6th European Conference - Academic Conferences
6th European Conference - Academic Conferences
6th European Conference - Academic Conferences
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Large-Scale Analysis of Continuous Data in Cyber-Warfare<br />
Threat Detection<br />
William Acosta<br />
University of Toledo, USA<br />
william.acosta@utoledo.edu<br />
Abstract: Combating cyber/information warfare threats requires analyzing vast quantities of diverse data. The<br />
data required to detect attacks as they occur (on-line analysis of live data) and predict future threats (forensic<br />
analysis/data mining) is not only large, but is growing at a staggering rate. Data such as network traffic logs,<br />
emails, and social networking posts, SMS message, and cell phone call logs are, by nature, continuous and<br />
growing. The problem addressed in this research is that current systems are not designed to handle either the<br />
scope or nature of the analysis or the data itself. For example, distributed data processing systems like Google’s<br />
Map-Reduce provide the ability to process large data sets, but they are not designed to easily support processing<br />
of changing data sets or data-mining algorithms. In light of this, Google has itself recently stopped using<br />
MapReduce for building its web-index, opting instead for a custom mechanism that can more quickly respond to<br />
and process new content. Non-traditional databases, like vertically-partitioned/column-store databases, can<br />
efficiently support analysis algorithms on large quantities of data, but they are not designed to support<br />
continuously changing data sets. The goal of this research is to explore and design new data management<br />
system that can handle large quantities of incrementally growing data as well as direct support for data mining<br />
and analysis algorithms. Specifically, this research proposes a new distributed data processing system that<br />
exploits the parallel and distributed resources/computation of cloud computing infrastructures. It makes use of<br />
summary data structures that can be updated incrementally and continuous queries to support analysis and data<br />
mining algorithms natively. This approach allows for larger-scale and more robust analysis on continuously<br />
growing data that can help detect, predict and respond to cyber-warfare threats.<br />
Keywords: data-mining, databases, text-search, cloud computing, data integration<br />
1. Introduction<br />
Protection against cyber/information warfare threats requires understanding the nature, methods, and<br />
patterns of those attacks. Such understanding can allow for early detection and, possibly prediction,<br />
of attacks. Gaining an understanding of the patterns and mechanisms used in cyber/information<br />
warfare attacks requires analyzing large amounts of diverse data such as server logs (Myers et al.<br />
2010), emails, SMS messages, and social-networking data. Not only is the data diverse, but it is also<br />
continuous; new data gets generated every day. Furthermore, analysis of this data can require<br />
equally diverse approaches: graph-theoretic algorithms (detecting patterns in social-networking), data<br />
mining algorithms (associations between events), statistical models, clustering algorithms, etc. The<br />
diverse nature of the data and analysis algorithms as well as the large quantity of data to be analyzed<br />
poses problems to both traditional databases and storage systems. In order to provide the analysis of<br />
diverse and continuous data required for cyber-warfare threat detection, a new system is needed for<br />
managing large quantities of diverse data that can support equally diverse analysis algorithms.<br />
The need to incrementally process large quantities of data is applicable to wide range of applications.<br />
For example, Google replaced MapReduce (Dean & Ghemawat 2004), its current web-indexing<br />
system, in order to enable faster updates of its index (Metz 2010, Peng & Dabek 2010). Similarly,<br />
detecting and responding to information security threats requires a mechanism that cannot only<br />
manage large quantities of data, but also provide for fast response time of complex, continuous<br />
analysis. This paper proposes a new distributed data-analysis framework that is designed to meet the<br />
needs of applications that require analysis of continuous data. Next, Section 2 presents the design of<br />
the proposed system in the context of related work. Section 3 then provides concluding remarks.<br />
2. Design and requirements of a continuous data analysis system<br />
Cyber-warfare threat detection requires analyzing large quantities of diverse data that is continuously<br />
generated. The properties of the raw data in this type of application impose some constraints on the<br />
analysis and data storage systems. These applications require analyzing not only current data, but<br />
also prior/historical from many heterogeneous sources. Because the raw data is continuously<br />
generated, old data must be kept for analysis while new data is integrated into the storage and<br />
analysis framework. Because old data must be kept and not changed, the system need not support<br />
updates of raw data. Effectively, raw data is append-only. This can be leveraged to improve storage<br />
efficiency and performance; it is easier to implement and support distributed storage as no write-<br />
317