27.06.2013 Views

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Large-Scale Analysis of Continuous Data in Cyber-Warfare<br />

Threat Detection<br />

William Acosta<br />

University of Toledo, USA<br />

william.acosta@utoledo.edu<br />

Abstract: Combating cyber/information warfare threats requires analyzing vast quantities of diverse data. The<br />

data required to detect attacks as they occur (on-line analysis of live data) and predict future threats (forensic<br />

analysis/data mining) is not only large, but is growing at a staggering rate. Data such as network traffic logs,<br />

emails, and social networking posts, SMS message, and cell phone call logs are, by nature, continuous and<br />

growing. The problem addressed in this research is that current systems are not designed to handle either the<br />

scope or nature of the analysis or the data itself. For example, distributed data processing systems like Google’s<br />

Map-Reduce provide the ability to process large data sets, but they are not designed to easily support processing<br />

of changing data sets or data-mining algorithms. In light of this, Google has itself recently stopped using<br />

MapReduce for building its web-index, opting instead for a custom mechanism that can more quickly respond to<br />

and process new content. Non-traditional databases, like vertically-partitioned/column-store databases, can<br />

efficiently support analysis algorithms on large quantities of data, but they are not designed to support<br />

continuously changing data sets. The goal of this research is to explore and design new data management<br />

system that can handle large quantities of incrementally growing data as well as direct support for data mining<br />

and analysis algorithms. Specifically, this research proposes a new distributed data processing system that<br />

exploits the parallel and distributed resources/computation of cloud computing infrastructures. It makes use of<br />

summary data structures that can be updated incrementally and continuous queries to support analysis and data<br />

mining algorithms natively. This approach allows for larger-scale and more robust analysis on continuously<br />

growing data that can help detect, predict and respond to cyber-warfare threats.<br />

Keywords: data-mining, databases, text-search, cloud computing, data integration<br />

1. Introduction<br />

Protection against cyber/information warfare threats requires understanding the nature, methods, and<br />

patterns of those attacks. Such understanding can allow for early detection and, possibly prediction,<br />

of attacks. Gaining an understanding of the patterns and mechanisms used in cyber/information<br />

warfare attacks requires analyzing large amounts of diverse data such as server logs (Myers et al.<br />

2010), emails, SMS messages, and social-networking data. Not only is the data diverse, but it is also<br />

continuous; new data gets generated every day. Furthermore, analysis of this data can require<br />

equally diverse approaches: graph-theoretic algorithms (detecting patterns in social-networking), data<br />

mining algorithms (associations between events), statistical models, clustering algorithms, etc. The<br />

diverse nature of the data and analysis algorithms as well as the large quantity of data to be analyzed<br />

poses problems to both traditional databases and storage systems. In order to provide the analysis of<br />

diverse and continuous data required for cyber-warfare threat detection, a new system is needed for<br />

managing large quantities of diverse data that can support equally diverse analysis algorithms.<br />

The need to incrementally process large quantities of data is applicable to wide range of applications.<br />

For example, Google replaced MapReduce (Dean & Ghemawat 2004), its current web-indexing<br />

system, in order to enable faster updates of its index (Metz 2010, Peng & Dabek 2010). Similarly,<br />

detecting and responding to information security threats requires a mechanism that cannot only<br />

manage large quantities of data, but also provide for fast response time of complex, continuous<br />

analysis. This paper proposes a new distributed data-analysis framework that is designed to meet the<br />

needs of applications that require analysis of continuous data. Next, Section 2 presents the design of<br />

the proposed system in the context of related work. Section 3 then provides concluding remarks.<br />

2. Design and requirements of a continuous data analysis system<br />

Cyber-warfare threat detection requires analyzing large quantities of diverse data that is continuously<br />

generated. The properties of the raw data in this type of application impose some constraints on the<br />

analysis and data storage systems. These applications require analyzing not only current data, but<br />

also prior/historical from many heterogeneous sources. Because the raw data is continuously<br />

generated, old data must be kept for analysis while new data is integrated into the storage and<br />

analysis framework. Because old data must be kept and not changed, the system need not support<br />

updates of raw data. Effectively, raw data is append-only. This can be leveraged to improve storage<br />

efficiency and performance; it is easier to implement and support distributed storage as no write-<br />

317

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!