27.06.2013 Views

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

William Acosta<br />

locking of existing data is necessary. It also allows for the analysis framework to make use novel<br />

summary data structures and algorithms that can incorporate the changes made to the data without<br />

requiring analysis of the full dataset.<br />

2.1 Storage and data management<br />

The large quantity of data makes a centralized storage solution unfeasible; instead, a distributed<br />

storage solution Is favored. The parallel nature of many of the algorithms makes a distributed solution<br />

not only more feasible, but also desirable. Distributed storage systems such as Google’s BigTable<br />

(Chang et al. 2006), Yahoo’s PNUTS (Cooper et al. 2008), and Amazon’s Dynamo (DeCandia et al.<br />

2007) provide the low-level mechanisms for storing, and managing large quantities of data. These<br />

systems were designed to support coordinated reads and updates of data in a distributed<br />

environment. To support the needs of applications like cyber-warfare threat detection, a distributed<br />

storage system should provide efficient, low-level support for append-only writes of raw data, as well<br />

as efficient tracking of incremental additions and updates of the dataset.<br />

2.2 Distributed processing of data<br />

Recently, there has been a great deal of research in Google’s MapReduce (Dean & Ghemawat 2004)<br />

distributed computing software framework for processing large datasets. However, its batch-oriented<br />

nature was not designed to deal with incremental or continuous data updates. This makes it<br />

unsuitable for a variety of applications including cyber-warfare threat analysis and detection. Systems<br />

like Haloop (Bu et al. 2010) and MapReduce Online (Condie et al. 2010) have sought to add<br />

continuous query support to MapReduce. To achieve this, these systems had to make fundamental<br />

changes to the API and underlying architecture of MapReduce. This paper argues that what is<br />

needed instead is a system designed from the ground-up to support the demands of analysis and<br />

mining algorithms on large sets of continuously generated data.<br />

2.3 Data management and analysis<br />

The problem of analyzing continuous data has been explored by stream databases (Abadi et al. 2005,<br />

Shah et al. 2004). Similarly, continuous queries in databases have been proposed with systems such<br />

as TelegraphCQ (Chandrasekaran et al. 2003) and CQL (Arasu et al. 2006). These systems can<br />

handle processing queries on streams of data with long-running/continuous queries. However, they<br />

lack the ability to support analytic algorithms over a large and diverse dataset. In contrast, verticallypartitioned<br />

databases such as C-Store (Stonebraker et al. 2005) excel at fast and efficient support of<br />

complex analytics. Unfortunately, vertically-partitioned databases suffer from poor performance on<br />

writes. In essence, insertions and updates require that the index be rebuilt. Although performance of<br />

reads is very fast once the index is built, building the index is very expensive. What is needed is a<br />

system that can perform complex analytics on continuous data without requiring a complex index to<br />

be completely rebuilt as a result of data updates. This paper proposes a new, incremental indexing<br />

system that keeps track of summarized historical data while allowing for many small [incremental]<br />

updates to be incorporated. The key difference is that, unlike traditional database indexes, the new<br />

incremental index would not be build off-line (batch-process). Instead, the index would incorporate the<br />

many incremental updates on-line so that the index of past data is always active and valid.<br />

In addition to the storage and distributed computing framework, it is also important to consider the<br />

needs of the algorithms that will be used in the system. Applications with such diverse data require<br />

equally diverse analysis. For example, detecting hidden correlations and associations between events<br />

seen in server logs requires mining association rules (Agrawal & Srikant 1994) whereas detecting<br />

interaction of attackers in a network may involve graph theoretic algorithms.<br />

3. Conclusion<br />

This paper presents a case for a new distributed computing system that is explicitly designed to meet<br />

the unique needs of applications such as cyber-warfare threat detection. The system should support<br />

large quantities of diverse data such as server logs, emails, social-network data, etc. It should allow<br />

for a variety of mining and analysis algorithms and support for those algorithms to be processed in a<br />

parallel and distributed manner. The system must not only meet these needs, but also do so in a way<br />

that can efficiently support continuous analysis of data that is continuously generated.<br />

318

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!