12.06.2018 Views

Big Data Analytics

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Big</strong> <strong>Data</strong> <strong>Analytics</strong>: Views from Statistical ... 7<br />

tome), complex graphical models, etc., new methods are being developed for sampling<br />

large graphs to estimate a given network’s parameters, as well as the node-,<br />

edge-, or subgraph-statistics of interest. For example, snowball sampling is a common<br />

method that starts with an initial sample of seed nodes, and in each step i, it<br />

includes in the sample all nodes that have edges with the nodes in the sample at step<br />

i −1but were not yet present in the sample. Network sampling also includes degreebased<br />

or PageRank-based methods and different types of random walks. Finally, the<br />

statistics are aggregated from the sampled subnetworks. For dynamic data streams,<br />

the task of statistical summarization is even more challenging as the learnt models<br />

need to be continuously updated. One approach is to use a “sketch”, which is<br />

not a sample of the data but rather its synopsis captured in a space-efficient representation,<br />

obtained usually via hashing, to allow rapid computation of statistics<br />

therein such that a probability bound may ensure that a high error of approximation<br />

is unlikely. For instance, a sublinear count-min sketch could be used to determine<br />

the most frequent items in a data stream [9]. Histograms and wavelets are also used<br />

for statistically summarizing data in motion [8].<br />

The most popular approach for summarizing large datasets is, of course, clustering.<br />

Clustering is the general process of grouping similar data points in an unsupervised<br />

manner such that an overall aggregate of distances between pairs of points<br />

within each cluster is minimized while that across different clusters is maximized.<br />

Thus, the cluster-representatives, (say, the k means from the classical k-means clustering<br />

solution), along with other cluster statistics, can offer a simpler and cleaner<br />

view of the structure of the dataset containing a much larger number of points<br />

(n ≫ k) and including noise. For big n, however, various strategies are being used<br />

to improve upon the classical clustering approaches. Limitations, such as the need<br />

for iterative computations involving a prohibitively large O(n 2 ) pairwise-distance<br />

matrix, or indeed the need to have the full dataset available beforehand for conducting<br />

such computations, are overcome by many of these strategies. For instance, a twostep<br />

online–offline approach (cf. Aggarwal [10]) first lets an online step to rapidly<br />

assign stream data points to the closest of the k ′ (≪ n) “microclusters.” Stored in<br />

efficient data structures, the microcluster statistics can be updated in real time as<br />

soon as the data points arrive, after which those points are not retained. In a slower<br />

offline step that is conducted less frequently, the retained k ′ microclusters’ statistics<br />

are then aggregated to yield the latest result on the k(< k ′ ) actual clusters in data.<br />

Clustering algorithms can also use sampling (e.g., CURE [11]), parallel computing<br />

(e.g., PKMeans [12] using MapReduce) and other strategies as required to handle<br />

big data.<br />

While subsampling and clustering are approaches to deal with the big n problem<br />

of big data, dimensionality reduction techniques are used to mitigate the challenges<br />

of big p. Dimensionality reduction is one of the classical concerns that can<br />

be traced back to the work of Karl Pearson, who, in 1901, introduced principal component<br />

analysis (PCA) that uses a small number of principal components to explain<br />

much of the variation in a high-dimensional dataset. PCA, which is a lossy linear<br />

model of dimensionality reduction, and other more involved projection models, typically<br />

using matrix decomposition for feature selection, have long been the main-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!