Big Data Analytics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Big</strong> <strong>Data</strong> <strong>Analytics</strong>: Views from Statistical ... 7<br />
tome), complex graphical models, etc., new methods are being developed for sampling<br />
large graphs to estimate a given network’s parameters, as well as the node-,<br />
edge-, or subgraph-statistics of interest. For example, snowball sampling is a common<br />
method that starts with an initial sample of seed nodes, and in each step i, it<br />
includes in the sample all nodes that have edges with the nodes in the sample at step<br />
i −1but were not yet present in the sample. Network sampling also includes degreebased<br />
or PageRank-based methods and different types of random walks. Finally, the<br />
statistics are aggregated from the sampled subnetworks. For dynamic data streams,<br />
the task of statistical summarization is even more challenging as the learnt models<br />
need to be continuously updated. One approach is to use a “sketch”, which is<br />
not a sample of the data but rather its synopsis captured in a space-efficient representation,<br />
obtained usually via hashing, to allow rapid computation of statistics<br />
therein such that a probability bound may ensure that a high error of approximation<br />
is unlikely. For instance, a sublinear count-min sketch could be used to determine<br />
the most frequent items in a data stream [9]. Histograms and wavelets are also used<br />
for statistically summarizing data in motion [8].<br />
The most popular approach for summarizing large datasets is, of course, clustering.<br />
Clustering is the general process of grouping similar data points in an unsupervised<br />
manner such that an overall aggregate of distances between pairs of points<br />
within each cluster is minimized while that across different clusters is maximized.<br />
Thus, the cluster-representatives, (say, the k means from the classical k-means clustering<br />
solution), along with other cluster statistics, can offer a simpler and cleaner<br />
view of the structure of the dataset containing a much larger number of points<br />
(n ≫ k) and including noise. For big n, however, various strategies are being used<br />
to improve upon the classical clustering approaches. Limitations, such as the need<br />
for iterative computations involving a prohibitively large O(n 2 ) pairwise-distance<br />
matrix, or indeed the need to have the full dataset available beforehand for conducting<br />
such computations, are overcome by many of these strategies. For instance, a twostep<br />
online–offline approach (cf. Aggarwal [10]) first lets an online step to rapidly<br />
assign stream data points to the closest of the k ′ (≪ n) “microclusters.” Stored in<br />
efficient data structures, the microcluster statistics can be updated in real time as<br />
soon as the data points arrive, after which those points are not retained. In a slower<br />
offline step that is conducted less frequently, the retained k ′ microclusters’ statistics<br />
are then aggregated to yield the latest result on the k(< k ′ ) actual clusters in data.<br />
Clustering algorithms can also use sampling (e.g., CURE [11]), parallel computing<br />
(e.g., PKMeans [12] using MapReduce) and other strategies as required to handle<br />
big data.<br />
While subsampling and clustering are approaches to deal with the big n problem<br />
of big data, dimensionality reduction techniques are used to mitigate the challenges<br />
of big p. Dimensionality reduction is one of the classical concerns that can<br />
be traced back to the work of Karl Pearson, who, in 1901, introduced principal component<br />
analysis (PCA) that uses a small number of principal components to explain<br />
much of the variation in a high-dimensional dataset. PCA, which is a lossy linear<br />
model of dimensionality reduction, and other more involved projection models, typically<br />
using matrix decomposition for feature selection, have long been the main-