10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Working <strong>with</strong> Big <strong>Data</strong><br />

Big data<br />

What makes big data different? Most big-data proponents talk about the four<br />

Vs of big data:<br />

1. Volume: The amount of data that we generate and store is growing at an<br />

increasing rate, and predictions of the future generally only suggest further<br />

increases. Today's multi-gigabyte sized hard drives will turn into exabyte<br />

hard drives in a few years, and network throughput traffic will be increasing<br />

as well. The signal to noise ratio can be quite difficult, <strong>with</strong> important data<br />

being lost in the mountain of non-important data.<br />

2. Velocity: While related to volume, the velocity of data is increasing too.<br />

Modern cars have hundreds of sensors that stream data into their computers,<br />

and the information from these sensors needs to be analyzed at a subsecond<br />

level to operate the car. It isn't just a case of finding answers in the volume of<br />

data; those answers often need to come quickly.<br />

3. Variety: Nice datasets <strong>with</strong> clearly defined columns are only a small part<br />

of the dataset that we have these days. Consider a social media post, which<br />

may have text, photos, user mentions, likes, comments, videos, geographic<br />

information, and other fields. Simply ignoring parts of this data that<br />

don't fit your model will lead to a loss of information, but integrating that<br />

information itself can be very difficult.<br />

4. Veracity: With the increase in the amount of data, it can be hard to determine<br />

whether the data is being correctly collected—whether it is outdated, noisy,<br />

contains outliers, or generally whether it is useful at all. Being able to trust<br />

the data is hard when a human can't reliably verify the data itself. External<br />

datasets are being increasingly merged into internal ones too, giving rise to<br />

more troubles relating to the veracity of the data.<br />

These main four Vs (others have proposed additional Vs) outline why big data is<br />

different to just lots-of-data. At these scales, the engineering problem of working<br />

<strong>with</strong> the data is often more difficult—let alone the analysis. While there are lots of<br />

snake oil salesmen that overstate the ability to use big data, it is hard to deny the<br />

engineering challenges and the potential of big-data analytics.<br />

The algorithms we have used are to date load the dataset into memory and then<br />

to work on the in-memory version. This gives a large benefit in terms of speed of<br />

computation, as it is much faster to compute on in-memory data than having to load<br />

a sample before we use it. In addition, in-memory data allows us to iterate over the<br />

data many times, improving our model.<br />

[ 272 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!