25.10.2016 Views

SAP HANA Predictive Analysis Library (PAL)

sap_hana_predictive_analysis_library_pal_en

sap_hana_predictive_analysis_library_pal_en

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.6 Preprocessing Algorithms<br />

The records in business database are usually not directly ready for predictive analysis due to the following<br />

reasons:<br />

●<br />

●<br />

●<br />

Some data come in large amount, which may exceed the capacity of an algorithm.<br />

Some data contains noisy observations which may hurt the accuracy of an algorithm.<br />

Some attributes are badly scaled, which can make an algorithm unstable.<br />

To address the above challenges, <strong>PAL</strong> provides several convenient algorithms for data preprocessing.<br />

3.6.1 Binning<br />

Binning data is a common requirement prior to running certain predictive algorithms. It generally reduces the<br />

complexity of the model, for example, the model in a decision tree.<br />

Binning methods replace a value by a "bin number" defined by all elements of its neighborhood, that is, the bin<br />

it belongs to. The ordered values are distributed into a number of bins. Because binning methods consult the<br />

neighborhood of values, they perform local smoothing.<br />

Note<br />

Binning can only be used on a table with only one attribute.<br />

Binning Methods<br />

There are four binning methods:<br />

●<br />

●<br />

Equal widths based on the number of bins<br />

Specify an integer to determine the number of equal width bins and calculate the range values by:<br />

BandWidth = (MaxValue - MinValue) / K<br />

Where MaxValue is the biggest value of every column, MinValue is the smallest value of every column,<br />

and K is the number of bins.<br />

For example, according to this rule:<br />

○<br />

○<br />

MinValue + BinWidth > Values in Bin 1 ≥ MinValue<br />

MinValue + 2 * BinWidth > Values in Bin 2 ≥ MinValue + BinWidth<br />

Equal bin widths defined as a parameter<br />

Specify the bin width and calculate the start and end of bin intervals by:<br />

Start of bin intervals = Minimum data value – 0.5 * Bin width<br />

End of bin intervals = Maximum data value + 0.5 * Bin width<br />

For example, assuming the data has a range from 6 to 38 and the bin width is 10:<br />

Start of bin intervals = 6 – 0.5 * 10 = 1<br />

End of bin intervals = 38 + 0.5 * 10 = 43<br />

<strong>SAP</strong> <strong>HANA</strong> <strong>Predictive</strong> <strong>Analysis</strong> <strong>Library</strong> (<strong>PAL</strong>)<br />

<strong>PAL</strong> Functions P U B L I C 431

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!