25.10.2016 Views

SAP HANA Predictive Analysis Library (PAL)

sap_hana_predictive_analysis_library_pal_en

sap_hana_predictive_analysis_library_pal_en

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Expected Result<br />

<strong>PAL</strong>_SILHOUETTE_RESULT_TBL:<br />

3.1.8 K-Medians<br />

K-medians is a clustering algorithm similar to K-means. K-medians and K-means both partition n observations<br />

into K clusters according to their nearest cluster center. In contrast to K-means, while calculating cluster<br />

centers, K-medians uses medians of each feature instead of means of it.<br />

A median value is the middle value of a set of values arranged in order.<br />

Given an initial set of K cluster centers: m1, ..., mk, the algorithm proceeds by alternating between the following<br />

two steps and repeats until the assignments no longer change.<br />

●<br />

●<br />

Assignment step: assigns each observation to the cluster with the closest center.<br />

Update step: calculates the new median of each feature of each cluster to be the new center of that<br />

cluster.<br />

The K-medians implementation in <strong>PAL</strong> supports multi-threads, data normalization, different distance level<br />

measurements, and cluster quality measurement (Silhouette). The implementation does not support<br />

categorical data, but this can be managed through data transformation. Because median method cannot apply<br />

to categorical data, the K-medians implementation uses the most frequent one instead. The first K and<br />

random K starting methods are supported.<br />

Support for Categorical Attributes<br />

If an attribute is of category type, it will be converted to a binary vector and then be used as a numerical<br />

attribute. For example, in the below table, "Gender" is of category type.<br />

Table 49:<br />

Customer ID Age Income Gender<br />

T1 31 10,000 Female<br />

T2 27 8,000 Male<br />

Because "Gender" has two distinct values, it will be converted into a binary vector with two dimensions:<br />

Table 50:<br />

Customer ID Age Income Gender_1 Gender_2<br />

T1 31 10,000 0 1<br />

T2 27 8,000 1 0<br />

Thus, the Euclidean distance between T1 and T2 is:<br />

<strong>SAP</strong> <strong>HANA</strong> <strong>Predictive</strong> <strong>Analysis</strong> <strong>Library</strong> (<strong>PAL</strong>)<br />

<strong>PAL</strong> Functions P U B L I C 77

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!