04.06.2015 Views

Database Modeling and Design

Database Modeling and Design

Database Modeling and Design

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

172 CHAPTER 8 Business Intelligence<br />

0.1<br />

P = 0.9<br />

0.1<br />

0.9 0.1<br />

0.9<br />

0.1<br />

0.9 0.1 0.9 0.1 0.9 0.1 0.9<br />

0.001<br />

0.009 0.009<br />

0.081 0.009 0.081<br />

0.081<br />

0.729<br />

a = 3 lefts<br />

P 3 = 0.001<br />

3<br />

C = 1 bin<br />

3<br />

a = 2 lefts<br />

P 2 = 0.009<br />

3<br />

C = 3 bins<br />

2<br />

a = 1 left<br />

P 1 = 0.081<br />

3<br />

C = 3 bins<br />

1<br />

a = 0 lefts<br />

P 0 = 0.729<br />

3<br />

C = 1 bin<br />

0<br />

Figure 8.15<br />

Example of a binomial multifractal distribution tree<br />

The values of P <strong>and</strong> k can be estimated based on sample data. The<br />

algorithm used in [Faloutsos, Matias, <strong>and</strong> Silberschatz, 1996] has three<br />

inputs: the number of rows in the sample, the frequency of the most<br />

commonly occurring value, <strong>and</strong> the number of distinct aggregate rows<br />

in the sample. The value of P is calculated based on the frequency of the<br />

most commonly occurring value. They begin with:<br />

k = ⎡Log 2 (Distinct rows in sample)⎤ 8.7<br />

<strong>and</strong> then adjust k upwards, recalculating P until a good fit to the number<br />

of distinct rows in the sample is found.<br />

Other distribution models can be utilized to predict the size of a view<br />

based on sample data. For example, the use of the Pareto distribution<br />

model has been explored [Nadeau <strong>and</strong> Teorey, 2003]. Another possibility<br />

is to find the best fit to the sample data for multiple distribution models,<br />

calculate which model is most likely to produce the given sample data,<br />

<strong>and</strong> then use that model to predict the number of rows for the full data<br />

set. This would require calculation for each distribution model considered,<br />

but should generally result in more accurate estimates.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!