20.01.2014 Views

thesis - Faculty of Information and Communication Technologies ...

thesis - Faculty of Information and Communication Technologies ...

thesis - Faculty of Information and Communication Technologies ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5. Growth Dynamics<br />

ments. These zero values need to be eliminated as log-normal, pareto<br />

<strong>and</strong> power-law distributions only work with data that has positive values.<br />

However, the impact <strong>of</strong> these transformations, if any, is not properly<br />

represented in the studies [21, 54, 223, 270, 299]. Furthermore,<br />

once data is transformed, this aspect has to be considered when deriving<br />

any inferences. Recently, a weakness <strong>of</strong> the approach with respect<br />

to fitting power-laws has been put forward by Goldstein et al. [104] as<br />

well as Clauset et al. [48]. They argue that the widely used approach<br />

<strong>of</strong> fitting power laws using a graphical linear fit <strong>of</strong> data transformed<br />

into a log-log scale is biased <strong>and</strong> inaccurate, especially since there is<br />

no quantitative measure <strong>of</strong> the goodness-<strong>of</strong>-fit that is used in this approach.<br />

This limitation would apply to the work by Wheeldon et al. [299]<br />

as they use a direct linear-fit <strong>of</strong> the log-log plot <strong>of</strong> the full raw histogram<br />

<strong>of</strong> the data. Potanin et al. [223] <strong>and</strong> Concas et al. [55] also use a linear<br />

fit <strong>of</strong> the logarithmically binned histogram which limits the power <strong>and</strong><br />

conclusions in their studies [48].<br />

Another limitation is that we cannot use these distributions for a meaningful<br />

comparison between s<strong>of</strong>tware systems or different releases <strong>of</strong> the<br />

same s<strong>of</strong>tware system. This is because the distributions are created<br />

by estimating the parameters from the underlying raw data rather than<br />

from empirically derived tables. Further, the value <strong>of</strong> fitting metric data<br />

to known distributions in order to infer the underlying generative process<br />

has not yet been properly established [210], especially since multiple<br />

non-correlated processes have been shown to generate the same<br />

distribution. Interestingly, this limitation is acknowledged by Concas<br />

et al. [54, 55], but they present their work <strong>of</strong> fitting metric data to a<br />

distribution as valuable since it provides a broad indication <strong>of</strong> a potential<br />

underlying process <strong>and</strong> more importantly can indicate presence<br />

<strong>of</strong> extreme values. A similar argument is also extended by Valverde et<br />

al. [287]. The common approach used by these studies based on the<br />

analysis <strong>of</strong> a metric data distribution is to infer the underlying generative<br />

process by investigating a single release. For instance, Concas et<br />

al. [55] argue that the presence <strong>of</strong> these skewed distributions in s<strong>of</strong>tware<br />

denotes that the programming activity cannot be considered to be<br />

a process involving r<strong>and</strong>om addition <strong>of</strong> independent increments but exhibits<br />

strong organic dependencies on what has been already developed.<br />

97

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!