thesis - Faculty of Information and Communication Technologies ...
thesis - Faculty of Information and Communication Technologies ...
thesis - Faculty of Information and Communication Technologies ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 5. Growth Dynamics<br />
ments. These zero values need to be eliminated as log-normal, pareto<br />
<strong>and</strong> power-law distributions only work with data that has positive values.<br />
However, the impact <strong>of</strong> these transformations, if any, is not properly<br />
represented in the studies [21, 54, 223, 270, 299]. Furthermore,<br />
once data is transformed, this aspect has to be considered when deriving<br />
any inferences. Recently, a weakness <strong>of</strong> the approach with respect<br />
to fitting power-laws has been put forward by Goldstein et al. [104] as<br />
well as Clauset et al. [48]. They argue that the widely used approach<br />
<strong>of</strong> fitting power laws using a graphical linear fit <strong>of</strong> data transformed<br />
into a log-log scale is biased <strong>and</strong> inaccurate, especially since there is<br />
no quantitative measure <strong>of</strong> the goodness-<strong>of</strong>-fit that is used in this approach.<br />
This limitation would apply to the work by Wheeldon et al. [299]<br />
as they use a direct linear-fit <strong>of</strong> the log-log plot <strong>of</strong> the full raw histogram<br />
<strong>of</strong> the data. Potanin et al. [223] <strong>and</strong> Concas et al. [55] also use a linear<br />
fit <strong>of</strong> the logarithmically binned histogram which limits the power <strong>and</strong><br />
conclusions in their studies [48].<br />
Another limitation is that we cannot use these distributions for a meaningful<br />
comparison between s<strong>of</strong>tware systems or different releases <strong>of</strong> the<br />
same s<strong>of</strong>tware system. This is because the distributions are created<br />
by estimating the parameters from the underlying raw data rather than<br />
from empirically derived tables. Further, the value <strong>of</strong> fitting metric data<br />
to known distributions in order to infer the underlying generative process<br />
has not yet been properly established [210], especially since multiple<br />
non-correlated processes have been shown to generate the same<br />
distribution. Interestingly, this limitation is acknowledged by Concas<br />
et al. [54, 55], but they present their work <strong>of</strong> fitting metric data to a<br />
distribution as valuable since it provides a broad indication <strong>of</strong> a potential<br />
underlying process <strong>and</strong> more importantly can indicate presence<br />
<strong>of</strong> extreme values. A similar argument is also extended by Valverde et<br />
al. [287]. The common approach used by these studies based on the<br />
analysis <strong>of</strong> a metric data distribution is to infer the underlying generative<br />
process by investigating a single release. For instance, Concas et<br />
al. [55] argue that the presence <strong>of</strong> these skewed distributions in s<strong>of</strong>tware<br />
denotes that the programming activity cannot be considered to be<br />
a process involving r<strong>and</strong>om addition <strong>of</strong> independent increments but exhibits<br />
strong organic dependencies on what has been already developed.<br />
97