thesis - Faculty of Information and Communication Technologies ...
thesis - Faculty of Information and Communication Technologies ...
thesis - Faculty of Information and Communication Technologies ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3. Data Selection Methodology<br />
Using Binaries<br />
We extract the measures for each class by processing the compiled Java<br />
bytecode instructions generated by the compiler (details are explained<br />
in Chapter 4). This method allows us to avoid running a (sometimes<br />
quite complex) build process for each release under investigation since<br />
we only analyze code that has actually been compiled.<br />
Our approach <strong>of</strong> using compiled binaries to extract metric data is more<br />
precise when compared to the methods used by other researchers that<br />
studied evolution in open-source s<strong>of</strong>tware systems since the earlier<br />
work used source code directories as input for their data analysis [41,<br />
100,105,120,153,217,239,256]. In order to process the large amount<br />
<strong>of</strong> raw data, many <strong>of</strong> the previous open source s<strong>of</strong>tware evolution studies<br />
used data gathered from size measures, such as, raw file count,<br />
raw folder count <strong>and</strong> raw line count. These measures were computed<br />
with some minimal filtering using Unix text utilities that work with files<br />
based on their extension, for example, *.c <strong>and</strong> *.cpp to capture C <strong>and</strong><br />
C++ source files respectively. These approaches have the advantage <strong>of</strong><br />
providing a general trend quickly <strong>and</strong> are practical when attempting to<br />
process many thous<strong>and</strong>s <strong>of</strong> projects. The file based processing method,<br />
however does not directly mine any structural dependency information.<br />
It also includes source code files that may no longer be part <strong>of</strong> the code<br />
base – essentially unused <strong>and</strong> unreachable code that has not been removed<br />
from the repositories.<br />
This practice <strong>of</strong> leaving old code has been noted by researchers in the<br />
field <strong>of</strong> code clone detection who observed the tendency <strong>of</strong> developers<br />
to copy a block <strong>of</strong> code, modify it, <strong>and</strong> leave the old code still in the<br />
repository [5,135,155,157]. Godfrey et al. [100] in their study <strong>of</strong> Linux<br />
kernel evolution noted that depending on the configuration setting in<br />
the build script (Makefile), it is possible that only 15% <strong>of</strong> the Linux<br />
source files are part <strong>of</strong> the final build. The use <strong>of</strong> only a small set <strong>of</strong><br />
source for a release is common in s<strong>of</strong>tware that can be built for multiple<br />
environments. For instance, Linux is an operating system designed to<br />
run on a large range <strong>of</strong> hardware platforms. When building the operating<br />
system for a specific hardware configuration, many modules are not<br />
56