20.01.2014 Views

thesis - Faculty of Information and Communication Technologies ...

thesis - Faculty of Information and Communication Technologies ...

thesis - Faculty of Information and Communication Technologies ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Data Selection Methodology<br />

Using Binaries<br />

We extract the measures for each class by processing the compiled Java<br />

bytecode instructions generated by the compiler (details are explained<br />

in Chapter 4). This method allows us to avoid running a (sometimes<br />

quite complex) build process for each release under investigation since<br />

we only analyze code that has actually been compiled.<br />

Our approach <strong>of</strong> using compiled binaries to extract metric data is more<br />

precise when compared to the methods used by other researchers that<br />

studied evolution in open-source s<strong>of</strong>tware systems since the earlier<br />

work used source code directories as input for their data analysis [41,<br />

100,105,120,153,217,239,256]. In order to process the large amount<br />

<strong>of</strong> raw data, many <strong>of</strong> the previous open source s<strong>of</strong>tware evolution studies<br />

used data gathered from size measures, such as, raw file count,<br />

raw folder count <strong>and</strong> raw line count. These measures were computed<br />

with some minimal filtering using Unix text utilities that work with files<br />

based on their extension, for example, *.c <strong>and</strong> *.cpp to capture C <strong>and</strong><br />

C++ source files respectively. These approaches have the advantage <strong>of</strong><br />

providing a general trend quickly <strong>and</strong> are practical when attempting to<br />

process many thous<strong>and</strong>s <strong>of</strong> projects. The file based processing method,<br />

however does not directly mine any structural dependency information.<br />

It also includes source code files that may no longer be part <strong>of</strong> the code<br />

base – essentially unused <strong>and</strong> unreachable code that has not been removed<br />

from the repositories.<br />

This practice <strong>of</strong> leaving old code has been noted by researchers in the<br />

field <strong>of</strong> code clone detection who observed the tendency <strong>of</strong> developers<br />

to copy a block <strong>of</strong> code, modify it, <strong>and</strong> leave the old code still in the<br />

repository [5,135,155,157]. Godfrey et al. [100] in their study <strong>of</strong> Linux<br />

kernel evolution noted that depending on the configuration setting in<br />

the build script (Makefile), it is possible that only 15% <strong>of</strong> the Linux<br />

source files are part <strong>of</strong> the final build. The use <strong>of</strong> only a small set <strong>of</strong><br />

source for a release is common in s<strong>of</strong>tware that can be built for multiple<br />

environments. For instance, Linux is an operating system designed to<br />

run on a large range <strong>of</strong> hardware platforms. When building the operating<br />

system for a specific hardware configuration, many modules are not<br />

56

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!