27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Figure 1. High level overview of detecting compatible licenses<br />

license extraction techniques may be used to populate such<br />

repositories. By combining the system’s model with license<br />

metadata, an enriched version of the system model can be<br />

derived, holding both information about the components that<br />

comprise the system and the license of each participating<br />

component. This enriched model can be seen as the license<br />

architecture of the software system, depicting artifact and<br />

license interconnections. The validity of these connections can<br />

be examined by checking whether two connected components<br />

have compatible licenses or not. In order to perform this task,<br />

the definition of compatibilities between licenses is required.<br />

This can be achieved by analyzing the licenses and modeling<br />

their compatibilities either as a whole or on individual<br />

permissions and restrictions. By combining the enriched<br />

system model with the license compatibility model, it is<br />

possible to detect permissible licensing schemas for the<br />

software system and/or possible licensing violations.<br />

III. LICENSE INFORMATION EXTRACTION<br />

In FLOSS development, several methodologies and<br />

corresponding tools have been developed in an attempt to<br />

identify licenses from source code, software libraries in binary<br />

format, components or programs of different suppliers under<br />

FLOSS or proprietary licenses. A category of extraction<br />

methods is based on regular expressions that indentify the<br />

license of each file through string matching techniques. These<br />

patterns result from the analysis of the actual license text. Tools<br />

using this approach are Ohcount 1 , OSLC (Open Source License<br />

Checker), and ASLA (Automated Software License Analyzer)<br />

[3]. According to this method, source files and files that might<br />

contain license information are first identified (i.e., commented<br />

parts of source files and whole files) and are then checked, in<br />

order to compare the content found against database files and<br />

conclude whether a specific license is present or not.<br />

A similar approach is followed in LIDESC (Librock License<br />

Awareness System 2 ), which is an awareness tool for preventing<br />

unintended violations of license terms of software that the<br />

1 http://www.ohloh.net/p/ohcount<br />

2 http://www.mibsoftware.com/librock/lidesc/<br />

developer writes or acquires. In LIDESC a license is modeled<br />

as a set of “license stamps” with each stamp corresponding to<br />

the entries encountered in the license text. It can be seen as a<br />

pair of a label and a string. The label defines a specific right or<br />

obligation in the license. The same label may be used in<br />

multiple licenses, whereas the string contains the license text<br />

that defines the corresponding right. By using text comparison,<br />

a set of license stamps is generated for each file, defined in two<br />

specific files for each license: a .txt file that contains the license<br />

text and a .lh file that uses the C language to define strings of<br />

symbolic terms, in order to describe the license.<br />

An algorithm for automatic license identification from<br />

source code has been implemented in the Ninka tool [4]. The<br />

algorithm works by extracting the license statement from the<br />

file, which it breaks apart into textual sentences, and proceeds<br />

to find a match for each sentence. The list of matched sentences<br />

is analyzed to determine if it contains one or more licenses.<br />

This method is based on and requires a knowledge base with<br />

four sets of information: filtering keywords, sets of equivalence<br />

phrases, known sentence-token expressions and, license rules.<br />

When focusing on binaries, the Binary Analysis<br />

Tool 3 (BAT) looks inside binary code, in order to find license<br />

compliance issues. BAT can detect bootloaders (such as<br />

Redboot, loadlin and uboot), open various compression<br />

archives (such as ZIP, RAR, tar, cpio and LZMA), search for<br />

Linux kernel and Busybox issues and identify dynamically<br />

linked libraries, while it reports the outcome in XML format.<br />

BAT uses symbol and string table comparisons to read binary<br />

code in firmware formats and compare it with source code,<br />

without undertaking reverse engineering actions. BAT can also<br />

compare the compiled version of the software under review<br />

with the corresponding source code, resulting in more accurate<br />

results. A different technique is based on automatic licenses’<br />

tracking of third-party artifacts [5]. Popular open-source build<br />

frameworks, i.e., Maven, Ivy and Gradle, describe artifacts and<br />

dependencies in terms of reusable declarative modules. Project<br />

Object Model (POM) files in Maven, as well as Descriptor files<br />

in Ivy and Gradle respectively, were designed to contain<br />

license information as part of the module metadata, giving the<br />

hope that the extraction of license information from module<br />

metadata can become fully automated. As we will describe, this<br />

is usually achieved by using an artifact repository.<br />

IV. LICENSES AND REPOSITORIES<br />

When talking about open source code repositories, many<br />

locations can be enumerated starting with SourceForge. Further<br />

repositories that aggregate project content from other locations<br />

can be found in commercial (e.g., Black Duck <strong>Knowledge</strong><br />

Base 4 ) or non-commercial solutions (e.g., Swik 5 , Ohloh).<br />

Another approach can be found in FLOSSmole [6], a central<br />

repository containing data and analyses about FLOSS projects<br />

collected and prepared in a decentralized manner.<br />

3 http://www.binaryanalysis.org<br />

4 http://www.blackducksoftware.com/knowledgebase<br />

5 http://swik.net/<br />

201

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!