10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Authorship Attribution<br />

Attributing documents to authors<br />

Authorship analysis has a background in stylometry, which is the study of an<br />

author's style of writing. The concept is based on the idea that everyone learns<br />

language slightly differently, and measuring the nuances in people's writing will<br />

enable us to tell them apart using only the content of their writing.<br />

The problem has been historically performed using manual analysis and statistics,<br />

which is a good indication that it could be automated <strong>with</strong> data mining. Modern<br />

authorship analysis studies are almost entirely data mining-based, although quite a<br />

significant amount of work is still done <strong>with</strong> more manually driven analysis using<br />

linguistic styles.<br />

Authorship analysis has many subproblems, and the main ones are as follows:<br />

• Authorship profiling: This determines the age, gender, or other traits<br />

of the author based on the writing. For example, we can detect the first<br />

language of a person speaking English by looking for specific ways in<br />

which they speak the language.<br />

• Authorship verification: This checks whether the author of this document<br />

also wrote the other document. This problem is what you would normally<br />

think about in a legal court setting. For instance, the suspect's writing style<br />

(content-wise) would be analyzed to see if it matched the ransom note.<br />

• Authorship clustering: This is an extension of authorship verification,<br />

where we use cluster analysis to group documents from a big set into<br />

clusters, and each cluster is written by the same author.<br />

However, the most common form of authorship analysis study is that of authorship<br />

attribution, a classification task where we attempt to predict which of a set of authors<br />

wrote a given document.<br />

Applications and use cases<br />

Authorship analysis has a number of use cases. Many use cases are concerned <strong>with</strong><br />

problems such as verifying authorship, proving shared authorship/provenance, or<br />

linking social media profiles <strong>with</strong> real-world users.<br />

In a historical sense, we can use authorship analysis to verify whether certain<br />

documents were indeed written by their supposed authors. Controversial authorship<br />

claims include some of Shakespeare's plays, the Federalist papers from the USA's<br />

foundation period, and other historical texts.<br />

[ 186 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!