TFIDF on large datasets - Nguyen Dang Binh
TFIDF on large datasets - Nguyen Dang Binh
TFIDF on large datasets - Nguyen Dang Binh
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Using term-frequency-inverse-document<br />
frequency of email to<br />
detect change in social groups<br />
Abstract—Interest in the classificati<strong>on</strong> of <strong>large</strong> text data sets<br />
c<strong>on</strong>tinues to grow. The Enr<strong>on</strong> email corpus remains a rich source<br />
of inquiry given its size and the unique characteristic of email:<br />
timestamps. Using this temporal data, we study a method for<br />
detecting temporal changes in the classificati<strong>on</strong>s. Using termfrequency-inverse-document-frequency<br />
(<str<strong>on</strong>g>TFIDF</str<strong>on</strong>g>) up<strong>on</strong> which we<br />
apply a change detecti<strong>on</strong> algorithm (CUSUM), our results suggest<br />
a methodology for predicting changes in email c<strong>on</strong>versati<strong>on</strong>s in a<br />
social network.<br />
Index Terms—Enr<strong>on</strong>, Machine Learning Statistical Process<br />
C<strong>on</strong>trol, TF-IDF, Text Analysis, Text Classificati<strong>on</strong>.<br />
I. INTRODUCTION<br />
l<strong>on</strong>g with the rapidly expanding scale of available text data,<br />
A<br />
interest is growing al<strong>on</strong>gside in the classifying of these<br />
growing data sets. The available text <strong>on</strong> the web al<strong>on</strong>e is<br />
huge. With no search engine indexing more than about 16% of<br />
the estimated size of the publicly indexable web [1], at least 20<br />
billi<strong>on</strong> web objects have been indexed [2]. Approaches to<br />
classificati<strong>on</strong> have been developed to specifically be more<br />
effective in all-text envir<strong>on</strong>ments [3].<br />
Since becoming available in 2002, the <strong>large</strong> Enr<strong>on</strong> email<br />
corpus has been the subject of repeated study by researchers. It<br />
possesses qualities that warrant c<strong>on</strong>tinued inquiry. Many papers<br />
explicitly referencing the Enr<strong>on</strong> dataset have already been<br />
published [4,5]. This dataset will likely c<strong>on</strong>tinue to be of interest<br />
for some time. The email corpus of Enr<strong>on</strong> represents a body of<br />
corresp<strong>on</strong>dence within a defined social group at a scale unusual<br />
in the genre. We are fortunate to not <strong>on</strong>ly have this substantial<br />
collecti<strong>on</strong> of emails but also time stamps <strong>on</strong> them.<br />
This work was supported by the Center for Computati<strong>on</strong>al Analysis of Social<br />
and Organizati<strong>on</strong>al Systems, School of Computer Science, Carnegie Mell<strong>on</strong><br />
University, http://www.casos.cs.cmu.edu.<br />
I. McCulloh is with the Network Science Center, U.S. Military Academy, West<br />
Point, NY 10996 (ph<strong>on</strong>e: 845-702-9115, fax: 845-938-2409, email:<br />
imccullo@cs.cmu.edu)<br />
E. Daimler is with Carnegie Mell<strong>on</strong> University, Building 23, Moffett Field,<br />
California 94035 (ph<strong>on</strong>e: 408-241-0055 email: edaimler@cs.cmu.edu)<br />
K.M. Carley is with the Center for Computati<strong>on</strong>al Analysis of Social and<br />
Organizati<strong>on</strong>al Systems, Carnegie Mell<strong>on</strong> University, 5000 Forbes Ave,<br />
Pittsburgh, PA 15213 USA (ph<strong>on</strong>e: 412-268-8163 email:<br />
kathleen.carley@cs.cmu.edu)<br />
Ian McCulloh, Eric Daimler, Kathleen M. Carley<br />
Timestamps <strong>on</strong> corresp<strong>on</strong>dence provide temporal richness to<br />
text classificati<strong>on</strong>. Tracking changes in the text classificati<strong>on</strong>s<br />
over time may give visibility to changes in the social structure<br />
defined by the messages. We may generalize this challenge:<br />
given a corpus of weekly email texts for a social group, develop<br />
a process to discover characteristics of temporal changes in that<br />
social group.<br />
It is not the study of sentiment per se, but rather changes in<br />
sentiment. Rather than looking at what the sentiment tells us<br />
directly, we are looking at what the changes in sentiment tells us.<br />
We are c<strong>on</strong>cerned with the degree to which a change in<br />
sentiment can help to predict real world events. The ability to<br />
detect changes in social groups is important in a variety of<br />
applicati<strong>on</strong>s.<br />
Machine learning techniques may be useful for quantifying<br />
communicati<strong>on</strong> in these social communities. For all identifiable<br />
social communities, there exists an opportunity to gain valuable<br />
insight into social dynamics. Analysis of these communities can<br />
be more robust with more objective, quantifiable metrics. A<br />
variety of change detecti<strong>on</strong> algorithms can then be used to<br />
detect changes in these communicati<strong>on</strong> metrics. A quantifiable<br />
method of detecting changes in social communities has far<br />
reaching applicati<strong>on</strong>s ranging from defense, to ec<strong>on</strong>omics, to<br />
public policy.<br />
A. Public Policy<br />
Public policy and policy maker’s representati<strong>on</strong> of<br />
events must reflect c<strong>on</strong>stituent beliefs. Placing<br />
objective, quantitative measures <strong>on</strong> these beliefs can<br />
make for more resp<strong>on</strong>sive, if not better, governance.<br />
B. Commercializati<strong>on</strong><br />
Customer adopti<strong>on</strong> can be powerfully impacted by<br />
word-of-mouth. The targeting of customer referrals<br />
could be made more effective. Addressing customer<br />
complaints could occur before becoming a material<br />
issue.<br />
C. Financial Risks<br />
Issues as diverse as securities or currency speculati<strong>on</strong>,<br />
unstable government policy, tax-avoidance schemes,<br />
accounting policy changes, or changes in credit
standards are reflected in Natural Language in additi<strong>on</strong><br />
to numbers. Detecting changes in Natural Language<br />
may be an important adjunct to changes in the<br />
numerical data. Increasingly organizati<strong>on</strong>s examine email<br />
to protect themselves against corporate malfeasance [6].<br />
D. Security & law enforcement<br />
To the degree that unlawful activities are reflected in<br />
sentiment, automated processes for detecting changes<br />
are inherently more effective than manual processes in<br />
the volume of data able to be processed.<br />
The sheer scale of the text to be classified in the Enr<strong>on</strong> email<br />
dataset provides it substantial interest from researchers. While<br />
there has been study <strong>on</strong> corpora of substantial size, few, if any,<br />
have been allowed at this scale <strong>on</strong> email. Analysis can be d<strong>on</strong>e<br />
from web log communicati<strong>on</strong> with a virtually unlimited data set.<br />
However, email can represent a more active dialogue with<br />
communicati<strong>on</strong>s occurring more rapidly with the subject<br />
changing quickly and in disc<strong>on</strong>tinuous spurts, with the additi<strong>on</strong><br />
of the time stamp adding to the data’s richness. The time-stamps<br />
<strong>on</strong> the text data set allows the applicati<strong>on</strong> of approaches to<br />
change detecti<strong>on</strong> such as cumulative sum (CUSUM) statistical<br />
process c<strong>on</strong>trol charts to investigate the degree to which<br />
changes in salient words may be detected.<br />
II. BACKGROUND<br />
The classificati<strong>on</strong> of documents has been studied for those in<br />
analog form, those in electr<strong>on</strong>ic form, emails, and even the<br />
Enr<strong>on</strong> Corpus in particular [5,6,7]. These studies have ranged<br />
from inquiry into the effectiveness of the classificati<strong>on</strong><br />
documents themselves [4] to exploratory data analysis [8]. As<br />
new classificati<strong>on</strong> algorithms have been developed, they have<br />
been tested against existing corpora [9]. As new corpora have<br />
been created, they have been tested <strong>on</strong> existing classificati<strong>on</strong><br />
techniques [4].<br />
The approach of classifying text in this way has been studied<br />
[10]. The explorati<strong>on</strong> of change detecti<strong>on</strong> in been studied [11,12]<br />
Comparing changes in social networks in this way has been<br />
studied. We look to expand this literature by exploring the<br />
degree to which we might predict changes in the social network<br />
with a particular approach to combining these methods.<br />
Efforts have been made to classify <strong>on</strong>line communicati<strong>on</strong> and<br />
email communicati<strong>on</strong> [8,9,13]. Classificati<strong>on</strong> algorithms have<br />
been applied to these Enr<strong>on</strong> email communicati<strong>on</strong>s as a test of<br />
the algorithms effectiveness [4].<br />
The many expressi<strong>on</strong>s of exploratory data analysis can be<br />
used to look at history to help predict the future. Al<strong>on</strong>g with<br />
classificati<strong>on</strong> methodologies, approaches to clustering have been<br />
brought to bear in many studies [7]. Other studies have used it<br />
for the purpose of looking at trends in data to detect changes in<br />
dialogue [3,8]. Clusters can be formed al<strong>on</strong>g many lines: topics,<br />
dates, sentiment and others.<br />
While <strong>large</strong> data sets are of growing interest, we might find<br />
these in blogs in additi<strong>on</strong> to email corpora. The effectiveness of<br />
text classificati<strong>on</strong> algorithms has been explored in the ample<br />
data sets available in the texts of blogs [9]. Email has been<br />
distinguished as having its own challenges for data mining [5].<br />
Much work has g<strong>on</strong>e into cleaning up the data for use by<br />
researchers. Some potentially c<strong>on</strong>founding characteristics<br />
include the use of multiple email alias’ by employees. As has<br />
been studied <strong>on</strong> multiple occasi<strong>on</strong>s [5,6], email also represents a<br />
point between formal written communicati<strong>on</strong> and less formal<br />
spoken communicati<strong>on</strong>. Interpreting half-sentences,<br />
abbreviati<strong>on</strong>s, or (intenti<strong>on</strong>al or unintenti<strong>on</strong>al) mis-spellings has<br />
been sufficient to occupy its own studies.<br />
Yet to be sufficiently explored is a mapping of these<br />
classificati<strong>on</strong>s to the events surrounding the subjects of the<br />
emails themselves. While electr<strong>on</strong>ic communicati<strong>on</strong>s of all sorts<br />
are interesting in and of themselves and the classificati<strong>on</strong>s are<br />
worthy of study, we extend their study to explore how<br />
classificati<strong>on</strong> algorithms might be used to predict events. We<br />
look to see if we can use changes in the nature of the email<br />
communicati<strong>on</strong> to predict the future.<br />
Despite its simplicity, results of experiments <strong>on</strong> Web pages<br />
and TV closed capti<strong>on</strong>s dem<strong>on</strong>strate high classificati<strong>on</strong> accuracy<br />
for <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> [10]. While <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> has been explored [14,15,16].<br />
This study suggests a method to identify organizati<strong>on</strong>al change<br />
in a corpus of over 50K emails through the applicati<strong>on</strong> of<br />
<str<strong>on</strong>g>TFIDF</str<strong>on</strong>g>, and then applying a statistical process c<strong>on</strong>trol chart<br />
from quality engineering.<br />
The corpus of Enr<strong>on</strong> email used in this study comprises<br />
50,000 email text documents. This data set spans a sufficient<br />
time period to be meaningful (created over a period of four<br />
years, 1998-2002) and c<strong>on</strong>tains at least <strong>on</strong>e known major<br />
organizati<strong>on</strong>al change point (in this case, turnover of the CEO<br />
& Chairman). The data set forms a closed network, where<br />
members of the social network send all texts in this dataset to<br />
other members of the social network.<br />
We dem<strong>on</strong>strate a method of investigating potential changes<br />
through <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> and the CUSUM c<strong>on</strong>trol chart. Key terms<br />
identified in the <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> vectors over the weeks identified as<br />
potential change points, correlate well with major historical<br />
events involving Enr<strong>on</strong>. These key terms also lead investigators<br />
to insightful emails in reference to potential causes of change.<br />
III. METHOD<br />
This research explores characteristics of temporal change in<br />
text classificati<strong>on</strong> of <strong>large</strong> data sets. We study the following<br />
characteristics: The point when a change has been detected; the<br />
magnitude of the change; and the most likely estimate of when<br />
the change originally occurred. We then map these changes to<br />
some of the organizati<strong>on</strong>’s historical events.
Below is an outline of the approach pursued in this study:<br />
A. C<strong>on</strong>ducted TF-IDF <strong>on</strong> weekly email documents<br />
We applied the TF-IDF algorithm to weekly email<br />
documents of Enr<strong>on</strong>. The TF-IDF is a measure of<br />
sentiment in documents. Character strings are given a<br />
high score, when they occur frequently in a document,<br />
yet infrequently across multiple documents. The<br />
formula for TF-IDF is given by,<br />
tfidf i, j = n i, j<br />
∑nk, j<br />
k<br />
ln<br />
D<br />
{ d j :t i ∈d j}<br />
,<br />
where, ni,j is the number of character strings, ti, in<br />
document dj; nk,j is the number of total terms in<br />
document dj; and |D| is the total number of documents<br />
in the corpus. For a comprehensive explanati<strong>on</strong> of TF-<br />
IDF refer to Sparck-J<strong>on</strong>es [19]. Determining the<br />
occurrence of a change from sequentially observed<br />
weekly email makes infeasible the c<strong>on</strong>ducting of TF-<br />
IDF over all possible documents. We therefore<br />
calculate TF-IDF over <strong>on</strong>e week documents (emails).<br />
The TF-IDF vectors for each document are then<br />
averaged to create a TF-IDF vector that represents the<br />
sentiment of that particular week.<br />
B. Performed cosine similarity between vectors of each week’s c<strong>on</strong>cepts<br />
While comparing the difference between corresp<strong>on</strong>ding<br />
comp<strong>on</strong>ents would be difficult for generating test<br />
statistics, comparing the angle between each vector and<br />
an average vector detects differences and allow use of<br />
c<strong>on</strong>trol charts. The cosine similarity between weekly<br />
vectors of influential terms is calculated to quantify<br />
different potential measures of weekly change.<br />
A reference vector is determined by averaging available<br />
weekly TF-IDF vectors across rows. This average<br />
vector has no significance other than serving as a<br />
reference point for calculating differences between<br />
weekly vectors. If this were not c<strong>on</strong>sidered, small trend<br />
changes in weekly sentiment would go undetected.<br />
Finding the angle between two weekly TF-IDF vectors,<br />
a and b is given by,<br />
θ<br />
a,<br />
b<br />
⎛ ⎞<br />
⎜<br />
a • b<br />
= arccos ⎟<br />
⎜ ⎟<br />
⎝ a b ⎠<br />
These angles between vectors, represents change in<br />
weekly email sentiment.<br />
C. Apply CUSUM c<strong>on</strong>trol chart statistic<br />
Statistical process c<strong>on</strong>trol charts [21,22,23] are used to<br />
detect changes in temporal data. We use the CUSUM<br />
c<strong>on</strong>trol chart statistic to identify potential changes in the<br />
semantic c<strong>on</strong>tent of Enr<strong>on</strong> communicati<strong>on</strong>. The<br />
CUSUM signals change and estimates the time of<br />
change.<br />
With c<strong>on</strong>trol charts helping to distinguish process<br />
abnormality, measurements from the process are used to<br />
compute a test statistic. When the test statistic exceeds<br />
the limits of the c<strong>on</strong>trol chart, the process is deemed<br />
abnormal. This indicates that a change in the process<br />
may have occurred. The process (in this case group<br />
sentiment) can then be investigated to identify the<br />
potential cause of the change. The CUSUM statistic is<br />
given by,<br />
C<br />
+<br />
x<br />
=<br />
⎧ θ x,<br />
x −θ<br />
δ<br />
⎨0,<br />
− + C<br />
⎩ sθ<br />
2<br />
+<br />
max x−1<br />
where θ is the average cosine similarity between a<br />
weekly vector and the reference vector when the<br />
sentiment is not changing; and δ is the magnitude of<br />
change that the CUSUM is optimized to detect. For this<br />
study, δ = 1 for all calculati<strong>on</strong>s. A c<strong>on</strong>trol limit of 2.03<br />
was used, which corresp<strong>on</strong>ds to a type I error of 0.05.<br />
This allowed the weekly data to be broken into time<br />
periods of similar sentiment.<br />
D. Identified most salient character strings for each time period<br />
We then identified the most salient character-strings for<br />
each time period by ranking the terms according to their<br />
transformed TF-IDF scores for each time period. In<br />
additi<strong>on</strong>, the biggest changes in salient terms between<br />
time-periods was calculated by taking the absolute value<br />
of the difference between sequential time-period<br />
vectors.<br />
E. Matched character strings against Enr<strong>on</strong> history<br />
We searched through four weeks of email data<br />
surrounding changes for reference to the salient terms.<br />
These email were reviewed for any potential insight into<br />
Enr<strong>on</strong> activities.<br />
IV. RESULTS<br />
There were 11 time-periods identified in the Enr<strong>on</strong> data. A<br />
plot of the CUSUM statistic over time is shown in Figure [1].<br />
⎫<br />
⎬<br />
⎭<br />
,
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
Enr<strong>on</strong> LSA Cosine Similarity<br />
CUSUM (k=0.5, h=2.03)<br />
0.2<br />
8/28/1999 3/15/2000 10/1/2000 4/19/2001 11/5/2001 5/24/2002<br />
Figure [1]<br />
Below are the salient character strings identified for four of<br />
the <strong>large</strong>st change points. Analyses of some other change points<br />
were omitted due to the lack of available historical data. The<br />
salient character strings were:<br />
enr<strong>on</strong>; swap/counterparty; agreements; terminate;<br />
meters; ectcc.<br />
A search <strong>on</strong> these salient terms identified several interesting<br />
emails that suggest that a change in the organizati<strong>on</strong> may have<br />
actually occurred. In order to realize the significance of the<br />
email messages, it is helpful to review the four major periods in<br />
Enr<strong>on</strong>’s history, when the CUSUM signaled potential change.<br />
In November 2000, Kenneth Lay, Enr<strong>on</strong> CEO, sells his<br />
shares and files fraudulent quarterly 10-Q for the third<br />
c<strong>on</strong>secutive quarter.<br />
In early March 2001, Fortune Magazine publishes an<br />
article that questi<strong>on</strong>s Enr<strong>on</strong>’s stock price; legal questi<strong>on</strong>s<br />
are raised about LJM, a company used to hide Enr<strong>on</strong><br />
debt; and problems with the Raptor partnership surface.<br />
Enr<strong>on</strong> repurchased Chewco’s investment in JEDI in late<br />
March to cover the problem. Shortly after these events,<br />
Enr<strong>on</strong> announces a <strong>large</strong> first quarter profit of $536<br />
milli<strong>on</strong>.<br />
In late July 2001, Enr<strong>on</strong>’s stock price closed below $47<br />
per share. This was a critical point for the Raptor<br />
partnerships. Three weeks later, Jeffery Skilling resigns<br />
as CEO.<br />
In late September 2001, Skilling sells half a milli<strong>on</strong><br />
shares of stock in Enr<strong>on</strong>, and Director Robert Belfer<br />
sells 109,000 shares. CEO, Kenneth Lay, tells<br />
employees that Enr<strong>on</strong>’s accounting practices are “legal<br />
and totally appropriate.”<br />
These four time periods in 2000 and 2001 all corresp<strong>on</strong>d to a<br />
shift in organizati<strong>on</strong> sentiment. There were 36 emails identified<br />
by the <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> salient terms within three weeks of the CUSUM<br />
signaled weeks. Of these 36, 14 suggested a possible cause of<br />
change. In late September 2000, three emails address “swaps”<br />
in “Raptor”. One of these specifically addresses Chewco and<br />
JEDI I, Enr<strong>on</strong>’s interest in these companies, and employees are<br />
told that this will be “helpful in [their] review of Raptor<br />
matters”. In late March 2001, five emails involve “swaps”,<br />
“agreement” and “terminate” in c<strong>on</strong>juncti<strong>on</strong> with “Raptor”.<br />
Two emails discuss handling of equity, warrants and debt. The<br />
other emails discuss that “Mary” is <strong>on</strong> vacati<strong>on</strong> and they are not<br />
sure how to handle the “Raptor” swaps. On 3 April 2001, an<br />
identified email states “I wasn’t trying to be critical of any<strong>on</strong>e<br />
specifically…this ‘loss’ of value, which does not show up in a<br />
P/L since it is hedged by Raptor…” suggests that there was<br />
some discussi<strong>on</strong> of how swaps were handled following the 26<br />
March move to cover problems with Raptor. An early August<br />
email suggests a plan to “write down some of our problem<br />
assets and unwind raptor” <strong>on</strong> the tails of Enr<strong>on</strong>’s stock price<br />
closing below $47, which was a critical point in the Raptor<br />
partnership. There are three other emails in August that discuss<br />
a “terminate” clause in a “Raptor” “agreement”. The final email<br />
of potential interest discusses how “Raptor” has “blown up.”<br />
This analysis indicates that a likely cause of change in the<br />
Enr<strong>on</strong> email sentiment is due to discussi<strong>on</strong>s of swaps in<br />
reference to “Raptor”. There is also some involvement with the<br />
companies Chewco and JEDI I.<br />
We have shown a method to detect possible changes in a<br />
social group for investigati<strong>on</strong>. We dem<strong>on</strong>strated a method of<br />
investigating potential changes through <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> and a CUSUM<br />
c<strong>on</strong>trol chart scheme, where key terms could be identified.<br />
These key terms show the major activities related to that domain<br />
(political ties) as identified by some<strong>on</strong>e other than the principal<br />
investigators.<br />
V. CONCLUSION<br />
The very <strong>large</strong> data set that we have investigated has been the<br />
subject of a great deal of interest but remains rich enough to<br />
justify c<strong>on</strong>tinued study. Our collecti<strong>on</strong> of 50,000 emails actually<br />
represents a subset of a <strong>large</strong>r set where at least 250,000 emails<br />
are available. Should the <strong>large</strong>r set become usable, it will be of<br />
interest to at least these researchers. Unfortunately, this <strong>large</strong>r<br />
set suffers difficulties requiring further cleaning (e.g., unrec<strong>on</strong>ciled<br />
to/from fields)<br />
There exist at least three straightforward extensi<strong>on</strong>s to the<br />
work presented in this paper:<br />
1) Applying additi<strong>on</strong>al classificati<strong>on</strong> algorithms to the existing<br />
Enr<strong>on</strong> corpus investigated here.<br />
2) Applying the classificati<strong>on</strong> approaches to other data sets.<br />
3) Applying the methodology for the purpose of predicting<br />
changes/events for the social network described by the email<br />
communicati<strong>on</strong>s.
As new approaches to classificati<strong>on</strong> problems become<br />
available, they will also likely justify testing against the Enr<strong>on</strong><br />
corpus.<br />
ACKNOWLEDGMENT<br />
We are grateful to Carolyn Rose for her feedback. Jana<br />
Diesner and Terrill Franz with the Center for Computati<strong>on</strong>al<br />
Analysis of Social and Organizati<strong>on</strong>al Systems<br />
(http://www.casos.cs.cmu.ed) provided assistance in the<br />
preparati<strong>on</strong> of the original dataset.<br />
REFERENCES<br />
[1] Lawrence, S., & Giles, L. (1999). Accessibility<br />
and Distributi<strong>on</strong> of Informati<strong>on</strong> <strong>on</strong> the Web.<br />
Nature, 400, 107-109.<br />
[2] UCBerkeleyLibrary. (2007). The BEST Search<br />
Engines. Retrieved November, 2007, 2007,<br />
from<br />
http://www.lib.berkeley.edu/TeachingLib/Guide<br />
s/Internet/SearchEngines.html<br />
[3] Mani, I., & Bloedorn, E. (1997). Summarizing<br />
Similarities and Differences Am<strong>on</strong>g Related<br />
Documents. Informati<strong>on</strong> Retrieval(1), 35-67.<br />
[4] Stockinger, K., Rotem, D., Shoshani, A., & Wu,<br />
K. (2006). Analyzing Enr<strong>on</strong> Data: Bitmap<br />
Indexing Outperforms MySQL Queries by Several<br />
Orders of Magnitude [Electr<strong>on</strong>ic Versi<strong>on</strong>], 4.<br />
Retrieved 2006 Jan 28.<br />
[5] Carley, K. M., & Skillicorn, D. (2005). Special<br />
Issue <strong>on</strong> Analyzing Large Scale Networks: The<br />
Enr<strong>on</strong> Corpus. Computati<strong>on</strong>al & Mathematical<br />
Organizati<strong>on</strong>al Theory(11), 179-181.<br />
[6] Keila, P. S., & Skillicorn, D. B. (2005).<br />
Structure in the Enr<strong>on</strong> Email Dataset.<br />
Computati<strong>on</strong>al & Mathematical Organizati<strong>on</strong><br />
Theory(11), 183–199.<br />
[7] Priebe, C. E., C<strong>on</strong>roy, J. M., Marchette, D. J.,<br />
& Park, Y. (2005). Scan Statistics <strong>on</strong> Enr<strong>on</strong><br />
Graphs. Computati<strong>on</strong>al & Mathematical<br />
Organizati<strong>on</strong> Theory(11), 229-247.<br />
[8] Godbole, N., Srinivasaiah, M., & Skiena, S.<br />
(2007). LargeScale Sentiment Analysis for<br />
News and Blogs (System Dem<strong>on</strong>strati<strong>on</strong>). Paper<br />
presented at the Internati<strong>on</strong>al C<strong>on</strong>ference for<br />
Weblogs and Social Media (ICWSM 07), Boulder,<br />
CO.<br />
[9] Li, Y. H., & Jain, A. K. (1998). Classificati<strong>on</strong><br />
of Text Documents. The Computer Journal,<br />
41(8), 537-546.<br />
[10] Chuang, W. T., Tiyyagura, A., Yang, J., &<br />
Giuffrida, G. (2000). A fast algorithm for<br />
hierarchical text classificati<strong>on</strong>. In Y.<br />
Kambayashi, M. Mohania & A. M. Tjoa (Eds.),<br />
DaWaK 2000 (Vol. LNCS 1874, pp. 409-418):<br />
Springer-Verlag Berlin Heidelberg 2000.<br />
[11] Lorden, G. (1971). Procedures for reacting to a<br />
change distributi<strong>on</strong>. The Annals of<br />
Mathematical Statistics, 42(6), 1897-1908.<br />
[12] Joseph J. Pignatiello, J., & Simps<strong>on</strong>, J. R.<br />
(2002). A Magnitude-Robust C<strong>on</strong>trol Chart For<br />
M<strong>on</strong>itoring And Estimating Step Changes For<br />
Normal Process Means. Quality And Reliability<br />
Engineering Internati<strong>on</strong>al(18), 429-441.<br />
[13] Hynek, J., & Jezek, K. (2003, 25-28 June 2003).<br />
Practical Approach to Automatic Text<br />
Summarizati<strong>on</strong>. Paper presented at the 7th<br />
ICCC/IFIP Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong><br />
Electr<strong>on</strong>ic Publishing, Universidade do Minho,<br />
Portugal.<br />
[14] Aizawa, A. (2000). The feature quantity: an<br />
informati<strong>on</strong> theoretic perspective of Tfidflike<br />
measures. Paper presented at the Annual<br />
ACM c<strong>on</strong>ference <strong>on</strong> research and development in<br />
informati<strong>on</strong> retrieval, Athens, Greece.<br />
[15] Jing, L.-P., Huang, H.-K., & Shi, H.-B. (2003).<br />
Improved feature selecti<strong>on</strong> approach <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> in<br />
text mining. Paper presented at the<br />
Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Machine Learning<br />
and Cybernetics, 2002. Proceedings. 2002<br />
[16] Hovy, E. (2006). Learning Ontological Knowledge<br />
from the Web.<br />
[17] Mo<strong>on</strong>, N., & Singh, R. (2005). Experiments in<br />
Text-Based Mining and Analysis of Biological<br />
Informati<strong>on</strong> from MEDLINE <strong>on</strong> Functi<strong>on</strong>ally-<br />
Related Genes Paper presented at the 18th<br />
Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Systems<br />
Engineering (ICSEng'05).<br />
[18] Ishii, N., Murai, T., Yamada, T., & Bao, Y.<br />
(2006). Text Classificati<strong>on</strong> by Combining<br />
Grouping, LSA and kNN. Paper presented at the<br />
5th IEEE/ACIS Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong><br />
Computer and Informati<strong>on</strong> Science and 1st<br />
IEEE/ACIS Internati<strong>on</strong>al Workshop <strong>on</strong><br />
Comp<strong>on</strong>ent-Based Software Engineering,Software<br />
Architecture and Reuse (ICIS-COMSAR'06).<br />
[19] Sparck J<strong>on</strong>es, K. (1972). A statistical<br />
interpretati<strong>on</strong> of term specificity and its<br />
applicati<strong>on</strong> in retrieval. Journal of<br />
Documentati<strong>on</strong>, 28 (1), 11-21.<br />
[20] Deerwester, S., Dumais, S. T., Furnas, G. W.,<br />
Landauer, T. K., & Harshman, R. (1988).<br />
Indexing by latent semantic analysis. Journal<br />
of the American Society for Informati<strong>on</strong><br />
Science, 41(6), 391-407.<br />
[21] M<strong>on</strong>tgomery, D. C. (1996). Introducti<strong>on</strong> to<br />
Statistical Quality C<strong>on</strong>trol (3 Sub editi<strong>on</strong><br />
ed.): John Wiley & S<strong>on</strong>s.<br />
[22] Page, E. S. (1954). C<strong>on</strong>tinuous Inspecti<strong>on</strong><br />
Schemes. Biometrika(41), 100-115.<br />
[23] Page, E. S. (1961). Cumulative Sum C<strong>on</strong>trol<br />
Charts. Technometrics, 3(1), 1-9.<br />
[24] Mani, I., & Bloedorn, E. (1998). Machine<br />
Learning of Generic and User-Focused<br />
Summarizati<strong>on</strong>. Paper presented at the<br />
Fifteenth Nati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> AI (AAAI-<br />
98). from http://arxiv.org/abs/cs/9811006v1