23.07.2013 Views

TFIDF on large datasets - Nguyen Dang Binh

TFIDF on large datasets - Nguyen Dang Binh

TFIDF on large datasets - Nguyen Dang Binh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Using term-frequency-inverse-document<br />

frequency of email to<br />

detect change in social groups<br />

Abstract—Interest in the classificati<strong>on</strong> of <strong>large</strong> text data sets<br />

c<strong>on</strong>tinues to grow. The Enr<strong>on</strong> email corpus remains a rich source<br />

of inquiry given its size and the unique characteristic of email:<br />

timestamps. Using this temporal data, we study a method for<br />

detecting temporal changes in the classificati<strong>on</strong>s. Using termfrequency-inverse-document-frequency<br />

(<str<strong>on</strong>g>TFIDF</str<strong>on</strong>g>) up<strong>on</strong> which we<br />

apply a change detecti<strong>on</strong> algorithm (CUSUM), our results suggest<br />

a methodology for predicting changes in email c<strong>on</strong>versati<strong>on</strong>s in a<br />

social network.<br />

Index Terms—Enr<strong>on</strong>, Machine Learning Statistical Process<br />

C<strong>on</strong>trol, TF-IDF, Text Analysis, Text Classificati<strong>on</strong>.<br />

I. INTRODUCTION<br />

l<strong>on</strong>g with the rapidly expanding scale of available text data,<br />

A<br />

interest is growing al<strong>on</strong>gside in the classifying of these<br />

growing data sets. The available text <strong>on</strong> the web al<strong>on</strong>e is<br />

huge. With no search engine indexing more than about 16% of<br />

the estimated size of the publicly indexable web [1], at least 20<br />

billi<strong>on</strong> web objects have been indexed [2]. Approaches to<br />

classificati<strong>on</strong> have been developed to specifically be more<br />

effective in all-text envir<strong>on</strong>ments [3].<br />

Since becoming available in 2002, the <strong>large</strong> Enr<strong>on</strong> email<br />

corpus has been the subject of repeated study by researchers. It<br />

possesses qualities that warrant c<strong>on</strong>tinued inquiry. Many papers<br />

explicitly referencing the Enr<strong>on</strong> dataset have already been<br />

published [4,5]. This dataset will likely c<strong>on</strong>tinue to be of interest<br />

for some time. The email corpus of Enr<strong>on</strong> represents a body of<br />

corresp<strong>on</strong>dence within a defined social group at a scale unusual<br />

in the genre. We are fortunate to not <strong>on</strong>ly have this substantial<br />

collecti<strong>on</strong> of emails but also time stamps <strong>on</strong> them.<br />

This work was supported by the Center for Computati<strong>on</strong>al Analysis of Social<br />

and Organizati<strong>on</strong>al Systems, School of Computer Science, Carnegie Mell<strong>on</strong><br />

University, http://www.casos.cs.cmu.edu.<br />

I. McCulloh is with the Network Science Center, U.S. Military Academy, West<br />

Point, NY 10996 (ph<strong>on</strong>e: 845-702-9115, fax: 845-938-2409, email:<br />

imccullo@cs.cmu.edu)<br />

E. Daimler is with Carnegie Mell<strong>on</strong> University, Building 23, Moffett Field,<br />

California 94035 (ph<strong>on</strong>e: 408-241-0055 email: edaimler@cs.cmu.edu)<br />

K.M. Carley is with the Center for Computati<strong>on</strong>al Analysis of Social and<br />

Organizati<strong>on</strong>al Systems, Carnegie Mell<strong>on</strong> University, 5000 Forbes Ave,<br />

Pittsburgh, PA 15213 USA (ph<strong>on</strong>e: 412-268-8163 email:<br />

kathleen.carley@cs.cmu.edu)<br />

Ian McCulloh, Eric Daimler, Kathleen M. Carley<br />

Timestamps <strong>on</strong> corresp<strong>on</strong>dence provide temporal richness to<br />

text classificati<strong>on</strong>. Tracking changes in the text classificati<strong>on</strong>s<br />

over time may give visibility to changes in the social structure<br />

defined by the messages. We may generalize this challenge:<br />

given a corpus of weekly email texts for a social group, develop<br />

a process to discover characteristics of temporal changes in that<br />

social group.<br />

It is not the study of sentiment per se, but rather changes in<br />

sentiment. Rather than looking at what the sentiment tells us<br />

directly, we are looking at what the changes in sentiment tells us.<br />

We are c<strong>on</strong>cerned with the degree to which a change in<br />

sentiment can help to predict real world events. The ability to<br />

detect changes in social groups is important in a variety of<br />

applicati<strong>on</strong>s.<br />

Machine learning techniques may be useful for quantifying<br />

communicati<strong>on</strong> in these social communities. For all identifiable<br />

social communities, there exists an opportunity to gain valuable<br />

insight into social dynamics. Analysis of these communities can<br />

be more robust with more objective, quantifiable metrics. A<br />

variety of change detecti<strong>on</strong> algorithms can then be used to<br />

detect changes in these communicati<strong>on</strong> metrics. A quantifiable<br />

method of detecting changes in social communities has far<br />

reaching applicati<strong>on</strong>s ranging from defense, to ec<strong>on</strong>omics, to<br />

public policy.<br />

A. Public Policy<br />

Public policy and policy maker’s representati<strong>on</strong> of<br />

events must reflect c<strong>on</strong>stituent beliefs. Placing<br />

objective, quantitative measures <strong>on</strong> these beliefs can<br />

make for more resp<strong>on</strong>sive, if not better, governance.<br />

B. Commercializati<strong>on</strong><br />

Customer adopti<strong>on</strong> can be powerfully impacted by<br />

word-of-mouth. The targeting of customer referrals<br />

could be made more effective. Addressing customer<br />

complaints could occur before becoming a material<br />

issue.<br />

C. Financial Risks<br />

Issues as diverse as securities or currency speculati<strong>on</strong>,<br />

unstable government policy, tax-avoidance schemes,<br />

accounting policy changes, or changes in credit


standards are reflected in Natural Language in additi<strong>on</strong><br />

to numbers. Detecting changes in Natural Language<br />

may be an important adjunct to changes in the<br />

numerical data. Increasingly organizati<strong>on</strong>s examine email<br />

to protect themselves against corporate malfeasance [6].<br />

D. Security & law enforcement<br />

To the degree that unlawful activities are reflected in<br />

sentiment, automated processes for detecting changes<br />

are inherently more effective than manual processes in<br />

the volume of data able to be processed.<br />

The sheer scale of the text to be classified in the Enr<strong>on</strong> email<br />

dataset provides it substantial interest from researchers. While<br />

there has been study <strong>on</strong> corpora of substantial size, few, if any,<br />

have been allowed at this scale <strong>on</strong> email. Analysis can be d<strong>on</strong>e<br />

from web log communicati<strong>on</strong> with a virtually unlimited data set.<br />

However, email can represent a more active dialogue with<br />

communicati<strong>on</strong>s occurring more rapidly with the subject<br />

changing quickly and in disc<strong>on</strong>tinuous spurts, with the additi<strong>on</strong><br />

of the time stamp adding to the data’s richness. The time-stamps<br />

<strong>on</strong> the text data set allows the applicati<strong>on</strong> of approaches to<br />

change detecti<strong>on</strong> such as cumulative sum (CUSUM) statistical<br />

process c<strong>on</strong>trol charts to investigate the degree to which<br />

changes in salient words may be detected.<br />

II. BACKGROUND<br />

The classificati<strong>on</strong> of documents has been studied for those in<br />

analog form, those in electr<strong>on</strong>ic form, emails, and even the<br />

Enr<strong>on</strong> Corpus in particular [5,6,7]. These studies have ranged<br />

from inquiry into the effectiveness of the classificati<strong>on</strong><br />

documents themselves [4] to exploratory data analysis [8]. As<br />

new classificati<strong>on</strong> algorithms have been developed, they have<br />

been tested against existing corpora [9]. As new corpora have<br />

been created, they have been tested <strong>on</strong> existing classificati<strong>on</strong><br />

techniques [4].<br />

The approach of classifying text in this way has been studied<br />

[10]. The explorati<strong>on</strong> of change detecti<strong>on</strong> in been studied [11,12]<br />

Comparing changes in social networks in this way has been<br />

studied. We look to expand this literature by exploring the<br />

degree to which we might predict changes in the social network<br />

with a particular approach to combining these methods.<br />

Efforts have been made to classify <strong>on</strong>line communicati<strong>on</strong> and<br />

email communicati<strong>on</strong> [8,9,13]. Classificati<strong>on</strong> algorithms have<br />

been applied to these Enr<strong>on</strong> email communicati<strong>on</strong>s as a test of<br />

the algorithms effectiveness [4].<br />

The many expressi<strong>on</strong>s of exploratory data analysis can be<br />

used to look at history to help predict the future. Al<strong>on</strong>g with<br />

classificati<strong>on</strong> methodologies, approaches to clustering have been<br />

brought to bear in many studies [7]. Other studies have used it<br />

for the purpose of looking at trends in data to detect changes in<br />

dialogue [3,8]. Clusters can be formed al<strong>on</strong>g many lines: topics,<br />

dates, sentiment and others.<br />

While <strong>large</strong> data sets are of growing interest, we might find<br />

these in blogs in additi<strong>on</strong> to email corpora. The effectiveness of<br />

text classificati<strong>on</strong> algorithms has been explored in the ample<br />

data sets available in the texts of blogs [9]. Email has been<br />

distinguished as having its own challenges for data mining [5].<br />

Much work has g<strong>on</strong>e into cleaning up the data for use by<br />

researchers. Some potentially c<strong>on</strong>founding characteristics<br />

include the use of multiple email alias’ by employees. As has<br />

been studied <strong>on</strong> multiple occasi<strong>on</strong>s [5,6], email also represents a<br />

point between formal written communicati<strong>on</strong> and less formal<br />

spoken communicati<strong>on</strong>. Interpreting half-sentences,<br />

abbreviati<strong>on</strong>s, or (intenti<strong>on</strong>al or unintenti<strong>on</strong>al) mis-spellings has<br />

been sufficient to occupy its own studies.<br />

Yet to be sufficiently explored is a mapping of these<br />

classificati<strong>on</strong>s to the events surrounding the subjects of the<br />

emails themselves. While electr<strong>on</strong>ic communicati<strong>on</strong>s of all sorts<br />

are interesting in and of themselves and the classificati<strong>on</strong>s are<br />

worthy of study, we extend their study to explore how<br />

classificati<strong>on</strong> algorithms might be used to predict events. We<br />

look to see if we can use changes in the nature of the email<br />

communicati<strong>on</strong> to predict the future.<br />

Despite its simplicity, results of experiments <strong>on</strong> Web pages<br />

and TV closed capti<strong>on</strong>s dem<strong>on</strong>strate high classificati<strong>on</strong> accuracy<br />

for <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> [10]. While <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> has been explored [14,15,16].<br />

This study suggests a method to identify organizati<strong>on</strong>al change<br />

in a corpus of over 50K emails through the applicati<strong>on</strong> of<br />

<str<strong>on</strong>g>TFIDF</str<strong>on</strong>g>, and then applying a statistical process c<strong>on</strong>trol chart<br />

from quality engineering.<br />

The corpus of Enr<strong>on</strong> email used in this study comprises<br />

50,000 email text documents. This data set spans a sufficient<br />

time period to be meaningful (created over a period of four<br />

years, 1998-2002) and c<strong>on</strong>tains at least <strong>on</strong>e known major<br />

organizati<strong>on</strong>al change point (in this case, turnover of the CEO<br />

& Chairman). The data set forms a closed network, where<br />

members of the social network send all texts in this dataset to<br />

other members of the social network.<br />

We dem<strong>on</strong>strate a method of investigating potential changes<br />

through <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> and the CUSUM c<strong>on</strong>trol chart. Key terms<br />

identified in the <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> vectors over the weeks identified as<br />

potential change points, correlate well with major historical<br />

events involving Enr<strong>on</strong>. These key terms also lead investigators<br />

to insightful emails in reference to potential causes of change.<br />

III. METHOD<br />

This research explores characteristics of temporal change in<br />

text classificati<strong>on</strong> of <strong>large</strong> data sets. We study the following<br />

characteristics: The point when a change has been detected; the<br />

magnitude of the change; and the most likely estimate of when<br />

the change originally occurred. We then map these changes to<br />

some of the organizati<strong>on</strong>’s historical events.


Below is an outline of the approach pursued in this study:<br />

A. C<strong>on</strong>ducted TF-IDF <strong>on</strong> weekly email documents<br />

We applied the TF-IDF algorithm to weekly email<br />

documents of Enr<strong>on</strong>. The TF-IDF is a measure of<br />

sentiment in documents. Character strings are given a<br />

high score, when they occur frequently in a document,<br />

yet infrequently across multiple documents. The<br />

formula for TF-IDF is given by,<br />

tfidf i, j = n i, j<br />

∑nk, j<br />

k<br />

ln<br />

D<br />

{ d j :t i ∈d j}<br />

,<br />

where, ni,j is the number of character strings, ti, in<br />

document dj; nk,j is the number of total terms in<br />

document dj; and |D| is the total number of documents<br />

in the corpus. For a comprehensive explanati<strong>on</strong> of TF-<br />

IDF refer to Sparck-J<strong>on</strong>es [19]. Determining the<br />

occurrence of a change from sequentially observed<br />

weekly email makes infeasible the c<strong>on</strong>ducting of TF-<br />

IDF over all possible documents. We therefore<br />

calculate TF-IDF over <strong>on</strong>e week documents (emails).<br />

The TF-IDF vectors for each document are then<br />

averaged to create a TF-IDF vector that represents the<br />

sentiment of that particular week.<br />

B. Performed cosine similarity between vectors of each week’s c<strong>on</strong>cepts<br />

While comparing the difference between corresp<strong>on</strong>ding<br />

comp<strong>on</strong>ents would be difficult for generating test<br />

statistics, comparing the angle between each vector and<br />

an average vector detects differences and allow use of<br />

c<strong>on</strong>trol charts. The cosine similarity between weekly<br />

vectors of influential terms is calculated to quantify<br />

different potential measures of weekly change.<br />

A reference vector is determined by averaging available<br />

weekly TF-IDF vectors across rows. This average<br />

vector has no significance other than serving as a<br />

reference point for calculating differences between<br />

weekly vectors. If this were not c<strong>on</strong>sidered, small trend<br />

changes in weekly sentiment would go undetected.<br />

Finding the angle between two weekly TF-IDF vectors,<br />

a and b is given by,<br />

θ<br />

a,<br />

b<br />

⎛ ⎞<br />

⎜<br />

a • b<br />

= arccos ⎟<br />

⎜ ⎟<br />

⎝ a b ⎠<br />

These angles between vectors, represents change in<br />

weekly email sentiment.<br />

C. Apply CUSUM c<strong>on</strong>trol chart statistic<br />

Statistical process c<strong>on</strong>trol charts [21,22,23] are used to<br />

detect changes in temporal data. We use the CUSUM<br />

c<strong>on</strong>trol chart statistic to identify potential changes in the<br />

semantic c<strong>on</strong>tent of Enr<strong>on</strong> communicati<strong>on</strong>. The<br />

CUSUM signals change and estimates the time of<br />

change.<br />

With c<strong>on</strong>trol charts helping to distinguish process<br />

abnormality, measurements from the process are used to<br />

compute a test statistic. When the test statistic exceeds<br />

the limits of the c<strong>on</strong>trol chart, the process is deemed<br />

abnormal. This indicates that a change in the process<br />

may have occurred. The process (in this case group<br />

sentiment) can then be investigated to identify the<br />

potential cause of the change. The CUSUM statistic is<br />

given by,<br />

C<br />

+<br />

x<br />

=<br />

⎧ θ x,<br />

x −θ<br />

δ<br />

⎨0,<br />

− + C<br />

⎩ sθ<br />

2<br />

+<br />

max x−1<br />

where θ is the average cosine similarity between a<br />

weekly vector and the reference vector when the<br />

sentiment is not changing; and δ is the magnitude of<br />

change that the CUSUM is optimized to detect. For this<br />

study, δ = 1 for all calculati<strong>on</strong>s. A c<strong>on</strong>trol limit of 2.03<br />

was used, which corresp<strong>on</strong>ds to a type I error of 0.05.<br />

This allowed the weekly data to be broken into time<br />

periods of similar sentiment.<br />

D. Identified most salient character strings for each time period<br />

We then identified the most salient character-strings for<br />

each time period by ranking the terms according to their<br />

transformed TF-IDF scores for each time period. In<br />

additi<strong>on</strong>, the biggest changes in salient terms between<br />

time-periods was calculated by taking the absolute value<br />

of the difference between sequential time-period<br />

vectors.<br />

E. Matched character strings against Enr<strong>on</strong> history<br />

We searched through four weeks of email data<br />

surrounding changes for reference to the salient terms.<br />

These email were reviewed for any potential insight into<br />

Enr<strong>on</strong> activities.<br />

IV. RESULTS<br />

There were 11 time-periods identified in the Enr<strong>on</strong> data. A<br />

plot of the CUSUM statistic over time is shown in Figure [1].<br />

⎫<br />

⎬<br />

⎭<br />

,


1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

Enr<strong>on</strong> LSA Cosine Similarity<br />

CUSUM (k=0.5, h=2.03)<br />

0.2<br />

8/28/1999 3/15/2000 10/1/2000 4/19/2001 11/5/2001 5/24/2002<br />

Figure [1]<br />

Below are the salient character strings identified for four of<br />

the <strong>large</strong>st change points. Analyses of some other change points<br />

were omitted due to the lack of available historical data. The<br />

salient character strings were:<br />

enr<strong>on</strong>; swap/counterparty; agreements; terminate;<br />

meters; ectcc.<br />

A search <strong>on</strong> these salient terms identified several interesting<br />

emails that suggest that a change in the organizati<strong>on</strong> may have<br />

actually occurred. In order to realize the significance of the<br />

email messages, it is helpful to review the four major periods in<br />

Enr<strong>on</strong>’s history, when the CUSUM signaled potential change.<br />

In November 2000, Kenneth Lay, Enr<strong>on</strong> CEO, sells his<br />

shares and files fraudulent quarterly 10-Q for the third<br />

c<strong>on</strong>secutive quarter.<br />

In early March 2001, Fortune Magazine publishes an<br />

article that questi<strong>on</strong>s Enr<strong>on</strong>’s stock price; legal questi<strong>on</strong>s<br />

are raised about LJM, a company used to hide Enr<strong>on</strong><br />

debt; and problems with the Raptor partnership surface.<br />

Enr<strong>on</strong> repurchased Chewco’s investment in JEDI in late<br />

March to cover the problem. Shortly after these events,<br />

Enr<strong>on</strong> announces a <strong>large</strong> first quarter profit of $536<br />

milli<strong>on</strong>.<br />

In late July 2001, Enr<strong>on</strong>’s stock price closed below $47<br />

per share. This was a critical point for the Raptor<br />

partnerships. Three weeks later, Jeffery Skilling resigns<br />

as CEO.<br />

In late September 2001, Skilling sells half a milli<strong>on</strong><br />

shares of stock in Enr<strong>on</strong>, and Director Robert Belfer<br />

sells 109,000 shares. CEO, Kenneth Lay, tells<br />

employees that Enr<strong>on</strong>’s accounting practices are “legal<br />

and totally appropriate.”<br />

These four time periods in 2000 and 2001 all corresp<strong>on</strong>d to a<br />

shift in organizati<strong>on</strong> sentiment. There were 36 emails identified<br />

by the <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> salient terms within three weeks of the CUSUM<br />

signaled weeks. Of these 36, 14 suggested a possible cause of<br />

change. In late September 2000, three emails address “swaps”<br />

in “Raptor”. One of these specifically addresses Chewco and<br />

JEDI I, Enr<strong>on</strong>’s interest in these companies, and employees are<br />

told that this will be “helpful in [their] review of Raptor<br />

matters”. In late March 2001, five emails involve “swaps”,<br />

“agreement” and “terminate” in c<strong>on</strong>juncti<strong>on</strong> with “Raptor”.<br />

Two emails discuss handling of equity, warrants and debt. The<br />

other emails discuss that “Mary” is <strong>on</strong> vacati<strong>on</strong> and they are not<br />

sure how to handle the “Raptor” swaps. On 3 April 2001, an<br />

identified email states “I wasn’t trying to be critical of any<strong>on</strong>e<br />

specifically…this ‘loss’ of value, which does not show up in a<br />

P/L since it is hedged by Raptor…” suggests that there was<br />

some discussi<strong>on</strong> of how swaps were handled following the 26<br />

March move to cover problems with Raptor. An early August<br />

email suggests a plan to “write down some of our problem<br />

assets and unwind raptor” <strong>on</strong> the tails of Enr<strong>on</strong>’s stock price<br />

closing below $47, which was a critical point in the Raptor<br />

partnership. There are three other emails in August that discuss<br />

a “terminate” clause in a “Raptor” “agreement”. The final email<br />

of potential interest discusses how “Raptor” has “blown up.”<br />

This analysis indicates that a likely cause of change in the<br />

Enr<strong>on</strong> email sentiment is due to discussi<strong>on</strong>s of swaps in<br />

reference to “Raptor”. There is also some involvement with the<br />

companies Chewco and JEDI I.<br />

We have shown a method to detect possible changes in a<br />

social group for investigati<strong>on</strong>. We dem<strong>on</strong>strated a method of<br />

investigating potential changes through <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> and a CUSUM<br />

c<strong>on</strong>trol chart scheme, where key terms could be identified.<br />

These key terms show the major activities related to that domain<br />

(political ties) as identified by some<strong>on</strong>e other than the principal<br />

investigators.<br />

V. CONCLUSION<br />

The very <strong>large</strong> data set that we have investigated has been the<br />

subject of a great deal of interest but remains rich enough to<br />

justify c<strong>on</strong>tinued study. Our collecti<strong>on</strong> of 50,000 emails actually<br />

represents a subset of a <strong>large</strong>r set where at least 250,000 emails<br />

are available. Should the <strong>large</strong>r set become usable, it will be of<br />

interest to at least these researchers. Unfortunately, this <strong>large</strong>r<br />

set suffers difficulties requiring further cleaning (e.g., unrec<strong>on</strong>ciled<br />

to/from fields)<br />

There exist at least three straightforward extensi<strong>on</strong>s to the<br />

work presented in this paper:<br />

1) Applying additi<strong>on</strong>al classificati<strong>on</strong> algorithms to the existing<br />

Enr<strong>on</strong> corpus investigated here.<br />

2) Applying the classificati<strong>on</strong> approaches to other data sets.<br />

3) Applying the methodology for the purpose of predicting<br />

changes/events for the social network described by the email<br />

communicati<strong>on</strong>s.


As new approaches to classificati<strong>on</strong> problems become<br />

available, they will also likely justify testing against the Enr<strong>on</strong><br />

corpus.<br />

ACKNOWLEDGMENT<br />

We are grateful to Carolyn Rose for her feedback. Jana<br />

Diesner and Terrill Franz with the Center for Computati<strong>on</strong>al<br />

Analysis of Social and Organizati<strong>on</strong>al Systems<br />

(http://www.casos.cs.cmu.ed) provided assistance in the<br />

preparati<strong>on</strong> of the original dataset.<br />

REFERENCES<br />

[1] Lawrence, S., & Giles, L. (1999). Accessibility<br />

and Distributi<strong>on</strong> of Informati<strong>on</strong> <strong>on</strong> the Web.<br />

Nature, 400, 107-109.<br />

[2] UCBerkeleyLibrary. (2007). The BEST Search<br />

Engines. Retrieved November, 2007, 2007,<br />

from<br />

http://www.lib.berkeley.edu/TeachingLib/Guide<br />

s/Internet/SearchEngines.html<br />

[3] Mani, I., & Bloedorn, E. (1997). Summarizing<br />

Similarities and Differences Am<strong>on</strong>g Related<br />

Documents. Informati<strong>on</strong> Retrieval(1), 35-67.<br />

[4] Stockinger, K., Rotem, D., Shoshani, A., & Wu,<br />

K. (2006). Analyzing Enr<strong>on</strong> Data: Bitmap<br />

Indexing Outperforms MySQL Queries by Several<br />

Orders of Magnitude [Electr<strong>on</strong>ic Versi<strong>on</strong>], 4.<br />

Retrieved 2006 Jan 28.<br />

[5] Carley, K. M., & Skillicorn, D. (2005). Special<br />

Issue <strong>on</strong> Analyzing Large Scale Networks: The<br />

Enr<strong>on</strong> Corpus. Computati<strong>on</strong>al & Mathematical<br />

Organizati<strong>on</strong>al Theory(11), 179-181.<br />

[6] Keila, P. S., & Skillicorn, D. B. (2005).<br />

Structure in the Enr<strong>on</strong> Email Dataset.<br />

Computati<strong>on</strong>al & Mathematical Organizati<strong>on</strong><br />

Theory(11), 183–199.<br />

[7] Priebe, C. E., C<strong>on</strong>roy, J. M., Marchette, D. J.,<br />

& Park, Y. (2005). Scan Statistics <strong>on</strong> Enr<strong>on</strong><br />

Graphs. Computati<strong>on</strong>al & Mathematical<br />

Organizati<strong>on</strong> Theory(11), 229-247.<br />

[8] Godbole, N., Srinivasaiah, M., & Skiena, S.<br />

(2007). LargeScale Sentiment Analysis for<br />

News and Blogs (System Dem<strong>on</strong>strati<strong>on</strong>). Paper<br />

presented at the Internati<strong>on</strong>al C<strong>on</strong>ference for<br />

Weblogs and Social Media (ICWSM 07), Boulder,<br />

CO.<br />

[9] Li, Y. H., & Jain, A. K. (1998). Classificati<strong>on</strong><br />

of Text Documents. The Computer Journal,<br />

41(8), 537-546.<br />

[10] Chuang, W. T., Tiyyagura, A., Yang, J., &<br />

Giuffrida, G. (2000). A fast algorithm for<br />

hierarchical text classificati<strong>on</strong>. In Y.<br />

Kambayashi, M. Mohania & A. M. Tjoa (Eds.),<br />

DaWaK 2000 (Vol. LNCS 1874, pp. 409-418):<br />

Springer-Verlag Berlin Heidelberg 2000.<br />

[11] Lorden, G. (1971). Procedures for reacting to a<br />

change distributi<strong>on</strong>. The Annals of<br />

Mathematical Statistics, 42(6), 1897-1908.<br />

[12] Joseph J. Pignatiello, J., & Simps<strong>on</strong>, J. R.<br />

(2002). A Magnitude-Robust C<strong>on</strong>trol Chart For<br />

M<strong>on</strong>itoring And Estimating Step Changes For<br />

Normal Process Means. Quality And Reliability<br />

Engineering Internati<strong>on</strong>al(18), 429-441.<br />

[13] Hynek, J., & Jezek, K. (2003, 25-28 June 2003).<br />

Practical Approach to Automatic Text<br />

Summarizati<strong>on</strong>. Paper presented at the 7th<br />

ICCC/IFIP Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong><br />

Electr<strong>on</strong>ic Publishing, Universidade do Minho,<br />

Portugal.<br />

[14] Aizawa, A. (2000). The feature quantity: an<br />

informati<strong>on</strong> theoretic perspective of Tfidflike<br />

measures. Paper presented at the Annual<br />

ACM c<strong>on</strong>ference <strong>on</strong> research and development in<br />

informati<strong>on</strong> retrieval, Athens, Greece.<br />

[15] Jing, L.-P., Huang, H.-K., & Shi, H.-B. (2003).<br />

Improved feature selecti<strong>on</strong> approach <str<strong>on</strong>g>TFIDF</str<strong>on</strong>g> in<br />

text mining. Paper presented at the<br />

Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Machine Learning<br />

and Cybernetics, 2002. Proceedings. 2002<br />

[16] Hovy, E. (2006). Learning Ontological Knowledge<br />

from the Web.<br />

[17] Mo<strong>on</strong>, N., & Singh, R. (2005). Experiments in<br />

Text-Based Mining and Analysis of Biological<br />

Informati<strong>on</strong> from MEDLINE <strong>on</strong> Functi<strong>on</strong>ally-<br />

Related Genes Paper presented at the 18th<br />

Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Systems<br />

Engineering (ICSEng'05).<br />

[18] Ishii, N., Murai, T., Yamada, T., & Bao, Y.<br />

(2006). Text Classificati<strong>on</strong> by Combining<br />

Grouping, LSA and kNN. Paper presented at the<br />

5th IEEE/ACIS Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong><br />

Computer and Informati<strong>on</strong> Science and 1st<br />

IEEE/ACIS Internati<strong>on</strong>al Workshop <strong>on</strong><br />

Comp<strong>on</strong>ent-Based Software Engineering,Software<br />

Architecture and Reuse (ICIS-COMSAR'06).<br />

[19] Sparck J<strong>on</strong>es, K. (1972). A statistical<br />

interpretati<strong>on</strong> of term specificity and its<br />

applicati<strong>on</strong> in retrieval. Journal of<br />

Documentati<strong>on</strong>, 28 (1), 11-21.<br />

[20] Deerwester, S., Dumais, S. T., Furnas, G. W.,<br />

Landauer, T. K., & Harshman, R. (1988).<br />

Indexing by latent semantic analysis. Journal<br />

of the American Society for Informati<strong>on</strong><br />

Science, 41(6), 391-407.<br />

[21] M<strong>on</strong>tgomery, D. C. (1996). Introducti<strong>on</strong> to<br />

Statistical Quality C<strong>on</strong>trol (3 Sub editi<strong>on</strong><br />

ed.): John Wiley & S<strong>on</strong>s.<br />

[22] Page, E. S. (1954). C<strong>on</strong>tinuous Inspecti<strong>on</strong><br />

Schemes. Biometrika(41), 100-115.<br />

[23] Page, E. S. (1961). Cumulative Sum C<strong>on</strong>trol<br />

Charts. Technometrics, 3(1), 1-9.<br />

[24] Mani, I., & Bloedorn, E. (1998). Machine<br />

Learning of Generic and User-Focused<br />

Summarizati<strong>on</strong>. Paper presented at the<br />

Fifteenth Nati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> AI (AAAI-<br />

98). from http://arxiv.org/abs/cs/9811006v1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!