TFIDF on large datasets - Nguyen Dang Binh

Using term-frequency-inverse-document 

frequency of email to 

detect change in social groups 

Abstract—Interest in the classification of large text data sets 

continues to grow. The Enron email corpus remains a rich source 

of inquiry given its size and the unique characteristic of email: 

timestamps. Using this temporal data, we study a method for 

detecting temporal changes in the classifications. Using termfrequency-inverse-document-frequency 

(<strong>TFIDF</strong>) upon which we 

apply a change detection algorithm (CUSUM), our results suggest 

a methodology for predicting changes in email conversations in a 

social network. 

Index Terms—Enron, Machine Learning Statistical Process 

Control, TF-IDF, Text Analysis, Text Classification. 

I. INTRODUCTION 

long with the rapidly expanding scale of available text data, 

A 

interest is growing alongside in the classifying of these 

growing data sets. The available text on the web alone is 

huge. With no search engine indexing more than about 16% of 

the estimated size of the publicly indexable web [1], at least 20 

billion web objects have been indexed [2]. Approaches to 

classification have been developed to specifically be more 

effective in all-text environments [3]. 

Since becoming available in 2002, the large Enron email 

corpus has been the subject of repeated study by researchers. It 

possesses qualities that warrant continued inquiry. Many papers 

explicitly referencing the Enron dataset have already been 

published [4,5]. This dataset will likely continue to be of interest 

for some time. The email corpus of Enron represents a body of 

correspondence within a defined social group at a scale unusual 

in the genre. We are fortunate to not only have this substantial 

collection of emails but also time stamps on them. 

This work was supported by the Center for Computational Analysis of Social 

and Organizational Systems, School of Computer Science, Carnegie Mellon 

University, http://www.casos.cs.cmu.edu. 

I. McCulloh is with the Network Science Center, U.S. Military Academy, West 

Point, NY 10996 (phone: 845-702-9115, fax: 845-938-2409, email: 

imccullo@cs.cmu.edu) 

E. Daimler is with Carnegie Mellon University, Building 23, Moffett Field, 

California 94035 (phone: 408-241-0055 email: edaimler@cs.cmu.edu) 

K.M. Carley is with the Center for Computational Analysis of Social and 

Organizational Systems, Carnegie Mellon University, 5000 Forbes Ave, 

Pittsburgh, PA 15213 USA (phone: 412-268-8163 email: 

kathleen.carley@cs.cmu.edu) 

Ian McCulloh, Eric Daimler, Kathleen M. Carley 

Timestamps on correspondence provide temporal richness to 

text classification. Tracking changes in the text classifications 

over time may give visibility to changes in the social structure 

defined by the messages. We may generalize this challenge: 

given a corpus of weekly email texts for a social group, develop 

a process to discover characteristics of temporal changes in that 

social group. 

It is not the study of sentiment per se, but rather changes in 

sentiment. Rather than looking at what the sentiment tells us 

directly, we are looking at what the changes in sentiment tells us. 

We are concerned with the degree to which a change in 

sentiment can help to predict real world events. The ability to 

detect changes in social groups is important in a variety of 

applications. 

Machine learning techniques may be useful for quantifying 

communication in these social communities. For all identifiable 

social communities, there exists an opportunity to gain valuable 

insight into social dynamics. Analysis of these communities can 

be more robust with more objective, quantifiable metrics. A 

variety of change detection algorithms can then be used to 

detect changes in these communication metrics. A quantifiable 

method of detecting changes in social communities has far 

reaching applications ranging from defense, to economics, to 

public policy. 

A. Public Policy 

Public policy and policy maker’s representation of 

events must reflect constituent beliefs. Placing 

objective, quantitative measures on these beliefs can 

make for more responsive, if not better, governance. 

B. Commercialization 

Customer adoption can be powerfully impacted by 

word-of-mouth. The targeting of customer referrals 

could be made more effective. Addressing customer 

complaints could occur before becoming a material 

issue. 

C. Financial Risks 

Issues as diverse as securities or currency speculation, 

unstable government policy, tax-avoidance schemes, 

accounting policy changes, or changes in credit

standards are reflected in Natural Language in addition 

to numbers. Detecting changes in Natural Language 

may be an important adjunct to changes in the 

numerical data. Increasingly organizations examine email 

to protect themselves against corporate malfeasance [6]. 

D. Security & law enforcement 

To the degree that unlawful activities are reflected in 

sentiment, automated processes for detecting changes 

are inherently more effective than manual processes in 

the volume of data able to be processed. 

The sheer scale of the text to be classified in the Enron email 

dataset provides it substantial interest from researchers. While 

there has been study on corpora of substantial size, few, if any, 

have been allowed at this scale on email. Analysis can be done 

from web log communication with a virtually unlimited data set. 

However, email can represent a more active dialogue with 

communications occurring more rapidly with the subject 

changing quickly and in discontinuous spurts, with the addition 

of the time stamp adding to the data’s richness. The time-stamps 

on the text data set allows the application of approaches to 

change detection such as cumulative sum (CUSUM) statistical 

process control charts to investigate the degree to which 

changes in salient words may be detected. 

II. BACKGROUND 

The classification of documents has been studied for those in 

analog form, those in electronic form, emails, and even the 

Enron Corpus in particular [5,6,7]. These studies have ranged 

from inquiry into the effectiveness of the classification 

documents themselves [4] to exploratory data analysis [8]. As 

new classification algorithms have been developed, they have 

been tested against existing corpora [9]. As new corpora have 

been created, they have been tested on existing classification 

techniques [4]. 

The approach of classifying text in this way has been studied 

[10]. The exploration of change detection in been studied [11,12] 

Comparing changes in social networks in this way has been 

studied. We look to expand this literature by exploring the 

degree to which we might predict changes in the social network 

with a particular approach to combining these methods. 

Efforts have been made to classify online communication and 

email communication [8,9,13]. Classification algorithms have 

been applied to these Enron email communications as a test of 

the algorithms effectiveness [4]. 

The many expressions of exploratory data analysis can be 

used to look at history to help predict the future. Along with 

classification methodologies, approaches to clustering have been 

brought to bear in many studies [7]. Other studies have used it 

for the purpose of looking at trends in data to detect changes in 

dialogue [3,8]. Clusters can be formed along many lines: topics, 

dates, sentiment and others. 

While large data sets are of growing interest, we might find 

these in blogs in addition to email corpora. The effectiveness of 

text classification algorithms has been explored in the ample 

data sets available in the texts of blogs [9]. Email has been 

distinguished as having its own challenges for data mining [5]. 

Much work has gone into cleaning up the data for use by 

researchers. Some potentially confounding characteristics 

include the use of multiple email alias’ by employees. As has 

been studied on multiple occasions [5,6], email also represents a 

point between formal written communication and less formal 

spoken communication. Interpreting half-sentences, 

abbreviations, or (intentional or unintentional) mis-spellings has 

been sufficient to occupy its own studies. 

Yet to be sufficiently explored is a mapping of these 

classifications to the events surrounding the subjects of the 

emails themselves. While electronic communications of all sorts 

are interesting in and of themselves and the classifications are 

worthy of study, we extend their study to explore how 

classification algorithms might be used to predict events. We 

look to see if we can use changes in the nature of the email 

communication to predict the future. 

Despite its simplicity, results of experiments on Web pages 

and TV closed captions demonstrate high classification accuracy 

for <strong>TFIDF</strong> [10]. While <strong>TFIDF</strong> has been explored [14,15,16]. 

This study suggests a method to identify organizational change 

in a corpus of over 50K emails through the application of 

<strong>TFIDF</strong>, and then applying a statistical process control chart 

from quality engineering. 

The corpus of Enron email used in this study comprises 

50,000 email text documents. This data set spans a sufficient 

time period to be meaningful (created over a period of four 

years, 1998-2002) and contains at least one known major 

organizational change point (in this case, turnover of the CEO 

& Chairman). The data set forms a closed network, where 

members of the social network send all texts in this dataset to 

other members of the social network. 

We demonstrate a method of investigating potential changes 

through <strong>TFIDF</strong> and the CUSUM control chart. Key terms 

identified in the <strong>TFIDF</strong> vectors over the weeks identified as 

potential change points, correlate well with major historical 

events involving Enron. These key terms also lead investigators 

to insightful emails in reference to potential causes of change. 

III. METHOD 

This research explores characteristics of temporal change in 

text classification of large data sets. We study the following 

characteristics: The point when a change has been detected; the 

magnitude of the change; and the most likely estimate of when 

the change originally occurred. We then map these changes to 

some of the organization’s historical events.

Below is an outline of the approach pursued in this study: 

A. Conducted TF-IDF on weekly email documents 

We applied the TF-IDF algorithm to weekly email 

documents of Enron. The TF-IDF is a measure of 

sentiment in documents. Character strings are given a 

high score, when they occur frequently in a document, 

yet infrequently across multiple documents. The 

formula for TF-IDF is given by, 

tfidf i, j = n i, j 

∑nk, j 

k 

ln 

D 

{ d j :t i ∈d j} 

, 

where, ni,j is the number of character strings, ti, in 

document dj; nk,j is the number of total terms in 

document dj; and |D| is the total number of documents 

in the corpus. For a comprehensive explanation of TF- 

IDF refer to Sparck-Jones [19]. Determining the 

occurrence of a change from sequentially observed 

weekly email makes infeasible the conducting of TF- 

IDF over all possible documents. We therefore 

calculate TF-IDF over one week documents (emails). 

The TF-IDF vectors for each document are then 

averaged to create a TF-IDF vector that represents the 

sentiment of that particular week. 

B. Performed cosine similarity between vectors of each week’s concepts 

While comparing the difference between corresponding 

components would be difficult for generating test 

statistics, comparing the angle between each vector and 

an average vector detects differences and allow use of 

control charts. The cosine similarity between weekly 

vectors of influential terms is calculated to quantify 

different potential measures of weekly change. 

A reference vector is determined by averaging available 

weekly TF-IDF vectors across rows. This average 

vector has no significance other than serving as a 

reference point for calculating differences between 

weekly vectors. If this were not considered, small trend 

changes in weekly sentiment would go undetected. 

Finding the angle between two weekly TF-IDF vectors, 

a and b is given by, 

θ 

a, 

b 

⎛ ⎞ 

⎜ 

a • b 

= arccos ⎟ 

⎜ ⎟ 

⎝ a b ⎠ 

These angles between vectors, represents change in 

weekly email sentiment. 

C. Apply CUSUM control chart statistic 

Statistical process control charts [21,22,23] are used to 

detect changes in temporal data. We use the CUSUM 

control chart statistic to identify potential changes in the 

semantic content of Enron communication. The 

CUSUM signals change and estimates the time of 

change. 

With control charts helping to distinguish process 

abnormality, measurements from the process are used to 

compute a test statistic. When the test statistic exceeds 

the limits of the control chart, the process is deemed 

abnormal. This indicates that a change in the process 

may have occurred. The process (in this case group 

sentiment) can then be investigated to identify the 

potential cause of the change. The CUSUM statistic is 

given by, 

C 

+ 

x 

= 

⎧ θ x, 

x −θ 

δ 

⎨0, 

− + C 

⎩ sθ 

2 

+ 

max x−1 

where θ is the average cosine similarity between a 

weekly vector and the reference vector when the 

sentiment is not changing; and δ is the magnitude of 

change that the CUSUM is optimized to detect. For this 

study, δ = 1 for all calculations. A control limit of 2.03 

was used, which corresponds to a type I error of 0.05. 

This allowed the weekly data to be broken into time 

periods of similar sentiment. 

D. Identified most salient character strings for each time period 

We then identified the most salient character-strings for 

each time period by ranking the terms according to their 

transformed TF-IDF scores for each time period. In 

addition, the biggest changes in salient terms between 

time-periods was calculated by taking the absolute value 

of the difference between sequential time-period 

vectors. 

E. Matched character strings against Enron history 

We searched through four weeks of email data 

surrounding changes for reference to the salient terms. 

These email were reviewed for any potential insight into 

Enron activities. 

IV. RESULTS 

There were 11 time-periods identified in the Enron data. A 

plot of the CUSUM statistic over time is shown in Figure [1]. 

⎫ 

⎬ 

⎭ 

,

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

Enron LSA Cosine Similarity 

CUSUM (k=0.5, h=2.03) 

0.2 

8/28/1999 3/15/2000 10/1/2000 4/19/2001 11/5/2001 5/24/2002 

Figure [1] 

Below are the salient character strings identified for four of 

the largest change points. Analyses of some other change points 

were omitted due to the lack of available historical data. The 

salient character strings were: 

enron; swap/counterparty; agreements; terminate; 

meters; ectcc. 

A search on these salient terms identified several interesting 

emails that suggest that a change in the organization may have 

actually occurred. In order to realize the significance of the 

email messages, it is helpful to review the four major periods in 

Enron’s history, when the CUSUM signaled potential change. 

In November 2000, Kenneth Lay, Enron CEO, sells his 

shares and files fraudulent quarterly 10-Q for the third 

consecutive quarter. 

In early March 2001, Fortune Magazine publishes an 

article that questions Enron’s stock price; legal questions 

are raised about LJM, a company used to hide Enron 

debt; and problems with the Raptor partnership surface. 

Enron repurchased Chewco’s investment in JEDI in late 

March to cover the problem. Shortly after these events, 

Enron announces a large first quarter profit of $536 

million. 

In late July 2001, Enron’s stock price closed below $47 

per share. This was a critical point for the Raptor 

partnerships. Three weeks later, Jeffery Skilling resigns 

as CEO. 

In late September 2001, Skilling sells half a million 

shares of stock in Enron, and Director Robert Belfer 

sells 109,000 shares. CEO, Kenneth Lay, tells 

employees that Enron’s accounting practices are “legal 

and totally appropriate.” 

These four time periods in 2000 and 2001 all correspond to a 

shift in organization sentiment. There were 36 emails identified 

by the <strong>TFIDF</strong> salient terms within three weeks of the CUSUM 

signaled weeks. Of these 36, 14 suggested a possible cause of 

change. In late September 2000, three emails address “swaps” 

in “Raptor”. One of these specifically addresses Chewco and 

JEDI I, Enron’s interest in these companies, and employees are 

told that this will be “helpful in [their] review of Raptor 

matters”. In late March 2001, five emails involve “swaps”, 

“agreement” and “terminate” in conjunction with “Raptor”. 

Two emails discuss handling of equity, warrants and debt. The 

other emails discuss that “Mary” is on vacation and they are not 

sure how to handle the “Raptor” swaps. On 3 April 2001, an 

identified email states “I wasn’t trying to be critical of anyone 

specifically…this ‘loss’ of value, which does not show up in a 

P/L since it is hedged by Raptor…” suggests that there was 

some discussion of how swaps were handled following the 26 

March move to cover problems with Raptor. An early August 

email suggests a plan to “write down some of our problem 

assets and unwind raptor” on the tails of Enron’s stock price 

closing below $47, which was a critical point in the Raptor 

partnership. There are three other emails in August that discuss 

a “terminate” clause in a “Raptor” “agreement”. The final email 

of potential interest discusses how “Raptor” has “blown up.” 

This analysis indicates that a likely cause of change in the 

Enron email sentiment is due to discussions of swaps in 

reference to “Raptor”. There is also some involvement with the 

companies Chewco and JEDI I. 

We have shown a method to detect possible changes in a 

social group for investigation. We demonstrated a method of 

investigating potential changes through <strong>TFIDF</strong> and a CUSUM 

control chart scheme, where key terms could be identified. 

These key terms show the major activities related to that domain 

(political ties) as identified by someone other than the principal 

investigators. 

V. CONCLUSION 

The very large data set that we have investigated has been the 

subject of a great deal of interest but remains rich enough to 

justify continued study. Our collection of 50,000 emails actually 

represents a subset of a larger set where at least 250,000 emails 

are available. Should the larger set become usable, it will be of 

interest to at least these researchers. Unfortunately, this larger 

set suffers difficulties requiring further cleaning (e.g., unreconciled 

to/from fields) 

There exist at least three straightforward extensions to the 

work presented in this paper: 

1) Applying additional classification algorithms to the existing 

Enron corpus investigated here. 

2) Applying the classification approaches to other data sets. 

3) Applying the methodology for the purpose of predicting 

changes/events for the social network described by the email 

communications.

As new approaches to classification problems become 

available, they will also likely justify testing against the Enron 

corpus. 

ACKNOWLEDGMENT 

We are grateful to Carolyn Rose for her feedback. Jana 

Diesner and Terrill Franz with the Center for Computational 

Analysis of Social and Organizational Systems 

(http://www.casos.cs.cmu.ed) provided assistance in the 

preparation of the original dataset. 

REFERENCES 

[1] Lawrence, S., & Giles, L. (1999). Accessibility 

and Distribution of Information on the Web. 

Nature, 400, 107-109. 

[2] UCBerkeleyLibrary. (2007). The BEST Search 

Engines. Retrieved November, 2007, 2007, 

from 

http://www.lib.berkeley.edu/TeachingLib/Guide 

s/Internet/SearchEngines.html 

[3] Mani, I., & Bloedorn, E. (1997). Summarizing 

Similarities and Differences Among Related 

Documents. Information Retrieval(1), 35-67. 

[4] Stockinger, K., Rotem, D., Shoshani, A., & Wu, 

K. (2006). Analyzing Enron Data: Bitmap 

Indexing Outperforms MySQL Queries by Several 

Orders of Magnitude [Electronic Version], 4. 

Retrieved 2006 Jan 28. 

[5] Carley, K. M., & Skillicorn, D. (2005). Special 

Issue on Analyzing Large Scale Networks: The 

Enron Corpus. Computational & Mathematical 

Organizational Theory(11), 179-181. 

[6] Keila, P. S., & Skillicorn, D. B. (2005). 

Structure in the Enron Email Dataset. 

Computational & Mathematical Organization 

Theory(11), 183–199. 

[7] Priebe, C. E., Conroy, J. M., Marchette, D. J., 

& Park, Y. (2005). Scan Statistics on Enron 

Graphs. Computational & Mathematical 

Organization Theory(11), 229-247. 

[8] Godbole, N., Srinivasaiah, M., & Skiena, S. 

(2007). LargeScale Sentiment Analysis for 

News and Blogs (System Demonstration). Paper 

presented at the International Conference for 

Weblogs and Social Media (ICWSM 07), Boulder, 

CO. 

[9] Li, Y. H., & Jain, A. K. (1998). Classification 

of Text Documents. The Computer Journal, 

41(8), 537-546. 

[10] Chuang, W. T., Tiyyagura, A., Yang, J., & 

Giuffrida, G. (2000). A fast algorithm for 

hierarchical text classification. In Y. 

Kambayashi, M. Mohania & A. M. Tjoa (Eds.), 

DaWaK 2000 (Vol. LNCS 1874, pp. 409-418): 

Springer-Verlag Berlin Heidelberg 2000. 

[11] Lorden, G. (1971). Procedures for reacting to a 

change distribution. The Annals of 

Mathematical Statistics, 42(6), 1897-1908. 

[12] Joseph J. Pignatiello, J., & Simpson, J. R. 

(2002). A Magnitude-Robust Control Chart For 

Monitoring And Estimating Step Changes For 

Normal Process Means. Quality And Reliability 

Engineering International(18), 429-441. 

[13] Hynek, J., & Jezek, K. (2003, 25-28 June 2003). 

Practical Approach to Automatic Text 

Summarization. Paper presented at the 7th 

ICCC/IFIP International Conference on 

Electronic Publishing, Universidade do Minho, 

Portugal. 

[14] Aizawa, A. (2000). The feature quantity: an 

information theoretic perspective of Tfidflike 

measures. Paper presented at the Annual 

ACM conference on research and development in 

information retrieval, Athens, Greece. 

[15] Jing, L.-P., Huang, H.-K., & Shi, H.-B. (2003). 

Improved feature selection approach <strong>TFIDF</strong> in 

text mining. Paper presented at the 

International Conference on Machine Learning 

and Cybernetics, 2002. Proceedings. 2002 

[16] Hovy, E. (2006). Learning Ontological Knowledge 

from the Web. 

[17] Moon, N., & Singh, R. (2005). Experiments in 

Text-Based Mining and Analysis of Biological 

Information from MEDLINE on Functionally- 

Related Genes Paper presented at the 18th 

International Conference on Systems 

Engineering (ICSEng'05). 

[18] Ishii, N., Murai, T., Yamada, T., & Bao, Y. 

(2006). Text Classification by Combining 

Grouping, LSA and kNN. Paper presented at the 

5th IEEE/ACIS International Conference on 

Computer and Information Science and 1st 

IEEE/ACIS International Workshop on 

Component-Based Software Engineering,Software 

Architecture and Reuse (ICIS-COMSAR'06). 

[19] Sparck Jones, K. (1972). A statistical 

interpretation of term specificity and its 

application in retrieval. Journal of 

Documentation, 28 (1), 11-21. 

[20] Deerwester, S., Dumais, S. T., Furnas, G. W., 

Landauer, T. K., & Harshman, R. (1988). 

Indexing by latent semantic analysis. Journal 

of the American Society for Information 

Science, 41(6), 391-407. 

[21] Montgomery, D. C. (1996). Introduction to 

Statistical Quality Control (3 Sub edition 

ed.): John Wiley & Sons. 

[22] Page, E. S. (1954). Continuous Inspection 

Schemes. Biometrika(41), 100-115. 

[23] Page, E. S. (1961). Cumulative Sum Control 

Charts. Technometrics, 3(1), 1-9. 

[24] Mani, I., & Bloedorn, E. (1998). Machine 

Learning of Generic and User-Focused 

Summarization. Paper presented at the 

Fifteenth National Conference on AI (AAAI- 

98). from http://arxiv.org/abs/cs/9811006v1

TFIDF on large datasets - Nguyen Dang Binh

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?