Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Visualizing Text
http://www.chrisharrison.net/index.php/Visualizations/BibleViz
Logistics
• No new (non-project) homework!
• Next homework will be to turn in a
processed dataset (due April 12)
• Demo day scheduled!
• May 3, 3-5 pm
Text
• We encounter text everywhere
• Over 1 trillion web pages
• Over 100 trillion words
• How can we usefully gain insights from
text?
http://www.common-place.org/vol-09/no-02/reviews/images/bigmap1.jpg
http://www.flickr.com/photos/trustypics/6864756449/
What analysis questions
can we ask about text?
What analysis questions
can we ask about text?
• What are the main themes of a web site
What analysis questions
can we ask about text?
• What are the main themes of a web site
• How does one document differ from
another
What analysis questions
can we ask about text?
• What are the main themes of a web site
• How does one document differ from
another
• What is the tone of a tweet
What analysis questions
can we ask about text?
• What are the main themes of a web site
• How does one document differ from
another
• What is the tone of a tweet
• What else?
Word clouds
Word/Tag clouds
Final project wordle
Alphabetical
Hassan-Montero & Herrero-Solano, 2006
Semantic
Hassan-Montero & Herrero-Solano, 2006
Topigraphy
• Use topigraphical map to display relations
between tags and abstraction level
Fujimura et al., WWW2008
SparkClouds
Lee et al., 2010
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
• Alpha wins
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
• Alpha wins
• Find topic
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
• Alpha wins
• Find topic
• No difference
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
• Alpha wins
• Find topic
• No difference
• Recall
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
• Alpha wins
• Find topic
• No difference
• Recall
• No difference
Schrammel, Leitner & Tscheligi, CHI 2009
Are tag clouds good?
• Search time
• Alpha wins
• Find topic
• No difference
• Recall
• No difference
• Conclusion: Not a good information processing
approach
Schrammel, Leitner & Tscheligi, CHI 2009
So why are they used?
Final project wordle
Final project word list
•Data
• User
• Visualization
• Information
• Application
• Use
• Project
• Users
• Music
• May
• Patterns
So why are they used?
So why are they used?
“What might be considered design flaws from a data
visualization perspective make sense in terms of what
information is intended to be conveyed.
...a large part of the appeal of the visual
appearance of tag clouds are its fun, non-conformist
view, and the feeling that it evokes of human activity”
Hearst & Rosner, 2008
Comparing texts
Multiple tag clouds
Collins, Viegas & Wattenberg, 2009
Collins, Viegas & Wattenberg, 2009
http://www.neoformix.com/2008/DocumentContrastDiagrams.html
http://www.chrisharrison.net/projects/wordspectrum/index.html
http://www.chrisharrison.net/projects/wordspectrum/index.html
Text clustering
Topic models
• Infer latent topics in text corpus
Steyvers & Griffiths, 2007
Scatter/Gather
Hierarchical text clustering
Cutting et al., 1992
Interactive topic
models
Eisenstein, J., Chau, D., Kittur, A., Xing, E. (CHI 2012 WIP)
Eisenstein, J., Chau, D., Kittur, A., Xing, E. (CHI 2012 WIP)
Text sequences
Arc diagrams
http://www.chrisharrison.net/index.php/Visualizations/BibleViz
Dot plots
http://www.vivo.colostate.edu/molkit/dnadot/bkg.html
Dot plots
http://www.vivo.colostate.edu/molkit/dnadot/bkg.html
Dot plots
http://en.wikipedia.org/wiki/File:Zinc-finger-dot-plot.png
Problems with many
repeated sequences
Wattenberg, 2002
Problems with many
repeated sequences
Wattenberg, 2002
What is a repeat?
What is a repeat?
• 123a123
What is a repeat?
• 123a123
• Maximal matching pairs: 123 vs. 12 or 23
What is a repeat?
• 123a123
• Maximal matching pairs: 123 vs. 12 or 23
• 10101010
What is a repeat?
• 123a123
• Maximal matching pairs: 123 vs. 12 or 23
• 10101010
• Repetition regions: 10 vs. 1010
Example
http://www.chrisharrison.net/index.php/Visualizations/BibleViz
http://a.openbible.info/labs/cross-references/grid-1000.jpg
http://a.openbible.info/blog/2010-04-cross-references-2000.png
Beyond the bible
http://similardiversity.net
Arc Diagrams
Text processing and visualization on the iPad
•
•
•
•
Top-Down
Until now, we have focused on specific
programming topics.
This week we will change gears a bit.
We will talk about how to achieve a
specific visualization from start to finish.
I will be sharing an approach, not a
polished product.
An arc diagram is a one-dimensional layout of
nodes, with circular arcs to represent links.
Our Arc Diagram
Our Arc Diagram
1. Letʼs write an app that draws an arc
diagram for the words in a document.
Our Arc Diagram
1. Letʼs write an app that draws an arc
diagram for the words in a document.
2. Letʼs do all the calculation on the iPad.
Our Arc Diagram
1. Letʼs write an app that draws an arc
diagram for the words in a document.
2. Letʼs do all the calculation on the iPad.
3. Letʼs let the user browse the web to
choose any document in any language.
Our Arc Diagram
1. Letʼs write an app that draws an arc
diagram for the words in a document.
2. Letʼs do all the calculation on the iPad.
3. Letʼs let the user browse the web to
choose any document in any language.
4. Letʼs make the drawing explorable with
multitouch.
Design
Design
• Along the bottom will be all the words in
the document.
Design
• Along the bottom will be all the words in
the document.
• An arc will connect two words if they
appeared adjacent to each other.
Design
• Along the bottom will be all the words in
the document.
• An arc will connect two words if they
appeared adjacent to each other.
• A thicker arc will indicate a stronger
relationship.
•
•
•
What will we need?
We need the list of words.
We need, for each word, the list of
connected words.
We need the strength of the connection.
Strength
•
Strength
Two words are strongly linked if they
appeared adjacent many times.
•
•
Strength
Two words are strongly linked if they
appeared adjacent many times.
So we will be counting words and
word pairs (bigrams).
•
•
•
Strength
Two words are strongly linked if they
appeared adjacent many times.
So we will be counting words and
word pairs (bigrams).
“Word.” You keep saying that word. I
do not think it means what you think it
means.
Words
•
Words
“The man sat.” has three words.
•
•
Words
“The man sat.” has three words.
How many words here:
•
•
Words
iOS offers us a solution to the word
segmentation problem in NSString:
- (void)enumerateSubstringsInRange:(NSRange)range
options:(NSStringEnumerationOptions)opts
usingBlock:(void (^)(NSString *substring,
NSRange substringRange,
NSRange enclosingRange, BOOL *stop))block
set opts to NSStringEnumerationByWords
The First Challenge
The user chooses an arbitrary document
and we:
1. Analyze that document to segment
the text into words.
2. Count the words and bigrams.
3. Present the results in a table view.
4. Do not run out of memory.
5. Do not take forever.
Thought Experiment
•
•
•
•
Letʼs consider our approach.
Letʼs store the document in an NSString
Enumerate each word with
enumerateSubstringsInRange
Store each word and bigram in a couple
NSDictionaries
Thought Experiment
•
•
This approach does not scale.
If the document is large, there isnʼt
room in RAM for all of this.
Use Core Data
•
Use Core Data
Time/memory tradeoff with Core Data.
•
•
Use Core Data
Time/memory tradeoff with Core Data.
Trade for some complexity (memory,
threading)
Phase 1
UIWebView
User picks a URL
Phase 2
Relate grams.
Secondary
stats
Unigram to
bigrams might be
a huge relation
Loop over
chunks
Parameter: chunk
size
UITableViews
Our first port of
call to visualize
the data.
Flow
Count grams
in RAM
Smaller chunk =
better
(less RAM)
Update
Core Data
Smaller chunk =
worse
(more updates)
Phase 1
UIWebView
User picks a URL
Phase 2
Relate grams.
Secondary
stats
Unigram to
bigrams might be
a huge relation
Loop over
chunks
Parameter: chunk
size
UITableViews
Our first port of
call to visualize
the data.
Flow
Count grams
in RAM
Smaller chunk =
better
(less RAM)
Update
Core Data
Smaller chunk =
worse
(more updates)
These will loop over
existing data in CD.
“The Tandem Stream”
ACWebSearchViewController
• Stripped-down web browser
• Analyze button begins analysis
• Loads selected URL
asynchronously
• Feeds chunks to processing
engine
Classes
ACTextProcessor
• Manages Core Data context
• Drives the text analysis
engine
• Handles the threading with
GCD
ACBigramCounter
• Counts grams in RAM
• Updates Core Data
• Performs Phase 2
(relationships and extra stats)
ACWebSearchViewController
•
Chunk size
#define CHUNK_SIZE 400000
This is the amount of text to send for
processing at one time.
Go button
-(void)loadCurrentURL
{
NSURL *url = self.currentURL;
if (url) {
NSURLRequest *req = [NSURLRequest requestWithURL:url];
[self.activityIndicator startAnimating];
[self.webView loadRequest:req];
}
else {
NSLog(@"Unable to create valid URL for current string %@",
self.currentURLString);
}
}
UIWebView delegate
- (void) webViewDidFinishLoad:(UIWebView *)webView
{
[self.activityIndicator stopAnimating];
self.webAddressTextField.text = webView.request.URL.absoluteString;
self.analyzeButton.enabled = YES;
}
- (void) webViewDidStartLoad:(UIWebView *)webView
{
[self.activityIndicator startAnimating];
}
-(void)webView:(UIWebView *)webView didFailLoadWithError:(NSError *)error
{
NSLog(@"Web view load failed with error: %@", error);
}
Analysis button
- (IBAction)analyzeButtonPressed:(id)sender {
[(AppDelegate *)[[UIApplication sharedApplication] delegate] deleteEntireDataStore];
_textProcessor = nil; // reset this guy too because it is powered by a MOC
[[(AppDelegate *)[[UIApplication sharedApplication] delegate] tabBarController]
setSelectedIndex:1];
}
NSString *urlString = self.webView.request.URL.absoluteString;
NSURL *url = [NSURL URLWithString:urlString];
if (url) {
NSURLRequest *req = [NSURLRequest requestWithURL:url];
[[NSURLConnection alloc] initWithRequest:req delegate:self];
}
NSURLConnection delegate
- (void) connection:(NSURLConnection *)connection didReceiveData:(NSData *)data
{
NSString *incomingString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
NSLog(@"newly available data: %d bytes", [incomingString length]);
[self processIncomingString:incomingString];
}
-(void)connection:(NSURLConnection *)connection didFailWithError:(NSError *)error
{
NSLog(@"NSURLConnection failed with error: %@", error);
}
The main engine
- (void) processIncomingString: (NSString*)string
{
if (string)
{
self.documentString = [self.documentString stringByAppendingString:string];
}
}
if ([self.documentString length] > CHUNK_SIZE) {
}
[self.textProcessor processString:[self.documentString substringToIndex:CHUNK_SIZE]
logBlock:^(NSString *s) {
//noise;
}];
self.documentString = [self.documentString substringFromIndex:CHUNK_SIZE];
// a last piece of URLConnectionDelegate for the last bit of data
- (void) connectionDidFinishLoading:(NSURLConnection *)connection {
}
[self.textProcessor processString:self.documentString logBlock:^(NSString *s) {
//noise;
}];
_documentString = @"";
[self.textProcessor didFinishProcessingStrings:^(NSString *s) {
//noise;
}];
•
•
ACTextProcessor
Reminder: The interface to CD is the
NSManagedObjectContext (“MOC”)
Rule: CD work on a thread must use a
MOC created in that thread.
•
“Thread” includes serial GCD queue.
ACTextProcessor
- (id)init {
if (self = [super init]) {
_queue = dispatch_queue_create("edu.cmu.dataviz", DISPATCH_QUEUE_SERIAL);
_dispatchGroup = dispatch_group_create();
// create the managed object context in the same thread where we will be doing the work
__block NSManagedObjectContext *context = nil;
dispatch_sync(_queue, ^{
context = [[NSManagedObjectContext alloc] init];
context.persistentStoreCoordinator = [(AppDelegate *)[[UIApplication sharedApplication]
delegate] managedObjectContext].persistentStoreCoordinator;
context.undoManager = nil;
});
self.managedObjectContext = context;
_bigramCounter = [[ACBigramCounter alloc] initWithCoreDataContext:self.managedObjectContext];
}
return self;
}
ACTextProcessor
- (void)processString:(NSString*) string logBlock:(void (^)(NSString*))logger {
}
dispatch_group_async(_dispatchGroup, queue,
^{
logger(@"Collecting counts...");
[self.bigramCounter addCountsFromString:string];
logger(@"done.\n");
});
- (void) didFinishProcessingStrings:(void (^)(NSString*))logger {
}
dispatch_async(queue, ^{
dispatch_group_wait(_dispatchGroup, DISPATCH_TIME_FOREVER);
logger(@"Adding database relations...");
[self.bigramCounter finalizeCounts];
logger(@"done.\n");
logger(@"All done.\n");
});
ACBigramCounter
•
Aside: MyCount
Simple class that contains a mutable
count, for storing in NSDictionary.
@interface MyCount : NSObject {
int64_t mCount; // same as Core Data 64-bit integer
}
@property (nonatomic, assign) int64_t count;
@end
@implementation MyCount
@synthesize count = mCount;
- (NSString *)description {
return [NSString stringWithFormat:@"%d", mCount];
}
@end
Data model
- (BOOL)addCountsFromString:(NSString *)trueCaseStr {
NSString *str = [trueCaseStr lowercaseString];
NSMutableDictionary *unigrams = [[NSMutableDictionary alloc] init];
NSMutableDictionary *bigrams = [[NSMutableDictionary alloc] init];
__block NSString *previousToken = nil;
@autoreleasepool {
[str enumerateSubstringsInRange:NSMakeRange(0, [str length])
options:NSStringEnumerationByWords usingBlock:^(NSString *currentToken, NSRange substringRange,
NSRange enclosingRange, BOOL *stop) {
_totalUnigramCount += 1;
MyCount *currentCount = [unigrams objectForKey:currentToken];
if (currentCount) {
currentCount.count = currentCount.count + 1;
} else {
MyCount *one = [[MyCount alloc] init];
one.count = 1;
[unigrams setObject:one forKey:currentToken];
_totalNumberOfUnigrams += 1;
}
if (previousToken != nil) {
NSString *bigram =
[NSString stringWithFormat:@"%@ %@", previousToken, currentToken];
MyCount *currentCount = [bigrams objectForKey:bigram];
if (currentCount) {
currentCount.count = currentCount.count + 1;
} else {
_totalNumberOfBigrams += 1;
MyCount *one = [[MyCount alloc] init];
one.count = 1;
[bigrams setObject:one forKey:bigram];
}
}
previousToken = currentToken;
}];
}
@autoreleasepool {
NSArray *chunkTokens = [[unigrams allKeys] sortedArrayUsingSelector:@selector
(localizedCompare:)];
}
ACCDEnumerator *existingEnum =
[[ACCDEnumerator alloc] initWithManagedObjectContext:self.managedObjectContext];
[existingEnum setEntity:@"Token" sortKey:@"token" ascending:YES
comparison:@selector(localizedCompare:) predicate:nil];
existingEnum.fetchLimit = 10000;
NSString *chunkToken = nil;
Token *existingToken = [existingEnum nextObject];
NSComparisonResult comparison = NSOrderedAscending;
for (chunkToken in chunkTokens) {
if (existingToken != nil) {
comparison = [chunkToken localizedCompare:existingToken.token];
}
while (comparison == NSOrderedDescending && existingToken != nil) {
existingToken = [existingEnum nextObject];
comparison = [chunkToken localizedCompare:existingToken.token];
}
int64_t chunkTokenCount = [(MyCount *)[unigrams valueForKey:chunkToken] count];
// same -> match, ascending -> novel
if (comparison == NSOrderedSame) {
existingToken.count =
[NSNumber numberWithInt:[existingToken.count intValue] + chunkTokenCount];
} else if (comparison == NSOrderedAscending) {
Token *newToken = [NSEntityDescription insertNewObjectForEntityForName:@"Token"
inManagedObjectContext:self.managedObjectContext];
newToken.token = [chunkToken copy];
newToken.count = [NSNumber numberWithInt:chunkTokenCount];
}
}
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
Walk through two
lists in tandem in
order to merge
some info.
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
Move one item at
a time in the first
list
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
Move one item at
a time in the first
list
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
Move in the
second list list until
you match or
exceed.
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
If match, merge the
info. If exceed, add
a new record.
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
If match, merge the
info. If exceed, add
a new record.
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
If match, merge the
info. If exceed, add
a new record.
The Tandem Stream
List 1 List 2
aardvark aardvark
atom add
azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
The First list could
be chunk tokens,
and the second list
could be Core
Data tokens.
The Tandem Stream
List 1 List 2
aardvark cage aardvark
atom age add
azure sky atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
One of the lists
could be bigrams
sorted by first
word…
The Tandem Stream
List 1 List 2
bad aardvark aardvark
the atom add
bright azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
…or by second
word.
We keep the two
words in separate
fields, so these
comparisons are
easy.
The Tandem Stream
List 1 List 2
bad aardvark aardvark
the atom add
bright azure atom
big
car
dog
elephant
gorilla
helicopter
ice
jump
This operation lets
us merge large
tables of data with
a controlled
memory footprint.
The key is the
ability to sort the
data.
•
•
•
Implementation
Version 1: query CD for all entries in
both tables and then loop along the
results.
CD will incrementally fetch more results
as you move along these arrays.
But memory grows without bound.
•
•
•
Implementation
Version 2: Only fetch N items at a time,
and save and reset the DB before
fetching more: ACCDEnumerator
-[NSFetchRequest setFetchLimit:]
-[NSFetchRequest setFetchOffset:]
- (id)nextObject {
id nextObj = nil;
if (self.currentItem && nil == [self.currentItem managedObjectContext]) {
// the context was reset by another enumerator, recover
self.currentItems = nil;
self.currentItem = nil;
self.enumerator = nil;
self.nextEnumeratedItemOffset -= 1; // refetch this guy
self.fetchOffset = self.nextEnumeratedItemOffset;
[self _performNextFetch:NO];
}
if (self.enumerator) {
nextObj = [self.enumerator nextObject];
if (nil == nextObj) {
self.currentItems = nil;
[self _performNextFetch:YES];
if (self.enumerator) {
nextObj = [self.enumerator nextObject];
}
}
}
if (nil != nextObj) {
self.nextEnumeratedItemOffset += 1;
self.currentItem = nextObj;
}
return nextObj;
}
The Relations and other stats
- (void)_finalizeCounts {
// Establish all the relationships in the DB
ACCDEnumerator *bigramEnumerator = [[ACCDEnumerator alloc]
initWithManagedObjectContext:self.managedObjectContext];
[bigramEnumerator setEntity:@"Bigram" sortKey:@”first_token” ascending:YES
comparison:@selector(localizedCompare:) predicate:nil];
}
ACCDEnumerator *unigramEnumerator = [[ACCDEnumerator alloc]
initWithManagedObjectContext:self.managedObjectContext];
[unigramEnumerator setEntity:@"Token" sortKey:@"token" ascending:YES
comparison:@selector(localizedCompare:) predicate:nil];
Bigram *bigram = [bigramEnumerator nextObject];
Token *unigram = [unigramEnumerator nextObject];
NSComparisonResult comparison = NSOrderedSame;
while (bigram != nil) {
comparison = [bigram.first_token localizedCompare:unigram.token];
if (comparison == NSOrderedDescending) {
unigram = [unigramEnumerator nextObject];
} else if (comparison == NSOrderedAscending) {
NSLog(@"ERROR unexpected mismatch: %@, %@", bigram.bigram,
unigram.token);
}
if (unigram) {
bigram.first = unigram;
unigram.num_bigrams_as_first =
[NSNumber numberWithInt:[unigram.num_bigrams_as_first intValue] + 1];
}
bigram = [bigramEnumerator nextObject];
}
•
•
The other stats
bigrams_as_first/second tells you how
many unique bigrams this token is a part of.
Imagine a token with a large count but small
bigrams_as_second.
•
•
Example: “Francisco” mostly follows
“San”.
Such tokens might be great for the
visualization to focus on.
ACAnalysisViewController
Performance on iPad 2
• “A Tale of Two Cities” (140,000 words, 800KB)
• Run 1
• 400k chunks
• 5000 bigram limit
• 10000 unigram limit during count
• 2000 for both during linking
• RAM 22M, time 2h16m.
• Run 2
• 400k chunks
• 5000 bigram limit
• 10000 unigram limit
• RAM 22M, time 1h
Performance on iPad 2
•
This is a baseline implementation.