25.07.2013 Views

words - Apple

words - Apple

words - Apple

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Visualizing Text

http://www.chrisharrison.net/index.php/Visualizations/BibleViz


Logistics

• No new (non-project) homework!

• Next homework will be to turn in a

processed dataset (due April 12)

• Demo day scheduled!

• May 3, 3-5 pm


Text

• We encounter text everywhere

• Over 1 trillion web pages

• Over 100 trillion words

• How can we usefully gain insights from

text?


http://www.common-place.org/vol-09/no-02/reviews/images/bigmap1.jpg


http://www.flickr.com/photos/trustypics/6864756449/


What analysis questions

can we ask about text?


What analysis questions

can we ask about text?

• What are the main themes of a web site


What analysis questions

can we ask about text?

• What are the main themes of a web site

• How does one document differ from

another


What analysis questions

can we ask about text?

• What are the main themes of a web site

• How does one document differ from

another

• What is the tone of a tweet


What analysis questions

can we ask about text?

• What are the main themes of a web site

• How does one document differ from

another

• What is the tone of a tweet

• What else?


Word clouds


Word/Tag clouds


Final project wordle


Alphabetical

Hassan-Montero & Herrero-Solano, 2006


Semantic

Hassan-Montero & Herrero-Solano, 2006


Topigraphy

• Use topigraphical map to display relations

between tags and abstraction level

Fujimura et al., WWW2008


SparkClouds

Lee et al., 2010


Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

• Alpha wins

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

• Alpha wins

• Find topic

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

• Alpha wins

• Find topic

• No difference

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

• Alpha wins

• Find topic

• No difference

• Recall

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

• Alpha wins

• Find topic

• No difference

• Recall

• No difference

Schrammel, Leitner & Tscheligi, CHI 2009


Are tag clouds good?

• Search time

• Alpha wins

• Find topic

• No difference

• Recall

• No difference

• Conclusion: Not a good information processing

approach

Schrammel, Leitner & Tscheligi, CHI 2009


So why are they used?


Final project wordle


Final project word list

•Data

• User

• Visualization

• Information

• Application

• Use

• Project

• Users

• Music

• May

• Patterns


So why are they used?


So why are they used?

“What might be considered design flaws from a data

visualization perspective make sense in terms of what

information is intended to be conveyed.

...a large part of the appeal of the visual

appearance of tag clouds are its fun, non-conformist

view, and the feeling that it evokes of human activity”

Hearst & Rosner, 2008


Comparing texts


Multiple tag clouds


Collins, Viegas & Wattenberg, 2009


Collins, Viegas & Wattenberg, 2009


http://www.neoformix.com/2008/DocumentContrastDiagrams.html


http://www.chrisharrison.net/projects/wordspectrum/index.html


http://www.chrisharrison.net/projects/wordspectrum/index.html


Text clustering


Topic models

• Infer latent topics in text corpus

Steyvers & Griffiths, 2007


Scatter/Gather

Hierarchical text clustering

Cutting et al., 1992


Interactive topic

models


Eisenstein, J., Chau, D., Kittur, A., Xing, E. (CHI 2012 WIP)


Eisenstein, J., Chau, D., Kittur, A., Xing, E. (CHI 2012 WIP)


Text sequences


Arc diagrams

http://www.chrisharrison.net/index.php/Visualizations/BibleViz


Dot plots

http://www.vivo.colostate.edu/molkit/dnadot/bkg.html


Dot plots

http://www.vivo.colostate.edu/molkit/dnadot/bkg.html


Dot plots

http://en.wikipedia.org/wiki/File:Zinc-finger-dot-plot.png


Problems with many

repeated sequences

Wattenberg, 2002


Problems with many

repeated sequences

Wattenberg, 2002


What is a repeat?


What is a repeat?

• 123a123


What is a repeat?

• 123a123

• Maximal matching pairs: 123 vs. 12 or 23


What is a repeat?

• 123a123

• Maximal matching pairs: 123 vs. 12 or 23

• 10101010


What is a repeat?

• 123a123

• Maximal matching pairs: 123 vs. 12 or 23

• 10101010

• Repetition regions: 10 vs. 1010


Example


http://www.chrisharrison.net/index.php/Visualizations/BibleViz


http://a.openbible.info/labs/cross-references/grid-1000.jpg


http://a.openbible.info/blog/2010-04-cross-references-2000.png


Beyond the bible

http://similardiversity.net


Arc Diagrams

Text processing and visualization on the iPad






Top-Down

Until now, we have focused on specific

programming topics.

This week we will change gears a bit.

We will talk about how to achieve a

specific visualization from start to finish.

I will be sharing an approach, not a

polished product.


An arc diagram is a one-dimensional layout of

nodes, with circular arcs to represent links.


Our Arc Diagram


Our Arc Diagram

1. Letʼs write an app that draws an arc

diagram for the words in a document.


Our Arc Diagram

1. Letʼs write an app that draws an arc

diagram for the words in a document.

2. Letʼs do all the calculation on the iPad.


Our Arc Diagram

1. Letʼs write an app that draws an arc

diagram for the words in a document.

2. Letʼs do all the calculation on the iPad.

3. Letʼs let the user browse the web to

choose any document in any language.


Our Arc Diagram

1. Letʼs write an app that draws an arc

diagram for the words in a document.

2. Letʼs do all the calculation on the iPad.

3. Letʼs let the user browse the web to

choose any document in any language.

4. Letʼs make the drawing explorable with

multitouch.


Design


Design

• Along the bottom will be all the words in

the document.


Design

• Along the bottom will be all the words in

the document.

• An arc will connect two words if they

appeared adjacent to each other.


Design

• Along the bottom will be all the words in

the document.

• An arc will connect two words if they

appeared adjacent to each other.

• A thicker arc will indicate a stronger

relationship.





What will we need?

We need the list of words.

We need, for each word, the list of

connected words.

We need the strength of the connection.


Strength



Strength

Two words are strongly linked if they

appeared adjacent many times.




Strength

Two words are strongly linked if they

appeared adjacent many times.

So we will be counting words and

word pairs (bigrams).





Strength

Two words are strongly linked if they

appeared adjacent many times.

So we will be counting words and

word pairs (bigrams).

“Word.” You keep saying that word. I

do not think it means what you think it

means.


Words



Words

“The man sat.” has three words.




Words

“The man sat.” has three words.

How many words here:




Words

iOS offers us a solution to the word

segmentation problem in NSString:

- (void)enumerateSubstringsInRange:(NSRange)range

options:(NSStringEnumerationOptions)opts

usingBlock:(void (^)(NSString *substring,

NSRange substringRange,

NSRange enclosingRange, BOOL *stop))block

set opts to NSStringEnumerationByWords


The First Challenge

The user chooses an arbitrary document

and we:

1. Analyze that document to segment

the text into words.

2. Count the words and bigrams.

3. Present the results in a table view.

4. Do not run out of memory.

5. Do not take forever.


Thought Experiment





Letʼs consider our approach.

Letʼs store the document in an NSString

Enumerate each word with

enumerateSubstringsInRange

Store each word and bigram in a couple

NSDictionaries


Thought Experiment



This approach does not scale.

If the document is large, there isnʼt

room in RAM for all of this.


Use Core Data



Use Core Data

Time/memory tradeoff with Core Data.




Use Core Data

Time/memory tradeoff with Core Data.

Trade for some complexity (memory,

threading)


Phase 1

UIWebView

User picks a URL

Phase 2

Relate grams.

Secondary

stats

Unigram to

bigrams might be

a huge relation

Loop over

chunks

Parameter: chunk

size

UITableViews

Our first port of

call to visualize

the data.

Flow

Count grams

in RAM

Smaller chunk =

better

(less RAM)

Update

Core Data

Smaller chunk =

worse

(more updates)


Phase 1

UIWebView

User picks a URL

Phase 2

Relate grams.

Secondary

stats

Unigram to

bigrams might be

a huge relation

Loop over

chunks

Parameter: chunk

size

UITableViews

Our first port of

call to visualize

the data.

Flow

Count grams

in RAM

Smaller chunk =

better

(less RAM)

Update

Core Data

Smaller chunk =

worse

(more updates)

These will loop over

existing data in CD.

“The Tandem Stream”


ACWebSearchViewController

• Stripped-down web browser

• Analyze button begins analysis

• Loads selected URL

asynchronously

• Feeds chunks to processing

engine

Classes

ACTextProcessor

• Manages Core Data context

• Drives the text analysis

engine

• Handles the threading with

GCD

ACBigramCounter

• Counts grams in RAM

• Updates Core Data

• Performs Phase 2

(relationships and extra stats)


ACWebSearchViewController



Chunk size

#define CHUNK_SIZE 400000

This is the amount of text to send for

processing at one time.


Go button

-(void)loadCurrentURL

{

NSURL *url = self.currentURL;

if (url) {

NSURLRequest *req = [NSURLRequest requestWithURL:url];

[self.activityIndicator startAnimating];

[self.webView loadRequest:req];

}

else {

NSLog(@"Unable to create valid URL for current string %@",

self.currentURLString);

}

}


UIWebView delegate

- (void) webViewDidFinishLoad:(UIWebView *)webView

{

[self.activityIndicator stopAnimating];

self.webAddressTextField.text = webView.request.URL.absoluteString;

self.analyzeButton.enabled = YES;

}

- (void) webViewDidStartLoad:(UIWebView *)webView

{

[self.activityIndicator startAnimating];

}

-(void)webView:(UIWebView *)webView didFailLoadWithError:(NSError *)error

{

NSLog(@"Web view load failed with error: %@", error);

}


Analysis button

- (IBAction)analyzeButtonPressed:(id)sender {

[(AppDelegate *)[[UIApplication sharedApplication] delegate] deleteEntireDataStore];

_textProcessor = nil; // reset this guy too because it is powered by a MOC

[[(AppDelegate *)[[UIApplication sharedApplication] delegate] tabBarController]

setSelectedIndex:1];

}

NSString *urlString = self.webView.request.URL.absoluteString;

NSURL *url = [NSURL URLWithString:urlString];

if (url) {

NSURLRequest *req = [NSURLRequest requestWithURL:url];

[[NSURLConnection alloc] initWithRequest:req delegate:self];

}


NSURLConnection delegate

- (void) connection:(NSURLConnection *)connection didReceiveData:(NSData *)data

{

NSString *incomingString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];

NSLog(@"newly available data: %d bytes", [incomingString length]);

[self processIncomingString:incomingString];

}

-(void)connection:(NSURLConnection *)connection didFailWithError:(NSError *)error

{

NSLog(@"NSURLConnection failed with error: %@", error);

}


The main engine

- (void) processIncomingString: (NSString*)string

{

if (string)

{

self.documentString = [self.documentString stringByAppendingString:string];

}

}

if ([self.documentString length] > CHUNK_SIZE) {

}

[self.textProcessor processString:[self.documentString substringToIndex:CHUNK_SIZE]

logBlock:^(NSString *s) {

//noise;

}];

self.documentString = [self.documentString substringFromIndex:CHUNK_SIZE];

// a last piece of URLConnectionDelegate for the last bit of data

- (void) connectionDidFinishLoading:(NSURLConnection *)connection {

}

[self.textProcessor processString:self.documentString logBlock:^(NSString *s) {

//noise;

}];

_documentString = @"";

[self.textProcessor didFinishProcessingStrings:^(NSString *s) {

//noise;

}];




ACTextProcessor

Reminder: The interface to CD is the

NSManagedObjectContext (“MOC”)

Rule: CD work on a thread must use a

MOC created in that thread.


“Thread” includes serial GCD queue.


ACTextProcessor

- (id)init {

if (self = [super init]) {

_queue = dispatch_queue_create("edu.cmu.dataviz", DISPATCH_QUEUE_SERIAL);

_dispatchGroup = dispatch_group_create();

// create the managed object context in the same thread where we will be doing the work

__block NSManagedObjectContext *context = nil;

dispatch_sync(_queue, ^{

context = [[NSManagedObjectContext alloc] init];

context.persistentStoreCoordinator = [(AppDelegate *)[[UIApplication sharedApplication]

delegate] managedObjectContext].persistentStoreCoordinator;

context.undoManager = nil;

});

self.managedObjectContext = context;

_bigramCounter = [[ACBigramCounter alloc] initWithCoreDataContext:self.managedObjectContext];

}

return self;

}


ACTextProcessor

- (void)processString:(NSString*) string logBlock:(void (^)(NSString*))logger {

}

dispatch_group_async(_dispatchGroup, queue,

^{

logger(@"Collecting counts...");

[self.bigramCounter addCountsFromString:string];

logger(@"done.\n");

});

- (void) didFinishProcessingStrings:(void (^)(NSString*))logger {

}

dispatch_async(queue, ^{

dispatch_group_wait(_dispatchGroup, DISPATCH_TIME_FOREVER);

logger(@"Adding database relations...");

[self.bigramCounter finalizeCounts];

logger(@"done.\n");

logger(@"All done.\n");

});


ACBigramCounter



Aside: MyCount

Simple class that contains a mutable

count, for storing in NSDictionary.

@interface MyCount : NSObject {

int64_t mCount; // same as Core Data 64-bit integer

}

@property (nonatomic, assign) int64_t count;

@end

@implementation MyCount

@synthesize count = mCount;

- (NSString *)description {

return [NSString stringWithFormat:@"%d", mCount];

}

@end


Data model


- (BOOL)addCountsFromString:(NSString *)trueCaseStr {

NSString *str = [trueCaseStr lowercaseString];

NSMutableDictionary *unigrams = [[NSMutableDictionary alloc] init];

NSMutableDictionary *bigrams = [[NSMutableDictionary alloc] init];

__block NSString *previousToken = nil;

@autoreleasepool {

[str enumerateSubstringsInRange:NSMakeRange(0, [str length])

options:NSStringEnumerationByWords usingBlock:^(NSString *currentToken, NSRange substringRange,

NSRange enclosingRange, BOOL *stop) {

_totalUnigramCount += 1;

MyCount *currentCount = [unigrams objectForKey:currentToken];

if (currentCount) {

currentCount.count = currentCount.count + 1;

} else {

MyCount *one = [[MyCount alloc] init];

one.count = 1;

[unigrams setObject:one forKey:currentToken];

_totalNumberOfUnigrams += 1;

}

if (previousToken != nil) {

NSString *bigram =

[NSString stringWithFormat:@"%@ %@", previousToken, currentToken];

MyCount *currentCount = [bigrams objectForKey:bigram];

if (currentCount) {

currentCount.count = currentCount.count + 1;

} else {

_totalNumberOfBigrams += 1;

MyCount *one = [[MyCount alloc] init];

one.count = 1;

[bigrams setObject:one forKey:bigram];

}

}

previousToken = currentToken;

}];

}


@autoreleasepool {

NSArray *chunkTokens = [[unigrams allKeys] sortedArrayUsingSelector:@selector

(localizedCompare:)];

}

ACCDEnumerator *existingEnum =

[[ACCDEnumerator alloc] initWithManagedObjectContext:self.managedObjectContext];

[existingEnum setEntity:@"Token" sortKey:@"token" ascending:YES

comparison:@selector(localizedCompare:) predicate:nil];

existingEnum.fetchLimit = 10000;

NSString *chunkToken = nil;

Token *existingToken = [existingEnum nextObject];

NSComparisonResult comparison = NSOrderedAscending;

for (chunkToken in chunkTokens) {

if (existingToken != nil) {

comparison = [chunkToken localizedCompare:existingToken.token];

}

while (comparison == NSOrderedDescending && existingToken != nil) {

existingToken = [existingEnum nextObject];

comparison = [chunkToken localizedCompare:existingToken.token];

}

int64_t chunkTokenCount = [(MyCount *)[unigrams valueForKey:chunkToken] count];

// same -> match, ascending -> novel

if (comparison == NSOrderedSame) {

existingToken.count =

[NSNumber numberWithInt:[existingToken.count intValue] + chunkTokenCount];

} else if (comparison == NSOrderedAscending) {

Token *newToken = [NSEntityDescription insertNewObjectForEntityForName:@"Token"

inManagedObjectContext:self.managedObjectContext];

newToken.token = [chunkToken copy];

newToken.count = [NSNumber numberWithInt:chunkTokenCount];

}

}


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

Walk through two

lists in tandem in

order to merge

some info.


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

Move one item at

a time in the first

list


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

Move one item at

a time in the first

list


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

Move in the

second list list until

you match or

exceed.


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

If match, merge the

info. If exceed, add

a new record.


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

If match, merge the

info. If exceed, add

a new record.


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

If match, merge the

info. If exceed, add

a new record.


The Tandem Stream

List 1 List 2

aardvark aardvark

atom add

azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

The First list could

be chunk tokens,

and the second list

could be Core

Data tokens.


The Tandem Stream

List 1 List 2

aardvark cage aardvark

atom age add

azure sky atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

One of the lists

could be bigrams

sorted by first

word…


The Tandem Stream

List 1 List 2

bad aardvark aardvark

the atom add

bright azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

…or by second

word.

We keep the two

words in separate

fields, so these

comparisons are

easy.


The Tandem Stream

List 1 List 2

bad aardvark aardvark

the atom add

bright azure atom

big

car

dog

elephant

gorilla

helicopter

ice

jump

This operation lets

us merge large

tables of data with

a controlled

memory footprint.

The key is the

ability to sort the

data.





Implementation

Version 1: query CD for all entries in

both tables and then loop along the

results.

CD will incrementally fetch more results

as you move along these arrays.

But memory grows without bound.





Implementation

Version 2: Only fetch N items at a time,

and save and reset the DB before

fetching more: ACCDEnumerator

-[NSFetchRequest setFetchLimit:]

-[NSFetchRequest setFetchOffset:]


- (id)nextObject {

id nextObj = nil;

if (self.currentItem && nil == [self.currentItem managedObjectContext]) {

// the context was reset by another enumerator, recover

self.currentItems = nil;

self.currentItem = nil;

self.enumerator = nil;

self.nextEnumeratedItemOffset -= 1; // refetch this guy

self.fetchOffset = self.nextEnumeratedItemOffset;

[self _performNextFetch:NO];

}

if (self.enumerator) {

nextObj = [self.enumerator nextObject];

if (nil == nextObj) {

self.currentItems = nil;

[self _performNextFetch:YES];

if (self.enumerator) {

nextObj = [self.enumerator nextObject];

}

}

}

if (nil != nextObj) {

self.nextEnumeratedItemOffset += 1;

self.currentItem = nextObj;

}

return nextObj;

}


The Relations and other stats


- (void)_finalizeCounts {

// Establish all the relationships in the DB

ACCDEnumerator *bigramEnumerator = [[ACCDEnumerator alloc]

initWithManagedObjectContext:self.managedObjectContext];

[bigramEnumerator setEntity:@"Bigram" sortKey:@”first_token” ascending:YES

comparison:@selector(localizedCompare:) predicate:nil];

}

ACCDEnumerator *unigramEnumerator = [[ACCDEnumerator alloc]

initWithManagedObjectContext:self.managedObjectContext];

[unigramEnumerator setEntity:@"Token" sortKey:@"token" ascending:YES

comparison:@selector(localizedCompare:) predicate:nil];

Bigram *bigram = [bigramEnumerator nextObject];

Token *unigram = [unigramEnumerator nextObject];

NSComparisonResult comparison = NSOrderedSame;

while (bigram != nil) {

comparison = [bigram.first_token localizedCompare:unigram.token];

if (comparison == NSOrderedDescending) {

unigram = [unigramEnumerator nextObject];

} else if (comparison == NSOrderedAscending) {

NSLog(@"ERROR unexpected mismatch: %@, %@", bigram.bigram,

unigram.token);

}

if (unigram) {

bigram.first = unigram;

unigram.num_bigrams_as_first =

[NSNumber numberWithInt:[unigram.num_bigrams_as_first intValue] + 1];

}

bigram = [bigramEnumerator nextObject];

}




The other stats

bigrams_as_first/second tells you how

many unique bigrams this token is a part of.

Imagine a token with a large count but small

bigrams_as_second.



Example: “Francisco” mostly follows

“San”.

Such tokens might be great for the

visualization to focus on.


ACAnalysisViewController


Performance on iPad 2

• “A Tale of Two Cities” (140,000 words, 800KB)

• Run 1

• 400k chunks

• 5000 bigram limit

• 10000 unigram limit during count

• 2000 for both during linking

• RAM 22M, time 2h16m.

• Run 2

• 400k chunks

• 5000 bigram limit

• 10000 unigram limit

• RAM 22M, time 1h


Performance on iPad 2


This is a baseline implementation.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!