25.11.2014 Views

ANDREW N EDMONDS PHD, CONCEPT STRINGS LLC - LT-Innovate

ANDREW N EDMONDS PHD, CONCEPT STRINGS LLC - LT-Innovate

ANDREW N EDMONDS PHD, CONCEPT STRINGS LLC - LT-Innovate

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The technology<br />

and uses of<br />

Concept Strings<br />

<strong>ANDREW</strong> N <strong>EDMONDS</strong> <strong>PHD</strong>, <strong>CONCEPT</strong> <strong>STRINGS</strong> <strong>LLC</strong>


Words and Concepts<br />

English has > 40,000 words<br />

Words have no obvious structure, except alphabetically<br />

Handling data structures of this size are computationally expensive<br />

If a piece of text has one intended meaning, then<br />

Behind that meaning there are a sequence of intended concepts<br />

Concepts can be collected and are, presumably, universal<br />

Concepts have structure (a set of trees)<br />

Hypernymy – “is a kind of” relationships<br />

Meronymy – “is a part of” relationships<br />

Antonymy – “is the opposite of” relationships


Where to get concepts?<br />

One solution: WordNet (Princeton)1998<br />

WordNets are available for most languages<br />

Strictly, a WordNet is just a collection of synsets<br />

Synsets represent one concept, tied to a part of speech (POS).<br />

00122338 04 n 02 mailing 0 posting 0 004 @ 00121366 n<br />

0000 + 01031256 v 0202 + 01437888 v 0101 + 01031256 v<br />

0101 | the transmission of a letter; "the postmark<br />

indicates the time of mailing"


Mapping words to concepts is<br />

ambiguous<br />

<br />

<br />

<br />

Each sentence has multiple possible mappings / readings<br />

A new data structure is needed that encodes this ambiguity<br />

The Concept String:<br />

Words<br />

The cat sat on the mat<br />

POS<br />

article noun verb prep. article noun<br />

Article<br />

concept<br />

feline<br />

be<br />

seated<br />

on prep.<br />

concept<br />

Article<br />

concept<br />

carpet<br />

Concepts<br />

whip<br />

ride<br />

picture<br />

mount<br />

tractor<br />

sit a<br />

baby<br />

tangled<br />

mess


Disambiguation<br />

Two main sources of uncertainty<br />

Part of speech<br />

Intended concepts<br />

WordNet contains Nouns, Adjective, Verbs & Adverbs<br />

By adding conjunctions, articles, pronouns etc.. We can create an inmemory<br />

directed graph of English.<br />

This large data structure can be extended to include slang, spellings, etc.<br />

We can work out all the possible POS for each word from this<br />

Using frequency information, (Concept N-Grams) produced by mining<br />

a large sample corpus, we can reduce POS and concept ambiguity


What can we do with this?<br />

We’ve created two useful data structures:<br />

Concept String Suffix Trees<br />

Used to locate matching texts in a corpus<br />

Concept Trees<br />

Used to rapidly scan text for matching templates expressed as concept<br />

strings, but also containing wild cards<br />

These two structures are computationally efficient (linear or better) ,<br />

and “big data” friendly.<br />

…The cat, which had previously entered, sat on the mat for several minutes…<br />

feline<br />

Wildcard<br />

be<br />

seated<br />

Match!


Applications<br />

Searching for phrases and sentences with a selected meaning<br />

across text streams or databases.<br />

Indexing a corpus so that sections with similar meaning are easily<br />

detected.<br />

Uses are in:<br />

Legal discovery<br />

Compliance - trialling for a legal compliance application for a UK<br />

company<br />

Homeland security<br />

Document fragment matching and searching – Successful long term<br />

collaboration with legal publisher in South America.


Business possibilities<br />

<br />

<br />

Concept Strings’ business model is to:<br />

<br />

<br />

<br />

<br />

<br />

License technology (the Dolby/ARM model)<br />

Help new companies to exploit the technology for vertical markets<br />

Develop prototypes, pilots or full products for customers<br />

Usually use Microsoft technologies/Azure cloud to do this<br />

Not require users to credit our technology in their products<br />

Contacting us:<br />

<br />

Email andy@conceptstrings.com<br />

Tel: +44 (0)203 289 0580<br />

<br />

<br />

<br />

Skype: Scientio<br />

Twitter: ConceptStrings<br />

LinkedIn: http://uk.linkedin.com/in/drandrewedmonds

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!