ANDREW N EDMONDS PHD, CONCEPT STRINGS LLC - LT-Innovate
ANDREW N EDMONDS PHD, CONCEPT STRINGS LLC - LT-Innovate
ANDREW N EDMONDS PHD, CONCEPT STRINGS LLC - LT-Innovate
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The technology<br />
and uses of<br />
Concept Strings<br />
<strong>ANDREW</strong> N <strong>EDMONDS</strong> <strong>PHD</strong>, <strong>CONCEPT</strong> <strong>STRINGS</strong> <strong>LLC</strong>
Words and Concepts<br />
English has > 40,000 words<br />
Words have no obvious structure, except alphabetically<br />
Handling data structures of this size are computationally expensive<br />
If a piece of text has one intended meaning, then<br />
Behind that meaning there are a sequence of intended concepts<br />
Concepts can be collected and are, presumably, universal<br />
Concepts have structure (a set of trees)<br />
Hypernymy – “is a kind of” relationships<br />
Meronymy – “is a part of” relationships<br />
Antonymy – “is the opposite of” relationships
Where to get concepts?<br />
One solution: WordNet (Princeton)1998<br />
WordNets are available for most languages<br />
Strictly, a WordNet is just a collection of synsets<br />
Synsets represent one concept, tied to a part of speech (POS).<br />
00122338 04 n 02 mailing 0 posting 0 004 @ 00121366 n<br />
0000 + 01031256 v 0202 + 01437888 v 0101 + 01031256 v<br />
0101 | the transmission of a letter; "the postmark<br />
indicates the time of mailing"
Mapping words to concepts is<br />
ambiguous<br />
<br />
<br />
<br />
Each sentence has multiple possible mappings / readings<br />
A new data structure is needed that encodes this ambiguity<br />
The Concept String:<br />
Words<br />
The cat sat on the mat<br />
POS<br />
article noun verb prep. article noun<br />
Article<br />
concept<br />
feline<br />
be<br />
seated<br />
on prep.<br />
concept<br />
Article<br />
concept<br />
carpet<br />
Concepts<br />
whip<br />
ride<br />
picture<br />
mount<br />
tractor<br />
sit a<br />
baby<br />
tangled<br />
mess
Disambiguation<br />
Two main sources of uncertainty<br />
Part of speech<br />
Intended concepts<br />
WordNet contains Nouns, Adjective, Verbs & Adverbs<br />
By adding conjunctions, articles, pronouns etc.. We can create an inmemory<br />
directed graph of English.<br />
This large data structure can be extended to include slang, spellings, etc.<br />
We can work out all the possible POS for each word from this<br />
Using frequency information, (Concept N-Grams) produced by mining<br />
a large sample corpus, we can reduce POS and concept ambiguity
What can we do with this?<br />
We’ve created two useful data structures:<br />
Concept String Suffix Trees<br />
Used to locate matching texts in a corpus<br />
Concept Trees<br />
Used to rapidly scan text for matching templates expressed as concept<br />
strings, but also containing wild cards<br />
These two structures are computationally efficient (linear or better) ,<br />
and “big data” friendly.<br />
…The cat, which had previously entered, sat on the mat for several minutes…<br />
feline<br />
Wildcard<br />
be<br />
seated<br />
Match!
Applications<br />
Searching for phrases and sentences with a selected meaning<br />
across text streams or databases.<br />
Indexing a corpus so that sections with similar meaning are easily<br />
detected.<br />
Uses are in:<br />
Legal discovery<br />
Compliance - trialling for a legal compliance application for a UK<br />
company<br />
Homeland security<br />
Document fragment matching and searching – Successful long term<br />
collaboration with legal publisher in South America.
Business possibilities<br />
<br />
<br />
Concept Strings’ business model is to:<br />
<br />
<br />
<br />
<br />
<br />
License technology (the Dolby/ARM model)<br />
Help new companies to exploit the technology for vertical markets<br />
Develop prototypes, pilots or full products for customers<br />
Usually use Microsoft technologies/Azure cloud to do this<br />
Not require users to credit our technology in their products<br />
Contacting us:<br />
<br />
Email andy@conceptstrings.com<br />
Tel: +44 (0)203 289 0580<br />
<br />
<br />
<br />
Skype: Scientio<br />
Twitter: ConceptStrings<br />
LinkedIn: http://uk.linkedin.com/in/drandrewedmonds