Exploiting Corpora with Sketch Engine - NLP Centre - Masaryk ...

Contents Current Trends in Corpus Processing Sketch Engine Finding Collocations Using CQL 

Exploiting Corpora with Sketch Engine 

Miloš Jakubíček 

Lexical Computing Ltd., Brighton, United Kingdom 

milos.jakubicek@sketchengine.co.uk 

Natural Language Processing Centre 

Faculty of Informatics, Masaryk University, Brno, Czech Republic 

jak@fi.muni.cz 

CLARA course on Multilingual Lexical Resources and Tools, 

20-23 June 2011, Bergen 

Miloš Jakubíček LCL UK, NLPC FI MU CZ 

Exploiting Corpora with Sketch Engine


Contents 

1 Current Trends in Corpus Processing 

2 Sketch Engine 

3 Finding Collocations 

4 Using CQL 




Organisation 

Schedule 

Building & Searching Corpora 

Exercises 

Word Sense & Word Sketch 

Exercises 

If you don’t mind, we will mix lectures and exercises 

We will be using Sketch Engine installation at 

http://the.sketchengine.co.uk – you need to register for 

a trial account 

Ask questions immediately 

Do get back if you’d like to focus on anything particular 

Download slides from 

http://nlp.fi.muni.cz/~xjakub/clara.pdf 




Today’s Corpora I 

billions of tokens 

complex multi-level multi-value annotation 

wide range of languages 

growing demand on complex searching – moving from 

morphology to syntax and semantics 

search API for automatic information retrieval and 

post-processing in particular applications needed 

parallel distributed processing needed 




Today’s Corpora II 

What is the key property of a modern corpus? 




Today’s Corpora III 

Yes, it’s the size. 




Today’s Corpora IV 

Why are quantitative aspects so important? 

Law of large numbers 

Most language phenomena follow the Zipfian distribution 




Zipf’s law I 

100 

80 

60 

40 

20 

zipf(x) 

0 

0 200 400 600 800 1000 




Zipf’s law II 

may be simplified to inductive definition: 

Zipf’s law (simplified) 

frequency of the n-th element fn ≈ 1 

n · f1 

⇒ frequency is inversely proportional to the rank according to 

frequency 

⇒ one needs really large corpora to capture all the variety of 

many language phenomena 




Zipf’s law III 

Substantives + Verb tags on the Brown corpus 




Size is not everything . . . 

Why are qualitative aspects so important – well this can’t be really 

a question, right? 

web is the most used data source to obtain enough source 

texts – „web as corpus“ 

web is garbage (by definition) 

garbage as corpus? 

building corpora from web requires extensive post-processing 




Building corpora from web 

crawling the data 

detecting language, detecting encoding 

cleaning boilerplate (metadata, ads, navigation etc.): jusText 

(web demo) 

deduplicating: onion 

see Pomikálek (2011) PhD thesis 




Deduplicating 

quite straightforward for full duplicates, but that’s not enough 

people do not only copy full documents 

they copy just parts of the document: orig vs. copy 

they copy and modify: orig vs. mod 

they copy and extend: orig vs. ext 




Building and searching corpora 

standard IT solution for storing data is a relational database, 

i. e. a table (rows are elements of the relation) 

not suitable for storing corpora at all (text doesn’t have 

relational backbone, nor most annotation types) 

requirements for specific database management system 

and appropriate user interface on top of it 




Introducing Sketch Engine (SkE) 

since 2003 as a commercial variant of the Manatee/Bonito 

corpus management system (Rychlý 1999) 

named after one of the key features – word sketches (to be 

introduced) 

most components released as open-source NoSketch Engine, 

following pay-per-service principle 




SkE architecture I 




SkE architecture II 

Components: 

Finlib (C++) – Fast Indexing Library (core indexing and data 

stream processing module) 

Manatee (C++) – Module for corpus encoding, searching and 

managing 

Bonito (Python) – Web user interface for Manatee 

Corpus Architect (Python) – Web user interface for managing 

corpora 




Sketch Engine infrastructure 

= servers and their equipment 

what do we need? 

most companies either store lots of data, but don’t need fast 

access (e.g. backups, logs) or store quite small amount of 

data accessible fastly (information systems, databases) 

we need both + lots of memory and fair number of CPU cores 

we need to manage concurrent access 




Sketch Engine in numbers 

by 2011: 

> 8500 registered users 

191 preloaded corpora, 38G tokens, 47 languages 

4,335 user corpora, 2.5G tokens 

ca. 30,000 requests per day 




SkE storage size 




SkE 2007 




SkE 2008 




SkE 2010 




SkE 2011 




SkE 2011 




Searching corpora – selected topics 

corpus querying using CQL – Corpus Query Language 

finding collocates 




Finding collocates 

many association scores available 

most of which do not have linguistically plausible 

interpretation 

most of which do not scale well according to corpus size ⇒ 

cannot be used for cross-corpora comparisons 

see e. g. Rychlý (2008), Evert (2004) 




logDice I 

logDice 

2 · fxy 

14 + log2D = 14 + log2 

fx + fy 

yet another association score for collocations 

simple, scales well for corpus size 

has linguistically plausible interpretation 

based on the well known Dice association score (D above) 

see Rychlý (2008) sample comparison to T-score, MI-score 

and others 




logDice II 

Formal properties of logDice: 

Theoretical maximum is 14 – when X always co-occurs with Y 

and vice versa. Usually the value is less than 10. 

Value of 0 means there is less than 1 co-occurrence of X and 

Y per 16000 X or 16000 Y. We can say that negative values 

mean there is no statistical significance of XY co-occurrence. 

Comparing two scores, plus 1 point means 2 times more 

co-occurences, plus 7 points means roughly 100 times more 

co-occurences. 

The score is independent of the corpus size. The score 

combines relative frequencies of XY in relation to X and Y. 




CQL 

= Corpus Query Language (Christ and Schulze, 1994) 

positions and positional attributes: [attr="value"] 

structures and structural attributes: 

example: 

[word=".*ing" & tag="V.*"] 

[word=".*ing"] | [tag="V.*"] 

 

established a within query: 

[tag="N.*"]+ within 

and alternative meet/union query: 

(meet [lemma="take"] [tag="N.*"] -5 +5) 

(union (meet ...) (meet ...)) 




CQL in Manatee/Bonito 

ehnancements and differences to the original CQL syntax 

within and containing 

meet/union (sub)query 

inequality comparisons 

frequency function 




within/containing queries 

searching for particles: 

[tag="PR.*"] within [tag="V.*"] [tag="AT0"]? 

[tag="AJ0"]* [tag="(PR.?|N.*)"] [tag="PR.*"] 

within 

searching for a Czech idiom “hnout někomu žlučí” (“to get 

somebody’s goat”): 

word-by-word translated as: 

hnout “move” [V, infinitive] 

někomu “somebody” [N, dative] 

žlučí “bile” [N, instrumental]. 

containing [lemma="hnout"] containing 

[tag=".*c3.*"] containing [word="žlučí"] 




within/containing queries 

structure boundaries: begin: , whole structure: , 

end: 

changes: within not allowed anymore, use within 

 




meet/union queries 

combined with regular query: 

containing (meet [lemma="have"] [tag="P.*"] -5 5) 

containing (meet [tag="N.*"] [lemma="blue"]) 

changes: meet/union queries can be used on any position, 

they can contain labels and no MU keyword is required (and 

deprecated): 

(meet 1:[] 2:[]) & 1.tag = 2.tag 




Inequality comparisons 

former comparisons allowed only equality and its negation: 

[attr="value"] [attr!="value"] 

inequality comparisons implemented: [attr="value"] [attr!="value"] 

intended usage: 

[tag="AJ.*"] [tag="NN.*"] within ="2009"> 

sophisticated comparison performed on the attribute value: 


Fixed string comparisons 

normally the CQL values are regular expressions 

sometimes this is not desirable (batch processing needs 

escaping of metacharacters) 

new == and !== operator introduced for fixed strings 

comparison 

no escaping needed except for ’"’ and ’\’ 

examples: ".", "$", " " matches a single dot, dollar sign and 

tilda, respectively, "\n" matches a backslash followed by the 

character n, 




Frequency function 

a frequency constraint allowed in the global conditions part of 

CQL: 

1:[tag="PP.*"] 2:[tag="NN.*"] & f(1.word) > 10 




Performance evaluation 

Table: Query performance evaluation – corpora legend: ◦ BNC (110M 

tokens), • BiWeC (version with 9.5G tokens), ∗ Czes (1.2G tokens) 

query # of results time (m:s) 

◦ [lemma="time"] 179,321 0.07 

◦ [lemma="t.*"] 14,660,881 3.12 

◦ Ex: particles 1,219,973 33.36 

• Ex: particles 97,671,485 32:26.48 

∗ Ex: idioms 66 1:6.86 

◦ Ex: meet/union 3 8.47 

• Ex: meet/union 1457 7:13.12 




Thank you! 

Thank you for your attention!

Exploiting Corpora with Sketch Engine - NLP Centre - Masaryk ...

Create successful ePaper yourself

Delete template?

Save as template?