Views
2 years ago

Corpus-Driven Bilingual Lexicon Extraction - University of Malta

Corpus-Driven Bilingual Lexicon Extraction - University of Malta

Corpus-Driven Bilingual Lexicon Extraction 81In order to develop this idea, we will regard a text S as a set of segments. Segments are simplylinguistic entities of a certain type, where the types vary according to the level of analysis. Thus ata given level of analysis, a segment might be a sequence of words, a sequence of paragraphs, or anentire chapter. Furthermore, other relations might be defined over a set of such segments. Withina sentence text, for example, sequence-of-word segments will be ordered. However, for the presentwe will not be concerned with either the internal character of segments nor with the relations thatmight exist over them. The important intuition is that the text is a set.3 AlignmentWithin the framework under discussion, bilingual knowledge is derived from correspondences betweena text S and its translation T, that is, between the respective segments in S = {s 1 , s 2 , . . . , s n }and T = {t 1 , t 2 , . . . , t m }. These correspondences are expressed using the notion of alignment, definedby Isabelle and Simard [4] as a subset of the Cartesian product S × T .To give an example, supposeandA particular alignment might beS = {s 1 , s 2 , s 3 , s 4 , s 5 }T = {t 1 , t 2 , t 3 , t 4 , t 5 }A = {(s 1 , t 1 ), (s 2 , t 2 ), (s 2 , t 3 ), (s 3 , t 4 ), (s 4 , t 5 ), (s 5 , t 5 )}This associates the segment s 1 with segment t 1 ; the segment s 2 with segments t 2 and t 3 ; thesegment s 3 to segment t 4 ; and the segments s 4 and s 5 to the same segment t 5 .A bilingual text, or “bitext” is a triple (A,S,T) where A is an alignment, S is a text, and T is itstranslation.Input to an alignment system is normally the bitext ({(s,t)},{s},{t}), i.e. comprising a single, largetext segment, its translation , and the alignment that is the one-to-one mapping between the two.In order to derive bilingual lexical information from this initial input, a bitext at the next levelmust be constructed by– identifying the subsegments of S and T– computing the best alignment of the subsegments of S and T.This process is repeated until subsegmentation reaches the word level where, potentially everysubset of words in the source text can be aligned with every subset of words in the target text.At this point we inspect the best alignment and count occurrences of similar pairs. The highestranking pairs are frequently occurring n-gram-to-n-gram translations and therefore good candidatesfor the bilingual lexicon.Of course, this is a deliberate simplification of the process. When, at a given level, the number ofsubsegments is large, the cost of computing the possible alignments of all possible subsegments isprohibitively high so in practice certain heuristic assumptions must be incorporated to reduce thiscost, e.g.

82 Michael Rosner– Subsegments are defined in such a way that that the number of them at a given level isreasonable. This is achieved by staging the alignment process to deal, in order, with alignmentsof sections, paragraphs, sentences, and finally words.– Limits are imposed on the set of candidate alignments that will actually be considered. Whenperforming subsentence alignment, for example, one can put an upper bound on their length.Hence, most work in this area in fact deals with single words. We shall see other examplesbelow.3.1 Paragraph AlignmentTypically, segmentation of paragraphs is carried out in two stages: anchor alignment and paragraphalignment, as described originally by Brown et. al. [2]. A slightly simplified procedure was adoptedfor the Maltese/English work by Bajada [1].The two texts are inspected and a set of anchor point types manually identified. An anchor point isa point that is easily recognised in both source and target texts such as a chapter heading, title, orother mark that is characteristic of the text type under investigation. The identification of anchorpoints divides the two texts into sections containing one or more paragraphs, as shown in figure 4.A table of anchor point types together with their respective translations is compiled and stored inmemory (to be used for computing alignment costs).anchor point✲ .S1S2sectionsectionanchor point✲S3sectionanchor point✲S4sectionFig. 4.For the most part, corresponding anchor points are present in both source and target texts. However,there are typically omissions of one or other anchor and/or minor departures from the exactanchor translations as stored in the table. Hence, the automatic alignment of anchors is not entirelystraightforward.Both Brown et. al. and Bajada adopted the technique of assigning a cost to the alignment of aproposed source/target anchor pairing. If the proposed pair is in the table, the cost is zero. If oneelement of the pair is an omission, a fixed cost is assigned. Otherwise, rules can be supplied fordealing with concrete cases. In the case of Bajada, a dynamic programming algorithm to find theleast cost alignment is based on a limited pattern of possibilities for the next anchor pair.

Disambiguating vectors for bilingual lexicon extraction from ... - Limsi
Utilizing Contextually Relevant Terms in Bilingual Lexicon Extraction
Automatic Induction of Bilingual Lexicons for Machine Translation
Bilingual Lexicon Induction - LREC Conferences
A Bilingual Corpus of Novels Aligned at Paragraph Level - Natural ...
Learning Bilingual Lexicons from Monolingual Corpora
Standardising bilingual lexical resources according to the Lexicon ...
Acquisition of Bilingual MT Lexicons from OCRed Dictionaries
Bilingual Collocation Extraction Based on Syntactic and Statistical ...
Standardising bilingual lexical resources according to the Lexicon ...
JEIDA's bilingual corpus and other corpora for NLP ... - CiteSeerX
Low Cost Construction of a Multilingual Lexicon from Bilingual Lists
Knowledge Extraction and Representation - LexiCon Research Group
A Chunk-Driven Bootstrapping Approach to Extracting ... - CLiPS
Language-Independent Bilingual Terminology Extraction from a ...
Extracting bilingual word pairs from Wikipedia
Creating Term and Lexicon Entries from Phrase Tables
Automatic Extraction of Bilingual Terms from Comparable Corpora
Identifying bilingual Multi-Word Expressions for Statistical Machine ...
Creating Term and Lexicon Entries from Phrase Tables - FBK
The Translation Corpus Aligner: A program for automatic alignment ...
Anchor Text Mining for Translation Extraction of Query Terms