Corpus-Driven Bilingual Lexicon Extraction - University of Malta
Corpus-Driven Bilingual Lexicon Extraction 83Specifically, two lists of anchor points are created, one for the source and one for the target. Theordering within these list preserves the ordering of anchor points in the texts. The algorithm iteratesthrough each source anchor point for which the first N target anchor points are compared, whereN is a small integer, and the one with the least cost assigned. This is an example of limiting the setof candidate alignments that will actually be considered, as mentioned above. In the work carriedout by Bajada, a value of 3 was chosen for N, since it was established empirically that for the textsunder consideration, the anchor points were never out of alignment by more than 3.The quality of the results obtained depends largely on the reliability of the anchor points identifiedby hand as indicators of “section-ness”. Although these can be very reliable in formalised textsuch as legal documents, the problem is harder in the case of, for example, government circularsor communications from the local Water Services. Simard et. al  have proposed using linguisticcognates as the basis for recognising potential anchor points automatically.Anchor points provide a rough alignment of source and target sections each containing severalunaligned paragraphs.The next stage aligns the paragraphs within sections. For each aligned anchor pair, the algorithmfirst retrieves and then counts the number of source and target paragraphs they enclose. Let N andM be the respective counts.Following the strategy of limiting the set of possible alignments, Bajada allows the following alignmentoptions, for n, m > 1:0 : 0, 1 : 1, 1 : 0, 0 : 1, 1 : n, n : 1, n : n, n : m(n < m), n : m(m > n), 1 : 0, 0 : 1where X:Y means that X source paragraphs are aligned with Y target paragraphs. Clearly, thechoice is fully determined for all but the last two cases, for which all possible combinations areenumerated and ranked by considering the length, in characters, of individual paragraphs or concatenationsof paragraphs. This strategy proved to be tractable because the number of paragraphsunder consideration was always small for the texts being considered.The underlying assumption is that the length ratio of correctly aligned paragraphs is approximatelyconstant for a given language pair. Hence the cost of a given alignment is proportional to the sum,over all source/target pairs in the alignment, of the difference in character length between eachsource/target pair, and the least-cost alignment is that which minimises the sum.3.2 Sentence AlignmentLike paragraph alignment, the sentence alignment algorithms proposed by Brown et. al. , andGale and Church , are based on the length of sentences in characters. The latter was reimplementedby Bajada . Like paragraph alignment, both algorithms assume that longer sentencestend to be translated by longer sentences and vice versa. The operation of the algorithm is summarisedby Bajada as follows:“The algorithm calculates the cost of aligning sentences by assigning a probabilisticscore to each possible pair of sentences which make up an alignment. The score is assignedaccording to the ratio of the lengths of the two sentences and on the variance of thisratio for all sentences. Dynamic programming is then used to find the maximum likelihoodalignment of the sentences.”
84 Michael Rosner3.3 Word AlignmentIn a classic paper that outlines the basic tenets of statistical machine translation, Brown et. al. develop a model in which for estimating Pr(f|e), the probability that f, a target (French) sentence,is the translation of e, a source (English) sentence. This is of course defined in terms of alignmentsexpressed as sets of what they term connections.A connection, written con f,ej,i , states that position j in f is connected to position i in e. The modelis directional, so that each position in f is connected to exactly one position in e, including thespecial “null” position 0 (where it is not connected to any word). Consequently, the number ofconnections in a given alignment is equal to the length of f.The translation probability for a pair of sentences is expressed asPr(f|e) = Σ a∈A Pr(f, a|e)where A is the set of all possible alignments for the sentences f and e. In other words, each possiblealignment contributes its share of probability additively. The question is, how to estimate Pr(f, a|e).Brown et. al. propose 5 different models.The first model assumes that Pr(f, a|e) depends purely on the translation probabilities t(φ|ɛ)between the word pairs (φ, ɛ) contained in the connections of a, and is obtained by multiplyingthese probabilities together. The individual values for t(φ|ɛ) are estimated using a training setcomprising a bitext that has been aligned at sentence level.The second model improves upon the first by taking into account the observation that, undertranslation, and for certain language pairs at least, words tend to retain their relative position inthe sentence: words that are near the beginning of the source sentence tend to appear near thebeginning of the target sentence.3.4 ConclusionBajada’s FYP report  implemented a complete framework for the extraction of bilingual lexicalequivalences that included all of the other alignment phases described above, including the twoversions of word alignment mentioned above. Using training and test material based on an abridgedversion of the Malta-EU accession treaty, results for section, paragraph and sentence alignment wereencouraging, with precision and recall above 90% over the test material.The results for word alignment were somewhat less satisfactory, with precision and recall figurescloser to 50%. It was shown that the second model for word alignment described above was animprovement on the first.Amongst the strategies under consideration for improving the performance of the system at wordlevel are:– Investigation of larger datasets in order to determine the rate of improvement for a givenincrease in size.– Use of a lemmatiser to reduce the vocabulary size and hence the frequency of individual connectionsbetween source and target words (this will be of particular interest on the Malteseside).– Investigation of possible equivalences between frequent n-grams of words for given n.