slides - Academia Sinica

iasl.iis.sinica.edu.tw

slides - Academia Sinica

DOM E2C NE, M2M, CRF, AV

Mike Tian‐Jian Jiang, Chan‐Hung Kuo, Wen‐Lian Hsu

Institute of Information Science, Academia Sinica


DOM E2C

Direct orthographical mapping for English‐to‐Chinese

named entity transliteration

http://mind.c.u‐tokyo.ac.jp/Sakai_Lab_files/News/HFSP2004.htm

IIS, Sinica 2/36

2011/11/12


What’s M2M?

M2M‐aligner, actually.

IIS, Sinica 3/36

2011/11/12

http://widget.bigoo.ws/community/graphic.asp?ID=11011


Expectation

Maximization

Many‐to‐many alignment

(Jiampojamarn et al., 2007)

(Ristad and Yianilos, 1998)

IIS, Sinica 4/36

2011/11/12


Why M2M‐aligner?

(Rather than GIZA++?)

IIS, Sinica 5/36

2011/11/12


Strict standard runs

Without human annotators or IBM models 1 –5

IIS, Sinica 6/36

2011/11/12


Pilot tests

of initial alignments

• Max source gram length X: 1 –24

• Max target gram length Y: 1 –3

IIS, Sinica 7/36

2011/11/12


Source Target M2M‐Aligner Result

ABBADIE 阿 巴 迪 A:B|B:A|D:I:E| 阿 | 巴 | 迪

IIS, Sinica 8/36

2011/11/12


MaxX=8

MaxY=1

• Without any missed alignment

• Deeper context for source

• Unigram for target

IIS, Sinica 9/36

2011/11/12


Why unigram for target?

(No context?)

IIS, Sinica 10/36

2011/11/12


Simple labeling scheme

For linear‐chain conditional random fields

(With bi‐gram context for target)

IIS, Sinica 11/36

2011/11/12


Character

A

B

B

A

D

I

E

Label

B 阿

I

B 巴

I

B 迪

I

I

IIS, Sinica 12/36

2011/11/12


…and then focus on

feature engineering

Combinations of feature sets and context coverage

IIS, Sinica 13/36

2011/11/12


To be automatic and

semi‐supervised

(Well, semi‐automatic)

IIS, Sinica

14/36 http://www.phdcomics.com/store/mojostore.php?_=view&ProductID=12681

2011/11/12


Related works of CRF

Yang et al. (2009)

Reddy and Waxmonsky (2009)

Shishtla et al. (2009)

IIS, Sinica 15/36

2011/11/12


Unsupervised learning

For feature extraction from unlabeled data

IIS, Sinica 16/36

2011/11/12


Accessor Variety

(Feng et al., 2004)

IIS, Sinica 17/36

2011/11/12


…or Branching Entropy

(Jin and Tanaka‐Ishii, 2006)

IIS, Sinica 18/36

2011/11/12


Forward or backward?

Left or Right?

IIS, Sinica 19/36

2011/11/12


AV(s)=min{L AV (s), R AV (s)}

• L AV (s): type counts of preceding characters

• R AV (s): type counts of succeeding characters

IIS, Sinica 20/36

2011/11/12


Integer (real) values

for features?

…or binary values?

IIS, Sinica 21/36

2011/11/12


A Ranking function

To cover binary feature functions

IIS, Sinica 22/36

2011/11/12


f(s) = r, if 2 r ≤ AV(s) < 2 r+1

(Zhao and Kit, 2007)

IIS, Sinica 23/36

2011/11/12


A

B

B

A

D

I

E

7

5

5

7

7

5

7

3

4

4

3

4

3

2

2

2

3

2

0

0

0

1

1

0

0

IIS, Sinica 24/36

2011/11/12


Input

1

Char

2

Char

AV Feature

3

Char

4

Char

5

Char

Label

A 7S 3B 2B 0B 1B B 阿

B 5S 3E 2C 0C 1C I

B 5S 3B 2E 0D 1D B 巴

A 7S 4B 2E 1B 1I I

D 7S 4E 3B 1C 1E B 迪

I 5S 4E 3C 1D 0I I

E 7S 3E 3E 1E 0E I

B: beginning; C: next to B; D: next to C; I: intermediate; E: ending

IIS, Sinica 25/36

2011/11/12


CRF Labeling Scheme

• Context Coverage (Feature Template)

• Prediction Label Variety (Output)

IIS, Sinica 26/36

2011/11/12


Context Coverage

1UB

C 0 , C ‐1 , C 1

C 0 C 1 , C ‐1 C 0

2UB

C 0 , C ‐1 , C 1 , C ‐2 , C 2

C 0 C 1 , C ‐1 C 0 , C ‐2 C ‐1 , C 1 C 2

3UB

C 0 , C ‐1 , C 1 , C ‐2 , C 2 , C ‐3 , C 3

C 0 C 1 , C ‐1 C 0 , C ‐2 C ‐1 , C 1 C 2 , C ‐3 C ‐2 , C 2 C 3

2011/11/12

IIS, Sinica 27/36


Prediction Label

Positioning tag

P BI : position B(eginning) and position I(ntermediate)

P BIE : B(egin), I(ntermediate), E(nding)

Chinese character (grapheme) tag

G B : on B only

G BI : on both B and I

IIS, Sinica 28/36

2011/11/12


Preliminary Test

Configuration ACC Mean F

1UB, P BI , G B 0.001 0.151

1UB, P BI , G B , AV 0.000 0.078

1UB, P BI , GBI 0.454 0.762

1UB, P BI , G BI , AV 0.547 0.813

1UB, P BIE , G B 0.182 0.586

1UB, P BIE , G B , AV 0.273 0.656

2UB, P BI , G B 0.001 0.122

2UB, P BI , G B , AV 0.000 0.064

2UB, P BI , G BI 0.547 0.813

2UB, P BI , G BI , AV 0.753 0.910

2UB, P BIE , G B 0.347 0.708

2UB, P BIE , G B , AV 0.483 0.800

3UB, P BI , G B , AV 0.569 0.860

3UB, P BIE , G B 0.449 0.771

3UB, P BIE , G B , AV 0.597 0.857

IIS, Sinica 29/36

2011/11/12


Results

Configuration ACC Mean F

2UB, P BI , G BI , AV 0.327 0.688

3UB, P BIE , G B , AV 0.303 0.675

IIS, Sinica 30/36

2011/11/12


Potential Issues

Source

COMMONWEALTH OF THE BAHAMAS

ARAL SEA

Target

巴 哈 马 / 联 邦

咸 / 海

Multi‐word named entities and semi‐translated transliteration

in the development set

Unseen in the training set

Unused in the test set

Inappropriate AV and context structure

Depth of context vs. Coverage of context (trigram?)

Length of n‐gram for AV

Distribution and scale of ranking for AV

Distinction of left and right for AV

Variety of prediction label

IIS, Sinica 31/36

2011/11/12


C2E Issues

Space and time complexity

IIS, Sinica 32/36

2011/11/12


CRF Training Cost

Time complexity of single L‐BFGS iteration

O(L 2 NTF)

L: number of labels

N: number of sequences

T: average length of sequences

F: average number of activated features

Contribution rate C

C = log 2 (L 2 F total )

F total ≈ NTF

IIS, Sinica 33/36

2011/11/12


C for NEWS10 Test

Configuration F total L C ACC C F C MRR C MAP

2UB, P BI ,G BI 2,501,328 744 0.0292 0.0575 0.0350 0.0280

2UB, P BI ,G BI ,AV 4,882,872 744 0.0287 0.0561 0.0337 0.0275

2UB, P BIE ,G B 1,125,744 376 0.0273 0.0601 0.0335 0.0261

2UB, P BIE ,G B ,AV 2,322,176 376 0.0275 0.0588 0.0332 0.0263

IIS, Sinica 34/36

2011/11/12


AV does help

Conclusion &

Future Work

Deeper coverage does not always help

The more positions where Chinese characters are labeled, the

better performance

Different alignment approaches?

Pair‐wise multiple sequence alignment

Evolutionary algorithm

Parallelism

More features of target grapheme?

Log‐linear or generative models?

Two stages (but not pipelined)?

IIS, Sinica 35/36

2011/11/12


Thank YOU!

Any questions or comments?

IIS, Sinica 36/36

2011/11/12


Reduced n‐gram

BEHIND THE SCENE

2011/11/12 IIS, Sinica


zhi1‐shi4

shi4‐wei4

tou2‐shi4

shi4‐li4

dao4‐shi4

shi4‐shang4

fang1‐shi4

shi4‐gu4

chang2‐shi4

shi4‐yi2

yi4‐shi4

shi4‐ji1

zhi4‐shi4

shi4‐zhong1

More magazines by this user
Similar magazines