Japanese English Statistical Machine Translation

Disclaimer: This page is for notes and discussion of work in progress on SMT between Japanese and English. It is unlikely to be understandable or useful to anyone outside the project.

Results (no MERT):


Model	Factors		Test 1 BLEU	Test 2 BLEU	Test 3 BLEU	Average BLEU	Time taken	Comments
Mecab; No Punctuation)	surface–>surface	JE	19.76	19.36	20.44	19.85	JST data
Mecab; Punctuation)	surface–>surface	JE	21.11	21.39	21.84	21.45
Mecab Tokenization & Chasen POS)	surface–>surface pos–>pos	JE	19.14	19.56	20.14	19.58
Juman No Punctuation)	surface–>surface pos–>pos	JE	18.98	17.55	17.66	18.71
Juman & Punctuation, Lemmas too)	surface–>surface pos–>pos lemma,pos–>lemma lemma,pos–>surface	JE	20.72	21.44	21.72	21.29
Mecab Punctuation, POS, Lemmas, Morph	t2:surface–>surface, t0:lemma–>lemma, g1:lemma–>pos, t1: morph–>pos, g2: pos,lemma–>surface	JE	19.68	19.87	19.59	19.71


Model		Test 1 BLEU	Test 2 BLEU	Test 3 BLEU	Average BLEU	Time taken	Comments
1	EJ						JST data
2	EJ
3 (Mecab)	EJ	24.67
4 (Juman)	EJ
3 (reversed)	EJ

JST data 100,000 sentence pairs

Eric’s systems:


Model	Factors	Corpus	Pair	pre-MERT	BLEU	2nd Run	Comments	Time
punctuation; lowercase	none	IWSLT06	JE		–	–	tokenization: Mecab; Moses baseline script
punctuation; lowercase	none	Tanaka	JE		14.39	17.69	tokenization: Mecab; Moses baseline script
punctuation; lowercase	surface->surface+pos	Tanaka	JE	11.39	19.06	17.75	EN factors: tree tagger	< 24 hrs
punctuation; lowercase	t: surface->surface; g: surface->pos	Tanaka	JE	11.39	17.89	–	EN factors: tree tagger	11 hrs
punctuation; lowercase	t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface	Tanaka	JE		18.67	–	JA factors: Mecab, morph == pos; EN factors: tree tagger
punctuation; lowercase	t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface	Tanaka	JE		9.66	–	JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha
punctuation; lowercase	t: lemma->lemma; g: lemma->pos; t: pos+morph->pos g: lemma+pos->surface	Tanaka	JE		6.91	–	JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha
punctuation; lowercase	none	Tanaka	EJ		26.87	–	tokenization: Moses baseline script; Mecab
punctuation; lowercase	surface->surface+pos	Tanaka	EJ		26.10	–	JA factors: Mecab

Models Under Construction

Model 1: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & no punctuation) sentence.lc.np.pose.en sentence.np.tokm.posm.ja

Model 2: (Lowercase English; Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & Punctuation) sentence.lc.p.pose.en sentence.p.tokm.posm.ja

Model 3: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS fromChasen & no punctuation) sentence.lc.np.pose.en sentence.np.tokc.posm.ja

Model 4: (Lowercase English, No Punctuation; Juman Tokenized Japanese, POS from Juman & no punctuation) sentence.lc.np.pose.en sentence.np.tokj.posm.ja

Model 5: (Lowercase English, No Punctuation; lemmatized in both languages & no punctuation) sentence.lc.np.dicm.pose.en sentence.np.tokj.dicj.posm.ja

Model 6: (best of 1-5 + NE) do NE on both languages and add as a factor Francis|n|name-B Bond|n|name-M was|v|O here|n|O (or here|n|place-B, depending on your tagger)

Sort of inspired by work at ATR Introducing Translation Dictionary Into Phrase-based SMT.
Which NE? Try Sekine’s

Model 7: (best of 1-5 + NE variant) do NE on both languages and filter out NEs that don’t align in preprocessing. e.g: compare the results (maybe taking only the intersection) then you can ge better results, as the cues must be different in the two languages.

lc = lowercase np = no punctuation p = punctuation tokc = Chasen Tokenized tokj = Juman Tokenized tokm = Mecab Tokenized dicj = Root Form from Juman dicm = Root Form from MORPH English Morphological Tagger pose = POS Adawati Maximum Entropy Tagger posc = POS Chasen posj = POS Juman posm = POS MeCab en = english ja = japanese

Sample File Names: sentences.lc.p.pose.en sentences.p.tokm.posm.ja

Other Ideas

Use French from http://wwwcyg.utc.fr/tatoeba/ to cross align on Tanaka Corpus
Parse and generate both sides and train off the expanded corpus.
Reverse the Japanese (suggested by Jason Katz-Brown)

Data Sources

Other Experiments

NE_Tagging_For_Improving_SMT

Last update: 2011-10-09 by anonymous [edit]