Japanese English Statistical Machine Translation
Disclaimer: This page is for notes and discussion of work in progress on SMT between Japanese and English. It is unlikely to be understandable or useful to anyone outside the project.
Results (no MERT):
Model | Factors | Test 1 BLEU | Test 2 BLEU | Test 3 BLEU | Average BLEU | Time taken | Comments | |
Mecab; No Punctuation) | surface–>surface | JE | 19.76 | 19.36 | 20.44 | 19.85 | JST data | |
Mecab; Punctuation) | surface–>surface | JE | 21.11 | 21.39 | 21.84 | 21.45 | ||
Mecab Tokenization & Chasen POS) | surface–>surface pos–>pos | JE | 19.14 | 19.56 | 20.14 | 19.58 | ||
Juman No Punctuation) | surface–>surface pos–>pos | JE | 18.98 | 17.55 | 17.66 | 18.71 | ||
Juman & Punctuation, Lemmas too) | surface–>surface pos–>pos lemma,pos–>lemma lemma,pos–>surface | JE | 20.72 | 21.44 | 21.72 | 21.29 | ||
Mecab Punctuation, POS, Lemmas, Morph | t2:surface–>surface, t0:lemma–>lemma, g1:lemma–>pos, t1: morph–>pos, g2: pos,lemma–>surface | JE | 19.68 | 19.87 | 19.59 | 19.71 |
Next:
Model | Test 1 BLEU | Test 2 BLEU | Test 3 BLEU | Average BLEU | Time taken | Comments | |
1 | EJ | JST data | |||||
2 | EJ | ||||||
3 (Mecab) | EJ | 24.67 | |||||
4 (Juman) | EJ | ||||||
3 (reversed) | EJ |
- JST data 100,000 sentence pairs
Eric’s systems:
Model | Factors | Corpus | Pair | pre-MERT | BLEU | 2nd Run | Comments | Time |
punctuation; lowercase | none | IWSLT06 | JE | – | – | tokenization: Mecab; Moses baseline script | ||
punctuation; lowercase | none | Tanaka | JE | 14.39 | 17.69 | tokenization: Mecab; Moses baseline script | ||
punctuation; lowercase | surface->surface+pos | Tanaka | JE | 11.39 | 19.06 | 17.75 | EN factors: tree tagger | < 24 hrs |
punctuation; lowercase | t: surface->surface; g: surface->pos | Tanaka | JE | 11.39 | 17.89 | – | EN factors: tree tagger | 11 hrs |
punctuation; lowercase | t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface | Tanaka | JE | 18.67 | – | JA factors: Mecab, morph == pos; EN factors: tree tagger | ||
punctuation; lowercase | t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface | Tanaka | JE | 9.66 | – | JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha | ||
punctuation; lowercase | t: lemma->lemma; g: lemma->pos; t: pos+morph->pos g: lemma+pos->surface | Tanaka | JE | 6.91 | – | JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha | ||
punctuation; lowercase | none | Tanaka | EJ | 26.87 | – | tokenization: Moses baseline script; Mecab | ||
punctuation; lowercase | surface->surface+pos | Tanaka | EJ | 26.10 | – | JA factors: Mecab |
Models Under Construction
Model 1: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & no punctuation) sentence.lc.np.pose.en sentence.np.tokm.posm.ja
Model 2: (Lowercase English; Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & Punctuation) sentence.lc.p.pose.en sentence.p.tokm.posm.ja
Model 3: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS fromChasen & no punctuation) sentence.lc.np.pose.en sentence.np.tokc.posm.ja
Model 4: (Lowercase English, No Punctuation; Juman Tokenized Japanese, POS from Juman & no punctuation) sentence.lc.np.pose.en sentence.np.tokj.posm.ja
Model 5: (Lowercase English, No Punctuation; lemmatized in both languages & no punctuation) sentence.lc.np.dicm.pose.en sentence.np.tokj.dicj.posm.ja
Model 6: (best of 1-5 + NE) do NE on both languages and add as a factor Francis|n|name-B Bond|n|name-M was|v|O here|n|O (or here|n|place-B, depending on your tagger)
- Sort of inspired by work at ATR Introducing Translation Dictionary Into Phrase-based SMT.
- Which NE? Try Sekine’s
Model 7: (best of 1-5 + NE variant) do NE on both languages and filter out NEs that don’t align in preprocessing. e.g: compare the results (maybe taking only the intersection) then you can ge better results, as the cues must be different in the two languages.
lc = lowercase np = no punctuation p = punctuation tokc = Chasen Tokenized tokj = Juman Tokenized tokm = Mecab Tokenized dicj = Root Form from Juman dicm = Root Form from MORPH English Morphological Tagger pose = POS Adawati Maximum Entropy Tagger posc = POS Chasen posj = POS Juman posm = POS MeCab en = english ja = japanese
Sample File Names: sentences.lc.p.pose.en sentences.p.tokm.posm.ja
Other Ideas
- Use French from http://wwwcyg.utc.fr/tatoeba/ to cross align on Tanaka Corpus
- Parse and generate both sides and train off the expanded corpus.
- Reverse the Japanese (suggested by Jason Katz-Brown)
Data Sources
Other Experiments
Last update: 2011-10-09 by anonymous [edit]