Japanese English Statistical Machine Translation

Disclaimer: This page is for notes and discussion of work in progress on SMT between Japanese and English. It is unlikely to be understandable or useful to anyone outside the project.

Results (no MERT):

                 
Model Factors   Test 1 BLEU Test 2 BLEU Test 3 BLEU Average BLEU Time taken Comments
Mecab; No Punctuation) surface–>surface JE 19.76 19.36 20.44 19.85 JST data  
Mecab; Punctuation) surface–>surface JE 21.11 21.39 21.84 21.45    
Mecab Tokenization & Chasen POS) surface–>surface pos–>pos JE 19.14 19.56 20.14 19.58    
Juman No Punctuation) surface–>surface pos–>pos JE 18.98 17.55 17.66 18.71    
Juman & Punctuation, Lemmas too) surface–>surface pos–>pos lemma,pos–>lemma lemma,pos–>surface JE 20.72 21.44 21.72 21.29    
Mecab Punctuation, POS, Lemmas, Morph t2:surface–>surface, t0:lemma–>lemma, g1:lemma–>pos, t1: morph–>pos, g2: pos,lemma–>surface JE 19.68 19.87 19.59 19.71    

Next:

               
Model   Test 1 BLEU Test 2 BLEU Test 3 BLEU Average BLEU Time taken Comments
1 EJ           JST data
2 EJ            
3 (Mecab) EJ 24.67          
4 (Juman) EJ            
3 (reversed) EJ            

Eric’s systems:

                 
Model Factors Corpus Pair pre-MERT BLEU 2nd Run Comments Time
punctuation; lowercase none IWSLT06 JE   tokenization: Mecab; Moses baseline script  
punctuation; lowercase none Tanaka JE   14.39 17.69 tokenization: Mecab; Moses baseline script  
punctuation; lowercase surface->surface+pos Tanaka JE 11.39 19.06 17.75 EN factors: tree tagger < 24 hrs
punctuation; lowercase t: surface->surface; g: surface->pos Tanaka JE 11.39 17.89 EN factors: tree tagger 11 hrs
punctuation; lowercase t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface Tanaka JE   18.67 JA factors: Mecab, morph == pos; EN factors: tree tagger  
punctuation; lowercase t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface Tanaka JE   9.66 JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha  
punctuation; lowercase t: lemma->lemma; g: lemma->pos; t: pos+morph->pos g: lemma+pos->surface Tanaka JE   6.91 JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha  
punctuation; lowercase none Tanaka EJ   26.87 tokenization: Moses baseline script; Mecab  
punctuation; lowercase surface->surface+pos Tanaka EJ   26.10 JA factors: Mecab  

Models Under Construction

Model 1: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & no punctuation) sentence.lc.np.pose.en sentence.np.tokm.posm.ja

Model 2: (Lowercase English; Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & Punctuation) sentence.lc.p.pose.en sentence.p.tokm.posm.ja

Model 3: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS fromChasen & no punctuation) sentence.lc.np.pose.en sentence.np.tokc.posm.ja

Model 4: (Lowercase English, No Punctuation; Juman Tokenized Japanese, POS from Juman & no punctuation) sentence.lc.np.pose.en sentence.np.tokj.posm.ja

Model 5: (Lowercase English, No Punctuation; lemmatized in both languages & no punctuation) sentence.lc.np.dicm.pose.en sentence.np.tokj.dicj.posm.ja

Model 6: (best of 1-5 + NE) do NE on both languages and add as a factor Francis|n|name-B Bond|n|name-M was|v|O here|n|O (or here|n|place-B, depending on your tagger)

  • Sort of inspired by work at ATR Introducing Translation Dictionary Into Phrase-based SMT.
  • Which NE? Try Sekine’s

Model 7: (best of 1-5 + NE variant) do NE on both languages and filter out NEs that don’t align in preprocessing. e.g: compare the results (maybe taking only the intersection) then you can ge better results, as the cues must be different in the two languages.

lc = lowercase np = no punctuation p = punctuation tokc = Chasen Tokenized tokj = Juman Tokenized tokm = Mecab Tokenized dicj = Root Form from Juman dicm = Root Form from MORPH English Morphological Tagger pose = POS Adawati Maximum Entropy Tagger posc = POS Chasen posj = POS Juman posm = POS MeCab en = english ja = japanese

Sample File Names: sentences.lc.p.pose.en sentences.p.tokm.posm.ja

Other Ideas

  • Use French from http://wwwcyg.utc.fr/tatoeba/ to cross align on Tanaka Corpus
  • Parse and generate both sides and train off the expanded corpus.
  • Reverse the Japanese (suggested by Jason Katz-Brown)

Data Sources

Other Experiments

NE_Tagging_For_Improving_SMT

Last update: 2011-10-09 by anonymous [edit]