Loading…

Parallel sentence generation from comparable corpora for improved SMT

A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages.An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to tra...

Full description

Saved in:
Bibliographic Details
Published in:Machine translation 2011-12, Vol.25 (4), p.341-375
Main Authors: Rauf, Sadaf Abdul, Schwenk, Holger
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages.An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic-English and French-English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.
ISSN:0922-6567
1573-0573
DOI:10.1007/s10590-011-9114-9