Translation Induction on Indian Language Corpora using Translingual Themes from Other Languages

Goutham Tholpadi, Chiranjib Bhattacharyya, Shirish Shevade.
16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), 2015.

Download the paper and the supplementary material.

Abstract

Identifying translations from comparable corpora is a well-known problem with several applications, e.g. dictionary creation in resource-scarce languages. Scarcity of high quality corpora, especially in Indian languages, makes this problem hard, e.g. state-of-the-art techniques achieve a mean reciprocal rank (MRR) of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. There exist comparable corpora in many Indian languages with other “auxiliary” languages. We observe that translations have many topically related words in common in the auxiliary language. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for translation induction. Extensive experiments on 35 comparable corpora using English and French as auxiliary languages show that this approach can yield dramatic improvements in performance (e.g. MRR improves by 124% to 0.419 for Telugu-Kannada). A user study on WikiTSu, a system for cross-lingual Wikipedia title suggestion that uses our approach, shows a 20% improvement in the quality of titles suggested.

Resources

Datasets (135MB)
Code

Contact

Please contact “gtholpadi at csa dot iisc dot ernet dot in” for any queries or comments.