Dept of Inf Studies.Research. FIRE ENGL / FIRE FIN .Projects

Project: TRT – Transliteration-Based Matching for Out-Of-Vocabulary Words in CLIR Applications

Description

Technical terms and proper names constitute a major problem in dictionary-based cross-language information retrieval (CLIR) as they are not included in the dictionaries. However, technical terms and proper names in different languages often share the same Latin or Greek origin (or at least ortography), being thus spelling variants of each other. We have developed a novel three-step fuzzy translation technique for such cross-lingual spelling variants called TRT – transliteration rule based translation. First, TRT rules are generated automatically using translation dictionaries as source data. The rules specify character transformations in context between languages as well as their frequency and reliability. In the second step, transformation rules are applied to source words to render them more similar to their target language equivalents. In the third step, the intermediate forms obtained in the first step are translated into a target language using fuzzy matching, e.g. n-grams. The effectiveness of the technique has been evaluated empirically using five source languages and English as a target language.


If transformation rules are used liberally, many target language candidate word forms will be generated. To identify the correct one, we have devised a novel statistical technique for the identification of the translation equivalents of source words obtained by TRT rules. The effectiveness of the devised FITE (frequency-based identification of translation equivalents) technique has been tested using biological and medical cross-lingual spelling variants and OOV words in Spanish-English and Finnish-English TRT. The tests indicate that the FITE-TRT translation may achieve high translation recall, high translation precision, as well as high indication prediction pointing out words that cannot be translated by the technique. Combined with a CLIR system, dictionary-based CLIR augmented with FITE-TRT performed substantially better than basic dictionary-based CLIR where OOV keys were kept intact.

Duration

2001 - 2007

Researchers

Dr. Ari Pirkola
Dr. Jarmo Toivonen, Tampere University of Technology
Mr. Heikki Keskustalo

Prof. Kalervo Järvelin

Publications

  1. Pirkola, A. & Toivonen, J. & Keskustalo, H. & Järvelin, K. (2005). Frequency-based Identification of Correct Translation Equivalents (FITE) Obtained through Transliteration Rules. Submitted.
  2. Pirkola, A. & Toivonen, J. & Keskustalo, H. & Järvelin, K. (2006). FITE-TRT: A High Quality Translation Technique for OOV Words. In: Wainwright, R.L. & Ossowski, S. & al. (Eds.) Proceedings of the 21st Annual ACM Symposium on Applied Computing. Dijon, France, April 23 -27, 2006, pp. xxx-xxx. To appear.
  3. Toivonen, J. & Pirkola, A. & Keskustalo, H. & Visala, K. & Järvelin, K. (2005). Translating cross-lingual spelling variants using transformation rules. Information Processing & Management 41(4): 859-872.
  4. Pirkola, A. & Toivonen, J. & Keskustalo, H. & Visala, K. & Järvelin, K. (2003). Fuzzy Translation of Cross-Lingual Spelling Variants. In: Callan, J. & Hawking, D. & Smeaton, A. & Clarke, C. (ed.). Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM SIGIR '03), Toronto, Canada, July 28 – August 1, 2003. New York, NY: ACM Press, pp. 345 - 352.


Updated 29.12.2005 Responsibility for updating: KJ

Dept of Inf Studies.Research. FIRE ENGL / FIRE FIN.Projects