|
|
Project:
FITE-TRT – Transliteration-Based Matching for Out-Of-Vocabulary
Words in CLIR Applications
Description
Technical terms and proper
names constitute a major problem in dictionary-based cross-language
information retrieval (CLIR) as they often are not included in translation dictionaries. However, they may be essential search keys for the searcher's topic. Luckily, technical terms and proper names in different languages
often share the same Latin or Greek origin (or at least ortography),
being thus spelling variants of each other. We have developed a three-step fuzzy translation technique for such cross-lingual
spelling variants called TRT – transliteration rule based
translation. First, TRT rules are generated automatically using
translation dictionaries as source data. The rules specify character
transformations in context between languages as well as their frequency
and reliability. In the second step, transformation rules are applied
to source words to render them more similar to their target language
equivalents. Initially, the third step involved translating the intermediate forms so obtained into a target language using fuzzy
matching, e.g. n-grams. The effectiveness of the technique was
evaluated empirically using five source languages and English as
a target language.
If transformation rules are used liberally, many target language
candidate word forms will be generated. To identify the correct
one, we have devised a novel statistical technique for the identification
of the translation equivalents of source words obtained by TRT rules.
The effectiveness of the devised FITE (frequency-based identification
of translation equivalents) technique has been tested using biological
and medical cross-lingual spelling variants and OOV words in Spanish-English
and Finnish-English TRT. The tests indicate that the FITE-TRT translation
may achieve high translation recall, high translation precision,
as well as high indication prediction pointing out words that cannot
be translated by the technique. Combined with a CLIR system, dictionary-based
CLIR augmented with FITE-TRT performed substantially better than
basic dictionary-based CLIR where OOV keys were kept intact.
There is an effective implementation of the FITE-TRT technique.
Duration
2001 - 2010
Researchers
Dos. Ari Pirkola
Mr. Aki Loponen
Dr. Jarmo
Toivonen, Tampere University of Technology (2001-2007)
Mr. Heikki
Keskustalo
Prof. Kalervo
Järvelin
Publications
- Pirkola, A. & Toivonen, J. & Keskustalo, H. & Järvelin, K. (2007). Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules. ACM Transactions on Information Systems (TOIS) 26(1): article 2. Preprint
- Pirkola, A. &
Toivonen, J. & Keskustalo, H. & Järvelin, K. (2006).
FITE-TRT: A High Quality Translation Technique for OOV Words.
In: Wainwright, R.L. & Ossowski, S. & al. (Eds.) Proceedings
of the 21st Annual ACM Symposium on Applied Computing. Dijon,
France, April 23 -27, 2006, pp. 1043-1049. Preprint
- Toivonen, J. &
Pirkola, A. & Keskustalo, H. & Visala, K. & Järvelin,
K. (2005). Translating cross-lingual spelling variants using transformation
rules. Information Processing & Management 41(4): 859-872. Preprint
- Pirkola, A. &
Toivonen, J. & Keskustalo, H. & Visala, K. & Järvelin,
K. (2003). Fuzzy Translation of Cross-Lingual Spelling Variants.
In: Callan, J. & Hawking, D. & Smeaton, A. & Clarke,
C. (ed.). Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval
(ACM SIGIR '03), Toronto, Canada, July 28 – August 1, 2003.
New York, NY: ACM Press, pp. 345 - 352. Preprint
- Loponen, A. & Pirkola, A. & Järvelin, K. (2008). An Effective Implementation of the FITE-TRT Method for OOV Word Translation. In: Ruthven, I. & al. (Eds.), Proc. of the 30th European Conference on Information Retrieval (ECIR 2008), Glasgow, April 2008. Heidelberg: Springer, Lecture Notes in Computer Science vol. vol. 4956, pp. xx-yy
Updated 12.03.2008
Responsibility for updating: KJ
|