Tampereen yliopistoInformaatiotieteiden tiedekunta
Informaatiotutkimuksen laitos

Project: COCOT – corpus-based CLIR methods

Description

This project develops and evaluates a new method for creating a comparable document collection from two document collections in different languages. The best query keys are extracted from a source collection with the relative average term frequency (RATF) formula. The keys are then translated into the target language with a dictionary-based query translation program UTACLIR. The resulting lists of words are used as queries that are run against the target collection with the nearest neighbor method. The documents are aligned with topical and/or date-restricted alignment schemes. The combined scheme was found the best, when the relatedness of the document pairs was assessed with a five-degree relevance scale. The project also develops methods for acquiring the comparable corpora in various languages automatically from the web.

In retrieval experiments, the resulting comparable collections, consisting of several thousand document alignments, are used in translating queries automatically to target languages – this system is called COCOT. Various test set-ups of using the dictionary-based translator UTACLIR and COCOT, as well as other translation methods, and methods for handling Out-of-Vocabulary words, alone and in combinations have been tested. The combined systems offer the best effectiveness. The final papers test topic-specific comparable corpora against topically more general but parallel corpora in retrieval. The findings are in favor of topically accurate corpora even if their alignment quality is weaker.

Duration

2004 - 2008

Researchers

Mr. Tuomas Talvensaari–  supervised by Prof. Martti Juhola (Dept. of Computer Science) and Prof. Kal Järvelin. Talvensaari will submit his Thesis in Spring 2008.

 

Publications

  1. Talvensaari, T. & Laurikkala, J. & Järvelin, K. & Juhola M. (2006). A study on automatic creation of a comparable document collection in cross-language information retrieval. Journal of Documentation 62(3): 372-387. ( Preprint )
  2. Talvensaari, T. & Laurikkala, J. & Järvelin, K. & Juhola M. (2007). Corpus-based CLIR in retrieval of highly relevant documents. Journal of the American Society for Information Science and Technology (JASIST) 58(3): 322-334. ( Preprint )
  3. Talvensaari, T. & Laurikkala, J. & Järvelin, K. & Juhola M. (2007). Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems (ACM TOIS) 25 (1): article 4. (preprint )
  4. Talvensaari, T. & Pirkola, A. & Järvelin, K. & Juhola, M. & Laurikkala, J. (2008). Focused Web Crawling in the Acquisition of Comparable Corpora. Information Retrieval 11(xx): xxx-yyy, in press. ( Preprint )
  5. Talvensaari, T. Effects of Aligned Corpus Quality and Size in Corpus-based CLIR. In: Ruthven, I. & al. (Eds.), Proc. of the 30th European Conference on Information Retrieval (ECIR 2008), Glasgow, April 2008. Heidelberg: Springer, Lecture Notes in Computer Science vol. vol. 4956, pp. xx-yy. ( Preprint ).

 

Updated 11.3.2008 Responsibility for updating: KJ

 


Informaatiotutkimuksen laitos