![]() |
Project: COCOT – corpus-based CLIR methodsDescriptionThis project develops and evaluates a new method for creating a comparable document collection from two document collections in different languages. The best query keys are extracted from a source collection with the relative average term frequency (RATF) formula. The keys are then translated into the target language with a dictionary-based query translation program UTACLIR. The resulting lists of words are used as queries that are run against the target collection with the nearest neighbor method. The documents are aligned with topical and/or date-restricted alignment schemes. The combined scheme was found the best, when the relatedness of the document pairs was assessed with a five-degree relevance scale. The project also develops methods for acquiring the comparable corpora in various languages automatically from the web. In retrieval experiments, the resulting comparable collections, consisting of several thousand document alignments, are used in translating queries automatically to target languages – this system is called COCOT. Various test set-ups of using the dictionary-based translator UTACLIR and COCOT, as well as other translation methods, and methods for handling Out-of-Vocabulary words, alone and in combinations have been tested. The combined systems offer the best effectiveness. The final papers test topic-specific comparable corpora against topically more general but parallel corpora in retrieval. The findings are in favor of topically accurate corpora even if their alignment quality is weaker. Duration2004 - 2008 ResearchersMr. Tuomas Talvensaari– supervised by Prof. Martti Juhola (Dept. of Computer Science) and Prof. Kal Järvelin. Talvensaari will submit his Thesis in Spring 2008.
Publications
Updated 11.3.2008 Responsibility for updating: KJ
|