|
|
Project: Morphological Processing and Non-processing in Monolingual Information Retrieval
Description
The project consisted of two parts: in the first part different morphological methods for management of keyword variation were tested in full-text retrieval of Finnish in a probabilistic query environment (InQuery). Following morphological methods were compared during the project:
- lemmatization (FINTWOL)
- generation of inflectional stems
- enhanced inflectional stems (inflectional stem enhancements with regular expressions)
- stemming (Snowball stemmer for Finnish)
- limited coverage of morphological variation of the keywords (based on the statistical distribution of case forms in texts of the language). This method was named FCG, frequent case (form) generation.
Shortly put results of the papers are as follows: although lemmatization performs best when mean average precision is considered, all the other methods compare fairly well with it in different settings and two different collections (TUTK and CLEF 2003). Two of these methods (3 and 5) are new developments and based on the ideas of the researcher.
In the final phase of the project we focused on testing our FCG method with other European languages which are morphologically complex enough to be of interest (Swedish, German and Russian) in the CLEF 2003 and 2004 collections. For Swedish and German we achieved positive results; our Russian results were ambiguous, which was most obviously due to the limited size of the collection. We also tested short queries for Finnish in this last phase.
Final product of the project is the Ph.D. thesis Reductive and Generative Approaches to Morphological Variation of Keywords in Monolingual Information Retrieval.
The FCG Project
The FCG project continued work begun in the Ph.D. thesis. We continued testing and evaluation of the FCG method with real word form generators with several languages (English, Finnish, German, Swedish).
Automatic generation of the most frequent forms was tried successfully with English, Finnish, German and Swedish. Also promising CLIR results were achieved with the approach when queries where translated from English to Finnish, German and Swedish with different MT programs.
Duration
2003 - 2007 (Morphological Processing and Non-processing...)
2007 - 2008 (The FCG project)
Researchers
Kimmo Kettunen, Ph. D.
Publications
2009
- Kettunen, Kimmo 2009. Choosing the best MT programs for CLIR purposes - can MT metrics be helpful? To appear in Proceedings of 31st European Conference on Information Retrieval (ECIR2009).
- Kettunen, Kimmo 2009. Reductive and Generative Approaches to
Management of Morphological Variation of Keywords in Monolingual
Information Retrieval - an Overview. To appear in
Journal of Documentation Volume 65 Issue 2.
2008
- Kettunen, Kimmo 2008b. Automatic Generation of Frequent Case Forms of Query Keywords in Text Retrieval. In Nordström, B. and Ranta, A. (eds.), Advances in Natural Language Processing, GoTAL 2008, LNAI 5221, 222-236. Preprint
- Kettunen, Kimmo 2008c. Frequent Case Form Generation of Query Keywords in Text Retrieval . In Nuno Guimarães & Pedro Isais (eds.), Proceedings of IADIS International Conference Applied Computing, 164-170.
- Sadeniemi, Markus, Kettunen, Kimmo, Lindh-Knuutila, Tiina & Honkela, Timo 2008. Complexity of European Union Languages: A Comparative Approach. Journal of Quantitative Linguistics 15(2), 185-211.
2007
- Kettunen, Kimmo 2007. Reductive and Generative Approaches to Morphological Variation of Keywords in Monolingual Information Retrieval. Acta Universitatis Tamperensis 1261.
- Kettunen, Kimmo, Airio, Eija & Järvelin Kalervo 2007. Restricted Inflectional Form Generation in Management of Morphological Keyword Variation. Information Retrieval 10, 415-444.
- Kettunen, Kimmo 2007. Management of Keyword Variation with Frequency Based Generation of Word Forms in IR. In Proceedings of SIGIR'07, 691-692.
- Kettunen, Kimmo 2007. Managing Keyword Variation with Frequency Based Generation of Word Forms in IR. In J. Nivre, H-J Kaalep, K. Muischnek and M. Koit (Eds.), Proceedings of Nodalida 2007, 318-323.
2006
- Kettunen, Kimmo 2006. Developing an automatic linguistic truncation operator for best-match retrieval of Finnish in inflected word form text database indexes. Journal of Information Science, 32(5), 465–479.
- Kettunen, Kimmo & Airio, Eija 2006. Is a morphologically complex language really that complex in full-text retrieval? In T. Salakoski et al. (Eds.): Advances in Natural Language Processing, LNAI 4139, pp. 411–422, 2006. Springer-Verlag Berlin Heidelberg.
- Kettunen, Kimmo, Sadeniemi, Markus, Lindh-Knuutila, Tiina & Honkela, Timo 2006. Analysis of EU Languages through Text Compression. In T. Salakoski et al. (Eds.): Advances in Natural Language Processing, LNAI 4139, pp. 99–109, 2006. Springer-Verlag Berlin Heidelberg.
2005
- Kettunen, Kimmo, Kunttu, Tuomas & Järvelin, Kalervo 2005. To stem or lemmatize a highly inflectional language in a probabilistic IR environment? Journal of Documentation 61 (4), 476–496.
- Kettunen, Kimmo 2005. Enhanced Inflectional Stems as search keys in best-match IR. In M. Langemets & P. Penjam (eds.) HLT-2005, Second Baltic Conference on Human Language technologies, 149–155.
2004
- Kettunen, Kimmo 2004. Covering the morphological variation of Finnish query nouns in a probabilistic best-match system. In The First Baltic Conference, Human Language technologies - The Baltic Perspective. Riga, Latvia, 73–80.
Presentations and other stuff
- Kettunen, Kimmo 2008a. Machine Translation Meets Frequent Case Generation in Query Translation Based CLIR. SLTC 2008.
- Kettunen, Kimmo 2008b. Sanoja analysoivat ja tuottavat ohjelmat hakutermien vaihtelun hallinnassa tekstitiedonhaussa. Informaatiotutkimus 27 (1), 24-26.
- Kettunen, Kimmo 2008c. "Suomen kielen substantiivilla on noin 2000 erilaista muotoa" - vai onko sittenkään? Esitelmä XXXV Kielitieteen päivillä Vaasassa 24.5. (slides in Finnish ) (abstract in Finnish)
- Kettunen, Kimmo 2005. Sijamuodot haussa - tarvitseeko kaikkea hakutermien morfologista vaihtelua kattaa? Ms. Sc. thesis, Dept. of Information Studies.
Updated 16.12.2008 Responsibility for updating: KK
|