University of TampereFaculty of Information Sciences
Department of Information Studies

 

Information Retrieval Laboratory

Information Retrieval Laboratory (founded in 1991) is the only laboratory environment in Finland suitable for the research and teaching of IR systems. The laboratory supports empirical research in the field of text retrieval from large full text databases (five doctoral dissertations finished during 1999-2004; six doctoral dissertation projects going on in Autumn 2004). The laboratory also supports implementation of research contract projects and teaching at advanced courses of the department.

Main Databases

Research environments suitable for testing the methods and systems of text retrieval are available. The main databases are large Finnish and English full text research databases. The databases can be used at research contract projects, for empirical research, and for IR teaching. The databases reside at the Unix server of the laboratory.

Each research database (i.e. research and teaching environment) consists of three parts: the article or picture database; the database containing descriptions of the search topics; and the database containing human relevance judgements.

The Finnish full text research database consists of electronical articles from newspapers Keskisuomalainen, Kauppalehti and Aamulehti. The database contains about 54,000 articles.  A set of 35 search topics has been created on the basis of the articles. Different search formulations and query versions have also been designed for each search topic. Based on the article database and the search topic database, about 17,000 human relevance judgments have been made. These judgments utilize a four level relevance scale.

The English full text research database TREC is delivered as part of international research cooperation. The subset of the database used in the laboratory contains about 500,000 English articles and 350 test questions with respective relevance judgments. The TREC database is gradually updated. The laboratory offers also several smaller English text collections, for example small bibliographical collections, and several research collections of the CLEF campaign.

We have created a graded relevance corpus for the TREC dataset (see Sormunen 2002). It is now available for the science community. We expect that appropriate acknowledgements are given in research papers employing the corpus.

The CLEF dataset (CLEF - Cross Language Evaluation Forum) is our use as well. It consist of newspaper articles in eight languages: English, Finnish, Swedish, German, French, Spanish, Italian and Dutch. We use CLEF databases especially for cross-language IR research.

Programs and Applications

The software used in text indexing and retrieval are InQuery, TRIP, Lemur and Terrier.  In InQuery (Computer Science Department of the University of Massachusetts) the query can be evaluated based on Boolean logic or Bayesian inference networks (relevance ranking). TRIP is based on Boolean Logic. Lemur (Computer Science Department at University of Massachusetts and School of Computer Science at Carnegie Mellon University) has language modeling background, but it supports also traditional methods. Terrier (University of Glasgow) is based on probabilistic framework for IR.

In InQuery, several different types of indexes has been built for one and the same text database. For example, in one index version, the Finnish words have been returned into basic forms by using the TWOL software of the Lingsoft Corp. FINTWOL is a morphological software for Finnish developed at the University of Helsinki. Also English and German morphological programs ENGTWOL and GERTWOL are available at the laboratory. Several electronic dictionaries by Kielikone are used in cross language information retrieval projects.

TRIP (by TietoEnator) is a full text information retrieval software. At the moment the Finnish words of the database residing on it have not been handled by a morphological program, so the index words are in inflicted forms.

We have developed an interactive research and learning environment QPA (Query Performance Analyzer). The first version of QPA, IR-Game, was introduced in 1998. This new application, designed in the Department, gives immediately and automatically visual feedbacks of the performance of the user. The searcher can choose among several types of feedback, for example, a set of recall precision curves showing the performance of different query formulations in the same screen. At the moment the tool contains many components and functionalities, e.g. picture and text databases, two different retrieval engines, and several different types of dictionaries.

Current research of the lab includes development of novel fuzzy string matching methods and matching environment. The query translation system UTACLIR has been developed as part of EU funded CLARITY research program.

We have developed an ontology-based Web-interface QUCCOO (QUery ConstruCtion with OntOlogies for direct content access) as well as a tool for editing search ontologies, ShOE (SearcH Ontology Editor). 

In order to store user-logs of participants in our user studies, we have developed a tool called ProxyLogger, and LogBrowser for browsing the logs.

Hardware

  • Unix-server SUN Fire 280R (4 GB RAM,  2 x UltraSPARC-III processors)
  • backup tape station Sun FlexiPak (144 GB DDS-3)
  • search session recording system, based on Macintosh G4 computer, including a VGA/Video signal transformer, microphone, television, and video recorder
  • multimedia workstation Pinus PeII
  • color and black-and-white printers

Research

The resources of the lab are used in several projects. For descriptions, see

Teaching

The laboratory is part of IR teaching of the Department. The usage includes lecture demonstrations, exercises, projects and thesis. Examples of the utilization are:

  • guided teaching of full text IR and of picture IR (TRIP, InQuery)
  • course on database design and database implementation (TRIP)
  • using the interactive learning environment tool (IR Game)

Updated 7th of August 2006 by Eija Airio (email: eija.airio@uta.fi)

Department of Information Studies