Tampereen yliopistoInformaatiotieteiden tiedekunta
Informaatiotutkimuksen laitos

 

Project: BioNLP

Project Description

Natural language constitutes an essential part of information retrieval. Its flexibility may be strength in human communication, however variation in the use of natural language expressions constitutes a major problem in information retrieval. For effective IR morphological, syntactic, semantic and discourse-level variation and ambiguity need to be taken into account. The BioNLP project addresses linguistic problems typical of biosciences, as well as general domain independent linguistic problems affecting the effectiveness of IR in the field. The problems specific to biosciences involve gene name synonymy and ambiguity, and the identification of phrasal gene and protein names. The general problems involve orthographic variation of names, named entity recognition and the identification of acronyms’ full names.

Gene name synonymy is one of the hardest problems due to the wide use of alternative gene names. Authors very often use alternative names and symbols in their articles instead of the official names and symbols. Also, at present only part of the genes of human and model organisms are discovered. Therefore, the problem of synonymy is not an ephemeral problem since new gene names are entered in the literature continuously. We will explore the severity of the gene name synonymy problem in genomics information retrieval, and develop and apply natural language processing and data mining techniques and tools which allow effective gene name searching.

We evaluate the effectiveness of IR supported by the techniques and tools developed in the project using TREC Genomics Track test setting (test collections, test topics, relevance assessments and evaluation methods).


Duration

2004 - 2007

Researchers

Principal researcher Dr. Ari Pirkola

Publications

  1. Pirkola, A. (2006). The problem of gene name synonymy from the information retrieval per-spective. Journal of the American Society for Information Science and Technology.To appear.
  2. Pirkola, A., Toivonen, J., Keskustalo, H. & Järvelin, K. (2006). FITE-TRT: A high quality translation technique for OOV words. ACM SAC - The 21st Annual ACM Symposium on Applied Computing, Dijon, France, April 23 - 27, 2006.
  3. Pirkola, A. (2005). TREC 2005 Genomics track experiments at UTA. Notebook paper. TREC Conference, November 15-18, 2005, Gaithersburg, Maryland USA.
  4. Pirkola, A. (2005). TREC 2004 Genomics track experiments at UTA: The effects of primary keys, bigram phrases and query expansion on retrieval performance. The Thirtheenth TExt REtrieval conference, Gaithersburg, MD. Available at: http://trec.nist.gov/pubs/trec13/t13_proceedings.html
  5. Pirkola, A. (2004). Genomics information retrieval: Challenges, problems and empirical find-ings. Third biennial DISSAnet Conference, Pretoria, South Africa, October 28 – 29, 2004. Available at: http://www.dissanet.com/jsp/modules/repository/index.jsp?repository=Library
  6. Pirkola, A. (2004). Genomics track experiments at UTA. Notebook paper. TREC Conference, November 16-19, 2004, Gaithersburg, Maryland USA.
  7. Pirkola, A. and Leppänen, E. (2004). TREC 2003 genomics track experiments at UTA: query expansion with predefined high frequency terms. The Twelfth TExt REtrieval conference, Gaithersburg, MD. Available at: http://trec.nist.gov/pubs/trec12/t12_proceedings.html

Relevant links

TREC Genomics home page
TREC
Ari Pirkola

 

 

Updated 02.01.2006 Responsibility for updating: AP


Informaatiotutkimuksen laitos