![]() |
Project: sGRAM – Approximate String Matching for Out-Of-Vocabulary Words in CLIR ApplicationsDescription
Untranslatable query
keys pose an abiding problem in dictionary-based cross-lingual information
retrieval (CLIR). One approach for solving it consists of using
approximate string matching methods for retrieving word form variants
of the source key from the target database index. We have developed
a novel n-gram based string matching technique, which we call the
s-gram matching technique (s-gram for skip-gram). In the technique,
n-grams are classified into categories on the basis of character
contiguity in words. The categories are then utilized in matching.
The technique has been compared with conventional n-gram technique
using adjacent characters as n-grams. Several types of words and
word pairs were studied, including biological, geographical, economic,
technological and other terms. Source language words have been in
French, Spanish, Italian, German, Swedish, Finnish and English and
the target words have been their spelling variants in Finnish and
English within target word lists of up to 200 000 words. In recent
work also Norwegian and Swedish have been tested. In all cross-lingual
tests done, the targeted s-gram matching technique outperformed
the conventional n-gram matching technique as well as longest common
subsequence and edit distance. The technique has also been highly
effective for the identification of monolingual word form variants. Formal definitions of the techniques have been defined. The work continues with
new language pairs, historical and OCR'd language and new test set-ups. Duration2002 - 2008 Researchers
Mr. Heikki
Keskustalo– supervised by Dr. Ari Pirkola and Prof. Kal
Järvelin Mrs. Sanna Kumpulainen Mr. Antti Järvelin
Publications
Updated 11.03.2008 Responsibility for updating: KJ |