Informatica Logo

INFORMATICA
International Journal

Main Page
Editorial Board
Abstracting/Indexing
Instructions to Authors
Subscription Information


Contents
Author Index
Papers in Production

INFORMATICA, 2004, Vol. 15, No. 4, 565-580
© Institute of Mathematics and Informatics,

ISSN 0868-4952

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Airenas VAICIUNASa, Vytautas KAMINSKASa, Gailius RASKINISb

aDepartment of Applied Informatics, Vytautas Magnus University Vileikos 8, LT-3035 Kaunas, Lithuania E-mail: airenas@freemail.lt, V.Kaminskas@if.vdu.lt

bCenter of Computational Linguistics, Vytautas Magnus University Donelaicio 52, LT-3000 Kaunas, Lithuania E-mail: idgara@vdu.lt

Abstract

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%.

Keywords:

language models, n-grams, class-based models, morphology, inflections, interpolation, perplexity reduction, out-of-vocabulary words

To preview Lithuanian abstract see full article text

PDFTo preview full article text in PDF format click here

Get Free ReaderYou could obtain free Acrobat Reader from Adobe


TopTop Copyright © INFORMATICA, Vilnius University Institute of Mathematics and Informatics, 2010