INFORMATICA, 2004, Vol. 15, No. 4, 565-580
© Institute of Mathematics and Informatics,

ISSN 0868-4952

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Airenas VAICIUNASa, Vytautas KAMINSKASa, Gailius RASKINISb

aDepartment of Applied Informatics, Vytautas Magnus University Vileikos 8, LT-3035 Kaunas, Lithuania E-mail:,

bCenter of Computational Linguistics, Vytautas Magnus University Donelaicio 52, LT-3000 Kaunas, Lithuania E-mail:


This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%.


language models, n-grams, class-based models, morphology, inflections, interpolation, perplexity reduction, out-of-vocabulary words

