Статьи выпуска за August 2022
Building a multifunctional lexical component for a natural language text analysis system
The paper proposes a new improved UniLemm algorithm (universal lemmatizer), which allows solving both the direct problem - constructing a word lemma, and the reverse one - constructing word forms with fixed grammemes according to the lemma.
The lemmatizer is an important component of advanced artificial intelligence systems that analyze natural language texts.
The task of lemmatization is to assign the initial form (lemma) to each input text word.
This paper reduces the lemmatization problem to the classification problem. Each word form with given grammemes (grammatical categories) is assigned a certain class - a declension paradigm, where P paradigm is a set of declension rules.
When building a classifier for the lemmatization problem, we take into account the existence of non-dictionary words, as well as the situation when grammemes for the word form are not specified.
The OpenCorpora Russian language dictionary acts as a training sample in building a classification tree. When constructing the classification tree nodes, we take into account two important orthogonal aspects: the suffixes of word forms and a set of grammemes. The set of grammemes used in this work is a subset of the set of grammemes used in the Russian National Corpus and a superset for grammemes used in the Universal Dependencies (UD) notation.
When building a classification tree, we use an original data structure based on RDR rules, which makes it possible to formulate not only a declension rule for a word form, but also possible exceptions.
The UniLemm algorithm builds a combined classification tree containing suffix subtrees and grammeme subtrees. Suffix trees are for primary classification, while grammeme trees allow resolving homonymy.
The final stage of the algorithm presents the final classification tree as DFA (Deterministic Final Automaton).
The correctness and quality of the algorithm was checked both on the control sample of OpenCorpora and on two subcorpuses containing original texts of various subjects and styles. The algorithm has shown good results both in the accuracy of solving the lemmatization problem (above 90%) and in the text processing speed (250-300 thousand words per second in single-threaded mode).