Information Technology and Systems

Thursday, October 6 14:30 - 16:30 Ballroom C	Session: Computer Linguistincs (rus)

	Chair: Ph.D. Leonid Iomdin

Leonid Iomdin
Ph.D. (fil), Principal Researcher (IITP RAS)

Lecture: Microsyntax of Russian (rus)

Abstract:

The lecture will focus on a variety of phenomena in modern Russian that occupy an intermediate position between the grammar and the dictionary. Most of these phenomena are of a highly individual character and require especially fine tools for research (the way certain medical conditions in need of surgical treatment require microsurgical instruments and techniques.

In particular, these phenomena include 1) syntactic idioms (i.e. such phraseological units that have syntactic properties different from those of normal linguistic units) and 2) non-standard syntactic constructions.

Examples of the first type of phenomena are expressions like
(1) s”est’ sobaku na X (lit. to eat a dog on X) ≈ ‘to have much experience in X’
or
(2) ruki češutsja X-ovat’ ≈ ‘there is a strong desire to perform X’
If we compare them with expressions that are lexically close to (1) and (2) but do not form phraseological units (cf. Zajac ne možet s”est’ sobaku ‘A hare cannot eat a dog’ or Posle etogo krema u menja ruki češutsja ‘my hands itch after I apply this kind of cream’) we will easily see that in addition to idiomatic meaning the given idioms acquire special properties of government (the preposition na in (1) that introduces the scope of experience or the infinitive in (2) that refers to the object of strong desire).

Examples of the second type of phenomena are constructions like
(3) Z-u X-ovat’ (cf. Vam vyxodit’ ‘You are to get off’, Komu rabotat’ ≈ ‘Who is there to work?’,
where the sense of a certain modality is not expressed by specific words like nužno ‘need’ or sleduet ‘should’ but by the constriction itself, or
(4) constructions with repeated words like X-ovat ne X-oval (cf. čitat’ ne čital (no listal) ≈ ‘ one didn’t really read it but flipped through’, xrapet’ on ne xrapit (a kakie-to zvuki vo sne prozivodit) ≈ ‘he does not exactly snore but he makes some sounds when he sleeps’ where the idea of incomplete action is not expressed by concrete lexical units like ne sovsem ‘not exactly’ or ne vprjamuju ‘not directly’ but by the repetition of words.

Special attention will be given to the techniques of strict description of such linguistic phenomena in theoretical and applied tasks.

Anton Kazennikov
Morphological Guesser Algorithm Based on ETAP-3 Dictionary Data Downoad paper

Abstract: В настоящей работе представлен способ построения морфологического анализатора для неизвестных слов на основе словарей системы ЭТАП-3. Анализатор строится на базе конечного автомата. В качестве исходного материала для построения анализатора используются словари системы ЭТАП-3. При построении анализатора в автомат вносится изменяемая часть слова с приписанными морфологическими характеристиками. Представленный в работе алгоритм был экспериментально опробован на корпусе СинТагРус, где показал свою эффективность. Таким образом, алгоритм может использоваться как для непосредственной задачи анализа неизвестных слов, так и для задачи стемминга (лемматизации).

Leonid Kreydlin
Using descriptor weights in the ETAP-3 parser Downoad paper

Abstract: В ходе работы по совершенствованию синтаксического анализатора системы ЭТАП-3 был проведен эксперимент по вычислению уровня семантической связанности слов - потенциальных участников сочинительных конструкций - и по использованию полученных данных в процессе фильтрации синтаксических гипотез.

Vadim Petrochenkov
Statistical word sense disambiguation in ETAP-3 system Downoad paper

Abstract: В статье описывается эксперимент по разрешению лексической неоднозначности при построении синтаксической структуры в системе ЭТАП-3. Для разрешения этой неоднозначности были использованы данные о совместном употреблении слов, собранные с корпуса размеченных текстов СинТагРус, а так же аналогичные данные из словаря CrossLexica. Использование этих данных вместе с вероятностной моделью позволило увеличить долю правильно разрешённых лексических неоднозначностей.