Rhymes and Syntax: A Morpho-Syntactic Analysis of Czech Poetry

Authors

  • Silvie Cinková
  • Petr Plecháč
  • Martin Popel

DOI:

https://doi.org/10.3986/pkn.v47.i2.04

Keywords:

Czech poetry, distant reading, text corpora, Universal Dependencies, natural language processing, treebanks

Abstract

A linguistically informed distant reading presupposes an adequate performance of Natural Language Processing tools. This article describes our evaluation of the UDPipe parser on a manually annotated sample of nineteenth-century Czech poetry in the following steps: (1) creation of a documented data set for this domain (poetry, nineteenth century, Czech); (2) domain-specific annotation decisions; (3) error analysis. The sample consisted of 29 randomly selected poems which were first automatically tagged and parsed with the UDPipe parser and then manually checked word by word. The following features were checked: word segmentation (chunking), lemmatization, part of speech assignment, assignment of more fine-grained morphological details, the position in the syntactic dependency tree (selection of the syntactic parent), as well as the label of the syntactic relation between the word and its parent. The findings were analyzed. The most typical parser errors are associated with complex noun phrases that contain other noun(s) as modifier(s), especially when these occur in a poetry-specific word order, that is, preposed to the governing noun. On the other hand, neither archaic orthography nor neologisms posed substantial issues.

References

Dobrovský, Josef. Ausführliches Lehrgebäude der Böhmischen Sprache, zur gründlichen Erlernung derselben für Deutsche, zur vollkommenern Kenntniß für Böhmen. Prague, Johann Herrl, 1809.

Hajič, Jan. “Complex Corpus Annotation: The Prague Dependency Treebank.” Jazykovedný ústav L. Štúra, SAV, 2004, https://ufal.mff.cuni.cz/pdt2.0/publications/Hajic2004.pdf. Accessed 24 Jan. 2024.

Hajič, Jan, et al. “MorfFlex CZ 2.0.” LINDAT/CLARIAH-CZ, 2020, http://hdl.handle.net/11234/1-3186. Accessed 24 Jan. 2024.

Hajič, Jan, et al. “The Prague Dependency Treebank 2.0.” Linguistic Data Consortium, 2006, https://ufal.mff.cuni.cz/pdt2.0/. Accessed 24 Jan. 2024.

Kampelík, František Cyril. Čechoslovan, čili národní jazyk v Čechách, na Moravě, ve Slezku a Slovensku. Prague, Jan Hostivít Pospíšil, 1842.

Kübler, Sandra, et al. Dependency Parsing. Springer, 2009.

Marneffe, Marie-Catherine de, et al. “Syntax: General Principles–The Status of Function Words.” Universal Dependencies Guidelines, 2017, https://universaldependencies.org/u/overview/syntax.html#the-status-of-function-words. Accessed 24 Jan. 2024.

Osolsobě, Klára. Česká morfologie a korpusy. Prague, Karolinum, 2014.

Kosek, Pavel, and Jana Pleskalová. “Spřežkový Pravopis.” CzechEncy–Nový encyklopedický slovník češtiny, edited by Petr Karlík et al., Brno, Masarykova univerzita, 2017, https://www.czechency.org/slovnik/SPŘEŽKOVÝ PRAVOPIS. Accessed 24 Jan. 2024.

Plecháč, Petr, and Robert Kolár. “The Corpus of Czech Verse.” Studia Metrica et Poetica, vol. 2, no. 1, 2015, pp. 107–118, https://doi.org/10.12697/smp.2015.2.1.05. Accessed 24 Jan. 2024.

Plecháč, Petr, et al. PoeTree: Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian and Spanish. 0.0.1. Zenodo, 2023, https://zenodo.org/records/10008459. Accessed 24 Jan. 2024.

Popel, Martin, et al. “Udapi: Universal API for Universal Dependencies.” Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies, edited by Marie-Catherine de Marneffe et al., Northern European Association for Language Technology, 2017, pp. 96–101.

Straka, Milan. “Universal Dependencies 2.12 Models for UDPipe 2.” LINDAT/CLARIAH-CZ, 2023, http://hdl.handle.net/11234/1-5200. Accessed 24 Jan. 2024.

Straka, Milan, and Martin Popel. “Eval.Py. 1.2.” GitHub, 2023, https://github.com/UniversalDependencies/tools/blob/master/eval.py. Accessed 24 Jan. 2024.

Straka, Milan, and Jana Straková. “UDPipe 2.” LINDAT/CLARIAH-CZ, 2022, http://hdl.handle.net/11234/1-4816. Accessed 24 Jan. 2024.

Straka, Milan, et al. “UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing.” Proceedings of the Tenth International Conference on Language Resources and Evaluation, edited by Nicoletta Calzolari et al., European Language Resources Association, Paris, 2016, pp. 4290–4297, https://aclanthology.org/L16-1680. Accessed 24 Jan. 2024.

Zeman, Daniel, et al. “CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.” Proceedings of the CoNLL 2018 Shared Task, edited by Daniel Zeman and Jan Hajič, Kerrville (TX), The Association for Computational Linguistics, 2018, pp. 1–21, http://www.aclweb.org/anthology/K18-2001. Accessed 24 Jan. 2024.

Zeman, Daniel, et al. “Universal Dependencies 2.12.” LINDAT/CLARIAH-CZ, 2023, http://hdl.handle.net/11234/1-5150. Accessed 24 Jan. 2024.

Žižková, Hana. “Compound Adverbs as an Issue in Machine Analysis of Czech Language.” Journal of Linguistics / Jazykoedný časopis, vol. 68, no. 2, 2017, pp. 396–403, https://doi.org/10.1515/jazcas-2017-0049. Accessed 24 Jan. 2024.

Downloads

Published

2024-06-21

Issue

Section

Thematic section