SrpELTeC: A Serbian Literary Corpus for Distant Reading

Authors

  • Ranka Stanković
  • Cvetana Krstev
  • Duško Vitas

DOI:

https://doi.org/10.3986/pkn.v47.i2.03

Keywords:

digital humanities, Serbian literature, text corpora, distant reading, linked data, named entity recognition, text analytics

Abstract

The article presents SrpELTeC, a corpus developed within the COST action Distant Reading for European Literary History (CA16204). All novels in SrpELTeC were selected, prepared, and annotated using the common principles established for all language collections in the European Literary Text Collection (ELTeC). The challenges and solutions in preparing SrpELTeC from scratch are outlined. All novels were manually encoded in TEI with rich metadata and structural annotation. The automatic annotation included POS-tagging, lemmatization, and named entities, relying on Natural Language Processing resources developed and maintained by the JeRTeh Language Resources and Technologies Society. The integration of SrpELTeC with Wikidata was supported with a set of SPARQL queries for the retrieval of metadata with different visualization options. Recent activities within the COST Action NexusLinguarum—European Network for Web-centred Linguistic Data Science (CA18209) are related to the linked data version of SrpELTeC using the NLP Interchange Format. All versions of SrpELTeC are freely available under the CC-BY license.

References

Burnard, Lou, et al. “In Search of Comity: TEI for Distant Reading.” Journal of the Text Encoding Initiative, vol. 14, 2021, pp. 1–9, https://doi.org/10.4000/jtei.3500. Accessed 24 Jan. 2024.

Deretić, Jovan. Istorija srpske književnosti. Belgrade, Nolit, 1983.

Deretić, Jovan. Srpski roman 1800–1950. Belgrade, Nolit, 1981.

Heiden, Serge. “The TXM Platform: Building Open-source Textual Analysis Software Compatible with the TEI Encoding Scheme.” Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation. Vol. 2. No. 3, edited by Ryo Otoguro et al., Institute for Digital Enhancement of Cognitive Development, Waseda University, 2010, pp. 389–398, https://aclanthology.org/Y10-1044. Accessed 24 Jan. 2024.

Ikonić Nešić, Milica, et al. “From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back).” Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, edited by Thierry Declerck et al., European Language Resources Association, Paris, 2022, pp. 7–16, https://aclanthology.org/2022.ldl-1.2/. Accessed 24 Jan. 2024.

Ikonić Nešić, Milica, et al. “Serbian ELTeC Sub-Collection in Wikidata.” Infotheca, vol. 21, no. 2, 2021, pp. 60–87, https://doi.org/10.18485/infotheca.2021.21.2.4. Accessed 24 Jan. 2024.

Kilgarriff, Adam, et al. “The Sketch Engine.” Proceedings of the Eleventh EURALEX International Congress, Université de Bretagne-Sud, Faculté des lettres et des sciences humaines, 2004, pp. 105–115, https://euralex.org/publications/the-sketch-engine/. Accessed 24 Jan. 2024.

Kilgarriff, Adam, et al. “The Sketch Engine: Ten Years On.” Lexicography, vol. 1, no. 1, 2014, pp. 7–36, https://doi.org/10.1007/s40607-014-0009-9. Accessed 24 Jan. 2024.

Krstev, Cvetana. Processing of Serbian: Automata, Texts and Electronic Dictionaries. Belgrade, Faculty of Philology of the University, 2008.

Krstev, Cvetana. “The Serbian Part of the ELTeC Collection Through the Magnifying Glass of Metadata.” Infotheca, vol. 21, no. 2, 2021, pp. 26–42, https://doi.org/10.18485/infotheca.2021.21.2.2. Accessed 24 Jan. 2024.

Krstev, Cvetana, et al. “A System for Named Entity Recognition Based on Local Grammars.” Journal of Logic and Computation, vol. 24, no. 2, 2014, pp. 473–489.

Krstev, Cvetana, et al. “Analysis of the First Serbian Literature Corpus of the Late 19th and Early 20th Century with the TXM Platform.” DH_BUDAPEST_2019, Eötvös Loránd University, Centre for Digital Humanities, 2019, pp. 36–37.

Milisavac, Živan, editor. Pripovedači. Novi Sad / Belgrade, Matica srpska / Srbska knjizhevna zadruga, 1972.

Moretti, Franco. “Conjectures on World Literature.” New Left Review, vol. 1, 2000, pp. 54–68.

Odebrecht, Carolin, et al. European Literary Text Collection (ELTeC): April 2021 Release with 14 Collections of at Least 50 Novels (v1.1.0). Zenodo, 2021, https://doi.org/10.5281/zenodo.4662444. Accessed 24 Jan. 2024.

Patras, Roxana, et al. “Thresholds to the ‘Great Unread’: Titling Practices in Eleven ELTeC Collections.” Interférences littéraires / Literaire interferenties, vol. 25, 2021, pp. 163–187, http://interferenceslitteraires.be/index.php/illi/article/view/1102. Accessed 24 Jan. 2024.

Schmid, Helmut. “Improvements in Part-of-speech Tagging with an Application to German.” Natural Language Processing Using Very Large Corpora, edited by Susan Armstrong et al., Springer Netherlands, 1999, pp. 13–25.

Schöch, Christof, et al. “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.” Modern Languages Open, vol. 1, 2021, https://doi.org/10.3828/mlo.v0i0.364. Accessed 24 Jan. 2024.

Stanković, Ranka, et al. “Annotation of the Serbian ELTeC Collection.” Infotheca, vol. 21, no. 2, 2021, pp. 43–59, https://doi.org/10.18485/infotheca.2021.21.2.3. Accessed 24 Jan. 2024.

Stanković, Ranka, et al. “Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection.” LREC 2022 Conference Proceedings, edited by Nicoletta Calzolari et al., European Language Resources Association, Paris, 2022, pp. 3337–3345, https://aclanthology.org/2022.lrec-1.356/. Accessed 24 Jan. 2024.

Stanković, Ranka, et al. “SrpELTeC on Platforms: Udaljeno čitanje, Aurora, noSketch.” Infotheca, vol. 21, no. 2, 2021, pp. 136–153, https://doi.org/10.18485/infotheca.2021.21.2.7. Accessed 24 Jan. 2024.

Stanković, Ranka, et al. “Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data.” Language, Data and Knowledge 2023, edited by Sara Carvalho et al., NOVA CLUNL, Lisbon, 2023, pp. 180–191.

Šandrih, Branislava, et al. “Development and Evaluation of Three Named Entity Recognition Systems for Serbian–The Case of Personal Names.” RANLP 2019: Natural Language Processing in a Deep Learning World: Proceedings, edited by Galia Angelova et al., INCOMA Ltd., Shumen, 2019, pp. 1060–1068, https://aclanthology.org/R19-1122/. Accessed 24 Jan. 2024.

Šandrih Todorović, Branislava, et al. “Serbian NER&Beyond: The Archaic and the Modern Intertwinned.” RANLP 2021: Deep Learning for Natural Language Processing Methods and Applications, edited by Galia Angelova et al., INCOMA Ltd., Shumen, 2021, pp. 1252–1260, https://aclanthology.org/2021.ranlp-1.141/. Accessed 24 Jan. 2024.

Trtovac, Aleksandra, et al. “The Serbian Part of the ELTeC–From the Empty List to the 100 Novels Collection.” Infotheca, vol. 21, no. 2, 2021, pp. 7–25, https://doi.org/10.18485/infotheca.2021.21.2.1. Accessed 24 Jan. 2024.

Vitas, Duško. “From Onions to Champagne–Food and Drink in the SrpELTeC Corpus.” Infotheca, vol. 21, no. 2, 2021, pp. 88–118, https://doi.org/10.18485/infotheca.2021.21.2.5. Accessed 24 Jan. 2024.

Downloads

Published

2024-06-21

Issue

Section

Thematic section