SrpELTeC: srbski literarni korpus za oddaljeno branje
DOI:
https://doi.org/10.3986/pkn.v47.i2.03Ključne besede:
digitalna humanistika, srbska književnost, besedilni korpusi, oddaljeno branje, povezani podatki, prepoznavanje imenskih entitet, besedilna analitikaPovzetek
V članku je predstavljen korpus SrpELTeC, ki je nastal v okviru COST akcije Distant Reading for European Literary History (CA16204). Vsi romani v SrpELTeC-u so bili izbrani, pripravljeni in označeni po skupnih načelih, ki veljajo za vse jezikovne zbirke v European Literary Text Collection (ELTeC). Opisani so izzivi in rešitve pri pripravi SrpELTeC. Vsi romani so bili ročno kodirani skladno s parametri XML-TEI in opremljeni z bogatimi metapodatki ter strukturnimi opombami. Avtomatično označevanje je vključevalo oblikoskladenjske oznake, lematizacijo in imenske entitete, ki so temeljile na virih za obdelavo naravnega jezika, ki jih je razvilo in jih vzdržuje Društvo za jezikovne vire in tehnologije JeRTeh. Integracija SrpELTeC z Wikidata je bila podprta z nizom poizvedb SPARQL za pridobivanje metapodatkov z različnimi možnostmi vizualizacije. V okviru nedavnih dejavnosti v okviru COST akcije NexusLinguarum – European Network for Web-centred Linguistic Data Science (CA18209) je bila ustvarjena povezana podatkovna različica SrpELTeC z uporabo NLP izmenjevalnega formata. Vse različice SrpELTeC so prosto dostopne pod licenco CC-BY.
Literatura
Burnard, Lou, et al. “In Search of Comity: TEI for Distant Reading.” Journal of the Text Encoding Initiative, vol. 14, 2021, pp. 1–9, https://doi.org/10.4000/jtei.3500. Accessed 24 Jan. 2024.
Deretić, Jovan. Istorija srpske književnosti. Belgrade, Nolit, 1983.
Deretić, Jovan. Srpski roman 1800–1950. Belgrade, Nolit, 1981.
Heiden, Serge. “The TXM Platform: Building Open-source Textual Analysis Software Compatible with the TEI Encoding Scheme.” Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation. Vol. 2. No. 3, edited by Ryo Otoguro et al., Institute for Digital Enhancement of Cognitive Development, Waseda University, 2010, pp. 389–398, https://aclanthology.org/Y10-1044. Accessed 24 Jan. 2024.
Ikonić Nešić, Milica, et al. “From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back).” Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, edited by Thierry Declerck et al., European Language Resources Association, Paris, 2022, pp. 7–16, https://aclanthology.org/2022.ldl-1.2/. Accessed 24 Jan. 2024.
Ikonić Nešić, Milica, et al. “Serbian ELTeC Sub-Collection in Wikidata.” Infotheca, vol. 21, no. 2, 2021, pp. 60–87, https://doi.org/10.18485/infotheca.2021.21.2.4. Accessed 24 Jan. 2024.
Kilgarriff, Adam, et al. “The Sketch Engine.” Proceedings of the Eleventh EURALEX International Congress, Université de Bretagne-Sud, Faculté des lettres et des sciences humaines, 2004, pp. 105–115, https://euralex.org/publications/the-sketch-engine/. Accessed 24 Jan. 2024.
Kilgarriff, Adam, et al. “The Sketch Engine: Ten Years On.” Lexicography, vol. 1, no. 1, 2014, pp. 7–36, https://doi.org/10.1007/s40607-014-0009-9. Accessed 24 Jan. 2024.
Krstev, Cvetana. Processing of Serbian: Automata, Texts and Electronic Dictionaries. Belgrade, Faculty of Philology of the University, 2008.
Krstev, Cvetana. “The Serbian Part of the ELTeC Collection Through the Magnifying Glass of Metadata.” Infotheca, vol. 21, no. 2, 2021, pp. 26–42, https://doi.org/10.18485/infotheca.2021.21.2.2. Accessed 24 Jan. 2024.
Krstev, Cvetana, et al. “A System for Named Entity Recognition Based on Local Grammars.” Journal of Logic and Computation, vol. 24, no. 2, 2014, pp. 473–489.
Krstev, Cvetana, et al. “Analysis of the First Serbian Literature Corpus of the Late 19th and Early 20th Century with the TXM Platform.” DH_BUDAPEST_2019, Eötvös Loránd University, Centre for Digital Humanities, 2019, pp. 36–37.
Milisavac, Živan, editor. Pripovedači. Novi Sad / Belgrade, Matica srpska / Srbska knjizhevna zadruga, 1972.
Moretti, Franco. “Conjectures on World Literature.” New Left Review, vol. 1, 2000, pp. 54–68.
Odebrecht, Carolin, et al. European Literary Text Collection (ELTeC): April 2021 Release with 14 Collections of at Least 50 Novels (v1.1.0). Zenodo, 2021, https://doi.org/10.5281/zenodo.4662444. Accessed 24 Jan. 2024.
Patras, Roxana, et al. “Thresholds to the ‘Great Unread’: Titling Practices in Eleven ELTeC Collections.” Interférences littéraires / Literaire interferenties, vol. 25, 2021, pp. 163–187, http://interferenceslitteraires.be/index.php/illi/article/view/1102. Accessed 24 Jan. 2024.
Schmid, Helmut. “Improvements in Part-of-speech Tagging with an Application to German.” Natural Language Processing Using Very Large Corpora, edited by Susan Armstrong et al., Springer Netherlands, 1999, pp. 13–25.
Schöch, Christof, et al. “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.” Modern Languages Open, vol. 1, 2021, https://doi.org/10.3828/mlo.v0i0.364. Accessed 24 Jan. 2024.
Stanković, Ranka, et al. “Annotation of the Serbian ELTeC Collection.” Infotheca, vol. 21, no. 2, 2021, pp. 43–59, https://doi.org/10.18485/infotheca.2021.21.2.3. Accessed 24 Jan. 2024.
Stanković, Ranka, et al. “Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection.” LREC 2022 Conference Proceedings, edited by Nicoletta Calzolari et al., European Language Resources Association, Paris, 2022, pp. 3337–3345, https://aclanthology.org/2022.lrec-1.356/. Accessed 24 Jan. 2024.
Stanković, Ranka, et al. “SrpELTeC on Platforms: Udaljeno čitanje, Aurora, noSketch.” Infotheca, vol. 21, no. 2, 2021, pp. 136–153, https://doi.org/10.18485/infotheca.2021.21.2.7. Accessed 24 Jan. 2024.
Stanković, Ranka, et al. “Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data.” Language, Data and Knowledge 2023, edited by Sara Carvalho et al., NOVA CLUNL, Lisbon, 2023, pp. 180–191.
Šandrih, Branislava, et al. “Development and Evaluation of Three Named Entity Recognition Systems for Serbian–The Case of Personal Names.” RANLP 2019: Natural Language Processing in a Deep Learning World: Proceedings, edited by Galia Angelova et al., INCOMA Ltd., Shumen, 2019, pp. 1060–1068, https://aclanthology.org/R19-1122/. Accessed 24 Jan. 2024.
Šandrih Todorović, Branislava, et al. “Serbian NER&Beyond: The Archaic and the Modern Intertwinned.” RANLP 2021: Deep Learning for Natural Language Processing Methods and Applications, edited by Galia Angelova et al., INCOMA Ltd., Shumen, 2021, pp. 1252–1260, https://aclanthology.org/2021.ranlp-1.141/. Accessed 24 Jan. 2024.
Trtovac, Aleksandra, et al. “The Serbian Part of the ELTeC–From the Empty List to the 100 Novels Collection.” Infotheca, vol. 21, no. 2, 2021, pp. 7–25, https://doi.org/10.18485/infotheca.2021.21.2.1. Accessed 24 Jan. 2024.
Vitas, Duško. “From Onions to Champagne–Food and Drink in the SrpELTeC Corpus.” Infotheca, vol. 21, no. 2, 2021, pp. 88–118, https://doi.org/10.18485/infotheca.2021.21.2.5. Accessed 24 Jan. 2024.