Post
36
š„ New Russian Stylometry Dataset!
Russian Stylometric Dataset (RSD) ā 322 texts from the 19th ā early 20th centuries (16 million words), prepared for analysis in stylo (R) and machine learning (Python).
š What's inside?
Fiction, journalism, scientific texts, drama, poetry
Grouped by author, gender, age, genre, literary movements (Romanticism/Realism)
Character speech (Tolstoy, Gogol, Ostrovsky)
Generated texts (LSTM, GPT)
š Use cases: authorship attribution, clustering, classification, benchmarking methods.
š Public domain + GPL-3.0 license.
š Learn more: https://github.com/nevmenandr/RSD
DOI: 10.5281/zenodo.20701309
Russian Stylometric Dataset (RSD) ā 322 texts from the 19th ā early 20th centuries (16 million words), prepared for analysis in stylo (R) and machine learning (Python).
š What's inside?
Fiction, journalism, scientific texts, drama, poetry
Grouped by author, gender, age, genre, literary movements (Romanticism/Realism)
Character speech (Tolstoy, Gogol, Ostrovsky)
Generated texts (LSTM, GPT)
š Use cases: authorship attribution, clustering, classification, benchmarking methods.
š Public domain + GPL-3.0 license.
š Learn more: https://github.com/nevmenandr/RSD
DOI: 10.5281/zenodo.20701309