THE DEVELOPMENT OF AN APPLICATION FOR COLLECTING AND FORMING CORPORA OF THE KAZAKH LANGUAGE

Authors

  • Karyukin Vladislav Author
  • Abdurakhmonova Nilufar Author

Keywords:

data parsing, text processing, low-resource languages, Selenium WebDriver, Adilet website, Kazakh.

Abstract

Today, text processing and analysis tasks are crucial in developing the NLP field. Most of the tasks in this field, like text generation, sentiment analysis, and machine translation, require many language resources. While multi-resource languages, such as English, German, French, Chinese, and others, benefit from large, diverse corpora that facilitate the development of robust language models, low-resource languages face significant challenges related to data scarcity. This paper focuses on the lack of resources for the Kazakh language, which is one of the languages in the Turkic group, for high-quality model training in various NLP tasks.
In order to increase the size of the available corpora, the research proposes the approach of parsing the Adilet legislative website, which includes a very large collection of well-structured and error- free texts. The dataset was gathered in two phases. In the first step, the links to the Adilet website were collected using the Content Downloader program. Then, the parser based on the Selenium WebDriver was utilized to extract data and form a dataset with the title, the date of the law article, its text, and a URL link. The total corpora included 9575 texts.

References

Mokhamed T., Harous S., Hussein N. et al. Comparative analysis of Deep Learning and Machine Learning algorithms for emoji prediction from Arabic text. Social Network Analysis and Mining, 14, 67, 2024. https://doi.org/10.1007/s13278-024-01217-w

Choi J., Lee B. Accelerating materials language processing with large language models. Communication Materials, 5, 13, 2024. https://doi.org/10.1038/s43246-024-00449-9

Jaesub Y., Jong-Seok L. Learning from class-imbalanced data using misclassification- focusing generative adversarial networks. Expert Systems with Applications, vol. 240,

https://doi.org/10.1016/j.eswa.2023.122288

Karyukin V., Mutanov G., Mamykova Z. et al. On the development of an information system for monitoring user opinion and its role for the public. Journal of Big Data, 9, 110, 2022. https://doi.org/10.1186/s40537-022-00660-w

Guo S., Deng N., He Y. ISTIC's Neural Machine Translation Systems for CCMT' 2023. Communications in Computer and Information Science, vol. 1922, 2023. Springer, Singapore. https://doi.org/10.1007/978-981-99-7894-6_9

Xu L., Lu L., Liu M. et al. Nanjing Yunjin intelligent question-answering system based on knowledge graphs and retrieval augmented generation technology. Heritage Science, 12, 118, 2024. https://doi.org/10.1186/s40494-024-01231-3

Zhong Y., Goodfellow Sebastian D. Domain-specific language models pre-trained on construction management systems corpora. Automation in Construction, vol. 160, 105316, 2024. https://doi.org/10.1016/j.autcon.2024.105316

Dagdelen J., Dunn A., Lee S. et al. Structured information extraction from scientific text with large language models. Nature Communications, 15, 1418, 2024. https://doi.org/10.1038/s41467-024-45563-x

Abd El-Mageed A.A., Abohany A.A., Ali A.H. et al. An adaptive hybrid African vultures- aquila optimizer with Xgb-Tree algorithm for fake news detection. Journal of Big Data, 11, 41, 2024. https://doi.org/10.1186/s40537-024-00895-9

Modi A., Shah K., Shah S. et al. Sentiment Analysis of Twitter Feeds Using Flask Environment: A Superior Application of Data Analysis. Annals of Data Science, 11, 159–180, 2024. https://doi.org/10.1007/s40745-022-00445-1

Karyukin V, Rakhimova D, Karibayeva A, Turganbayeva A, Turarbek A. The neural machine translation models for the low-resource Kazakh–English language pair. PeerJ Computer Science 9: e1224, 2023. https://doi.org/10.7717/peerj-cs.1224

Wei Zh., Zhang Sh. A structured sentiment analysis dataset based on public comments from various domains. Data in Brief, vol. 53, 110232, 2024. https://doi.org/10.1016/j.dib.2024.110232

Afli H., Barrault L., Schwenk H. Building and using multimodal comparable corpora for machine translation, Natural Language Engineering, 22(4), pp. 603–625, 2016. https://doi.org/10.1017/S1351324916000152

Hämäläinen M., Alnajjar K. , Poibeau T. Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog. In Proceedings of the 17th International Conference on the Foundations of Digital Games (FDG '22). Association for Computing Machinery, New York, NY, USA, Article 56, 1–4, 2022. https://doi.org/10.1145/3555858.3555930

Allaberdiev B., Matlatipov G., Kuriyozov E., Rakhmonov Z. Parallel texts dataset for Uzbek-Kazakh machine translation. Data in Brief, vol. 53, 110194, 2024. https://doi.org/10.1016/j.dib.2024.110194

Shymbayev M., Alimzhanov Y. Extractive Question Answering for Kazakh Language. IEEE International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, pp. 401-405, 2023. https://doi.org/10.1109/SIST58284.2023.10223508

Tolegen G., Toleu A., Mussabayev R., Zhumazhanov B., Ziyatbekova G. Generative Pre- Trained Transformer for Kazakh Text Generation Tasks, 19th International Asian School-Seminar on Optimization Problems of Complex Systems (OPCS), Novosibirsk, Moscow,

Russian Federation, pp. 144-118, 2023. https://doi.org/10.1109/OPCS59592.2023.10275765

Ismailov, A. S., & Abdurakhmonova, N. (2022). The development of Alisher stemmer for Uzbek Language. Science and Education, 3(4), 187-213.

Abdurakhmonova, N., & Tuliyev, U. (2018). Morphological analysis by finite state transducer for Uzbek-English machine translation/Foreign Philology: Language. Literature, Education, 3, 68.

Abdurakhmonova, N., & Urdishev, K. (2019). Corpus based teaching Uzbek as a foreign language. Journal of Foreign Language Teaching and Applied Linguistics (J-FLTAL), 6(1- 2019), 131-7.

Abduraxmonova, N. Z. (2018). Linguistic support of the program for translating English texts into Uzbek (on the example of simple sentences): Doctor of Philosophy (PhD) il dis. aftoref.

Downloads

Published

2024-06-24

Issue

Section

SECTION 3. Language and speech analysis in NLP (morphological, syntactic and semantic analysis; speech analysis and synthesis).

How to Cite

THE DEVELOPMENT OF AN APPLICATION FOR COLLECTING AND FORMING CORPORA OF THE KAZAKH LANGUAGE. (2024). «CONTEMPORARY TECHNOLOGIES OF COMPUTATIONAL LINGUISTICS», 2(22.04), 260-265. https://myscience.uz/index.php/linguistics/article/view/61

Similar Articles

11-20 of 68

You may also start an advanced similarity search for this article.