Preview

Herald of the Kazakh-British Technical University

Advanced search

A COMPUTATIONAL PIPELINE FOR LEXICAL AND THEMATIC ANALYSIS OF THE CODE OF ADMINISTRATIVE OFFENSES OF THE REPUBLIC OF KAZAKHSTAN

https://doi.org/10.55452/1998-6688-2025-22-4-227-243

Abstract

This study introduces a computational pipeline for the automated linguistic and structural analysis of legal texts, applied to the Code of Administrative Offenses of the Republic of Kazakhstan (CAO RK, K1400000235). The proposed workflow integrates data collection, text preprocessing, tokenization, keyword extraction, semantic clustering, and visualization using natural language processing (NLP) and statistical techniques implemented in Python. The pipeline unites lexical, thematic, and quantitative linguistic analyses into a coherent sequence that enables the identification of frequency distributions, semantic fields, and latent topics across the hierarchical structure of the Code (sections, chapters, and articles). The analysis of the CAO RK corpus revealed several distinctive linguistic patterns: a dominance of sanction and responsibility-related vocabulary (штраф, ответственность, правонарушение), high lexical density in chapters regulating economic and procedural offenses, and concentrated thematic clusters reflecting the normative-punitive orientation of administrative law. Visualization techniques such as frequency histograms, thematic heatmaps, and topic maps illustrate the potential of the pipeline for exploring legislative language quantitatively. Overall, the framework establishes a scalable foundation for comparative legal linguistics, automated legislative monitoring, and the modernization of legal analytics in Kazakhstan.

About the Authors

B. Mukhsimbayev
Kazakh-British Technical University
Kazakhstan

PhD student

Almaty



A. Pak
Kazakh-British Technical University
Kazakhstan

PhD, Professor

Almaty



A. Kuralbayev
Kazakh-British Technical University
Kazakhstan

PhD student

Almaty



References

1. Theory and Methodology of the World’s National Linguistic Corpora. Linguistics Journal of Eurasia, 14(3), 33–45 (2022).

2. Tokatov, R.A., Akimzhanova, M.T. On the accuracy of the texts of the Civil Code of the Republic of Kazakhstan (General Part) in the Kazakh and Russian languages. Bulletin of L.N. Gumilyov Eurasian National University. Law Series, 3, 135–141 (2021). https://doi.org/10.31489/2021l3/135-141.

3. Ilyassova, G.A. Problems of ensuring authenticity of texts in Kazakh and Russian in the Civil Procedure Code of the Republic of Kazakhstan. Bulletin of L.N. Gumilyov Eurasian National University. Law Series, 3, 71–78 (2022). https://doi.org/10.31489/2022l3/71-78.

4. Ilyassova, G.A. Issues of application of terms in the state language in civil legislation (according to the text of the special part of the Civil Code of the Republic of Kazakhstan). Vestnik Akademii Upravleniya, 69(2), 123–134 (2023). https://doi.org/10.47649/vau.2023.v69.i2.14.

5. Zhanzhigitov, S.Zh. Linguistic strategies of legal communication in digital environments: The case of the PravMedia online forum. Bulletin of L.N. Gumilyov Eurasian National University. Philology Series, 3, 41–49 (2024). https://doi.org/10.31489/2024ph3/41-49.

6. Yeshpanov, R., Efimov, P., Boytsov, L., Shalkarbayuli, A., Braslavski, P. KazQAD: Kazakh opendomain question answering dataset. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC–COLING 2024). Torino, Italy: European Language Resources Association (ELRA), 2024, pp. 9645–9656.

7. Khairova, N., Kolesnyk, A., Mamyrbayev, O., Mukhsina, K. The Aligned Kazakh–Russian Parallel Corpus Focused on the Criminal Theme. Proceedings of the International Conference on Computational Linguistics and Intelligent Systems (COLINS), 2019, pp. 116–125.

8. Baisalov, A., Kenzhegulov, Y., Alimzhanova, Z. Instruction tuning on public government and cultural data for low-resource language: A case study in Kazakh. Proceedings of the 2024 Conference on Computational Linguistics for Low-Resource Languages, 2024 (Preprint available on arXiv).

9. Formation of the State Language as the Language of the Law. Bulletin of Law and State, 2, 56–68 (2022).

10. Kolesnik, A., Khairova, N. Use of linguistic criteria for estimating the quality of Wikipedia articles. Proceedings of the 1st International Conference on Computational Linguistics and Intelligent Systems (COLINS), 2017, pp. 207–215.

11. Khairova, N., Mamyrbayev, O., Rizun, N., Razno, M., Ybytayeva, G. A parallel corpus-based approach to crime event extraction for low-resource languages. IEEE Access, 11, 54093–54111 (2023).

12. Yeshpanov, R., Varol, H.A. KazSAnDRA: Kazakh sentiment analysis dataset of reviews and attitudes. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC–COLING 2024), Torino, Italy: European Language Resources Association (ELRA), 2024.

13. Karipova, A., Serikbekova, S., Aralbekov, G., Tuleugaliyeva, Zh., Sarsenova, A. Comparative analysis of administrative liability for driving while intoxicated in the Commonwealth of Independent States. Hrvatska i komparativna javna uprava. Croatian and Comparative Public Administration, 24(4), 889–910 (2024).

14. Drápal, J., Westermann, H., Savelka, J. Using large language models to support thematic analysis in empirical legal studies. Proceedings of the Thirty-sixth Annual Conference on Legal Knowledge and Information Systems (JURIX 2023). Maastricht, The Netherlands: IOS Press, 2023, pp. 65–74.

15. Malik, V., Sanjay, R., Guha, S.K., Hazarika, A., Nigam, S.K., Bhattacharya A., Modi A. Semantic segmentation of legal documents via rhetorical roles. Proceedings of the Natural Legal Language Processing Workshop (NLLP 2022). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, 2022, pp. 132–142.

16. Niekler, A., Wiedemann, G., Heyer, G. Leipzig Corpus Miner: A text mining infrastructure for qualitative data analysis. Proceedings of the Terminology and Knowledge Engineering Conference (TKE 2014). Berlin, Germany, 2014, pp. 441–450.


Review

For citations:


Mukhsimbayev B., Pak A., Kuralbayev A. A COMPUTATIONAL PIPELINE FOR LEXICAL AND THEMATIC ANALYSIS OF THE CODE OF ADMINISTRATIVE OFFENSES OF THE REPUBLIC OF KAZAKHSTAN. Herald of the Kazakh-British Technical University. 2025;22(4):227-243. https://doi.org/10.55452/1998-6688-2025-22-4-227-243

Views: 75

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1998-6688 (Print)
ISSN 2959-8109 (Online)