APPLICATION OF NON-AUTOREGRESSIVE DECODING FOR KAZAKH SPEECH RECOGNITION
https://doi.org/10.55452/1998-6688-2025-22-4-23-30
Abstract
In the field of speech recognition, end-to-end models are gradually replacing traditional and hybrid approaches. Their main principle is autoregressive decoding, where the output sequence is formed from left to right. However, it has not yet been proven that this method provides the best results in converting speech to text. Moreover, end-toend models rely solely on the previous context, which complicates the processing of unclear or distorted sounds. In this regard, the insertion method was proposed, which does not use autoregressive decoding and generates output data in an arbitrary order. This paper examines a Kazakh speech recognition model trained using the insertion method and Connectionist Temporal Classification (CTC). The experiments conducted showed that this method improves recognition accuracy. Unlike autoregressive models, the Insertion method provides greater flexibility in processing sequences, as it does not require a strict order for generating output data. This reduces decoding delays and makes the model more robust to poorly pronounced words. Furthermore, combining the Insertion method with CTC improves the alignment of audio data and text transcription. This is especially important for agglutinative languages such as Kazakh. According to the experimental results, the recognition accuracy of the proposed model reached 10.2%, making it competitive today.
About the Authors
D. OralbekovaKazakhstan
PhD, Senior Researcher
Almaty
O. Mamyrbayev
Kazakhstan
PhD, Professor, Chief Researcher
Almaty
A. Yerimbetova
Kazakhstan
PhD, Cand.Tech.Sc., Associate Professor
Almaty
A. Bekarystankyzy
Kazakhstan
PhD, Senior Researcher
Almaty
M. Turdalyuly
Kazakhstan
PhD, Senior Researcher
Almaty
References
1. Mamyrbayev, O., Oralbekova, D. Modern trends in the development of speech recognition systems. News of the National Academy of Sciences of the Republic of Kazakhstan, 4 (332), 42–51 (2020).
2. Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML, Pittsburgh, USA, pp. 369– 376 (2006).
3. Majhi, M.K., and Saha, S.K. An automatic speech recognition system in Odia language using attention mechanism and data augmentation. International Journal of Speech Technology, 27, 717–728 (2024). https://doi.org/10.1007/s10772-024-10132-6.
4. Cheng, L., Zhu, H., Hu, Z., and Luo, B. A Sequence-to-Sequence Model for Online Signal Detection and Format Recognition. IEEE Signal Processing Letters, 31, 994–998 (2024). https://doi.org/10.1109/LSP.2024.3384015.
5. Mundotiya, R.K., Mehta, A., Baruah, R., and Singh, A.K. Integration of morphological features and contextual weightage using monotonic chunk attention for part of speech tagging. Journal of King Saud University – Computer and Information Sciences, 34, 7324–7334 (2021).
6. Deng, K., Cao, S., Zhang, Y., Ma, L., Cheng, G., Xu, J., and Zhang, P. Improving CTC-based speech recognition via knowledge transferring from pre-trained language models (2022). https://doi.org/10.48550/arXiv.2203.03582.
7. Dingliwal, S., Sunkara, M., Ronanki, S., Farris, J., Kirchhoff, K., and Bodapati, S. Personalization of CTC Speech Recognition Models. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, pp. 302–309 (2023). https://doi.org/10.1109/SLT54892.2023.10022705.
8. Zeyer, A., Kazuki, I., Ralf, S., and Hermann, N. Improved training of end-to-end attention models for speech recognition. arXiv preprint arXiv:1805.03294 (2018).
9. Wang, Z., Tao, Z., Shao, Y., and Ding, B. LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement. Applied Acoustics, 172, 107647 (2021).
10. Huang, Z., Wang, P., Wang, J., Miao, H., Xu, J., and Zhang, P. Improving Transformer Based Endto-End Code-Switching Speech Recognition Using Language Identification. Applied Sciences, 11 (19), 9106 (2021). https://doi.org/10.3390/app11199106.
11. Miao, H., Cheng, G., Gao, C., Zhang, P., and Yan, Y. Transformer-Based Online CTC/Attention EndTo-End Speech Recognition Architecture. ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6084–6088 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053165.
12. Li, F., Chen, J., and Zhang, X. A Survey of Non-Autoregressive Neural Machine Translation. Electronics, 12 (13), 2980 (2023). https://doi.org/10.3390/electronics12132980
13. Chen, N., Watanabe, S., Villalba, J., and Dehak, N. Listen and fill in the missing letters: Nonautoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908 (2020).
14. Fujita, Y., Watanabe, S., Omachi, M., and Chan, X. Insertion-Based Modeling for End-to-End Automatic Speech Recognition. INTERSPEECH 2020 (2020). https://doi.org/10.48550/arXiv.2005.13211.
15. Rakhimova, D., Sagat, K., Zhakypbaeva, K., and Zhunussova, A. Development and Study of a Post-editing Model for Russian-Kazakh and English-Kazakh Translation Based on Machine Learning. In: Wojtkiewicz K., Treur J., Pimenidis E., Maleszka M. (eds) Advances in Computational Collective Intelligence. ICCCI 2021. Communications in Computer and Information Science, vol. 1463. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88113-9_42.
16. Karyukin, V., Rakhimova, D., Karibayeva, A., Turganbayeva, A., and Turarbek, A. The neural machine translation models for the low-resource Kazakh–English language pair. PeerJ Computer Science, 9, e1224 (2023). https://doi.org/10.7717/peerj-cs.1224
17. Kydyrbekova, A., and Oralbekova, D. Speaker identification using distribution-preserving x-vector generation. News of the National Academy of Sciences of the Republic of Kazakhstan, Physico-mathematical series, 4 (352), 152–162 (2024). https://doi.org/10.32014/2024.2518-1726.314
18. Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., and Zhumazhanov, B. Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1 (9(115)), 84–92 (2022). https://doi.org/10.15587/1729-4061.2022.252801.
19. Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., and Bekarystankyzy, A. End-to-End Model Based on RNN-T for Kazakh Speech Recognition. 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), pp. 163–167 (2021). https://doi.org/10.1109/ICCCI51764.2021.9486811.
20. Soleymanpour, M., Johnson, M.T., Soleymanpour, R., and Berry, J. Synthesizing Dysarthric Speech Using Multi-Speaker TTS for Dysarthric Speech Recognition. ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 7382–7386 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746585.
Review
For citations:
Oralbekova D., Mamyrbayev O., Yerimbetova A., Bekarystankyzy A., Turdalyuly M. APPLICATION OF NON-AUTOREGRESSIVE DECODING FOR KAZAKH SPEECH RECOGNITION. Herald of the Kazakh-British Technical University. 2025;22(4):23-30. (In Kazakh) https://doi.org/10.55452/1998-6688-2025-22-4-23-30
JATS XML






