Preview

Herald of the Kazakh-British Technical University

Advanced search

APPLICATION OF NON-AUTOREGRESSIVE DECODING FOR KAZAKH SPEECH RECOGNITION

https://doi.org/10.55452/1998-6688-2025-22-4-23-30

Abstract

In the field of speech recognition, end-to-end models are gradually replacing traditional and hybrid approaches. Their main principle is autoregressive decoding, where the output sequence is formed from left to right. However, it has not yet been proven that this method provides the best results in converting speech to text. Moreover, end-toend models rely solely on the previous context, which complicates the processing of unclear or distorted sounds. In this regard, the insertion method was proposed, which does not use autoregressive decoding and generates output data in an arbitrary order. This paper examines a Kazakh speech recognition model trained using the insertion method and Connectionist Temporal Classification (CTC). The experiments conducted showed that this method improves recognition accuracy. Unlike autoregressive models, the Insertion method provides greater flexibility in processing sequences, as it does not require a strict order for generating output data. This reduces decoding delays and makes the model more robust to poorly pronounced words. Furthermore, combining the Insertion method with CTC improves the alignment of audio data and text transcription. This is especially important for agglutinative languages such as Kazakh. According to the experimental results, the recognition accuracy of the proposed model reached 10.2%, making it competitive today.

About the Authors

D. Oralbekova
Institute of Information and Computational technology
Kazakhstan

PhD, Senior Researcher

Almaty



O. Mamyrbayev
Institute of Information and Computational technology
Kazakhstan

PhD, Professor, Chief Researcher

Almaty



A. Yerimbetova
Institute of Information and Computational technology
Kazakhstan

PhD, Cand.Tech.Sc., Associate Professor

Almaty



A. Bekarystankyzy
Institute of Information and Computational technology
Kazakhstan

PhD, Senior Researcher

Almaty



M. Turdalyuly
Institute of Information and Computational technology
Kazakhstan

PhD, Senior Researcher

Almaty



References

1. Mamyrbayev, O., Oralbekova, D. Modern trends in the development of speech recognition systems. News of the National Academy of Sciences of the Republic of Kazakhstan, 4 (332), 42–51 (2020).

2. Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML, Pittsburgh, USA, pp. 369– 376 (2006).

3. Majhi, M.K., and Saha, S.K. An automatic speech recognition system in Odia language using attention mechanism and data augmentation. International Journal of Speech Technology, 27, 717–728 (2024). https://doi.org/10.1007/s10772-024-10132-6.

4. Cheng, L., Zhu, H., Hu, Z., and Luo, B. A Sequence-to-Sequence Model for Online Signal Detection and Format Recognition. IEEE Signal Processing Letters, 31, 994–998 (2024). https://doi.org/10.1109/LSP.2024.3384015.

5. Mundotiya, R.K., Mehta, A., Baruah, R., and Singh, A.K. Integration of morphological features and contextual weightage using monotonic chunk attention for part of speech tagging. Journal of King Saud University – Computer and Information Sciences, 34, 7324–7334 (2021).

6. Deng, K., Cao, S., Zhang, Y., Ma, L., Cheng, G., Xu, J., and Zhang, P. Improving CTC-based speech recognition via knowledge transferring from pre-trained language models (2022). https://doi.org/10.48550/arXiv.2203.03582.

7. Dingliwal, S., Sunkara, M., Ronanki, S., Farris, J., Kirchhoff, K., and Bodapati, S. Personalization of CTC Speech Recognition Models. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, pp. 302–309 (2023). https://doi.org/10.1109/SLT54892.2023.10022705.

8. Zeyer, A., Kazuki, I., Ralf, S., and Hermann, N. Improved training of end-to-end attention models for speech recognition. arXiv preprint arXiv:1805.03294 (2018).

9. Wang, Z., Tao, Z., Shao, Y., and Ding, B. LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement. Applied Acoustics, 172, 107647 (2021).

10. Huang, Z., Wang, P., Wang, J., Miao, H., Xu, J., and Zhang, P. Improving Transformer Based Endto-End Code-Switching Speech Recognition Using Language Identification. Applied Sciences, 11 (19), 9106 (2021). https://doi.org/10.3390/app11199106.

11. Miao, H., Cheng, G., Gao, C., Zhang, P., and Yan, Y. Transformer-Based Online CTC/Attention EndTo-End Speech Recognition Architecture. ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6084–6088 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053165.

12. Li, F., Chen, J., and Zhang, X. A Survey of Non-Autoregressive Neural Machine Translation. Electronics, 12 (13), 2980 (2023). https://doi.org/10.3390/electronics12132980

13. Chen, N., Watanabe, S., Villalba, J., and Dehak, N. Listen and fill in the missing letters: Nonautoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908 (2020).

14. Fujita, Y., Watanabe, S., Omachi, M., and Chan, X. Insertion-Based Modeling for End-to-End Automatic Speech Recognition. INTERSPEECH 2020 (2020). https://doi.org/10.48550/arXiv.2005.13211.

15. Rakhimova, D., Sagat, K., Zhakypbaeva, K., and Zhunussova, A. Development and Study of a Post-editing Model for Russian-Kazakh and English-Kazakh Translation Based on Machine Learning. In: Wojtkiewicz K., Treur J., Pimenidis E., Maleszka M. (eds) Advances in Computational Collective Intelligence. ICCCI 2021. Communications in Computer and Information Science, vol. 1463. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88113-9_42.

16. Karyukin, V., Rakhimova, D., Karibayeva, A., Turganbayeva, A., and Turarbek, A. The neural machine translation models for the low-resource Kazakh–English language pair. PeerJ Computer Science, 9, e1224 (2023). https://doi.org/10.7717/peerj-cs.1224

17. Kydyrbekova, A., and Oralbekova, D. Speaker identification using distribution-preserving x-vector generation. News of the National Academy of Sciences of the Republic of Kazakhstan, Physico-mathematical series, 4 (352), 152–162 (2024). https://doi.org/10.32014/2024.2518-1726.314

18. Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., and Zhumazhanov, B. Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1 (9(115)), 84–92 (2022). https://doi.org/10.15587/1729-4061.2022.252801.

19. Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., and Bekarystankyzy, A. End-to-End Model Based on RNN-T for Kazakh Speech Recognition. 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), pp. 163–167 (2021). https://doi.org/10.1109/ICCCI51764.2021.9486811.

20. Soleymanpour, M., Johnson, M.T., Soleymanpour, R., and Berry, J. Synthesizing Dysarthric Speech Using Multi-Speaker TTS for Dysarthric Speech Recognition. ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 7382–7386 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746585.


Review

For citations:


Oralbekova D., Mamyrbayev O., Yerimbetova A., Bekarystankyzy A., Turdalyuly M. APPLICATION OF NON-AUTOREGRESSIVE DECODING FOR KAZAKH SPEECH RECOGNITION. Herald of the Kazakh-British Technical University. 2025;22(4):23-30. (In Kazakh) https://doi.org/10.55452/1998-6688-2025-22-4-23-30

Views: 98

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1998-6688 (Print)
ISSN 2959-8109 (Online)