SPEECH SEGMENTATION DURING SPEAKER MATCHING
https://doi.org/10.55452/1998-6688-2025-22-2-10-23
Abstract
Speech segmentation is the process of dividing speech signals into parts, which is an important aspect of speaker identification and speech recognition systems. This process improves the efficiency of the system by accurately detecting the beginning and end of speech. The use of voice activity detectors (VADs) plays an important role in segmentation, as they help to determine the boundaries between speech and silence. However, the most common errors in segmentation are false positives and false negatives, which negatively affect the overall accuracy of the system. In this regard, it is necessary to reduce errors through various approaches and methods. Measures such as reducing background noise, using deep learning models, and data augmentation can significantly improve the quality of segmentation. Using spectral analysis methods and features allows you to clearly distinguish between speech and background noise. The purpose of this study is to optimize the segmentation process and analyze the probability of errors, improve the efficiency of speech recognition systems. As a result, this work provides a basis for new research and development in the field of speech recognition. The article considers the problem of speech segmentation for speaker identification. The paper describes possible segmentation criteria – qualitative and quantitative characteristics of sound speech, such as speech delays and intonation, as well as their acoustic relationship. This allows a specialist to identify specific segment units (syllables, words, etc.), record their structure, and identify the main features.
About the Authors
A. T. AkhmediyarovaKazakhstan
PhD, Associate Professor
Almaty
Zh. M. Alibiyeva
Kazakhstan
PhD, Associate Professor
Almaty
N. K. Мukazhanov
Kazakhstan
PhD, Associate Professor
Almaty
References
1. Sujatha C. Vibration, Acoustics and Strain Measurement: Theory and Experiments, 722 p. (2023). https://doi.org/10.1007/978-3-031-03968-3_4.
2. Sudeep S.V.N.V.S., Venkata Kiran S., Nandan D., and Kumar S. An Overview of Biometrics and Face Spoofing Detection, ICCCE 2020, 871–881, (2021).
3. Zlatoustova L.V., Potapova R.K., Potapov V.V., and Trunin-Donskoy V.N. General and Applied phonetics: textbook. stipend (Moscow, Publishing house of Moscow State University, 1997), 416 p. [In Russian].
4. Mukazhanov N., Alibiyeva Z., Yerimbetova A., Kassymova A., Alibiyeva N. Development of an augmented damerau–levenshtein method for correcting spelling errors in Kazakh texts, EasternEuropean Journal of Enter-prise Technologies, 5 (2 (125)), 23–33 (2023). https://doi.org/10.15587/1729-4061.2023.289187.
5. Amaan Rizvi, Anupam Jamatia, Dwijen Rudrapal, Kunal Chakma and Bjorn Gambek. Cross-Lingual Speaker Identification for Indian Languages. In Proceedings of the 14th International Conference on the Latest Advances in Natural Language Processing (Varna, Bulgaria, 2023), pp. 979–987.
6. Pati D., and Prasanna S.R.M. Speaker verification using excitation source information. International Journal of Speech Technology, 15(3), 241–257 (2012).
7. Zeinali H., Sameti H., and Burget L. HMM-based phrase-independent i-vector extractor for textdependent speaker verification. IEEE/ACM Transactions on Audio Speech and Language Processing, 25, 1421–1435 (2017).
8. Meftah A.H., Mathkour H., Kerrache S., and Alotaibi Y.A. Speaker Identification in Different Emotional States in Arabic and English. IEEE Access, 8, 60070–60083 (2020).
9. Gurtueva I.A., Brzikhatlov K.Ch. Analytical review and classification of methods for identifying acoustic signal features in speech systems, Izvestiya Kabardino-Balkarian Scientific Center of the Russian Academy of Sciences, 1, 41–58 (2022). https://doi.org/10.35330/1991-6639-2022-1-105-41-58. [In Russian].
10. Belov Yu.S., Nifontov S.V., Azarenko K.A. Application of wavelet filtering for noise reduction in speech signals, Basic research, 4 (part 1), 29–33 (2017). [In Russian].
11. Huang X., Acero A., and Hon H.W. Spoken Language Processing: A Guide to Theory, Algorithms, and Applications. Prentice Hall, 2001.
12. Deng L., and Yu D. Deep Learning for Speech Recognition. IEEE Signal Processing Magazine, 29(6), 82–97 (2012).
13. Zhang Y., and Wu Y. Robust Speech Segmentation using Spectral Clustering. IEEE Transactions on Audio, Speech, and Language Processing, 25(9), 1815–1827 (2017).
14. Hazen T.J., and Reddy R. Voice Activity Detection: A Review of the Literature. Journal of the Acoustical Society of America, 132(5), 2994–3005 (2012).
15. Nefedov N.N., Alimuradov A.K. A brief overview of ways to detect speech activity. Engineering and technology, 9 (2), 1–6 (2024). https://doi.org/10.21685/2587- 7704-2024-9-2-9. [In Russian].
16. Sharma S., and Goyal N. A Review of Speech Segmentation Techniques. International Journal of Computer Applications, 134(11), 10–14 (2016).
17. Sak H., Senior A., and Beaufays F. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. Proceedings of the 15th International Conference on Speech and Language Processing (INTERSPEECH, 2014), pp. 338–342.
18. Gonzalez J., and Rios M. A Comprehensive Study on Speech Segmentation Approaches. Journal of Signal Processing Systems, 76(2), 171–182 (2014).
19. Li X., and Zhao C. Improved Speech Segmentation Algorithm Based on GMM and VAD. Journal of the Acoustical Society of America, 137(3), 1355–1363 (2015).
20. Yin J., and Wang Y. A Novel Method for Speech Segmentation Based on Wavelet Transform. International Journal of Speech Technology, 22(3), 483–493 (2019).
Review
For citations:
Akhmediyarova A.T., Alibiyeva Zh.M., Мukazhanov N.K. SPEECH SEGMENTATION DURING SPEAKER MATCHING. Herald of the Kazakh-British Technical University. 2025;22(2):10-23. (In Kazakh) https://doi.org/10.55452/1998-6688-2025-22-2-10-23