METHODS OF IDENTIFICATION AND SELECTION OF CHARACTERISTICS IN THE PROCESSING OF SCIENTIFIC INFORMATION RESOURCES OF THE UNIVERSITY
Abstract
This paper discusses methods for identifying and selecting features when processing scientific information resources of a university. The procedure for processing unstructured information resources consists of several stages: the extraction of terminological collocations, the selection of features, classification, thematic annotation, clustering of documents and analytical information retrieval. Methods for automatic extraction of terminological collocations are used to form a subset of domain terms. The set of terminological collocations allocated on a given collection of scientific texts characterizes the narrow subject area of this collection. The automatic extraction of keywords and terminological collocations is the main stage in the tasks of processing natural language. For automatic extraction of terminological collocations from scientific texts in this paper the C-value method is considered. Setting a C-value value limit will only allow for terms longer than one word. The candidate terms thus obtained form a list of n-grams (bigrams, trigrams). The main modification of the method based on the statistical approach is the preliminary use of morphological filter patterns. Phrase-like phrases are extracted from the text using the C-value method: the text is divided; phrases that meet the established conditions are extracted from the text; for all candidate terms selected by the established restriction, records are created in the database. Methods of feature selection are used to reduce the dimension of feature space in order to form the most informative composition. The choice of traits contributes to improving the efficiency of learning by reducing the size of the lexicon and the accuracy of classification due to the elimination of noise signs. To remove non-informative terms, i. e. To assess the importance of the terms, the criterion was chosen. The body of documents for processing is assembled from articles published in journals in various fields.
About the Authors
G. ZhomartkyzyKazakhstan
S. K. Kumargazhanova
Kazakhstan
G. V. Popova
Kazakhstan
References
1. Pivovarova L. M.,Yagunova E. V. (2010). Extraction and classification of terminological collocations on the material of linguistic scientific texts (preliminary observations). In Proceedings of Symposium: “Terminology and knowledge” Russia, Moscow. URL: http://webground.su/data/lit/pivovarova_yagunova/Izvlechenie_i_klassifikatsiya_terminoligicheskih_kollokatsyi.pdf.
2. Sedova Y. A., Kvyatkovskaya I. Y. (2011). Intelligent analysis of corps of scientific information. In Bulletin of the Astrakhan State Technical University. Series: Management, Computing and Informatics, Vol. 1, Russia, P. 128-136.
3. Braslavsky P. Sokolov, Е. A. (2008). Comparison of five methods for extraction of terms of arbitrary length. In Proceedings of International Conference “Dialogue” - Computational Linguistics and Intelligent Technologies, Vol. 7 (14). Russia, P. 67-74.
4. Min J., Josh C. D., Buzhou T., Hongxin C., Hua X. (2012). Extracting semantic lexicons from discharge summaries using machine learning and the C-Value method. Procedding of the AMIA Symposium, P. 409-416.
5. Manning Ch. D., Raghavan P., Schutze H. (2009). Introduction to Information Retrieval.
6. Du M., Chen X. (2013). Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization. In Journal of Zhejiang University-Science C-Computers & Electronics, Vol. 14 (6), P. 407-416.
7. Shengyi Jiang, Guansong Pang, Meiling Wu, Limin Kuang. (2012). An improved K-nearest-neighbor algorithm for text categorization. In Proceedings of the Expert Systems with AP. lications 39, P. 1503-1509.
8. Jiang J., Tsai Sh., Lee Sh. (2012). FSKNN: Multi-label text categorization based on fuzzy similarity and k nearest neighbors. In Proceedings of the Expert Systems with AP.lications 39, P. 2813-2821.
9. Science journal “Solid State Physics”, url: http://journals.ioffe.ru/ftt/ (date accessed - September 2013).
10. Altin9ay H., Erenel Z. (2010). Analytical evaluation of term weighting schemes for text categorization. In Proceedings of the Pattern Recognition Letters, 1, P. 1310-1323.
Review
For citations:
Zhomartkyzy G., Kumargazhanova S.K., Popova G.V. METHODS OF IDENTIFICATION AND SELECTION OF CHARACTERISTICS IN THE PROCESSING OF SCIENTIFIC INFORMATION RESOURCES OF THE UNIVERSITY. Herald of the Kazakh-British technical university. 2019;16(3):116-121. (In Russ.)