(1.Beijing Petroleum Machinery Co.,,Ltd.,,China Petroleum Engineering Technology Research Institute, Beijing 102206,,China,; 2.School of Information,Renmin University of China,,Beijing 100872,,China)
Abstract: In the process of digital construction of enterprises, the digital processing of a large number of daily business activity texts of enterprises is usually multi-task, and it is necessary to complete information extraction tasks and text classification tasks for text data at the same time. In this application scenario, in order to achieve a more accurate text classification effect, this paper proposes an improved TF-IDF algorithm, which uses the text information extraction result as the distinguishing feature of important text categories, and introduces the information gain method to obtain an improved weight calculation formula. Then an improved text feature vector space representation is obtained, and then a text classification model is constructed. The experiment takes the Chinese text of the petroleum industry as an example, and selects 2 006 test texts for text classification comparison experiments. The experimental results show that the improved TF-IDF algorithm has an accuracy rate P of 99.99% and a recall rate R of 99.87%. The algorithm text classification effect has been significantly improved.
Key words : text classification;VSM,;TF-IDF,;petroleum;support vector machine