中圖分類號(hào):TP391.6 文獻(xiàn)標(biāo)志碼:A DOI: 10.16157/j.issn.0258-7998.223376 中文引用格式: 陳木生,高斐,,吳俊華. 基于單頁(yè)語(yǔ)義特征的垃圾網(wǎng)頁(yè)檢測(cè)[J]. 電子技術(shù)應(yīng)用,,2023,49(6):24-29. 英文引用格式: Chen Musheng,,Gao Fei,,Wu Junhua. Web spam detection based on semantic features from current page[J]. Application of Electronic Technique,2023,,49(6):24-29.
Web spam detection based on semantic features from current page
Chen Musheng1,,2,Gao Fei1,,Wu Junhua1
(1.School of Software Engineering,, Jiangxi University of Science and Technology, Nanchang 330013,, China,; 2.Nanchang Key Laboratory of Virtual Digital Engineering and Cultural Communication, Nanchang 330013,, China)
Abstract: In order to solve the problem of high difficulty and large amount of computation in feature extraction for web spam detection, a method for extracting semantic features only based on the HTML script of the current page is proposed. Firstly, the domain name is segmented by a memorization search algorithm combining depth-first search and dynamic programming. Secondly, The latent Dirichlet distribution is used to extract subject words of the web page. Lastly, three single-page semantic similarity features are calculated based on Word2Vec and word mover distance. Combining the single-page semantic similarity features with single-page statistical features, classification algorithms such as random forest are used to build classification models for web spam detection. The experimental results show that the AUC value of single-page content extraction based on semantic and statistical features for classification reaches 88.0%, which is about 4% higher than that of the control method.
Key words : web spam detection,;feature extraction;memory search,;latent Dirichlet distribution,;Word2Vec,;word mover distance;random forest