《電子技術(shù)應(yīng)用》
您所在的位置:首頁(yè) > 人工智能 > 設(shè)計(jì)應(yīng)用 > 基于單頁(yè)語(yǔ)義特征的垃圾網(wǎng)頁(yè)檢測(cè)
基于單頁(yè)語(yǔ)義特征的垃圾網(wǎng)頁(yè)檢測(cè)
電子技術(shù)應(yīng)用
陳木生1,,2,,高斐1,吳俊華1
(1.江西理工大學(xué) 軟件工程學(xué)院,江西 南昌 330013,;2.南昌市虛擬數(shù)字工程與文化傳播重點(diǎn)實(shí)驗(yàn)室,江西 南昌 330013)
摘要: 為解決垃圾網(wǎng)頁(yè)檢測(cè)中特征提取難度高,、計(jì)算量大的問(wèn)題,,提出一種僅基于當(dāng)前網(wǎng)頁(yè)的HTML腳本提取語(yǔ)義特征的方法。首先使用深度優(yōu)先搜索和動(dòng)態(tài)規(guī)劃相結(jié)合的記憶化搜索算法對(duì)域名進(jìn)行單詞切割,,采用隱含狄利克雷分布提取主題詞,,基于Word2Vec詞向量和詞移距離計(jì)算3個(gè)單頁(yè)語(yǔ)義相似度特征;然后將單頁(yè)語(yǔ)義相似度特征融合單頁(yè)統(tǒng)計(jì)特征,,使用隨機(jī)森林等分類算法構(gòu)建分類模型進(jìn)行垃圾網(wǎng)頁(yè)檢測(cè),。實(shí)驗(yàn)結(jié)果表明,基于單頁(yè)內(nèi)容提取語(yǔ)義特征融合單頁(yè)統(tǒng)計(jì)特征進(jìn)行分類的AUC值達(dá)到88.0%,,比對(duì)照方法提高4%左右,。
中圖分類號(hào):TP391.6
文獻(xiàn)標(biāo)志碼:A
DOI: 10.16157/j.issn.0258-7998.223376
中文引用格式: 陳木生,高斐,,吳俊華. 基于單頁(yè)語(yǔ)義特征的垃圾網(wǎng)頁(yè)檢測(cè)[J]. 電子技術(shù)應(yīng)用,,2023,49(6):24-29.
英文引用格式: Chen Musheng,,Gao Fei,,Wu Junhua. Web spam detection based on semantic features from current page[J]. Application of Electronic Technique,2023,,49(6):24-29.
Web spam detection based on semantic features from current page
Chen Musheng1,,2,Gao Fei1,,Wu Junhua1
(1.School of Software Engineering,, Jiangxi University of Science and Technology, Nanchang 330013,, China,; 2.Nanchang Key Laboratory of Virtual Digital Engineering and Cultural Communication, Nanchang 330013,, China)
Abstract: In order to solve the problem of high difficulty and large amount of computation in feature extraction for web spam detection, a method for extracting semantic features only based on the HTML script of the current page is proposed. Firstly, the domain name is segmented by a memorization search algorithm combining depth-first search and dynamic programming. Secondly, The latent Dirichlet distribution is used to extract subject words of the web page. Lastly, three single-page semantic similarity features are calculated based on Word2Vec and word mover distance. Combining the single-page semantic similarity features with single-page statistical features, classification algorithms such as random forest are used to build classification models for web spam detection. The experimental results show that the AUC value of single-page content extraction based on semantic and statistical features for classification reaches 88.0%, which is about 4% higher than that of the control method.
Key words : web spam detection,;feature extraction;memory search,;latent Dirichlet distribution,;Word2Vec,;word mover distance;random forest

0 引言

如今,,隨著互聯(lián)網(wǎng)信息的快速增長(zhǎng),,搜索引擎被認(rèn)為是訪問(wèn)網(wǎng)站的關(guān)鍵工具,其用戶占到網(wǎng)絡(luò)用戶的80%以上[1],。但是有研究表明,,大約60%的用戶只查看第一頁(yè)中最初的5個(gè)結(jié)果[2]??梢钥闯?,在搜索結(jié)果中排名靠前的網(wǎng)頁(yè)會(huì)擁有更多的訪問(wèn)者,由此帶來(lái)更多的收入,。由于通過(guò)正常手段提高網(wǎng)頁(yè)排名非常困難,,于是某些網(wǎng)站便通過(guò)非正常手段和技術(shù)欺騙搜索引擎提高網(wǎng)頁(yè)排名,這些網(wǎng)頁(yè)被稱為垃圾網(wǎng)頁(yè)[3],。垃圾網(wǎng)頁(yè)會(huì)降低搜索結(jié)果的質(zhì)量,,浪費(fèi)用戶的時(shí)間,侵占搜索引擎公司和其他內(nèi)容網(wǎng)站的合法利益[4],。盡管搜索引擎公司已經(jīng)使用了各種方法來(lái)應(yīng)對(duì)垃圾網(wǎng)頁(yè),,但至今為止,垃圾網(wǎng)頁(yè)檢測(cè)依然是搜索引擎需要重點(diǎn)突破的難題,,也是學(xué)術(shù)領(lǐng)域的一個(gè)前沿課題,。因此,高效,、準(zhǔn)確地檢測(cè)垃圾網(wǎng)頁(yè)具有重要意義。



本文詳細(xì)內(nèi)容請(qǐng)下載:http://forexkbc.com/resource/share/2000005343




作者信息:

陳木生1,,2,,高斐1,吳俊華1

(1.江西理工大學(xué) 軟件工程學(xué)院,,江西 南昌 330013,;2.南昌市虛擬數(shù)字工程與文化傳播重點(diǎn)實(shí)驗(yàn)室,江西 南昌 330013)


微信圖片_20210517164139.jpg

此內(nèi)容為AET網(wǎng)站原創(chuàng),,未經(jīng)授權(quán)禁止轉(zhuǎn)載,。