中圖分類號(hào): TN919.5,;TP391.1 文獻(xiàn)標(biāo)識(shí)碼: A DOI:10.16157/j.issn.0258-7998.200160 中文引用格式: 淮曉永,韓曉東,,高若辰,,等. 一種自適應(yīng)網(wǎng)頁(yè)結(jié)構(gòu)化信息提取方法[J].電子技術(shù)應(yīng)用,2020,,46(12):97-102. 英文引用格式: Huai Xiaoyong,,Han Xiaodong,,Gao Ruochen,et al. An adaptive method for extracting structured information from web pages[J]. Application of Electronic Technique,,2020,,46(12):97-102.
An adaptive method for extracting structured information from web pages
National Computer System Engineering Research Institute of China,Beijing 100083,,China
Abstract: In order to meet the needs of Internet information collection and mining, aiming at the problems of traditional web site information collection methods, such as mixed collection information, unable to be used directly, and the high cost and low efficiency of manual structured collection method, this paper proposes an adaptive method for extracting structured information from web pages. We implement web page classification algorithm, subtree based title item and content item structured information extraction algorithm. Based on the classification annotated dataset of typical website pages, the classification model can adapt to the differences of various web sites, classify the web pages, and extract the list structured information and content structured information in the web pages according to the web page classification. This technology plays an important role in improving the automation level and processing efficiency of website structured information collection and processing.
Key words : information extraction,;structured information;classification model,;adaptive
本文研究針對(duì)傳統(tǒng)的網(wǎng)站信息整頁(yè)采集方式存在采集信息混雜、無(wú)法直接使用,,而人工結(jié)構(gòu)化采集方式成本高,、工作效率低的問(wèn)題,研究實(shí)現(xiàn)了一種基于DOM樹(shù)的網(wǎng)頁(yè)結(jié)構(gòu)化信息提取方法(DOM based Web-page Structured Information Extraction,,DWSIE),,實(shí)現(xiàn)了一個(gè)網(wǎng)頁(yè)結(jié)構(gòu)化信息提取服務(wù)工具包,該工具包極大地提高了網(wǎng)站結(jié)構(gòu)化信息采集處理的自動(dòng)化水平和處理效率,。