1   Purpose of the Research 
With the speedy development of computer and Internet, more and more information is 
generated and exists in the Internet. How to get accurate and useful information? 
Search engine is a good tool to obtain useful information. So it becomes the most 
popular on-line service besides E-mails. 
The working process of general search engine can be described as follows: Firstly the 
network robot also named Spider skims through the Internet, collects the URL of web 
pages and the information contained in the pages; the spider stores the information into 
the index database; then the search utility sets up the Web page of a list of links to URLs 
which the search engine can find in its index matched the site visitor’s search keywords. 
But there is so much irrelevant information in the result pages. So people pay more 
and more attention to vertical search in a certain area. 
Business information is just a small part of the network information. If we want to 
search business information, it will take much more time and energy to download all 
the information the general spider program found and to judge whether it is business 
information or not. So the study of implementing an efficient commerce-oriented spi-
der program is necessary and of real value. In this paper, a method to implement a 
commerce-oriented search engine will be introduced. 
2   Process of Realizing 
The network robot always starts from a certain web page or several pages, and then 
goes through all the pages it can find. So firstly the spider analyzes the HTML code of a 
   Filter Technology of Commerce-Oriented Network Information  503 
web page, seeks the hyperlinks in the page, and then skims through all the linking pages 
using recursion or non-recursion algorithm. Recursion is an algorithm that can shift the 
program logic back to itself. It is simple, but it can not be applied to multi-thread 
technique. Therefore it can’t be adopted in an efficient spider program. Using 
non-recursion method, spider program puts the hyperlinks it found into the waiting 
论文网http://www.chuibin.com/  
queue instead of transferring to it. When the spider program has finished scanning the 
current web page, it will link the next URL in the queue according to the algorithm. 
A hyperlink would be judged by the commerce-oriented spider if it is related to 
commerce or not before it is added to the queue. The way to achieve it is as follows: 
1. Collect some typical commerce-related documents and transform them to text 
files as exercise texts originally; 本文来自优~文'论,文·网原文请找腾讯3249.114
2. Use LSA theory to build up an entry-text matrix of exercise texts. Using LSA 
model, a text set can be denoted as r*m entry matrix D. “M” means the quantity of texts 
in the text set , while “r” represents the number of different entries in the text set. That 
is, each different entry corresponds to a row of the matrix D; and each text file corre-
sponds to a column of the matrix D. Dd=[] , Here,  d  is the weight of entry I in 
ij r×m ij
text j. As is known to all, there are many formulas to calculate weight in the traditional 
vector representations. Following is a very familiar formula to calculate weight: 
上一页  [1] [2] [3] [4] 
网络信息过滤技术英文文献和翻译 第4页下载如图片无法显示或论文不完整,请联系qq752018766