10384 200128011 UDC The Design and Implementation of a Specialized Search Engine Based on Robot Technology 2004 5 2004 2004 2004 5
World Wide Web Robot Web / (Focused Crawling) Web Meta data Web Web I
ABSTRACT This article firstly give a brief introduction of the history evolvement working principle of World Wide Web and also with the information retrieval issues of the Web. Most general search engines nowadays search Web pages as many as possible with robot software, and then build full document index or partial index. According to specific strategies, search engine retrieve the best match URL hyperlinks from its database when received user's query, then reply user with ordered results. With the rapid development of WWW and increasing complex of search engine system, it becomes much harder to design and implement a satisfied search engine. It s feasible to design and implement a specified search engine targeted towards specific users specific specialty field at present, it s also a research trend. There are many valuable research works has been done with Focused Crawling. This article applied an efficient topic item auto expansion algorithm in a specialized search engine based on focused crawling technology. The algorithm highly exploits Meta data of URL within Web document with Web mining technology. Under normal software and hardware configuration and limited network resource, the system accomplish topic relevant Web page s searching indexing quickly and correctly, and it also afford specialized users great quality specialized information retrieval service. Key Words Search Engine Focused Crawling Web Mining II
... I ABSTRACT... II... I... 1 1.1 Web... 1 1.2 Web... 2 1.3... 2 1.4... 4 1.5... 5... 6 2.1... 6 2.2... 6 2.3... 8 2.3.1 HTTP/ HTTPS... 8 2.3.2 Robot... 9 2.3.3... 9 2.3.4... 10 2.3.5... 10 2.3.6... 11 2.3.7... 11 2.3.8... 12 2.3.9 XML... 12 2.4 Google... 12 2.4.1 Google... 14 I
2.4.2... 16 2.5... 17... 20 3.1... 20 3.2... 21 3.3... 22 3.3.1... 22 3.3.2 URL... 23 3.3.3... 23 3.3.4... 24 3.4... 24 3.5 HITS... 25... 27 4.1 Web... 28 4.1.1 MetaData... 28 4.1.2 MetaData... 29 4.2 Web... 29 4.3... 30 4.4... 31 4.5... 32... 34 5.1... 34 5.2... 34 5.3... 35... 49... 51... 53 II
WWW World Wide Web 1989 3 CERN the European Laboratory for Particle Physics B/S Web 1993 Internet Web Mosaic Netscape Navigator Web Web Internet Web WWW Internet Web Home Page WWW 1.1 Web WWW Web Web Web URL Web Hyper Text Web Web HTML Hyper Text Markup Language Web 1
Web Web IE Netscape Opera Web HTML Web WWW B/S Client Server 1.2 Web Search Engine Web 1995 Internet Yahoo Alta Vista Infoseek/Go Excite Lycos Google Ask Jeeves Baidu [1-5] Yahoo Infoseek Excite Lycos [6-9] 1.3 [10] 1 SINGLE/GENERAL SEARCH ENGINE 2
Google AltaVista Excite Infoseek/Go Lycos 2 META SEARCH ENGINE ALL4ONE metacrawler Profusion 3721 3 INTELLIGENT SEARCH ENGINE ASK JEEVES Google, 4 PERSONAL SEARCH ENGINE " " 3
: 5 SPECIALIZED SEARCH ENGINE AAAFREESTUFF MAPBLAST SE4Topic Web 1.4 Web Meta data Web 4
1.5 WWW 5
SEARCH ENGINE WEB CRAWLERS [11-12] SPIDER ROBOT Internet Intranet 2.1 1994 4 WEBCRAWLER WWW Lycos 1995 Yahoo Excite Infoseek/Go AltaVista 1994 4 Internet Baidu Yahoo 2.2 Web URL 6
Web [13] Web [14] 2.1 URL URL Robot) WWW WWW 2.1 7
2.3 2.3.1 HTTP/ HTTPS HTTP TCP/IP WWW HTTP WWW HTTP Header Fields Entity HTTP/1.1 Server Date Content-type Last-modified Content-length 8
HTTP / 2.3.2 Robot Robot Robot Internet URL URL URL Robot Robot Robot 1 URL 2 URL HTTP/HTTPS Internet HTML URL 3 URL 4 2 3 2.3.3 1 [15] NOT AND OR 2 3 [16-17] 9
4 2.3.4 Internet Web 40 60 HTML 200 Internet 2.3.5 [18] 1 2 n 10
Degree papers are in the Xiamen University Electronic Theses and Dissertations Database. Full texts are available in the following ways: 1. If your library is a CALIS member libraries, please log on http://etd.calis.edu.cn/ and submit requests online, or consult the interlibrary loan department in your library. 2. For users of non-calis member libraries, please mail to etd@xmu.edu.cn for delivery details.