正在加载图片...
MASSACHVSETTS INSTITVTE OF TECHNOLOGY Depart ment of Electrical Engineering and Computer Science 6. 001-Structure and Interpret at ion of Computer Programs Spring Semester, 2005 Issued: Tuesday, March 15, 2005 Solut ions due on online tutor: Friday, April 1, 2005 by 6: 00 PM Crawling and Indexing the World wide Web This project explores some issues t hat arise in constructing a"spider"or a"web agent"that craw ls over document s in t he World wide Web. For purposes of this project, the Web is an extremely large collect ion of do cuments. each document cont ains some text and also links to ot her documents. in the form of urls In this project, we'll be working wit h programs that can start wit h an initial document and follow t he references to ot her document s to do useful things. For example, we could construct an index of all the words occurring in do cuments, and make thi ilable to people looking for informat ion on the web(as do many of the search engines on the web, such as Google or Yahoo Just in case you arent fluent wit h the det ails of Http, Urls, Uris Html, Xml, Xsl, Htt NG, DOM, and the rest of the alphabet soup that makes up the technical det ails of the Web, heres a simplified version of what goes on behind the scenes 1. The Web consists of a very large number of things called document s, ident ified by names called URLs(Uniform Resource Locators). For example, the oCw home page has the URL Urlhttp://ocw.mit.edu/.ThefirstportionofaUrl(Http://)revealsthenameofa protocol (in this case hypertext transmission protocol or Http) That can be used to fetch the do cument, and the rest of the URL cont ains informat ion needed by the protocol to specify which do cument is intended.(A protocol is a particular set of rules for how to communicate By using the Http protocol a program(most commonly a browser but any program can do this"web agent s"and spiders are examples of such programs that aren't browsers canretrieveadocumentwhoseUrlstartswithhttp.Thedocumentisreturnedtothe program, along wit h informat ion about how it is encoded, for example, ASCIi or Unico de text, HTML, images in gif or JPG or MPEG or PNG or some other format, an Excel or et, etC. 3. Document s enco ded in HTML(Hy per Text Markup Language)form can cont ain a mixture of text, images format ting informat ion, and links to ot her document s. Thus, when a browser (or ot her program) gets an HTML document, it can extract the links from it, yielding URL: for ot her document s in the Web. If these are in HTML format t hen t hey too can be retrieved and will yield yet more links and so on 4. A spi der is a program that st arts wit h an initial set of URLs, retrieves the corresponding documents, adds the links from these documents to the set of URLs and keeps on going Every time it retrieves a do cument, it does some(hopefully useful) work in addit ion to jus finding the embedded link￾             !""￾#     $%  %& '""( ￾  ￾ %%) %*& + ￾(& '""( ￾ %    ) *&  ￾& '""( ,* )"" $ ￾        +%  - .% % %%% + %  %  /%0   /1, 0 + 1% 2 %  + 3 3 3,!  %%  +%  -& + 3, %  .*    %! +  % % .  % 4%  + %&  +   56 %!  +%  -& 17 , 14 1+ % +  % 1+     1 + %  + %   % +%!  .& 1  %  .   + 1%   %&  4 +% 2,   4    + 1, 8%  *  + %+ %  + 1,& %+ %   +9! :%  % * 7 ; 1+ + %  $& 56 %& 56 %&  & < & < & $= &  &  + %  + +, % + 4%  + + %  + 3,& +7%  %> 2%  1+ %  ,+ + %%) ￾! + 3, %%%   2*  ,  +%  %& > ,* %  56 % 85 6% %9!  .& + +  +% + 56 ! + >%  56 8￾9 2% +     8 +% % +*. %%% &  $9 +  , %  + + &  + %  + 56 %   ,* +   %* 1++  % ! 8  %   %  %  +1   !9 '! ?* % + $ &   8% *  ,1% , *    +%# /1, %0  %%  .%  %+ % + 7 ,1%%9  2   1+% 56 %% 1+ ! +  %   + &  1+  , +1  % &  .&   5 .&  & %    :$  $  $   % + &  .  % %%+& ! @! %    8*. 4 9     .  .& %&  &  4%  + %! +%& 1+  ,1% 8 + 9 %   &   . + 4%  & * 56 %  + %  + 3,!  +%    & + +*   , 2  1 * *  4%&  % ! A! ￾ %   + %% 1+   %  56 %& 2% + % %& % + 4%  +% %  + %  56 %  4%  ! 2*   2%  &  % % 8+* %9 14    -% > + , 4%! OCW URL http://ocw.mit.edu/
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有