正在加载图片...
6.001, Spring Semester, 2005--Pro ject 3 1.1 The Web as a graph The Web it self can be thought of as a directed graph in which the nodes are HTML document s and t he edges are hyperlinks to other HTML document s. For example, in Figure 2 the no de labeled B would be a URL, and a directed edge exists between two nodes B and E if the document represented by node B cont ains a link to the do cument represented by node e(as it does in this case) As ment ioned earlier, a web spider (or web crawler) is a program that traverses the web. A web spider might support procedures such as 1 Forj scURLcgrjks web URD ret urns a list of the URLs that are out bound links from URL 1 FoEj SCURLC. Imx. web URD) ret urns an alphabet ized list of all of the words occurring in the document at URL Note, we have not said any thing yet about the act ual representat ion of the web. We are simply st at ing an abstract definition of a dat a struct ure In a real web craw ler, oFj scURLCgFjkS would involve retrieving the document over the network using its URL, parsing the HTML informat ion ret urned by the web server, and extract ing the link information from <n 2REF=iii>, <pe ngm Srr=iii> and similar tags. Simila a real web craw ler, ForjScURLC Ix. web URD would retrieve the document, discard all of the mark-up commands(such as <b3sy>, <6. eg>, <tg>, etc), alphabet ize(and remove duplicates from) the lt ing list of word For this pro ject our programs will not actually go out and retrieve document s over the web. Inst we will represent a collection of web do cuments as a graph as discussed earlier. When you load t he code for t his project, you will have available a global variable,. 6ncwnb, w hich holds the graph represent at ion for a set of do cuments for use in t his project Not e: it is import ant to separate our particular represent ation of informat ion on the web from the idea of the web as a loose collect ion of documents. We are choosing to use a graph to capture a simple version of the web - this is simply to provide us wit h a concrete representat ion of the web, so that we can examine issues related to exploring the web. In pract ice, we would never build an ent ire graph represent ation of the web, we would simply take advant age of the abstraction of conceptualize the struct ure of the web, especially since it is a dynamic thing that const antly ch Our implement ation of opj scURlcgrjkS and opj ScURLC. mx. will simply use the graph pro cedures to get web links(children)and web page contents define(find-URL-links web url (find-node-children web url)) (define (find-URL-text web url) (find-node-contents web url)) In ot her words, we are convert ing operat ions that would normally apply to the web itself into operat ions t hat work on the internal represent ation of t he web as a graph￾         A       + 3, %  , ++  %   +  1++ + %   %  + %  +*4%  +  %!  .&   ' +  , ? 1 ,  56 &     .%% ,1 1 % ?    +  % ,*  ? %  4  +  % ,*   8%  %  +% %9! %  &  1, % 8 1, 19 %   + 2%% + 1,! 1, % + % % %+ %) ￾      %  %  + 56 % +  , 4%  ! ￾      %  +,E %    + 1%   +   ! & 1 +2  % *+ * , +  %  + 1,! 3  %* %  ,% >    %!    1, 1&    1 22 2 +  2 + 14 % % 56 & % +    ,* + 1, %2&  . + 4     !"&  # !"  % %! *&    1, 1&      1 2 + & %   + 4= % 8%+ % $ %"& ￾ "&  "& !9& +,E 8 2 % 9 + .&   + % %  1%!  +%  -  % 1  *    2 % 2 + 1,! %& 1 1 %    1, % %    % %%% ! 3+ *  +   +%  -& * 1 +2 2,  , 2,& ￾ & $& 1++ +% + + %   %  %  %  +%  -!   %   %   %    + 1,  +   + 1, %  %   %! 3  +%  %  +    % 2%  + 1, D +% % %*  2 % 1+   %  + 1,& % + 1  . %%%   . + 1,!  & 1 1 2 ,   + %  + 1,& 1 1 %* 4 2  + ,%  E + %  + 1,& %* %  %  * + + %* +%!           1 %* % + + %   1, 4% 8+9  1,  %) ￾ ￾      ￾     ￾ ￾     ￾     + 1%& 1  2 % + 1 * *  + 1, %  % + 14  +  %  + 1, %  +!
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有