Nc&IS Web Graph Link Analysis http:/net.pku.edu.cn/wwbia 彭波 pb@netpku.edu.cn 北京大学信息科学技术学院 9/27/2010
Web Graph & Link Analysis http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 9/27/2010
上一讲主要内容 crawle面临的难题 Scalable, fast, polite ync UDP robust, continuous (slack about DNS prefetch K expiry dates) Text indexing client (UDP) 画晶实现高效率的基本技术 cache Cache Hyperlink Prefetch HttpH normalizer receve a Concurrency Page fetching context'thread disPageKnown? 多进程/多线程 e Craw k+ 异步I/O Persistent work-threadK H isUrlVisited? K+ pool of URLs 〓■有趣的技术 Bloom filter Consistent Hashing
上一讲主要内容 ◼ Crawler面临的难题 ◼ Scalable, fast, polite, robust, continuous ◼ 实现高效率的基本技术 ◼ Cache ◼ Prefetch ◼ Concurrency ◼ 多进程/多线程 ◼ 异步I/O ◼ 有趣的技术 ◼ Bloom filter ◼ Consistent Hashing
I ATT I BBNIGTE MAEW I CERFnet I DIgex I PSI I Sprint I UUnet Unkn owi MAE
Web Graph a玉 an Kodak Canon Worl wide Gateway Apple-Products-QuikTie TOshiba Dell C debe Systemi Incorporated Corpor atien td. Logitech Keyboards AMDE Adva AsUS International lemas Instruments In Veritas softw check c)2003 Touch Graph llC http://www.touchgraph.com/tggooglebroWseR.html
Web Graph http://www.touchgraph.com/TGGoogleBrowser.html
Giant Global Graph My Spac Linkedin Pinkbike PASA biking c content benullman. col google (portfolio sitel news creativity meetup me side proje groups Charlotte google budesigns.com groups employer Design Twitter personal site. intranet Charlotte o● leg: site owner, partner) shopping weekly O●幽 (eg: contributor, organizer Last. fm shallow Icc mont Amazon ebay Pandora (eg maintain a profile)
Giant Global Graph
图论、线性代数若干概念回顾 ■图,有向图,邻接矩阵,节点的度( degree),两节 点间的距离(d),图的直径(r),图的连通性, 有向图的强连通分支,连通分支 nd(uV):从u到v的最短路径的长度 r(G):最大的距离 矩阵(A),矩阵的转置(AT),行列式(|A|) 线性相关性,矩阵的秩,特征值,特征向量,特征 子空间 Ax=x, (I-A)x=0
图论、线性代数若干概念回顾 ◼ 图,有向图,邻接矩阵,节点的度(degree),两节 点间的距离(d),图的直径(r),图的连通性, 有向图的强连通分支,连通分支 ◼ d(u,v):从u到v的最短路径的长度 ◼ r(G):最大的距离 ◼ 矩阵(A),矩阵的转置(AT),行列式(|A|), 线性相关性,矩阵的秩,特征值,特征向量,特征 子空间 Ax = x, (I − A)x = 0
本次课大纲 nWeb有多大? Web的连通性如何? Web上节点的分布如何? Web上节点距离有多远? Web上节点重要度如何度量?
本次课大纲 ◼ Web 有多大? ◼ Web的连通性如何? ◼ Web上节点的分布如何? ◼ Web上节点距离有多远? ◼ Web上节点重要度如何度量?
Nc&IS Web有多大?
Web 有多大?
Web的大小一网页总数 ■图大小不可知,也无法定义 估计Web图节点数的下界 搜索引擎索引的网页数( crawled pages) 例如 CNNIC中国互联网网页调查报告 ■能更逼近真实值吗? 亿个 84.7 □网页数 一网页数增长率 240% 200.1% 180% 178.0% 44.7 89.49 120% 30 260 98.5% 60% 8.7 1.6 2002.122003.122004.122005.122006.122007.12
Web的大小—网页总数 ◼ 图大小不可知,也无法定义 ◼ 估计Web图节点数的下界 ◼ 搜索引擎索引的网页数(crawled pages) ◼ 例如CNNIC中国互联网网页调查报告 ◼ 能更逼近真实值吗?
Capture-Recapture Model Unknown number of fish in a lake Catch a sample and mark them m++, Let them loose Recapture a sample and look for marks Estimate population size n1 number in first sample 15 n2 number in second sample 10 n12 number in both samples 5 N= total population size assume that n1/N= n12/n2 therefore 15/N 5/10 N=(10X15)/5=30
Capture-Recapture Model ◼ Unknown number of fish in a lake ◼ Catch a sample and mark them ◼ Let them loose ◼ Recapture a sample and look for marks ◼ Estimate population size ◼ n1 = number in first sample 15 n2 = number in second sample 10 n12 = number in both samples 5 N = total population size assume that n1/N = n12/n2 therefore 15/N = 5/10 N = (10 x 15) / 5 = 30