网络爬虫 数据科学引论((Python.之道) Powered by陈吴鹏
Powered b y 陈 昊 鹏 网络爬虫 数据科学引论(Python之道)
爬虫是什么 爬虫crawler,即网络爬虫Spider。是去自动化获取网络上的内容,是一个能够自动化地 访问互联网并将网站内容下载下来的的程序或脚本。 Powered by陈吴鹏
Powered b y 陈 昊 鹏 爬虫是什么 爬虫crawler,即网络爬虫Spider。是去自动化获取网络上的内容,是一个能够自动化地 访问互联网并将网站内容下载下来的的程序或脚本
为什么需要爬虫? 高效自动化地从网络获取收集数据,后续可做数据处理。 详情 仙踪林 鱼蛋粉濑粉姜撞奶 详情 公 内蒙吉 杯面拉肠菜茶饭堂 艇行 详情 猪扒 咸茶 详情 菜 掩面斋年牛粉 详情 洋情 南海诸岛 Powered by陈吴腑
Powered b y 陈 昊 鹏 为什么需要爬虫? 高效自动化地从网络获取收集数据,后续可做数据处理
基本流程 B 心0变本 机构 投资总数最新投资事件 Secure https://www.itjuzi.com/investfirm Scrapy 1DG资本 725 品生医学 提中围 6.0 经纬中国 602 皇包车 URL HTML TOOL FORMAT 地址 网页原内容 工具 格式 统一资源定位 网页内容编码 爬虫工具 统一化格式 符(网址) 根据url指定 我们要从中 访问网页并 得到信息将 需要爬取的 提取我们需 从html解析 数据整理成 网页 要的关键信 信息的具体 统一格式导 息 工具,如 出 Scrapy Powered by陈吴鹏
Powered b y 陈 昊 鹏 基本流程 URL HTML TOOL 工具 FORMAT 格式 统一资源定位 符(网址) 根据url指定 需要爬取的 网页 网页原内容 网页内容编码 我们要从中 提取我们需 要的关键信 息 地址 爬虫工具 访问网页并 从html解析 信息的具体 工具,如 Scrapy 统一化格式 得到信息将 数据整理成 统一格式导 出
Scrapy-URL指定 definit_(self): self.file open('demol_quotes.json','w'); #设置待爬取网站列表 self.urls [ for i in range(1,3): self.urls.append('http://quotes.toscrape.com/page/'+str(i)) 年 初始化效果效果等同 并 self.urls = Cquotes.toscrape.com/page/1/ 米 'http://quotes.toscrape.com/page/1/', # 'http://quotes.toscrape.com/page/2/', ←→equotes.toscrape.com/page/2/ 并 print(self.urls) Quotes to Scrape "This life is what you make it.No matter what,you're going to sometimes,it's a universal truth.But the good part is you get t how you're going to mess it up.Girls will be your friends-they anyway.But just remember,some come,some go.The ones t Powered by陈吴腑
Powered b y 陈 昊 鹏 Scrapy – URL指定
Scrapy-HTML分析 ::before span.text 570.4x51.2 vsdiy class="row"> :before "The world as we have created it is a process of our thinking.It cannot v v by Albert Einstein (about) "The world as we have created it is a process of our thinking.It cannot be Tags:change deep-thoughts thinking world changed without changing our thinking."" V "byu "It is our choices,Harry,that show what we truly are,far more than our Albert Einstein (about) abilities." by J.K.Rowling (about) .. .. "There are only two ways to live your life.One is as though nothing is a .. miracle.The other is as though everything is a miracle." <div class="quote"itemscope itemtype="http:/ Powered by陈吴鹏
Powered b y 陈 昊 鹏 Scrapy – HTML分析
def parse(self,response): #提取名言列表 quotes response.css("div.quote"); Scrapy-解析网页 for quote in quotes: #提取每条名言中的作者名 author quote.css("small.author:text").extract first(); #提取名言的文字内容 text quote.css(".text::text").extract_first(); #提取名言标签 tags quote.css(".tags .tag:text").extract(); v ""The world as we have created it is a process of our thinking.It cannot be changed without changing our thinking."" v "by Albert Einstein (about) v Tags: change deep-thoughts thinking world .. .. Powered by陈吴鹏
Powered b y 陈 昊 鹏 Scrapy – 解析网页
Scrapy-格式化导出 parse方法会在每个request收到response,之后调用 def parse(self,response): #提取名言列表 quotes response.css("div.quote"); {"author":"Albert Einstein","tags":["change", for quote in quotes: 2 {"author":"J.K.Rowling","tags":["abilities" #提取每条名言中的作者名 {"author":"Albert Einstein","tags":["inspiratior author quote.css("small.author::text").extract first() {"author":"Jane Austen","tags":["aliteracy","bo #提取名言的文字内容 {"author":"Marilyn Monroe","tags":["be-yourself 6 {"author":"Albert Einstein","tags":["adulthood" text quote.css(".text:text").extract first(); author: "Andr\u00e9 Gide","tags":["life","lo #提取名言标签 8 {"author":"Thomas A.Edison","tags":["edison", tags quote.css(".tags .tag::text").extract(); 9 {"author":"Eleanor Roosevelt","tags":["misattrib #构建字典对象 10 {"author":"Steve Martin","tags":["humor","obvic item ={"author":author,"text":text,"tags":tags } 11 {"author":"Marilyn Monroe","tags":["friends" 12 #将字典转换成json字符串 {"author":"J.K.Rowling", "tags":["courage", "fr 13 {"author":"Albert Einstein","tags":["simplicity line json.dumps(dict(item)) 14 {"author":"Bob Marley","tags":["love"],"text": #将每个条目写入文件 15 {"author":"Dr.Seuss","tags":["fantasy"], "text' self.file.write(line "\n") 16 {"author":"Douglas Adams","tags":["life","navig 17 {"author":"Elie Wiesel","tags":["activism","apa #及时将内容写入文件,否则可能会出现少许延迟 18 {"author":"Friedrich Nietzsche","tags":["friends 19 self.file.flush() {"author":"Mark Twain","tags":["books","content 20 {"author":"Allen Saunders","tags":["fate","life os.fsync(self.file) #输出当前解析完成的网页网址,可以当做爬取进度来看待,与程序逻辑无关 print("over:"response.url) Powered by陈吴鹏
Powered b y 陈 昊 鹏 Scrapy – 格式化导出
常见问题 IP被封 杀 重构 网页更 新 7 Powered by陈吴鹏
Powered b y 陈 昊 鹏 常见问题 IP被封 杀 重 构 网页更 新