python版本獲取百度搜索結果頁面的信息真實的子鏈接

大小: 3KB

文件類型: .py

金幣: 1

下載: 0 次

發布日期: 2021-05-10
語言: C/C++
標簽:

高速下載

資源簡介

和C++版本的思路一樣，可以制定頁數進行爬取百度搜索結果頁面的信息

資源截圖

小圖大圖

代碼片段和文件信息

#!/usr/bin
#coding:utf-8
import?sys
import?urllib
import?urllib2
import?re

class?FetchUrl:
????“““This?a?BaiduCrawler?for?get?subUrl?of?PageContent“““
????
????def?__init__（self?strKeyword?iPages?=?1）:
????????‘‘‘Some?Inition‘‘‘
????????self.m_strKeyword?=?strKeyword
????????self.m_iPages?=?iPages
????????
????def?GetSubPageUrlList（self?url?comreg）:
????????‘‘‘Fetch?subUrl?of?Pages‘‘‘
????????try:
????????????response?=?urllib2.urlopen（url）
????????except?urllib2.HTTPError?e:
????????????print?“******Get?A?HTTPError?Try?again*****“
????????????response?=?urllib2.urlopen（url）
????????except?urllib2.URLError?e:
????????????print?“******Get?An?URLError?Try?again*****“
????????????response?=?urllib2.urlopen（url）
????????htmlpage?=?response.read（）
????????infoList1?=?re.findall（comreg?htmlpage）
????????#將列表去重之后返回
????????return?list（set（infoList1））

????def?GetUrlList（self）:
????????‘‘‘獲取結果頁面中指定頁數的子鏈接‘‘‘
????????mainList?=?[];
????????reg?=?r‘http://www.baidu.com/link\?url=.[^\“]+‘
????????comreg?=?re.compile（reg）
????????print?“任務的關鍵詞為：%s“?%?self.m_strKeyword
????????#將關鍵詞進行url編碼
????????encodeKeyword?=?urllib.quote（self.m_strKeyword.decode（‘gbk‘）.encode（‘utf-8‘））
????????i?=?1
????????while?i?<=?self.m_iPages:
????????????url?=?‘http://www.baidu.com/s?wd=%s&pn=%d&tn=baiduhome_pg&ie=utf-8&usm=4‘?%?（encodeKeyword?i）
????????????subList?=?self

上一篇：Modbus Activex Control 1.4.5 破解版
下一篇：MFC多線程編程

xxxx18一60岁hd中国/日韩女同互慰一区二区/西西人体扒开双腿无遮挡/日韩欧美黄色一级片 - 色护士精品影院www

python版本獲取百度搜索結果頁面的信息真實的子鏈接

資源簡介

資源截圖

代碼片段和文件信息

評論

相關資源