xxxx18一60岁hd中国/日韩女同互慰一区二区/西西人体扒开双腿无遮挡/日韩欧美黄色一级片 - 色护士精品影院www

  • 大小: 3KB
    文件類型: .py
    金幣: 1
    下載: 0 次
    發布日期: 2021-05-10
  • 語言: C/C++
  • 標簽:

資源簡介

和C++版本的思路一樣,可以制定頁數進行爬取百度搜索結果頁面的信息

資源截圖

代碼片段和文件信息

#!/usr/bin
#coding:utf-8
import?sys
import?urllib
import?urllib2
import?re

class?FetchUrl:
????“““This?a?BaiduCrawler?for?get?subUrl?of?PageContent“““
????
????def?__init__(self?strKeyword?iPages?=?1):
????????‘‘‘Some?Inition‘‘‘
????????self.m_strKeyword?=?strKeyword
????????self.m_iPages?=?iPages
????????
????def?GetSubPageUrlList(self?url?comreg):
????????‘‘‘Fetch?subUrl?of?Pages‘‘‘
????????try:
????????????response?=?urllib2.urlopen(url)
????????except?urllib2.HTTPError?e:
????????????print?“******Get?A?HTTPError?Try?again*****“
????????????response?=?urllib2.urlopen(url)
????????except?urllib2.URLError?e:
????????????print?“******Get?An?URLError?Try?again*****“
????????????response?=?urllib2.urlopen(url)
????????htmlpage?=?response.read()
????????infoList1?=?re.findall(comreg?htmlpage)
????????#將列表去重之后返回
????????return?list(set(infoList1))

????def?GetUrlList(self):
????????‘‘‘獲取結果頁面中指定頁數的子鏈接‘‘‘
????????mainList?=?[];
????????reg?=?r‘http://www.baidu.com/link\?url=.[^\“]+‘
????????comreg?=?re.compile(reg)
????????print?“任務的關鍵詞為:%s“?%?self.m_strKeyword
????????#將關鍵詞進行url編碼
????????encodeKeyword?=?urllib.quote(self.m_strKeyword.decode(‘gbk‘).encode(‘utf-8‘))
????????i?=?1
????????while?i?<=?self.m_iPages:
????????????url?=?‘http://www.baidu.com/s?wd=%s&pn=%d&tn=baiduhome_pg&ie=utf-8&usm=4‘?%?(encodeKeyword?i)
????????????subList?=?self

評論

共有 條評論

相關資源