xxxx18一60岁hd中国/日韩女同互慰一区二区/西西人体扒开双腿无遮挡/日韩欧美黄色一级片 - 色护士精品影院www

  • 大小: 800B
    文件類型: .zip
    金幣: 2
    下載: 0 次
    發布日期: 2021-05-12
  • 語言: Python
  • 標簽: python??spider??

資源簡介

開發爬蟲中,始終受困于爬蟲的效率問題,后多方查看資料,根據已掌握的信息編寫了該項目,此demo完全基于python的協程思想實現,不管是自己研究用還是應用到自己的項目中都可以。需要的小伙伴快下載來使用吧

資源截圖

代碼片段和文件信息

from?bs4?import?BeautifulSoup
import?requests
from?urllib.parse?import?urlparse

start_url?=?‘https://www.cnblogs.com‘
trust_host?=?‘www.cnblogs.com‘
ignore_path?=?[]
history_urls?=?[]


def?parse_html(html):
????soup?=?BeautifulSoup(html?“lxml“)
????print(soup.title)
????links?=?soup.find_all(‘a‘?href=True)
????return?(a[‘href‘]?for?a?in?links?if?a[‘href‘])


def?parse_url(url):
????url?=?url.strip()

????if?url.find(‘#‘)?>=?0:
????????url?=?url.split(‘#‘)[0]
????if?not?url:
????????return?None
????if?url.find(‘javascript:‘)?>=?0:
????????return?None

????for?f?in?ignore_path:
????????if?f?in?url:
????????????return?None
????if?url.find(‘http‘)?????????url?=?start_url?+?url
????????return?url
????parse?=?urlparse(url)
????if?parse.hostname?==?trust_host:
????????return?url


def?consumer():
????html?=?‘‘
????while?True:
????????url?=?yield?html

????????if?url:
????????????print(‘[CONSUMER]?Consuming?%s...‘?%?url)
????????????rsp?=?requests.get(url)
????????????html?=?rsp.content


def?produce(c):
????next(c)
????def?do_work(urls):
????????for?u?in?urls:
????????????if?u?not?in?history_urls:
????????????????history_urls.append(u)
????????????????print(‘[PRODUCER]?Producing?%s...‘?%?u)
????????????????html?=?c.send(u)
????????????????results?=?parse_html(html)
????????????????work_urls?=?(x?for?x?in?map(parse_url?results)?if?x)
????????????????do_work(work_urls)
????do_work([start_url])
????c.close()

if?__name__?==?‘__main__‘:
????c?=?consumer()
????produce(c)
????print(len(history_urls))

?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????文件????????1588??2019-10-21?14:10??python_spider.py

評論

共有 條評論