《用Python写网络爬虫》读书笔记

发表于 2017-01-16 更新于 2020-11-03 分类于技术阅读次数：本文字数： 2.6k 阅读时长 ≈ 10 分钟

学习《用Python写网络爬虫》，为了加深记忆，边读边做笔记。如有侵权，立即删除。

网络爬虫简介

背景调研

识别网站所用技术

使用builtwith模块，可以检查网站构建的技术类型。安装方法如下：

1	pip install builtwith

该模块将URL作为参数，下载该URL并对其进行分析，然后返回该网站使用的技术。下面是使用该模块的一个例子。

1 2	>>> import builtwith >>> builtwith.parse('http://example.webscraping.com')

### 寻找网站的所有者我们可以使用WHOIS协议查询域名的注册者是是谁。Python中有一个针对该协议的封装库，通过pig安装：

1	pip install python-whois

使用该模块对appspot.com这个域名进行WHOIS查询。

1 2	>>> import whois >>> print whois.whois('appspot.com')

## 编写第一个网络爬虫 ### 下载网页要想爬取网页，我们首先要将其下载下来。使用Python的urllib2模块下载URL。当下载网页时，我们可能会遇到一些无法控制的错误，比如请求的页面可能不存在。此时，urllib2会抛出异常，然后退出脚本。安全起见，可以捕获这些异常。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:',e.reason
        html = None
    return html

现在，当出现下载错误时，该函数能够捕获到异常，然后返回None。

重试下载

下载时遇到的错误经常是临时性的，比如服务器过载时返回的503 Service Unavailable错误。对于此类错误，我们可以尝试重新下载，因为这个服务器问题现在可能已经解决。不过，我们不需要对所有错误都尝试重新下载。如果服务器返回的是404 Not Found这种错误，则说明该网页目前并不存在，再次尝试同样的请求一般也不会出现不同的结果。所以，我们只需要确保download函数在发生5xx错误时重试下载即可。下面是支持重试下载功能的新版本代码。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2

def download(url, num_retries=2):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries-1)
    return html

现在，当download函数遇到5xx错误码时，将会递归调用函数自身进行重试。此外，该函数还增加了一个参数，用于设定重试下载的次数，其默认值为两次。我们在这里限制网页下载的尝试次数，是因为服务器错误可能暂时还没有解决。可以通过下载http://httpstat.us/500 ，该网址始终返回500错误，测试该函数。

1	>>> download('http://httpstat.us/500')

设置用户代理

下面的代码对download函数进行了修改，设定了一个默认的用户代理“wswp”（即Web Scraping with Python的首字母缩写）

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2


def download(url, user_agent='wswp', num_retries=2):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, user_agent, num_retries - 1)
    return html


download('http://www.meetup.com/')

### 网站地图爬虫在第一个简单的爬虫中，我们将使用实例网站robots.txt文件中发现的网站地图来下载所有网页。为了解析网站地图，我们将会使用一个简单的正则表达式，从标签中提取出URL。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2
import re

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:',e.reason
        html = None
    return html

def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>',sitemap)
    # download each link
    for link in links:
        html = download(link)


if __name__ == '__main__':
    crawl_sitemap('http://example.webscraping.com/sitemap.xml')

ID遍历爬虫

分析实例url - http://example.webscraping.com/view/Afghanistan-1 - http://example.webscraping.com/view/Aland-Islands-2 - http://example.webscraping.com/view/Albania-3

可以看出，这些URL只在结尾处有所区别，包括国家名（作为页面别名）和ID。在URL中包含页面别名是非常普遍的做法，可以对搜索引擎优化起到帮助作用。一般情况下，Web服务器会忽略这个字符串，只使用ID来匹配数据库中的相关记录。下面我们将别名移除，加载http://example.webscraping.com/view/1 ，测试连接仍然可用。因此，我们可以忽略页面别名，只遍历ID来下载所有国家的页面。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2
import itertools

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:',e.reason
        html = None
    return html

for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        break
    else:
        # success - can scrape the result
        pass

在这段代码中，我们对ID进行遍历，直到出现下载错误时停止，我们假设此时已到达最后一个国家的页面。不过，这种实现方式存在一个缺陷，那就是某些记录可能已被删除，数据库ID之间并不是连续的。此时，只要访问到某个间隔点，爬虫就会立即退出。下面是这段代码的改进版本，在该版本中连续发生多次下载错误后才会退出程序。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2
import itertools

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:',e.reason
        html = None
    return html

# maximum number of consecutive download errors allowed
max_errors = 5
# current number of consecutive download errors
num_errors = 0

for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        # received an error trying to download this webpage
        num_errors += 1
        if num_errors == max_errors:
            # reached maximum number of
            # consecutive errors so exit
            break
    else:
        # success - can scrape the result
        # ...
        num_errors = 0

上面代码中实现的爬虫需要连续5次下载错误才会停止遍历，这样就很大程度上降低了遇到被删除记录时过早停止遍历的风险。 ### 链接爬虫如果想要爬取的是国家列表索引页和国家页面。其中，索引页链接格式如下： - http://example.webscraping.com/index/1 - http://example.webscraping.com/index/2

国家页链接格式如下： - http://example.webscraping.com/view/Afghanistan-1 - http://example.webscraping.com/view/Aland-Islands-2

因此，我们可以用/(index|view)这个简单的正则表达式来匹配这两类网页。可以使用如下代码：

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re
import urllib2
import urlparse

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
    return html


def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex"""
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                crawl_queue.append(link)


def get_links(html):
    """Return a list of links from html"""
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


link_crawler('http://example.webscraping.com', '/(index|view)')

你会发现我们会得到如下的下载错误：

ssh://root@192.168.1.122:22/usr/bin/python -u /root/pyFile/test.py
Downloading: http://example.webscraping.com
Downloading: /index/1
Traceback (most recent call last):
  ...
ValueError: unknown url type: /index/1

问题出在下载/index/1时，该链接只有网页的路径部分，而没有协议和服务器部分，也就是说这个一个相对链接。由于浏览器知道你正在浏览哪个网页，所以在浏览器浏览时，相对链接是能够正常工作的。但是urllib2是无法获知上下文的。为了让urllib2能够定位网页，我们需要将链接转换为绝对链接的形式，以便包含定位网页所有的细节。可以使用urlparse模块。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re
import urllib2
import urlparse

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
    return html


def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex"""
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urlparse.urljoin(seed_url,link)
                crawl_queue.append(link)


def get_links(html):
    """Return a list of links from html"""
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


link_crawler('http://example.webscraping.com', '/(index|view)')

当你运行这段代码时，会发现虽然网页下载没有出现错误，但是同样的地点总是会被不断下载到。这是因为这些站点相互之间存在链接。比如，澳大利亚链接到了南极洲，而南极洲也存在到澳大利亚的链接，此时爬虫就会在它们之间不断循环下去。要想避免重复爬取相同的链接，我们需要记录哪些链接已经被爬取过。下面是修改后的link_crawler函数，已具备存储已发现URL的功能，可以避免重复下载。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re
import urllib2
import urlparse

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
    return html


def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex"""
    crawl_queue = [seed_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            # check if link matches expected regex
            if re.match(link_regex, link):
                # from absolute link
                link = urlparse.urljoin(seed_url, link)
                # check if have already seen this link
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)


def get_links(html):
    """Return a list of links from html"""
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


link_crawler('http://example.webscraping.com', '/(index|view)')

当运行该脚本时，它会爬取所有地点，并且能够如期停止。 #### 高级功能 解析robot.txt 首先，我们需要解析robots.txt文件，以避免下载禁止爬取的URL。使用Python自带的robotparser模块，就可以轻松完成这项工作。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import robotparser

rp = robotparser.RobotFileParser()
rp.set_url('http://example.webscraping.com/robots.txt')
rp.read()
url = 'http://example.webscraping.com'
user_agent = 'BadCrawler'
print rp.can_fetch(user_agent, url)
user_agent = 'GoodCrawler'
print rp.can_fetch(user_agent, url)

当用户代理设置为’BadCrawler’时，robotparser模块会返回结果表明无法获取网页。 #### 支持代理下面是集成了代理功能的新版download函数。

def download(url, user_agent='wswp', proxy=None, num_retries=2):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    try:
        html = opener.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <- e.code < 600:
                # retry 5XX HTTP errors
                html = download(url, user_agent, proxy, num_retries-1)
    return html

#### 下载限速如果我们爬取网站的速度过快，就会面临被封禁或是造成服务器过载的风险。为了降低这些风险，我们可以在两次下载之间添加延时，从而对爬虫限速。