Python 爬取豆瓣图书评论


代码来自中国大学慕课 用 Python 玩转数据。添加了详细的注释,修改了计算平均分的循环条件。

time.sleep(5) 一定要加,不然会被豆瓣检测到异常流量而屏蔽 get 请求。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import requests,re,time
from bs4 import BeautifulSoup

count = 0 # 爬取 50 条评论循环
i = 0 # 控制评论页数
sum = 0 # 50 条评论分数总和
count_s = 0 # 爬取 50 条评论分数循环
while(count < 50):
try:
r = requests.get('https://book.douban.com/subject/27599965/comments/hot?p=' + str(i+1))
except Exception as err:
print(err)
break
soup = BeautifulSoup(r.text,'lxml') # html 解析
comments=soup.find_all('p','comment-content')
for item in comments:
count = count + 1
if count > 50:
break
print(count,item.string) # 打印输出每条评论
pattern = re.compile('<span class="user-stars allstar(.*?) rating"') # 构建获取分数的正则表达式
p = re.findall(pattern, r.text)
for star in p:
count_s = count_s + 1
sum += int(star)
time.sleep(5) # delay request from douban's robots.txt
i += 1
print("count: %d"%count)
print("count_s: %d"%count_s)
if count == 51:
print(sum/count_s)

参考文档:
用 Python 玩转数据