代码来自中国大学慕课 用 Python 玩转数据。添加了详细的注释,修改了计算平均分的循环条件。
time.sleep(5)
一定要加,不然会被豆瓣检测到异常流量而屏蔽 get 请求。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| import requests,re,time from bs4 import BeautifulSoup
count = 0 i = 0 sum = 0 count_s = 0 while(count < 50): try: r = requests.get('https://book.douban.com/subject/27599965/comments/hot?p=' + str(i+1)) except Exception as err: print(err) break soup = BeautifulSoup(r.text,'lxml') comments=soup.find_all('p','comment-content') for item in comments: count = count + 1 if count > 50: break print(count,item.string) pattern = re.compile('<span class="user-stars allstar(.*?) rating"') p = re.findall(pattern, r.text) for star in p: count_s = count_s + 1 sum += int(star) time.sleep(5) i += 1 print("count: %d"%count) print("count_s: %d"%count_s) if count == 51: print(sum/count_s)
|
参考文档:
用 Python 玩转数据