目标
昨天抓完了列表页,今天继续抓详情页,这个页面就比较有价值了,有很多有用的信息
步骤
1. 获取页面链接
2. 获取页面,解析页面
3. 保存数据库
第三方库
requests http://cn.python-requests.org/zh_CN/latest/
BeautifulSoup4 https://www.crummy.com/software/BeautifulSoup/
pymysql (pip install pymysql)
基本模板
<div id="info"> <span><span class="pl">导演</span>: <span class="attrs"><a href="/celebrity/1047973/" rel="v:directedBy">弗兰克·德拉邦特</a></span></span><br> <span><span class="pl">编剧</span>: <span class="attrs"><a href="/celebrity/1047973/">弗兰克·德拉邦特</a> / <a href="/celebrity/1049547/">斯蒂芬·金</a></span></span><br> <span class="actor"><span class="pl">主演</span>: <span class="attrs"><span><a href="/celebrity/1054521/" rel="v:starring">蒂姆·罗宾斯</a> / </span><span><a href="/celebrity/1054534/" rel="v:starring">摩根·弗里曼</a> / </span><span><a href="/celebrity/1041179/" rel="v:starring">鲍勃·冈顿</a> / </span><span><a href="/celebrity/1000095/" rel="v:starring">威廉姆·赛德勒</a> / </span><span><a href="/celebrity/1013817/" rel="v:starring">克兰西·布朗</a> / </span><span style="display: none;"><a href="/celebrity/1010612/" rel="v:starring">吉尔·贝罗斯</a> / </span><span style="display: none;"><a href="/celebrity/1054892/" rel="v:starring">马克·罗斯顿</a> / </span><span style="display: none;"><a href="/celebrity/1027897/" rel="v:starring">詹姆斯·惠特摩</a> / </span><span style="display: none;"><a href="/celebrity/1087302/" rel="v:starring">杰弗里·德曼</a> / </span><span style="display: none;"><a href="/celebrity/1074035/" rel="v:starring">拉里·布兰登伯格</a> / </span><span style="display: none;"><a href="/celebrity/1099030/" rel="v:starring">尼尔·吉恩托利</a> / </span><span style="display: none;"><a href="/celebrity/1343305/" rel="v:starring">布赖恩·利比</a> / </span><span style="display: none;"><a href="/celebrity/1048222/" rel="v:starring">大卫·普罗瓦尔</a> / </span><span style="display: none;"><a href="/celebrity/1343306/" rel="v:starring">约瑟夫·劳格诺</a> / </span><span style="display: none;"><a href="/celebrity/1315528/" rel="v:starring">祖德·塞克利拉</a></span><a href="javascript:;" class="more-actor" title="更多主演">更多...</a></span></span><br> <span class="pl">类型:</span> <span property="v:genre">剧情</span> / <span property="v:genre">犯罪</span><br> <span class="pl">制片国家/地区:</span> 美国<br> <span class="pl">语言:</span> 英语<br> <span class="pl">上映日期:</span> <span property="v:initialReleaseDate" content="1994-09-10(多伦多电影节)">1994-09-10(多伦多电影节)</span> / <span property="v:initialReleaseDate" content="1994-10-14(美国)">1994-10-14(美国)</span><br> <span class="pl">片长:</span> <span property="v:runtime" content="142">142 分钟</span><br> <span class="pl">又名:</span> 月黑高飞(港) / 刺激1995(台) / 地狱诺言 / 铁窗岁月 / 消香克的救赎<br> <span class="pl">IMDb链接:</span> <a href="http://www.imdb.com/title/tt0111161" target="_blank" rel="nofollow">tt0111161</a><br> </div>
Code
""" @author: jtusta @license: MIT Licence @contact: root@jtahstu.com @site: www.jtahstu.com @software: PyCharm Community Edition @file: getMoviesDetail.py @time: 2017/01/09 14:16 """ import requests from bs4 import BeautifulSoup import pymysql import pymysql.cursors import time def db(type, rank, detail=[]): res = "" connection = pymysql.connect(host='localhost', user='root', password='pass', db='test', port=3306, charset='utf8') cursor = connection.cursor() if type == 1: sql = 'select detail_url from douban_movie_top250 where rank=%s' cursor.execute(sql, (str(rank))); res = cursor.fetchone()[0] elif type == 2: sql = 'insert into douban_movie_top250_details(rank,dirsctor, screenwriter, starring, type, location, language, rel_time, len, other_names, imdb, introduce)' \ 'values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)' count = cursor.execute(sql, ( rank, detail[0], detail[1], detail[2], detail[3], detail[4], detail[5], detail[6], detail[7], detail[8], detail[9], detail[10])) if count: res = 'insert rank ' + str(rank) + ' ok' else: res = 'insert rank ' + str(rank) + ' fail' connection.commit() cursor.close() connection.close() return res def getUrl(rank): url = db(1, rank) return url def saveDetail(rank, detail): res = db(2, rank, detail) return res def getDetail(url): headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' , 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' , 'Host': 'movie.douban.com' , 'Referer': 'https://movie.douban.com/top250' , 'Upgrade-Insecure-Requests': '1' , 'Cache-Control': 'max-age=0' , 'Connection': 'keep-alive' , 'Accept-Encoding': 'gzip, deflate, br' , 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3'} html = requests.get(url, headers=headers) soup = BeautifulSoup(html.text, "html.parser") info = soup.select("#info") infos = info[0].text.split("\n") director = infos[1].replace("导演:", "").strip() screenwriter = infos[2].replace("编剧:", "").strip() starring = infos[3].replace("主演:", "").strip() type = infos[4].replace("类型:", "").strip() location = infos[5].replace("制片国家/地区:", "").strip() language = infos[6].replace("语言:", "").strip() reltime = infos[7].replace("上映日期:", "").strip() len = infos[8].replace("片长:", "").strip() othernames = infos[9].replace("又名:", "").strip() imdb = infos[10].replace("IMDb链接:", "").strip() introduceAll = soup.select(".all") if introduceAll: introduce = introduceAll[0].text.strip() else: introduce = soup.select("#link-report") introduce = introduce[0].text.replace("©豆瓣", "").strip() list = [director, screenwriter, starring, type, location, language, reltime, len, othernames, imdb, introduce] return list list = [22, 30] # 这里是分阶段抓取的,改成(1,251)即可 for i in range(100, 251): # for i in list: print("正在抓取rank %d" % i) try: url = getUrl(i) detail = getDetail(url) res = saveDetail(i, detail) print(res) time.sleep(1) except: list.append(i) print('rank ' + str(i) + ' 抓取异常') continue print("抓取完毕,以下rank抓取失败:") print(list) # [22, 30, 110, 131, 237] # 这些出错的都是404错误,实际上页面是存在的,这5个就不弄了实际上今天用的解析方法比较偷懒,代码中你就可以看的出来
较昨天来说就是小小的封装了一下
执行过程
...省略... 正在抓取rank 200 insert rank 200 ok 正在抓取rank 201 insert rank 201 ok 正在抓取rank 202 insert rank 202 ok 正在抓取rank 203 insert rank 203 ok 正在抓取rank 204 insert rank 204 ok 正在抓取rank 205 insert rank 205 ok ...省略... 正在抓取rank 230 insert rank 230 ok 正在抓取rank 231 insert rank 231 ok 正在抓取rank 232 insert rank 232 ok 正在抓取rank 233 insert rank 233 ok 正在抓取rank 234 insert rank 234 ok 正在抓取rank 235 insert rank 235 ok 正在抓取rank 236 insert rank 236 ok 正在抓取rank 237 rank 237 抓取异常 正在抓取rank 238 insert rank 238 ok 正在抓取rank 239 insert rank 239 ok ...省略... 正在抓取rank 250 insert rank 250 ok 抓取完毕,以下rank抓取失败: [22, 30, 110, 131, 237]
数据库结构
CREATE TABLE `douban_movie_top250_details` ( `id` int(11) NOT NULL AUTO_INCREMENT, `rank` int(11) DEFAULT NULL COMMENT '排名', `dirsctor` varchar(255) DEFAULT NULL COMMENT '导演', `screenwriter` varchar(255) DEFAULT NULL COMMENT '编剧', `starring` text COMMENT '主演', `type` varchar(255) DEFAULT NULL COMMENT '类型', `location` varchar(255) DEFAULT NULL COMMENT '制片国家/地区', `language` varchar(255) DEFAULT NULL COMMENT '语言', `rel_time` varchar(255) DEFAULT NULL COMMENT '上映日期', `len` varchar(255) DEFAULT NULL COMMENT '片长', `other_names` varchar(255) DEFAULT NULL COMMENT '又名', `imdb` varchar(255) DEFAULT NULL, `introduce` text COMMENT '剧情简介', PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
执行的一些SQL
update douban_movie_top250_details set id=rank where rank!=id insert into douban_movie_top250_details(rank) values(22); insert into douban_movie_top250_details(rank) values(30); insert into douban_movie_top250_details(rank) values(110); insert into douban_movie_top250_details(rank) values(131); insert into douban_movie_top250_details(rank) values(237);
数据库截图

完整表数据下载:douban_movie_top250_details.sql
总结
今天从早上起来,就在试着安装scrapy这个Python爬虫框架,在装了一堆依赖库(不下十个库)之后,scrapy基本可用了,至少可以使用**scrapy startproject douban_movie_top250**来生成项目了,但是却跑不起来,提示缺少win32api,在一番倒腾之后,没安装上,我勒个去,老子不干了,后面就没耐心再去折腾了,所以弃之。
对于js生成的网页,可以采用selenium,phantomjs等库模拟浏览器访问,具体没看,有需要再去研究吧。
ok,到此详情页抓取完毕,后面还有短评、关于电影的问题和影评页就结束了,敬请期待。
---
本文章采用 知识共享署名2.5中国大陆许可协议 进行许可,转载必须注明作者和本文链接。
---
发表评论