jtahstu的博客

Git仓库   英文博客  

最新碎语:以后没事写写小的知识点吧

您的位置:jtahstu的博客 >笔记> Python爬虫 之 抓取豆瓣电影Top250详情页

Python爬虫 之 抓取豆瓣电影Top250详情页

目标

昨天抓完了列表页,今天继续抓详情页,这个页面就比较有价值了,有很多有用的信息


步骤

 1. 获取页面链接

 2. 获取页面,解析页面

 3. 保存数据库

第三方库

requests http://cn.python-requests.org/zh_CN/latest/  

BeautifulSoup4 https://www.crummy.com/software/BeautifulSoup/

pymysql (pip install pymysql)

基本模板

<div id="info">
        <span><span class="pl">导演</span>: <span class="attrs"><a href="/celebrity/1047973/" rel="v:directedBy">弗兰克·德拉邦特</a></span></span><br>
        <span><span class="pl">编剧</span>: <span class="attrs"><a href="/celebrity/1047973/">弗兰克·德拉邦特</a> / <a href="/celebrity/1049547/">斯蒂芬·金</a></span></span><br>
        <span class="actor"><span class="pl">主演</span>: <span class="attrs"><span><a href="/celebrity/1054521/" rel="v:starring">蒂姆·罗宾斯</a> / </span><span><a href="/celebrity/1054534/" rel="v:starring">摩根·弗里曼</a> / </span><span><a href="/celebrity/1041179/" rel="v:starring">鲍勃·冈顿</a> / </span><span><a href="/celebrity/1000095/" rel="v:starring">威廉姆·赛德勒</a> / </span><span><a href="/celebrity/1013817/" rel="v:starring">克兰西·布朗</a> / </span><span style="display: none;"><a href="/celebrity/1010612/" rel="v:starring">吉尔·贝罗斯</a> / </span><span style="display: none;"><a href="/celebrity/1054892/" rel="v:starring">马克·罗斯顿</a> / </span><span style="display: none;"><a href="/celebrity/1027897/" rel="v:starring">詹姆斯·惠特摩</a> / </span><span style="display: none;"><a href="/celebrity/1087302/" rel="v:starring">杰弗里·德曼</a> / </span><span style="display: none;"><a href="/celebrity/1074035/" rel="v:starring">拉里·布兰登伯格</a> / </span><span style="display: none;"><a href="/celebrity/1099030/" rel="v:starring">尼尔·吉恩托利</a> / </span><span style="display: none;"><a href="/celebrity/1343305/" rel="v:starring">布赖恩·利比</a> / </span><span style="display: none;"><a href="/celebrity/1048222/" rel="v:starring">大卫·普罗瓦尔</a> / </span><span style="display: none;"><a href="/celebrity/1343306/" rel="v:starring">约瑟夫·劳格诺</a> / </span><span style="display: none;"><a href="/celebrity/1315528/" rel="v:starring">祖德·塞克利拉</a></span><a href="javascript:;" class="more-actor" title="更多主演">更多...</a></span></span><br>
        <span class="pl">类型:</span> <span property="v:genre">剧情</span> / <span property="v:genre">犯罪</span><br>
        
        <span class="pl">制片国家/地区:</span> 美国<br>
        <span class="pl">语言:</span> 英语<br>
        <span class="pl">上映日期:</span> <span property="v:initialReleaseDate" content="1994-09-10(多伦多电影节)">1994-09-10(多伦多电影节)</span> / <span property="v:initialReleaseDate" content="1994-10-14(美国)">1994-10-14(美国)</span><br>
        <span class="pl">片长:</span> <span property="v:runtime" content="142">142 分钟</span><br>
        <span class="pl">又名:</span> 月黑高飞(港) / 刺激1995(台) / 地狱诺言 / 铁窗岁月 / 消香克的救赎<br>
        <span class="pl">IMDb链接:</span> <a href="http://www.imdb.com/title/tt0111161" target="_blank" rel="nofollow">tt0111161</a><br>

</div>

  Code

"""
@author: jtusta
@license: MIT Licence 
@contact: root@jtahstu.com
@site: www.jtahstu.com
@software: PyCharm Community Edition
@file: getMoviesDetail.py
@time: 2017/01/09 14:16
"""
import requests
from bs4 import BeautifulSoup
import pymysql
import pymysql.cursors
import time

def db(type, rank, detail=[]):
    res = ""
    connection = pymysql.connect(host='localhost',
                                 user='root',
                                 password='pass',
                                 db='test',
                                 port=3306,
                                 charset='utf8')
    cursor = connection.cursor()
    if type == 1:
        sql = 'select detail_url from douban_movie_top250 where rank=%s'
        cursor.execute(sql, (str(rank)));
        res = cursor.fetchone()[0]
    elif type == 2:
        sql = 'insert into douban_movie_top250_details(rank,dirsctor, screenwriter, starring, type, location, language, rel_time, len, other_names, imdb, introduce)' \
              'values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'
        count = cursor.execute(sql, (
            rank, detail[0], detail[1], detail[2], detail[3], detail[4], detail[5], detail[6], detail[7], detail[8],
            detail[9], detail[10]))
        if count:
            res = 'insert rank ' + str(rank) + ' ok'
        else:
            res = 'insert rank ' + str(rank) + ' fail'
    connection.commit()
    cursor.close()
    connection.close()
    return res

def getUrl(rank):
    url = db(1, rank)
    return url

def saveDetail(rank, detail):
    res = db(2, rank, detail)
    return res

def getDetail(url):
    headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
        , 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'
        , 'Host': 'movie.douban.com'
        , 'Referer': 'https://movie.douban.com/top250'
        , 'Upgrade-Insecure-Requests': '1'
        , 'Cache-Control': 'max-age=0'
        , 'Connection': 'keep-alive'
        , 'Accept-Encoding': 'gzip, deflate, br'
        , 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3'}
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, "html.parser")
    info = soup.select("#info")
    infos = info[0].text.split("\n")
    director = infos[1].replace("导演:", "").strip()
    screenwriter = infos[2].replace("编剧:", "").strip()
    starring = infos[3].replace("主演:", "").strip()
    type = infos[4].replace("类型:", "").strip()
    location = infos[5].replace("制片国家/地区:", "").strip()
    language = infos[6].replace("语言:", "").strip()
    reltime = infos[7].replace("上映日期:", "").strip()
    len = infos[8].replace("片长:", "").strip()
    othernames = infos[9].replace("又名:", "").strip()
    imdb = infos[10].replace("IMDb链接:", "").strip()

    introduceAll = soup.select(".all")
    if introduceAll:
        introduce = introduceAll[0].text.strip()
    else:
        introduce = soup.select("#link-report")
        introduce = introduce[0].text.replace("©豆瓣", "").strip()
    list = [director, screenwriter, starring, type, location, language, reltime, len, othernames, imdb, introduce]
    return list

list = [22, 30]
# 这里是分阶段抓取的,改成(1,251)即可
for i in range(100, 251):
    # for i in list:
    print("正在抓取rank %d" % i)
    try:
        url = getUrl(i)
        detail = getDetail(url)
        res = saveDetail(i, detail)
        print(res)
        time.sleep(1)
    except:
        list.append(i)
        print('rank ' + str(i) + ' 抓取异常')
        continue

print("抓取完毕,以下rank抓取失败:")
print(list)
# [22, 30, 110, 131, 237]
# 这些出错的都是404错误,实际上页面是存在的,这5个就不弄了
实际上今天用的解析方法比较偷懒,代码中你就可以看的出来
较昨天来说就是小小的封装了一下

执行过程

...省略...
正在抓取rank 200
insert rank 200 ok
正在抓取rank 201
insert rank 201 ok
正在抓取rank 202
insert rank 202 ok
正在抓取rank 203
insert rank 203 ok
正在抓取rank 204
insert rank 204 ok
正在抓取rank 205
insert rank 205 ok
...省略...
正在抓取rank 230
insert rank 230 ok
正在抓取rank 231
insert rank 231 ok
正在抓取rank 232
insert rank 232 ok
正在抓取rank 233
insert rank 233 ok
正在抓取rank 234
insert rank 234 ok
正在抓取rank 235
insert rank 235 ok
正在抓取rank 236
insert rank 236 ok
正在抓取rank 237
rank 237 抓取异常
正在抓取rank 238
insert rank 238 ok
正在抓取rank 239
insert rank 239 ok
...省略...
正在抓取rank 250
insert rank 250 ok
抓取完毕,以下rank抓取失败:
[22, 30, 110, 131, 237]

数据库结构


CREATE TABLE `douban_movie_top250_details` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `rank` int(11) DEFAULT NULL COMMENT '排名',
  `dirsctor` varchar(255) DEFAULT NULL COMMENT '导演',
  `screenwriter` varchar(255) DEFAULT NULL COMMENT '编剧',
  `starring` text COMMENT '主演',
  `type` varchar(255) DEFAULT NULL COMMENT '类型',
  `location` varchar(255) DEFAULT NULL COMMENT '制片国家/地区',
  `language` varchar(255) DEFAULT NULL COMMENT '语言',
  `rel_time` varchar(255) DEFAULT NULL COMMENT '上映日期',
  `len` varchar(255) DEFAULT NULL COMMENT '片长',
  `other_names` varchar(255) DEFAULT NULL COMMENT '又名',
  `imdb` varchar(255) DEFAULT NULL,
  `introduce` text COMMENT '剧情简介',
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

执行的一些SQL

update douban_movie_top250_details set id=rank where rank!=id

insert into douban_movie_top250_details(rank) values(22);
insert into douban_movie_top250_details(rank) values(30);
insert into douban_movie_top250_details(rank) values(110);
insert into douban_movie_top250_details(rank) values(131);
insert into douban_movie_top250_details(rank) values(237);

数据库截图


完整表数据下载:douban_movie_top250_details.sql

总结

今天从早上起来,就在试着安装scrapy这个Python爬虫框架,在装了一堆依赖库(不下十个库)之后,scrapy基本可用了,至少可以使用**scrapy startproject douban_movie_top250**来生成项目了,但是却跑不起来,提示缺少win32api,在一番倒腾之后,没安装上,我勒个去,老子不干了,后面就没耐心再去折腾了,所以弃之。
    对于js生成的网页,可以采用selenium,phantomjs等库模拟浏览器访问,具体没看,有需要再去研究吧。
    ok,到此详情页抓取完毕,后面还有短评、关于电影的问题和影评页就结束了,敬请期待。

---

本文章采用 知识共享署名2.5中国大陆许可协议 进行许可,欢迎转载,演绎或用于商业目的。

---

二维码加载中...

扫一扫移动端访问O(∩_∩)O

发表评论

50 + 13 =
路人甲 表情
看不清楚?点图切换 Ctrl+Enter快速提交
正在加载中……