python3爬虫-6.使用requests和BeautifulSoup爬取豆瓣Top250电影__用requests和beautifulsoup爬取数据

相关文章推荐

俊逸的斑马 · 萱草花（2021年电影《你好，李焕英》主题曲 ...· 3 周前 ·

勤奋的风衣 · 太像了吧！徐锦江曾给迈克尔杰克逊伴舞？官方歌 ...· 9 月前 ·

一身肌肉的剪刀 · 爱情人偶 - 720P|1080P高清下载 ...· 1 年前 ·

谈吐大方的茄子 · 央视春晚导演敲定，一拨人提前把另一场“春晚” ...· 1 年前 ·

沉稳的萝卜 · 新一代蒙迪欧和本田雅阁选谁好？合资B级车的标 ...· 1 年前 ·

这次我们就要来使用上次说的 BeautifulSoup + Reuqests 进行爬取豆瓣TOP250电影

这次我们将爬取到的内容存放到 excel

打开目标网站https://movie.douban.com/top250?start=0&filter=

每次点击下一页,start的值会加25,一共十页,最大225

接下来我们来看下我们要的主要信息

主要思路:

请求豆瓣的链接获取网页源代码

然后使用 BeatifulSoup 拿到我们要的内容

最后就把数据存储到 excel 文件中

def main(page):
    url="https://movie.douban.com/top250?start="+str(page*25)+"&filter="#获取链接
    html=request_douban(url)#获取html
    soup=BeautifulSoup(html,"lxml")#处理html,获取所需信息
    save_to_excel(soup)#处理并保存所需信息
处理url
#请求豆瓣电影
def request_douban(url):
        response=requests.get(url)
        if response.status_code==200:
            return response.text
    except requests.RequestException:
        return None
获取html
soup=BeautifulSoup(html,"lxml")
处理html获取所需信息
信息都在grid_view类里
def save_to_excel(soup):
    list = soup.find(class_="grid_view").find_all("li")#获取关键信息部分的html
    for item in list:
        item_name = item.find(class_="title").string#获取名称
        item_img = item.find("a").find("img").get("src")#获取图片
        item_index = item.find(class_="").string#获取排名
        item_score = item.find(class_="rating_num").string#获取评分
        if item.find(class_='inq') is not None:			#获取评价
            item_quote = item.find(class_="inq").string
        else:
            item_quote=""
        item_author = item.find(class_="bd").find(class_="").text #获取导演,演员信息
        print("爬取电影:" + item_name + "|" + item_score + "|" + item_quote)#打印获取详情
        global n
        sheet.write(n,0,item_name)
        sheet.write(n, 1, item_img)
        sheet.write(n, 2, item_index)
        sheet.write(n, 3, item_score)
        sheet.write(n, 4, item_author)
        sheet.write(n, 5, item_quote)
    return None
主执行函数,写入excel文件
写入excel文件需要用到xlwt库
Python xlwt 用法说明_
if __name__=="__main__":
    n = 1
    book = xlwt.Workbook()# 创建一个工作薄,可以在里面设置参数比如encoding="utf-8" 
    sheet = book.add_sheet("豆瓣电影TOP250", cell_overwrite_ok=True) #创建一个工作表,名称为"豆瓣电影TOP250"；允许覆盖写入
   # 写入第i行第j列的单元格
    sheet.write(0, 0, "名称")
    sheet.write(0, 1, "图片")
    sheet.write(0, 2, "排名")
    sheet.write(0, 3, "评分")
    sheet.write(0, 4, "作者")
    sheet.write(0, 5, "简介")
    for i in range(0,10):
        main(i)
    book.save(u"豆瓣电影TOP250.xls")#xlsx.save( path )保存文件
    print("爬取完成")
遇到的问题
1.如果打开了系统代理,运行的时候可能会报错
2.book.save()保存的是xls文件,因此命名的文件格式也应该是xls,不然会显示格式损坏,无法打开文件
3.爬取的有些部分值可能是空的,比如这里的item_quote有时为空,所以要注意加以判断
4.写入的内容必须与工作簿的编码一致，否则在保存的时候会报错，比如，设置编码为utf-8，那么所有写入的内容都必须是utf-8的编码
from bs4 import BeautifulSoup
import requests
import xlwt
def main(page):
    url="https://movie.douban.com/top250?start="+str(page*25)+"&filter="
    html=request_douban(url)
    soup=BeautifulSoup(html,"lxml")
    #solve_html(soup)
    save_to_excel(soup)
#请求豆瓣电影
def request_douban(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/88.0.4324.146 Safari/537.36',
        response=requests.get(url=url,headers=headers)
        if response.status_code==200:
            return response.text
    except requests.RequestException:
        return None
def save_to_excel(soup):
    list = soup.find(class_="grid_view").find_all("li")
    for item in list:
        item_name = item.find(class_="title").string
        item_img = item.find("a").find("img").get("src")
        item_index = item.find(class_="").string
        item_score = item.find(class_="rating_num").string
        if item.find(class_='inq') is not None:
            item_quote = item.find(class_="inq").string
        else:
            item_quote=""
        item_author = item.find(class_="bd").find(class_="").text
        print("爬取电影:" + item_name + "|" + item_score + "|" + item_quote)
        global n
        sheet.write(n,0,item_name)
        sheet.write(n, 1, item_img)
        sheet.write(n, 2, item_index)
        sheet.write(n, 3, item_score)
        sheet.write(n, 4, item_author)
        sheet.write(n, 5, item_quote)
    return None
if __name__=="__main__":
    n = 1
    book = xlwt.Workbook()
    sheet = book.add_sheet("豆瓣电影TOP250", cell_overwrite_ok=True)
    sheet.write(0, 0, "名称")
    sheet.write(0, 1, "图片")
    sheet.write(0, 2, "排名")
    sheet.write(0, 3, "评分")
    sheet.write(0, 4, "作者")
    sheet.write(0, 5, "简介")
    for i in range(0,10):
        main(i)
    book.save(u"豆瓣电影TOP250.xls")
    print("爬取完成")
                            java gui 菜鸟教程 java教程菜鸟教程
                             Java 实例 - 判断数组是否相等以下实例演示了如何使用 equals ()方法来判断数组是否相等：import java.util.Arrays;
public class Main {
    public static void main(String[] args) throws Exception {
        int[] ary = {1,2,3,4,5,6};