展示如何使用Scrapy爬取静态数据和Selenium+Headless Chrome爬取JS动态生成的数据,从而爬取完整的Google Play
印尼市场
的应用数据。
注意
不同国家的数据格式不一样,解析的方法也不一样
。如果爬取其他国家市场的应用数据,需要修改解析数据的代码(在下文的
GooglePlaySpider.py
文件中)。
项目的运行环境:
运行平台:
macOS
Python版本:
3.6
IDE:
Sublime Text
Scrapy,爬虫框架
$ sudo easy_install pip
$ pip install scrapy
Selenium,浏览器自动化框架
$ pip install selenium
Chrome Driver,浏览器驱动
直接下载解压
Chrome,浏览器
$ curl https://intoli.com/install-google-chrome.sh | bash
SQLAlchemy,SQL框架
$ pip install sqlalchemy
$ pip install sqlalchemy_utils
MySQL
通过Scrapy创建项目
$ scrapy startproject gp
定义爬虫数据
Item
在
items.py
文件添加:
class ProductItem(scrapy.Item):
gp_icon
= scrapy.Field()
gp_name
= scrapy.Field()
// ...
class GPReviewItem(scrapy.Item):
avatar_url
= scrapy.Field()
user_name
= scrapy.Field()
// ...
在
spiders
文件夹创建
GooglePlaySpider.py
:
import scrapy
from gp.items import ProductItem, GPReviewItem
class GooglePlaySpider(scrapy.Spider):
name = 'gp'
allowed_domains = ['play.google.com']
def __init__(self, *args, **kwargs):
urls = kwargs.pop('urls', [])
if urls:
self.start_urls = urls.split(',')
print('start urls = ', self.start_urls)
def parse(self, response):
print('Begin parse ', response.url)
item = ProductItem()
content = response.xpath('//div[@class="LXrl4c"]')
try:
item['gp_icon'] = response.urljoin(content.xpath('//img[@class="T75of ujDFqe"]/@src')[0].extract())
except Exception as error:
exception_count += 1
print('gp_icon except = ', error)
item['gp_icon'] = ''
try:
item['gp_name'] = content.xpath('//h1[@class="AHFaub"]/span/text()')[0].extract()
except Exception as error:
exception_count += 1
print('gp_name except = ', error)
item['gp_name'] = ''
// ...
yield item
运行爬虫:
$ scrapy crawl gp -a urls='https://play.google.com/store/apps/details?id=id.danarupiah.weshare.jiekuan&hl=id'
评论数据:
'gp_review': []
获取不到评论数据的原因是:评论数据是通过JS代码动态生成的,所以需要模拟浏览器请求网页获取。
通过Selenium+Headless Chrome获取评论数据
在最里面的gp
文件夹创建配置文件configs.py
并添加浏览器路径:
CHROME_PATH = r''
CHROME_DRIVER_PATH = r''
在middlewares.py
文件创建ChromeDownloaderMiddleware
:
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from gp.configs import *
class ChromeDownloaderMiddleware(object):
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument('--headless')
if CHROME_PATH:
options.binary_location = CHROME_PATH
if CHROME_DRIVER_PATH:
self.driver = webdriver.Chrome(chrome_options=options, executable_path=CHROME_DRIVER_PATH)
else:
self.driver = webdriver.Chrome(chrome_options=options)
def __del__(self):
self.driver.close()
def process_request(self, request, spider):
try:
print('Chrome driver begin...')
self.driver.get(request.url)
return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8',
status=200)
except TimeoutException:
return HtmlResponse(url=request.url, request=request, encoding='utf-8', status=500)
finally:
print('Chrome driver end...')
在settings.py
文件添加:
DOWNLOADER_MIDDLEWARES = {
'gp.middlewares.ChromeDownloaderMiddleware': 543,
再次运行爬虫:
$ scrapy crawl gp -a urls='https://play.google.com/store/apps/details?id=id.danarupiah.weshare.jiekuan&hl=id'
评论数据:
'gp_review': [{'avatar_url': 'https:
'rating_star': '5',
'review_text': 'Euis Suharani',
'user_name': 'Euis Suharani'},
{'avatar_url': 'https:
'rating_star': '3',
'review_text': 'Pengguna Google',
'user_name': 'Pengguna Google'},
{'avatar_url': 'https:
'rating_star': '5',
'review_text': 'novi anna',
'user_name': 'novi anna'},
{'avatar_url': 'https:
'rating_star': '4',
'review_text': 'Pengguna Google',
'user_name': 'Pengguna Google'}]
使用sqlalchemy
操作MySQL
在配置文件configs.py
添加数据库连接信息:
DATABASES = {
'DRIVER': 'mysql+pymysql',
'HOST': '127.0.0.1',
'PORT': 3306,
'NAME': 'gp',
'USER': 'root',
'PASSWORD': 'root',
在最里面的gp
文件夹创建数据库连接文件connections.py
:
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_utils import database_exists, create_database
from gp.configs import *
Base = declarative_base()
def db_connect_engine():
engine = create_engine("%s://%s:%s@%s:%s/%s?charset=utf8"
% (DATABASES['DRIVER'],
DATABASES['USER'],
DATABASES['PASSWORD'],
DATABASES['HOST'],
DATABASES['PORT'],
DATABASES['NAME']),
echo=False)
if not database_exists(engine.url):
create_database(engine.url)
Base.metadata.create_all(engine)
return engine
def db_session():
return sessionmaker(bind=db_connect_engine())
在最里面的gp
文件夹创建sqlalchemy model
文件models.py
:
from sqlalchemy import Column, ForeignKey
from sqlalchemy.dialects.mysql import TEXT, INTEGER
from sqlalchemy.orm import relationship
from gp.connections import Base
class Product(Base):
__tablename__ = 'product'
id = Column(INTEGER, primary_key=True, autoincrement=True)
updated_at = Column(INTEGER)
gp_icon = Column(TEXT)
gp_name = Column(TEXT)
// ...
class GPReview(Base):
__tablename__ = 'gp_review'
id = Column(INTEGER, primary_key=True, autoincrement=True)
product_id = Column(INTEGER, ForeignKey(Product.id))
avatar_url = Column(TEXT)
user_name = Column(TEXT)
// ...
在pipelines.py
文件添加数据库操作代码:
from gp.connections import *
from gp.items import ProductItem
from gp.models import *
class GoogleplayspiderPipeline(object):
def __init__(self):
self.session = db_session()
def process_item(self, item, spider):
print('process item from gp url = ', item['gp_url'])
if isinstance(item, ProductItem):
session = self.session()
model = Product()
model.gp_icon = item['gp_icon']
model.gp_name = item['gp_name']
// ...
try:
m = session.query(Product).filter(Product.gp_url == model.gp_url).first()
if m is None:
print('add model from gp url ', model.gp_url)
session.add(model)
session.flush()
product_id = model.id
for review in item['gp_review']:
r = GPReview()
r.product_id = product_id
r.avatar_url = review['avatar_url']
r.user_name = review['user_name']
// ...
session.add(r)
else:
print("update model from gp url ", model.gp_url)
m.updated_at = item['updated_at']
m.gp_icon = item['gp_icon']
m.gp_name = item['gp_name']
// ...
product_id = m.id
session.query(GPReview).filter(GPReview.product_id == product_id).delete()
session.flush()
for review in item['gp_review']:
r = GPReview()
r.product_id = product_id
r.avatar_url = review['avatar_url']
r.user_name = review['user_name']
// ...
session.add(r)
session.commit()
print('spider_success')
except Exception as error:
session.rollback()
print('gp error = ', error)
print('spider_failure_exception')
raise
finally:
session.close()
return item
把settings.py
文件的ITEM_PIPELINES
注释打开:
ITEM_PIPELINES = {
'gp.pipelines.GoogleplayspiderPipeline': 300,
再次运行爬虫:
$ scrapy crawl gp -a urls='https://play.google.com/store/apps/details?id=id.danarupiah.weshare.jiekuan&hl=id'
查看MySQL
数据库存储的爬虫数据:
访问MySQL
:$ mysql -u root -p
,输入密码:root
列出所有数据库:mysql> show databases;
,可以看到新建的gp
访问gp
:mysql> use gp;
列出所有的数据表:mysql> show tables;
,可以看到新建的product
和gp_review
查看产品数据:mysql> select * from product;
查看评论数据:mysql> select * from gp_review;
完整项目代码