我在尝试从招聘网站上搜刮数据时遇到了一个小问题,我对python和scrapy整体来说也是比较陌生的。
我有一个脚本,我正在运行,从各种确实的帖子中提取数据。爬虫似乎没有出错,但不会从回应301或302错误代码的网站中提取数据。
我已将脚本和日志粘贴在底部
希望得到任何帮助
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["indeed.com"]
start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
def parse(self, response):
handle_httpstatus_list = [True]
jobs = response.xpath('//div[@class="title"]')
for job in jobs:
title = job.xpath('a//@title').extract_first()
posting_link = job.xpath('a//@href').extract_first()
posting_url = "https://indeed.com" + posting_link
job_location = job.xpath('div//@data-rc-loc').extract_first()
yield Request(posting_url, callback=self.parse_page, meta={'title': title, 'posting_url':posting_url, 'job_location':job_location})
relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
absolute_next_url = "https://indeed.com" + relative_next_url
yield Request(absolute_next_url, callback=self.parse)
def parse_page(self, response):
posting_url = response.meta.get('posting_url')
title = response.meta.get('title')
job_location = response.meta.get('job_location')
job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"]/text()').extract_first()
job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description icl-u-xs-mt--md "]/text()').extract_first()
yield{'title':title,
'posting_url':posting_url,
'job_name':job_name,
'job_location': job_location,
'job_description_1':job_description_1,
'posted_on_date':posted_on_date,
'job_description_2':job_description_2,
'job_location':job_location
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-29 12:37:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1860897,
'downloader/request_count': 1616,
'downloader/request_method_count/GET': 1616,
'downloader/response_bytes': 13605809,
'downloader/response_count': 1616,
'downloader/response_status_count/200': 360,
'downloader/response_status_count/301': 758,
'downloader/response_status_count/302': 498,
'dupefilter/filtered': 9,
'elapsed_time_seconds': 28.657843,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 9, 29, 19, 37, 53, 776779),
'item_scraped_count': 337,
'log_count/DEBUG': 1954,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'memusage/max': 54546432,
'memusage/startup': 54546432,
'request_depth_max': 20,
'response_received_count': 360,
'robotstxt/request_count': 3,
'robotstxt/response_count': 3,
'robotstxt/response_status_count/200': 3,
'scheduler/dequeued': 1612,
'scheduler/dequeued/memory': 1612,
'scheduler/enqueued': 1612,
'scheduler/enqueued/memory': 1612,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2019, 9, 29, 19, 37, 25, 118936)}
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Spider closed (finished)
[1]: https://i.stack.imgur.com/6MOMC.png
2 个回答
0 人赞同
根据替换代码0】的文件,你有几种不同的方式来摆脱这种情况。
setting dont_redirect=True
in a specific Request.meta
setting handle_httpstatus_all=True
in a specific Request.meta
adding handle_httpstatus_list
as an attribute of the Spider
, whose contents are the numberic HTTP codes for which the Spider wishes to process the actual redirect Response
or, of course, disable the RedirectMiddleware
in your settings.py
with REDIRECT_ENABLED = False
, which will force every Spider to be responsible for its own redirect handling
0 人赞同
我刚刚对你的刮刀进行了快速测试,在我看来,它实际上是按照它的要求在工作。
编辑:为了使我的解释更加清楚,你不能搜刮301或302重定向,因为它们只是重定向。如果你请求一个被重定向的URL,Scrapy会自动为你处理,并从你被重定向的页面中刮取数据。重定向的最终目的地才会给你200响应。
如果你按照我下面介绍的逻辑,你会看到Scrapy请求URLhttp://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3, but gets redirected to https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3.正是这个最终的页面,你将能够搜刮到。(你可以通过点击最初的URL并与最终的URL进行比较,自己尝试一下)。
我再重复一遍,你将无法从301和302重定向中搜出任何东西(那里没有任何东西可以搜出),只有最后得到200响应的页面。
我附上了你的搜刮器的建议版本,它同时保存了请求的URL和搜刮的URL。在我看来,一切都很好,你的搜刮器按照它应该的方式工作。(然而,请注意,react.com只能为你提供最多19页的搜索结果,这就限制了你只能搜刮190个项目)。
我希望现在能更好地理解这一点。
下面是输出的一个例子,从原始请求开始。
2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3> from <GET http://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
这将被重定向到下一个链接的301重定向。
2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> from <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
这又被302重定向到下一个链接。
2019-09-30 10:37:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> (referer: None)
最后,我们可以搜刮数据。
2019-09-30 10:37:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3>
{'title': 'General Manager', 'posting_url': 'https://indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3', 'job_name': 'General Manager', 'job_location': 'Plano, TX 75024', 'job_description_1': [], 'posted_on_date': ' - 30+ days ago', 'job_description_2': None}
因此,数据是从收到200响应的最终页面中刮取的。请注意,在刮出的项目中,posting_url
是与meta
属性一起传入的,而不是实际刮出的网址。这可能是你想要的,但如果你想保存被搜刮的实际网址,那么你应该使用posting_url = response.url
来代替。编辑:见下面的建议更新
建议的代码更新。
import scrapy
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["indeed.com"]
start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
def parse(self, response):
jobs = response.xpath('//div[@class="title"]')
for job in jobs:
title = job.xpath('a//@title').extract_first()
posting_link = job.xpath('a//@href').extract_first()
referer_url = "https://indeed.com" + posting_link
yield scrapy.Request(url=referer_url,
callback=self.parse_page,
meta={'title': title,
'referer_url': referer_url,
relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if relative_next_url:
absolute_next_url = "https://indeed.com" + relative_next_url
yield scrapy.Request(absolute_next_url, callback=self.parse)
else:
self.logger.info('No more pages found.')
def parse_page(self, response):
referer_url = response.meta.get('referer_url')
title = response.meta.get('title')
job_location = response.meta.get('job_location')
posting_url = response.url
job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"]/text()').extract_first()
job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description icl-u-xs-mt--md "]/text()').extract_first()
yield {'title': title,
'posting_url': posting_url,
'referer_url': referer_url,
'job_name': job_name,
'job_location': job_location,
'job_description_1': job_description_1,