我在尝试从招聘网站上搜刮数据时遇到了一个小问题,我对python和scrapy整体来说也是比较陌生的。
我有一个脚本,我正在运行,从各种确实的帖子中提取数据。爬虫似乎没有出错,但不会从回应301或302错误代码的网站中提取数据。
我已将脚本和日志粘贴在底部
希望得到任何帮助
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["indeed.com"]
start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
def parse(self, response):
handle_httpstatus_list = [True]
jobs = response.xpath('//div[@class="title"]')
for job in jobs:
title = job.xpath('a//@title').extract_first()
posting_link = job.xpath('a//@href').extract_first()
posting_url = "https://indeed.com" + posting_link
job_location = job.xpath('div//@data-rc-loc').extract_first()
yield Request(posting_url, callback=self.parse_page, meta={'title': title, 'posting_url':posting_url, 'job_location':job_location})
relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
absolute_next_url = "https://indeed.com" + relative_next_url
yield Request(absolute_next_url, callback=self.parse)
def parse_page(self, response):
posting_url = response.meta.get('posting_url')
title = response.meta.get('title')
job_location = response.meta.get('job_location')
job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"]/text()').extract_first()
job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description icl-u-xs-mt--md "]/text()').extract_first()
yield{'title':title,
'posting_url':posting_url,
'job_name':job_name,
'job_location': job_location,
'job_description_1':job_description_1,
'posted_on_date':posted_on_date,
'job_description_2':job_description_2,
'job_location':job_location
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-29 12:37:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1860897,
'downloader/request_count': 1616,
'downloader/request_method_count/GET': 1616,
'downloader/response_bytes': 13605809,
'downloader/response_count': 1616,
'downloader/response_status_count/200': 360,
'downloader/response_status_count/301': 758,
'downloader/response_status_count/302': 498,
'dupefilter/filtered': 9,
'elapsed_time_seconds': 28.657843,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 9, 29, 19, 37, 53, 776779),
'item_scraped_count': 337,
'log_count/DEBUG': 1954,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'memusage/max': 54546432,
'memusage/startup': 54546432,
'request_depth_max': 20,
'response_received_count': 360,
'robotstxt/request_count': 3,
'robotstxt/response_count': 3,
'robotstxt/response_status_count/200': 3,
'scheduler/dequeued': 1612,
'scheduler/dequeued/memory': 1612,
'scheduler/enqueued': 1612,
'scheduler/enqueued/memory': 1612,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2019, 9, 29, 19, 37, 25, 118936)}
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Spider closed (finished)
[1]: https://i.stack.imgur.com/6MOMC.png
2 个回答
0 人赞同
根据替换代码0】的文件,你有几种不同的方式来摆脱这种情况。
setting dont_redirect=True in a specific Request.meta
setting handle_httpstatus_all=True in a specific Request.meta
adding handle_httpstatus_list as an attribute of the Spider, whose contents are the numberic HTTP codes for which the Spider wishes to process the actual redirect Response
or, of course, disable the RedirectMiddleware in your settings.py with REDIRECT_ENABLED = False, which will force every Spider to be responsible for its own redirect handling
0 人赞同
我刚刚对你的刮刀进行了快速测试,在我看来,它实际上是按照它的要求在工作。
编辑:为了使我的解释更加清楚,你不能搜刮301或302重定向,因为它们只是重定向。如果你请求一个被重定向的URL,Scrapy会自动为你处理,并从你被重定向的页面中刮取数据。重定向的最终目的地才会给你200响应。
如果你按照我下面介绍的逻辑,你会看到Scrapy请求URLhttp://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3, but gets redirected to https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3.正是这个最终的页面,你将能够搜刮到。(你可以通过点击最初的URL并与最终的URL进行比较,自己尝试一下)。
我再重复一遍,你将无法从301和302重定向中搜出任何东西(那里没有任何东西可以搜出),只有最后得到200响应的页面。
我附上了你的搜刮器的建议版本,它同时保存了请求的URL和搜刮的URL。在我看来,一切都很好,你的搜刮器按照它应该的方式工作。(然而,请注意,react.com只能为你提供最多19页的搜索结果,这就限制了你只能搜刮190个项目)。
我希望现在能更好地理解这一点。
下面是输出的一个例子,从原始请求开始。
2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3> from <GET http://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
这将被重定向到下一个链接的301重定向。
2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> from <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
这又被302重定向到下一个链接。
2019-09-30 10:37:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> (referer: None)
最后,我们可以搜刮数据。
2019-09-30 10:37:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3>
{'title': 'General Manager', 'posting_url': 'https://indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3', 'job_name': 'General Manager', 'job_location': 'Plano, TX 75024', 'job_description_1': [], 'posted_on_date': ' - 30+ days ago', 'job_description_2': None}
因此,数据是从收到200响应的最终页面中刮取的。请注意,在刮出的项目中,posting_url是与meta属性一起传入的,而不是实际刮出的网址。这可能是你想要的,但如果你想保存被搜刮的实际网址,那么你应该使用posting_url = response.url来代替。编辑:见下面的建议更新
建议的代码更新。
import scrapy
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["indeed.com"]
start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
def parse(self, response):
jobs = response.xpath('//div[@class="title"]')
for job in jobs:
title = job.xpath('a//@title').extract_first()
posting_link = job.xpath('a//@href').extract_first()
referer_url = "https://indeed.com" + posting_link
yield scrapy.Request(url=referer_url,
callback=self.parse_page,
meta={'title': title,
'referer_url': referer_url,
relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if relative_next_url:
absolute_next_url = "https://indeed.com" + relative_next_url
yield scrapy.Request(absolute_next_url, callback=self.parse)
else:
self.logger.info('No more pages found.')
def parse_page(self, response):
referer_url = response.meta.get('referer_url')
title = response.meta.get('title')
job_location = response.meta.get('job_location')
posting_url = response.url
job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"]/text()').extract_first()
job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description icl-u-xs-mt--md "]/text()').extract_first()
yield {'title': title,
'posting_url': posting_url,
'referer_url': referer_url,
'job_name': job_name,
'job_location': job_location,
'job_description_1': job_description_1,