我在尝试从招聘网站上搜刮数据时遇到了一个小问题,我对python和scrapy整体来说也是比较陌生的。

我有一个脚本,我正在运行,从各种确实的帖子中提取数据。爬虫似乎没有出错,但不会从回应301或302错误代码的网站中提取数据。

我已将脚本和日志粘贴在底部

希望得到任何帮助

import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["indeed.com"]
    start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
    def parse(self, response):
        handle_httpstatus_list = [True]
        jobs = response.xpath('//div[@class="title"]')
        for job in jobs:
            title = job.xpath('a//@title').extract_first()
            posting_link = job.xpath('a//@href').extract_first()
            posting_url = "https://indeed.com" + posting_link
            job_location = job.xpath('div//@data-rc-loc').extract_first()
            yield Request(posting_url, callback=self.parse_page, meta={'title': title, 'posting_url':posting_url, 'job_location':job_location})
        relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        absolute_next_url = "https://indeed.com" + relative_next_url
        yield Request(absolute_next_url, callback=self.parse)
    def parse_page(self, response):
        posting_url = response.meta.get('posting_url')
        title = response.meta.get('title')
        job_location = response.meta.get('job_location')
        job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none  jobsearch-JobInfoHeader-title"]/text()').extract_first()
        job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
        posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
        job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs  jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
        job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description  icl-u-xs-mt--md  "]/text()').extract_first()
        yield{'title':title,
            'posting_url':posting_url,
            'job_name':job_name,
            'job_location': job_location,
            'job_description_1':job_description_1,
            'posted_on_date':posted_on_date,
            'job_description_2':job_description_2,
            'job_location':job_location
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-29 12:37:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1860897,
 'downloader/request_count': 1616,
 'downloader/request_method_count/GET': 1616,
 'downloader/response_bytes': 13605809,
 'downloader/response_count': 1616,
 'downloader/response_status_count/200': 360,
 'downloader/response_status_count/301': 758,
 'downloader/response_status_count/302': 498,
 'dupefilter/filtered': 9,
 'elapsed_time_seconds': 28.657843,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 9, 29, 19, 37, 53, 776779),
 'item_scraped_count': 337,
 'log_count/DEBUG': 1954,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'memusage/max': 54546432,
 'memusage/startup': 54546432,
 'request_depth_max': 20,
 'response_received_count': 360,
 'robotstxt/request_count': 3,
 'robotstxt/response_count': 3,
 'robotstxt/response_status_count/200': 3,
 'scheduler/dequeued': 1612,
 'scheduler/dequeued/memory': 1612,
 'scheduler/enqueued': 1612,
 'scheduler/enqueued/memory': 1612,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2019, 9, 29, 19, 37, 25, 118936)}
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Spider closed (finished)
  [1]: https://i.stack.imgur.com/6MOMC.png
    
python
scrapy
web-crawler
data-science
bobparker
bobparker
发布于 2019-09-30
2 个回答
mdaniel
mdaniel
发布于 2019-09-30
已采纳
0 人赞同

根据替换代码0】的文件,你有几种不同的方式来摆脱这种情况。

  • setting dont_redirect=True in a specific Request.meta
  • setting handle_httpstatus_all=True in a specific Request.meta
  • adding handle_httpstatus_list as an attribute of the Spider, whose contents are the numberic HTTP codes for which the Spider wishes to process the actual redirect Response
  • or, of course, disable the RedirectMiddleware in your settings.py with REDIRECT_ENABLED = False, which will force every Spider to be responsible for its own redirect handling
  • 奇怪,我尝试了这些设置,在日志文件中得到了一些变化,但仍然没有提取每个链接的所有数据。
    好吧,呼,幸好你没有用你收到的新的日志信息来更新你的问题,或者你试过的不成功的细节,否则可能会有人意外地帮助你解决问题。如果你改变主意,请务必阅读如何问页,并特别注意到MCVE section
    Tor Stava
    Tor Stava
    发布于 2019-09-30
    0 人赞同

    我刚刚对你的刮刀进行了快速测试,在我看来,它实际上是按照它的要求在工作。

    编辑:为了使我的解释更加清楚,你不能搜刮301或302重定向,因为它们只是重定向。如果你请求一个被重定向的URL,Scrapy会自动为你处理,并从你被重定向的页面中刮取数据。重定向的最终目的地才会给你200响应。

    如果你按照我下面介绍的逻辑,你会看到Scrapy请求URLhttp://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3, but gets redirected to https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3.正是这个最终的页面,你将能够搜刮到。(你可以通过点击最初的URL并与最终的URL进行比较,自己尝试一下)。

    我再重复一遍,你将无法从301和302重定向中搜出任何东西(那里没有任何东西可以搜出),只有最后得到200响应的页面。

    我附上了你的搜刮器的建议版本,它同时保存了请求的URL和搜刮的URL。在我看来,一切都很好,你的搜刮器按照它应该的方式工作。(然而,请注意,react.com只能为你提供最多19页的搜索结果,这就限制了你只能搜刮190个项目)。

    我希望现在能更好地理解这一点。

    下面是输出的一个例子,从原始请求开始。

    2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3> from <GET http://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
    

    这将被重定向到下一个链接的301重定向。

    2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> from <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
    

    这又被302重定向到下一个链接。

    2019-09-30 10:37:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> (referer: None)
    

    最后,我们可以搜刮数据。

    2019-09-30 10:37:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> 
    {'title': 'General Manager', 'posting_url': 'https://indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3', 'job_name': 'General Manager', 'job_location': 'Plano, TX 75024', 'job_description_1': [], 'posted_on_date': ' - 30+ days ago', 'job_description_2': None}
    

    因此,数据是从收到200响应的最终页面中刮取的。请注意,在刮出的项目中,posting_url是与meta属性一起传入的,而不是实际刮出的网址。这可能是你想要的,但如果你想保存被搜刮的实际网址,那么你应该使用posting_url = response.url来代替。编辑:见下面的建议更新

    建议的代码更新。

    import scrapy
    class JobsSpider(scrapy.Spider):
        name = "jobs"
        allowed_domains = ["indeed.com"]
        start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
        def parse(self, response):
            jobs = response.xpath('//div[@class="title"]')
            for job in jobs:
                title = job.xpath('a//@title').extract_first()
                posting_link = job.xpath('a//@href').extract_first()
                referer_url = "https://indeed.com" + posting_link
                yield scrapy.Request(url=referer_url,
                                     callback=self.parse_page,
                                     meta={'title': title,
                                           'referer_url': referer_url,
            relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
            if relative_next_url:
                absolute_next_url = "https://indeed.com" + relative_next_url
                yield scrapy.Request(absolute_next_url, callback=self.parse)
            else:
                self.logger.info('No more pages found.')
        def parse_page(self, response):
            referer_url = response.meta.get('referer_url')
            title = response.meta.get('title')
            job_location = response.meta.get('job_location')
            posting_url = response.url
            job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none  jobsearch-JobInfoHeader-title"]/text()').extract_first()
            job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
            posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
            job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs  jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
            job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description  icl-u-xs-mt--md  "]/text()').extract_first()
            yield {'title': title,
                   'posting_url': posting_url,
                   'referer_url': referer_url,
                   'job_name': job_name,
                   'job_location': job_location,
                   'job_description_1': job_description_1,