一段时间后，Scrapy停止爬行并产生项目，但继续运行。

Question 1

我写了一些scrapy代码，应该能够循环浏览一系列城市，进入这些城市的特定页面，抓取该页面上一个表格中的所有数据，然后遍历该城市的所有表格页面。我的代码在运行，但过了一会儿，它似乎超时了，我开始在日志中看到这个。

2020-12-16 18:47:47 [yjs] INFO: Parsing table and getting job data for page url http://www.yingjiesheng.com/other-morejob-1372.html
2020-12-16 18:48:27 [scrapy.extensions.logstats] INFO: Crawled 113 pages (at 2 pages/min), scraped 111 items (at 2 items/min)
2020-12-16 18:49:27 [scrapy.extensions.logstats] INFO: Crawled 113 pages (at 0 pages/min), scraped 111 items (at 0 items/min)
2020-12-16 18:50:27 [scrapy.extensions.logstats] INFO: Crawled 113 pages (at 0 pages/min), scraped 111 items (at 0 items/min)
2020-12-16 18:51:27 [scrapy.extensions.logstats] INFO: Crawled 113 pages (at 0 pages/min), scraped 111 items (at 0 items/min)
2020-12-16 18:52:27 [scrapy.extensions.logstats] INFO: Crawled 113 pages (at 0 pages/min), scraped 111 items (at 0 items/min)
这似乎发生在随机的时间点上。我第一次运行它时，在66页之后就开始出现这种情况。下面是我的蜘蛛代码。
URLROOT = "https://www.yingjiesheng.com/"
CITIES = {"beijing": "北京"}
class YjsSpider(scrapy.Spider):
    name = "yjs"
    def start_requests(self):
        # loop through cities and pass info
        for key, value in CITIES.items():
            self.logger.info('Starting requests for %s', key)
            url = URLROOT + str(key)
            yield scrapy.Request(
                url=url, callback=self.retrieve_tabsuffix, 
                meta={'city': key, 'city_ch': value},
                encoding='gb18030'
    def retrieve_tabsuffix(self, response):
        city = response.meta['city']
        city_ch = response.meta['city_ch']
        morepages = response.xpath(
            '//*[contains(concat( " ", @class, " " ), concat( " ", "mbth", " " ))]')
        morepage_html = morepages.css("a::attr(href)").get()
        if "-morejob-" in morepage_html:
            jobpage_one = f"{URLROOT}{city}-morejob-1.html"
        elif "list_" in morepage_html:
            jobpage_one = f"{URLROOT}{city}/list_1.html"
        yield response.follow(
            url=jobpage_one, 
            callback=self.retrieve_tabhtmls,
            meta={'city': city, 'city_ch': city_ch},
            encoding='gb18030')
    def retrieve_tabhtmls(self, response):
        city = response.meta['city']
        city_ch = response.meta['city_ch']
        self.logger.info('Encodings are %s, %s', encoding1, encoding2)
        # htmls
        listhtmls = response.xpath(
                '//*[contains(concat( " ", @class, " " ), concat( " ", "clear", " " ))]').get()
        totalrecords = response.xpath(
            '//*[contains(concat( " ", @class, " " ), concat( " ", "act", " " ))]').get()
        self.logger.info("totalrecords: %s", totalrecords)
        # identify the last page number
        listhtmls = listhtmls.split("a href=\"")
        for listhtml in listhtmls:
            if "last page" in listhtml:
                lastpagenum = re.findall(r"\d+", listhtml)[0]
        morejobpages = list(range(1, int(lastpagenum) + 1))
        self.logger.info("total number tables %s", lastpagenum)
        self.logger.info('Getting all table page URLs for %s', city)
        morejobpages_urls = [
                "http://www.yingjiesheng.com/{}/list_{}.html".format(city, i) for i in morejobpages]
        self.logger.info(morejobpages)
        yield from response.follow_all(
            urls=morejobpages_urls,
            callback=self.parse_tab,
            meta={'city': city, 'city_ch': city_ch,
                  'totalrecords': totalrecords},
            encoding='gb18030')
    def parse_tab(self, response):
        city = response.meta['city']
        city_ch = response.meta['city_ch']
        totalrecords = response.meta['totalrecords']
        self.logger.info('Parsing table and getting job data for page url %s', response.url)
        # table content
        tabcontent = response.xpath(
            '//*[(@id = "tb_job_list")]')
        # list of rows
        tabrows = tabcontent.css("tr.jobli").getall()
        item = YjsTable()
        item['table'] = tabrows
        item['time_scraped'] = datetime.datetime.now().strftime(
                "%m/%d/%Y %H:%M:%S")
        item['city'] = city
        item['city_ch'] = city_ch
        item['totalrecords'] = totalrecords
        item['pageurl'] = response.url
        yield item
This是我发现的唯一一个似乎遇到同样问题的其他帖子，但他们是从SQL数据库中提取的，而我不是。
有谁知道为什么Scrapy工作了一段时间后，突然停止请求页面和抓取数据，但继续运行？
编辑：我用DEBUG日志设置重新运行，得到这个结果。
2020-12-17 10:35:47 [scrapy.extensions.logstats] INFO: Crawled 41 pages (at 0 pages/min), scraped 39 items (at 0 items/min)
2020-12-17 10:35:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.yingjiesheng.com/app/job.php?Action=FullTimeMore&Location=guangzhou&Source=Other&Page=86> from <GET http://www.yingjiesheng.com/guangzhou-morejob-86.html>
2020-12-17 10:36:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.yingjiesheng.com/guangzhou-morejob-86.html> from <GET http://www.yingjiesheng.com/app/job.php?Action=FullTimeMore&Location=guangzhou&Source=Other&Page=86>
2020-12-17 10:36:24 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.yingjiesheng.com/app/job.php?Action=FullTimeMore&Location=guangzhou&Source=Other&Page=85> from <GET http://www.yingjiesheng.com/guangzhou-morejob-85.html>
所以，我好像被重定向了，但它没有成功地从重定向中刮取信息，然后转到下一个页面。有谁知道如何让Scrapy不断尝试一个页面直到成功？或者是否有更好的方法来处理这个问题？

Question 2


          
           
            
             首先，你需要检查
             
              logging settings
             
             所以你可以针对你的情况启用更好的日志记录，更简单的做法是直接设置
             
              LOG_LEVEL='DEBUG'
             
             ，这样你就可以看到正在发生的一切，现在看起来它被设置为 "INFO"。
            
            
             可能发生的情况是，蜘蛛不断发现请求，但所有的请求都被拒绝了，所以它们没有被算作 "页面"，它们可能是404、503等。
            
            
             你也可以有一些超时的页面，scrapy不会停止工作，因为它的本质是异步的，所以日志可能一直出现，即使scrapy在等待适当的响应。
            
            
             你也可以将你的scrapy项目配置成这样（永远不会结束，只是继续生存），但从你分享的内容来看，它似乎不是这样的。不过，还是应该检查一下你的扩展、管道和中间件在做什么，这样你就可以确定它们不会干扰你的蜘蛛。
            
            
             你可以随时杀死蜘蛛，但要确保它自己停下来，所以它也会返回一些统计资料，这些资料对执行过程中发生的事情很有启示。

Question 3


          
           
            
             我解决了我的问题。正如下面的日志所示，这些链接被抛出一个301重定向到一个不同的链接，然后抛出一个302并重定向到原来的链接。然后，scrapy没有搜刮该页面，因为这是一个重复的请求。
            
            2020-12-17 10:35:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.yingjiesheng.com/app/job.php?Action=FullTimeMore&Location=guangzhou&Source=Other&Page=86> from <GET http://www.yingjiesheng.com/guangzhou-morejob-86.html>