Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Rewriting this to make what I'm looking for help with clearer. I'm trying to scrape a page of search results like this

http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0

But when I run it in Scrapy, the requests seem to be redirected:

2020-01-10 09:55:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> from http://search.people.com.cn/cnpeople/search.do?pageNum=7&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>

And then nothing is scraped.

Is that just the way the website works to redirect me to a list of results, or is it trying to prevent me scraping it? Is there anything I can do?

Below is my spider code:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "RMW"
    def start_requests(self):
        # starturls = ['http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0',]
        numbers = list(range(1, 10, 1))
        for num in numbers:
            url = 'http://search.people.com.cn/cnpeople/search.do?pageNum='+str(num)+'&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0'
            urls = []
            urls.append(url)
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        for link in response.css("ul"):
            yield {
                'link': link.css("a::attr(href)").get()

I'd really appreciate any help resolving this from somebody with more expertise in the area.

Thanks, I had been looking at that. I’m not sure if I’m my case the redirect is the server blocking me scraping or just part of the way the website delivers search results? Was hoping somebody might advise on this... – Nick Olczak Jan 10, 2020 at 11:12

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.