python - Scrapy handle 302 redirections

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm trying to crawl a web site using a scrapy CrawlSpider the problem is that the website keeps redirecting me in a random pattern meaning that a url might load sometimes and sometimes it's redirected to a certain page I tried changing my user-agent, Tried to mimic the behavior of the browser by creating an http header similar to the one sent by the browser and even when I used crawlera to send the requests nothing solved my problem. I'd be thankful if someone guided me through this

Console:

2017-11-06 02:11:14 [scrapy.core.engine] INFO: Spider opened
2017-11-06 02:11:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-06 02:11:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-11-06 02:11:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_intnl/dap/shopping-tourism.html> from <GET http://www.sears.com/en_intnl/dap/shopping-tourism.html>
2017-11-06 02:11:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/en_intnl/dap/shopping-tourism.html> (referer: None)
2017-11-06 02:11:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/en_us/sitemap.html>
2017-11-06 02:11:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/en_us/botmanagement.html> (referer: http://www.sears.com/en_intnl/dap/shopping-tourism.html)
2017-11-06 02:11:34 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.sears.com/gifts/b-1020009> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-11-06 02:11:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/seasonal-christmas/b-1100100> (referer: http://www.sears.com/en_intnl/dap/shopping-tourism.html)
2017-11-06 02:11:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/toys-games/b-1020010> (referer: http://www.sears.com/en_intnl/dap/shopping-tourism.html)
2017-11-06 02:11:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/home-decor-decorative-accents/b-1348893716>
2017-11-06 02:11:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/tvs-electronics-home-theater-audio-musical-instruments-guitars-string-instruments/b-5000861>
2017-11-06 02:12:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/tvs-electronics-gaming/b-1347529268>
                Sears probably requires you to solve a captcha if you visit from any non-residential IP address (like the ones used by Crawlera). You'll have to solve the captcha to bypass the check or test if the appropriate cookie can be copied from your own browser.
– Blender
                Nov 6, 2017 at 0:24
                Thanks for replying, I disabled crawlera and tried to run the spider on my own IP go the same result, actually the console log I attached is resulting from running the spider on my IP
– Mohamed Elmahdi
                Nov 6, 2017 at 0:52
                I got the captcha immediately when visiting the website with my typical browser. Try pulling the cookie from your home browser.
– Blender
                Nov 6, 2017 at 0:56
                The cookie is set by the website when you complete the captcha. There'd be no point to having a webpage called botmanagement.html if it was completely trivial to bypass.
– Blender
                Nov 6, 2017 at 3:10
Use if response.url == """http://www.sears.com/en_us/botmanagement.html""": to detect if you have been redirected to a reCAPCHA page.
Use Selenium with Scrapy(Selenium can control browser directly, so you will be able to watch the whole scraping process and manually pass the reCAPCHA)(This is an example of how to use selenium with Scrapy)
Slow down your scraping speed in order to prevent spider detection
Public proxies
Gather proxies
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.