Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'}
res = requests.get("https://www.walmart.com/cp/976759", headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
script = soup.find("script", {"id":"category"})
data = json.loads(script.get_text(strip=True))
with open("data.json", "w") as f:
json.dump(data, f)
the complete data is stored in a script tag with id as category as mentioned in this answer
lxml web-scraping is returning empty values
.
I have more pages to fetch and it appears they are loaded through javascript too. What are the methods of determining the script tag id that the site data is stored in ? for example how do I determine the script tag id of these links
https://www.walmart.com/cp/coffee/1086446?povid=976759+%7C+2018-12-26+%7C+Food%20Coffee%20Shop%20by%20Category%20Tile%201
and this one
https://www.walmart.com/browse/food/coffee/976759_1086446_1229654?povid=1086446+%7C++%7C+Coffee%20Bottle%20Coffee%20Featured%20Categories%20Collapsible
You can use regular expressions to match against attributes - and also to exclude attributes. I realise that the
script
tags you are looking for are all of type
application/json
, which I made the first filter, i.e.
soup.find_all('script', {'type': 'application/json'})
. Next, there are tags that start with
tb-djs-wlm
, which refer to several images. I exclude them using the regular expression
re.compile(r'^((?!tb-djs).)*$')
.
So, now we have:
from bs4 import BeautifulSoup
import requests
import re
session = requests.Session()
# your test urls
url1 = 'https://www.walmart.com/cp/coffee/1086446?povid=976759+%7C+2018-12-26+%7C+Food%20Coffee%20Shop%20by%20Category%20Tile%201'
url2 = 'https://www.walmart.com/browse/food/coffee/976759_1086446_1229654?povid=1086446+%7C++%7C+Coffee%20Bottle%20Coffee%20Featured%20Categories%20Collapsible'
url3 = 'https://www.walmart.com/cp/976759'
urls = [url1, url2, url3]
def find_tag(soup):
script = soup.find('script', {'type': 'application/json', 'id':re.compile(r'^((?!tb-djs).)*$')})
return script['id']
for url in urls:
soup = BeautifulSoup(session.get(url).text, 'html.parser')
print(find_tag(soup))
# category
# searchContent
# category
To get the content of the script you can use the json
library and on the bs4 tag element and simply load it with json.loads(script_soup.text)
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.