相关文章推荐
逃课的橙子  ·  SQL Server ...·  7 月前    · 
狂野的荒野  ·  前端 - HTML页面 <meta> ...·  1 年前    · 
听话的感冒药  ·  JAVA ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams from bs4 import BeautifulSoup headers = {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} res = requests.get("https://www.walmart.com/cp/976759", headers=headers) soup = BeautifulSoup(res.text, "html.parser") script = soup.find("script", {"id":"category"}) data = json.loads(script.get_text(strip=True)) with open("data.json", "w") as f: json.dump(data, f)

the complete data is stored in a script tag with id as category as mentioned in this answer lxml web-scraping is returning empty values .

I have more pages to fetch and it appears they are loaded through javascript too. What are the methods of determining the script tag id that the site data is stored in ? for example how do I determine the script tag id of these links

https://www.walmart.com/cp/coffee/1086446?povid=976759+%7C+2018-12-26+%7C+Food%20Coffee%20Shop%20by%20Category%20Tile%201

and this one

https://www.walmart.com/browse/food/coffee/976759_1086446_1229654?povid=1086446+%7C++%7C+Coffee%20Bottle%20Coffee%20Featured%20Categories%20Collapsible

You can use regular expressions to match against attributes - and also to exclude attributes. I realise that the script tags you are looking for are all of type application/json , which I made the first filter, i.e. soup.find_all('script', {'type': 'application/json'}) . Next, there are tags that start with tb-djs-wlm , which refer to several images. I exclude them using the regular expression re.compile(r'^((?!tb-djs).)*$') .

So, now we have:

from bs4 import BeautifulSoup
import requests
import re
session = requests.Session()
# your test urls
url1 = 'https://www.walmart.com/cp/coffee/1086446?povid=976759+%7C+2018-12-26+%7C+Food%20Coffee%20Shop%20by%20Category%20Tile%201'
url2 = 'https://www.walmart.com/browse/food/coffee/976759_1086446_1229654?povid=1086446+%7C++%7C+Coffee%20Bottle%20Coffee%20Featured%20Categories%20Collapsible'
url3 = 'https://www.walmart.com/cp/976759'
urls = [url1, url2, url3]
def find_tag(soup):
    script = soup.find('script', {'type': 'application/json', 'id':re.compile(r'^((?!tb-djs).)*$')})
    return script['id']
for url in urls:
    soup = BeautifulSoup(session.get(url).text, 'html.parser')
    print(find_tag(soup))
# category
# searchContent
# category

To get the content of the script you can use the json library and on the bs4 tag element and simply load it with json.loads(script_soup.text)

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.