Python爬虫动态页面抓取问题?

目标网页小编精选 - 照片 import requests from lxml import etree url = 'https://pixabay…
关注者
14
被浏览
2,381

1 个回答

In [13]: import requests
In [14]: r = requests.get('https://pixabay.com/zh/editors_choice/')
In [15]: soup = BeautifulSoup(r.text, 'lxml')
In [16]: for img in soup.find_all('img',attrs={'data-lazy-srcset':True}):
    ...:     print(img['data-lazy-srcset'])

out:

https://cdn.pixabay.com/photo/2017/02/19/15/28/italy-2080072__340.jpg 1x, https://cdn.pixabay.com/photo/2017/02/19/15/28/italy-2080072__480.jpg 2x https://cdn.pixabay.com/photo/2017/02/26/21/39/rose-2101475__340.jpg 1x, https://cdn.pixabay.com/photo/2017/02/26/21/39/rose-2101475__480.jpg 2x

浏览器中的代码是JavaScript修改过的, 你直接用requests请求然后打印出来看就会发现:

<div class="item" data-h="427" data-w="640">