用自动化测试工具selenium来揭露骗局的真相

大神带我来搬砖

0.282 2018-07-09 13:25 IP属地: 辽宁

前几天写了用爬虫来揭露约稿骗局的真相，但实际上对于动态加载的数据来说，用程序爬取比较困难，在这种情况下，可以使用selenium来模拟浏览器行为，达到同样目的。

安装好python之后，利用pip命令安装selenium，下载浏览器对应的driver就可以进行了。这次我们需要用selenium打开某个用户的timeline页面以后，一直页面下拉，直到页面中出现“加入了简书”。

selenium进行页面滚动

平时我们都是用鼠标滚轮在浏览器中进行页面滚动，在selenium中，同样可以模拟鼠标操作。但是这次我们采用了javascript来进行页面滚动。滚动代码如下：

js="document.documentElement.scrollTop=%d" % step browser.execute_script(js) time.sleep(0.2) step是页面滚动的像素，step每次会增加，这样就实现了页面自动向下滚动。关闭chrome浏览器自动加载图片浏览器默认是加载图片的，为了提高速度，此处要禁止chrome加载图片 options = webdriver.ChromeOptions() prefs = {"profile.managed_default_content_settings.images":2} options.add_experimental_option("prefs",prefs) browser = webdriver.Chrome(chrome_options=options) 使用headless模式运行chrome 发现加载了太多的动态之后，浏览器还是会卡死，这时可以考虑用headless模式运行chrome。headless模式下的chrome不会出现用户图形界面，因此速度更快。 options = webdriver.ChromeOptions() options.add_argument('headless') 删除页面上元素在使用了headless模式后，发现浏览器最后还是越来越慢，应该是因为页面上元素太多，渲染不过来造成的。这时想到了用javascript删除页面元素。注意，需要保留最后一个li元素，以便计算max_id。 var nodeList=document.querySelectorAll("#list-container > ul > li"); for(var i=0;i<nodeList.length-1;i++){ nodeList[i].remove() 将这段JS代码在selenium中运行即可。同时由于页面元素会被删除，页面滚动的代码也要做一些调整，需要先返回顶部，再向下滚动，以防下拉刷新没有触发。 browser.execute_script("document.documentElement.scrollTop=0") browser.execute_script("document.documentElement.scrollTop=1600") 爬取结果分析在爬取的动态中搜索，还是找不到“大神带我来搬砖” options = webdriver.ChromeOptions() prefs = {"profile.managed_default_content_settings.images":2} options.add_experimental_option("prefs",prefs) browser = webdriver.Chrome(chrome_options=options) browser.set_page_load_timeout(60) browser.get("https://www.jianshu.com/users/5aa8494a18c8/timeline") time.sleep(5) file = open("browser.txt",'w',encoding='utf-8') while True: text = browser.find_element_by_xpath("""//*[@id="list-container"]/ul""").text file.write(text) # remove li elements js='''var nodeList=document.querySelectorAll("#list-container > ul > li");for(var i=0;i<nodeList.length-1;i++){nodeList[i].remove()}''' browser.execute_script(js) # scroll browser.execute_script("document.documentElement.scrollTop=0") browser.execute_script("document.documentElement.scrollTop=1600") time.sleep(10) if '加入了简书' in text: print("end") break file.write(text) file.close()

最后编辑于：2018-07-09 13:37