Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I was trying to make a wikipedia crawler that gets the "See also" links text and then enters the urls that tags link to. However, "See also" part of the article (which is an unorganized list) doesn't have any class or id, so i get it with a method "find_next_sibling". Next, it goes through every linked Wikipedia page there, and does the same thing. This is my code:
import requests
from bs4 import BeautifulSoup
def wikipediaCrawler(page, maxPages):
pageNumber = 1
while pageNumber < maxPages:
url = "https://en.wikipedia.org" + page
sourceCode = requests.get(url)
print(sourceCode)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")
ul = soup.find("h2", text="See also").find_next_sibling("ul")
for li in ul.findAll("li"):
print(li.get_text())
for link in ul.findAll('a'):
page = str(link.get('href'))
print(page)
pageNumber += 1
wikipediaCrawler("/wiki/Online_chat", 3)
It prints the first page normally.
The problem is that whenever it tries to switch the page I get this error:
Traceback (most recent call last):
File "C:/Users/Shaman/PycharmProjects/WebCrawler/main.py", line 23, in <module>
wikipediaCrawler("/wiki/Online_chat", 3)
File "C:/Users/Shaman/PycharmProjects/WebCrawler/main.py", line 14, in wikipediaCrawler
ul = soup.find("h2", text="See also").find_next_sibling("ul")
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
I print the requests function and it says "Response<200>" so it doesn't seem like a permission issue. I honestly have no clue why it happens. Any ideas? Thanks in advance
Edit: I know that the Wikipedia articles that it searches all contain tag with text "See also". In this case it earched "Voice_chat" article and didn't find anything despite it being there.
–
I think you want the <ul>
after the h2
tag that starts the "See also" section.
One way to find that h2
is to use CSS selectors to find the right tag, then grab the parent element (the h2
), and then get the next sibling from there:
def wikipediaCrawler(page, maxPages):
soup = BeautifulSoup(plainText, "html.parser")
see_also = soup.select("h2 > #See_also")[0]
ul = see_also.parent.find_next_sibling("ul")
for link in ul.findAll('a'):
page = str(link.get('href'))
print(page)
wikipediaCrawler("/wiki/Online_chat", 3)
Output:
/wiki/Chat_room
/wiki/Collaborative_software
/wiki/Instant_messaging
/wiki/Internet_forum
/wiki/List_of_virtual_communities_with_more_than_100_million_active_users
/wiki/Online_dating_service
/wiki/Real-time_text
/wiki/Videotelephony
/wiki/Voice_chat
/wiki/Comparison_of_VoIP_software
/wiki/Massively_multiplayer_online_game
/wiki/Online_game
/wiki/Video_game_culture
The piece of code soup.find("h2", text="See also")
sometime just can not find the element and then return None
.
Quick fix is to pass the error:
import requests
from bs4 import BeautifulSoup
def wikipediaCrawler(page, maxPages):
pageNumber = 1
while pageNumber < maxPages:
url = "https://en.wikipedia.org" + page
sourceCode = requests.get(url)
print(sourceCode)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")
ul = soup.find("h2", text="See also").find_next_sibling("ul")
for li in ul.findAll("li"):
print('li: ', pageNumber, li.get_text())
for link in ul.findAll('a'):
page = str(link.get('href'))
print('a:', pageNumber, page)
except Exception, e:
print e
print soup.find("h2", text="See also")
pageNumber += 1
wikipediaCrawler("/wiki/Online_chat", 3)
I added a small change in printing to debug easier.
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.