相关文章推荐
淡定的松树  ·  2. 指令 | Microsoft Learn·  1 年前    · 
乐观的滑板  ·  DescribeSlowLogs - ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I was trying to make a wikipedia crawler that gets the "See also" links text and then enters the urls that tags link to. However, "See also" part of the article (which is an unorganized list) doesn't have any class or id, so i get it with a method "find_next_sibling". Next, it goes through every linked Wikipedia page there, and does the same thing. This is my code:

import requests
from bs4 import BeautifulSoup
def wikipediaCrawler(page, maxPages):
    pageNumber = 1
    while pageNumber < maxPages:
        url = "https://en.wikipedia.org" + page
        sourceCode = requests.get(url)
        print(sourceCode)
        plainText = sourceCode.text
        soup = BeautifulSoup(plainText, "html.parser")
        ul = soup.find("h2", text="See also").find_next_sibling("ul")
        for li in ul.findAll("li"):
            print(li.get_text())
        for link in ul.findAll('a'):
            page = str(link.get('href'))
            print(page)
        pageNumber += 1
wikipediaCrawler("/wiki/Online_chat", 3)

It prints the first page normally. The problem is that whenever it tries to switch the page I get this error:

Traceback (most recent call last):
  File "C:/Users/Shaman/PycharmProjects/WebCrawler/main.py", line 23, in <module>
    wikipediaCrawler("/wiki/Online_chat", 3)
  File "C:/Users/Shaman/PycharmProjects/WebCrawler/main.py", line 14, in wikipediaCrawler
    ul = soup.find("h2", text="See also").find_next_sibling("ul")
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

I print the requests function and it says "Response<200>" so it doesn't seem like a permission issue. I honestly have no clue why it happens. Any ideas? Thanks in advance

Edit: I know that the Wikipedia articles that it searches all contain tag with text "See also". In this case it earched "Voice_chat" article and didn't find anything despite it being there.

I know by source code that on every page h2 with a text "See also" exists and the next sibling with tag ul exists too – Marcin Gawronek Jun 20, 2018 at 5:03

I think you want the <ul> after the h2 tag that starts the "See also" section.

One way to find that h2 is to use CSS selectors to find the right tag, then grab the parent element (the h2), and then get the next sibling from there:

def wikipediaCrawler(page, maxPages):
    soup = BeautifulSoup(plainText, "html.parser")
    see_also = soup.select("h2 > #See_also")[0]
    ul = see_also.parent.find_next_sibling("ul")
    for link in ul.findAll('a'):
        page = str(link.get('href'))
        print(page)
wikipediaCrawler("/wiki/Online_chat", 3)

Output:

/wiki/Chat_room
/wiki/Collaborative_software
/wiki/Instant_messaging
/wiki/Internet_forum
/wiki/List_of_virtual_communities_with_more_than_100_million_active_users
/wiki/Online_dating_service
/wiki/Real-time_text
/wiki/Videotelephony
/wiki/Voice_chat
/wiki/Comparison_of_VoIP_software
/wiki/Massively_multiplayer_online_game
/wiki/Online_game
/wiki/Video_game_culture

The piece of code soup.find("h2", text="See also") sometime just can not find the element and then return None.

Quick fix is to pass the error:

import requests
from bs4 import BeautifulSoup
def wikipediaCrawler(page, maxPages):
    pageNumber = 1
    while pageNumber < maxPages:
        url = "https://en.wikipedia.org" + page
        sourceCode = requests.get(url)
        print(sourceCode)
        plainText = sourceCode.text
        soup = BeautifulSoup(plainText, "html.parser")
        ul = soup.find("h2", text="See also").find_next_sibling("ul")
        for li in ul.findAll("li"):
            print('li: ', pageNumber, li.get_text())
        for link in ul.findAll('a'):
            page = str(link.get('href'))
            print('a:', pageNumber, page)
    except Exception, e:
        print e
        print soup.find("h2", text="See also")
    pageNumber += 1
wikipediaCrawler("/wiki/Online_chat", 3)

I added a small change in printing to debug easier.

It prints: <Response [200]> 'NoneType' object has no attribute 'find_next_sibling' None It seems it gets to the webpage but then it can't find the h2, no matter if it exists or not – Marcin Gawronek Jun 20, 2018 at 5:24

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.