Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Ask Question
I am trying to scrape a website with BeautifulSoup but am having a problem.
I was following a tutorial done in python 2.7 and it had exactly the same code in it and had no problems.
import urllib.request
from bs4 import *
htmlfile = urllib.request.urlopen("http://en.wikipedia.org/wiki/Steve_Jobs")
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
title = (soup.title.text)
body = soup.find("Born").findNext('td')
print (body.text)
If I try to run the program I get,
Traceback (most recent call last):
File "C:\Users\USER\Documents\Python Programs\World Population.py", line 13, in <module>
body = soup.find("Born").findNext('p')
AttributeError: 'NoneType' object has no attribute 'findNext'
Is this a problem with python 3 or am i just too naive?
The find
and find_all
methods do not search for arbitrary text in the document, they search for HTML tags. The documentation makes that clear (my italics):
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match. This is the simplest usage:
soup.find_all("title")
# [<title>The Dormouse's story</title>]
That's why your soup.find("Born")
is returning None
and hence why it complains about NoneType
(the type of None
) having no findNext()
method.
That page you reference contains (at the time this answer was written) eight copies of the word "born", none of which are tags.
Looking at the HTML source for that page, you'll find the best option may be to look for the correct span (formatted for readabilty):
<th scope="row" style="text-align: left;">Born</th>
<span class="nickname">Steven Paul Jobs</span><br />
<span style="display: none;">(<span class="bday">1955-02-24</span>)</span>February 24, 1955<br />
–
The find
method looks for tags, not text. To find the name, birthday and birthplace, you would have to look up the span
elements with the corresponding class name, and access the text
attribute of that item:
import urllib.request
from bs4 import *
soup = BeautifulSoup(urllib.request.urlopen("http://en.wikipedia.org/wiki/Steve_Jobs"))
title = soup.title.text
name = soup.find('span', {'class': 'nickname'}).text
bday = soup.find('span', {'class': 'bday'}).text
birthplace = soup.find('span', {'class': 'birthplace'}).text
print(name)
print(bday)
print(birthplace)
Output:
Steven Paul Jobs
1955-02-24
San Francisco, California, US
PS: You don't have to call read
on urlopen
, BS accept file-like objects.