Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I'm writing my first "real" project, a web crawler, and I don't know how to fix this error. Here's my code
import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
page = 1
for page in range(1, max_pages+1):
url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
page += 1
main_spider(1)
Here's the error
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
TypeError: must be str, not NoneType
–
–
As noted by @Shiping, your code is not indented properly ... I corrected it below.
Also... link.get('href')
is not returning a string in one of the cases.
import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
for page in range(1, max_pages+1):
url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
main_spider(1)
For purposes of evaluating what was happening, I added several lines of code...between several of your existing lines AND removed the offending line (for the time being).
soup = BeautifulSoup(plain_text, "html.parser")
print('All anchor tags:', soup.findAll('a')) ### ADDED
for link in soup.findAll("a"):
print(type(link.get("href")), link.get("href")) ### ADDED
The result of my additions was this (truncated for brevity):
NOTE: that the first anchor does NOT have an href attribute and thus link.get('href')
can't return a value, so returns None
[<a id="top"></a>, <a href="#mw-head">navigation</a>,
<a href="#p-search">search</a>,
<a href="/wiki/Special:SiteMatrix" title="Special:SiteMatrix">sister...
<class 'NoneType'> None
<class 'str'> #mw-head
<class 'str'> #p-search
<class 'str'> /wiki/Special:SiteMatrix
<class 'str'> /wiki/File:Wiktionary-logo-v2.svg
To prevent the error, a possible solution would be to add a conditional OR a try/except expression to your code. I'll demo a conditional expression.
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
if link.get('href') == None:
continue
else:
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
–
Therefore, link.get("href") will return None, as there is no href.
To fix this, check for None first:
if link.get('href') is not None:
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
# do stuff here
Not all anchors (<a>
elements) need to have a href
attribute (see https://www.w3schools.com/tags/tag_a.asp):
In HTML5, the tag is always a hyperlink, but if it has no href attribute, it is only a placeholder for a hyperlink.
Actually you already got the Exception and Python is great at handling exceptions so why not catch the exception? This style is called "Easier to ask for forgiveness than permission." (EAFP) and is actually encouraged:
import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
for page in range(1, max_pages+1):
url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
# The following part is new:
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
except TypeError:
main_spider(1)
Also the page = 1
and page += 1
lines can be omitted. The for page in range(1, max_pages+1):
instruction is already sufficient here.
–
–
I had the same error from different code. After adding a conditional inside a function, I thought that the return type was not being set properly, but what I realized was that when the condition was False, the return statement was not being called at all -- a change to my indentation fixed the problem.
I had the same error message in a similar situation.
I was concatenating strings too and one variable was supposed to be assigned a return value of a function.
But in one case there was no return value and the variable was "empty". This caused the same error message.
input = get_input() # <-- make sure this always returns a value
print ("input was" + input)
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.