Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm writing my first "real" project, a web crawler, and I don't know how to fix this error. Here's my code

import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
    page = 1
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
            print(href)
    page += 1
main_spider(1)

Here's the error

href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
TypeError: must be str, not NoneType  
                I am sorry for being so dumb it seems I fixed it but now I have a new problem.  Instead of getting all the links from the page i'm getting the url to the original page over and over again.
– Dylan Boyd
                Apr 23, 2017 at 1:55
                how come? you indented main_spider(1) and you shouldn't get anything. or you had two main_spider(1) lines, one in the function itself?
– Shiping
                Apr 23, 2017 at 2:05

As noted by @Shiping, your code is not indented properly ... I corrected it below. Also... link.get('href') is not returning a string in one of the cases.

import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"): 
            href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
            print(href)
main_spider(1)

For purposes of evaluating what was happening, I added several lines of code...between several of your existing lines AND removed the offending line (for the time being).

        soup = BeautifulSoup(plain_text, "html.parser")
        print('All anchor tags:', soup.findAll('a'))     ### ADDED
        for link in soup.findAll("a"): 
            print(type(link.get("href")), link.get("href"))  ### ADDED

The result of my additions was this (truncated for brevity): NOTE: that the first anchor does NOT have an href attribute and thus link.get('href') can't return a value, so returns None

[<a id="top"></a>, <a href="#mw-head">navigation</a>, 
<a href="#p-search">search</a>, 
<a href="/wiki/Special:SiteMatrix" title="Special:SiteMatrix">sister...   
<class 'NoneType'> None
<class 'str'> #mw-head
<class 'str'> #p-search
<class 'str'> /wiki/Special:SiteMatrix
<class 'str'> /wiki/File:Wiktionary-logo-v2.svg      

To prevent the error, a possible solution would be to add a conditional OR a try/except expression to your code. I'll demo a conditional expression.

        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"): 
            if link.get('href') == None:
                continue
            else:
                href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
                print(href) 
                Great that works perfectly!  However is there a way I could limit the links I get back to only the ones about Star Wars?  Considering that it's set up for every link on the page, could I limit the output to the main links?
– Dylan Boyd
                Apr 23, 2017 at 2:18

Therefore, link.get("href") will return None, as there is no href.

To fix this, check for None first:

if link.get('href') is not None:
    href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
    # do stuff here

Not all anchors (<a> elements) need to have a href attribute (see https://www.w3schools.com/tags/tag_a.asp):

In HTML5, the tag is always a hyperlink, but if it has no href attribute, it is only a placeholder for a hyperlink.

Actually you already got the Exception and Python is great at handling exceptions so why not catch the exception? This style is called "Easier to ask for forgiveness than permission." (EAFP) and is actually encouraged:

import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            # The following part is new:
                href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
                print(href)
            except TypeError:
main_spider(1)

Also the page = 1 and page += 1 lines can be omitted. The for page in range(1, max_pages+1): instruction is already sufficient here.

@DylanBoyd I updated the answer. I don't know how it could lead to a SyntaxError but maybe something went wrong with indentation or during copy. :) – MSeifert Apr 23, 2017 at 2:11 @DylanBoyd No problem if you choose another answer - but if you want to use a conditional I would recommend Jackywathys answer instead. The continue is unnecessary and then you don't need the else. – MSeifert Apr 23, 2017 at 2:32

I had the same error from different code. After adding a conditional inside a function, I thought that the return type was not being set properly, but what I realized was that when the condition was False, the return statement was not being called at all -- a change to my indentation fixed the problem.

I had the same error message in a similar situation.

I was concatenating strings too and one variable was supposed to be assigned a return value of a function.

But in one case there was no return value and the variable was "empty". This caused the same error message.

input = get_input() # <-- make sure this always returns a value
print ("input was" + input)
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.