Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.

The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.

I struggled to find a way to actually download the file in Python, thus why I resorted to using wget .

So, how do I download the file using Python?

Many of the answers below are not a satisfactory replacement for wget . Among other things, wget (1) preserves timestamps (2) auto-determines filename from url, appending .1 (etc.) if the file already exists (3) has many other options, some of which you may have put in your .wgetrc . If you want any of those, you have to implement them yourself in Python, but it's simpler to just invoke wget from Python. ShreevatsaR Sep 27, 2016 at 17:22 Short solution for Python 3: import urllib.request; s = urllib.request.urlopen('http://example.com/').read().decode() Basj Nov 26, 2019 at 9:47 wget is still a better approach, if you need to automatically retrieve filename and timestamps and handling duplicating files as stackoverflow.com/users/4958/shreevatsar stated. If the urls are variables, one can still handle in python using subprocess. Tendai Mar 2 at 8:19

One more, using urlretrieve :

import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

(for Python 2 use import urllib and urllib.urlretrieve)

Oddly enough, this worked for me on Windows when the urllib2 method wouldn't. The urllib2 method worked on Mac, though. – InFreefall May 15, 2011 at 21:49 Bug: file_size_dl += block_sz should be += len(buffer) since the last read is often not a full block_sz. Also on windows you need to open the output file as "wb" if it isn't a text file. – Eggplant Jeff May 25, 2011 at 17:53 Me too urllib and urllib2 didn't work but urlretrieve worked well, was getting frustrated - thanks :) – funk-shun Jul 12, 2011 at 6:08 Wrap the whole thing (except the definition of file_name) with if not os.path.isfile(file_name): to avoid overwriting podcasts! useful when running it as a cronjob with the urls found in a .html file – Sriram Murali May 1, 2012 at 20:15 According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future. docs.python.org/3/library/urllib.request.html#legacy-interface – Louis Yang Dec 24, 2020 at 22:21

Use urllib.request.urlopen():

import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
    html = f.read().decode('utf-8')

This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.

On Python 2, the method is in urllib2:

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
                This won't work if there are spaces in the url you provide. In that case, you'll need to parse the url and urlencode the path.
– Jason Sundram
                Apr 14, 2010 at 21:17
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760

You can run pip install requests to get it.

Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.

2015-12-30

People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:

from tqdm import tqdm
import requests
url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:
    for data in tqdm(response.iter_content()):
        handle.write(data)

This is essentially the implementation @kvance described 30 months ago.

How does this handle large files, does everything get stored into memory or can this be written to a file without large memory requirement? – Bibek Shrestha Dec 17, 2012 at 16:05 It is possible to stream large files by setting stream=True in the request. You can then call iter_content() on the response to read a chunk at a time. – kvance Jul 28, 2013 at 17:14 Why would a url library need to have a file unzip facility? Read the file from the url, save it and then unzip it in whatever way floats your boat. Also a zip file is not a 'folder' like it shows in windows, Its a file. – Harel Nov 15, 2013 at 16:36 @Ali: r.text: For text or unicode content. Returned as unicode. r.content: For binary content. Returned as bytes. Read about it here: docs.python-requests.org/en/latest/user/quickstart – hughdbrown Jan 17, 2016 at 18:44 I think a chunk_size argument is desirable along with stream=True. The default chunk_size is 1, which means, each chunk could be as small as 1 byte and so is very inefficient. – haridsv Oct 1, 2018 at 10:54
import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
  output.write(mp3file.read())

The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.

The disadvantage of this solution is, that the entire file is loaded into ram before saved to disk, just something to keep in mind if using this for large files on a small system like a router with limited ram. – tripplet Nov 18, 2012 at 13:33 To avoid reading the whole file into memory, try passing an argument to file.read that is the number of bytes to read. See: gist.github.com/hughdbrown/c145b8385a2afa6570e2 – hughdbrown Oct 7, 2015 at 16:02 @hughdbrown I found your script useful, but have one question: can I use the file for post-processing? suppose I download a jpg file that I want to process with OpenCV, can I use the 'data' variable to keep working? or do I have to read it again from the downloaded file? – Rodrigo E. Principe Nov 16, 2016 at 12:29
import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()
  • urllib.request.urlretrieve

    import urllib.request
    urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
    

    Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)

  • Python 2

  • urllib2.urlopen (thanks Corey)

    import urllib2
    response = urllib2.urlopen('http://www.example.com/')
    html = response.read()
    
  • urllib.urlretrieve (thanks PabloG)

    import urllib
    urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
                    It sure took a while, but there, finally is the easy straightforward api I expect from a python stdlib :)
    – ThorSummoner
                    Aug 4, 2017 at 20:52
                    @EdouardThiel If you click on urllib.request.urlretrieve above it'll bring you to that exact link. Cheers!
    – bmaupin
                    Dec 23, 2019 at 14:44
                    urllib.request.urlretrieve is documented as a "legacy interface" and "might become deprecated in the future".
    – gerrit
                    Mar 27, 2020 at 17:32
                    You should mention that you are getting a bunch of bytes that need to be handled after that.
    – thoroc
                    Jun 14, 2020 at 13:02
    def download(url):
        get_response = requests.get(url,stream=True)
        file_name  = url.split("/")[-1]
        with open(file_name, 'wb') as f:
            for chunk in get_response.iter_content(chunk_size=1024):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
    download("https://example.com/example.jpg")
                    Thanks, also, replace with open(file_name,... with with open('thisname'...) because it may throw an error
    – the sigmoid infinity
                    Nov 7, 2020 at 19:24
    
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    from __future__ import ( division, absolute_import, print_function, unicode_literals )
    import sys, os, tempfile, logging
    if sys.version_info >= (3,):
        import urllib.request as urllib2
        import urllib.parse as urlparse
    else:
        import urllib2
        import urlparse
    def download_file(url, dest=None):
        Download and save a file specified by url to dest directory,
        u = urllib2.urlopen(url)
        scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
        filename = os.path.basename(path)
        if not filename:
            filename = 'downloaded.file'
        if dest:
            filename = os.path.join(dest, filename)
        with open(filename, 'wb') as f:
            meta = u.info()
            meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
            meta_length = meta_func("Content-Length")
            file_size = None
            if meta_length:
                file_size = int(meta_length[0])
            print("Downloading: {0} Bytes: {1}".format(url, file_size))
            file_size_dl = 0
            block_sz = 8192
            while True:
                buffer = u.read(block_sz)
                if not buffer:
                    break
                file_size_dl += len(buffer)
                f.write(buffer)
                status = "{0:16}".format(file_size_dl)
                if file_size:
                    status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
                status += chr(13)
                print(status, end="")
            print()
        return filename
    if __name__ == "__main__":  # Only run if this file is called directly
        print("Testing with 10MB download")
        url = "http://download.thinkbroadband.com/10MB.zip"
        filename = download_file(url)
        print(filename)
    

    Simple yet Python 2 & Python 3 compatible way comes with six library:

    from six.moves import urllib
    urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
    
  • urllib.urlretrieve ('url_to_file', file_name)

  • urllib2.urlopen('url_to_file')

  • requests.get(url)

  • wget.download('url', file_name)

  • Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.

    You should change from -o to -O to avoid confusion, as it is in GNU wget. Or at least both options should be valid. – erik Jul 17, 2015 at 15:46 @eric I am not sure that I want to make wget.py an in-place replacement for real wget. The -o already behaves differently - it is compatible with curl this way. Would a note in documentation help to resolve the issue? Or it is the essential feature for an utility with such name to be command line compatible? – anatoly techtonik Jul 17, 2015 at 20:24

    In python3 you can use urllib3 and shutil libraires. Download them by using pip or pip3 (Depending whether python3 is default or not)

    pip3 install urllib3 shutil
    

    Then run this code

    import urllib.request
    import shutil
    url = "http://www.somewebsite.com/something.pdf"
    output_file = "save_this_name.pdf"
    with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
    

    Note that you download urllib3 but use urllib in code

    I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:

    import urllib
    response = urllib.urlopen('http://www.example.com/sound.mp3')
    mp3 = response.read()
    

    Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:

    import urllib
    mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()
    
    from parallel_sync import wget
    urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
    wget.download('/tmp', urls)
    # or a single file:
    wget.download('/tmp', urls[0], filenames='x.zip', extract=True)
    https://pythonhosted.org/parallel_sync/pages/examples.html

    This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.

    You can get the progress feedback with urlretrieve as well:

    def report(blocknr, blocksize, size):
        current = blocknr*blocksize
        sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
    def downloadFile(url):
        print "\n",url
        fname = url.split('/')[-1]
        print fname
        urllib.urlretrieve(url, fname, report)
    

    Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.

    import subprocess
    subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
    

    In Jupyter Notebook, one can also call programs directly with the ! syntax:

    !wget -O example_output_file.html https://example.com
    # Save file data to local copy
    with open(local_file_name, 'wb')as file:
        file.write(data.content)
    

    Now do something with the local copy of the remote file

    sys.stdout.write( "\r" + (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total + " [%3.2f%%]"%(100.0*float(downloaded)/float(total)) sys.stdout.flush() def download(srcurl, dstfilepath, progress_callback=None, block_size=8192): def _download_helper(response, out_file, file_size): if progress_callback!=None: progress_callback(0,file_size) if block_size == None: buffer = response.read() out_file.write(buffer) if progress_callback!=None: progress_callback(file_size,file_size) else: file_size_dl = 0 while True: buffer = response.read(block_size) if not buffer: break file_size_dl += len(buffer) out_file.write(buffer) if progress_callback!=None: progress_callback(file_size_dl,file_size) with open(dstfilepath,"wb") as out_file: if python3: with urllib.request.urlopen(srcurl) as response: file_size = int(response.getheader("Content-Length")) _download_helper(response,out_file,file_size) else: response = urllib2.urlopen(srcurl) meta = response.info() file_size = int(meta.getheaders("Content-Length")[0]) _download_helper(response,out_file,file_size) import traceback download( "https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip", "output.zip", progress_callback_simple except: traceback.print_exc() input()

    If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.

    First, these are the results (they are similar in different runs):

    $ python wget_test.py 
    urlretrive_test : starting
    urlretrive_test : 6.56
    ==============
    wget_no_bar_test : starting
    wget_no_bar_test : 7.20
    ==============
    wget_with_bar_test : starting
    100% [......................................................................] 541335552 / 541335552
    wget_with_bar_test : 50.49
    ==============
    

    The way I performed the test is using "profile" decorator. This is the full code:

    import wget
    import urllib
    import time
    from functools import wraps
    def profile(func):
        @wraps(func)
        def inner(*args):
            print func.__name__, ": starting"
            start = time.time()
            ret = func(*args)
            end = time.time()
            print func.__name__, ": {:.2f}".format(end - start)
            return ret
        return inner
    url1 = 'http://host.com/500a.iso'
    url2 = 'http://host.com/500b.iso'
    url3 = 'http://host.com/500c.iso'
    def do_nothing(*args):
    @profile
    def urlretrive_test(url):
        return urllib.urlretrieve(url)
    @profile
    def wget_no_bar_test(url):
        return wget.download(url, out='/tmp/', bar=do_nothing)
    @profile
    def wget_with_bar_test(url):
        return wget.download(url, out='/tmp/')
    urlretrive_test(url1)
    print '=============='
    time.sleep(1)
    wget_no_bar_test(url2)
    print '=============='
    time.sleep(1)
    wget_with_bar_test(url3)
    print '=============='
    time.sleep(1)
    

    urllib seems to be the fastest

    There must be something completely horrible going on under the hood to make the bar increase the time so much. – Alistair Carscadden Sep 10, 2018 at 6:39 Can I ask - where does the file save once the program runs? Also, is there a way to name it and save it in a specific location? This is the link I am working with - when you click the link it immediately downloads an excel file: ons.gov.uk/generator?format=xls&uri=/economy/… – Joshua Tinashe Oct 14, 2020 at 13:03 You can supply the save location as second argument, e.g.: dload.save(url, "/home/user/test.xls") – Pedro Lobito Oct 14, 2020 at 15:42

    This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :

        import urllib2,os
        url = "http://download.thinkbroadband.com/10MB.zip"
        file_name = url.split('/')[-1]
        u = urllib2.urlopen(url)
        f = open(file_name, 'wb')
        meta = u.info()
        file_size = int(meta.getheaders("Content-Length")[0])
        print "Downloading: %s Bytes: %s" % (file_name, file_size)
        os.system('cls')
        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break
            file_size_dl += len(buffer)
            f.write(buffer)
            status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
            status = status + chr(8)*(len(status)+1)
            print status,
        f.close()
    

    If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.

    cls doesn't do anything on my OS X or nor on an Ubuntu server of mine. Some clarification could be good. – the Sep 24, 2014 at 21:57 I think you should use clear for linux, or even better replace the print line instead of clearing the whole command line output. – Arijoon Jan 21, 2015 at 1:01 this answer just copies another answer and adds a call to a deprecated function (os.system()) that launches a subprocess to clear the screen using a platform specific command (cls). How does this have any upvotes?? Utterly worthless "answer" IMHO. – Corey Goldberg Dec 11, 2015 at 19:56

    urlretrieve and requests.get are simple, however the reality not. I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package

    import urllib.request
    url_request = urllib.request.Request(url, headers=headers)
    url_connect = urllib.request.urlopen(url_request)
    #remember to open file in bytes mode
    with open(filename, 'wb') as f:
        while True:
            buffer = url_connect.read(buffer_size)
            if not buffer: break
            #an integer value of size of written data
            data_wrote = f.write(buffer)
    #you could probably use with-open-as manner
    url_connect.close()
    

    This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.

    outfile = os.path.join(SAVE_DIR, file_name) response = requests.get(URL, stream=True) with open(outfile,'wb') as output: output.write(response.content)

    You can use shutil

    import os
    import requests
    import shutil
    outfile = os.path.join(SAVE_DIR, file_name)
    response = requests.get(url, stream = True)
    with open(outfile, 'wb') as f:
      shutil.copyfileobj(response.content, f)
    
  • If you are downloading from restricted url, don't forget to include access token in headers
  • I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.

    After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.

    It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.

    Here it is:

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    from __future__ import (division, absolute_import, print_function, unicode_literals)
    import sys, os, argparse
    from bs4 import BeautifulSoup
    # --- insert Stan's script here ---
    # if sys.version_info >= (3,): 
    # def download_file(url, dest=None): 
    # --- new stuff ---
    def collect_all_url(page_url, extensions):
        Recovers all links in page_url checking for all the desired extensions
        conn = urllib2.urlopen(page_url)
        html = conn.read()
        soup = BeautifulSoup(html, 'lxml')
        links = soup.find_all('a')
        results = []    
        for tag in links:
            link = tag.get('href', None)
            if link is not None: 
                for e in extensions:
                    if e in link:
                        # Fallback for badly defined links
                        # checks for missing scheme or netloc
                        if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
                            results.append(link)
                        else:
                            new_url=urlparse.urljoin(page_url,link)                        
                            results.append(new_url)
        return results
    if __name__ == "__main__":  # Only run if this file is called directly
        # Command line arguments
        parser = argparse.ArgumentParser(
            description='Download all files from a webpage.')
        parser.add_argument(
            '-u', '--url', 
            help='Page url to request')
        parser.add_argument(
            '-e', '--ext', 
            nargs='+',
            help='Extension(s) to find')    
        parser.add_argument(
            '-d', '--dest', 
            default=None,
            help='Destination where to save the files')
        parser.add_argument(
            '-p', '--par', 
            action='store_true', default=False, 
            help="Turns on parallel download")
        args = parser.parse_args()
        # Recover files to download
        all_links = collect_all_url(args.url, args.ext)
        # Download
        if not args.par:
            for l in all_links:
                    filename = download_file(l, args.dest)
                    print(l)
                except Exception as e:
                    print("Error while downloading: {}".format(e))
        else:
            from multiprocessing.pool import ThreadPool
            results = ThreadPool(10).imap_unordered(
                lambda x: download_file(x, args.dest), all_links)
            for p in results:
                print(p)
    

    An example of its usage is:

    python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
    

    And an actual example if you want to see it in action:

    python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
    

    Another possibility is with built-in http.client:

    from http import HTTPStatus, client
    from shutil import copyfileobj
    # using https
    connection = client.HTTPSConnection("www.example.com")
    with connection.request("GET", "/noise.mp3") as response:
        if response.status == HTTPStatus.OK:
            copyfileobj(response, open("noise.mp3")
        else:
            raise Exception("request needs work")
    

    The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.

    path_to_downloaded_file = keras.utils.get_file( fname="file name", origin="https://www.linktofile.com/link/to/file", extract=True, archive_format="zip", # downloaded file format cache_dir="/", # cache and extract in current directory

    Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table. Put curl.exe in the same directory as your script

    from subprocess import call
    url = ""
    call(["curl", {url}, '--output', "song.mp3"])
    

    Note: You cannot specify an output path with curl, so do an os.rename afterwards

  •