Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
from BeautifulSoup import BeautifulSoup
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
queue = Queue.Queue()
out_queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and then grabs chunk of webpage
url = urllib2.urlopen(host)
chunk = url.read()
#place chunk into out queue
self.out_queue.put(chunk)
#signals to queue job is done
self.queue.task_done()
class DatamineThread(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
chunk = self.out_queue.get()
#parse the chunk
soup = BeautifulSoup(chunk)
print soup.findAll(['title'])
#signals to queue job is done
self.out_queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue, out_queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
for i in range(5):
dt = DatamineThread(out_queue)
dt.setDaemon(True)
dt.start()
#wait on the queue until everything has been processed
queue.join()
out_queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
Sometimes I get this error here:
Exception in thread Thread-10 (most likely raised during interpreter shutdown)
Explain please what caused it.
Update by another author:
Here's the full exception I see in similar code:
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 552, in __bootstrap_inner
File "/usr/local/lib/python2.7/threading.py", line 505, in run
File "mine.py", line 86, in run
File "/usr/local/lib/python2.7/Queue.py", line 168, in get
File "/usr/local/lib/python2.7/threading.py", line 237, in wait
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
–
–
That is bug http://bugs.python.org/issue14623
The most simple workaround is to add timeout
time.sleep(1)
in the end of script that allows the threads to finish before the script comes to end of life and close
–
Your example script seems okay - which is to say, it runs fine for me using python 2.7.2.
What version of python are you using? It's possible the errors you're seeing may be related to this bug. If so, then upgrading to python>=2.6.5 or python>=3.1 might help.
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.