http - 在Python中发送 100,000 个HTTP请求的最快方式是什么?

我正在打开一个包含100,000个url的文件,我需要向每个url发送一个http请求,并且打印状态代码,我使用python 2.6,到目前为止,我已经了解了python实现线程/并发,我甚至查看了python 并发库,但是,无法确定如何正确编写这个程序,有谁遇到过类似的问题?

时间:

Twistedless解决方案:


from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue

concurrent = 200

def doWork():
 while True:
 url = q.get()
 status, url = getStatus(url)
 doSomethingWithResult(status, url)
 q.task_done()

def getStatus(ourl):
 try:
 url = urlparse(ourl)
 conn = httplib.HTTPConnection(url.netloc) 
 conn.request("HEAD", url.path)
 res = conn.getresponse()
 return res.status, ourl
 except:
 return "error", ourl

def doSomethingWithResult(status, url):
 print status, url

q = Queue(concurrent * 2)
for i in range(concurrent):
 t = Thread(target=doWork)
 t.daemon = True
 t.start()
try:
 for url in open('urllist.txt'):
 q.put(url.strip())
 q.join()
except KeyboardInterrupt:
 sys.exit(1)

这个比twisted的解决方案快一点,并且使用更少的CPU。

基于Tornado异步网络库的解决,


from tornado import ioloop, httpclient

i = 0

def handle_request(response):
 print(response.code)
 global i
 i -= 1
 if i == 0:
 ioloop.IOLoop.instance().stop()

http_client = httpclient.AsyncHTTPClient()
for url in open('urls.txt'):
 i += 1
 http_client.fetch(url.strip(), handle_request, method='HEAD')
ioloop.IOLoop.instance().start()

解决方案:


from twisted.internet import reactor, threads
from urlparse import urlparse
import httplib
import itertools


concurrent = 200
finished=itertools.count(1)
reactor.suggestThreadPoolSize(concurrent)

def getStatus(ourl):
 url = urlparse(ourl)
 conn = httplib.HTTPConnection(url.netloc) 
 conn.request("HEAD", url.path)
 res = conn.getresponse()
 return res.status

def processResponse(response,url):
 print response, url
 processedOne()

def processError(error,url):
 print "error", url#, error
 processedOne()

def processedOne():
 if finished.next()==added:
 reactor.stop()

def addTask(url):
 req = threads.deferToThread(getStatus, url)
 req.addCallback(processResponse, url)
 req.addErrback(processError, url) 

added=0
for url in open('urllist.txt'):
 added+=1
 addTask(url.strip())

try:
 reactor.run()
except KeyboardInterrupt:
 reactor.stop()

Testtime :


[kalmi@ubi1:~] wc -l urllist.txt
10000 urllist.txt
[kalmi@ubi1:~] time python f.py > /dev/null 

real 1m10.682s
user 0m16.020s
sys 0m10.330s
[kalmi@ubi1:~] head -n 6 urllist.txt
http://www.google.com
http://www.bix.hu
http://www.godaddy.com
http://www.google.com
http://www.bix.hu
http://www.godaddy.com
[kalmi@ubi1:~] python f.py | head -n 6
200 http://www.bix.hu
200 http://www.bix.hu
200 http://www.bix.hu
200 http://www.bix.hu
200 http://www.bix.hu
200 http://www.bix.hu

Pingtime :


bix.hu is ~10 ms away from me
godaddy.com: ~170 ms
google.com: ~30 ms

使用grequests,它是请求+Gevent模块的组合。

GRequests允许你使用Gevent请求轻松地发出asyncronous请求。

用法很简单:


import grequests

urls = [
 'http://www.heroku.com',
 'http://tablib.org',
 'http://httpbin.org',
 'http://python-requests.org',
 'http://kennethreitz.com'
]

创建一组未发送的请求:


>>> rs = (grequests.get(u) for u in urls)

同时发送它们:


>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]

对于你的情况,线程可能会做到这一点,因为你可能会花费大量时间等待响应,在标准库中有一些有用的模块,如队列,可以帮助。

我之前对文件的并行下载也做了类似的事情,这对我来说已经足够了,但是,它并没有达到您所说的规模。

如果你的任务更多CPU-bound,你可能需要查看多处理模块,这将允许你使用更多的CPUs/cores/threads (由于每个进程的锁定,而且不会阻止彼此的更多进程),

使用线程池是一个不错的选择,并且将使这相当容易,然而,python没有一个标准库使得线程池变得非常简单,但是这里有一个不错的库可以让你开始: http://www.chrisarndt.de/projects/threadpool/

他们站点的代码示例:


pool = ThreadPool(poolsize)
requests = makeRequests(some_callable, list_of_args, callback)
[pool.putRequest(req) for req in requests]
pool.wait()

希望这个能帮到你。

...