Python 获得500万个缩短URL的扩展版本的最快方法_Python_Url_Multiprocessing_Urllib2

Python 获得500万个缩短URL的扩展版本的最快方法

python url

Python 获得500万个缩短URL的扩展版本的最快方法,python,url,multiprocessing,urllib2,Python,Url,Multiprocessing,Urllib2,我正在做一个项目，我需要扩大500万缩短网址。这些url可以被任何url缩短器缩短。最快的方法是什么当前代码： import csv import pandas as pd from urllib2 import urlopen import urllib2 import threading import time def urlResolution(url,tweetId,w): try: print "Entered Function" p

我正在做一个项目，我需要扩大500万缩短网址。这些url可以被任何url缩短器缩短。最快的方法是什么

当前代码：

import csv
import pandas as pd
from urllib2 import urlopen
import urllib2
import threading
import time



def urlResolution(url,tweetId,w):

    try:

        print "Entered Function"
        print "Original Url:",url

        hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

        #header has been added since some sites give an error otherwise
        req = urllib2.Request(url, headers=hdr)
        temp = urlopen(req)
        newUrl = temp.geturl()
        print "Resolved Url:",newUrl
        if newUrl!= 'None':
            print "in if condition"
            w.writerow([tweetId,newUrl])

    except Exception,e:
        print "Throwing exception"
        print str(e)
        return None


def urlResolver(urlFile):
    df=pd.read_csv(urlFile, delimiter="\t")

    df['Url']
    df2 = df[["Tweet ID","Url"]].copy()
    start = time.time()

    df3 = df2[df2.Url!="None"]

    list_url = []
    n=0
    w = csv.writer(open("OUTPUT_FILE.tsv", "w"), delimiter = '\t')
        w.writerow(["Tweet ID","Url"])

    maxC = 0
    while maxC < df3.shape[0]:
        #creates threads
        #only 40 threads are created at a time, since for large number of threads it gives <too many open files> error
        threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,maxC+40)]


        for thread in threads:
                thread.start()
        for thread in threads:
                thread.join()
        if maxC+40 >= df3.shape[0]:
            threads = [threading.Thread(target=urlResolution, args=(df3.iloc[n]['Url'],df3.iloc[n]['Tweet ID'],w)) for n in range(maxC,df3.shape[0])]

                    print "threads complete"
                    for thread in threads:
                            thread.start()
                    for thread in threads:
                            thread.join()   
            break
        maxC = maxC + 40
    print "Elapsed Time: %s" % (time.time() - start)

    w.close()




if __name__ == '__main__':
    df3 = urlResolver("INPUT_FILE.tsv")

导入csv
作为pd进口熊猫
从urllib2导入urlopen
导入urllib2
导入线程
导入时间
def url解析（url、tweetId、w）：
尝试：
打印“输入的函数”
打印“原始Url:”，Url
hdr={'User-Agent'：'Mozilla/5.0（X11；Linux x86_64）AppleWebKit/537.11（KHTML，类似Gecko）Chrome/23.0.1271.64 Safari/537.11'，
“接受”：“text/html，application/xhtml+xml，application/xml；q=0.9，*/*；q=0.8”，
“接受字符集”：“ISO-8859-1，utf-8；q=0.7，*；q=0.3”，
“接受编码”：“无”，
‘接受语言’：‘en-US，en；q=0.8’，
“连接”：“保持活动状态”}
#由于某些站点在其他情况下给出错误，因此添加了标题
请求（url，标题=hdr）
temp=urlopen（请求）
newUrl=temp.geturl（）
打印“已解析Url:”，新Url
如果newUrl！='“没有”：
打印“在if条件下”
w、 writerow（[tweetId，newUrl]）
除例外情况外，e：
打印“抛出异常”
打印str（e）
一无所获
def URLSolver（urlFile）：
df=pd.read\u csv（URL文件，分隔符=“\t”）
df['Url']
df2=df[[“Tweet ID”，“Url”]].copy（）
开始=时间。时间（）
df3=df2[df2.Url！=“无”]
列表\u url=[]
n=0
w=csv.writer（打开（“OUTPUT_FILE.tsv”，“w”），分隔符='\t'）
w、 writerow（[“推特ID”，“Url”]）
maxC=0
而maxC=df3.shape[0]：
threads=[threading.Thread（target=urlsolution，args=（df3.iloc[n]['Url']，df3.iloc[n]['Tweet ID']，w]），范围为n（maxC，df3.shape[0]）]
打印“线程完成”
对于线程中的线程：
thread.start（）
对于线程中的线程：
thread.join（）
打破
最大值=最大值+40
打印“已用时间：%s”%（Time.Time（）-start）
w、 关闭（）
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
df3=URLSolver（“INPUT_FILE.tsv”）

我已经用python编写了一个使用urllib2（用于url扩展）的多处理程序，但是它看起来非常慢

关于如何进一步加快速度有什么建议吗？

首先，StackOverflow不是提出此类问题的地方，最好是在上问一下同事，其次，你必须提供实际的代码。你需要保持连接的活力，重新使用它们来处理新的请求，并对请求进行管道处理（在不等待响应的情况下以大批量发送它们）。这可能需要绕过urllib2，只需与套接字或TLS包装器进行交互。