Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/324.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
网页url的优化';Python 2.7中的s标题转换_Python_Python 2.7_Beautifulsoup_Urllib - Fatal编程技术网

网页url的优化';Python 2.7中的s标题转换

网页url的优化';Python 2.7中的s标题转换,python,python-2.7,beautifulsoup,urllib,Python,Python 2.7,Beautifulsoup,Urllib,我在一个名为twfile.txt的文档中有一个tweets列表。示例txt文件可能如下所示: RT @CriticalReading: How #Islamophobia works. #Germanwings http://t.co/rX6XVxARiD Family of Australian victims visit the #Germanwings #GermanWingsCrash crash site in #FrenchAlps #A320Crash #A320 http://t

我在一个名为twfile.txt的文档中有一个tweets列表。示例txt文件可能如下所示:

RT @CriticalReading: How #Islamophobia works. #Germanwings http://t.co/rX6XVxARiD
Family of Australian victims visit the #Germanwings #GermanWingsCrash crash site in #FrenchAlps #A320Crash #A320 http://t.co/ztReJ1tifU
RT @morningshowon7: #Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
Three generations from the same family were killed in the #Germanwings Alps crash: http://t.co/6F5MgvBSZG http://t.co/HzJZCZKVZe
Alps crash pilot's hidden illness sparks medical privacy debate #Germanwings. http://t.co/Efe89rxwJG
#Germanwings crash: church in #AndreasLubitz's home town stands by his family http://t.co/QkePs5sG4W http://t.co/irdDnHhxF7
Breaking: #Germanwings co-pilot had been treated 4 suicidal tendencies: http://t.co/6qEynKMSEI/s/KJKu http://t.co/TVdqP4EeWu/s/b4vR @Reuters
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
Audio last 60 seconds from flight deck http://t.co/T4IYK26NrG     #Germanwings #GermanWingsCrash #GermanyWings #4U9525 #AndreasLubitz
#Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
RT @surfinwav: American intelligence contractor among those killed in Alps plane crash http://t.co/m4L0EOd9L2 #Germanwings #GermanWingsCrash
Excellent help & resources from our friends @MindframeMedia over responsible reporting re #Germanwings http://t.co/EQG0kxyQgd  #NoStigma
.@Boba71 @Reuters So in Germany any sick psycho can fly a commercial plane hiding behind the so called privacy laws? #germanwings
The World Will Never Forget  https://t.co/Th41xouUiS  #4U9525 #GermanWings #A320Crash #indeepsorrow #AndreasLubitz
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
I am uncomfortable using word 'depression' for the #Germanwings pilot, depression does not kill other people.
Google Maps has blurred out the home of #Germanwings crash pilot Andreas Lubitz. http://t.co/VTm5sfmT6e
#Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/YpDB8trKFL http://t.co/uML8h6vwD8
#Lufthansa #Germanwings prepare for negligence charges since copilot was known to be suicidal 7 years ago
ICYMI: @swaindiana's interview w. lawyer who represents 4 families, who lost loved ones in #Germanwings crash. http://t.co/dnUXKkCD46 #CBCNN
An airplane crashes, after a couple of HOURS we get who's guilty, with the perfect solution for everybody. I don't buy it. #Germanwings
#Germanwings Crash Settlements Are Likely to Vary by Passenger Nationality - #aviationlaw #montrealconvention http://t.co/MWM8nSEYwG
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
German prosecutors confirm #Germanwings pilot "had continued to see psychiatrists and neurologists until recently" http://t.co/ma1v9zeiIV
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
RT @MindframeMedia: MEDIA: tips when including #mentalillness in stories to avoid perpetuating #stigma http://t.co/W7RlJVe9Lq #Germanwings
#Germanwings plane crash in French Alps: First clues - CNN : http://t.co/AbMPbXFfjG
RT @MindframeMedia: MEDIA: Get to know the facts about  #mentalillness & avoiding  stigmatising stories http://t.co/ZDd7AFOAir #Germanwings
RT @michaelhallida4: Am I Mad Enough To Crash A Plane Into A Mountain? https://t.co/M9d5nlf4bM #auspol #Germanwings
It's a sick world! How can this happen? RT @Reuters #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/ryw6nTmTNF
RT @Reuters: #Germanwings co-pilot Andreas Lubitz had been treated for suicidal tendencies: http://t.co/p7wqBNvoEW http://t.co/KKAGnvXFDd
I suffer #depression too but I would never risk other people's life. #Germanwings
以下代码用于从文件中读取。然后它扩展url并用旧url替换新url。它还检查url是否指向图像。如果没有,它将用网页标题替换url。否则它会保持原样。代码运行良好,但有一个问题是,此过程需要花费太多时间,不适合包含数千条tweet的文档。如何让它工作得更快

import codecs
from bs4 import BeautifulSoup
import urllib

output = codecs.open('tw1file.txt','w','utf-8')

with open('twfile.txt','r') as inputf:
    for line in inputf:
        try:
            list1 = line.split(' ')
            for i in range(len(list1)):
                a = list1[i]
                if "http" in list1[i]:
                    ##print list1[i]
                    response = urllib.urlopen(list1[i])
                    a = response.url
                    ##print a
                    if 'photo' in a:
                        ##print a                       
                        list1[i] = a + ' '
                        ##print list1[i]
                    else:

                        html = response.read()
                        soup = BeautifulSoup(html)
                        list1[i] = soup.html.head.title
                        t = str(list1[i])
                        list1[i] = t[8:-9] = ' '


                    list1[i] = ''.join(ch for ch in list1[i])
                else:
                    list1[i] = ''.join(ch for ch in list1[i])
            line = ' '.join(list1)
            print line
            output.write(line)
        except:
            pass


inputf.close()
output.close()
可能是通过购买更多的带宽

请看这里:

然后确定你花时间在什么上,我打赌你大部分时间都在使用脚本,下载网站

如果您在网络上有大量空闲时间(由于站点的速度比您的带宽慢),您可以尝试将这些行放入处理队列中,让一组工作线程来完成实际工作

请看这里: (例如,使用worker的代码,请参见答案)

可能需要购买更多带宽

请看这里:

然后确定你花时间在什么上,我打赌你大部分时间都在使用脚本,下载网站

如果您在网络上有大量空闲时间(由于站点的速度比您的带宽慢),您可以尝试将这些行放入处理队列中,让一组工作线程来完成实际工作

请看这里: (例如,有关使用workers的代码,请参见的答案)