Python-提高代码Pandas.append的速度_Python_Pandas

Python-提高代码Pandas.append的速度

python pandas

Python-提高代码Pandas.append的速度,python,pandas,Python,Pandas,所以我试图从报纸上摘取头条。在过去的十年里年份是此处包含 /resources/archive/us/2007.html /resources/archive/us/2008.html /resources/archive/us/2009.html /resources/archive/us/2010.html /resources/archive/us/2011.html /resources/archive/us/2012.html /resources/archive/us/2013.h

所以我试图从报纸上摘取头条。在过去的十年里

年份

是此处包含

/resources/archive/us/2007.html
/resources/archive/us/2008.html
/resources/archive/us/2009.html
/resources/archive/us/2010.html
/resources/archive/us/2011.html
/resources/archive/us/2012.html
/resources/archive/us/2013.html
/resources/archive/us/2014.html
/resources/archive/us/2015.html
/resources/archive/us/2016.html

因此，我的代码在这里所做的是，它打开每年的页面，收集所有日期链接，然后单独打开每个链接，获取所有

.text

，并将每个标题和相应的日期作为一行添加到数据框

标题

headlines = pd.DataFrame(columns=["date", "headline"])

for y in years:
   yurl = "http://www.reuters.com"+str(y)
   response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', })
   bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 

   days =[]
   links = bs.findAll('h5')
   for mon in links:
      for day in mon.next_sibling.next_sibling:
          days.append(day)

   days = [e for e in days if str(e) not in ('\n')]
   for ind in days:
       hlday = ind['href']
       date = re.findall('(?!\/)[0-9].+(?=\.)', hlday)[0]
       date =  date[4:6] + '-' + date[6:] + '-' + date[:4]
       print(date.split('-')[2])
       yurl = "http://www.reuters.com"+str(hlday)
       response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', })
       if response.status_code == 404 or response.content == b'':
           print('')
       else: 
           bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 
           lines = bs.findAll('div', {'class':'headlineMed'})
           for h in lines:
               headlines = headlines.append([{"date":date, "headline":h.text}], ignore_index = True)

运行它需要很长时间，所以我没有运行for循环，而是在今年运行了这个

/resources/archive/us/2008.html

已经3个小时了，它还在运行

由于我是Python新手，我不明白我做错了什么，或者我如何才能做得更好

是不是因为每次运行时都要读写一个更大的数据帧，

pandas.append

会花费很长时间？

您使用的是这种反模式：

headlines = pd.DataFrame()
for for y in years:
    for ind in days:
        headlines = headlines.append(blah)

相反，请执行以下操作：

headlines = []
for for y in years:
    for ind in days:
        headlines.append(pd.DataFrame(blah))

headlines = pd.concat(headlines)

第二个潜在问题是您正在发出3650个web请求。如果我在运营这样一个网站，我会内置节流功能来降低像你这样的刮刀的速度。您可能会发现最好只收集一次原始数据，将其存储在磁盘上，然后在第二次过程中对其进行处理。这样，您就不会在每次需要调试程序时产生3650个web请求的费用。

不要多次调用append。相反，保留一个包含100个独立数据帧的列表，然后最后调用

pd.concat

。