在Python中使用Pandas和numpy合并刮取的数据时出现问题_Python_Pandas_Numpy_Web Scraping_Beautifulsoup

在Python中使用Pandas和numpy合并刮取的数据时出现问题

python pandas numpy web-scraping

在Python中使用Pandas和numpy合并刮取的数据时出现问题,python,pandas,numpy,web-scraping,beautifulsoup,Python,Pandas,Numpy,Web Scraping,Beautifulsoup,我正在尝试从许多不同的URL收集信息，并根据年份和高尔夫球手姓名组合数据。到目前为止，我正在尝试将信息写入csv，然后使用pd.merge（）进行匹配，但我必须为要合并的每个数据帧使用唯一的名称。我尝试使用numpy数组，但最终的过程是合并所有单独的数据 import csv from urllib.request import urlopen from bs4 import BeautifulSoup import datetime import socket import urllib.er

我正在尝试从许多不同的URL收集信息，并根据年份和高尔夫球手姓名组合数据。到目前为止，我正在尝试将信息写入csv，然后使用pd.merge（）进行匹配，但我必须为要合并的每个数据帧使用唯一的名称。我尝试使用numpy数组，但最终的过程是合并所有单独的数据

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np

base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']

alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
    data = urlopen(i)
    soup = BeautifulSoup(data, "html.parser")
    for link in soup.find_all('a'):
        if link.has_attr('href'):
            alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
    if inn in i:
        beta.append(i)
#no repeats
gamma= []
for i in beta:
    if i not in gamma:
        gamma.append(i)

#making list of urls with Statistic labels
jan = []
for i in gamma:
    try:
        data = urlopen(i)
        soup = BeautifulSoup(data, "html.parser")
        for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
            for j in table.find_all('h3'):
                y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
                jan.append([i,str(y+'.csv')])
                print([i,str(y+'.csv')])
    except Exception as e:
            print(e)
            pass

# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = [] 
title = []  
for i in jan:
    try:
        with open(i[1], 'w+') as fp:
            writer = csv.writer(fp)
            for y in years:
                data = urlopen(i[0][:-4] +y+ end)
                soup = BeautifulSoup(data, "html.parser")
                data1 = urlopen(i[0])
                soup1 = BeautifulSoup(data1, "html.parser")
                for table in soup1.find_all('table',{'id':'statsTable'}):
                    title.append('year')
                    for k in table.find_all('tr'):
                        for n in k.find_all('th'):
                            title1.append(n.get_text())
                            for l in title1:
                                if l not in title:
                                    title.append(l)
                    rows_sp.append(title)
                for table in soup.find_all('table',{'id':'statsTable'}):
                    for h in table.find_all('tr'):
                        row_sp = [y]
                        for j in h.find_all('td'):
                            row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
                        rows_sp.append(row_sp)
                        print(row_sp)
                        writer.writerows([row_sp])
    except Exception as e:
        print(e)
        pass

dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda  left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)

更新（根据评论）
这个问题部分涉及技术方法（Pandas

merge（）

），但它似乎也是讨论数据收集和清理有用工作流的一个机会。因此，我添加了比编码解决方案严格要求的更多的细节和解释

您基本上可以使用与我的原始答案相同的方法从不同的URL类别获取数据。我建议您在迭代url列表时保留一个

{url:data}

在设置清理部分时需要做一些工作，因为您需要针对每个URL类别中的不同列进行调整。我已经演示了手动方法，只使用了一些测试URL。但是，如果您有数千个不同的URL类别，那么您可能需要考虑如何以编程方式收集和组织列名。这感觉超出了这次行动的范围

只要您确定每个URL中都有

year

和

PLAYER NAME

字段，下面的合并应该可以工作。与前面一样，假设您不需要写入CSV，现在让我们停止对您的剪贴代码进行任何优化：

首先，在

url

中定义url类别。按url分类，我指的是

http://www.pgatour.com/stats/stat.02356.html

实际上会通过在url本身中插入一系列年份来多次使用，例如：

http://www.pgatour.com/stats/stat.02356.2017.html

，

http://www.pgatour.com/stats/stat.02356.2016.html

。在本例中，

stat.02356.html

是包含多年玩家数据信息的url类别

import pandas as pd

# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
        'http://www.pgatour.com/stats/stat.02568.html',
        'http://www.pgatour.com/stats/stat.111.html']

# we'll store data from each url category in this dict.
url_data = {}

现在迭代

URL

。在

URL

循环中，此代码与我的原始答案完全相同，而我的原始答案来自OP-only，其中一些变量名经过调整以反映我们新的捕获方法

for url in urls:
    print("url: ", url)
    url_data[url] = {"row_sp": [],
                     "rows_sp": [],
                     "title1": [],
                     "title": []}
    try:
        #with open(i[1], 'w+') as fp:
            #writer = csv.writer(fp)
        for y in years:
            current_url = url[:-4] +y+ end
            print("current url is: ", current_url)
            data = urlopen(current_url)
            soup = BeautifulSoup(data, "html.parser")
            data1 = urlopen(url)
            soup1 = BeautifulSoup(data1, "html.parser")
            for table in soup1.find_all('table',{'id':'statsTable'}):
                url_data[url]["title"].append('year')
                for k in table.find_all('tr'):
                    for n in k.find_all('th'):
                        url_data[url]["title1"].append(n.get_text())
                        for l in url_data[url]["title1"]:
                            if l not in url_data[url]["title"]:
                                url_data[url]["title"].append(l)
                url_data[url]["rows_sp"].append(url_data[url]["title"])
            for table in soup.find_all('table',{'id':'statsTable'}):
                for h in table.find_all('tr'):
                    url_data[url]["row_sp"] = [y]
                    for j in h.find_all('td'):
                        url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
                    url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
                    #print(row_sp)
                    #writer.writerows([row_sp])
    except Exception as e:
        print(e)
        pass

现在，对于

url\u data

中的每个键

url

，

rows\u sp

包含您感兴趣的特定url类别的数据。
请注意，

rows\u sp

当我们迭代

url\u data

时，现在实际上将是

url\u data[url][“rows\u sp”]

，但接下来的几个代码块来自我的原始答案，因此使用旧的

rows\u sp

变量名

# example rows_sp
[['year',
  'RANK THIS WEEK',
  'RANK LAST WEEK',
  'PLAYER NAME',
  'EVENTS',
  'RATING',
  'year',
  'year',
  'year',
  'year'],
 ['2017'],
 ['2017', '1', '1', 'Sam Burns', '1', '9.2'],
 ['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
 ['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
 ['2017', '2', '3', 'Whee Kim', '2', '8.8'],
 ['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
 ...
]

将

行\u sp

直接写入数据框表明数据格式不正确：

pd.DataFrame(rows_sp).head()
      0               1               2               3       4       5     6  \
0  year  RANK THIS WEEK  RANK LAST WEEK     PLAYER NAME  EVENTS  RATING  year   
1  2017            None            None            None    None    None  None   
2  2017               1               1       Sam Burns       1     9.2  None   
3  2017               2               3   Rickie Fowler      10     8.8  None   
4  2017               2               2  Dustin Johnson      10     8.8  None   

      7     8     9  
0  year  year  year  
1  None  None  None  
2  None  None  None  
3  None  None  None  
4  None  None  None  

pd.DataFrame(rows_sp).dtypes
0    object
1    object
2    object
3    object
4    object
5    object
6    object
7    object
8    object
9    object
dtype: object

通过少量清理，我们可以将

行\u sp

放入具有适当数字数据类型的数据框中：

df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
              "PLAYER NAME","EVENTS","RATING",
              "year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)

df.head()
   year  RANK THIS WEEK  RANK LAST WEEK     PLAYER NAME  EVENTS  RATING
2  2017               1             1.0       Sam Burns       1     9.2
3  2017               2             3.0   Rickie Fowler      10     8.8
4  2017               2             2.0  Dustin Johnson      10     8.8
5  2017               2             3.0        Whee Kim       2     8.8
6  2017               2             3.0  Thomas Pieters       3     8.8

更新的清洗
现在我们有一系列的url类别要处理，每个类别都有一组不同的字段要清理，上面的部分变得有点复杂。如果您只有几个页面，可以直观地查看每个类别的字段并存储它们，如下所示：

cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
                                      'PLAYER NAME', 'ROUNDS', 'AVERAGE', 
                                      'TOTAL SG:APP', 'MEASURED ROUNDS', 
                                      'year1', 'year2', 'year3', 'year4'],
                           'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
                                      'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
                          },
        'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
                                    'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 
                                    'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
                         'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
                                   '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
                        },
        'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
                                      'PLAYER NAME', 'EVENTS', 'RATING', 
                                      'year1', 'year2', 'year3', 'year4'],
                           'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 
                                      'EVENTS', 'RATING']
                          }
       }

然后您可以再次循环

url\u数据

，并存储在

dfs

集合中：

dfs = {}

for url in url_data:
    page = url.split("/")[-1]
    colnames = cols[page]["columns"]
    num_cols = cols[page]["numeric"]
    rows_sp = url_data[url]["rows_sp"]
    df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
    df.columns = colnames
    df.drop(["year1","year2","year3","year4"], 1, inplace=True)
    df = df.loc[df["PLAYER NAME"].notnull()]
    df = df.loc[df.year != "year"]
    # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
    df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
    df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
    df[num_cols] = df[num_cols].apply(pd.to_numeric)
    dfs[url] = df

现在，我们已经准备好按

年份和玩家名称合并所有不同的数据类别。（实际上，您可以在清理循环中迭代合并，但出于演示目的，我在这里分离。）
现在master
包含每个玩家年份的合并数据。下面是使用groupby（）
查看数据的视图：
您可能需要对合并的列进行更多的清理，因为有些列是跨数据类别复制的（例如，ROUNDS\u x
和ROUNDS\u y
）。据我所知，重复的字段名似乎包含完全相同的信息，因此您可以删除每个字段的\u y
版本
 您在代码中的什么地方使用熊猫？尝试的合并
在哪里？没有尝试，但类似于dataframes=[df1，df2，df3]#存储在一个列表中df#u merge=reduce（lambda left，right:pd.merge（left，right，on=['column']，how='outer'），dataframes），这是我试图完成的过程，但我无法利用它为什么链合并不起作用？错误？意外的结果？您没有在csv中读取数据帧吗？将csv转换为数据帧需要一个我理解的数据帧名称，因此我很难唯一地命名数据帧以使用链合并。您可以循环使用csv文件，反复运行read.csv（）并附加到列表或dict，然后运行链合并。欢迎！这个答案是否为您最初的问题提供了足够的解决方案？如果是，请考虑通过单击答案左边的复选标记来标记这个答案。如果没有，你还停留在什么方面？如果你在你的原始文章中添加一个你想要的输出的例子，用列名和几行数据（如果格式正确，即使是虚构的数据也会很有用），这可能会有所帮助。为每个URL创建单独的数据帧非常简单，但如果每个URL有不同的年份，我仍然不确定您打算如何在播放器年合并。如果你所列出的练习URL并不代表你想抓取的其他URL，请考虑给你的练习例子添加URL，这些URL会给你提供有效的合并可能性。如果你发现错误，我很乐意更新。这是正确的工作时，我张贴，但我可能有一些错误
dfs = {}

for url in url_data:
    page = url.split("/")[-1]
    colnames = cols[page]["columns"]
    num_cols = cols[page]["numeric"]
    rows_sp = url_data[url]["rows_sp"]
    df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
    df.columns = colnames
    df.drop(["year1","year2","year3","year4"], 1, inplace=True)
    df = df.loc[df["PLAYER NAME"].notnull()]
    df = df.loc[df.year != "year"]
    # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
    df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
    df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
    df[num_cols] = df[num_cols].apply(pd.to_numeric)
    dfs[url] = df 

master = pd.DataFrame()
for url in dfs:
    if master.empty:
        master = dfs[url]
    else:
        master = master.merge(dfs[url], on=['year','PLAYER NAME'])

master.groupby(["PLAYER NAME", "year"]).first().head(4)
                  RANK THIS WEEK_x  RANK LAST WEEK_x  EVENTS  RATING  \
PLAYER NAME year                                                       
Aam Hawin   2015                66              66.0       7     8.2   
            2016                80              80.0      12     8.1   
            2017                72              45.0       8     8.2   
Aam Scott   2013                45              45.0      10     8.2   

                  RANK THIS WEEK_y  RANK LAST WEEK_y  ROUNDS_x  AVERAGE  \
PLAYER NAME year                                                          
Aam Hawin   2015               136               136        95   -0.183   
            2016               122               122        93   -0.061   
            2017                56                52        84    0.296   
Aam Scott   2013                16                16        61    0.548   

                  TOTAL SG:APP  MEASURED ROUNDS  RANK THIS WEEK  \
PLAYER NAME year                                                  
Aam Hawin   2015       -14.805               81              86   
            2016        -5.285               87              39   
            2017        18.067               61               8   
Aam Scott   2013        24.125               44              57   

                  RANK LAST WEEK  ROUNDS_y      %  # SAVES  # BUNKERS  \
PLAYER NAME year                                                        
Aam Hawin   2015              86        95  50.96       80        157   
            2016              39        93  54.78       86        157   
            2017               6        84  61.90       91        147   
Aam Scott   2013              57        61  53.85       49         91   

                  TOTAL O/U PAR  
PLAYER NAME year                 
Aam Hawin   2015           47.0  
            2016           43.0  
            2017           27.0  
Aam Scott   2013           11.0