在Python中使用Pandas和numpy合并刮取的数据时出现问题
我正在尝试从许多不同的URL收集信息,并根据年份和高尔夫球手姓名组合数据。到目前为止,我正在尝试将信息写入csv,然后使用pd.merge()进行匹配,但我必须为要合并的每个数据帧使用唯一的名称。我尝试使用numpy数组,但最终的过程是合并所有单独的数据在Python中使用Pandas和numpy合并刮取的数据时出现问题,python,pandas,numpy,web-scraping,beautifulsoup,Python,Pandas,Numpy,Web Scraping,Beautifulsoup,我正在尝试从许多不同的URL收集信息,并根据年份和高尔夫球手姓名组合数据。到目前为止,我正在尝试将信息写入csv,然后使用pd.merge()进行匹配,但我必须为要合并的每个数据帧使用唯一的名称。我尝试使用numpy数组,但最终的过程是合并所有单独的数据 import csv from urllib.request import urlopen from bs4 import BeautifulSoup import datetime import socket import urllib.er
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np
base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']
alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
if inn in i:
beta.append(i)
#no repeats
gamma= []
for i in beta:
if i not in gamma:
gamma.append(i)
#making list of urls with Statistic labels
jan = []
for i in gamma:
try:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
for j in table.find_all('h3'):
y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
jan.append([i,str(y+'.csv')])
print([i,str(y+'.csv')])
except Exception as e:
print(e)
pass
# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = []
title = []
for i in jan:
try:
with open(i[1], 'w+') as fp:
writer = csv.writer(fp)
for y in years:
data = urlopen(i[0][:-4] +y+ end)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(i[0])
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
title.append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
title1.append(n.get_text())
for l in title1:
if l not in title:
title.append(l)
rows_sp.append(title)
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
row_sp = [y]
for j in h.find_all('td'):
row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
rows_sp.append(row_sp)
print(row_sp)
writer.writerows([row_sp])
except Exception as e:
print(e)
pass
dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)
更新(根据评论)这个问题部分涉及技术方法(Pandas
merge()
),但它似乎也是讨论数据收集和清理有用工作流的一个机会。因此,我添加了比编码解决方案严格要求的更多的细节和解释
您基本上可以使用与我的原始答案相同的方法从不同的URL类别获取数据。我建议您在迭代url列表时保留一个{url:data}
目录列表,然后从该目录构建干净的数据帧
在设置清理部分时需要做一些工作,因为您需要针对每个URL类别中的不同列进行调整。我已经演示了手动方法,只使用了一些测试URL。但是,如果您有数千个不同的URL类别,那么您可能需要考虑如何以编程方式收集和组织列名。这感觉超出了这次行动的范围
只要您确定每个URL中都有year
和PLAYER NAME
字段,下面的合并应该可以工作。与前面一样,假设您不需要写入CSV,现在让我们停止对您的剪贴代码进行任何优化:
首先,在url
中定义url类别。按url分类,我指的是http://www.pgatour.com/stats/stat.02356.html
实际上会通过在url本身中插入一系列年份来多次使用,例如:http://www.pgatour.com/stats/stat.02356.2017.html
,http://www.pgatour.com/stats/stat.02356.2016.html
。在本例中,stat.02356.html
是包含多年玩家数据信息的url类别
import pandas as pd
# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
'http://www.pgatour.com/stats/stat.02568.html',
'http://www.pgatour.com/stats/stat.111.html']
# we'll store data from each url category in this dict.
url_data = {}
现在迭代URL
。在URL
循环中,此代码与我的原始答案完全相同,而我的原始答案来自OP-only,其中一些变量名经过调整以反映我们新的捕获方法
for url in urls:
print("url: ", url)
url_data[url] = {"row_sp": [],
"rows_sp": [],
"title1": [],
"title": []}
try:
#with open(i[1], 'w+') as fp:
#writer = csv.writer(fp)
for y in years:
current_url = url[:-4] +y+ end
print("current url is: ", current_url)
data = urlopen(current_url)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(url)
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
url_data[url]["title"].append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
url_data[url]["title1"].append(n.get_text())
for l in url_data[url]["title1"]:
if l not in url_data[url]["title"]:
url_data[url]["title"].append(l)
url_data[url]["rows_sp"].append(url_data[url]["title"])
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
url_data[url]["row_sp"] = [y]
for j in h.find_all('td'):
url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
#print(row_sp)
#writer.writerows([row_sp])
except Exception as e:
print(e)
pass
现在,对于url\u data
中的每个键url
,rows\u sp
包含您感兴趣的特定url类别的数据。请注意,
rows\u sp
当我们迭代url\u data
时,现在实际上将是url\u data[url][“rows\u sp”]
,但接下来的几个代码块来自我的原始答案,因此使用旧的rows\u sp
变量名
# example rows_sp
[['year',
'RANK THIS WEEK',
'RANK LAST WEEK',
'PLAYER NAME',
'EVENTS',
'RATING',
'year',
'year',
'year',
'year'],
['2017'],
['2017', '1', '1', 'Sam Burns', '1', '9.2'],
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
['2017', '2', '3', 'Whee Kim', '2', '8.8'],
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
...
]
将行\u sp
直接写入数据框表明数据格式不正确:
pd.DataFrame(rows_sp).head()
0 1 2 3 4 5 6 \
0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year
1 2017 None None None None None None
2 2017 1 1 Sam Burns 1 9.2 None
3 2017 2 3 Rickie Fowler 10 8.8 None
4 2017 2 2 Dustin Johnson 10 8.8 None
7 8 9
0 year year year
1 None None None
2 None None None
3 None None None
4 None None None
pd.DataFrame(rows_sp).dtypes
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
dtype: object
通过少量清理,我们可以将行\u sp
放入具有适当数字数据类型的数据框中:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
"PLAYER NAME","EVENTS","RATING",
"year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)
df.head()
year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING
2 2017 1 1.0 Sam Burns 1 9.2
3 2017 2 3.0 Rickie Fowler 10 8.8
4 2017 2 2.0 Dustin Johnson 10 8.8
5 2017 2 3.0 Whee Kim 2 8.8
6 2017 2 3.0 Thomas Pieters 3 8.8
更新的清洗现在我们有一系列的url类别要处理,每个类别都有一组不同的字段要清理,上面的部分变得有点复杂。如果您只有几个页面,可以直观地查看每个类别的字段并存储它们,如下所示:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', 'AVERAGE',
'TOTAL SG:APP', 'MEASURED ROUNDS',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
},
'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS',
'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
},
'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'EVENTS', 'RATING',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK',
'EVENTS', 'RATING']
}
}
然后您可以再次循环url\u数据
,并存储在dfs
集合中:
dfs = {}
for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df
现在,我们已经准备好按年份和玩家名称合并所有不同的数据类别。(实际上,您可以在清理循环中迭代合并,但出于演示目的,我在这里分离。)
现在master
包含每个玩家年份的合并数据。下面是使用groupby()
查看数据的视图:
您可能需要对合并的列进行更多的清理,因为有些列是跨数据类别复制的(例如,ROUNDS\u x
和ROUNDS\u y
)。据我所知,重复的字段名似乎包含完全相同的信息,因此您可以删除每个字段的\u y
版本 您在代码中的什么地方使用熊猫?尝试的合并
在哪里?没有尝试,但类似于dataframes=[df1,df2,df3]#存储在一个列表中df#u merge=reduce(lambda left,right:pd.merge(left,right,on=['column'],how='outer'),dataframes),这是我试图完成的过程,但我无法利用它为什么链合并不起作用?错误?意外的结果?您没有在csv中读取数据帧吗?将csv转换为数据帧需要一个我理解的数据帧名称,因此我很难唯一地命名数据帧以使用链合并。您可以循环使用csv文件,反复运行read.csv()
并附加到列表或dict,然后运行链合并。欢迎!这个答案是否为您最初的问题提供了足够的解决方案?如果是,请考虑通过单击答案左边的复选标记来标记这个答案。如果没有,你还停留在什么方面?如果你在你的原始文章中添加一个你想要的输出的例子,用列名和几行数据(如果格式正确,即使是虚构的数据也会很有用),这可能会有所帮助。为每个URL创建单独的数据帧非常简单,但如果每个URL有不同的年份,我仍然不确定您打算如何在播放器年合并。如果你所列出的练习URL并不代表你想抓取的其他URL,请考虑给你的练习例子添加URL,这些URL会给你提供有效的合并可能性。如果你发现错误,我很乐意更新。这是正确的工作时,我张贴,但我可能有一些错误
dfs = {}
for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df
master = pd.DataFrame()
for url in dfs:
if master.empty:
master = dfs[url]
else:
master = master.merge(dfs[url], on=['year','PLAYER NAME'])
master.groupby(["PLAYER NAME", "year"]).first().head(4)
RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \
PLAYER NAME year
Aam Hawin 2015 66 66.0 7 8.2
2016 80 80.0 12 8.1
2017 72 45.0 8 8.2
Aam Scott 2013 45 45.0 10 8.2
RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \
PLAYER NAME year
Aam Hawin 2015 136 136 95 -0.183
2016 122 122 93 -0.061
2017 56 52 84 0.296
Aam Scott 2013 16 16 61 0.548
TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \
PLAYER NAME year
Aam Hawin 2015 -14.805 81 86
2016 -5.285 87 39
2017 18.067 61 8
Aam Scott 2013 24.125 44 57
RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \
PLAYER NAME year
Aam Hawin 2015 86 95 50.96 80 157
2016 39 93 54.78 86 157
2017 6 84 61.90 91 147
Aam Scott 2013 57 61 53.85 49 91
TOTAL O/U PAR
PLAYER NAME year
Aam Hawin 2015 47.0
2016 43.0
2017 27.0
Aam Scott 2013 11.0