Python 困难的数据帧查找查询
我很确定这已经有一个问题了,所以如果有人能给我指出正确的方向 我有两个数据帧,DF1:Python 困难的数据帧查找查询,python,pandas,dataframe,Python,Pandas,Dataframe,我很确定这已经有一个问题了,所以如果有人能给我指出正确的方向 我有两个数据帧,DF1: +----------+-----------+------------+-------------+--------------------+ | Survived | Surname | FamilySize | NumSurvived | FamilySurvivalRate | +----------+-----------+------------+-------------+---------
+----------+-----------+------------+-------------+--------------------+
| Survived | Surname | FamilySize | NumSurvived | FamilySurvivalRate |
+----------+-----------+------------+-------------+--------------------+
| 0 | Braund | 2 | 0 | 0 |
| 1 | Cumings | 1 | 1 | 1 |
| 1 | Heikkinen | 1 | 1 | 1 |
| 1 | Futrelle | 2 | 1 | 0.5 |
| 0 | Allen | 2 | 1 | 0.5 |
| 0 | Moran | 3 | 1 | 0.333333333 |
| 0 | McCarthy | 1 | 0 | 0 |
| 0 | Palsson | 4 | 0 | 0 |
+----------+-----------+------------+-------------+--------------------+
和DF2:
+----------+-----------+------------+-------------+--------------------+
| Survived | Surname | FamilySize | NumSurvived | FamilySurvivalRate |
+----------+-----------+------------+-------------+--------------------+
| 0 | Braund | 2 | 0 | |
| 1 | Cumings | 1 | 1 | |
| 1 | Heikkinen | 1 | 1 | |
| 1 | Futrelle | 2 | 1 | |
| 0 | Allen | 2 | 1 | |
| 0 | Moran | 3 | 1 | |
| 0 | McCarthy | 1 | 0 | |
| 0 | Palsson | 4 | 0 | |
+----------+-----------+------------+-------------+--------------------+
对于DF2中的每个姓氏,我需要在DF1中找到该姓氏的家族生存率,并将值放入DF2中。如果姓氏不在DF1中,则必须为0
谢谢 您需要根据DF2中的条目合并两个数据帧,然后用0填充缺少的值:
(
df2
# Remove FamilySurvivalRate from DF2, as it is of not interest
.drop(columns=["FamilySurvivalRate"]
# Retrieve possibly existing values from df1
.merge(df1, how="left")
# Fill missing values with 0
.fillna({"FamilySurvivalRate": 0})
)
您可以尝试以下方法:
DF2.loc[DF2['Surname']==DF1['Surname'],['FamilySurvivalRate']] = DF1['FamilySurvivalRate']
使用由df1创建的系列
,并替换不匹配的值:
print (df2)
Survived Surname FamilySize NumSurvived
0 0 Braund 2 0
1 1 Cumings1 1 1 <- change surname for no match
2 1 Heikkinen 1 1
3 1 Futrelle 2 1
4 0 Allen 2 1
5 0 Moran 3 1
6 0 McCarthy 1 0
7 0 Palsson 4 0
s = df1.set_index('Surname')['FamilySurvivalRate']
df2['FamilySurvivalRate'] = df2['Surname'].map(s).fillna(0)
print (df2)
Survived Surname FamilySize NumSurvived FamilySurvivalRate
0 0 Braund 2 0 0.000000
1 1 Cumings1 1 1 0.000000
2 1 Heikkinen 1 1 1.000000
3 1 Futrelle 2 1 0.500000
4 0 Allen 2 1 0.500000
5 0 Moran 3 1 0.333333
6 0 McCarthy 1 0 0.000000
7 0 Palsson 4 0 0.000000
打印(df2)
幸存的姓氏家族
0 0 Braund 2 0
试试这个,希望它能解决你的问题
df2 = df2.drop('FamilySurvivalRate', axis=1)
df2 = pd.merge(left=df2, right=df1[['Surname','FamilySurvivalRate']], on='Surname')
df2
我认为使用merge()也可以实现同样的效果
两个DataFrame
的大小是否相同?@jezrael-否,并且有重复的姓氏,但每个重复的姓氏都有相同的家族存活率计数surname@ZackJoubert-这取决于需要什么-如果需要第一个值-s=df1。删除重复项('姓氏')。设置索引('姓氏')['FamilySurvivalRate']
的if needmean
-s=df1.groupby('姓氏')['familyssurvivalrate'].mean()
@ZackJoubert不完全正确,因为需要删除重复的by姓氏,它是系列的索引。所以首先按列删除重复项,然后按集合索引删除creae索引。
df2.merge(df1[["Surname","FamilySurvivalRate"]],how ='left', on = "Surname").fillna(0)