Python 逐行、逐单元格比较两个数据帧
我有两个数据帧,Python 逐行、逐单元格比较两个数据帧,python,pandas,dataframe,iterator,iteration,Python,Pandas,Dataframe,Iterator,Iteration,我有两个数据帧,df1和df2,希望执行以下操作,将结果存储在df3: for each row in df1: for each row in df2: create a new row in df3 (called "df1-1, df2-1" or whatever) to store results for each cell(column) in df1: for the cell in df2 whose c
df1
和df2
,希望执行以下操作,将结果存储在df3
:
for each row in df1:
for each row in df2:
create a new row in df3 (called "df1-1, df2-1" or whatever) to store results
for each cell(column) in df1:
for the cell in df2 whose column name is the same as for the cell in df1:
compare the cells (using some comparing function func(a,b) ) and,
depending on the result of the comparison, write result into the
appropriate column of the "df1-1, df2-1" row of df3)
例如,类似于:
df1
A B C D
foo bar foobar 7
gee whiz herp 10
df2
A B C D
zoo car foobar 8
df3
df1-df2 A B C D
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)
我从以下几点开始:
for r1 in df1.iterrows():
for r2 in df2.iterrows():
for c1 in r1:
for c2 in r2:
但我不知道该怎么办,希望能得到一些帮助 因此,要继续评论中的讨论,您可以使用矢量化,这是pandas或numpy等图书馆的卖点之一。理想情况下,您不应该调用
iterrows()
。我的建议更明确一点:
# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']
# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0 foofoofoozoo
1 NaN
Name: A, dtype: object
# more generally
# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns)
for colName in df1:
df3[colName] = func(df1[colName], df2[colName])
现在,您甚至可以将不同的函数应用于不同的列,例如,创建lambda函数,然后使用列名压缩它们:
# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]
# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
df3[colName] = func(df1[colName], df2[colName])
我想到的唯一一个“问题”是,您需要确保您的函数适用于列中的数据。例如,如果您要执行类似于df1['A']-df2['A']
(使用您提供的df1、df2)的操作,则会产生ValueError
,因为两个字符串的减法未定义。只是一些需要注意的事情
编辑,回复:您的评论:这也是可行的。迭代较大的dfX.columns,这样就不会遇到
KeyError
,并在其中抛出if
语句:
# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
if colName not in df1:
df3[colName] = np.nan # be sure to import numpy as np
else:
df3[colName] = func(df1[colName], df2[colName])
因为您将func应用于同名的列,所以可以只遍历这些列并使用矢量化,例如df3['a']=func(df1['a'],df2['a']),等等?@StarFox很有趣,所以我可以这样做:对于df3中的列:df3[column]=func(df1[column],df2[column])?当然!这就是pandas/numpy(通常是矢量化)的力量。我将在下面提供一些示例,我们将从这里开始。我认为您可以基于两个数据帧之间的笛卡尔积来制定解决方案,请看这里的起点:@Svend这似乎是一个很有希望的想法,谢谢!不过,我会先试试StarFox的解决方案。是的,这非常有帮助,我已经接受了它作为答案,非常感谢您抽出时间!如果列数不相等,是否可以修改此选项以供使用?即,df1中可能有df2中不存在的列;比较函数应该只输出类似于N/A的内容。