Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/346.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/android/224.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 列,其中每个值取决于另一个df查询_Python_Python 3.x_Pandas_Dataframe_Aggregate - Fatal编程技术网

Python 列,其中每个值取决于另一个df查询

Python 列,其中每个值取决于另一个df查询,python,python-3.x,pandas,dataframe,aggregate,Python,Python 3.x,Pandas,Dataframe,Aggregate,我面临着一个复杂的问题。 我有第一个数据帧,其中有客户(注意ClientID不是唯一的,您可以将相同的ClientID与不同的TestDate关联): df1: ClientID TestDate 1A 2019-12-24 1B 2019-08-26 1B 2020-01-12 LineNumber ClientID Date Amount 1 1A 2020-01-12 50 2

我面临着一个复杂的问题。 我有第一个数据帧,其中有客户(注意ClientID不是唯一的,您可以将相同的ClientID与不同的TestDate关联):

df1:

ClientID  TestDate
1A        2019-12-24
1B        2019-08-26
1B        2020-01-12
LineNumber  ClientID  Date          Amount
1           1A        2020-01-12    50
2           1A        2019-09-24    15
3           1A        2019-12-25    20
4           1A        2018-12-30    30
5           1B        2018-12-30    60
6           1B        2019-12-12    40
ClientID  TestDate      NumberOp  MeanOp
1A        2019-12-24    2         22.5
1B        2019-08-26    1         60
1B        2020-01-12    2         50
for index, row in df1.iterrows():
    id = row['ClientID']
    date = row['TestDate']
    df2_known = df2.loc[df2['ClientID'] == id]
    df2_known = df2_known.loc[df2_known['Date'] < date]
    df1.loc[index, 'NumberOp'] = df2_known.shape[0]
    df1.loc[index, 'MeanOp'] = df2_known['Amount'].mean()
我有另一个“操作”数据框,指示日期和涉及的客户机

df2:

ClientID  TestDate
1A        2019-12-24
1B        2019-08-26
1B        2020-01-12
LineNumber  ClientID  Date          Amount
1           1A        2020-01-12    50
2           1A        2019-09-24    15
3           1A        2019-12-25    20
4           1A        2018-12-30    30
5           1B        2018-12-30    60
6           1B        2019-12-12    40
ClientID  TestDate      NumberOp  MeanOp
1A        2019-12-24    2         22.5
1B        2019-08-26    1         60
1B        2020-01-12    2         50
for index, row in df1.iterrows():
    id = row['ClientID']
    date = row['TestDate']
    df2_known = df2.loc[df2['ClientID'] == id]
    df2_known = df2_known.loc[df2_known['Date'] < date]
    df1.loc[index, 'NumberOp'] = df2_known.shape[0]
    df1.loc[index, 'MeanOp'] = df2_known['Amount'].mean()
我想在df1中添加一列,其中包含平均行数和行数,但只取df2行,其中日期

例如,对于客户机1A,我将只获取行号2和4(因为行号1和3的日期晚于TestDate),然后获取df1的以下输出:

预期df1:

ClientID  TestDate
1A        2019-12-24
1B        2019-08-26
1B        2020-01-12
LineNumber  ClientID  Date          Amount
1           1A        2020-01-12    50
2           1A        2019-09-24    15
3           1A        2019-12-25    20
4           1A        2018-12-30    30
5           1B        2018-12-30    60
6           1B        2019-12-12    40
ClientID  TestDate      NumberOp  MeanOp
1A        2019-12-24    2         22.5
1B        2019-08-26    1         60
1B        2020-01-12    2         50
for index, row in df1.iterrows():
    id = row['ClientID']
    date = row['TestDate']
    df2_known = df2.loc[df2['ClientID'] == id]
    df2_known = df2_known.loc[df2_known['Date'] < date]
    df1.loc[index, 'NumberOp'] = df2_known.shape[0]
    df1.loc[index, 'MeanOp'] = df2_known['Amount'].mean()
注意:对于1B客户机的第一行,由于测试日期是
2019-08-26
,因此只看到一个操作(第6行操作是在
2019-12-12
中完成的,所以在测试日期之后,所以我不考虑它)

我已经有了代码,但我必须在我的
df1
上使用
iterrows
,这需要很长时间:

当前代码(有效但很长):

ClientID  TestDate
1A        2019-12-24
1B        2019-08-26
1B        2020-01-12
LineNumber  ClientID  Date          Amount
1           1A        2020-01-12    50
2           1A        2019-09-24    15
3           1A        2019-12-25    20
4           1A        2018-12-30    30
5           1B        2018-12-30    60
6           1B        2019-12-12    40
ClientID  TestDate      NumberOp  MeanOp
1A        2019-12-24    2         22.5
1B        2019-08-26    1         60
1B        2020-01-12    2         50
for index, row in df1.iterrows():
    id = row['ClientID']
    date = row['TestDate']
    df2_known = df2.loc[df2['ClientID'] == id]
    df2_known = df2_known.loc[df2_known['Date'] < date]
    df1.loc[index, 'NumberOp'] = df2_known.shape[0]
    df1.loc[index, 'MeanOp'] = df2_known['Amount'].mean()

如果我按照答案中给出的方式执行
groupby
transform
,我的输出将不会有
CliendID='5C'
的任何行,因为
Date
Date为null
永远不会发生,因此当我执行
df=df[(df['Date']时,行丢失。您可以合并和转换:

df = df2.merge(df1, on=['ClientID'])
#filter based on condition
df = df[df['Date']<df['TestDate']]
#get the mean and count into new columns
df['MeanOp'] = df.groupby(['ClientID'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID'])['Amount'].transform('count')
#drop duplicates and irrelevant columns
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()
编辑:如果要保留缺少的
df2的匹配键

df = df2.merge(df1, on=['ClientID'], how='right')
df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())]
df['MeanOp'] = df.groupby(['ClientID'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID'])['Amount'].transform('count')
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()
更新:根据帖子上的编辑,如果您想按
(客户端ID,测试日期)
对它们进行分组:


@BeamsAdept不客气。在文章中添加了编辑以涵盖该情况。非常感谢您的解决方案,它在一般情况下都能工作。我仍然有一些问题,因为为了提问,我将问题简化了一点,现在我又陷入了这个问题:事实上,我的df1可以拥有多次相同的客户端(因此ClientID不是唯一的)但是TestDate不同(对应于我们作为检查操作参考的日期)。在这种情况下,您的方法在合并时似乎有问题。这完全是我的错,因为我没有对我的主要问题进行精确的分析,对此表示抱歉。@Beamsad没有接受任何无法修复的内容。但它应该可以工作。同一客户端id的测试日期多重性似乎有什么问题?您希望平均数/计数是多少?是否为每个client\u id或per(客户id,测试日期)?请详细说明真实情况,以便我们能更好地帮助您。谢谢。@BeamsAdept simply groupby(客户id,测试日期)。我将添加编辑。请注意,根据您的定义和新示例,df2中的一个日期小于df1中同一客户机id的两个测试日期,它同时参与相应测试日期的平均值/计数。这是您想要的,还是您想按测试日期将日期间隔分块?@BeamsAdept请检查发送给se的帖子上的更新e如果能解决您的问题,谢谢。