Python 在熊猫中排列借贷?

Python 在熊猫中排列借贷?,python,pandas,Python,Pandas,我在pandas数据框中有许多借方和贷方行(以下是一些示例数据): 我试图确定日期/聚会的配对以及它们相互抵消的金额。例如,在9/1上,您可以看到与Wells的借记和贷记交易进行抵销 我试图做的是创建一个单独的借方数据框和贷方数据框,然后在日期/参与方将两者合并 df = pd.DataFrame({'Date': ['9/1/2020','9/1/2020', '9/1/2020', '9/1/2020', '9/2/2020', '9/2/2020', '9/3/2020'],

我在pandas数据框中有许多借方和贷方行(以下是一些示例数据):

我试图确定日期/聚会的配对以及它们相互抵消的金额。例如,在9/1上,您可以看到与Wells的借记和贷记交易进行抵销

我试图做的是创建一个单独的借方数据框和贷方数据框,然后在日期/参与方将两者合并

df = pd.DataFrame({'Date': ['9/1/2020','9/1/2020', '9/1/2020', '9/1/2020', '9/2/2020', '9/2/2020', '9/3/2020'],
                  'Party': ['Wells', 'Wells', 'Wells', 'Wells', 'BOA', 'BOA', 'Chase'],
                  'Debit/Credit': ['Debit', 'Credit', 'Debit', 'Debit', 'Credit', 'Debit', 'Debit'],
                  'Amount': [4, -4, 4, 4, -4, 4, 4]})
debit_df = df.loc[df['Debit/Credit'] == 'Debit']
credit_df = df.loc[df['Debit/Credit'] == 'Credit']
offset_df= debit_df.merge(credit_df, on = ['Date', 'Party'])
matching_trans = offset_df.loc[offset_df['Amount_x'] == abs(offset_df['Amount_y'])]

这种方法的问题是,我显然会在存在多个类似Wells交易的情况下使用笛卡尔积。是否有一种方法可以确定油井的匹配对(即借方4,贷方4)以及发生的次数?我的数据要大得多,但在本例中,在最终的
匹配\u trans
数据帧中只会返回1个结果。

如果只需要发生这种情况的次数,可以比较匹配实例的计数。首先计算每个日期/方的借方和贷方的类似金额:

debit_df = df.loc[df['Debit/Credit'] == 'Debit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
credit_df = df.loc[df['Debit/Credit'] == 'Credit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
然后将其中一个值更改为负数,以便也可以将其用于匹配:

credit_df.rename(columns={'Amount':'Credit_Amount'}, inplace=True)
credit_df['Amount'] = -credit_df['Credit_Amount']
最后,根据日期、参与方和金额匹配两个dfs,删除NAs并找到偏移量:

matching_trans = debit_df.merge(credit_df, on=['Date', 'Party', 'Amount'], how='left').dropna(axis=0)
matching_trans.rename(columns={'Amount':'Debit_Amount', 'Debit/Credit_x':'Debit_count',
                               'Debit/Credit_y':'Credit_count'}, inplace=True)
matching_trans['offset_count'] = matching_trans.apply(lambda x: min(x.Credit_count, x.Debit_count),axis=1)

“偏移量计数”将为您提供每个日期/派对组合的偏移量。

如果您只需要发生这种情况的次数,您可以比较匹配实例的计数。首先计算每个日期/方的借方和贷方的类似金额:

debit_df = df.loc[df['Debit/Credit'] == 'Debit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
credit_df = df.loc[df['Debit/Credit'] == 'Credit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
然后将其中一个值更改为负数,以便也可以将其用于匹配:

credit_df.rename(columns={'Amount':'Credit_Amount'}, inplace=True)
credit_df['Amount'] = -credit_df['Credit_Amount']
最后,根据日期、参与方和金额匹配两个dfs,删除NAs并找到偏移量:

matching_trans = debit_df.merge(credit_df, on=['Date', 'Party', 'Amount'], how='left').dropna(axis=0)
matching_trans.rename(columns={'Amount':'Debit_Amount', 'Debit/Credit_x':'Debit_count',
                               'Debit/Credit_y':'Credit_count'}, inplace=True)
matching_trans['offset_count'] = matching_trans.apply(lambda x: min(x.Credit_count, x.Debit_count),axis=1)

“偏移量计数”将为您提供每个日期/派对组合的偏移量。

以下是一种识别匹配对的方法。它很长,但并不复杂。为借项和贷项制作defaultdict

  • 键为日期+参与方+金额(更改信用证金额符号)
  • 值是唯一的ID(我称之为seq_num,但它只是原始索引)
下一步:

# make a default dictionary for debits
# key => (Date + Party + Amount)
# value => list of seq_num
# same for credits (exept use -1 * Amount)

debits = defaultdict(list)
credits = defaultdict(list)

for row in df.itertuples():
    if row.Debit_Credit == 'Debit':
        key = (row.Date, row.Party, row.Amount)
        debits[key].append(row.seq_num)
    elif row.Debit_Credit == 'Credit':
        key = (row.Date, row.Party, (-1) * row.Amount)
        credits[key].append(row.seq_num)
    else:
        continue # can't get here!
现在遍历debits dict。如果credits dict中也存在密钥,那么我们找到了一个匹配的对——将序列号移动到“offset”dict

offsets = defaultdict(list)

for key, value in debits.items():
    # is this key also in credits?
    if key in credits:
        print(key, 'found offset!')
        debit_seq_num = value.pop()
        credit_seq_num = credits[key].pop()
        offsets[key].append((debit_seq_num, credit_seq_num))
最后,我们可以通过迭代每个dict来打印一个小报告:

# print report

print('debits')
for key, value in debits.items():
    if value:
        print('    ', key, value)
        
print('credits')
for key, value in credits.items():
    if value:
        print('    ', key, value)

print('offsets')
for key, value in offsets.items():
    if value:
        print('    ', key, value)

debits
     (Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [0, 2]
     (Timestamp('2020-09-03 00:00:00'), 'Chase', 4) [6]
credits
offsets
     (Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [(3, 1)]
     (Timestamp('2020-09-02 00:00:00'), 'BOA', 4) [(5, 4)]

offsets dict给出了一对序列号,它们是偏移量。请注意,借项、贷项和抵销的并集与原始数据帧相同(我们没有重复计数,也没有丢失任何内容)。

下面是一种识别匹配对的方法。它很长,但并不复杂。为借项和贷项制作defaultdict

  • 键为日期+参与方+金额(更改信用证金额符号)
  • 值是唯一的ID(我称之为seq_num,但它只是原始索引)
下一步:

# make a default dictionary for debits
# key => (Date + Party + Amount)
# value => list of seq_num
# same for credits (exept use -1 * Amount)

debits = defaultdict(list)
credits = defaultdict(list)

for row in df.itertuples():
    if row.Debit_Credit == 'Debit':
        key = (row.Date, row.Party, row.Amount)
        debits[key].append(row.seq_num)
    elif row.Debit_Credit == 'Credit':
        key = (row.Date, row.Party, (-1) * row.Amount)
        credits[key].append(row.seq_num)
    else:
        continue # can't get here!
现在遍历debits dict。如果credits dict中也存在密钥,那么我们找到了一个匹配的对——将序列号移动到“offset”dict

offsets = defaultdict(list)

for key, value in debits.items():
    # is this key also in credits?
    if key in credits:
        print(key, 'found offset!')
        debit_seq_num = value.pop()
        credit_seq_num = credits[key].pop()
        offsets[key].append((debit_seq_num, credit_seq_num))
最后,我们可以通过迭代每个dict来打印一个小报告:

# print report

print('debits')
for key, value in debits.items():
    if value:
        print('    ', key, value)
        
print('credits')
for key, value in credits.items():
    if value:
        print('    ', key, value)

print('offsets')
for key, value in offsets.items():
    if value:
        print('    ', key, value)

debits
     (Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [0, 2]
     (Timestamp('2020-09-03 00:00:00'), 'Chase', 4) [6]
credits
offsets
     (Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [(3, 1)]
     (Timestamp('2020-09-02 00:00:00'), 'BOA', 4) [(5, 4)]

offsets dict给出了一对序列号,它们是偏移量。请注意,借项、贷项和抵销的并集与原始数据帧相同(我们没有重复计数,也没有丢失任何内容)。

hello-这种方法有效,但当您有1个以上的重复值时,这并不能确定要匹配的对,它将继续重复values@TomWatson你完全正确,我没有抓住那一点。我把整个答案换成了另一个答案。检查并查看这是否有效。您好-这种方法有效,但当您有超过1个重复值时,这并不能确定要匹配哪一对,它将继续重复values@TomWatson你完全正确,我没有抓住那一点。我把整个答案换成了另一个答案。检查一下这是否有效。