Python 在熊猫中排列借贷?
我在pandas数据框中有许多借方和贷方行(以下是一些示例数据): 我试图确定日期/聚会的配对以及它们相互抵消的金额。例如,在9/1上,您可以看到与Wells的借记和贷记交易进行抵销 我试图做的是创建一个单独的借方数据框和贷方数据框,然后在日期/参与方将两者合并Python 在熊猫中排列借贷?,python,pandas,Python,Pandas,我在pandas数据框中有许多借方和贷方行(以下是一些示例数据): 我试图确定日期/聚会的配对以及它们相互抵消的金额。例如,在9/1上,您可以看到与Wells的借记和贷记交易进行抵销 我试图做的是创建一个单独的借方数据框和贷方数据框,然后在日期/参与方将两者合并 df = pd.DataFrame({'Date': ['9/1/2020','9/1/2020', '9/1/2020', '9/1/2020', '9/2/2020', '9/2/2020', '9/3/2020'],
df = pd.DataFrame({'Date': ['9/1/2020','9/1/2020', '9/1/2020', '9/1/2020', '9/2/2020', '9/2/2020', '9/3/2020'],
'Party': ['Wells', 'Wells', 'Wells', 'Wells', 'BOA', 'BOA', 'Chase'],
'Debit/Credit': ['Debit', 'Credit', 'Debit', 'Debit', 'Credit', 'Debit', 'Debit'],
'Amount': [4, -4, 4, 4, -4, 4, 4]})
debit_df = df.loc[df['Debit/Credit'] == 'Debit']
credit_df = df.loc[df['Debit/Credit'] == 'Credit']
offset_df= debit_df.merge(credit_df, on = ['Date', 'Party'])
matching_trans = offset_df.loc[offset_df['Amount_x'] == abs(offset_df['Amount_y'])]
这种方法的问题是,我显然会在存在多个类似Wells交易的情况下使用笛卡尔积。是否有一种方法可以确定油井的匹配对(即借方4,贷方4)以及发生的次数?我的数据要大得多,但在本例中,在最终的
匹配\u trans
数据帧中只会返回1个结果。如果只需要发生这种情况的次数,可以比较匹配实例的计数。首先计算每个日期/方的借方和贷方的类似金额:
debit_df = df.loc[df['Debit/Credit'] == 'Debit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
credit_df = df.loc[df['Debit/Credit'] == 'Credit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
然后将其中一个值更改为负数,以便也可以将其用于匹配:
credit_df.rename(columns={'Amount':'Credit_Amount'}, inplace=True)
credit_df['Amount'] = -credit_df['Credit_Amount']
最后,根据日期、参与方和金额匹配两个dfs,删除NAs并找到偏移量:
matching_trans = debit_df.merge(credit_df, on=['Date', 'Party', 'Amount'], how='left').dropna(axis=0)
matching_trans.rename(columns={'Amount':'Debit_Amount', 'Debit/Credit_x':'Debit_count',
'Debit/Credit_y':'Credit_count'}, inplace=True)
matching_trans['offset_count'] = matching_trans.apply(lambda x: min(x.Credit_count, x.Debit_count),axis=1)
“偏移量计数”将为您提供每个日期/派对组合的偏移量。如果您只需要发生这种情况的次数,您可以比较匹配实例的计数。首先计算每个日期/方的借方和贷方的类似金额:
debit_df = df.loc[df['Debit/Credit'] == 'Debit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
credit_df = df.loc[df['Debit/Credit'] == 'Credit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
然后将其中一个值更改为负数,以便也可以将其用于匹配:
credit_df.rename(columns={'Amount':'Credit_Amount'}, inplace=True)
credit_df['Amount'] = -credit_df['Credit_Amount']
最后,根据日期、参与方和金额匹配两个dfs,删除NAs并找到偏移量:
matching_trans = debit_df.merge(credit_df, on=['Date', 'Party', 'Amount'], how='left').dropna(axis=0)
matching_trans.rename(columns={'Amount':'Debit_Amount', 'Debit/Credit_x':'Debit_count',
'Debit/Credit_y':'Credit_count'}, inplace=True)
matching_trans['offset_count'] = matching_trans.apply(lambda x: min(x.Credit_count, x.Debit_count),axis=1)
“偏移量计数”将为您提供每个日期/派对组合的偏移量。以下是一种识别匹配对的方法。它很长,但并不复杂。为借项和贷项制作defaultdict
- 键为日期+参与方+金额(更改信用证金额符号)
- 值是唯一的ID(我称之为seq_num,但它只是原始索引)
# make a default dictionary for debits
# key => (Date + Party + Amount)
# value => list of seq_num
# same for credits (exept use -1 * Amount)
debits = defaultdict(list)
credits = defaultdict(list)
for row in df.itertuples():
if row.Debit_Credit == 'Debit':
key = (row.Date, row.Party, row.Amount)
debits[key].append(row.seq_num)
elif row.Debit_Credit == 'Credit':
key = (row.Date, row.Party, (-1) * row.Amount)
credits[key].append(row.seq_num)
else:
continue # can't get here!
现在遍历debits dict。如果credits dict中也存在密钥,那么我们找到了一个匹配的对——将序列号移动到“offset”dict
offsets = defaultdict(list)
for key, value in debits.items():
# is this key also in credits?
if key in credits:
print(key, 'found offset!')
debit_seq_num = value.pop()
credit_seq_num = credits[key].pop()
offsets[key].append((debit_seq_num, credit_seq_num))
最后,我们可以通过迭代每个dict来打印一个小报告:
# print report
print('debits')
for key, value in debits.items():
if value:
print(' ', key, value)
print('credits')
for key, value in credits.items():
if value:
print(' ', key, value)
print('offsets')
for key, value in offsets.items():
if value:
print(' ', key, value)
debits
(Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [0, 2]
(Timestamp('2020-09-03 00:00:00'), 'Chase', 4) [6]
credits
offsets
(Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [(3, 1)]
(Timestamp('2020-09-02 00:00:00'), 'BOA', 4) [(5, 4)]
offsets dict给出了一对序列号,它们是偏移量。请注意,借项、贷项和抵销的并集与原始数据帧相同(我们没有重复计数,也没有丢失任何内容)。下面是一种识别匹配对的方法。它很长,但并不复杂。为借项和贷项制作defaultdict
- 键为日期+参与方+金额(更改信用证金额符号)
- 值是唯一的ID(我称之为seq_num,但它只是原始索引)
# make a default dictionary for debits
# key => (Date + Party + Amount)
# value => list of seq_num
# same for credits (exept use -1 * Amount)
debits = defaultdict(list)
credits = defaultdict(list)
for row in df.itertuples():
if row.Debit_Credit == 'Debit':
key = (row.Date, row.Party, row.Amount)
debits[key].append(row.seq_num)
elif row.Debit_Credit == 'Credit':
key = (row.Date, row.Party, (-1) * row.Amount)
credits[key].append(row.seq_num)
else:
continue # can't get here!
现在遍历debits dict。如果credits dict中也存在密钥,那么我们找到了一个匹配的对——将序列号移动到“offset”dict
offsets = defaultdict(list)
for key, value in debits.items():
# is this key also in credits?
if key in credits:
print(key, 'found offset!')
debit_seq_num = value.pop()
credit_seq_num = credits[key].pop()
offsets[key].append((debit_seq_num, credit_seq_num))
最后,我们可以通过迭代每个dict来打印一个小报告:
# print report
print('debits')
for key, value in debits.items():
if value:
print(' ', key, value)
print('credits')
for key, value in credits.items():
if value:
print(' ', key, value)
print('offsets')
for key, value in offsets.items():
if value:
print(' ', key, value)
debits
(Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [0, 2]
(Timestamp('2020-09-03 00:00:00'), 'Chase', 4) [6]
credits
offsets
(Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [(3, 1)]
(Timestamp('2020-09-02 00:00:00'), 'BOA', 4) [(5, 4)]
offsets dict给出了一对序列号,它们是偏移量。请注意,借项、贷项和抵销的并集与原始数据帧相同(我们没有重复计数,也没有丢失任何内容)。hello-这种方法有效,但当您有1个以上的重复值时,这并不能确定要匹配的对,它将继续重复values@TomWatson你完全正确,我没有抓住那一点。我把整个答案换成了另一个答案。检查并查看这是否有效。您好-这种方法有效,但当您有超过1个重复值时,这并不能确定要匹配哪一对,它将继续重复values@TomWatson你完全正确,我没有抓住那一点。我把整个答案换成了另一个答案。检查一下这是否有效。