Python 与值表的迭代合并–;以间隔之间的值为条件(熊猫)
我试图合并两个表,其中左侧的行保持不变,右侧的列得到更新。因此,如果右侧的值是最高值(即,高于左侧的当前值),但低于单独设置的阈值,则左侧表的列取右侧的值 阈值由“Snapshop”列设置;“找到的最新值”列表示迄今为止观察到的最高值(在阈值内) 为了提高内存效率,该过程将处理许多小块数据,并且需要能够迭代数据帧列表。在每个数据帧中,原点记录在“表ID”列中。如果主数据框找到一个值,它会将原点存储在其列“find in”中 例子 主桌(左侧) 第一数据块 结果:第一次合并后左侧 第二数据块 结果:第二次合并后左侧 代码Python 与值表的迭代合并–;以间隔之间的值为条件(熊猫),python,pandas,conditional-statements,Python,Pandas,Conditional Statements,我试图合并两个表,其中左侧的行保持不变,右侧的列得到更新。因此,如果右侧的值是最高值(即,高于左侧的当前值),但低于单独设置的阈值,则左侧表的列取右侧的值 阈值由“Snapshop”列设置;“找到的最新值”列表示迄今为止观察到的最高值(在阈值内) 为了提高内存效率,该过程将处理许多小块数据,并且需要能够迭代数据帧列表。在每个数据帧中,原点记录在“表ID”列中。如果主数据框找到一个值,它会将原点存储在其列“find in”中 例子 主桌(左侧) 第一数据块 结果:第一次合并后左侧 第二数据块 结果
这是我的解决方法,尽管如果只有两个表,我会尽量不在循环中执行此操作。我从联接表中删除了您的“idx”列
df_list = [df,Table1,Table2]
main_df = df_list[0]
count_ = 0
for i in df_list[1:]:
main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
if count_ < 1:
main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
count_ += 1
else:
main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
this_table = []
this_date = []
for i in main_df.index:
curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
curr_foundin = main_df.loc[i,'Found_in']
next_foundin = main_df.loc[i,'Table_ID']
next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
else:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
count_ += 1
df_list=[df,表1,表2]
main_df=df_列表[0]
计数=0
对于df_列表中的i[1:]:
main_df=main_df.merge(i,how='left',on='ID')。排序_值(按=['ID','Snapshot\u timestamp'],升序=[True,False])
main_df['rownum']=main_df.groupby(['ID']).cumcount()+1
如果计数<1:
main_df=main_df[main_df['rownum']==1]。删除(列=['rownum'、'Latest_value_found'、'found_in']))
main_-df['Latest_value_-found']=np.其中(main_-df['Snapshot']>main_-df['Snapshot_-timestamp'],main_-df['Snapshot_-timestamp'],pd.NaT)
main_-df['Found_in']=np.where(main_-df['Snapshot']>main_-df['Snapshot_-timestamp'],main_-df['Table_-ID'],np.NaN)
main_df=main_df.drop(列=['Snapshot_timestamp','Table_ID'])。重置索引(drop=True)
计数μ+=1
其他:
main_-df=main_-df[main_-df['rownum']==1]。删除(列='rownum')。重置索引(删除=True)
此_表=[]
此日期=[]
对于主索引中的i:
curr_snapshot=pd.to_datetime(main_df.loc[i,'snapshot'])
curr_latest_val=pd.to_datetime(main_df.loc[i,'latest_value_found']))
curr_foundin=main_df.loc[i,'Found_in']
next_foundin=main_df.loc[i,'Table_ID']
next_snapshot=pd.to_datetime(main_df.loc[i,'snapshot_timestamp'])
如果当前快照>当前最新快照和当前快照>下一个快照和当前最新快照==下一个快照:
此日期。附加(当前最新值)
此\u table.append(curr\u foundin)
elif curr_snapshot>curr_latest_val和curr_snapshot>next_snapshot和curr_latest_val>next_snapshot:
此日期。附加(当前最新值)
此\u table.append(curr\u foundin)
elif curr_snapshot>curr_latest_val和curr_snapshot>next_snapshot和curr_latest_val
谢谢。你们的答案很有帮助,但它似乎是“迭代的”。也许我有一个错误的思维来自SQL,但我认为这将是低效的。我以两张桌子为例(复数)。在我的例子中,它有数百个。是的,你是对的,如果你需要合并数百个表,你可能需要传递某种类型的函数,最好不要在适用的地方使用循环。你能演示一下这是如何工作的吗?我对熊猫很陌生,目前不能使用SQL。
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table1 | 1 | Jan-14 |
| 2 | Table1 | 1 | Feb-14 |
| 3 | Table1 | 2 | Jan-14 |
| 4 | Table1 | 2 | Feb-14 |
| 5 | Table1 | 3 | Mar-14 |
+-----+----------+-------------+--------------------+
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Feb-14 | Table1 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
| 1 | Table2 | 1 | Mar-15 |
| 2 | Table2 | 1 | Apr-15 |
| 3 | Table2 | 2 | Feb-14 |
| 4 | Table2 | 3 | Feb-14 |
| 5 | Table2 | 4 | Aug-19 |
+-----+----------+-------------+--------------------+
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
| 1 | Aug-18 | Apr-15 | Table2 |
| 2 | Aug-18 | Feb-14 | Table1 |
| 3 | May-18 | Mar-14 | Table1 |
| 4 | May-18 | NULL | NULL |
| 5 | May-18 | NULL | NULL |
+----+--------------------+--------------------+----------+
import pandas as pd
import numpy as np
# Main dataframe
df = pd.DataFrame({"ID": [1,2,3,4,5],
"Snapshot": ["2019-08-31", "2019-08-31","2019-05-31","2019-05-31","2019-05-31"], # the maximum interval than can be used
"Latest_value_found": [None,None,None,None,None],
"Found_in": [None,None,None,None,None]}
)
# Data chunks used for updates
Table1 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table1", "Table1", "Table1", "Table1", "Table1"],
"Customer_ID": [1,1,2,2,3],
"Snapshot_timestamp": ["2019-01-31","2019-02-28","2019-01-31","2019-02-28","2019-03-30"]}
)
Table2 = pd.DataFrame({"Idx": [1,2,3,4,5],
"Table_ID": ["Table2", "Table2", "Table2", "Table2", "Table2"],
"Customer_ID": [1,1,2,3,4],
"Snapshot_timestamp": ["2019-03-31","2019-04-30","2019-02-28","2019-02-28","2019-08-31"]}
)
list_of_data_chunks = [Table1, Table2]
# work: iteration
for data_chunk in list_of_data_chunks:
pass
# here the merging is performed iteratively
df_list = [df,Table1,Table2]
main_df = df_list[0]
count_ = 0
for i in df_list[1:]:
main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
if count_ < 1:
main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
count_ += 1
else:
main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
this_table = []
this_date = []
for i in main_df.index:
curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
curr_foundin = main_df.loc[i,'Found_in']
next_foundin = main_df.loc[i,'Table_ID']
next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
this_date.append(next_snapshot)
this_table.append(next_foundin)
else:
this_date.append(curr_latest_val)
this_table.append(curr_foundin)
main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
count_ += 1