Python 与值表的迭代合并–;以间隔之间的值为条件(熊猫)

Python 与值表的迭代合并–;以间隔之间的值为条件(熊猫),python,pandas,conditional-statements,Python,Pandas,Conditional Statements,我试图合并两个表,其中左侧的行保持不变,右侧的列得到更新。因此,如果右侧的值是最高值(即,高于左侧的当前值),但低于单独设置的阈值,则左侧表的列取右侧的值 阈值由“Snapshop”列设置;“找到的最新值”列表示迄今为止观察到的最高值(在阈值内) 为了提高内存效率,该过程将处理许多小块数据,并且需要能够迭代数据帧列表。在每个数据帧中,原点记录在“表ID”列中。如果主数据框找到一个值,它会将原点存储在其列“find in”中 例子 主桌(左侧) 第一数据块 结果:第一次合并后左侧 第二数据块 结果

我试图合并两个表,其中左侧的行保持不变,右侧的列得到更新。因此,如果右侧的值是最高值(即,高于左侧的当前值),但低于单独设置的阈值,则左侧表的列取右侧的值

阈值由“Snapshop”列设置;“找到的最新值”列表示迄今为止观察到的最高值(在阈值内)

为了提高内存效率,该过程将处理许多小块数据,并且需要能够迭代数据帧列表。在每个数据帧中,原点记录在“表ID”列中。如果主数据框找到一个值,它会将原点存储在其列“find in”中

例子 主桌(左侧) 第一数据块 结果:第一次合并后左侧 第二数据块 结果:第二次合并后左侧 代码
这是我的解决方法,尽管如果只有两个表,我会尽量不在循环中执行此操作。我从联接表中删除了您的“idx”列

df_list = [df,Table1,Table2]
main_df = df_list[0]

count_ = 0
for i in df_list[1:]:
    main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
    main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
    if count_ < 1:
        main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
        main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
        main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
        main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
        count_ += 1
    else:
        main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
        this_table = []
        this_date = []
        for i in main_df.index:
            curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
            curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
            curr_foundin = main_df.loc[i,'Found_in']
            next_foundin = main_df.loc[i,'Table_ID']
            next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
            if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            else:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)

        main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
        main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
        count_ += 1
df_list=[df,表1,表2]
main_df=df_列表[0]
计数=0
对于df_列表中的i[1:]:
main_df=main_df.merge(i,how='left',on='ID')。排序_值(按=['ID','Snapshot\u timestamp'],升序=[True,False])
main_df['rownum']=main_df.groupby(['ID']).cumcount()+1
如果计数<1:
main_df=main_df[main_df['rownum']==1]。删除(列=['rownum'、'Latest_value_found'、'found_in']))
main_-df['Latest_value_-found']=np.其中(main_-df['Snapshot']>main_-df['Snapshot_-timestamp'],main_-df['Snapshot_-timestamp'],pd.NaT)
main_-df['Found_in']=np.where(main_-df['Snapshot']>main_-df['Snapshot_-timestamp'],main_-df['Table_-ID'],np.NaN)
main_df=main_df.drop(列=['Snapshot_timestamp','Table_ID'])。重置索引(drop=True)
计数μ+=1
其他:
main_-df=main_-df[main_-df['rownum']==1]。删除(列='rownum')。重置索引(删除=True)
此_表=[]
此日期=[]
对于主索引中的i:
curr_snapshot=pd.to_datetime(main_df.loc[i,'snapshot'])
curr_latest_val=pd.to_datetime(main_df.loc[i,'latest_value_found']))
curr_foundin=main_df.loc[i,'Found_in']
next_foundin=main_df.loc[i,'Table_ID']
next_snapshot=pd.to_datetime(main_df.loc[i,'snapshot_timestamp'])
如果当前快照>当前最新快照和当前快照>下一个快照和当前最新快照==下一个快照:
此日期。附加(当前最新值)
此\u table.append(curr\u foundin)
elif curr_snapshot>curr_latest_val和curr_snapshot>next_snapshot和curr_latest_val>next_snapshot:
此日期。附加(当前最新值)
此\u table.append(curr\u foundin)
elif curr_snapshot>curr_latest_val和curr_snapshot>next_snapshot和curr_latest_val
谢谢。你们的答案很有帮助,但它似乎是“迭代的”。也许我有一个错误的思维来自SQL,但我认为这将是低效的。我以两张桌子为例(复数)。在我的例子中,它有数百个。是的,你是对的,如果你需要合并数百个表,你可能需要传递某种类型的函数,最好不要在适用的地方使用循环。你能演示一下这是如何工作的吗?我对熊猫很陌生,目前不能使用SQL。
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
|   1 | Table1   |           1 | Jan-14             |
|   2 | Table1   |           1 | Feb-14             |
|   3 | Table1   |           2 | Jan-14             |
|   4 | Table1   |           2 | Feb-14             |
|   5 | Table1   |           3 | Mar-14             |
+-----+----------+-------------+--------------------+
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
|  1 | Aug-18             | Feb-14             | Table1   |
|  2 | Aug-18             | Feb-14             | Table1   |
|  3 | May-18             | Mar-14             | Table1   |
|  4 | May-18             | NULL               | NULL     |
|  5 | May-18             | NULL               | NULL     |
+----+--------------------+--------------------+----------+
+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
|   1 | Table2   |           1 | Mar-15             |
|   2 | Table2   |           1 | Apr-15             |
|   3 | Table2   |           2 | Feb-14             |
|   4 | Table2   |           3 | Feb-14             |
|   5 | Table2   |           4 | Aug-19             |
+-----+----------+-------------+--------------------+
+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
|  1 | Aug-18             | Apr-15             | Table2   |
|  2 | Aug-18             | Feb-14             | Table1   |
|  3 | May-18             | Mar-14             | Table1   |
|  4 | May-18             | NULL               | NULL     |
|  5 | May-18             | NULL               | NULL     |
+----+--------------------+--------------------+----------+
import pandas as pd
import numpy as np

# Main dataframe
df = pd.DataFrame({"ID": [1,2,3,4,5],
                  "Snapshot": ["2019-08-31", "2019-08-31","2019-05-31","2019-05-31","2019-05-31"],  # the maximum interval than can be used
                   "Latest_value_found": [None,None,None,None,None],
                   "Found_in": [None,None,None,None,None]}
)

# Data chunks used for updates
Table1 = pd.DataFrame({"Idx": [1,2,3,4,5],
                  "Table_ID": ["Table1", "Table1", "Table1", "Table1", "Table1"],
                   "Customer_ID": [1,1,2,2,3],
                   "Snapshot_timestamp": ["2019-01-31","2019-02-28","2019-01-31","2019-02-28","2019-03-30"]}
)
Table2 = pd.DataFrame({"Idx": [1,2,3,4,5],
                  "Table_ID": ["Table2", "Table2", "Table2", "Table2", "Table2"],
                   "Customer_ID": [1,1,2,3,4],
                   "Snapshot_timestamp": ["2019-03-31","2019-04-30","2019-02-28","2019-02-28","2019-08-31"]}
)

list_of_data_chunks = [Table1, Table2]

# work: iteration
for data_chunk in list_of_data_chunks:
    pass
    # here the merging is performed iteratively
df_list = [df,Table1,Table2]
main_df = df_list[0]

count_ = 0
for i in df_list[1:]:
    main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
    main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
    if count_ < 1:
        main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
        main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
        main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
        main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
        count_ += 1
    else:
        main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
        this_table = []
        this_date = []
        for i in main_df.index:
            curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
            curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
            curr_foundin = main_df.loc[i,'Found_in']
            next_foundin = main_df.loc[i,'Table_ID']
            next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
            if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            else:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)

        main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
        main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
        count_ += 1