Python 与值表的迭代合并–；以间隔之间的值为条件（熊猫）_Python_Pandas_Conditional Statements

Python 与值表的迭代合并–；以间隔之间的值为条件（熊猫）

python pandas

Python 与值表的迭代合并–；以间隔之间的值为条件（熊猫）,python,pandas,conditional-statements,Python,Pandas,Conditional Statements,我试图合并两个表，其中左侧的行保持不变，右侧的列得到更新。因此，如果右侧的值是最高值（即，高于左侧的当前值），但低于单独设置的阈值，则左侧表的列取右侧的值阈值由“Snapshop”列设置；“找到的最新值”列表示迄今为止观察到的最高值（在阈值内）为了提高内存效率，该过程将处理许多小块数据，并且需要能够迭代数据帧列表。在每个数据帧中，原点记录在“表ID”列中。如果主数据框找到一个值，它会将原点存储在其列“find in”中例子主桌（左侧）第一数据块结果：第一次合并后左侧第二数据块结果

我试图合并两个表，其中左侧的行保持不变，右侧的列得到更新。因此，如果右侧的值是最高值（即，高于左侧的当前值），但低于单独设置的阈值，则左侧表的列取右侧的值

阈值由“Snapshop”列设置；“找到的最新值”列表示迄今为止观察到的最高值（在阈值内）

为了提高内存效率，该过程将处理许多小块数据，并且需要能够迭代数据帧列表。在每个数据帧中，原点记录在“表ID”列中。如果主数据框找到一个值，它会将原点存储在其列“find in”中

例子主桌（左侧）第一数据块结果：第一次合并后左侧第二数据块结果：第二次合并后左侧代码

这是我的解决方法，尽管如果只有两个表，我会尽量不在循环中执行此操作。我从联接表中删除了您的“idx”列

df_list = [df,Table1,Table2]
main_df = df_list[0]

count_ = 0
for i in df_list[1:]:
    main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
    main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
    if count_ < 1:
        main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
        main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
        main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
        main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
        count_ += 1
    else:
        main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
        this_table = []
        this_date = []
        for i in main_df.index:
            curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
            curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
            curr_foundin = main_df.loc[i,'Found_in']
            next_foundin = main_df.loc[i,'Table_ID']
            next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
            if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            else:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)

        main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
        main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
        count_ += 1

df_list=[df，表1，表2]
main_df=df_列表[0]
计数=0
对于df_列表中的i[1:]：
main_df=main_df.merge（i，how='left'，on='ID'）。排序_值（按=['ID'，'Snapshot\u timestamp']，升序=[True，False]）
main_df['rownum']=main_df.groupby（['ID']）.cumcount（）+1
如果计数<1：
main_df=main_df[main_df['rownum']==1]。删除（列=['rownum'、'Latest_value_found'、'found_in']））
main_-df['Latest_value_-found']=np.其中（main_-df['Snapshot']>main_-df['Snapshot_-timestamp']，main_-df['Snapshot_-timestamp']，pd.NaT）
main_-df['Found_in']=np.where（main_-df['Snapshot']>main_-df['Snapshot_-timestamp']，main_-df['Table_-ID']，np.NaN）
main_df=main_df.drop（列=['Snapshot_timestamp'，'Table_ID']）。重置索引（drop=True）
计数μ+=1
其他：
main_-df=main_-df[main_-df['rownum']==1]。删除（列='rownum'）。重置索引（删除=True）
此_表=[]
此日期=[]
对于主索引中的i：
curr_snapshot=pd.to_datetime（main_df.loc[i，'snapshot']）
curr_latest_val=pd.to_datetime（main_df.loc[i，'latest_value_found']））
curr_foundin=main_df.loc[i，'Found_in']
next_foundin=main_df.loc[i，'Table_ID']
next_snapshot=pd.to_datetime（main_df.loc[i，'snapshot_timestamp']）
如果当前快照>当前最新快照和当前快照>下一个快照和当前最新快照==下一个快照：
此日期。附加（当前最新值）
此\u table.append（curr\u foundin）
elif curr_snapshot>curr_latest_val和curr_snapshot>next_snapshot和curr_latest_val>next_snapshot：
此日期。附加（当前最新值）
此\u table.append（curr\u foundin）
elif curr_snapshot>curr_latest_val和curr_snapshot>next_snapshot和curr_latest_val

谢谢。你们的答案很有帮助，但它似乎是“迭代的”。也许我有一个错误的思维来自SQL，但我认为这将是低效的。我以两张桌子为例（复数）。在我的例子中，它有数百个。是的，你是对的，如果你需要合并数百个表，你可能需要传递某种类型的函数，最好不要在适用的地方使用循环。你能演示一下这是如何工作的吗？我对熊猫很陌生，目前不能使用SQL。

+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
|   1 | Table1   |           1 | Jan-14             |
|   2 | Table1   |           1 | Feb-14             |
|   3 | Table1   |           2 | Jan-14             |
|   4 | Table1   |           2 | Feb-14             |
|   5 | Table1   |           3 | Mar-14             |
+-----+----------+-------------+--------------------+

+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
|  1 | Aug-18             | Feb-14             | Table1   |
|  2 | Aug-18             | Feb-14             | Table1   |
|  3 | May-18             | Mar-14             | Table1   |
|  4 | May-18             | NULL               | NULL     |
|  5 | May-18             | NULL               | NULL     |
+----+--------------------+--------------------+----------+

+-----+----------+-------------+--------------------+
| Idx | Table ID | Customer ID | Snapshot timestamp |
+-----+----------+-------------+--------------------+
|   1 | Table2   |           1 | Mar-15             |
|   2 | Table2   |           1 | Apr-15             |
|   3 | Table2   |           2 | Feb-14             |
|   4 | Table2   |           3 | Feb-14             |
|   5 | Table2   |           4 | Aug-19             |
+-----+----------+-------------+--------------------+

+----+--------------------+--------------------+----------+
| ID | Snapshot timestamp | Latest value found | Found in |
+----+--------------------+--------------------+----------+
|  1 | Aug-18             | Apr-15             | Table2   |
|  2 | Aug-18             | Feb-14             | Table1   |
|  3 | May-18             | Mar-14             | Table1   |
|  4 | May-18             | NULL               | NULL     |
|  5 | May-18             | NULL               | NULL     |
+----+--------------------+--------------------+----------+

import pandas as pd
import numpy as np

# Main dataframe
df = pd.DataFrame({"ID": [1,2,3,4,5],
                  "Snapshot": ["2019-08-31", "2019-08-31","2019-05-31","2019-05-31","2019-05-31"],  # the maximum interval than can be used
                   "Latest_value_found": [None,None,None,None,None],
                   "Found_in": [None,None,None,None,None]}
)

# Data chunks used for updates
Table1 = pd.DataFrame({"Idx": [1,2,3,4,5],
                  "Table_ID": ["Table1", "Table1", "Table1", "Table1", "Table1"],
                   "Customer_ID": [1,1,2,2,3],
                   "Snapshot_timestamp": ["2019-01-31","2019-02-28","2019-01-31","2019-02-28","2019-03-30"]}
)
Table2 = pd.DataFrame({"Idx": [1,2,3,4,5],
                  "Table_ID": ["Table2", "Table2", "Table2", "Table2", "Table2"],
                   "Customer_ID": [1,1,2,3,4],
                   "Snapshot_timestamp": ["2019-03-31","2019-04-30","2019-02-28","2019-02-28","2019-08-31"]}
)

list_of_data_chunks = [Table1, Table2]

# work: iteration
for data_chunk in list_of_data_chunks:
    pass
    # here the merging is performed iteratively

df_list = [df,Table1,Table2]
main_df = df_list[0]

count_ = 0
for i in df_list[1:]:
    main_df = main_df.merge(i, how = 'left', on = 'ID').sort_values(by = ['ID','Snapshot_timestamp'], ascending = [True,False])
    main_df['rownum'] = main_df.groupby(['ID']).cumcount()+1
    if count_ < 1:
        main_df = main_df[main_df['rownum'] == 1].drop(columns = ['rownum','Latest_value_found','Found_in'])
        main_df['Latest_value_found'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Snapshot_timestamp'], pd.NaT)
        main_df['Found_in'] = np.where(main_df['Snapshot'] > main_df['Snapshot_timestamp'], main_df['Table_ID'], np.NaN)
        main_df = main_df.drop(columns = ['Snapshot_timestamp','Table_ID']).reset_index(drop = True)
        count_ += 1
    else:
        main_df = main_df[main_df['rownum']==1].drop(columns = 'rownum').reset_index(drop = True)
        this_table = []
        this_date = []
        for i in main_df.index:
            curr_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot'])
            curr_latest_val = pd.to_datetime(main_df.loc[i,'Latest_value_found'])
            curr_foundin = main_df.loc[i,'Found_in']
            next_foundin = main_df.loc[i,'Table_ID']
            next_snapshot = pd.to_datetime(main_df.loc[i,'Snapshot_timestamp'])
            if curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val == next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val > next_snapshot:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)
            elif curr_snapshot > curr_latest_val and curr_snapshot > next_snapshot and curr_latest_val < next_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            elif pd.isnull(curr_latest_val) and next_snapshot < curr_snapshot:
                this_date.append(next_snapshot)
                this_table.append(next_foundin)
            else:
                this_date.append(curr_latest_val)
                this_table.append(curr_foundin)

        main_df = main_df.drop(columns = ['Latest_value_found','Found_in','Table_ID','Snapshot_timestamp'])
        main_df = pd.concat([main_df,pd.Series(this_date),pd.Series(this_table)], axis = 1).rename(columns = {0:'Latest_value_found',1:'Found_in'})
        count_ += 1