Python 熊猫：根据年份对数据帧应用不同的过滤器_Python_Pandas_Dataframe_Data Analysis

Python 熊猫：根据年份对数据帧应用不同的过滤器

python pandas dataframe

Python 熊猫：根据年份对数据帧应用不同的过滤器,python,pandas,dataframe,data-analysis,Python,Pandas,Dataframe,Data Analysis,如果年份高于或低于某个范围，我希望对我的数据帧应用不同的过滤器。这是数据帧 dataset=pd.DataFrame({'ID': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5], 'Avail' : [2017,2017,2017,2018,2018,2018,2017,2017,2017,2017,2017,2017,2017,2018,2018], 'Change' : [0,0,2

如果年份高于或低于某个范围，我希望对我的数据帧应用不同的过滤器。这是数据帧

dataset=pd.DataFrame({'ID': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5], 
                      'Avail' : [2017,2017,2017,2018,2018,2018,2017,2017,2017,2017,2017,2017,2017,2018,2018], 
                      'Change' : [0,0,2018,0,0,0,0,0,0,0,0,0,2018,0,0],
                      'Pref' : [1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
                      'Status': ['null', 'null','Q','null','null','null','Q','null','null','null','null','null','Q','null','null']
                      },columns=['ID', 'Avail', 'Change', 'Pref', 'Status'])

以下是我编写的生成错误的代码：

def yearfilt(x):
    if x.loc[:, ['Avail', 'Change']].values.max(axis=1) < 2018:
        if pd.isnull(x.Status):
            x.drop_duplicates(subset=['STU_ID','Status' ], keep='last')
        else:
            x=x.drop(x[pd.isnull(x.Status)].index)
    else:
        if pd.isnull(x.ASSESSMENT_OUTCOME_CD):
            x.drop_duplicates(subset=['STU_ID','Status' ], keep='first')
        else:
         x=x.drop(x[pd.isnull(x.Status)].index)

df=dataset.groupby(['ID']).apply(yearfilt).sort_values(["ID"]).reset_index(drop=True)

我想表演的是：

If the max (Avail, Change) < 2018 then
Case 1: the same status --> drop duplicates and keep the last
Case 2: different status --> drop null-value statuses

else (in other words max (Avail, Change) = 2018)
Case 1: the same status --> drop duplicates and keep the first
Case 2: different status --> drop null-value statuses

基本上，从每个ID我只想保留一个。

谢谢

您看到的

值错误

是因为您试图检查

是否（某些系列）

。我不确定哪一行给出了您指出的错误，但您的if语句似乎都可能导致此问题

例如，第一条if语句将一系列值与单个值进行比较。结果是一系列布尔值，而不是if语句能够理解的单个True/False。同样的情况也可能发生在

pd.isnull

中

您应该检查哪些命令给出了数组结果，并考虑它如何符合代码的逻辑

如果我正确理解您的问题，这里有一个可能的解决方案：

def yearfilt(group):
    # Apply .max() twice to get a single value across the group.
    # Otherwise the results is a Series, and using if will result in a ValueError.
    if group[['Avail', 'Change']].max().max() < 2018:
        # Returns true if there is a unique status value.
        if group['Status'].unique().shape[0] == 1:
            # Return last row as a dataframe.
            return group.iloc[-1:]
        else:
            # Return ALL rows with status not null (may be more than 1?).
            return group[group['Status'] != 'null']
    else:
        if group['Status'].unique().shape[0] == 1:
            # Return first row as a dataframe.
            return group.iloc[:1]
        else:
            return group[group['Status'] != 'null']

dataset.groupby('ID').apply(yearfilt).reset_index(drop=True)

def yearfilt（组）：
#两次Apply.max（）以获取整个组中的单个值。
#否则，结果是一个系列，使用if将导致ValueError。
如果组[['Avail'，'Change']].max（）.max（）<2018：
#如果存在唯一的状态值，则返回true。
如果组['Status'].unique（）.shape[0]==1：
#将最后一行作为数据帧返回。
返回组.iloc[-1:]
其他：
#返回状态为NOTNULL（可能大于1？）的所有行。
返回组[组['Status']！='null']
其他：
如果组['Status'].unique（）.shape[0]==1：
#将第一行作为数据帧返回。
返回组.iloc[：1]
其他：
返回组[组['Status']！='null']
dataset.groupby（'ID'）。应用（yearfilt）。重置索引（drop=True）

有几件事需要记住：

传递给在

groupby（）.apply中使用的函数的每个参数都被传递给整个数据帧的子集。您需要返回新对象，而不是修改函数接收的组


如果使用的是isnull
，则尝试筛选的值必须是None
，而不是字符串'null'
，'None'
，'nan'
等。请参阅关于缺失值的说明
在系列
上不能使用if
语句，只能使用单个值

请写出给您带来麻烦的行？代码的最后一行：df=dataset.groupby（['ID']）。apply（yearfilt）。sort_value（[“ID”]）。reset_index（drop=True）谢谢。但是，您是否有任何建议根据年份应用过滤器，如上文案例1,2中所述的过滤器？我真的陷入了这个困境，我已经用一个可能的解决方案编辑了这篇文章。希望有帮助！这是金子，谢谢。它在我提供的数据样本上运行良好。然而，当我把它应用到我的扩展数据帧（大约有30列）上时，似乎有些不对劲了。我只是想弄明白为什么过滤器不能给出一致的答案。虽然我的实际数据集有更多的列，但我只使用我在文章的示例中提供的列来应用过滤器。真奇怪！我建议使用.iloc
返回第一个和最后一个索引groupby
可能无法为您提供子集数据帧的一致顺序，因此您可能需要使用idxmax
/idxmin
根据某种顺序获取所需的第一行/最后一行。还要检查组[group['Status']！='null']的行为，因为它可能返回多行。关于'null'值，您完全正确。在我的实际数据集中，我有空格而不是空值，这就是问题的根源。我将它们全部改为“Nan”，但在python中这似乎不是唯一的。我还用“无”来代替空白。这对除ID 1之外的所有人都有效。
ID  Year  Change  Pref  Status
1   2017   2018    3      Q
2   2018   0       1     null
3   2017   0       1      Q
4   2017   0       3     null
5   2017   2018    1      Q

def yearfilt(group):
    # Apply .max() twice to get a single value across the group.
    # Otherwise the results is a Series, and using if will result in a ValueError.
    if group[['Avail', 'Change']].max().max() < 2018:
        # Returns true if there is a unique status value.
        if group['Status'].unique().shape[0] == 1:
            # Return last row as a dataframe.
            return group.iloc[-1:]
        else:
            # Return ALL rows with status not null (may be more than 1?).
            return group[group['Status'] != 'null']
    else:
        if group['Status'].unique().shape[0] == 1:
            # Return first row as a dataframe.
            return group.iloc[:1]
        else:
            return group[group['Status'] != 'null']

dataset.groupby('ID').apply(yearfilt).reset_index(drop=True)