Python 熊猫根据“删除”删除行；“邻居”；_Python_Pandas

Python 熊猫根据“删除”删除行；“邻居”；

python pandas

Python 熊猫根据“删除”删除行；“邻居”；,python,pandas,Python,Pandas,给定以下数据帧： data = [['2019-06-20 12:28:00', '05123', 2, 8888], ['2019-06-20 13:28:00', '55874', 6, 8888], ['2019-06-20 13:35:00', '12345', 1, 8888], ['2019-06-20 13:35:00', '35478', 2, 1234], ['2019-06-20 13:35:00', '1234

给定以下数据帧：

data = [['2019-06-20 12:28:00', '05123', 2, 8888],
        ['2019-06-20 13:28:00', '55874', 6, 8888],
        ['2019-06-20 13:35:00', '12345', 1, 8888],
        ['2019-06-20 13:35:00', '35478', 2, 1234],
        ['2019-06-20 13:35:00', '12345', 2, 8888],
        ['2019-06-20 14:22:00', '98765', 1, 8888]]

columns = ['pdate', 'station', 'ptype', 'train']
df = pd.DataFrame(data, columns = columns)

其中‘pdate’=通过时间，‘station’=车站代码，‘ptype’=通过类型，‘train’=车次号

“ptype”可以有以下值（1=到达，2=离开，6=通过）

结果是：

                 pdate station  ptype  train
0  2019-06-20 12:28:00   05123      2   8888
1  2019-06-20 13:28:00   55874      6   8888
2  2019-06-20 13:35:00   12345      1   8888
3  2019-06-20 13:35:00   35478      2   1234
4  2019-06-20 13:35:00   12345      2   8888
5  2019-06-20 14:22:00   98765      1   8888

不幸的是，有时在错误的站，而不是注册“pType”＝6（PASS），他们输入“pType”＝1（到达）和“pType”＝2（离开）同时发生。因此，2个记录我认为只是一个单程记录

。我必须从数据帧中删除每行ptype=6或（ptype=1，同一车站和同一车次的下一条记录ptype=2正好同时发生）

因此，从给定的示例中，我必须删除以下行（1、2、4）

我可以删除ptype=6的所有行

df = df.drop(df[(df['ptype']==6)].index)

但我不知道如何删除其他对。

有什么想法吗？

IIUC，你可以做

groupby

和

nunique

：

# convert to datetime. Skip if already is.
df.pdate = pd.to_datetime(df.pdate)

# drop all the 6 records:
df = df[df.ptype.ne(6)]

(df[df.groupby(['pdate','train'])
      .ptype.transform('nunique').eq(1)]
)

输出：

                pdate station  ptype  train
0 2019-06-20 12:28:00   05123      2   8888
3 2019-06-20 13:35:00   35478      2   1234
5 2019-06-20 14:22:00   98765      1   8888

以下是您如何做到这一点：

# We look at the problematic ptypes
# We groupby station train and pdate to  identify the problematic rows
test = df[(df['ptype'] == 1) | (df['ptype'] == 2)].groupby(['station', 'train', 'pdate']).size().reset_index()

# If there is more than one row that means there is a duplicate 
errors = test[test[0] >1][['station', 'train', 'pdate']]
# We create a column to_remove to later identify the problematic rows
errors['to_remove'] = 1

df = df.merge(errors, on=['station', 'train', 'pdate'], how='left')

#We drop the problematic rows
df = df.drop(index = df[df['to_remove'] == 1].index)

# We drop the column to_remove which is no longer necessary
df.drop(columns='to_remove', inplace = True)

输出：

                 pdate station  ptype  train
0  2019-06-20 12:28:00   05123      2   8888
1  2019-06-20 13:28:00   55874      6   8888
3  2019-06-20 13:35:00   35478      2   1234
5  2019-06-20 14:22:00   98765      1   8888

这不是一个非常像熊猫的方法，但如果我正确理解了你的目的，你实际上会得到你想要的结果

# a dict for unique filtered records
filtered_records = {}

def unique_key(row):
    return '%s-%s-%d' % (row[columns[0]],row[columns[1]],row[columns[3]])

# populate a map of unique dt, train, station records
for index, row in df.iterrows():
    key = unique_key(row)
    val = filtered_records.get(key,None)
    if val is None:
        filtered_records[key] = row[columns[2]]
    else:
        # is there's a 1 and 2 record, declare the record a 6
        if val * row[columns[2]] == 2:
            filtered_records[key] = 6

# helper function for apply
def update_row_ptype(row):
    val = filtered_records[unique_key(row)]
    return val if val == 6 else row[columns[2]]

# update the dataframe with invalid detected entries from the dict
df[columns[2]] = df.apply(lambda row: update_row_ptype(row), axis = 1)
# drop em
df.drop(df[(df[columns[2]]==6)].index,inplace=True)

print df

输出

                 pdate station  ptype  train
0  2019-06-20 12:28:00   05123      2   8888
3  2019-06-20 13:35:00   35478      2   1234
5  2019-06-20 14:22:00   98765      1   8888

不。样本数据中没有错误。8888次列车行驶在以下轨道上：“05123”在12:28，“55874”在13:28，“12345”在13:35和“98765”在14:22似乎有效。但为什么删除重复项不保留一条记录（通过删除所有其他重复项）？我不确定我是否遵循。

drop_duplicates

确实保留了一条记录，并删除了其他记录。在这种情况下，解决方案是错误的。我必须完全删除第2行和第4行，因为这两行被视为单次通行记录，所以我不需要它们。例如，对于火车#8888，我只需要保留车站“05123”和“98765”是的！更新工作正常。谢谢！你太棒了！：-）好吧，虽然这会产生预期的结果，但性能明智的iterrows并不是最好的解决方案，特别是在我的情况下，我必须处理大约数千万条记录。