Python 当同一日期出现多条消息时,我如何将第一条消息标记为已诱导?
我有一个汽车维修信息的大数据框。我正在尝试清除此数据并删除所有诱发的消息 无论何时出现汽车信息44,我的代码都会标记所有同时出现的信息。我正在尝试反转我的逻辑,以便任何时候消息44与另一条消息一起出现时,它都被标记为诱导 我对它进行了过滤,因此任何日期出现的第一条消息都将是消息44 我的代码如下:Python 当同一日期出现多条消息时,我如何将第一条消息标记为已诱导?,python,dataframe,jupyter,Python,Dataframe,Jupyter,我有一个汽车维修信息的大数据框。我正在尝试清除此数据并删除所有诱发的消息 无论何时出现汽车信息44,我的代码都会标记所有同时出现的信息。我正在尝试反转我的逻辑,以便任何时候消息44与另一条消息一起出现时,它都被标记为诱导 我对它进行了过滤,因此任何日期出现的第一条消息都将是消息44 我的代码如下: df['MsgCat'] = 'New' for i in range(1,len(df)): if df['MsgCat'].iloc[i] == 'New': if df[
df['MsgCat'] = 'New'
for i in range(1,len(df)):
if df['MsgCat'].iloc[i] == 'New':
if df['CarSerial'].iloc[i] == df['CarSerial'].iloc[i-1]:
if df['Date'].iloc[i] == df['Date'].iloc[i-1]:
df['MsgCount'].iloc[i] = df['MsgCount'].iloc[i-1] + 1
if df['MsgId'].iloc[i-((df['MsgCount'].iloc[i])-1)] == 1:
df['MsgCat'].iloc[i] = 'Induced'
else:
df['MsgCount'].iloc[i] = 1
else:
df['MsgCount'].iloc[i] = 1
else:
df['MsgCount'].iloc[i] = 1
输出:
CarSerial Date MessageNum MsgId MsgCount MsgCat
015 10/14/2015 44 1 1 New
015 10/14/2015 21 2 2 Induced
015 10/14/2015 22 3 3 Induced
015 10/20/2015 30 5 1 New
022 5/1/2015 44 1 1 New
022 7/10/2015 44 1 1 New
022 1/4/2016 44 1 1 New
141 1/10/2016 17 9 1 New
141 1/10/2016 18 10 2 New
008 1/21/2016 44 1 1 New
008 2/4/2016 44 1 1 New
008 2/4/2016 30 5 2 Induced
008 2/4/2016 31 6 3 Induced
期望输出:
CarSerial Date MessageNum MsgId MsgCount MsgCat
015 10/14/2015 44 1 1 Induced
015 10/14/2015 21 2 2 New
015 10/14/2015 22 3 3 New
015 10/20/2015 30 5 1 New
022 5/1/2015 44 1 1 New
022 7/10/2015 44 1 1 New
022 1/4/2016 44 1 1 New
141 1/10/2016 17 9 1 New
141 1/10/2016 18 10 2 New
008 1/21/2016 44 1 1 New
008 2/4/2016 44 1 1 Induced
008 2/4/2016 30 5 2 New
008 2/4/2016 31 6 3 New
提前谢谢 仅仅颠倒逻辑是不够的:当您发现消息44是诱导的时,您已经传递了它。您有两个基本选择:
CarSerial
和Date
对行进行分组。对于每个组,通过向名为changes
的字典中添加一个项目,记录组中是否有44的MessageNum
,以及是否有多行。字典中的项由一个基于dict
类的类组成,该类将“诱导”赋值给44,将“新”赋值给其他所有项。因此,任何符合条件的组都将由changes
字典中的一项表示,该项为必须更改的记录指定所需的MsgCat
标签。如果需要,使用change\u
功能检查每一行的内容,方法是在changes
中查找每一行,并为这两个记录(包括changes
和所有其他记录)分配一个结果
>>> import pandas as pd
>>> df = pd.read_csv('cars.csv', sep='\s+')
>>> df
CarSerial Date MessageNum MsgId MsgCount
0 15 10/14/2015 44 1 1
1 15 10/14/2015 21 2 2
2 15 10/14/2015 22 3 3
3 15 10/20/2015 30 5 1
4 22 5/1/2015 44 1 1
5 22 7/10/2015 44 1 1
6 22 1/4/2016 44 1 1
7 141 1/10/2016 17 9 1
8 141 1/10/2016 18 10 2
9 8 1/21/2016 44 1 1
10 8 2/4/2016 44 1 1
11 8 2/4/2016 30 5 2
12 8 2/4/2016 31 6 3
>>> grouping = df.groupby(df['CarSerial'].apply(lambda n: str(n)) + ' ' + df['Date'])
>>> class Once(dict):
... def __missing__(self, key):
... return 'New'
...
>>> once = Once()
>>> once[44] = 'Induced'
>>> def change_if_need_be(row):
... key = str(row['CarSerial'])+' '+row['Date']
... if key in changes:
... return changes[key][row['MessageNum']]
... else:
... return 'New'
...
>>> changes = {}
>>> for g in grouping:
... if any(g[1].MessageNum == 44) and g[1].MessageNum.count()>1:
... changes[g[0]] = once
...
>>> df['MsgCat'] = df.apply(change_if_need_be, axis=1)
>>> df
CarSerial Date MessageNum MsgId MsgCount MsgCat
0 15 10/14/2015 44 1 1 Induced
1 15 10/14/2015 21 2 2 New
2 15 10/14/2015 22 3 3 New
3 15 10/20/2015 30 5 1 New
4 22 5/1/2015 44 1 1 New
5 22 7/10/2015 44 1 1 New
6 22 1/4/2016 44 1 1 New
7 141 1/10/2016 17 9 1 New
8 141 1/10/2016 18 10 2 New
9 8 1/21/2016 44 1 1 New
10 8 2/4/2016 44 1 1 Induced
11 8 2/4/2016 30 5 2 New
12 8 2/4/2016 31 6 3 New
编辑:我想到了一种运行更快的改进方法
将函数更改为此
>>> def change_if_need_be(row):
... key = str(row['CarSerial'])+' '+row['Date']
... if key in changes:
... return once[row['MessageNum']]
... else:
... return 'New'
...
Change将
从一个dict
更改为这样的列表
>>> changes = []
>>> for g in grouping:
... if any(g[1].MessageNum == 44) and g[1].MessageNum.count()>1:
... changes.append(g[0])
...
编辑:简化(删除源自dict
的类
)并合并
>>> import pandas as pd
>>> df = pd.read_csv('cars.csv', sep='\s+')
>>> df
CarSerial Date MessageNum MsgId MsgCount MsgCat
0 15 10/14/2015 44 1 1 New
1 15 10/14/2015 21 2 2 Induced
2 15 10/14/2015 22 3 3 Induced
3 15 10/20/2015 30 5 1 New
4 22 5/1/2015 44 1 1 New
5 22 7/10/2015 44 1 1 New
6 22 1/4/2016 44 1 1 New
7 141 1/10/2016 17 9 1 New
8 141 1/10/2016 18 10 2 New
9 8 1/21/2016 44 1 1 New
10 8 2/4/2016 44 1 1 New
11 8 2/4/2016 30 5 2 Induced
12 8 2/4/2016 31 6 3 Induced
>>> grouping = df.groupby(df['CarSerial'].apply(lambda n: str(n)) + ' ' + df['Date'])
>>> changes = []
>>> for g in grouping:
... if any(g[1].MessageNum == 44) and g[1].MessageNum.count()>1:
... changes.append(g[0])
...
>>> def change_if_need_be(row):
... key = str(row['CarSerial'])+' '+row['Date']
... if key in changes:
... return {44: 'Induced'}.get(row['MessageNum'], 'New')
... else:
... return 'New'
...
>>> df['MsgCat'] = df.apply(change_if_need_be, axis=1)
结果相同。数据帧有多大?是熊猫吗?大约10000行。是的,是熊猫。