Python 计算并删除数据帧中每个唯一行的重复项

Python 计算并删除数据帧中每个唯一行的重复项,python,pandas,dataframe,Python,Pandas,Dataframe,数据帧由150000多个数据组成,包括重复数据。在下面的示例中,显示了包含索引的25列数据。我想: 1计算每个唯一数据的重复数 ,Date,Time,Company,AV_ID,timestamp,Longitude,Latitude,Altitude,Roll,Pitch,Yaw,Roll Rate,Pitch Rate,Yaw Rate,Speed-x,Speed-y,Speed-z,Drive Mode,Throttle Actuator Value,Brake Light Conditi

数据帧由150000多个数据组成,包括重复数据。在下面的示例中,显示了包含索引的25列数据。我想:

1计算每个唯一数据的重复数

,Date,Time,Company,AV_ID,timestamp,Longitude,Latitude,Altitude,Roll,Pitch,Yaw,Roll Rate,Pitch Rate,Yaw Rate,Speed-x,Speed-y,Speed-z,Drive Mode,Throttle Actuator Value,Brake Light Condition,Brake Actuator Value,Steering Angle,Direction Indicator,Reverse Light Condition
0,29-Jan-2019,09:29:43.184,DEL,DEL0002,2019-01-29 09:33:33.425000,,,,,,,,0.0,,,2.22,,,9.25,,,,,
1,29-Jan-2019,09:29:43.184,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
2,29-Jan-2019,09:29:43.199,DEL,DEL0002,2019-01-29 09:33:33.425000,,,,,,,,0.0,,,2.22,,,9.25,,,,,
3,29-Jan-2019,09:29:43.199,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
4,29-Jan-2019,09:29:44.543,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
5,29-Jan-2019,09:29:44.543,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
6,29-Jan-2019,09:29:44.574,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
7,29-Jan-2019,09:29:44.574,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
8,29-Jan-2019,09:29:46.606,DEL,DEL0002,2019-01-29 09:33:37.425000,,,,,,,,0.0,,,2.22,,,5.48,,,,,
9,29-Jan-2019,09:29:46.606,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
10,29-Jan-2019,09:29:46.622,DEL,DEL0002,2019-01-29 09:33:37.425000,,,,,,,,0.0,,,2.22,,,5.48,,,,,
11,29-Jan-2019,09:29:46.622,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
12,29-Jan-2019,09:29:48.573,DEL,DEL0002,2019-01-29 09:33:39.422000,,,,,,,,0.0,,,1.94,,,6.02,,,,,
13,29-Jan-2019,09:29:48.573,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
14,29-Jan-2019,09:29:48.588,DEL,DEL0002,2019-01-29 09:33:39.422000,,,,,,,,0.0,,,1.94,,,6.02,,,,,
2删除基于每行的所有重复数据

3插入新列以显示每个唯一数据的重复数

,Date,Time,Company,AV_ID,timestamp,Longitude,Latitude,Altitude,Roll,Pitch,Yaw,Roll Rate,Pitch Rate,Yaw Rate,Speed-x,Speed-y,Speed-z,Drive Mode,Throttle Actuator Value,Brake Light Condition,Brake Actuator Value,Steering Angle,Direction Indicator,Reverse Light Condition
0,29-Jan-2019,09:29:43.184,DEL,DEL0002,2019-01-29 09:33:33.425000,,,,,,,,0.0,,,2.22,,,9.25,,,,,
1,29-Jan-2019,09:29:43.184,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
2,29-Jan-2019,09:29:43.199,DEL,DEL0002,2019-01-29 09:33:33.425000,,,,,,,,0.0,,,2.22,,,9.25,,,,,
3,29-Jan-2019,09:29:43.199,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
4,29-Jan-2019,09:29:44.543,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
5,29-Jan-2019,09:29:44.543,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
6,29-Jan-2019,09:29:44.574,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
7,29-Jan-2019,09:29:44.574,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
8,29-Jan-2019,09:29:46.606,DEL,DEL0002,2019-01-29 09:33:37.425000,,,,,,,,0.0,,,2.22,,,5.48,,,,,
9,29-Jan-2019,09:29:46.606,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
10,29-Jan-2019,09:29:46.622,DEL,DEL0002,2019-01-29 09:33:37.425000,,,,,,,,0.0,,,2.22,,,5.48,,,,,
11,29-Jan-2019,09:29:46.622,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
12,29-Jan-2019,09:29:48.573,DEL,DEL0002,2019-01-29 09:33:39.422000,,,,,,,,0.0,,,1.94,,,6.02,,,,,
13,29-Jan-2019,09:29:48.573,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
14,29-Jan-2019,09:29:48.588,DEL,DEL0002,2019-01-29 09:33:39.422000,,,,,,,,0.0,,,1.94,,,6.02,,,,,
到目前为止,我能够按如下方式删除重复项。但是,我无法计算每个唯一数据行的重复数,也无法将计数插入到新列中

# To get some time conversion
s = pd.to_numeric(mydataset['timestamp'], errors = 'coerce') + local
mydataset['timestamp'] = pd.to_datetime(s, unit = 'ms')

# To remove the duplicates
duplicatedRows = mydataset[mydataset.duplicated()]
您可以尝试按所有列分组,然后按大小计数重复项:


假设我在您想要的方面是正确的,请查看以下数据子集:

4,29-Jan-2019,09:29:44.543,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
5,29-Jan-2019,09:29:44.543,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
6,29-Jan-2019,09:29:44.574,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,

如果你想把这些行的第一个和最后一个看作是重复的,那么你需要在第二列09:29:44.543和09:29:44.574时指定要分组的列。 以多个列为例:

cols_to_groupby = ['Company', 'AV_ID', 'timestamp', 'Longitude', 'Latitude', 'Altitude']

# insert a new column with count of duplicates:
df['duplicate_count'] = df.groupby(cols_to_groupby).transform('count')

# get rid of duplicates:
df = df.drop_duplicates(subset=cols_to_groupby)

您可以使用duplicatedRows计算行数。count我的意思是,我想计算每个相同数据行的出现次数。例如,第1行和第2行具有完全相同的数据。对于该特定数据类型,这将构成2个计数。第3行到第5行是相同的,因此count=3。它给了我一个错误:ValueError:传递值的长度是130322,索引意味着0。可能是因为数据量大吗?可能是数据集中的每一行至少有一个NAN,请参见。如果可能的话,在小组之前做些事。我很高兴听到这个消息。请考虑接受/投票的答案,使其他人更容易找到解决办法,这工作!但是,如果我想选择更多的列到groupby,我无法这样做。还有其他方法可以解决吗?如果您尝试向groupby添加更多列,您会遇到什么问题?