使用Python中的正则表达式从基于2列的csv文件中删除重复行_Python_Regex_Csv_Set_Duplicates

使用Python中的正则表达式从基于2列的csv文件中删除重复行

python regex csv

使用Python中的正则表达式从基于2列的csv文件中删除重复行,python,regex,csv,set,duplicates,Python,Regex,Csv,Set,Duplicates,如何基于两列从csv文件中删除重复行，其中一列使用正则表达式确定匹配项，并按第一个字段（IPAddress）分组。最后，向行添加一个计数字段，以对重复行进行计数： csv文件： IPAddress, Value1, Value2, Value3 127.0.0.1, Test1ABC, 10, 20 127.0.0.1, Test2ABC, 20, 30 127.0.0.1, Test1ABA, 30, 40 127.0.0.1, Value1BBA, 40, 50 127.0.0.1, Val

如何基于两列从csv文件中删除重复行，其中一列使用正则表达式确定匹配项，并按第一个字段（IPAddress）分组。最后，向行添加一个计数字段，以对重复行进行计数：

csv文件：

IPAddress, Value1, Value2, Value3
127.0.0.1, Test1ABC, 10, 20
127.0.0.1, Test2ABC, 20, 30
127.0.0.1, Test1ABA, 30, 40
127.0.0.1, Value1BBA, 40, 50
127.0.0.1, Value1BBA, 40, 50
127.0.0.2, Test1ABC, 10, 20
127.0.0.2, Value1AAB, 20, 30
127.0.0.2, Value2ABA, 30, 40
127.0.0.2, Value1BBA, 40, 50

我想匹配IPAddress和Value1（如果前5个字符匹配，则Value1是匹配项）

这将给我：

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
**127.0.0.1, Test1ABA, 30, 40** (Line would be removed but counted)
127.0.0.1, Value1BBA, 40, 50, 2
**127.0.0.1, Value1BBA, 40, 50** (Line would be removed but counted)
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
**127.0.0.2, Value1BBA, 40, 50** (Line would be removed but counted)

新产出：

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
127.0.0.1, Value1BBA, 40, 50, 2
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1

我尝试过使用集合，但显然无法索引集合

entries = set()
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
    for row in list:
    key = (row[0], row[1])
        if re.match(r"(Test1)", key[1]) not in entries:
        entries.add(key)

伪代码？：

# I want to iterate through rows of a csv file and
if row[0] and row[1][:5] match a previous entry:
    remove row
    add count
else:
    add row

非常感谢您的帮助或指导。

您需要一本词典来跟踪匹配。您不需要正则表达式，只需要跟踪前5个字符。按其“键”（由第一列和第二列的前5个字符组成）存储行，并添加计数。您需要先计数，然后写出收集的行和计数

如果排序很重要，您可以将字典替换为集合。OrderedDict（），否则代码相同：

rows = {}

with open(inputfilename, 'rb') as inputfile:
    reader = csv.reader(inputfile)
    headers = next(reader)  # collect first row as headers for the output
    for row in reader:
        key = (row[0], row[1][:5])
        if key not in rows:
            rows[key] = row + [0,]
        rows[key][-1] += 1  # count

with open('myfilewithoutduplicates.csv', 'wb') as outputfile:
    writer = csv.writer(outputfile)
    writer.writerow(headers + ['Count'])
    writer.writerows(rows.itervalues())

您可以使用：

请注意，

Value1AAB

等行的前6个字符似乎匹配。你能详细说明一下这些线是如何匹配的吗？前缀相等的规则是什么？是否应该将其与除最后3个字符外的所有字符进行比较？谢谢您的帮助。以前从未使用过“numpy”。我现在来看看。

import numpy as np

# import data from file (assume file called a.csv), store as record array:
a  = np.genfromtxt('a.csv',delimiter=',',skip_header=1,dtype=None)

# get the first column and first 5 chars of 2nd col store in array p
p=[x+y for x,y in zip(a['f0'],[a['f1'][z][0:6] for z in range(len(a))])]

#compare elements in p, get indexes of unique entries (m)
k,m = np.unique(p, return_index=True)

# use indexes to create new array without dupes
newlist = [a[v] for v in m]

#the count is the difference in lengths of the arrays
count = len(a)-len(newlist)