Python 读取csv文件时使用re
我有一个关键字列表Python 读取csv文件时使用re,python,regex,csv,Python,Regex,Csv,我有一个关键字列表health\u list,我想在csv文件的一列中检查这些关键字。如果列表中至少显示一个关键字,则我将整行写入一个新的csv文件 我使用re.search检查关键字,然后记录行号,然后使用csv.writer编写新的csv。但包含关键字的许多行不会显示在我的新csv文件中。有什么意见吗 healthy_new=[] with open("Data 2017.csv","rb") as f: csvreader=csv.reader(f,delimiter=",")
health\u list
,我想在csv文件的一列中检查这些关键字。如果列表中至少显示一个关键字,则我将整行写入一个新的csv文件
我使用re.search检查关键字,然后记录行号,然后使用csv.writer编写新的csv。但包含关键字的许多行不会显示在我的新csv文件中。有什么意见吗
healthy_new=[]
with open("Data 2017.csv","rb") as f:
csvreader=csv.reader(f,delimiter=",")
next(csvreader)
for line, row in enumerate(csvreader):
for word in healthy_list:
try:
if (re.search(word,row[4].lower()) ):
healthy_new.append(line)
except ValueError:
continue
healthy_new=list(set(healthy_new))
....
f = open("Data 2017.csv", "r")
reader = csv.reader(f)
data = open("healthy_new_output.csv", "w")
w = csv.writer(data, delimiter=',')
for idx, row in enumerate(reader):
idx+=-1
if idx in healthy_new:
my_row = row
w.writerow(my_row)
编辑:
一些数据片段2017.csv
健康清单:
[...'diet', 'low-fat', 'light', 'diet', 'salad', 'salads', 'baked', 'grilled', 'whole grain']
如果需要,您可以使用pandas过滤掉它们,然后使用
pandas.DataFrame.to_csv
方法将其输出到csv
下面是有关其工作原理的基本说明:
数据2017.csv
name,age,description
Andy,15,Having a bad stomach
Bobby,21,Having a good stomach and a little flu
Connie,22,Not having anything particularly bad
Derry,12,Bad stomach & lightheaded
这一工作原理的基本说明如下:
In []: df = pd.read_csv('Data 2017.csv')
In []: word_flags = ['bad', 'flu', 'lightheaded']
In []: df_filtered = df.loc[:, :][df.description.str.contains("|".join(word_flags), re.IGNORECASE)]
In []: df_filtered
Out[]:
name age description
0 Andy 15 Having a bad stomach
1 Bobby 21 Having a good stomach and a little flu
2 Connie 22 Not having anything particularly bad
3 Derry 12 Bad stomach & lightheaded
In []: word_flags = ['flu', 'foo', 'bar']
In []: df_filtered = df.loc[:, :][df.description.str.contains("|".join(word_flags), re.IGNORECASE)]
In []: df_filtered
Out[]:
name age description
1 Bobby 21 Having a good stomach and a little flu
df_filtered.to_csv("Filtered Data 2017.csv", index=False)
现在你有了这个:
name,age,description
Bobby,21,Having a good stomach and a little flu
要具体解决您的问题,请参阅下面的代码段:
In []: word_flags = ['bad', 'flu', 'lightheaded']
In []: df2 = pd.DataFrame()
In []: for col in df.select_dtypes(object):
...: df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
...:
In []: df2
Out[]:
name age description
0 Andy 15 Having a bad stomach
1 Bobby 21 Having a good stomach and a little flu
2 Connie 22 Not having anything particularly bad
3 Derry 12 Bad stomach & lightheaded
In []: word_flags = ['flu', 'foo', 'bar']
In []: df2 = pd.DataFrame()
In []: for col in df.select_dtypes(object):
...: df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
...:
In []: df2
Out[]:
name age description
1 Bobby 21 Having a good stomach and a little flu
但是,只有将过滤器定义为仅过滤出特定列时,此方法才能正常工作。假设您这样定义word\u标志
:
In []: word_flags = ['flu', 'foo', 'bar', 'bobby']
这将产生重复记录,需要进一步清理
In []: df2 = pd.DataFrame()
In []: for col in df.select_dtypes(object):
...: df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
...:
In []: df2
Out[]:
name age description
1 Bobby 21 Having a good stomach and a little flu
1 Bobby 21 Having a good stomach and a little flu
我们可以举一个
'Data 2017.csv'
和health_list
的例子吗?如果您只想让字符串包含测试,为什么要使用re
?@Megalng
这是我到目前为止学到的唯一方法。。。你建议哪一个?