Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/298.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 读取csv文件时使用re_Python_Regex_Csv - Fatal编程技术网

Python 读取csv文件时使用re

Python 读取csv文件时使用re,python,regex,csv,Python,Regex,Csv,我有一个关键字列表health\u list,我想在csv文件的一列中检查这些关键字。如果列表中至少显示一个关键字,则我将整行写入一个新的csv文件 我使用re.search检查关键字,然后记录行号,然后使用csv.writer编写新的csv。但包含关键字的许多行不会显示在我的新csv文件中。有什么意见吗 healthy_new=[] with open("Data 2017.csv","rb") as f: csvreader=csv.reader(f,delimiter=",")

我有一个关键字列表
health\u list
,我想在csv文件的一列中检查这些关键字。如果列表中至少显示一个关键字,则我将整行写入一个新的csv文件

我使用re.search检查关键字,然后记录行号,然后使用csv.writer编写新的csv。但包含关键字的许多行不会显示在我的新csv文件中。有什么意见吗

healthy_new=[]
with open("Data 2017.csv","rb") as f:
    csvreader=csv.reader(f,delimiter=",")
    next(csvreader)
    for line, row in enumerate(csvreader):
        for word in healthy_list:
            try:
                if  (re.search(word,row[4].lower()) ):
                    healthy_new.append(line)
            except ValueError:
                continue 

healthy_new=list(set(healthy_new))

....

f = open("Data 2017.csv", "r")
reader = csv.reader(f)

data = open("healthy_new_output.csv", "w")
w = csv.writer(data, delimiter=',')
for idx, row in enumerate(reader):
    idx+=-1
    if idx in healthy_new:
        my_row = row
        w.writerow(my_row)
编辑: 一些数据片段2017.csv

健康清单:

 [...'diet', 'low-fat', 'light', 'diet', 'salad', 'salads', 'baked', 'grilled', 'whole grain']

如果需要,您可以使用pandas过滤掉它们,然后使用
pandas.DataFrame.to_csv
方法将其输出到csv

下面是有关其工作原理的基本说明:

数据2017.csv

name,age,description
Andy,15,Having a bad stomach
Bobby,21,Having a good stomach and a little flu
Connie,22,Not having anything particularly bad
Derry,12,Bad stomach & lightheaded
这一工作原理的基本说明如下:

In []: df = pd.read_csv('Data 2017.csv')

In []: word_flags = ['bad', 'flu', 'lightheaded']

In []: df_filtered = df.loc[:, :][df.description.str.contains("|".join(word_flags), re.IGNORECASE)]

In []: df_filtered
Out[]: 
     name  age                             description
0    Andy   15                    Having a bad stomach
1   Bobby   21  Having a good stomach and a little flu
2  Connie   22    Not having anything particularly bad
3   Derry   12               Bad stomach & lightheaded

In []: word_flags = ['flu', 'foo', 'bar']

In []: df_filtered = df.loc[:, :][df.description.str.contains("|".join(word_flags), re.IGNORECASE)]

In []: df_filtered
Out[]: 
    name  age                             description
1  Bobby   21  Having a good stomach and a little flu

df_filtered.to_csv("Filtered Data 2017.csv", index=False)
现在你有了这个:

name,age,description
Bobby,21,Having a good stomach and a little flu
要具体解决您的问题,请参阅下面的代码段:

In []: word_flags = ['bad', 'flu', 'lightheaded']

In []: df2 = pd.DataFrame()

In []: for col in df.select_dtypes(object):
    ...:     df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
    ...:     

In []: df2
Out[]: 
     name  age                             description
0    Andy   15                    Having a bad stomach
1   Bobby   21  Having a good stomach and a little flu
2  Connie   22    Not having anything particularly bad
3   Derry   12               Bad stomach & lightheaded

In []: word_flags = ['flu', 'foo', 'bar']

In []: df2 = pd.DataFrame()

In []: for col in df.select_dtypes(object):
    ...:     df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
    ...:     

In []: df2
Out[]: 
    name  age                             description
1  Bobby   21  Having a good stomach and a little flu
但是,只有将过滤器定义为仅过滤出特定列时,此方法才能正常工作。假设您这样定义
word\u标志

In []: word_flags = ['flu', 'foo', 'bar', 'bobby']
这将产生重复记录,需要进一步清理

In []: df2 = pd.DataFrame()

In []: for col in df.select_dtypes(object):
    ...:     df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
    ...:     

In []: df2
Out[]: 
    name  age                             description
1  Bobby   21  Having a good stomach and a little flu
1  Bobby   21  Having a good stomach and a little flu

我们可以举一个
'Data 2017.csv'
health_list
的例子吗?如果您只想让字符串包含测试,为什么要使用
re
@Megalng
这是我到目前为止学到的唯一方法。。。你建议哪一个?