Python在一列中搜索CSV字符串,从同一行的另一列返回字符串
我正试图用python编写一个程序,为列表中的每个字符串搜索27000行。我正在搜索的每个字符串都在一列中,并且在另一列中有一个“id”值,如果找到该值,我希望打印该值。我当前拥有的代码统计该字符串在文档中出现的次数,但我仍然无法找到一种方法来为找到该字符串的每个唯一行返回特定值Python在一列中搜索CSV字符串,从同一行的另一列返回字符串,python,string,csv,search,Python,String,Csv,Search,我正试图用python编写一个程序,为列表中的每个字符串搜索27000行。我正在搜索的每个字符串都在一列中,并且在另一列中有一个“id”值,如果找到该值,我希望打印该值。我当前拥有的代码统计该字符串在文档中出现的次数,但我仍然无法找到一种方法来为找到该字符串的每个唯一行返回特定值 import csv fin = open('data.csv') words = ["happy","sad","good","bad","sunny","rainy"] found = {} count = 0 f
import csv
fin = open('data.csv')
words = ["happy","sad","good","bad","sunny","rainy"]
found = {}
count = 0
for line in fin:
for word in words:
if word in line:
count = count + 1
found[word] = count
print(found)
上面代码的主要语义问题是,打印“find”字典只会产生“words”列表中的一个结果及其计数
for line in fin: # loops over the lines of the file
for word in words: # loops over your word list
if word in line: # checks if current word is in line
count = count + 1 # increments global variable "count". Everytime a word in the list is found in the line. With no reset or save-away inbetween whatsoever. This is the number of times any word appears in any line at the end.
found[word] = count # assigns - after all words are looped over - current "count" value to found[current_word]
因此,您要做的是为字典键指定一个任意值,而字典键恰好是您在每次迭代中检查的最后一个单词
对我来说似乎不是很有用。我猜你打算做一些类似的事情:
from collections import defaultdict
found = defaultdict(int)
for line in fin:
for word in words:
if word in line:
found[word] += 1
您说过要在找到单词时打印行的id。假设您有一个只有两个列的逗号分隔的csv文件,我将这样做:
fin = open('data.csv')
words = ["happy","sad","good","bad","sunny","rainy"]
found = {}
for line in fin:
str1,id=line.split(',')
for w in words:
if w in str1:
print id
found[w]=found.get(w,0)+1
break
print(found)
如果文件中有多个列,则可以执行以下操作:
split_line=line.split(',')
str1=split_line[0] # Whatever columns
id=split_line[1] # Whatever columns
对于这样的事情,我认为使用
pandas
库将保持你的理智。假设一个15000行的CSV文件有两列,String
和ID
In [1]: import pandas as pd
In [2]: words = ['happy','sad','good','bad','sunny','rainy']
In [3]: df = pd.read_csv('data.csv')
In [4]: df.head(5)
Out[4]:
Strings ID
0 happy 1
1 sad 2
2 happy 3
3 sad 4
4 good 5
In [5]: for word in words:
...: print '{} : {}'.format(word, df['Strings'].str.lower().str.contains(word).sum())
...:
happy : 2501
sad : 2500
good : 2500
bad : 2500
sunny : 2499
rainy : 2500
或者,您可以只创建一个透视表,它将有类似的结果
In [30]: df_pt = df.pivot_table(index='Strings',values='ID',aggfunc=len)
In [31]: df_pt
Out[31]:
Strings
bad 2500
good 2500
happy 2501
rainy 2500
sad 2500
sunny 2499
Name: ID, dtype: int64
如果需要获取每个单词的ID,只需选择/索引/过滤数据即可:
In [6]: df_happy = df[df['Strings'] == 'happy']
In [7]: df_happy.head(5)
Out[7]:
Strings ID
0 happy 1
2 happy 3
12 happy 13
14 happy 15
18 happy 19
如果您需要它作为列表,那么:
In [8]: list_happy = df_happy['ID'].tolist()
In [9]: list_happy[:5]
Out[9]: [1, 3, 13, 15, 19]
显然,我删去了一些部分,但想法仍然是一样的。太棒了。我仍然是Python的新手,因此感谢您添加的注释和解决方案。这解决了计数问题,但是我如何返回我提到的每一行“word”的“id”值呢。只要把它们存起来,你就可以找到它们。