多词搜索无法正常工作(Python)
我正在从事一个项目,该项目要求我能够在一个文件中搜索多个关键字。例如,如果我有一个文件,其中单词西红柿出现100次,单词面包出现500次,Pickle出现20次,那么我希望能够在文件中搜索西红柿和面包,并获得它在文件中出现的次数。我在这个网站上找到了有相同问题的人,但是有其他语言的人 我有一个工作程序,允许我搜索列名并统计某个内容在该列中出现的次数,但我想让它更精确一些。这是我的密码:多词搜索无法正常工作(Python),python,python-2.7,Python,Python 2.7,我正在从事一个项目,该项目要求我能够在一个文件中搜索多个关键字。例如,如果我有一个文件,其中单词西红柿出现100次,单词面包出现500次,Pickle出现20次,那么我希望能够在文件中搜索西红柿和面包,并获得它在文件中出现的次数。我在这个网站上找到了有相同问题的人,但是有其他语言的人 我有一个工作程序,允许我搜索列名并统计某个内容在该列中出现的次数,但我想让它更精确一些。这是我的密码: def start(): location = raw_input("What is the fold
def start():
location = raw_input("What is the folder containing the data you like processed located? ")
#location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
if os.path.exists(location) == True: #Tests to see if user entered a valid path
file_extension = raw_input("What is the file type (.txt for example)? ")
search_for(location,file_extension)
else:
print "I'm sorry, but the file location you have entered does not exist. Please try again."
start()
def search_for(location,file_extension):
querylist = []
n = 5
while n == 5:
search_query = raw_input("What would you like to search for in each file? Use'Done' to indicate that you have finished your request. ")
#list = ["CD90-N5722-15C", "CD90-NB810-4C", "CP90-N2475-8", "CD90-VN530-22B"]
if search_query == "Done":
print "Your queries are:",querylist
print ""
content = os.listdir(location)
run(content,file_extension,location,querylist)
n = 0
else:
querylist.append(search_query)
continue
def run(content,file_extension,location,querylist):
for item in content:
if item.endswith(file_extension):
search(location,item,querylist)
quit()
def search(location,item,querylist):
with open(os.path.join(location,item), 'r') as f:
countlist = []
for search in querylist: #any search value after the first one is incorrectly reporting "0"
countsearch = 0
for line in f:
if search in line:
countsearch = countsearch + 1
countlist.append(search)
countlist.append(countsearch) #mechanism to update countsearch is not working for any value after the first
print item, countlist
start()
如果使用该代码,则def搜索的最后一部分无法正常工作。每当我输入搜索时,在我输入的第一个搜索之后的任何搜索都返回0,尽管一个文件中最多出现500000个搜索词
我还想知道,因为我必须为5个文件编制索引,每个文件有1000000行,是否有一种方法可以编写一个附加函数或其他东西来计算所有文件中出现的次数
由于文件的大小和内容,我无法在此发布文件。任何帮助都将不胜感激
编辑
我这里也有这段代码。如果我使用此选项,我将获得每个选项的正确计数,但最好让用户能够输入任意数量的搜索:
def check_start():
#location = raw_input("What is the folder containing the data you like processed located? ")
location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
content = os.listdir(location)
for item in content:
if item.endswith("processed"):
countcol1 = 0
countcol2 = 0
countcol3 = 0
countcol4 = 0
#print os.path.join(currentdir,item)
with open(os.path.join(location,item), 'r') as f:
for line in f:
if "CD90-N5722-15C" in line:
countcol1 = countcol1 + 1
if "CD90-NB810-4C" in line:
countcol2 = countcol2 + 1
if "CP90-N2475-8" in line:
countcol3 = countcol3 + 1
if "CD90-VN530-22B" in line:
countcol4 = countcol4 + 1
print item, "CD90-N5722-15C", countcol1, "CD90-NB810-4C", countcol2, "CP90-N2475-8", countcol3, "CD90-VN530-22B", countcol4
您正在尝试对文件进行多次迭代。第一次之后,文件指针位于末尾,因此后续搜索将失败,因为没有任何内容可读取 如果添加行: f、 请参见k0,这将在每次读取之前重置指针:
def search(location,item,querylist):
with open(os.path.join(location,item), 'r') as f:
countlist = []
for search in querylist: #any search value after the first one is incorrectly reporting "0"
countsearch = 0
for line in f:
if search in line:
countsearch = countsearch + 1
countlist.append(search)
countlist.append(countsearch) #mechanism to update countsearch is not working for any value after the first
f.seek(0)
print item, countlist
另外,我已经猜到了缩进。。。你真的不应该使用标签。我不确定我是否完全理解你的问题,但是像这样的东西怎么样
def check_start():
raw_search_terms = raw_input('Enter search terms seperated by a comma:')
search_term_list = raw_search_terms.split(',')
#location = raw_input("What is the folder containing the data you like processed located? ")
location = "C:/Code/Samples/Dates/2015-06-07/Large-Scale Data Parsing/Data Files"
content = os.listdir(location)
for item in content:
if item.endswith("processed"):
# create a dictionary of search terms with their counts (initialized to 0)
search_term_count_dict = dict(zip(search_term_list, [0 for s in search_term_list]))
for line in f:
for s in search_term_list:
if s in line:
search_term_count_dict[s] += 1
print item
for key, value in search_term_count_dict.iteritems() :
print key, value
你能修复你的代码缩进吗?@SiHa谢谢你让我知道,但我看不出它在哪里关闭了。你能告诉我吗?看起来你需要在函数定义后缩进行。哦!我没有意识到他们没有在网站上自动缩进。在我的代码中,它们非常好,我想这可能是因为我使用了tab。是的,这种情况经常发生。我被标签按钮弄伤了。关于您的代码,在第二个有效的示例中,看起来您正在计算术语CD90-N5722-15C、CD90-NB810-4C等的实例。如果是这样,您不能用变量替换这些硬编码的值吗?您可以使用原始输入定义这些变量,以便用户可以输入自己的搜索词。对不起!我真的应该使用空格,但Pycharm无法区分两者之间的区别,它在程序中用4个空格替换一个选项卡。f.seek0方法对我不起作用,在第一个之后,我仍然返回0作为任何值:out-30000000.txt.processed['CD90-N5722-15C',438956',CD90-NB810-4C',0]@Tobytoyo,seek是错误的位置,它所做的只是在文件已经开始时寻找文件的开始,你需要在每一次迭代之后寻找f@PadraicCunningham当前位置当我把它放在那里时,我意识到第一次寻找是不必要的,但我认为它使它更清晰。移动它真的需要编辑吗?@SiHa,你只有一次搜索,你的代码在我编辑它之前不起作用,所以不确定你的确切意思。打开文件后立即搜索如何影响循环中的代码?@SiHa,打开文件后立即搜索,因此基本上是一个代码门挡;谢谢你的回答,但这不是我想要的。Siha和Cunningham的方法正是我想要的。非常感谢你的回答!