Python 查找给定单词在一行中的常见位置_Python

Python 查找给定单词在一行中的常见位置

python

Python 查找给定单词在一行中的常见位置,python,Python,我有一个文本文件，每行包含几个单词。现在给定一组查询词，我必须找到文件中同时出现查询词的行数。i、 e包含两个查询词的行数，包含3个查询词的行数等我尝试使用以下代码：请注意，rest（list，word）从“list”中删除“word”，并返回更新后的列表。linecount是原始数据中的行数 raw=open("raw_dataset_1","r") queryfile=open("queries","r") query=queryfile.readline().split() query_

我有一个文本文件，每行包含几个单词。现在给定一组查询词，我必须找到文件中同时出现查询词的行数。i、 e包含两个查询词的行数，包含3个查询词的行数等

我尝试使用以下代码：请注意，rest（list，word）从“list”中删除“word”，并返回更新后的列表。linecount是原始数据中的行数

raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query=queryfile.readline().split()
query_size=len(query)
two=0
three=0
four=0

while linecount>0:
    line=raw.readline().split()
    if query_size>=2:
        for word1 in query:
            beta=rest(query,word1)
            for word2 in beta:
                if (word1 in line) and (word2 in line):
                    two+=1
                    print line
    if (query_size>=3):
        for word3 in query:
            beta=rest(query,word3)
            for word4 in beta:
                gama=rest(beta,word4)
                for word5 in gama:
                    if (((word3 in line) and (word4 in line)) and (word5 in line)):
                        three+=1
                        print line
    linecount-=1

print two
print three

它可以工作，虽然有冗余，但我可以将“2”除以2得到所需的数字）

有更好的方法吗？

我会采取更一般的方法。假设

query

是您的查询词列表，而

raw\u dataset\u 1

是您正在分析的文件名，我将执行以下操作：

# list containing the number of lines with 0,1,2,3... occurrances of query words.
wordcount = [0,0,0,0,0]    
for line in file("raw_dataset_1").readlines():
    # loop over each query word, see if it occurs in the given line, and just count them. 
    # The bracket inside will create a list of elements (query_word) from your query word list (query)
    # but add only those words which occur in the line (if  query_word in line). [See list comprehension]
    # E.g. if your line contain three query words those three will be in the list.
    # You are not interested in what those words are, so you just take the length of the list (len). 
    # Finally, number_query_words_found is the number of query words present in the current line of text. 
    number_query_words_found = len([query_word for query_word in query if query_word in line])  
    if number_query_words_found<5:
        # increase the line-number by one. The index corresponds to the number of query-words present
        wordcount[number_query_words_found] += 1

print "Number of lines with 2 query words: ", wordcount[2]
print "Number of lines with 3 query words: ", wordcount[3]

#包含0,1,2,3的行数的列表。。。疑问词的出现。
字数=[0,0,0,0,0]
对于文件中的行（“原始数据集1”）。readlines（）：
#在每个查询词上循环，查看它是否出现在给定的行中，然后对它们进行计数。
#内括号将从查询词列表（查询）中创建元素列表（查询词）
#但只添加行中出现的单词（如果查询行中的单词）。[见列表]
#例如，如果您的行包含三个查询词，则这三个查询词将出现在列表中。
#你对这些单词不感兴趣，所以你只需要选择列表的长度（len）。
#最后，number\u query\u words\u found是当前文本行中存在的查询词数。
number\u query\u words\u found=len（[query\u word for query\u word in query if query\u word in line]）
如果找到数字查询单词，我会采取更一般的方法。假设query
是您的查询词列表，而raw\u dataset\u 1
是您正在分析的文件名，我将执行以下操作：
# list containing the number of lines with 0,1,2,3... occurrances of query words.
wordcount = [0,0,0,0,0]    
for line in file("raw_dataset_1").readlines():
    # loop over each query word, see if it occurs in the given line, and just count them. 
    # The bracket inside will create a list of elements (query_word) from your query word list (query)
    # but add only those words which occur in the line (if  query_word in line). [See list comprehension]
    # E.g. if your line contain three query words those three will be in the list.
    # You are not interested in what those words are, so you just take the length of the list (len). 
    # Finally, number_query_words_found is the number of query words present in the current line of text. 
    number_query_words_found = len([query_word for query_word in query if query_word in line])  
    if number_query_words_found<5:
        # increase the line-number by one. The index corresponds to the number of query-words present
        wordcount[number_query_words_found] += 1

print "Number of lines with 2 query words: ", wordcount[2]
print "Number of lines with 3 query words: ", wordcount[3]

#包含0,1,2,3的行数的列表。。。疑问词的出现。
字数=[0,0,0,0,0]
对于文件中的行（“原始数据集1”）。readlines（）：
#在每个查询词上循环，查看它是否出现在给定的行中，然后对它们进行计数。
#内括号将从查询词列表（查询）中创建元素列表（查询词）
#但只添加行中出现的单词（如果查询行中的单词）。[见列表]
#例如，如果您的行包含三个查询词，则这三个查询词将出现在列表中。
#你对这些单词不感兴趣，所以你只需要选择列表的长度（len）。
#最后，number\u query\u words\u found是当前文本行中存在的查询词数。
number\u query\u words\u found=len（[query\u word for query\u word in query if query\u word in line]）
如果找到数字查询单词，我会使用集合：
raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query_line = queryfile.readline()
query_words = query_line.split()
query_set = set(query_words)
query_size = len(query_set)  # Note that this isn't actually used below

for line in raw: # Iterating over a file gives you one line at a time
    words = line.strip().split()
    word_set = set(words)
    common_set = query_set.intersection(word_set)
    if len(common_set) == 2:
        two += 1
    elif len(common_set) == 3:
        three += 1
    elif len(common_set) == 4:
        four += 1

当然，您可能希望将该行保存到结果文件或其他任何文件中，而不是仅计算发生次数。但是这应该给你一个大致的想法：使用集合将极大地简化你的逻辑。
我将使用集合来实现这一点：
raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query_line = queryfile.readline()
query_words = query_line.split()
query_set = set(query_words)
query_size = len(query_set)  # Note that this isn't actually used below

for line in raw: # Iterating over a file gives you one line at a time
    words = line.strip().split()
    word_set = set(words)
    common_set = query_set.intersection(word_set)
    if len(common_set) == 2:
        two += 1
    elif len(common_set) == 3:
        three += 1
    elif len(common_set) == 4:
        four += 1

当然，您可能希望将该行保存到结果文件或其他任何文件中，而不是仅计算发生次数。但是这应该给你一个大致的想法：使用集合将极大地简化你的逻辑。
@Vidit No。我想找到包含多个查询词的行。我的问题是“太阳树”。我有“苹果树芒果”、“太阳树天空”和“日月星”三行。那么两个查询词同时出现的行数是1，即“sun tree sky”（这行包含两个查询词）@Vidit No。我想找到包含多个查询词的行。我的问题是“太阳树”。我有“苹果树芒果”、“太阳树天空”和“日月星”三行。那么两个查询词同时出现的行数是1，即“太阳树天空”（这一行包含两个查询词）您能解释一下这一行吗“len（[query\u word for query\u word in query if query\u word in line]）”我得到了名称错误：名称“line”没有定义。对不起，示例现在已经修复。它应该为文件中的行（“原始数据集1”）读取。readlines（）：
…这是一个O（N^2）解决方案：对于需要检查的每个单词，您在该行中迭代一次。使用集合将给出一个O（N）解。@Alex:非常感谢您的帮助：）。这段代码非常好。我学到了很多。请你解释一下这行“len（[query\u word for query\u word in query if query\u word in line]）”我得到了名称错误：名称“line”没有定义。对不起，这个例子现在已经修复了。它应该为文件中的行（“原始数据集1”）读取。readlines（）：
…这是一个O（N^2）解决方案：对于需要检查的每个单词，您在该行中迭代一次。使用集合将给出一个O（N）解。@Alex:非常感谢您的帮助：）。这段代码非常好。我学到了很多。@munn：非常感谢：）。是的，我有这个想法here@munn：非常感谢。是的，我在这里使用了这个想法