使用Python查找包含关键字数组之一的句子_Python_Python 2.7

使用Python查找包含关键字数组之一的句子

python python-2.7

使用Python查找包含关键字数组之一的句子,python,python-2.7,Python,Python 2.7,我正在使用Python 2.7 我想浏览一个.txt文件，只保留包含一个或多个关键字列表的句子之后，我想用另一个关键字列表再次浏览剩余的文本，并重复这个过程结果我想保存在那。txt，其余的可以删除我是Python新手（但我很喜欢它！），所以不要担心会伤害我的感情，你可以随意假设我对Python了解不多，然后把它说得有点哑：）这就是我到目前为止所做的： import re f = open('C:\\Python27\\test\\A.txt') text = f.read() def

我正在使用Python 2.7

我想浏览一个.txt文件，只保留包含一个或多个关键字列表的句子

之后，我想用另一个关键字列表再次浏览剩余的文本，并重复这个过程

结果我想保存在那。txt，其余的可以删除

我是Python新手（但我很喜欢它！），所以不要担心会伤害我的感情，你可以随意假设我对Python了解不多，然后把它说得有点哑：）

这就是我到目前为止所做的：

import re

f = open('C:\\Python27\\test\\A.txt')

text = f.read()
define_words = 'contractual'
print re.findall(r"([^.]*?%s[^.]*\.)" % define_words,text)

这一点非常有效，可以过滤掉任何带有“契约”的句子。如果我把“合同义务”放在那里，它会过滤掉这两个单词相邻的句子

我一直想知道的是，我该如何将这些词转换成一系列相互独立的词呢？比如“合同”、“义务”、“法律”、“雇主”等

编辑applepi的答案：

我做了一些小测试：

“敏捷的棕色狐狸跳过了懒狗

新线

又是一条漂亮的新线。”

如果我把两个单词放在一个句子的字符串中，我只会得到一个句子。像[快速”，“棕色]

输出：['T'，'h'，'e'，'q'，'u'，'i'，'c'，'k'，'b'，'r'，'o'，'w'，'n'，'f'，'o'，'x'，'y'，'j'，'u'，'m'，'p'，'s'，'o'，'v'，'e'，'r'，'T'，'h'，'e'，'l'，'a'，'z'，'y'，'d'，'o'，'g'，'

所以['quick'，'other']什么也没想到

[‘还有’，‘另一个’]将提出：

输出：['''\n'，'\n'，Y'，e'，t'，''a'，'n'，'o'，'t'，'h'，'e'，'r'，'n'，'i'，'c'，'e'，'n'，'e'，'w'，'l'，'i'，'n'，'e'，'

print [sent for sent in text.split('.') 
        if any(word in sent for word in define_words.split()) ]

或者，如果更改字符串列表的define_单词：

# define_words = ['contractual', 'obligations']
define_words = 'contractual obligations'.split()

print [sent for sent in text.split('.') 
        if any(word in sent for word in define_words) ]

实际上，如果您愿意，您可以用re运算符替换contains有用的单词。

我无法评论（我没有足够的声誉），因此从技术上讲，这个答案不是答案

我对regex不是很熟悉，但假设您的

re.findall（）

成功，您可以使用以下代码：

import re, itertools
from collections import Counter
f = open('C:\\Python27\\test\\A.txt')

text = f.read()
everything = []
define_words = ['contractual', 'obligation', 'law', 'employer']
for k in define_words:
    everything.append(re.findall(r"([^.]*?%s[^.]*\.)" % k,text))

everything = list(itertools.chain(*everything))
counts = Counter(everything)
everything = [value for value, count in counts.items() if count > 1]
everything = list(itertools.chain(*everything))
print everything

这将在数组列表上循环，并将值添加到列表中，形成列表列表。然后我只保留重复项（好值），并将列表列表转换为一个列表

错误：真正的错误是所有内容都是列表列表，而

计数器（所有内容）

不允许这样做。因此，我在

计数器（）

之前剥离了它。您可能想查看RegEx操作符。我的评论是，使用

for

loopLooks确实有希望获得一些不是答案的东西：）我得到了这个错误：回溯（最后一次调用）：文件“C:/PythonProjects/test.py”，第11行，在counts=Counter中（所有）文件“C:\Python27\lib\collections.py”，第450行，在init self.update（iterable，**kwds）文件“C:\Python27\lib\collections.py”，第532行，在update self[elem]=self_get（elem，0）中+1 TypeError:Unhabable type:“list”抱歉，将

all

更改为其他内容，如

everything

，似乎

all

是一个内置函数：）事实上，这不是错误，我在回答中指定了真实的一个。我得到了一长串单个字符：[''F'，a'，c'，t'，o'，r'，s'，''w'，h'，i'，c'，h'，''l'，e'，d'，''t'，o'，''（比这个长：））好的是，只有当定义词在文件中至少有一个词时才会发生。单个字符的列表会根据我输入的关键字而变化。

import re, itertools
from collections import Counter
f = open('C:\\Python27\\test\\A.txt')

text = f.read()
everything = []
define_words = ['contractual', 'obligation', 'law', 'employer']
for k in define_words:
    everything.append(re.findall(r"([^.]*?%s[^.]*\.)" % k,text))

everything = list(itertools.chain(*everything))
counts = Counter(everything)
everything = [value for value, count in counts.items() if count > 1]
everything = list(itertools.chain(*everything))
print everything