Python 从文本列表中删除单词

Python 从文本列表中删除单词,python,text,stop-words,Python,Text,Stop Words,我试图从文本字符串列表中删除某些单词(除了使用stopwords),但由于某些原因它不起作用 documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system",

我试图从文本字符串列表中删除某些单词(除了使用stopwords),但由于某些原因它不起作用

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

exclude = ['am', 'there','here', 'for', 'of', 'user']

new_doc = [word for word in documents if word not in exclude]

print new_doc
输出

['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']
如您所见,排除中的任何单词都不会从文档中删除(例如,“for”是一个主要示例)

它与此操作员一起工作:

new_doc = [word for word in str(documents).split() if word not in exclude]
但是,如何在文档中恢复初始元素(尽管是“清理过的元素”)


我将非常感谢你的帮助

你是在句子而不是单词上循环。为此,你需要拆分句子,使用嵌套循环在单词上循环,过滤它们,然后加入结果

>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>> 
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>> 
您也可以使用
regex
exclude
单词替换为带有
re.sub
函数的空字符串,而不是嵌套列表理解、拆分和筛选:

>>> import re
>>> 
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface  lab abc computer applications', 'A survey   opinion  computer system response time', 'The EPS  interface management system', 'System and human system engineering testing  EPS', 'Relation   perceived response time to error measurement', 'The generation  random binary unordered trees', 'The intersection graph  paths in trees', 'Graph minors IV Widths  trees and well quasi ordering', 'Graph minors A survey']
>>> 

r'|'。join(exclude)
将单词与一个pip(在正则表达式中表示逻辑或)连接起来。

在筛选单词之前,应将行拆分为单词:

new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]

word
不是一个词,它是一整行(例如
“实验室abc计算机应用程序的人机界面”
),因此永远不会出现在
排除
@jornsharpe中-只是添加了一个更正,但问题仍然存在(有点不同)只是添加了一个更正,但问题仍然存在(有点不同).Gotcha!谢谢!!太棒了!对于大文本,你认为哪种方法更有效?@Toly Yep我想是的。你会对大文本文件使用正则表达式还是嵌套理解?@TolyUsing正则表达式比拆分、循环和过滤性能更好。@Toly你可以使用
timeit
模块在这两种方法上运行基准测试。