从另一个文件（Python）的文本中删除文件中定义的所有停止字_Python_File_Stop Words

从另一个文件（Python）的文本中删除文件中定义的所有停止字

python file

从另一个文件（Python）的文本中删除文件中定义的所有停止字,python,file,stop-words,Python,File,Stop Words,我有两个文本文件： Stopwords.txt-->每行包含一个停止字 text.txt-->大文档文件我正在尝试从text.txt文件中删除所有出现的stopwords（stopwords.txt文件中的任何单词），而不使用NLTK（学校作业）我该怎么做呢？这是到目前为止我的代码 import re with open('text.txt', 'r') as f, open('stopwords.txt','r') as st: f_content = f.read()

我有两个文本文件：

Stopwords.txt-->每行包含一个停止字

text.txt-->大文档文件

我正在尝试从text.txt文件中删除所有出现的stopwords（stopwords.txt文件中的任何单词），而不使用NLTK（学校作业）

我该怎么做呢？这是到目前为止我的代码

import re

with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:
    f_content = f.read()
    #splitting text.txt by non alphanumeric characters
    processed = re.split('[^a-zA-Z]', f_content)

    st_content = st.read()
    #splitting stopwords.txt by new line
    st_list = re.split('\n', st_content)
    #print(st_list) to check it was working

    #what I'm trying to do is: traverse through the text. If stopword appears, 
    #remove it. otherwise keep it. 
    for word in st_list:
        f_content = f_content.replace(word, "")
        print(f_content)

但是，当我运行代码时，首先需要花费很长时间才能输出某些内容，当它这样做时，它只输出整个文本文件。（我是python新手，所以如果我做了一些根本错误的事情，请告诉我！）

以下是我在需要删除英语停止词时使用的方法。我通常也使用nltk中的语料库，而不是我自己的停止词文件

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()

## Remove stop words
stops = set(stopwords.words("english"))
text = [ps.stem(w) for w in text if not w in stops and len(w) >= 3]
text = list(set(text)) #remove duplicates
text = " ".join(text)

对于您的特殊情况，我会做如下操作：

stops = list_of_words_from_file

如果我回答了您的问题，请告诉我，我不确定问题是从文件读取还是词干

编辑：要从另一个文件的文本中删除文件中定义的所有stopwords，可以使用str.replace（）

基于您面临性能问题的事实。我建议使用

子流程

库（，或）调用

sed

linux命令

我知道Python对于这类事情（以及其他许多事情）非常有用，但是如果你有一个非常大的text.txt。我会尝试旧的、丑陋的、强大的命令行“sed”

尝试以下方法：

sed-f stopwords.sed text.txt>output_file.txt

对于stopwords.sed文件，每个stopwords必须位于不同的行中，并使用以下格式：

s|\<xxxxx\>||g

s| \|124; g

其中“xxxxx”将是停止词本身

s|\<the\>||g

s| \|124; g

上面的行将删除所有出现的“The”（不带单引号）

值得一试。

我认为这种方法有效。。。但它的速度非常慢，所以如果有人对如何提高效率有任何建议，我将非常感谢

import re
from stemming.porter2 import stem as PT


with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:

    f_content = f.read()
    processed = re.split('[^a-zA-Z]', f_content)
    processed = [x.lower() for x in processed]
    processed = [PT(x) for x in processed]
    #print(processed)

    st_content = st.read()
    st_list = set(st_content.split())

    clean_text = [x for x in processed if x not in st_list]
    print clean_text

您似乎正在with块外部调用

st.read（）

，这意味着st将被关闭。还有，到目前为止，这段代码有什么问题吗？嘿@jammydower谢谢你的回复！我已经用这个问题更新了我原来的问题^。如果你能帮忙，我将不胜感激！！这是我想要的策略，但我们不允许使用NLTK。。。我们收到了一个文本文件，上面有一堆他们想让我们从文档文件中删除的停止词……有什么建议吗？！？谢谢所以你只需要从文本中删除某些单词？不涉及词干？是的，例如stopwords.txt是[a，the，it，from]，text.txt是[Hello I am from the UK]，应该变成[Hello I am UK]。我也需要阻止，但这是下一步！我已经设法让词干工作，但我需要先删除停止词。检查我在编辑中添加的代码段。没有测试过，但应该能用。嘿！我认为代码片段的想法是正确的，但是replace方法没有起作用，因为它需要两个参数（我在上面的代码中更新了这个参数），但是我仍然有问题：(

import re
from stemming.porter2 import stem as PT


with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:

    f_content = f.read()
    processed = re.split('[^a-zA-Z]', f_content)
    processed = [x.lower() for x in processed]
    processed = [PT(x) for x in processed]
    #print(processed)

    st_content = st.read()
    st_list = set(st_content.split())

    clean_text = [x for x in processed if x not in st_list]
    print clean_text