Python 句子标记词中的停止词_Python_Nlp

Python 句子标记词中的停止词

python nlp

Python 句子标记词中的停止词,python,nlp,Python,Nlp,我正在使用stopwords和句子标记器，但当我打印过滤后的句子时，会得到包括stopwords在内的结果。问题在于它不能忽略输出中的停止字。如何删除句子标记器中的停止词 userinput1 = input ("Enter file name:") myfile1 = open(userinput1).read() stop_words = set(stopwords.words("english")) word1 = nltk.sent_tokenize(myfi

我正在使用stopwords和句子标记器，但当我打印过滤后的句子时，会得到包括stopwords在内的结果。问题在于它不能忽略输出中的停止字。如何删除句子标记器中的停止词

 userinput1  = input ("Enter file name:")
    myfile1 = open(userinput1).read()
    stop_words = set(stopwords.words("english"))
    word1 = nltk.sent_tokenize(myfile1)
    filtration_sentence = []
    for w in word1:
        word = sent_tokenize(myfile1)
        filtered_sentence = [w for w in word if not w in stop_words]
        print(filtered_sentence)

    userinput2  = input ("Enter file name:")
    myfile2 = open(userinput2).read()
    stop_words = set(stopwords.words("english"))
    word2 = nltk.sent_tokenize(myfile2)
    filtration_sentence = []
    for w in word2:
        word = sent_tokenize(myfile2)
        filtered_sentence = [w for w in word if not w in stop_words]
        print(filtered_sentence)

    stemmer = nltk.stem.porter.PorterStemmer()
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

    def stem_tokens(tokens):
        return [stemmer.stem(item) for item in tokens]

    '''remove punctuation, lowercase, stem'''
    def normalize(text):
        return stem_tokens(nltk.sent_tokenize(text.lower().translate(remove_punctuation_map)))
    vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

    def cosine_sim(myfile1, myfile2):
        tfidf = vectorizer.fit_transform([myfile1, myfile2])
        return ((tfidf * tfidf.T).A)[0,1]
    print(cosine_sim(myfile1,myfile2))

我认为你不能直接从句子中删除停止词。您必须首先将每个单词拆分成句子，或者使用nltk.word\u tokenize拆分句子。对于每个单词，您检查它是否在停止词列表中。以下是一个例子：

import nltk
from nltk.corpus import stopwords
stopwords_en = set(stopwords.words('english'))

sents = nltk.sent_tokenize("This is an example sentence. We will remove stop words from this")

sents_rm_stopwords = []
for sent in sents:
    sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w.lower() not in stopwords_en))

输出

请注意，您还可以使用string.percentration删除标点符号

如何使用字符串标点符号@titipataimport string和string.标点符号，然后您可以执行stopwords\u en.unionstring.标点符号。好的，我正在尝试实现这一点。还有一个问题。我上面的代码将给出两个文档之间的余弦相似性，但我希望它将显示两个文档之间的相似性句子。。“我怎么能把它们打印出来？”提提帕托男孩，这是一个更难的问题。您可以在文档之间执行tfidf余弦距离。但是，对于句子相似性，您可以计算句子中单词之间的Jaccard距离或tfidf向量相似性，然后将最相似的一对排序为最小。你可以做很多选择。。。

['example sentence .', 'remove stop words']

import string
stopwords_punctuation = stopwords_en.union(string.punctuation) # merge set together