Python 3.x 在应用ngram之前理解输入文本的最佳方法_Python 3.x_Pandas_Nlp_Nltk_Nltk Book

Python 3.x 在应用ngram之前理解输入文本的最佳方法

python-3.x pandas nlp

Python 3.x 在应用ngram之前理解输入文本的最佳方法,python-3.x,pandas,nlp,nltk,nltk-book,Python 3.x,Pandas,Nlp,Nltk,Nltk Book,目前我正在从excel文件中读取文本，并对其应用bigram。下面示例代码中使用的finalList具有从输入excel文件读取的输入单词的列表借助以下库从输入中删除了StopWord： from nltk.corpus import stopwords 二元逻辑在单词输入文本列表中的应用 bigram=ngrams(finalList ,2) 输入文本：我完成了我的端到端流程当前输出：已完成结束、结束、结束进程期望输出：完成端到端、端到端流程这意味着一些单词组，如（端到端）应该被视

目前我正在从excel文件中读取文本，并对其应用bigram。下面示例代码中使用的finalList具有从输入excel文件读取的输入单词的列表
借助以下库从输入中删除了StopWord：

from nltk.corpus import stopwords
二元逻辑在单词输入文本列表中的应用

bigram=ngrams(finalList ,2)
输入文本：我完成了我的端到端流程
当前输出：已完成结束、结束、结束进程
期望输出：完成端到端、端到端流程

这意味着一些单词组，如（端到端）应该被视为一个单词。
要解决您的问题，您必须使用正则表达式清除停止词。请参见此示例：

import re text = 'I completed my end-to-end process..:?' pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words. new_text = re.sub(pattern, '', text) print(new_text) 'I completed my end-to-end process' # Now you can generate bigrams manually. # 1. Tokanize the new text tok = new_text.split() print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5]) ['I', 'completed', 'my', 'end-to-end', 'process'] # 2. Loop over the list and generate bigrams, store them in a var called bigrams bigrams = [] for i in range(len(tok) - 1): # -1 to avoid index error bigram = tok[i] + ' ' + tok[i + 1] bigrams.append(bigram) # 3. Print your bigrams for bi in bigrams: print(bi, end = ', ') I completed, completed my, my end-to-end, end-to-end process,

我希望这有帮助
检查您的标记化？使用适当的标记器：