Python 标记停止词生成的标记[';ha';,';le';,';u';,';wa';]不在停止词中
我正在使用Python制作聊天机器人。 代码: 它运行良好,但每次对话都会出现以下错误:Python 标记停止词生成的标记[';ha';,';le';,';u';,';wa';]不在停止词中,python,python-3.x,nlp,nltk,chatbot,Python,Python 3.x,Nlp,Nltk,Chatbot,我正在使用Python制作聊天机器人。 代码: 它运行良好,但每次对话都会出现以下错误: /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words g
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.
以下是来自CMD的一些对话:
机器人:聊天机器人是一种通过听觉或文本方式进行对话的软件。
什么是印度
机器人:印度的野生动物,在印度文化中历来被视为宽容的动物,在这些森林和其他地方的受保护栖息地得到支持。
什么是聊天机器人
机器人:聊天机器人是一种通过听觉或文本方式进行对话的软件。原因是您使用了自定义的
标记器
,并使用了默认的stop\u words='english'
,因此在提取特征时,会检查它们之间是否存在任何不一致停止单词
和标记器
如果您深入研究sklearn/feature\u extraction/text.py的代码,您将发现此代码段正在执行一致性检查:
如您所见,如果发现不一致,它将发出警告
希望对您有所帮助。请通过
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))
def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):
"""Check if stop words are consistent
Returns
-------
is_consistent : True if stop words are consistent with the preprocessor
and tokenizer, False if they are not, None if the check
was previously performed, "error" if it could not be
performed (e.g. because of the use of a custom
preprocessor / tokenizer)
"""
if id(self.stop_words) == getattr(self, '_stop_words_id', None):
# Stop words are were previously validated
return None
# NB: stop_words is validated, unlike self.stop_words
try:
inconsistent = set()
for w in stop_words or ():
tokens = list(tokenize(preprocess(w)))
for token in tokens:
if token not in stop_words:
inconsistent.add(token)
self._stop_words_id = id(self.stop_words)
if inconsistent:
warnings.warn('Your stop_words may be inconsistent with '
'your preprocessing. Tokenizing the stop '
'words generated tokens %r not in '
'stop_words.' % sorted(inconsistent))