Python 标记停止词生成的标记['；ha'；，'；le'；，'；u'；，'；wa'；]不在停止词中_Python_Python 3.x_Nlp_Nltk_Chatbot

Python 标记停止词生成的标记['；ha'；，'；le'；，'；u'；，'；wa'；]不在停止词中

python python-3.x nlp

Python 标记停止词生成的标记['；ha'；，'；le'；，'；u'；，'；wa'；]不在停止词中,python,python-3.x,nlp,nltk,chatbot,Python,Python 3.x,Nlp,Nltk,Chatbot,我正在使用Python制作聊天机器人。代码：它运行良好，但每次对话都会出现以下错误： /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words g

我正在使用Python制作聊天机器人。代码：

它运行良好，但每次对话都会出现以下错误：

/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. 

Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.

以下是来自CMD的一些对话：

机器人：聊天机器人是一种通过听觉或文本方式进行对话的软件。

什么是印度

机器人：印度的野生动物，在印度文化中历来被视为宽容的动物，在这些森林和其他地方的受保护栖息地得到支持。

什么是聊天机器人

机器人：聊天机器人是一种通过听觉或文本方式进行对话的软件。

原因是您使用了自定义的

标记器

，并使用了默认的

stop\u words='english'

，因此在提取特征时，会检查它们之间是否存在任何不一致

停止单词

和

标记器

如果您深入研究

sklearn/feature\u extraction/text.py的代码，您将发现此代码段正在执行一致性检查：
如您所见，如果发现不一致，它将发出警告
希望对您有所帮助。请通过
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. 

Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):
    """Check if stop words are consistent

    Returns
    -------
    is_consistent : True if stop words are consistent with the preprocessor
                    and tokenizer, False if they are not, None if the check
                    was previously performed, "error" if it could not be
                    performed (e.g. because of the use of a custom
                    preprocessor / tokenizer)
    """
    if id(self.stop_words) == getattr(self, '_stop_words_id', None):
        # Stop words are were previously validated
        return None

    # NB: stop_words is validated, unlike self.stop_words
    try:
        inconsistent = set()
        for w in stop_words or ():
            tokens = list(tokenize(preprocess(w)))
            for token in tokens:
                if token not in stop_words:
                    inconsistent.add(token)
        self._stop_words_id = id(self.stop_words)

        if inconsistent:
            warnings.warn('Your stop_words may be inconsistent with '
                          'your preprocessing. Tokenizing the stop '
                          'words generated tokens %r not in '
                          'stop_words.' % sorted(inconsistent))