Python tf idf sickitlearn单独的“；字；从字里行间_Python_Text Classification_Tf Idf

Python tf idf sickitlearn单独的“；字；从字里行间

python

Python tf idf sickitlearn单独的“；字；从字里行间,python,text-classification,tf-idf,Python,Text Classification,Tf Idf,我正在处理文本分类中的一个问题，如果在该格式中找到一个单词，它的重要性将不同于在该格式中找到的单词，因此我尝试了此代码 import re from sklearn.feature_extraction.text import CountVectorizer sent1 = "The cat sat on my \"face\" face" sent2 = "The dog sat on my bed" content = [sent1,sent2]

我正在处理文本分类中的一个问题，如果在该格式中找到一个单词，它的重要性将不同于在该格式中找到的单词，因此我尝试了此代码

    import re
    from sklearn.feature_extraction.text import CountVectorizer
    sent1 = "The cat sat on my \"face\" face"
    sent2 = "The dog sat on my bed"
    content = [sent1,sent2]
    vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'")
    vectorizer.fit(content)
    print (vectorizer.get_feature_names())

结果是

    ['"', 'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the']

在我希望的地方

    ['bed', 'cat', 'dog', 'face','"face"' 'my', 'on', 'sat', 'the']

您需要根据需要调整

token\u模式

参数。以下内容适用于提供的示例：

pattern = r"\S+[^!?.\s]"
vectorizer = CountVectorizer(token_pattern=pattern)

但是，您可能需要进一步细化该模式。可能有助于正确使用正则表达式。

您的令牌模式是

token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'"

正在查找单词（\b\w\w+\b）或感叹号、问号或引号。试试像这样的东西

token_pattern=r"(?u)\b\w\w+\b|\"\b\w\w+\b\"|!|\?|\'"

注意这个部分

\"\b\w\w+\b\"

查找被引号包围的单词

这是一个标记化问题。修复

token_模式

以捕获双引号的情况，或者提供一个

tokeniser

可调用到

countvectorier