Python 在Countvectorizer中使用词汇表参数时未生成Bi图

Python 在Countvectorizer中使用词汇表参数时未生成Bi图,python,pandas,scikit-learn,nltk,Python,Pandas,Scikit Learn,Nltk,我正在尝试使用countvectorizer生成bigram,并将它们连接回数据帧。然而,它只给我作为输出的单位。我只想在存在特定关键字的情况下创建bi图。我使用词汇表参数传递它们 我试图达到的目的是消除文本语料库中的其他单词,并在词汇词典中列出n个单词 words=nltk.corpus.stopwords.words('english') Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for

我正在尝试使用countvectorizer生成bigram,并将它们连接回数据帧。然而,它只给我作为输出的单位。我只想在存在特定关键字的情况下创建bi图。我使用词汇表参数传递它们

我试图达到的目的是消除文本语料库中的其他单词,并在词汇词典中列出n个单词

words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
输入数据

 Id Name
    1   Industrial  Floor chenidsd 34
    2   Industrial  Floor room   345
    3   Central District    46
    4   Central Industrial District  Bay
    5   Chinese District Bay
    6   Bay Chinese xrty
    7   Industrial  Floor chenidsd 34
    8   Industrial  Floor room   345
    9   Central District    46
    10  Central Industrial District  Bay
    11  Chinese District Bay
    12  Bay Chinese dffefef
    13  Industrial  Floor chenidsd 34
    14  Industrial  Floor room   345
    15  Central District    46
    16  Central Industrial District  Bay
    17  Chinese District Bay
    18  Bay Chinese grty
NLTK

words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
词汇定义

 english_corpus=['bay','central','chinese','district', 'floor','industrial','room']  
双RAM生成器

 cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
    cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
    for i, col in enumerate(cv.get_feature_names()):
        Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)
然而,它只给我一个单格作为输出。如何解决这个问题

输出

In[26]:Nata.columns.tolist()
Out[26]:

['Id',
 'Name',
 'bay',
 'central',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']
TL;DR 请参阅以了解它是如何自动小写的,“标记化”和删除停止字的

[out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']
如果在预处理步骤中使用了ngramization,只需覆盖
analyzer
参数即可 [out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

您误解了中
词汇表
参数的含义

从文档中:

词汇表

映射或iterable,可选映射(如dict),其中 键是术语,值是特征矩阵中的索引,或 这是无法接受的条件。如果未给出词汇表,则根据 输入文档。映射中的索引不应重复,并且 在0和最大索引之间不应有任何差距

<>这意味着你只考虑词汇中的任何东西作为你的代码>特征名称< /代码>。如果您的功能集中需要bigram,那么您的词汇表中就需要有bigram

它不会生成ngrams,然后检查ngrams是否只包含词汇表中的单词

在代码中,您可以看到,如果在词汇表中添加bigram,那么它们将出现在
功能\u name()
中:

[out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']
那么,如何根据单字列表(单字表)在我的功能名称中获得双字图呢? 一种可能的解决方案:您必须使用ngram生成编写自己的分析器,并检查生成的ngram是否在您想要保留的单词列表中,例如

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()
TL;DR 请参阅以了解它是如何自动小写的,“标记化”和删除停止字的

[out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']
如果在预处理步骤中使用了ngramization,只需覆盖
analyzer
参数即可 [out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

您误解了中
词汇表
参数的含义

从文档中:

词汇表

映射或iterable,可选映射(如dict),其中 键是术语,值是特征矩阵中的索引,或 这是无法接受的条件。如果未给出词汇表,则根据 输入文档。映射中的索引不应重复,并且 在0和最大索引之间不应有任何差距

<>这意味着你只考虑词汇中的任何东西作为你的代码>特征名称< /代码>。如果您的功能集中需要bigram,那么您的词汇表中就需要有bigram

它不会生成ngrams,然后检查ngrams是否只包含词汇表中的单词

在代码中,您可以看到,如果在词汇表中添加bigram,那么它们将出现在
功能\u name()
中:

[out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']
['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']
那么,如何根据单字列表(单字表)在我的功能名称中获得双字图呢? 一种可能的解决方案:您必须使用ngram生成编写自己的分析器,并检查生成的ngram是否在您想要保留的单词列表中,例如

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()

请看,您在多次重新处理相同的列时犯了同样的错误。@alvas,只有在字典中的单词出现时,我如何才能生成双格和单格呢together@alvas,我试图做到的是消除文本语料库中的其他单词,并对字典中的列表进行n-grams查看更新的答案。@alvas感谢您的解释,我如何修改它以从字典单词列表中创建uni、bigram、TRIGRAM请参见,你犯了同样的错误,多次重新处理相同的列。@alvas,只有在字典中的单词出现的情况下,我怎么才能做出双格和单格呢together@alvas,我试图做到的是消除文本语料库中的其他单词,并对字典中的列表进行n-grams查看更新的答案。@alvas感谢您的解释,我如何修改它以从字典单词列表中创建uni、bigram、TRIGRAM感谢您的解释,我如何修改它包括uni,bigram,Trigram以及
从nltk导入everygrams
然后使用
everygram()
代替
everygram()
,从nltk导入everygrams的Trigram也是
,然后使用
everygram(word\u tokenize(text.lower()),1,3)代替
ngrams()
,这是一个可靠的解释,因此值得投票支持。