Python 在Countvectorizer中使用词汇表参数时未生成Bi图
我正在尝试使用countvectorizer生成bigram,并将它们连接回数据帧。然而,它只给我作为输出的单位。我只想在存在特定关键字的情况下创建bi图。我使用词汇表参数传递它们 我试图达到的目的是消除文本语料库中的其他单词,并在词汇词典中列出n个单词Python 在Countvectorizer中使用词汇表参数时未生成Bi图,python,pandas,scikit-learn,nltk,Python,Pandas,Scikit Learn,Nltk,我正在尝试使用countvectorizer生成bigram,并将它们连接回数据帧。然而,它只给我作为输出的单位。我只想在存在特定关键字的情况下创建bi图。我使用词汇表参数传递它们 我试图达到的目的是消除文本语料库中的其他单词,并在词汇词典中列出n个单词 words=nltk.corpus.stopwords.words('english') Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for
words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
输入数据
Id Name
1 Industrial Floor chenidsd 34
2 Industrial Floor room 345
3 Central District 46
4 Central Industrial District Bay
5 Chinese District Bay
6 Bay Chinese xrty
7 Industrial Floor chenidsd 34
8 Industrial Floor room 345
9 Central District 46
10 Central Industrial District Bay
11 Chinese District Bay
12 Bay Chinese dffefef
13 Industrial Floor chenidsd 34
14 Industrial Floor room 345
15 Central District 46
16 Central Industrial District Bay
17 Chinese District Bay
18 Bay Chinese grty
NLTK
words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
词汇定义
english_corpus=['bay','central','chinese','district', 'floor','industrial','room']
双RAM生成器
cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
for i, col in enumerate(cv.get_feature_names()):
Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)
然而,它只给我一个单格作为输出。如何解决这个问题
输出
In[26]:Nata.columns.tolist()
Out[26]:
['Id',
'Name',
'bay',
'central',
'chinese',
'district',
'floor',
'industrial',
'room']
TL;DR
请参阅以了解它是如何自动小写的,“标记化”和删除停止字的
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
如果在预处理步骤中使用了ngramization,只需覆盖analyzer
参数即可
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
您误解了中
词汇表
参数的含义
从文档中:
词汇表
:
映射或iterable,可选映射(如dict),其中
键是术语,值是特征矩阵中的索引,或
这是无法接受的条件。如果未给出词汇表,则根据
输入文档。映射中的索引不应重复,并且
在0和最大索引之间不应有任何差距
<>这意味着你只考虑词汇中的任何东西作为你的代码>特征名称< /代码>。如果您的功能集中需要bigram,那么您的词汇表中就需要有bigram
它不会生成ngrams,然后检查ngrams是否只包含词汇表中的单词
在代码中,您可以看到,如果在词汇表中添加bigram,那么它们将出现在功能\u name()
中:
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
那么,如何根据单字列表(单字表)在我的功能名称中获得双字图呢?
一种可能的解决方案:您必须使用ngram生成编写自己的分析器,并检查生成的ngram是否在您想要保留的单词列表中,例如
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
TL;DR
请参阅以了解它是如何自动小写的,“标记化”和删除停止字的
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
如果在预处理步骤中使用了ngramization,只需覆盖analyzer
参数即可
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
您误解了中
词汇表
参数的含义
从文档中:
词汇表
:
映射或iterable,可选映射(如dict),其中
键是术语,值是特征矩阵中的索引,或
这是无法接受的条件。如果未给出词汇表,则根据
输入文档。映射中的索引不应重复,并且
在0和最大索引之间不应有任何差距
<>这意味着你只考虑词汇中的任何东西作为你的代码>特征名称< /代码>。如果您的功能集中需要bigram,那么您的词汇表中就需要有bigram
它不会生成ngrams,然后检查ngrams是否只包含词汇表中的单词
在代码中,您可以看到,如果在词汇表中添加bigram,那么它们将出现在功能\u name()
中:
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
那么,如何根据单字列表(单字表)在我的功能名称中获得双字图呢?
一种可能的解决方案:您必须使用ngram生成编写自己的分析器,并检查生成的ngram是否在您想要保留的单词列表中,例如
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
请看,您在多次重新处理相同的列时犯了同样的错误。@alvas,只有在字典中的单词出现时,我如何才能生成双格和单格呢together@alvas,我试图做到的是消除文本语料库中的其他单词,并对字典中的列表进行n-grams查看更新的答案。@alvas感谢您的解释,我如何修改它以从字典单词列表中创建uni、bigram、TRIGRAM请参见,你犯了同样的错误,多次重新处理相同的列。@alvas,只有在字典中的单词出现的情况下,我怎么才能做出双格和单格呢together@alvas,我试图做到的是消除文本语料库中的其他单词,并对字典中的列表进行n-grams查看更新的答案。@alvas感谢您的解释,我如何修改它以从字典单词列表中创建uni、bigram、TRIGRAM感谢您的解释,我如何修改它包括uni,bigram,Trigram以及
从nltk导入everygrams
然后使用everygram()
代替everygram()
,从nltk导入everygrams的Trigram也是,然后使用everygram(word\u tokenize(text.lower()),1,3)代替ngrams()
,这是一个可靠的解释,因此值得投票支持。