Python Sklearn CountVectorizer“；“空词汇”；计算nGram时数据帧出错_Python_Pandas_Dataframe_Scikit Learn_Countvectorizer

Python Sklearn CountVectorizer“；“空词汇”；计算nGram时数据帧出错

python pandas dataframe scikit-learn

Python Sklearn CountVectorizer“；“空词汇”；计算nGram时数据帧出错,python,pandas,dataframe,scikit-learn,countvectorizer,Python,Pandas,Dataframe,Scikit Learn,Countvectorizer,我有一个数据帧（数据），有3条记录： id text 0001 The farmer plants grain 0002 The fisher catches tuna 0003 The police officer fights crime 我按id对该数据帧进行分组： data_grouped = data.groupby('id') 描述生成的groupby对象表明所有记录都保留了下来然后，我运行此代码在文本中查找ngram，并将它们连接到id： word_vectori

我有一个数据帧（

数据

），有3条记录：

id    text
0001  The farmer plants grain
0002  The fisher catches tuna
0003  The police officer fights crime

我按id对该数据帧进行分组：

data_grouped = data.groupby('id')

描述生成的groupby对象表明所有记录都保留了下来

然后，我运行此代码在

文本中查找ngram，并将它们连接到id
：
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), 
analyzer='word')
for id, group in data_grouped:
       X = word_vectorizer.fit_transform(group['text'])
       frequencies = sum(X).toarray()[0]
       results = pd.DataFrame(frequencies, columns=['frequency'])
       dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
       dfinner['id'] = id
       final = results.join(dfinner)

当我同时运行所有这些代码时，word\u矢量器
会出现一个错误，指出“空词汇表；可能文档只包含停止词”。我知道这个错误在很多其他问题中都提到过，但我找不到一个处理数据帧的错误
使问题进一步复杂化的是，错误并不总是出现。我从SQL数据库中提取数据，根据我提取的记录数量，错误可能会出现，也可能不会出现。例如，拉入TOP10
记录会导致错误，但TOP5
不会
编辑：
完全回溯
Traceback (most recent call last):

  File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
    runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
    X = word_vectorizer.fit_transform(group['cleanComments'])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"

ValueError: empty vocabulary; perhaps the documents only contain stop words

回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
runfile（'C:/Users/taca/Documents/Work/Python/Text-Analytics/owccomments.py'，wdir='C:/Users/taca/Documents/Work/Python/Text-Analytics'）
文件“C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site packages\spyder\utils\site\sitecustomize.py”，第866行，在runfile中
execfile（文件名、命名空间）
文件“C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site packages\spyder\utils\site\sitecustomize.py”，第102行，在execfile中
exec（编译（f.read（），文件名，'exec'），命名空间）
文件“C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py”，第38行，在
X=word\u矢量器.fit\u变换（组['cleanComments']）
文件“C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py”，第839行，在fit\u transform中
自我修复（词汇）
文件“C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py”，第781行，在
raise VALUERROR（“空词汇表；可能仅限于文档”
ValueError：词汇表为空；文档可能只包含停止词
我知道这里发生了什么，但在运行过程中，我有一个烦人的问题。你为什么要这样做？我不太确定我是否理解将CountVectorizer与文档集合中的每个文档相匹配的价值。通常的想法是将其与整个语料库相匹配，然后从那里进行分析。我明白了您可能希望能够看到每个文档中存在哪些gram，但还有其他更简单、更优化的方法。例如：
df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
 'farmer plants',
 'fights crime',
 'fisher catches',
 'officer fights',
 'plants grain',
 'police officer',
 'the farmer',
 'the fisher',
 'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 0]
 [0 0 1 0 1 0 1 0 0 1]]

很好，在这里您可以看到CountVectorizer提取的特征以及每个文档中存在哪些特征的矩阵表示。dt_mat
是文档术语矩阵，表示词汇表（特征）中每克的计数（频率）对于每个文档。要将其映射回grams，甚至将其放置到数据帧中，您可以执行以下操作：
df['grams'] = cv.inverse_transform(dt_mat)
print(df)
   id                             text  \
0   1          The farmer plants grain
1   2          The fisher catches tuna
2   3  The police officer fights crime

                                               grams
0          [plants grain, farmer plants, the farmer]
1         [catches tuna, fisher catches, the fisher]
2  [fights crime, officer fights, police officer,...

就个人而言，这感觉更有意义，因为您正在将CountVectorizer适配到整个语料库，而不是一次只适配一个文档。您仍然可以提取相同的信息（频率和克数）当你在文档中放大时，这会快得多。
我知道这里发生了什么，但在浏览过程中我有一个棘手的问题。你为什么要这样做？我不太确定我是否理解将CountVectorizer安装到文档集合中每个文档的价值。通常的想法是将它安装到entire语料库，然后从那里进行分析。我知道，也许你希望能够看到每个文档中存在哪些gram，但还有其他更简单和优化的方法。例如：
df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
 'farmer plants',
 'fights crime',
 'fisher catches',
 'officer fights',
 'plants grain',
 'police officer',
 'the farmer',
 'the fisher',
 'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 0]
 [0 0 1 0 1 0 1 0 0 1]]

很好，在这里您可以看到CountVectorizer提取的特征以及每个文档中存在哪些特征的矩阵表示。dt_mat
是文档术语矩阵，表示词汇表（特征）中每克的计数（频率）对于每个文档。要将其映射回grams，甚至将其放置到数据帧中，您可以执行以下操作：
df['grams'] = cv.inverse_transform(dt_mat)
print(df)
   id                             text  \
0   1          The farmer plants grain
1   2          The fisher catches tuna
2   3  The police officer fights crime

                                               grams
0          [plants grain, farmer plants, the farmer]
1         [catches tuna, fisher catches, the fisher]
2  [fights crime, officer fights, police officer,...

就个人而言，这感觉更有意义，因为您将CountVectorizer安装到整个语料库中，而不是一次只安装一个文档。您仍然可以提取相同的信息（频率和克数），并且随着文档的扩展，提取速度会快得多。
您能再解释一下吗？您所说的是什么意思“当我同时运行所有这些代码时”？还发布了完整的错误堆栈跟踪。有关跟踪，请参阅我的编辑。我还尝试将X=word\u矢量器.fit\u转换（组['cleanComments']）
更改为X=word\u矢量器.fit\u转换（数据['cleanComments']）
消除了错误，但…显然它也丢失了分组，因此每个nGram都被分配给每个id
。此外，当我将print（final）
添加到循环中时，输出会像我预期的那样打印出来，每个id
的数据帧只包含该id
的nGram，忽略我关于“一起运行代码”，我的意思是，当我执行脚本时，会发生此错误。当数据帧包含一个没有双字符的行（即，只有一个单词）时，它可能会挂起。是否有方法在for循环中放置另一个循环，该循环表示与if group['text'类似的内容包含少于1个单词，然后忽略？我只是不知道如何用Python编写。或者，更好的是，如果组['text']只有一个单词，是否有一种方法包含Unigram？您能再解释一下吗？当您说“当我同时运行所有代码时”是什么意思“？还发布错误的完整堆栈跟踪。有关跟踪，请参阅我的编辑。我还尝试更改X=word\u矢量器。fit\u转换（组['cleanComments