计算Python数据帧中的短语频率_Python_Pandas_Nltk

计算Python数据帧中的短语频率

python pandas

计算Python数据帧中的短语频率,python,pandas,nltk,Python,Pandas,Nltk,我的数据存储在熊猫数据框中-请参见下面的可复制示例。真正的数据帧将有超过10k行，每行有更多的单词/短语。我想计算每两个单词短语出现在列ReviewContent中的次数。如果这是一个文本文件，而不是一个数据框的列，我会使用NLTK的搭配模块（类似于答案或答案）。我的问题是：如何将专栏ReviewContent转换为单个语料库文本 import numpy as np import pandas as pd data = {'ReviewContent' : ['Great food', '

我的数据存储在熊猫数据框中-请参见下面的可复制示例。真正的数据帧将有超过10k行，每行有更多的单词/短语。我想计算每两个单词短语出现在列

ReviewContent

中的次数。如果这是一个文本文件，而不是一个数据框的列，我会使用NLTK的搭配模块（类似于答案或答案）。我的问题是：如何将专栏

ReviewContent

转换为单个语料库文本

import numpy as np
import pandas as pd

data = {'ReviewContent' : ['Great food',
'Low prices but above average food',
'Staff was the worst',
'Great location and great food',
'Really low prices',
'The daily menu is usually great',
'I waited a long time to be served, but it was worth it. Great food']}

df = pd.DataFrame(data)

预期产出：

[(('great', 'food'), 3), (('low', 'prices'), 2), ...]

或

我建议使用join：

corpus = ' '.join(df.ReviewContent)

结果如下：

In [102]: corpus
Out[102]: 'Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily menu is usually great I waited a long time to be served, but it was worth it. Great food'

使用Pandas版本0.20.1+，您可以直接从稀疏矩阵创建SPARSTAFRAME：

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(2,2))

r = pd.SparseDataFrame(cv.fit_transform(df.ReviewContent), 
                       columns=cv.get_feature_names(),
                       index=df.index,
                       default_fill_value=0)

结果:

In [52]: r
Out[52]:
   above average  and great  average food  be served  but above  but it  daily menu  great food  great location  \
0              0          0             0          0          0       0           0           1               0
1              1          0             1          0          1       0           0           0               0
2              0          0             0          0          0       0           0           0               0
3              0          1             0          0          0       0           0           1               1
4              0          0             0          0          0       0           0           0               0
5              0          0             0          0          0       0           1           0               0
6              0          0             0          1          0       1           0           1               0

   is usually    ...     staff was  the daily  the worst  time to  to be  usually great  waited long  was the  was worth  \
0           0    ...             0          0          0        0      0              0            0        0          0
1           0    ...             0          0          0        0      0              0            0        0          0
2           0    ...             1          0          1        0      0              0            0        1          0
3           0    ...             0          0          0        0      0              0            0        0          0
4           0    ...             0          0          0        0      0              0            0        0          0
5           1    ...             0          1          0        0      0              1            0        0          0
6           0    ...             0          0          0        1      1              0            1        0          1

   worth it
0         0
1         0
2         0
3         0
4         0
5         0
6         1

[7 rows x 29 columns]

In [57]: print(text)
Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily me
nu is usually great I waited a long time to be served, but it was worth it. Great food

如果只想将所有行中的字符串连接到一个行中，请使用以下方法：

结果:

In [52]: r
Out[52]:
   above average  and great  average food  be served  but above  but it  daily menu  great food  great location  \
0              0          0             0          0          0       0           0           1               0
1              1          0             1          0          1       0           0           0               0
2              0          0             0          0          0       0           0           0               0
3              0          1             0          0          0       0           0           1               1
4              0          0             0          0          0       0           0           0               0
5              0          0             0          0          0       0           1           0               0
6              0          0             0          1          0       1           0           1               0

   is usually    ...     staff was  the daily  the worst  time to  to be  usually great  waited long  was the  was worth  \
0           0    ...             0          0          0        0      0              0            0        0          0
1           0    ...             0          0          0        0      0              0            0        0          0
2           0    ...             1          0          1        0      0              0            0        1          0
3           0    ...             0          0          0        0      0              0            0        0          0
4           0    ...             0          0          0        0      0              0            0        0          0
5           1    ...             0          1          0        0      0              1            0        0          0
6           0    ...             0          0          0        1      1              0            1        0          1

   worth it
0         0
1         0
2         0
3         0
4         0
5         0
6         1

[7 rows x 29 columns]

In [57]: print(text)
Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily me
nu is usually great I waited a long time to be served, but it was worth it. Great food

作为序列/iterable，

df[“ReviewContent”]

的结构与将

nltk.sent\u tokenize（）

应用于文本文件的结果完全相同：一个字符串列表，每个字符串包含一个句子。所以用同样的方法

counts = collections.Counter()
for sent in df["ReviewContent"]:
    words = nltk.word_tokenize(sent)
    counts.update(nltk.bigrams(words))

如果您不确定下一步要做什么，那么这与使用数据帧无关。要计算bigrams，您不需要

搭配

模块，只需要

nltk.bigrams（）

和一个计算字典

这会起作用，但会产生“人为”的短语——一篇评论的最后一个词与下一篇评论的第一个词结合在一起。我可能会设法解决这个问题——如果我没有收到更好的答案，我肯定会选择这个。希望我的答案能回答你的问题“我如何才能将专栏评论内容转换为单个语料库文本？”我同意人工短语的缺点，并想知道其他人如何处理这个问题。过去，我曾尝试用一个指示符号（如

）而不是空格来连接文本，然后使用

finder=bigramconboundfinder。从单词（语料库）

中，后跟一个过滤器来删除人工短语：

finder。基于以下示例代码，应用单词过滤器（lambda w:w='~'）

。