计算Python数据帧中的短语频率
我的数据存储在熊猫数据框中-请参见下面的可复制示例。真正的数据帧将有超过10k行,每行有更多的单词/短语。 我想计算每两个单词短语出现在列计算Python数据帧中的短语频率,python,pandas,nltk,Python,Pandas,Nltk,我的数据存储在熊猫数据框中-请参见下面的可复制示例。真正的数据帧将有超过10k行,每行有更多的单词/短语。 我想计算每两个单词短语出现在列ReviewContent中的次数。如果这是一个文本文件,而不是一个数据框的列,我会使用NLTK的搭配模块(类似于答案或答案)。我的问题是:如何将专栏ReviewContent转换为单个语料库文本 import numpy as np import pandas as pd data = {'ReviewContent' : ['Great food', '
ReviewContent
中的次数。如果这是一个文本文件,而不是一个数据框的列,我会使用NLTK的搭配模块(类似于答案或答案)。我的问题是:如何将专栏ReviewContent
转换为单个语料库文本
import numpy as np
import pandas as pd
data = {'ReviewContent' : ['Great food',
'Low prices but above average food',
'Staff was the worst',
'Great location and great food',
'Really low prices',
'The daily menu is usually great',
'I waited a long time to be served, but it was worth it. Great food']}
df = pd.DataFrame(data)
预期产出:
[(('great', 'food'), 3), (('low', 'prices'), 2), ...]
或
我建议使用join:
corpus = ' '.join(df.ReviewContent)
结果如下:
In [102]: corpus
Out[102]: 'Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily menu is usually great I waited a long time to be served, but it was worth it. Great food'
使用Pandas版本0.20.1+,您可以直接从稀疏矩阵创建SPARSTAFRAME:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))
r = pd.SparseDataFrame(cv.fit_transform(df.ReviewContent),
columns=cv.get_feature_names(),
index=df.index,
default_fill_value=0)
结果:
In [52]: r
Out[52]:
above average and great average food be served but above but it daily menu great food great location \
0 0 0 0 0 0 0 0 1 0
1 1 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 1 1
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0
6 0 0 0 1 0 1 0 1 0
is usually ... staff was the daily the worst time to to be usually great waited long was the was worth \
0 0 ... 0 0 0 0 0 0 0 0 0
1 0 ... 0 0 0 0 0 0 0 0 0
2 0 ... 1 0 1 0 0 0 0 1 0
3 0 ... 0 0 0 0 0 0 0 0 0
4 0 ... 0 0 0 0 0 0 0 0 0
5 1 ... 0 1 0 0 0 1 0 0 0
6 0 ... 0 0 0 1 1 0 1 0 1
worth it
0 0
1 0
2 0
3 0
4 0
5 0
6 1
[7 rows x 29 columns]
In [57]: print(text)
Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily me
nu is usually great I waited a long time to be served, but it was worth it. Great food
如果只想将所有行中的字符串连接到一个行中,请使用以下方法: 结果:
In [52]: r
Out[52]:
above average and great average food be served but above but it daily menu great food great location \
0 0 0 0 0 0 0 0 1 0
1 1 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 1 1
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0
6 0 0 0 1 0 1 0 1 0
is usually ... staff was the daily the worst time to to be usually great waited long was the was worth \
0 0 ... 0 0 0 0 0 0 0 0 0
1 0 ... 0 0 0 0 0 0 0 0 0
2 0 ... 1 0 1 0 0 0 0 1 0
3 0 ... 0 0 0 0 0 0 0 0 0
4 0 ... 0 0 0 0 0 0 0 0 0
5 1 ... 0 1 0 0 0 1 0 0 0
6 0 ... 0 0 0 1 1 0 1 0 1
worth it
0 0
1 0
2 0
3 0
4 0
5 0
6 1
[7 rows x 29 columns]
In [57]: print(text)
Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily me
nu is usually great I waited a long time to be served, but it was worth it. Great food
作为序列/iterable,
df[“ReviewContent”]
的结构与将nltk.sent\u tokenize()
应用于文本文件的结果完全相同:一个字符串列表,每个字符串包含一个句子。所以用同样的方法
counts = collections.Counter()
for sent in df["ReviewContent"]:
words = nltk.word_tokenize(sent)
counts.update(nltk.bigrams(words))
如果您不确定下一步要做什么,那么这与使用数据帧无关。要计算bigrams,您不需要
搭配
模块,只需要nltk.bigrams()
和一个计算字典 这会起作用,但会产生“人为”的短语——一篇评论的最后一个词与下一篇评论的第一个词结合在一起。我可能会设法解决这个问题——如果我没有收到更好的答案,我肯定会选择这个。希望我的答案能回答你的问题“我如何才能将专栏评论内容转换为单个语料库文本?”我同意人工短语的缺点,并想知道其他人如何处理这个问题。过去,我曾尝试用一个指示符号(如~
)而不是空格来连接文本,然后使用finder=bigramconboundfinder。从单词(语料库)
中,后跟一个过滤器来删除人工短语:finder。基于以下示例代码,应用单词过滤器(lambda w:w='~')
。