Python N_克频率计数
我有一个熊猫数据帧,我想做2克的频率基于一个文本列Python N_克频率计数,python,pandas,n-gram,Python,Pandas,N Gram,我有一个熊猫数据帧,我想做2克的频率基于一个文本列 text_column This is a book This is a book that is read This is a book but he doesn't think this is a book 最终结果是2克的频率计数,但频率是计算每个文档中是否有2克,而不是2克计数 因此,部分结果将是 2 gram Count This is 3 a book 3 “这是”和“一本书
text_column
This is a book
This is a book that is read
This is a book but he doesn't think this is a book
最终结果是2克的频率计数,但频率是计算每个文档中是否有2克,而不是2克计数
因此,部分结果将是
2 gram Count
This is 3
a book 3
“这是”和“一本书”出现在所有3个文本中,尽管第3个文本中每个文本都有2个,因为我只感兴趣的是这2克出现了多少个文档,计数是3,所以不是4
你知道我怎么做吗
谢谢这是非常c风格的,但很管用。想法是跟踪每个文档的“当前”bigrams,确保每个文档只添加一次(
cur\u bigrams=set()
),并且在每个文档之后,如果它在当前文档中,则增加一个全局频率计数器(bigram\u freq
)。然后,利用bigram\u freq
中的信息构建一个新的数据帧,该信息是跨文档的全局计数器
bigram_freq = {}
for doc in df["text_column"]:
cur_bigrams = set()
words = doc.split(" ")
bigrams = zip(words, words[1:])
for bigram in bigrams:
if bigram not in cur_bigrams: # Add bigram, but only once/doc
cur_bigrams.add(bigram)
for bigram in cur_bigrams:
if bigram in bigram_freq:
bigram_freq[bigram] += 1
else:
bigram_freq[bigram] = 1
result_df = pd.DataFrame(columns=["2_gram", "count"])
row_list = []
for bigram, freq in bigram_freq.items():
row_list.append([bigram[0] + " " + bigram[1], freq])
for i in range(len(row_list)):
result_df.loc[i] = row_list[i]
print(result_df)
输出:
2_gram count
0 a book 3
1 is a 3
2 This is 3
3 is read 1
4 that is 1
5 book that 1
6 he doesn't 1
7 this is 1
8 book but 1
9 but he 1
10 think this 1
11 doesn't think 1
您可能可以使用更具功能性的样式和/或列表理解来将代码精简一点。我将把它作为一个练习留给读者。Pythonic答案(编写为一般性的,因此可以应用于文件/数据帧/任何内容):
现在c
保持每2克的频率
说明:
zip()
返回长度为2(2克)的元组上的迭代器集合()
,以消除冗余collections.Counter()
对象,该对象跟踪每个元组出现的次数。您需要导入集合
才能使用此功能是的,Python非常棒。到目前为止您尝试了什么?我比我更喜欢您的解决方案+感谢您的解决方案!QQ:如何将其转换成三角图?@Abbey我想你只需将最后一行替换为:
c.update(set(zip(x[:-2],x[1:-1],x[2:]))
c=collections.Counter()
for i in fh:
x = i.rstrip().split(" ")
c.update(set(zip(x[:-1],x[1:])))