Python N_克频率计数_Python_Pandas_N Gram

Python N_克频率计数

python pandas

Python N_克频率计数,python,pandas,n-gram,Python,Pandas,N Gram,我有一个熊猫数据帧，我想做2克的频率基于一个文本列 text_column This is a book This is a book that is read This is a book but he doesn't think this is a book 最终结果是2克的频率计数，但频率是计算每个文档中是否有2克，而不是2克计数因此，部分结果将是 2 gram Count This is 3 a book 3 “这是”和“一本书

我有一个熊猫数据帧，我想做2克的频率基于一个文本列

text_column
This is a book
This is a book that is read
This is a book but he doesn't think this is a book

最终结果是2克的频率计数，但频率是计算每个文档中是否有2克，而不是2克计数
因此，部分结果将是

2 gram Count This is 3 a book 3
“这是”和“一本书”出现在所有3个文本中，尽管第3个文本中每个文本都有2个，因为我只感兴趣的是这2克出现了多少个文档，计数是3，所以不是4
你知道我怎么做吗

谢谢
这是非常c风格的，但很管用。想法是跟踪每个文档的“当前”bigrams，确保每个文档只添加一次（
cur\u bigrams=set（）
），并且在每个文档之后，如果它在当前文档中，则增加一个全局频率计数器（
bigram\u freq
）。然后，利用
bigram\u freq
中的信息构建一个新的数据帧，该信息是跨文档的全局计数器

bigram_freq = {} for doc in df["text_column"]: cur_bigrams = set() words = doc.split(" ") bigrams = zip(words, words[1:]) for bigram in bigrams: if bigram not in cur_bigrams: # Add bigram, but only once/doc cur_bigrams.add(bigram) for bigram in cur_bigrams: if bigram in bigram_freq: bigram_freq[bigram] += 1 else: bigram_freq[bigram] = 1 result_df = pd.DataFrame(columns=["2_gram", "count"]) row_list = [] for bigram, freq in bigram_freq.items(): row_list.append([bigram[0] + " " + bigram[1], freq]) for i in range(len(row_list)): result_df.loc[i] = row_list[i] print(result_df)
输出：

2_gram count 0 a book 3 1 is a 3 2 This is 3 3 is read 1 4 that is 1 5 book that 1 6 he doesn't 1 7 this is 1 8 book but 1 9 but he 1 10 think this 1 11 doesn't think 1
您可能可以使用更具功能性的样式和/或列表理解来将代码精简一点。我将把它作为一个练习留给读者。
Pythonic答案（编写为一般性的，因此可以应用于文件/数据帧/任何内容）：
现在
c
保持每2克的频率
说明：

每一行都被空格分割成一个列表

然后
zip（）
返回长度为2（2克）的元组上的迭代器

迭代器被送入
集合（）
，以消除冗余

然后该集合被送入一个
collections.Counter（）
对象，该对象跟踪每个元组出现的次数。您需要
导入集合
才能使用此功能

现在很容易列出计数器的内容或将其转换为您喜欢的任何其他格式（例如数据帧）

是的，Python非常棒。
到目前为止您尝试了什么？我比我更喜欢您的解决方案+感谢您的解决方案！QQ：如何将其转换成三角图？@Abbey我想你只需将最后一行替换为：
c.update（set（zip（x[：-2]，x[1:-1]，x[2:]））

c=collections.Counter() for i in fh: x = i.rstrip().split(" ") c.update(set(zip(x[:-1],x[1:])))