Python 计算单词的频率和与单词关联的不同ID的数量
除了计算文档中单词的频率外,我还想计算与单词相关联的不同ID的数量。用一个例子更容易解释:Python 计算单词的频率和与单词关联的不同ID的数量,python,word-count,Python,Word Count,除了计算文档中单词的频率外,我还想计算与单词相关联的不同ID的数量。用一个例子更容易解释: from pandas import * from collections import defaultdict d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']), 'words' : Series(["apple banana apple strawberry banana lemon", "apple", "banana", "banana
from pandas import *
from collections import defaultdict
d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']),
'words' : Series(["apple banana apple strawberry banana lemon",
"apple", "banana", "banana lemon", "kiwi", "kiwi lemon"])}
df = DataFrame(d)
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
# count frequency of words using defaultdict
wc = defaultdict(int)
for line in df.words:
linesplit = line.split()
for word in linesplit:
wc[word] += 1
# defaultdict(<type 'int'>, {'kiwi': 2, 'strawberry': 1, 'lemon': 3, 'apple': 3, 'banana': 4})
# turn in to a DataFrame
dwc = {"word": Series(wc.keys()),
"count": Series(wc.values())}
dfwc = DataFrame(dwc)
>>> dfwc
count word
0 2 kiwi
1 1 strawberry
2 3 lemon
3 3 apple
4 4 banana
理想情况下,我希望它与计算词频同时进行。。但我不确定如何整合它
任何指针都将不胜感激 我对熊猫不是很有经验,但你可以这样做。这个方法保存一个dict,其中键是单词,值是每个单词出现的所有id的集合
wc = defaultdict(int)
idc = defaultdict(set)
for ID, words in zip(df.ID, df.words):
lwords = words.split()
for word in lwords:
wc[word] += 1
# You don't really need the if statement (since a set will only hold one
# of each ID at most) but I feel like it makes things much clearer.
if ID not in idc[word]:
idc[word].add(ID)
在此之后,idc看起来像:
defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})
在添加lenidc.values()作为dwc的键并初始化dfwc后,我得到:
count ids word
0 2 1 kiwi
1 1 1 strawberry
2 3 2 lemon
3 3 1 apple
4 4 3 banana
这种方法的缺点在于它使用两个单独的dict(wc和idc),并且不能保证其中的键(字)顺序相同。因此,您需要将dicts合并在一起以消除此问题。我就是这样做的:
# Makes it so the values in the wc dict are a tuple in
# (word_count, id_count) form
for key, value in lenidc.iteritems():
wc[key] = (wc[key], value)
# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively.
dwc = {"word": Series(wc.keys()),
"count": Series([v[0] for v in wc.values()]),
"ids": Series([v[1] for v in wc.values()])}
我对熊猫不是很有经验,但你可以做这样的事情。这个方法保存一个dict,其中键是单词,值是每个单词出现的所有id的集合
wc = defaultdict(int)
idc = defaultdict(set)
for ID, words in zip(df.ID, df.words):
lwords = words.split()
for word in lwords:
wc[word] += 1
# You don't really need the if statement (since a set will only hold one
# of each ID at most) but I feel like it makes things much clearer.
if ID not in idc[word]:
idc[word].add(ID)
在此之后,idc看起来像:
defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})
在添加lenidc.values()作为dwc的键并初始化dfwc后,我得到:
count ids word
0 2 1 kiwi
1 1 1 strawberry
2 3 2 lemon
3 3 1 apple
4 4 3 banana
这种方法的缺点在于它使用两个单独的dict(wc和idc),并且不能保证其中的键(字)顺序相同。因此,您需要将dicts合并在一起以消除此问题。我就是这样做的:
# Makes it so the values in the wc dict are a tuple in
# (word_count, id_count) form
for key, value in lenidc.iteritems():
wc[key] = (wc[key], value)
# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively.
dwc = {"word": Series(wc.keys()),
"count": Series([v[0] for v in wc.values()]),
"ids": Series([v[1] for v in wc.values()])}
也许有一种更巧妙的方法可以做到这一点,但我会分两步来实现。首先,将其展平,然后用我们想要的信息创建一个新的数据帧:
# make a new, flattened object
s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
index = s.index.get_level_values(0)
new = df.ix[index]
new["words"] = s.values
# now group and build
grouped = new.groupby("words")["ID"]
summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
summary = summary.reset_index().rename(columns={"words": "word"})
产生
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
一步一步。我们从原始数据帧开始:
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
拉开多个水果元素:
>>> s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
>>> s
0 0 apple
1 banana
2 apple
3 strawberry
4 banana
5 lemon
1 0 apple
2 0 banana
3 0 banana
1 lemon
4 0 kiwi
5 0 kiwi
1 lemon
dtype: object
获取与原始帧对齐的索引:
>>> index = s.index.get_level_values(0)
>>> index
Int64Index([0, 0, 0, 0, 0, 0, 1, 2, 3, 3, 4, 5, 5], dtype=int64)
然后从这个角度看原始帧:
>>> new = df.ix[index]
>>> new["words"] = s.values
>>> new
ID words
0 a apple
0 a banana
0 a apple
0 a strawberry
0 a banana
0 a lemon
1 a apple
2 b banana
3 c banana
3 c lemon
4 c kiwi
5 c kiwi
5 c lemon
这更像是我们可以处理的东西。根据我的经验,一半的工作是将数据转换为正确的格式。在这之后,很容易:
>>> grouped = new.groupby("words")["ID"]
>>> summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
>>> summary
count ids
words
apple 3 1
banana 4 3
kiwi 2 1
lemon 3 2
strawberry 1 1
>>> summary = summary.reset_index().rename(columns={"words": "word"})
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
请注意,只需使用.descripe()
,我们就可以找到这些信息:
我们也可以从这里开始,然后旋转以获得所需的输出。可能有一种更巧妙的方法来实现这一点,但我将分两步来实现。首先,将其展平,然后用我们想要的信息创建一个新的数据帧:
# make a new, flattened object
s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
index = s.index.get_level_values(0)
new = df.ix[index]
new["words"] = s.values
# now group and build
grouped = new.groupby("words")["ID"]
summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
summary = summary.reset_index().rename(columns={"words": "word"})
产生
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
一步一步。我们从原始数据帧开始:
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
拉开多个水果元素:
>>> s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
>>> s
0 0 apple
1 banana
2 apple
3 strawberry
4 banana
5 lemon
1 0 apple
2 0 banana
3 0 banana
1 lemon
4 0 kiwi
5 0 kiwi
1 lemon
dtype: object
获取与原始帧对齐的索引:
>>> index = s.index.get_level_values(0)
>>> index
Int64Index([0, 0, 0, 0, 0, 0, 1, 2, 3, 3, 4, 5, 5], dtype=int64)
然后从这个角度看原始帧:
>>> new = df.ix[index]
>>> new["words"] = s.values
>>> new
ID words
0 a apple
0 a banana
0 a apple
0 a strawberry
0 a banana
0 a lemon
1 a apple
2 b banana
3 c banana
3 c lemon
4 c kiwi
5 c kiwi
5 c lemon
这更像是我们可以处理的东西。根据我的经验,一半的工作是将数据转换为正确的格式。在这之后,很容易:
>>> grouped = new.groupby("words")["ID"]
>>> summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
>>> summary
count ids
words
apple 3 1
banana 4 3
kiwi 2 1
lemon 3 2
strawberry 1 1
>>> summary = summary.reset_index().rename(columns={"words": "word"})
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
请注意,只需使用.descripe()
,我们就可以找到这些信息:
我们也可以从这里开始,然后旋转以获得所需的输出。您确定在预期输出中提供的ID列是正确的吗?看起来柠檬有2个,苹果有1个。你完全正确-我刚刚更正了数字。你确定你在预期输出中给出的ID列是正确的吗?看起来柠檬有2个,苹果有1个。你完全正确-我刚刚更正了数字。没问题!此外,我还更新了我的答案,更新了我记忆中的一个坑落和解决方案。啊哈,当我在我的数据上尝试时,我注意到了这一点。我用一种不那么聪明的方法解决了这个问题——创建两个独立的数据帧,一个用于wc,一个用于idc,然后在“word”上合并这两个数据帧。您的解决方案更加优雅。谢谢没问题!此外,我还更新了我的答案,更新了我记忆中的一个坑落和解决方案。啊哈,当我在我的数据上尝试时,我注意到了这一点。我用一种不那么聪明的方法解决了这个问题——创建两个独立的数据帧,一个用于wc,一个用于idc,然后在“word”上合并这两个数据帧。您的解决方案更加优雅。谢谢