Python 如何计算列中逗号分隔的重复值?
我有一个dataframe列,如下所示:Python 如何计算列中逗号分隔的重复值?,python,string,pandas,dataframe,Python,String,Pandas,Dataframe,我有一个dataframe列,如下所示: 1 Applied Learning, Literacy & Language 2 Literacy & Language, Special Needs 3 Math & Science, Literacy & Language 4 Literacy & Language, Math & Science 6
1 Applied Learning, Literacy & Language
2 Literacy & Language, Special Needs
3 Math & Science, Literacy & Language
4 Literacy & Language, Math & Science
6 Math & Science, Applied Learning
7 Applied Learning
8 Literacy & Language
10 Math & Science...
每行中都有逗号分隔的值。我想要的是计算所有唯一值的发生率。例如:数学与科学出现4次。所以数学和科学的计数应该是4。我尝试了以下代码:
cato=response['Category'].str.split(',')
cat_set=[]
for i in cato.dropna():
cat_set.extend(i)
plt1=pd.Series(cat_set).value_counts().sort_values(ascending=False).to_frame()
但问题是,这段代码适用于小数据集,但对于大数据集则需要很多时间。有什么解决办法吗
感谢您尝试使用,它是专为此类任务的高性能而构建的
说你从
df = pd.DataFrame({'Category': ['Applied Learning, Literacy & Language', 'Literacy & Language, Special Needs']})
那就做吧
import collections
import itertools
>>> collections.Counter(itertools.chain.from_iterable(v.split(',') for v in df.Category))
Counter({' Literacy & Language': 1,
' Special Needs': 1,
'Applied Learning': 1,
'Literacy & Language': 1})
这是使用
collections.Counter
和itertools.chain
的一种方法。需要特别注意去除空白
对于性能,您应该使用数据进行测试和基准测试
from collections import Counter
from itertools import chain
s = pd.Series(['Applied Learning, Literacy & Language', 'Literacy & Language, Special Needs',
'Math & Science, Literacy & Language', 'Literacy & Language, Math & Science',
'Math & Science, Applied Learning', 'Applied Learning', 'Literacy & Language',
'Math & Science'])
res = Counter(map(str.strip, chain.from_iterable(s.str.split(','))))
Counter({'Applied Learning': 3,
'Literacy & Language': 5,
'Math & Science': 4,
'Special Needs': 1})
使用scikit学习的另一种方法
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=lambda x: [i.strip() for i in x.split(',')], lowercase=False)
counts = vec.fit_transform(df['text']) # actual count, output will be a sparse matrix
dict(zip(vec.get_feature_names(), counts.sum(axis=0).tolist()[0]))
这里的CountVectorizer
模块是一个scikit学习
实现,用于自然语言处理中的任何单词包建模
您可以将counts
对象直接用作稀疏矩阵,这对于存储和计算非常有效,您还可以执行类似于.sum(axis=0)
的操作,该操作按列求和。完成后,只需将其与词汇表
合并即可获得所需内容
输出
{'Applied Learning': 3, 'Literacy & Language': 5, 'Math & Science': 4, 'Special
Needs': 1}
这适用于该列中的所有单词
split = response['Category'].str.split(', ')
s = set()
for row in split:
[s.add(el) for el in row]
for topic in s:
df[topic] = a.map(lambda x: topic in x)
这将导致df具有
0 Literacy & Language Math & Science \
0 Applied Learning, Literacy & Language True False
1 Literacy & Language, Special Needs True False
2 Math & Science, Literacy & Language True True
3 Literacy & Language, Math & Science True True
4 Math & Science, Applied Learning False True
5 Applied Learning False False
6 Literacy & Language True False
7 Math & Science False True
Applied Learning Special Needs
0 True False
1 False True
2 False False
3 False False
4 True False
5 True False
6 False False
7 False False
因此,您可以计算真值之和:
for topic in s:
print(topic, df[topic].sum())
Literacy & Language 5
Math & Science 4
Applied Learning 3
Special Needs 1
我想添加另一个可能的解决方案。我还试图解决这个问题,通过使用嵌套理解得到了类似的结果:
example = pd.Series(['a', 'b, c', 'a, b', 'b', 'd', 'c'])
values = []
[[values.append(key) for key in record.split(', ')] for record in example.values.tolist()]
series = pd.Series(values)
series.value_counts().sort_index()
使用
CountVectorizer
非常酷,但我认为这有点过分,因为不需要在实例之间进行分布。不过,很好。@AmiTavory完全同意-毫无疑问,我的第一个目标是收集。计数器,但自从它被拿走后,就忍不住把它放进去了:-P