Python 熊猫:.groupby().size()和百分比
我有一个源自Python 熊猫:.groupby().size()和百分比,python,pandas,bioinformatics,Python,Pandas,Bioinformatics,我有一个源自df.groupby().size()操作的数据帧,如下所示: Localization RNA level cytoplasm 1 Non-expressed 7 2 Very low 13
df.groupby().size()
操作的数据帧,如下所示:
Localization RNA level
cytoplasm 1 Non-expressed 7
2 Very low 13
3 Low 8
4 Medium 6
5 Moderate 8
6 High 2
7 Very high 6
cytoplasm & nucleus 1 Non-expressed 5
2 Very low 8
3 Low 2
4 Medium 10
5 Moderate 16
6 High 6
7 Very high 5
cytoplasm & nucleus & plasma membrane 1 Non-expressed 6
2 Very low 3
3 Low 3
4 Medium 7
5 Moderate 8
6 High 4
7 Very high 1
我要做的是计算单独出现的次数(即来自.size()
的最后一列),作为适用本地化中出现总数的百分比
例如:在细胞质
定位(7+13+8+6+8+2+6)中总共有50次出现,非表达
和极低
RNA水平分别产生14%和26%
有什么好办法吗?我一直在用一种我认为非常迂回的方法,即为每一个本地化
创建一个新的数据帧,并从那里开始工作,但是有很多行,最后必须合并所有生成的数据帧。我希望至少有一种更聪明的方法 以下是基于熊猫、函数的完整示例。
基本思想是根据“本地化”
对数据进行分组,并在组上应用函数
import pandas as pd
from io import StringIO
#For Python 2, replace previous line with: from StringIO import StringIO
data = \
"""Localization,RNA level,Size
cytoplasm ,1 Non-expressed, 7
cytoplasm ,2 Very low ,13
cytoplasm ,3 Low , 8
cytoplasm ,4 Medium , 6
cytoplasm ,5 Moderate , 8
cytoplasm ,6 High , 2
cytoplasm ,7 Very high , 6
cytoplasm & nucleus ,1 Non-expressed, 5
cytoplasm & nucleus ,2 Very low , 8
cytoplasm & nucleus ,3 Low , 2
cytoplasm & nucleus ,4 Medium ,10
cytoplasm & nucleus ,5 Moderate ,16
cytoplasm & nucleus ,6 High , 6
cytoplasm & nucleus ,7 Very high , 5
cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6
cytoplasm & nucleus & plasma membrane,2 Very low , 3
cytoplasm & nucleus & plasma membrane,3 Low , 3
cytoplasm & nucleus & plasma membrane,4 Medium , 7
cytoplasm & nucleus & plasma membrane,5 Moderate , 8
cytoplasm & nucleus & plasma membrane,6 High , 4
cytoplasm & nucleus & plasma membrane,7 Very high , 1"""
# Create the dataframe
df = pd.read_csv(StringIO(data))
df['Localization'].str.strip()
df['RNA level'].str.strip()
df['Size'].astype(int)
df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x))
您应该使用df['RNA level'].str.strip()
进行矢量化字符串清理(而不是转换器),使用df['Size'].astype(int)
进行矢量化int转换您的groupby可以折叠为:df.groupby('Localization')['Size'].transform(lambda x:x/len(x))
您的意思是df.groupby('Localization')['Size'].transform(λx:x/sum(x))