Python 每个变量的取消堆栈和返回值计数?
我有一个数据框,记录了19717人通过多项选择题选择编程语言的回答。第一栏当然是受访者的性别,其余的是他们选择的。数据帧如下所示,每个响应都记录为与列相同的名称。如果未选择响应,则会导致nanPython 每个变量的取消堆栈和返回值计数?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框,记录了19717人通过多项选择题选择编程语言的回答。第一栏当然是受访者的性别,其余的是他们选择的。数据帧如下所示,每个响应都记录为与列相同的名称。如果未选择响应,则会导致nan ID Gender Python Bash R JavaScript C++ 0 Male Python nan nan JavaScript nan 1 Female
ID Gender Python Bash R JavaScript C++
0 Male Python nan nan JavaScript nan
1 Female nan nan R JavaScript C++
2 Prefer not to say Python Bash nan nan nan
3 Male nan nan nan nan nan
我想要的是一个根据性别返回计数的表。因此,如果5000名男性用Python编码,3000名女性用JS编码,那么我应该得到:
Gender Python Bash R JavaScript C++
Male 5000 1000 800 1500 1000
Female 4000 500 1500 3000 800
Prefer Not To Say 2000 ... ... ... 860
我尝试了以下几种选择:
df.iloc[:, [*range(0, 13)]].stack().value_counts()
Male 16138
Python 12841
SQL 6532
R 4588
Female 3212
Java 2267
C++ 2256
Javascript 2174
Bash 2037
C 1672
MATLAB 1516
Other 1148
TypeScript 389
Prefer not to say 318
None 83
Prefer to self-describe 49
dtype: int64
而这并不是上述所要求的。这可以在熊猫中完成吗?您可以将性别设置为索引和总和:
s = df.set_index('Gender').iloc[:, 1:]
s.eq(s.columns).astype(int).sum(level=0)
输出:
Python Bash R JavaScript C++
Gender
Male 1 0 0 1 0
Female 0 0 1 1 1
Prefer not to say 1 1 0 0 0
另一个想法是沿着轴1计算值,然后:
[外]
您可以melt
并使用crosstab
df1 = pd.melt(df,id_vars=['ID','Gender'],var_name='Language',value_name='Choice')
df1['Choice'] = np.where(df1['Choice'] == df1['Language'],1,0)
final= pd.crosstab(df1['Gender'],df1['Language'],values=df1['Choice'],aggfunc='sum')
print(final)
Language Bash C++ JavaScript Python R
Gender
Female 0 1 1 0 1
Male 0 0 1 1 0
Prefer not to say 1 0 0 1 0
让我们排到一行
df.drop('ID',1).melt('Gender').\
query('variable==value').\
groupby(['Gender','variable']).size().unstack(fill_value=0)
Out[120]:
variable Bash C++ JavaScript Python R
Gender
Female 0 1 1 0 1
Male 0 0 1 1 0
Prefernottosay 1 0 0 1 0
假设您的nan
是nan
(即它不是字符串),我们可以利用count
,因为它忽略nan
,以获得所需的输出
df_out = df.iloc[:,2:].groupby(df.Gender, sort=False).count()
Out[175]:
Python Bash R JavaScript C++
Gender
Male 1 0 0 1 0
Female 0 0 1 1 1
Prefer not to say 1 1 0 0 0
出于某种原因,这将返回每个性别
索引的所有0。
df.drop('ID',1).melt('Gender').\
query('variable==value').\
groupby(['Gender','variable']).size().unstack(fill_value=0)
Out[120]:
variable Bash C++ JavaScript Python R
Gender
Female 0 1 1 0 1
Male 0 0 1 1 0
Prefernottosay 1 0 0 1 0
df_out = df.iloc[:,2:].groupby(df.Gender, sort=False).count()
Out[175]:
Python Bash R JavaScript C++
Gender
Male 1 0 0 1 0
Female 0 0 1 1 1
Prefer not to say 1 1 0 0 0