Python 熊猫按列表中的值分组(串联)
我正在尝试按DataFrame系列中列表中的项目分组。正在使用的数据集是 布局大致如下:Python 熊猫按列表中的值分组(串联),python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我正在尝试按DataFrame系列中列表中的项目分组。正在使用的数据集是 布局大致如下: ... LanguageWorkedWith ... ConvertedComp ... Respondent 1 Python;C 50000 2 C++;C 70000 # read csv sos = pd.read_csv("develo
... LanguageWorkedWith ... ConvertedComp ...
Respondent
1 Python;C 50000
2 C++;C 70000
# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')
# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')
# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)
# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)
# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')
# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]
# Group by LLW and apply median
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()
我想在使用的语言列表中的唯一值上使用groupby
,并将平均聚合器函数应用于ConvertedComp,比如
LanguageWorkedWith
C++ 70000
C 60000
Python 50000
实际上,我已经成功地实现了预期的输出,但我的解决方案似乎有点僵硬,而且对熊猫来说是新的,我相信可能有更好的方法
我的解决办法如下:
... LanguageWorkedWith ... ConvertedComp ...
Respondent
1 Python;C 50000
2 C++;C 70000
# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')
# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')
# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)
# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)
# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')
# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]
# Group by LLW and apply median
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()
输出:
LWW
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: Salary, dtype: float64
与选择特定语言时计算的值相匹配:sos.loc[sos[“LanguageWorkedWith”].str.contains('Perl').fillna(False),“ConvertedComp”].media()
任何关于如何改进/提供此功能的功能/etc的提示都将不胜感激 在仅目标列数据框中,分解语言名称并将其与薪资相结合。下一步是使用melt将数据从水平格式转换为垂直格式。然后我们将语言名称分组,得到中间值
美好的扩展列表并使用
melt
肯定更好。我相信重置索引并不是绝对必要的,因为默认情况下,melt
会忽略索引。即使ignore_index
设置为false,您仍然可以使用“应答者”id作为索引获得相同的结果。再次感谢!我在张贴后注意到了你的观点。非常感谢你。如果我的答案对你有帮助,请接受它作为正确答案,并给它一个1+。