Python 熊猫按列表中的值分组（串联）_Python_Pandas_Pandas Groupby

Python 熊猫按列表中的值分组（串联）

python pandas

Python 熊猫按列表中的值分组（串联）,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我正在尝试按DataFrame系列中列表中的项目分组。正在使用的数据集是布局大致如下： ... LanguageWorkedWith ... ConvertedComp ... Respondent 1 Python;C 50000 2 C++;C 70000 # read csv sos = pd.read_csv("develo

我正在尝试按DataFrame系列中列表中的项目分组。正在使用的数据集是

布局大致如下：

           ... LanguageWorkedWith ... ConvertedComp ...
Respondent
    1               Python;C              50000
    2                C++;C                70000

# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')

# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')

# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)

# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)

# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')

# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]

# Group by LLW and apply median 
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()

我想在使用的语言列表中的唯一值上使用

groupby

，并将平均聚合器函数应用于ConvertedComp，比如

LanguageWorkedWith
        C++                 70000
        C                   60000
      Python                50000

实际上，我已经成功地实现了预期的输出，但我的解决方案似乎有点僵硬，而且对熊猫来说是新的，我相信可能有更好的方法

我的解决办法如下：

           ... LanguageWorkedWith ... ConvertedComp ...
Respondent
    1               Python;C              50000
    2                C++;C                70000

# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')

# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')

# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)

# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)

# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')

# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]

# Group by LLW and apply median 
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()

输出：

LWW
Perl     76131.5
Scala    75669.0
Go       74034.0
Rust     74000.0
Ruby     71093.0
Name: Salary, dtype: float64

与选择特定语言时计算的值相匹配：

sos.loc[sos[“LanguageWorkedWith”].str.contains（'Perl'）.fillna（False），“ConvertedComp”].media（）

任何关于如何改进/提供此功能的功能/etc的提示都将不胜感激

在仅目标列数据框中，分解语言名称并将其与薪资相结合。下一步是使用melt将数据从水平格式转换为垂直格式。然后我们将语言名称分组，得到中间值

美好的扩展列表并使用

melt

肯定更好。我相信重置索引并不是绝对必要的，因为默认情况下，

melt

会忽略索引。即使

ignore_index

设置为false，您仍然可以使用“应答者”id作为索引获得相同的结果。再次感谢！我在张贴后注意到了你的观点。非常感谢你。如果我的答案对你有帮助，请接受它作为正确答案，并给它一个1+。