Python将列表列表的一列扩展为两个新列
我有一个像这样的DFPython将列表列表的一列扩展为两个新列,python,pandas,list,Python,Pandas,List,我有一个像这样的DF name id apps john 1 [[app1, v1], [app2, v2], [app3,v3]] smith 2 [[app1, v1], [app4, v4]] name id app_name app_version john 1 app1 v1 john 1 app2 v2 john 1 app3 v3 smith 2 app1
name id apps
john 1 [[app1, v1], [app2, v2], [app3,v3]]
smith 2 [[app1, v1], [app4, v4]]
name id app_name app_version
john 1 app1 v1
john 1 app2 v2
john 1 app3 v3
smith 2 app1 v1
smith 2 app4 v4
我想展开“应用程序”列,使其看起来像这样
name id apps
john 1 [[app1, v1], [app2, v2], [app3,v3]]
smith 2 [[app1, v1], [app4, v4]]
name id app_name app_version
john 1 app1 v1
john 1 app2 v2
john 1 app3 v3
smith 2 app1 v1
smith 2 app4 v4
非常感谢您的任何帮助您始终可以获得暴力解决方案。比如:
name, id, app_name, app_version = [], [], [], []
for i in range(len(df)):
for v in df.loc[i,'apps']:
app_name.append(v[0])
app_version.append(v[1])
name.append(df.loc[i, 'name'])
id.append(df.loc[i, 'id'])
df = pd.DataFrame({'name': name, 'id': id, 'app_name': app_name, 'app_version': app_version})
我来做这项工作
请注意,我假设df['apps']是字符串列表,如果df['apps']是字符串,那么您需要:eval(df.loc[I,'apps'])
而不是df.loc[I,'apps']
您可以应用(pd.Series)
两次以获得所需的中间步骤,然后合并回原始数据帧
import pandas as pd
df = pd.DataFrame({
'name': ['john', 'smith'],
'id': [1, 2],
'apps': [[['app1', 'v1'], ['app2', 'v2'], ['app3','v3']],
[['app1', 'v1'], ['app4', 'v4']]]
})
dftmp = df.apps.apply(pd.Series).T.melt().dropna()
dfapp = (dftmp.value
.apply(pd.Series)
.set_index(dftmp.variable)
.rename(columns={0:'app_name', 1:'app_version'})
)
df[['name', 'id']].merge(dfapp, left_index=True, right_index=True)
# returns:
name id app_name app_version
0 john 1 app1 v1
0 john 1 app2 v2
0 john 1 app3 v3
1 smith 2 app1 v1
1 smith 2 app4 v4
我的建议(可能有更简单的方法)是使用DataFrame.apply
和pd.concat
:
def expand_row(row):
return pd.DataFrame({
'name': row['name'], # row.name is the name of the series
'id': row['id'],
'app_name': [app[0] for app in row.apps],
'app_version': [app[1] for app in row.apps]
})
temp_dfs = df.apply(expand_row, axis=1).tolist()
expanded = pd.concat(temp_dfs)
expanded = expanded.reset_index() # put index in the correct order
print(expanded)
# name id app_name app_version
# 0 john 1 app1 v1
# 1 john 1 app2 v2
# 2 john 1 app3 v3
# 3 smith 2 app1 v1
# 4 smith 2 app4 v4
另外,这里有一个仅使用python的解决方案,如果我的直觉正确的话,它应该是快速的:
rows = df.values.tolist()
expanded = [[row[0], row[1], app[0], app[1]]
for row in rows
for app in row[2]]
df = pd.DataFrame(
expanded, columns=['name', 'id', 'app_name', 'app_version'])
# name id app_name app_version
# 0 john 1 app1 v1
# 1 john 1 app2 v2
# 2 john 1 app3 v3
# 3 smith 2 app1 v1
# 4 smith 2 app4 v4
另一种方法是(也应该很快):
链的
pd.Series
易于理解,如果您想了解更多方法,请查看
方法2稍微修改了我编写的函数
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: sum(df[x].tolist(),[])}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
然后 或
使用
pd.DataFrame(df.apps.tolist())
而不是.apply(pd.Series)
(速度非常慢),无论哪种方法都可以将其从C-backed API中拉到Python中.apply
隐藏for
循环,而tolist
将封装的对象推回Python。我没有做过任何测试来看看哪个更快。我做过,这就是我为什么评论的原因。也可以参考details@James它的速度是1.1秒vs 900微秒,所以它的速度快了1000倍,这是惊人的。尽管这是可行的,但对于大数据帧来说可能是不可行的。在pandas中,一个for循环已经足够糟糕了,所以想象一下两个嵌套的for循环;}始终避免直接迭代!
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: sum(df[x].tolist(),[])}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
yourdf=unnesting(df,['apps'])
yourdf['app_name'],yourdf['app_version']=yourdf.apps.str[0],yourdf.apps.str[1]
yourdf
Out[548]:
apps id name app_name app_version
0 [app1, v1] 1 john app1 v1
0 [app2, v2] 1 john app2 v2
0 [app3, v3] 1 john app3 v3
1 [app1, v1] 2 smith app1 v1
1 [app4, v4] 2 smith app4 v4
yourdf=unnesting(df,['apps']).reindex(columns=df.columns.tolist()+['app_name','app_version'])
yourdf[['app_name','app_version']]=yourdf.apps.tolist()
yourdf
Out[567]:
apps id name app_name app_version
0 [app1, v1] 1 john app1 v1
0 [app2, v2] 1 john app2 v2
0 [app3, v3] 1 john app3 v3
1 [app1, v1] 2 smith app1 v1
1 [app4, v4] 2 smith app4 v4