Python 熊猫:当单元格内容为lists/NaN/string时,为每个元素创建一行
嗨,我有一个类似下面的dfPython 熊猫:当单元格内容为lists/NaN/string时,为每个元素创建一行,python,python-3.x,pandas,Python,Python 3.x,Pandas,嗨,我有一个类似下面的df index a b c d 0 xx aa av NaN 1 pp as ka [1,2,3,4] 2 pa aj q 1234 3 xq aq aq NaN 4 pn an kn [10,20,30,40] 5 px ax kx "00012" 我想转换成下面这样的一些 index a b c d d-separated 0 xx aa av NaN
index a b c d
0 xx aa av NaN
1 pp as ka [1,2,3,4]
2 pa aj q 1234
3 xq aq aq NaN
4 pn an kn [10,20,30,40]
5 px ax kx "00012"
我想转换成下面这样的一些
index a b c d d-separated
0 xx aa av NaN NaN
1 pp as ka [1,2,3,4] 1
2 pp as ka [1,2,3,4] 2
3 pp as ka [1,2,3,4] 3
4 pp as ka [1,2,3,4] 4
5 pa aj q 1234 1234
6 xq aq aq NaN NaN
7 pn an kn [10,20,30,40] 10
8 pn an kn [10,20,30,40] 20
9 pn an kn [10,20,30,40] 30
10 pn an kn [10,20,30,40] 40
11 px ax kx "00012" "00012"
我已经参考了
及
但是,由于我的情况与他们不同。在我的例子中,这个解决方案行不通。感谢您的帮助设置
这是一个棘手的问题,主要是因为NaN
,所以我先用填充值替换它们,然后在最后将它们更改回来:
(df.join(df.fillna(-999)
.d.apply(pd.Series))
.drop('d', 1).set_index(['a', 'b', 'c'])
.stack().reset_index()
.drop('level_3',1)
.replace(-999, np.nan).rename(columns={0: 'd-separated'})
)
a b c d-separated
0 xx aa av NaN
1 pp as ka 1
2 pp as ka 2
3 pp as ka 3
4 pp as ka 4
5 pa aj q 1234
6 xq aq aq NaN
7 pn an kn 10
8 pn an kn 20
9 pn an kn 30
10 pn an kn 40
11 px ax kx 00012
但是,这确实会丢失原始的
d
列,因为它包含不可损坏的类型,因此无法将其设置为索引级别。这是可能的,但并非微不足道-对于索引id的列,必须将哈希类型的list
s转换为tuple
s,对于DataFrame
从构造函数标量转换为一个元素list
s:
df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'],
'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'],
'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'],
'd': [np.nan, [1,2,3,4], '1234', np.nan, [10, 20, 30, 40], '00012']})
s = (df.assign(d1=df['d'].fillna('NANval').apply(lambda x: x if isinstance(x, list) else [x]),
d = df['d'].apply(lambda x: tuple(x) if isinstance(x, list) else x))
.set_index(['a','b','c','d'])['d1']
)
print (s)
a b c d
xx aa av NaN [NANval]
pp as ka (1, 2, 3, 4) [1, 2, 3, 4]
pa aj q 1234 [1234]
xq aq aq NaN [NANval]
pn an kn (10, 20, 30, 40) [10, 20, 30, 40]
px ax kx 00012 [00012]
Name: d1, dtype: object
df['d'] = df['d'].apply(lambda x: list(x) if isinstance(x, tuple) else x)
print (df)
a b c d d-separated
0 xx aa av NaN NaN
1 pp as ka [1, 2, 3, 4] 1
2 pp as ka [1, 2, 3, 4] 2
3 pp as ka [1, 2, 3, 4] 3
4 pp as ka [1, 2, 3, 4] 4
5 pa aj q 1234 1234
6 xq aq aq NaN NaN
7 pn an kn [10, 20, 30, 40] 10
8 pn an kn [10, 20, 30, 40] 20
9 pn an kn [10, 20, 30, 40] 30
10 pn an kn [10, 20, 30, 40] 40
11 px ax kx 00012 00012
如有必要,最后将
元组
s转换为列表
s:
df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'],
'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'],
'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'],
'd': [np.nan, [1,2,3,4], '1234', np.nan, [10, 20, 30, 40], '00012']})
s = (df.assign(d1=df['d'].fillna('NANval').apply(lambda x: x if isinstance(x, list) else [x]),
d = df['d'].apply(lambda x: tuple(x) if isinstance(x, list) else x))
.set_index(['a','b','c','d'])['d1']
)
print (s)
a b c d
xx aa av NaN [NANval]
pp as ka (1, 2, 3, 4) [1, 2, 3, 4]
pa aj q 1234 [1234]
xq aq aq NaN [NANval]
pn an kn (10, 20, 30, 40) [10, 20, 30, 40]
px ax kx 00012 [00012]
Name: d1, dtype: object
df['d'] = df['d'].apply(lambda x: list(x) if isinstance(x, tuple) else x)
print (df)
a b c d d-separated
0 xx aa av NaN NaN
1 pp as ka [1, 2, 3, 4] 1
2 pp as ka [1, 2, 3, 4] 2
3 pp as ka [1, 2, 3, 4] 3
4 pp as ka [1, 2, 3, 4] 4
5 pa aj q 1234 1234
6 xq aq aq NaN NaN
7 pn an kn [10, 20, 30, 40] 10
8 pn an kn [10, 20, 30, 40] 20
9 pn an kn [10, 20, 30, 40] 30
10 pn an kn [10, 20, 30, 40] 40
11 px ax kx 00012 00012
首先将数据框扩展到所需大小,根据需要重复每一行:
df1 = df.loc[df.index.repeat([len(x) if isinstance(x,list) else 1 for x in df.d])]
现在取消列d并将其与上面的df连接起来
d_sep= pd.DataFrame({'d_Sep':sum([x if isinstance(x,list) else [x] for x in df.d],[])})
df2 = pd.concat([df1.reset_index(drop=True),d_sep],axis=1)
a b c d d_Sep
0 xx aa av NaN NaN
1 pp as ka [1, 2, 3, 4] 1
2 pp as ka [1, 2, 3, 4] 2
3 pp as ka [1, 2, 3, 4] 3
4 pp as ka [1, 2, 3, 4] 4
5 pa aj q 1234 1234
6 xq aq aq NaN NaN
7 pn an kn [10, 20, 30, 40] 10
8 pn an kn [10, 20, 30, 40] 20
9 pn an kn [10, 20, 30, 40] 30
10 pn an kn [10, 20, 30, 40] 40
11 px ax kx 00012 00012
抱歉,但它显示ValueError:操作数无法与形状(768329,)(2,)一起广播