Python 如何高效地创建数据透视表?
我有这样一个数据帧:Python 如何高效地创建数据透视表?,python,performance,pandas,Python,Performance,Pandas,我有这样一个数据帧: import pandas as pd df = pd.DataFrame({"c0": list('ABC'), "c1": [" ".join(list('ab')), " ".join(list('def')), " ".join(list('s'))], "c2": list('DEF')}) c0 c1 c2 0 A a b D 1 B d e f E 2
import pandas as pd
df = pd.DataFrame({"c0": list('ABC'),
"c1": [" ".join(list('ab')), " ".join(list('def')), " ".join(list('s'))],
"c2": list('DEF')})
c0 c1 c2
0 A a b D
1 B d e f E
2 C s F
c2
c0 c1
A a D
b D
B d E
e E
f E
C s F
我想创建一个透视表,如下所示:
import pandas as pd
df = pd.DataFrame({"c0": list('ABC'),
"c1": [" ".join(list('ab')), " ".join(list('def')), " ".join(list('s'))],
"c2": list('DEF')})
c0 c1 c2
0 A a b D
1 B d e f E
2 C s F
c2
c0 c1
A a D
b D
B d E
e E
f E
C s F
因此,c1
中的条目被拆分,然后作为多索引中使用的单个元素处理
我的做法如下:
newdf = pd.DataFrame()
for indi, rowi in df.iterrows():
# get all single elements in string
n_elements = rowi['c1'].split()
# only one element so we can just add the entire row
if len(n_elements) == 1:
newdf = newdf.append(rowi)
# more than one element
else:
for eli in n_elements:
# that allows to add new elements using loc, without it we will have identical index values
if not newdf.empty:
newdf = newdf.reset_index(drop=True)
newdf.index = -1 * newdf.index - 1
# add entire row
newdf = newdf.append(rowi)
# replace the entire string by the single element
newdf.loc[indi, 'c1'] = eli
print newdf.reset_index(drop=True)
产生
c0 c1 c2
0 A a D
1 A b D
2 B d E
3 B e E
4 B f E
5 C s F
那我就可以打电话了
pd.pivot_table(newdf, index=['c0', 'c1'], aggfunc=lambda x: ' '.join(set(str(v) for v in x)))
这给了我想要的输出(见上文)
对于可能非常慢的巨大数据帧,我想知道是否有更有效的方法来实现这一点。这就是我得到结果的方式,在R中称为unnest
df.c1=df.c1.apply(lambda x : pd.Series(x).str.split(' '))
df.set_index(['c0', 'c2'])['c1'].apply(pd.Series).stack().reset_index().drop('level_2',1).rename(columns={0:'c1'}).set_index(['c0','c1'])
Out[208]:
c2
c0 c1
A a D
b D
B d E
e E
f E
C s F
选项1
import numpy as np, pandas as pd
s = df.c1.str.split()
l = s.str.len()
newdf = df.loc[df.index.repeat(l)].assign(c1=np.concatenate(s)).set_index(['c0', 'c1'])
newdf
c2
c0 c1
A a D
b D
B d E
e E
f E
C s F
选项2
应该快一点
import numpy as np, pandas as pd
s = np.core.defchararray.split(df.c1.values.astype(str), ' ')
l = [len(x) for x in s.tolist()]
r = np.arange(len(s)).repeat(l)
i = pd.MultiIndex.from_arrays([
df.c0.values[r],
np.concatenate(s)
], names=['c0', 'c1'])
newdf = pd.DataFrame({'c2': df.c2.values[r]}, i)
newdf
c2
c0 c1
A a D
b D
B d E
e E
f E
C s F
工作正常(投票通过),但需要一段时间才能完成;)。只需化妆品:您只需使用
split()
。非常好(向上投票),而且仍然可读!为了完整起见:您可以添加import numpy as np
行,并将其分配给newdf
。嗨,Pir,我花了很多时间来阅读它,并弄清楚您的方法有多高效!谢谢分享~