Pandas/Python将两列转换为矩阵。矩阵中的列名
我可以使用以下命令成功地将这两列转换为矩阵Pandas/Python将两列转换为矩阵。矩阵中的列名,python,pandas,Python,Pandas,我可以使用以下命令成功地将这两列转换为矩阵 dfb = datab.parse("a") dfb Name Product 0 Mike Apple,pear 1 John Orange,Banana 2 Bob Banana 3 Connie Pear pd.get_dummies(dfb.Product).groupby(dfb.Name).apply(max) Apple,pear B
dfb = datab.parse("a")
dfb
Name Product
0 Mike Apple,pear
1 John Orange,Banana
2 Bob Banana
3 Connie Pear
pd.get_dummies(dfb.Product).groupby(dfb.Name).apply(max)
Apple,pear Banana Orange,Banana Pear
Name
Bob 0 1 0 0
Connie 0 0 0 1
John 0 0 1 0
Mike 1 0 0 0
然而,我想要的矩阵如下
Apple Banana Orange Pear
Name
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
1.
您需要:
2.
针对新的DataFarme
的解决方案,最后按列,因此axis=1
和level=0
和聚合max
:
dfb = dfb.set_index('Name')
df = pd.get_dummies(dfb.Product.str.split(',', expand=True), prefix='', prefix_sep='')
.groupby(axis=1, level=0).max()
print (df)
Apple Banana Orange Pear
Name
Mike 1 0 0 1
John 0 1 1 0
Bob 0 1 0 0
Connie 0 0 0 1
3.
使用split
和multi-labelbinarizer
的解决方案:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(dfb.Product.str.split(',')),
columns=mlb.classes_,
index=dfb.Name)
print (df)
Apple Banana Orange Pear
Name
Mike 1 0 0 1
John 0 1 1 0
Bob 0 1 0 0
Connie 0 0 0 1
如果在列
名称中重复:
df = df.groupby('Name').max()
print (df)
Apple Banana Orange Pear
Name
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
请参见下面的计时
pir0 = lambda dfb: pd.get_dummies(dfb.Name).T.dot(
dfb.Product.str.title().str.get_dummies(','))
pir0(dfb)
Apple Banana Orange Pear
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
from cytoolz import concat
def pir1(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
return pd.DataFrame(
np.bincount(f0.repeat(l) * m + f1, minlength=n * m).reshape(n, m),
u0, u1)
pir1(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
def pir2(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
a = np.zeros((n, m), dtype=int)
a[f0.repeat(l), f1] = 1
return pd.DataFrame(a, u0, u1)
pir2(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
选项1
pir0 = lambda dfb: pd.get_dummies(dfb.Name).T.dot(
dfb.Product.str.title().str.get_dummies(','))
pir0(dfb)
Apple Banana Orange Pear
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
from cytoolz import concat
def pir1(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
return pd.DataFrame(
np.bincount(f0.repeat(l) * m + f1, minlength=n * m).reshape(n, m),
u0, u1)
pir1(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
def pir2(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
a = np.zeros((n, m), dtype=int)
a[f0.repeat(l), f1] = 1
return pd.DataFrame(a, u0, u1)
pir2(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
选项2
pir0 = lambda dfb: pd.get_dummies(dfb.Name).T.dot(
dfb.Product.str.title().str.get_dummies(','))
pir0(dfb)
Apple Banana Orange Pear
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
from cytoolz import concat
def pir1(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
return pd.DataFrame(
np.bincount(f0.repeat(l) * m + f1, minlength=n * m).reshape(n, m),
u0, u1)
pir1(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
def pir2(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
a = np.zeros((n, m), dtype=int)
a[f0.repeat(l), f1] = 1
return pd.DataFrame(a, u0, u1)
pir2(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
选项3
pir0 = lambda dfb: pd.get_dummies(dfb.Name).T.dot(
dfb.Product.str.title().str.get_dummies(','))
pir0(dfb)
Apple Banana Orange Pear
Bob 0 1 0 0
Connie 0 0 0 1
John 0 1 1 0
Mike 1 0 0 1
from cytoolz import concat
def pir1(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
return pd.DataFrame(
np.bincount(f0.repeat(l) * m + f1, minlength=n * m).reshape(n, m),
u0, u1)
pir1(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
def pir2(dfb):
f0, u0 = pd.factorize(dfb.Name.values)
p = [x.title().split(',') for x in dfb.Product.values.tolist()]
l = [len(y) for y in p]
f1, u1 = pd.factorize(list(concat(p)))
n, m = u0.size, u1.size
a = np.zeros((n, m), dtype=int)
a[f0.repeat(l), f1] = 1
return pd.DataFrame(a, u0, u1)
pir2(dfb)
Apple Pear Orange Banana
Mike 1 1 0 0
John 0 0 1 1
Bob 0 0 0 1
Connie 0 1 0 0
定时
代码如下
虽然OP没有对此做任何说明,但是如果dfb.Name
不是唯一的,那么这些解决方案将不会在这些重复的行中聚合。在dfb=dfb.append(dfb)
之后重试,除非。。。OP想把它们聚合起来。。。然后求和
。但是是的,似乎是这样。在Name
列中有重复的值,例如,下一行是4 Connie Orange
?嗯,如果OP groupbyName
和aggregate max,我认为sum
是错误的,因为对于0,1
矩阵是必需的aggregate max。你认为呢?是的,我同意。。。虽然这对OP的尝试没有影响。但这确实表明这可能是他们想要的。我将切换到max并重新运行。@jezrael我重新运行了模拟。发布问题,请检查。如果没有重复,请等待divakar-签入