Python 如何使用组合或字符串作为数据帧(Pandas)的索引?
我正在尝试使用列的组合在DataFrame中创建一个新列。由于我不知道如何使用生成的组合作为索引,我尝试将组合转换为字符串,但这也不起作用Python 如何使用组合或字符串作为数据帧(Pandas)的索引?,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我正在尝试使用列的组合在DataFrame中创建一个新列。由于我不知道如何使用生成的组合作为索引,我尝试将组合转换为字符串,但这也不起作用 import itertools as iter def pset(lst): comb = (iter.combinations(lst, l) for l in range(2,3)) return list(iter.chain.from_iterable(comb)) temp = pset(transactions) t = st
import itertools as iter
def pset(lst):
comb = (iter.combinations(lst, l) for l in range(2,3))
return list(iter.chain.from_iterable(comb))
temp = pset(transactions)
t = str(temp[0]).strip(" ")
transactions[[t]]
这给了我一个错误
KeyError: '["\'A\', \'B\'"] not in index'
这里A和B是我在数据框中的列
transaction dataset:
A,B,C,D,E,F,G
1,0,1,1,0,1,1
1,1,1,1,0,1,0
1,0,0,1,0,1,0
0,0,1,1,1,0,0
1,0,0,1,1,1,0
0,1,1,1,1,1,1
Expected output Expected output:
A,B A,C A,D
1 2 4
您得到的预期输出如下,
df
是您发布的事务数据集。(这个解决方案是用Python2.7制作的,但我希望它在Python3中也能工作)
输出:
A,B A,C A,D A,E A,F A,G B,C B,D B,E B,F B,G C,D C,E C,F C,G D,E D,F D,G E,F \
0 1 2 4 1 4 1 2 2 1 2 1 4 2 3 2 3 5 2 2
E,G F,G
0 1 2
Count
A,B 1
A,C 2
A,D 4
A,E 1
A,F 4
A,G 1
B,C 2
B,D 2
B,E 1
B,F 2
B,G 1
C,D 4
C,E 2
C,F 3
C,G 2
D,E 3
D,F 5
D,G 2
E,F 2
E,G 1
F,G 2
如有需要,转置:
OT = Out.T
OT.columns = ["Count"]
输出:
A,B A,C A,D A,E A,F A,G B,C B,D B,E B,F B,G C,D C,E C,F C,G D,E D,F D,G E,F \
0 1 2 4 1 4 1 2 2 1 2 1 4 2 3 2 3 5 2 2
E,G F,G
0 1 2
Count
A,B 1
A,C 2
A,D 4
A,E 1
A,F 4
A,G 1
B,C 2
B,D 2
B,E 1
B,F 2
B,G 1
C,D 4
C,E 2
C,F 3
C,G 2
D,E 3
D,F 5
D,G 2
E,F 2
E,G 1
F,G 2
编辑:
改进的代码也可用于更高维度:
import itertools as iter
import pandas as pd
import numpy as np
dim = 2
colComb = [a for a in iter.combinations(df.columns,dim)]
newCols = [','.join(colComb[i]) for i in range(len(colComb))]
Out = pd.DataFrame(columns = newCols)
for i in range(len(colComb)):
Out.loc[0,newCols[i]] = df[np.sum(df[list(colComb[i])],axis=1) == dim][colComb[i][0]].count()
编辑:
更快速的二维代码:
cols = []
vals = []
for i in range(len(df.columns)):
for j in range(i+1,len(df.columns)):
cols.append(df.columns[i]+','+df.columns[j])
vals.append(np.multiply(df[df.columns[i]],df[df.columns[j]]).sum())
Out = pd.DataFrame(columns=cols)
Out.loc[0] = vals
Out = Out.astype(int)
vals = []
colComb = [a for a in iter.combinations(df.columns,dim)]
cols = [','.join(colComb[i]) for i in range(len(colComb))]
vals = []
for C in colComb:
v = df[C[0]]
for i in range(1,len(C)):
v = np.multiply(v,df[C[i]])
vals.append(v.sum())
dd = pd.DataFrame(columns=cols)
dd.loc[0] = vals
dd = dd.astype(int)
另一种编辑,一种处理更高维度的更快解决方案:
cols = []
vals = []
for i in range(len(df.columns)):
for j in range(i+1,len(df.columns)):
cols.append(df.columns[i]+','+df.columns[j])
vals.append(np.multiply(df[df.columns[i]],df[df.columns[j]]).sum())
Out = pd.DataFrame(columns=cols)
Out.loc[0] = vals
Out = Out.astype(int)
vals = []
colComb = [a for a in iter.combinations(df.columns,dim)]
cols = [','.join(colComb[i]) for i in range(len(colComb))]
vals = []
for C in colComb:
v = df[C[0]]
for i in range(1,len(C)):
v = np.multiply(v,df[C[i]])
vals.append(v.sum())
dd = pd.DataFrame(columns=cols)
dd.loc[0] = vals
dd = dd.astype(int)
它的运行速度应该至少快3-4倍。您是在寻找pd.MultiIndex.from_product()?我正在执行apriori,我想创建一个包含多个列(索引)的索引相同的数据帧。@YadyneshDesai-您能添加数据帧示例:5-6行和所需的输出吗?@jezrael我已经更新了问题。所需的输出只是我实际需要的一小部分。谢谢,但我有点困惑-您认为所有列的组合
a、B、C、D、E、F、G
?你能解释更多的数字吗?1,2,4
?谢谢。我将在大约两天内尝试并返回。在上面的代码中,数据帧索引是作为组合还是字符串存储的?如果你这样做OT=Out.T
,然后查看OT.index
,你会看到索引是作为字符串存储的。如果希望将这些字符串放在普通列中,可以使用reset\u index()
,而使用通用索引。这需要花费大量时间。如果数据集很大,则需要花费大量时间。