Python 熊猫/小矮人:创建梯子的最快方法?
我有一个数据框,如:Python 熊猫/小矮人:创建梯子的最快方法?,python,pandas,numpy,dataframe,vectorization,Python,Pandas,Numpy,Dataframe,Vectorization,我有一个数据框,如: color cost temp 0 blue 12.0 80.4 1 red 8.1 81.2 2 pink 24.5 83.5 color cost temp original_idx 0 blue 11.5 80.4 0 1 blue 12.0 80.4 0 2
color cost temp
0 blue 12.0 80.4
1 red 8.1 81.2
2 pink 24.5 83.5
color cost temp original_idx
0 blue 11.5 80.4 0
1 blue 12.0 80.4 0
2 blue 12.5 80.4 0
3 red 7.6 81.2 1
4 red 8.1 81.2 1
5 red 8.6 81.2 1
6 pink 24.0 83.5 2
7 pink 24.5 83.5 2
8 pink 25.0 83.5 2
我想为每行创建一个“阶梯”或“范围”,以50美分为增量,从低于当前成本的0.50美元到高于当前成本的0.50美元。我当前的代码类似于以下代码:
incremented_prices = []
df['original_idx'] = df.index # To know it's original label
for row in df.iterrows():
current_price = row['cost']
more_costs = numpy.arange(current_price-1, current_price+1, step=0.5)
for cost in more_costs:
row_c = row.copy()
row_c['cost'] = cost
incremented_prices.append(row_c)
df_incremented = pandas.concat(incremented_prices)
这段代码将生成如下数据帧:
color cost temp
0 blue 12.0 80.4
1 red 8.1 81.2
2 pink 24.5 83.5
color cost temp original_idx
0 blue 11.5 80.4 0
1 blue 12.0 80.4 0
2 blue 12.5 80.4 0
3 red 7.6 81.2 1
4 red 8.1 81.2 1
5 red 8.6 81.2 1
6 pink 24.0 83.5 2
7 pink 24.5 83.5 2
8 pink 25.0 83.5 2
在实际问题中,我将使范围从-50.00美元到50.00美元,我发现这非常缓慢,是否有更快的矢量化方法?您可以尝试使用
numpy重新创建数据帧。重复:
cost_steps = pd.np.arange(-0.5, 0.51, 0.5)
repeats = cost_steps.size
pd.DataFrame(dict(
color = pd.np.repeat(df.color.values, repeats),
# here is a vectorized method to calculate the costs with all steps added with broadcasting
cost = (df.cost.values[:, None] + cost_steps).ravel(),
temp = pd.np.repeat(df.temp.values, repeats),
original_idx = pd.np.repeat(df.index.values, repeats)
))
更新更多列:
df1 = df.rename_axis("original_idx").reset_index()
cost_steps = pd.np.arange(-0.5, 0.51, 0.5)
repeats = cost_steps.size
pd.DataFrame(pd.np.hstack((pd.np.repeat(df1.drop("cost", 1).values, repeats, axis=0),
(df1.cost[:, None] + cost_steps).reshape(-1, 1))),
columns=df1.columns.drop("cost").tolist()+["cost"])
以下是一种基于NumPy初始化的方法-
increments = 0.5*np.arange(-1,2) # Edit the increments here
names = np.append(df.columns, 'original_idx')
M,N = df.shape
vals = df.values
cost_col_idx = (names == 'cost').argmax()
n = len(increments)
shp = (M,n,N+1)
b = np.empty(shp,dtype=object)
b[...,:-1] = vals[:,None]
b[...,-1] = np.arange(M)[:,None]
b[...,cost_col_idx] = vals[:,cost_col_idx].astype(float)[:,None] + increments
b.shape = (-1,N+1)
df_out = pd.DataFrame(b, columns=names)
要使增量从-50
变为+50
,增量为0.5
,请使用:
increments = 0.5*np.arange(-100,101)
样本运行-
In [200]: df
Out[200]:
color cost temp newcol
0 blue 12.0 80.4 mango
1 red 8.1 81.2 banana
2 pink 24.5 83.5 apple
In [201]: df_out
Out[201]:
color cost temp newcol original_idx
0 blue 11.5 80.4 mango 0
1 blue 12 80.4 mango 0
2 blue 12.5 80.4 mango 0
3 red 7.6 81.2 banana 1
4 red 8.1 81.2 banana 1
5 red 8.6 81.2 banana 1
6 pink 24 83.5 apple 2
7 pink 24.5 83.5 apple 2
8 pink 25 83.5 apple 2
您还可以将这个问题重新表述为:如何创建一个DF,使我的原始DF的每一行重复N次?然后,可能有用。@Lev这将是它的一部分,但对于每一行,我需要一个不同的价格,它基于原始价格+/-一定的金额。这是我想要的,但我有500列,所以我不想键入每一列。有没有办法把你的答案和500列的数据框结合起来