Python 将数组展开到dask数据帧中的列_Python_Dask

Python 将数组展开到dask数据帧中的列

python dask

Python 将数组展开到dask数据帧中的列,python,dask,Python,Dask,我有以下键的avro数据：“id、标签、功能”。 id和label是字符串，而features是浮动的缓冲区 import dask.bag as db avros = db.read_avro('data.avro') df = avros.to_dataframe() convert = partial(np.frombuffer, dtype='float64') X = df.assign(features=lambda x: x.features.apply(convert, meta

我有以下键的avro数据：“id、标签、功能”。 id和label是字符串，而features是浮动的缓冲区

import dask.bag as db
avros = db.read_avro('data.avro')
df = avros.to_dataframe()
convert = partial(np.frombuffer, dtype='float64')
X = df.assign(features=lambda x: x.features.apply(convert, meta='float64'))

我最终得到了这个MCVE

  label id         features
0  good  a  [1.0, 0.0, 0.0]
1   bad  b  [1.0, 0.0, 0.0]
2  good  c  [0.0, 0.0, 0.0]
3   bad  d  [1.0, 0.0, 1.0]
4  good  e  [0.0, 0.0, 0.0]

我期望的结果是：

  label id   f1   f2   f3
0  good  a  1.0  0.0  0.0
1   bad  b  1.0  0.0  0.0
2  good  c  0.0  0.0  0.0
3   bad  d  1.0  0.0  1.0
4  good  e  0.0  0.0  0.0

我尝试了一些类似pandas的方法，即

df[['f1'，'f2'，'f3']]=df.features.apply（pd.Series）

与pandas中的工作方式不同

我可以像这样绕着一圈走

for i in range(len(features)):
df[f'f{i}'] = df.features.map(lambda x: x[i])

但在实际的用例中，我有数千个特性，这些特性会遍历数据集数千次

实现预期结果的最佳方式是什么

[68]中的

：导入字符串
…：将numpy作为np导入
…：作为pd导入熊猫
In[69]：M，N=100100
…：labels=np.random.choice（['good'，'bad'，size=M）
…：ids=np.random.choice（list（string.ascii_小写），size=M）
…：features=np.empty（（M，），dtype=object）
…：features[：]=list（map（list，np.random.randn（M，N）））
…：df=pd.DataFrame（[labels，id，features]，index=['label'，'id'，'features']）.T
…：df1=df.copy（）
在[70]：%时间内
…：列=[f“f{i:04d}”表示范围（N）中的i]
…：features=pd.DataFrame（列表（映射（np.asarray，df1.pop（'features'）.to_numpy（）），index=df.index，columns=columns）
…：df1=pd.concat（[df1，特征]，轴=1）
壁时间：13.9毫秒
In[71]：M，N=10001000
…：labels=np.random.choice（['good'，'bad'，size=M）
…：ids=np.random.choice（list（string.ascii_小写），size=M）
…：features=np.empty（（M，），dtype=object）
…：features[：]=list（map（list，np.random.randn（M，N）））
…：df=pd.DataFrame（[labels，id，features]，index=['label'，'id'，'features']）.T
…：df1=df.copy（）
在[72]：%时间内
…：列=[f“f{i:04d}”表示范围（N）中的i]
…：features=pd.DataFrame（列表（映射（np.asarray，df1.pop（'features'）.to_numpy（）），index=df.index，columns=columns）
…：df1=pd.concat（[df1，特征]，轴=1）
墙壁时间：627毫秒
In[73]：df1.shape
Out[73]：（10001002）

编辑：比原始文件快2倍左右

[79]中的

df2=df.copy（）
在[80]：%时间内
…：features=df2.pop（'features'））
…：对于范围（N）内的i：
…：df2[f'f{i:04d}']=features.map（lambda x:x[i]）
...:     
壁时间：1.46秒
[81]中：df1.等于（df2）
Out[81]：对

编辑：编辑：构建数据帧的更快方法比原始方法提高了8倍：

[22]中的

df1=df.copy（）
在[23]：%时间内
…：features=pd.DataFrame（{f“f{i:04d}）：用于i的np.asarray（row），枚举中的row（df1.pop（'features'）。to_numpy（））}）
…：df1=pd.concat（[df1，特征]，轴=1）
壁时间：165毫秒
[68]中的：导入字符串
…：将numpy作为np导入
…：作为pd导入熊猫
In[69]：M，N=100100
…：labels=np.random.choice（['good'，'bad'，size=M）
…：ids=np.random.choice（list（string.ascii_小写），size=M）
…：features=np.empty（（M，），dtype=object）
…：features[：]=list（map（list，np.random.randn（M，N）））
…：df=pd.DataFrame（[labels，id，features]，index=['label'，'id'，'features']）.T
…：df1=df.copy（）
在[70]：%时间内
…：列=[f“f{i:04d}”表示范围（N）中的i]
…：features=pd.DataFrame（列表（映射（np.asarray，df1.pop（'features'）.to_numpy（）），index=df.index，columns=columns）
…：df1=pd.concat（[df1，特征]，轴=1）
壁时间：13.9毫秒
In[71]：M，N=10001000
…：labels=np.random.choice（['good'，'bad'，size=M）
…：ids=np.random.choice（list（string.ascii_小写），size=M）
…：features=np.empty（（M，），dtype=object）
…：features[：]=list（map（list，np.random.randn（M，N）））
…：df=pd.DataFrame（[labels，id，features]，index=['label'，'id'，'features']）.T
…：df1=df.copy（）
在[72]：%时间内
…：列=[f“f{i:04d}”表示范围（N）中的i]
…：features=pd.DataFrame（列表（映射（np.asarray，df1.pop（'features'）.to_numpy（）），index=df.index，columns=columns）
…：df1=pd.concat（[df1，特征]，轴=1）
墙壁时间：627毫秒
In[73]：df1.shape
Out[73]：（10001002）

编辑：比原始文件快2倍左右
[79]中的df2=df.copy（）
在[80]：%时间内
…：features=df2.pop（'features'））
…：对于范围（N）内的i：
…：df2[f'f{i:04d}']=features.map（lambda x:x[i]）
...:     
壁时间：1.46秒
[81]中：df1.等于（df2）
Out[81]：对

编辑：编辑：构建数据帧的更快方法比原始方法提高了8倍：
[22]中的df1=df.copy（）
在[23]：%时间内
…：features=pd.DataFrame（{f“f{i:04d}）：用于i的np.asarray（row），枚举中的row（df1.pop（'features'）。to_numpy（））}）
…：df1=pd.concat（[df1，特征]，轴=1）
壁时间：165毫秒
在建议的解决方案中，可能存在重复项，它似乎会为每个功能解析序列。这在MCVE中并不太糟糕，但在现实世界中，我有成千上万的功能。这听起来在计算上很昂贵。实际上，在这个主题上有一个更新的答案。这很接近，但在字符串上有效。我的对象已经是一个列表。在建议的解决方案中，可能是的重复项，它似乎会为每个功能解析序列。这在MCVE中并不太糟糕，但在现实世界中，我有成千上万的功能。这听起来在计算上很昂贵。实际上，在这个主题上有一个更新的答案。这很接近，但在字符串上有效。我的对象已经是一个列表。