Python 是否有更惯用的方法根据列的内容从PyArrow表中选择行？_Python_Pyarrow

Python 是否有更惯用的方法根据列的内容从PyArrow表中选择行？

python

Python 是否有更惯用的方法根据列的内容从PyArrow表中选择行？,python,pyarrow,Python,Pyarrow,我有一个很大的PyArrow表，其中有一列名为index，我想用它对表进行分区；索引的每个单独值表示表中的不同数量是否有一种惯用的方法根据列的内容从PyArrow表中选择行下面是一个示例表： import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import numpy as np # Example table for data schema irow = np.arange(2**20) dt = 1

我有一个很大的PyArrow表，其中有一列名为

index

，我想用它对表进行分区；

索引的每个单独值

表示表中的不同数量

是否有一种惯用的方法根据列的内容从PyArrow表中选择行

下面是一个示例表：

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

# Example table for data schema
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt, dtype=np.int64),
                   'index':     np.array(irow%2, dtype=np.int16),
                   'value':     np.array(irow*0, dtype=np.int32)},
                   columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)
table0 = pa.Table.from_pandas(df0)
print(df0)

# prints the following:
         timestamp  index   value
0                0      0       0
1                0      1       0
2               17      0       1
3               17      1    1691
4               34      0       2
...            ...    ...     ...
1048571    8912845      1    9945
1048572    8912862      0  524286
1048573    8912862      1    9978
1048574    8912879      0  524287
1048575    8912879      1    9723

[1048576 rows x 3 columns]

在熊猫身上做这个选择很容易：

print(df0[df0['index']==1])

# prints the following
         timestamp  index  value
1                0      1      0
3               17      1   1691
5               34      1   3334
7               51      1   4881
9               68      1   6287
...            ...    ...    ...
1048567    8912811      1   9028
1048569    8912828      1   9625
1048571    8912845      1   9945
1048573    8912862      1   9978
1048575    8912879      1   9723

[524288 rows x 3 columns]

但对于PyArrow，我必须在PyArrow和numpy或pandas之间来回走动：

value_index = table0.column('index').to_numpy()
# get values of the index column, convert to numpy format
row_indices = np.nonzero(value_index==1)[0]
# find matches and get their indices
selected_table = table0.take(pa.array(row_indices))
# use take() with those indices
v = selected_table.column('value')
print(v.to_numpy())

# which prints
[   0 1691 3334 ... 9945 9978 9723]

有更直接的方法吗？

执行布尔筛选操作不需要转换为numpy。您可以使用

pyarrow.compute

模块中的

equal

和

filter

函数：

导入pyarrow.compute作为pc
value_index=table0.column（'index'））
行屏蔽=pc.equal（值索引，pa.scalar（1，值索引.类型））
所选表格=表格0。过滤器（行掩码）

Hmm。当我尝试此操作时，我得到了

ArrowNotImplementedError:Function equal没有与内核匹配的输入类型（数组[int16]，标量[int64]）

如何将整数强制转换为

value\u index.type

？哦，我想出来了：

value\u index.type.to\u dtype（）（1）

另一种方法是

pa.scalar（1，value\u index.type）

。我接受了，但由于某些原因，这实际上比我发布的稀疏选择解决方案慢。。。看见