Pandas 如何将dataframe转换为架构中具有联合类型的PyArrow表?

Pandas 如何将dataframe转换为架构中具有联合类型的PyArrow表?,pandas,pyarrow,apache-arrow,Pandas,Pyarrow,Apache Arrow,我有一个数据框架,其中有一列包含dict/structs列表。其中一个键(thing,在下面的示例中)的值可以是int或字符串。有没有一种方法可以定义一个PyArrow类型,允许将此数据帧转换为PyArrow表,以便最终输出到拼花地板文件 为此,我尝试使用pa.union,但我似乎在做一些不受支持/实现的事情 import pandas as pd import pyarrow as pa df = pd.DataFrame(data={"id": [1, 2], &qu

我有一个数据框架,其中有一列包含dict/structs列表。其中一个键(
thing
,在下面的示例中)的值可以是int或字符串。有没有一种方法可以定义一个PyArrow类型,允许将此数据帧转换为PyArrow表,以便最终输出到拼花地板文件

为此,我尝试使用
pa.union
,但我似乎在做一些不受支持/实现的事情

import pandas as pd
import pyarrow as pa


df = pd.DataFrame(data={"id": [1, 2], "dict": [{"thing": 1}, {"thing": "two"}]})

schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("dict", pa.struct([
        ("thing", pa.union([
            pa.field("int64", pa.int64()),
            pa.field("string", pa.string()),
        ], "sparse"))
    ]))
])

t = pa.Table.from_pandas(df, schema=schema)
结果

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 1394, in pyarrow.lib.Table.from_pandas
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in <listcomp>
    arrays = [convert_column(c, f)
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 574, in convert_column
    raise e
  File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 568, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('sparse_union', 'Conversion failed for column dict with type object')

pyarrow 2.0.0中似乎还没有实现:

import pandas as pd
import pyarrow as pa

union  = pa.union([
            pa.field("int64", pa.int64()),
            pa.field("string", pa.string()),
        ], 'sparse')

pa.array([1, 'two'], union)
---------------------------------------------------------------------------
ArrowNotImplementedError回溯(最后一次调用)
在里面
10],“稀疏”)
11
--->12 pa.阵列([1,'2'],并集)
/pyarrow.lib.array()中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi
/pyarrow.lib中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi._sequence_to_array()
/pyarrow.lib.pyarrow\u内部检查状态()中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi
/pyarrow.lib.check_status()中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi
箭头未实现错误:稀疏联合

PyArrow有一个内置的方法

将熊猫作为pd导入
将pyarrow作为pa导入
df=pd.DataFrame({
…'int':[1,2],
…'str':['a','b']
... })
pa.表来自大熊猫(df)
import pandas as pd
import pyarrow as pa

union  = pa.union([
            pa.field("int64", pa.int64()),
            pa.field("string", pa.string()),
        ], 'sparse')

pa.array([1, 'two'], union)
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-72-f7ec6792b124> in <module>
     10         ], 'sparse')
     11 
---> 12 pa.array([1, 'two'], union)

/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: sparse_union


import pandas as pd
import pyarrow as pa
df = pd.DataFrame({
    ...     'int': [1, 2],
    ...     'str': ['a', 'b']
    ... })
pa.Table.from_pandas(df)
<pyarrow.lib.Table object at 0x7f05d1fb1b40>