Pandas 如何将dataframe转换为架构中具有联合类型的PyArrow表?
我有一个数据框架,其中有一列包含dict/structs列表。其中一个键(Pandas 如何将dataframe转换为架构中具有联合类型的PyArrow表?,pandas,pyarrow,apache-arrow,Pandas,Pyarrow,Apache Arrow,我有一个数据框架,其中有一列包含dict/structs列表。其中一个键(thing,在下面的示例中)的值可以是int或字符串。有没有一种方法可以定义一个PyArrow类型,允许将此数据帧转换为PyArrow表,以便最终输出到拼花地板文件 为此,我尝试使用pa.union,但我似乎在做一些不受支持/实现的事情 import pandas as pd import pyarrow as pa df = pd.DataFrame(data={"id": [1, 2], &qu
thing
,在下面的示例中)的值可以是int或字符串。有没有一种方法可以定义一个PyArrow类型,允许将此数据帧转换为PyArrow表,以便最终输出到拼花地板文件
为此,我尝试使用pa.union
,但我似乎在做一些不受支持/实现的事情
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(data={"id": [1, 2], "dict": [{"thing": 1}, {"thing": "two"}]})
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("dict", pa.struct([
("thing", pa.union([
pa.field("int64", pa.int64()),
pa.field("string", pa.string()),
], "sparse"))
]))
])
t = pa.Table.from_pandas(df, schema=schema)
结果:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 1394, in pyarrow.lib.Table.from_pandas
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in dataframe_to_arrays
arrays = [convert_column(c, f)
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in <listcomp>
arrays = [convert_column(c, f)
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 574, in convert_column
raise e
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 568, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('sparse_union', 'Conversion failed for column dict with type object')
pyarrow 2.0.0中似乎还没有实现:
import pandas as pd
import pyarrow as pa
union = pa.union([
pa.field("int64", pa.int64()),
pa.field("string", pa.string()),
], 'sparse')
pa.array([1, 'two'], union)
---------------------------------------------------------------------------
ArrowNotImplementedError回溯(最后一次调用)
在里面
10],“稀疏”)
11
--->12 pa.阵列([1,'2'],并集)
/pyarrow.lib.array()中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi
/pyarrow.lib中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi._sequence_to_array()
/pyarrow.lib.pyarrow\u内部检查状态()中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi
/pyarrow.lib.check_status()中的nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi
箭头未实现错误:稀疏联合
PyArrow有一个内置的方法
将熊猫作为pd导入
将pyarrow作为pa导入
df=pd.DataFrame({
…'int':[1,2],
…'str':['a','b']
... })
pa.表来自大熊猫(df)
import pandas as pd
import pyarrow as pa
union = pa.union([
pa.field("int64", pa.int64()),
pa.field("string", pa.string()),
], 'sparse')
pa.array([1, 'two'], union)
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
<ipython-input-72-f7ec6792b124> in <module>
10 ], 'sparse')
11
---> 12 pa.array([1, 'two'], union)
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowNotImplementedError: sparse_union
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({
... 'int': [1, 2],
... 'str': ['a', 'b']
... })
pa.Table.from_pandas(df)
<pyarrow.lib.Table object at 0x7f05d1fb1b40>