Python 将具有空值且可插入Int64的数据帧从pandas导出到R_Python_Pandas_Pyarrow_Feather

Python 将具有空值且可插入Int64的数据帧从pandas导出到R

python pandas

Python 将具有空值且可插入Int64的数据帧从pandas导出到R,python,pandas,pyarrow,feather,Python,Pandas,Pyarrow,Feather,我正在尝试导出一个数据帧，其中包含分类数据，这样就可以很容易地被R读取我把赌注押在apache feather上，但不幸的是pandas的Int64数据类型似乎没有实现： from pyarrow import feather import pandas as pd col1 = pd.Series([0, None, 1, 23]).astype('Int64') col2 = pd.Series([1, 3, 2, 1]).astype('Int64') df = pd.DataFra

我正在尝试导出一个数据帧，其中包含分类数据，这样就可以很容易地被R读取

我把赌注押在apache feather上，但不幸的是pandas的

Int64

数据类型似乎没有实现：

from pyarrow import feather
import pandas as pd

col1 = pd.Series([0, None, 1, 23]).astype('Int64')
col2 = pd.Series([1, 3, 2, 1]).astype('Int64')

df = pd.DataFrame({'a': col1, 'b': col2})

feather.write_feather(df, '/tmp/foo')

这是您收到的错误消息：

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-107-8cc611a30355> in <module>
----> 1 feather.write_feather(df, '/tmp/foo')

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write_feather(df, dest)
    181     writer = FeatherWriter(dest)
    182     try:
--> 183         writer.write(df)
    184     except Exception:
    185         # Try to make sure the resource is closed

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write(self, df)
     92         # TODO(wesm): Remove this length check, see ARROW-1732
     93         if len(df.columns) > 0:
---> 94             table = Table.from_pandas(df, preserve_index=False)
     95             for i, name in enumerate(table.schema.names):
     96                 col = table[i]

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551     if nthreads == 1:
    552         arrays = [convert_column(c, f)
--> 553                   for c, f in zip(columns_to_convert, convert_fields)]
    554     else:
    555         from concurrent import futures

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    551     if nthreads == 1:
    552         arrays = [convert_column(c, f)
--> 553                   for c, f in zip(columns_to_convert, convert_fields)]
    554     else:
    555         from concurrent import futures

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    542             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    543                        .format(col.name, col.dtype),)
--> 544             raise e
    545         if not field_nullable and result.null_count > 0:
    546             raise ValueError("Field {} was non-nullable but pandas column "

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    536 
    537         try:
--> 538             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    539         except (pa.ArrowInvalid,
    540                 pa.ArrowNotImplementedError,

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column a with type Int64')

---------------------------------------------------------------------------
ArrowTypeError回溯（最近一次调用上次）
在里面
---->1.写下羽毛（df，/tmp/foo）
写入中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py（df，dest）
181编写器=羽毛编写器（dest）
182尝试：
-->183作家写作（df）
184例外情况除外：
185#尽量确保资源已关闭
写入中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py（self，df）
92#TODO（wesm）：删除此长度检查，请参见箭头-1732
93如果长度（测向列）>0：
--->94 table=table.from_pandas（df，preserve_index=False）
95表示i，枚举中的名称（table.schema.names）：
96列=表[i]
pyarrow.lib.table.from_pandas（）中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi
dataframe\u to\u数组中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas\u compat.py（df、schema、preserve\u index、nthreads、columns、safe）
551如果nthreads==1：
552数组=[convert_列（c，f）
-->553表示压缩中的c、f（列\u到\u转换，转换\u字段）]
554其他：
555来自同期进口期货
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas\u compat.py in（.0）
551如果nthreads==1：
552数组=[convert_列（c，f）
-->553表示压缩中的c、f（列\u到\u转换，转换\u字段）]
554其他：
555来自同期进口期货
convert_列中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py（列，字段）
542 e.args+=（“类型为{1！s}的列{0！s}的转换失败”
543.格式（col.name，col.dtype），）
-->544提高e
545如果字段不可为空且result.null计数大于0：
546 raise VALUERROR（“字段{}不可为空，但为列”
convert_列中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py（列，字段）
536
537尝试：
-->538结果=pa.array（col，type=type，from=True，safe=safe）
539除非（宾夕法尼亚州），
540 pa.箭头未执行错误，
ArrowTypeError:（“未传递numpy.dtype对象”，“类型为Int64的列a的转换失败”）

是否有一种变通方法允许我使用此特殊的

Int64

数据类型，最好使用最新的Arrow版本（Pyarow 0.15.0）的Pyarow？

，并且在使用pandas开发版本时，现在支持此方法：

[1]中的

：从pyarrow导入羽毛
…：作为pd导入熊猫
...:  
…：col1=pd.Series（[0，None，1，23]）.astype（'Int64'））
…：col2=pd.Series（[1,3,2,1]）.astype（'Int64'））
...:  
…：df=pd.DataFrame（{'a'：col1，'b'：col2}）
...:  
…：feather.写_feather（df，'/tmp/foo'）
[2]中：feather.read_表（'/tmp/foo'）
出[2]：
pyarrow.桌子
a:int64
b:int64

您可以看到结果箭头表（读回时）正确地包含整数列。因此，要将其发布到版本中，需要等到pandas 1.0

目前（不使用pandas master），您有两个解决方案选项：

将列转换为对象数据类型列（
```
df['a']=df['a'].astype（object）
```
），然后写入feather。对于那些对象列（包含整数和缺少的值），pyarrow将正确推断它是整数
Monkeypatch熊猫现在（直到下一次熊猫发布）：
因此，使用pyarrow/feather编写可为null的整数列应该是现成的（您仍然需要最新的pyarrow 0.15.0）

请注意，将feather文件读回pandas数据帧目前仍将导致一个浮点列（如果缺少值），因为这是arrow integer到pandas的默认转换。在转换为pandas时，还将继续保留这些特定的pandas类型（请参阅）.

您可以打开JIRA问题或写入dev@Apache Arrow邮件列表吗？外部参照：Arrow到pandas的转换表如下：

pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type)