Python 将具有空值且可插入Int64的数据帧从pandas导出到R

Python 将具有空值且可插入Int64的数据帧从pandas导出到R,python,pandas,pyarrow,feather,Python,Pandas,Pyarrow,Feather,我正在尝试导出一个数据帧,其中包含分类数据,这样就可以很容易地被R读取 我把赌注押在apache feather上,但不幸的是pandas的Int64数据类型似乎没有实现: from pyarrow import feather import pandas as pd col1 = pd.Series([0, None, 1, 23]).astype('Int64') col2 = pd.Series([1, 3, 2, 1]).astype('Int64') df = pd.DataFra

我正在尝试导出一个数据帧,其中包含分类数据,这样就可以很容易地被R读取

我把赌注押在apache feather上,但不幸的是pandas的
Int64
数据类型似乎没有实现:

from pyarrow import feather
import pandas as pd

col1 = pd.Series([0, None, 1, 23]).astype('Int64')
col2 = pd.Series([1, 3, 2, 1]).astype('Int64')

df = pd.DataFrame({'a': col1, 'b': col2})

feather.write_feather(df, '/tmp/foo')
这是您收到的错误消息:

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-107-8cc611a30355> in <module>
----> 1 feather.write_feather(df, '/tmp/foo')

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write_feather(df, dest)
    181     writer = FeatherWriter(dest)
    182     try:
--> 183         writer.write(df)
    184     except Exception:
    185         # Try to make sure the resource is closed

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write(self, df)
     92         # TODO(wesm): Remove this length check, see ARROW-1732
     93         if len(df.columns) > 0:
---> 94             table = Table.from_pandas(df, preserve_index=False)
     95             for i, name in enumerate(table.schema.names):
     96                 col = table[i]

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551     if nthreads == 1:
    552         arrays = [convert_column(c, f)
--> 553                   for c, f in zip(columns_to_convert, convert_fields)]
    554     else:
    555         from concurrent import futures

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    551     if nthreads == 1:
    552         arrays = [convert_column(c, f)
--> 553                   for c, f in zip(columns_to_convert, convert_fields)]
    554     else:
    555         from concurrent import futures

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    542             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    543                        .format(col.name, col.dtype),)
--> 544             raise e
    545         if not field_nullable and result.null_count > 0:
    546             raise ValueError("Field {} was non-nullable but pandas column "

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    536 
    537         try:
--> 538             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    539         except (pa.ArrowInvalid,
    540                 pa.ArrowNotImplementedError,

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column a with type Int64')
---------------------------------------------------------------------------
ArrowTypeError回溯(最近一次调用上次)
在里面
---->1.写下羽毛(df,/tmp/foo)
写入中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py(df,dest)
181编写器=羽毛编写器(dest)
182尝试:
-->183作家写作(df)
184例外情况除外:
185#尽量确保资源已关闭
写入中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py(self,df)
92#TODO(wesm):删除此长度检查,请参见箭头-1732
93如果长度(测向列)>0:
--->94 table=table.from_pandas(df,preserve_index=False)
95表示i,枚举中的名称(table.schema.names):
96列=表[i]
pyarrow.lib.table.from_pandas()中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi
dataframe\u to\u数组中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas\u compat.py(df、schema、preserve\u index、nthreads、columns、safe)
551如果nthreads==1:
552数组=[convert_列(c,f)
-->553表示压缩中的c、f(列\u到\u转换,转换\u字段)]
554其他:
555来自同期进口期货
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas\u compat.py in(.0)
551如果nthreads==1:
552数组=[convert_列(c,f)
-->553表示压缩中的c、f(列\u到\u转换,转换\u字段)]
554其他:
555来自同期进口期货
convert_列中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py(列,字段)
542 e.args+=(“类型为{1!s}的列{0!s}的转换失败”
543.格式(col.name,col.dtype),)
-->544提高e
545如果字段不可为空且result.null计数大于0:
546 raise VALUERROR(“字段{}不可为空,但为列”
convert_列中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py(列,字段)
536
537尝试:
-->538结果=pa.array(col,type=type,from=True,safe=safe)
539除非(宾夕法尼亚州),
540 pa.箭头未执行错误,
ArrowTypeError:(“未传递numpy.dtype对象”,“类型为Int64的列a的转换失败”)

是否有一种变通方法允许我使用此特殊的
Int64
数据类型,最好使用最新的Arrow版本(Pyarow 0.15.0)的Pyarow?

,并且在使用pandas开发版本时,现在支持此方法:

[1]中的
:从pyarrow导入羽毛
…:作为pd导入熊猫
...:  
…:col1=pd.Series([0,None,1,23]).astype('Int64'))
…:col2=pd.Series([1,3,2,1]).astype('Int64'))
...:  
…:df=pd.DataFrame({'a':col1,'b':col2})
...:  
…:feather.写_feather(df,'/tmp/foo')
[2]中:feather.read_表('/tmp/foo')
出[2]:
pyarrow.桌子
a:int64
b:int64
您可以看到结果箭头表(读回时)正确地包含整数列。 因此,要将其发布到版本中,需要等到pandas 1.0

目前(不使用pandas master),您有两个解决方案选项:

  • 将列转换为对象数据类型列(
    df['a']=df['a'].astype(object)
    ),然后写入feather。对于那些对象列(包含整数和缺少的值),pyarrow将正确推断它是整数

  • Monkeypatch熊猫现在(直到下一次熊猫发布):

    因此,使用pyarrow/feather编写可为null的整数列应该是现成的(您仍然需要最新的pyarrow 0.15.0)



请注意,将feather文件读回pandas数据帧目前仍将导致一个浮点列(如果缺少值),因为这是arrow integer到pandas的默认转换。在转换为pandas时,还将继续保留这些特定的pandas类型(请参阅).

您可以打开JIRA问题或写入dev@Apache Arrow邮件列表吗?外部参照:Arrow到pandas的转换表如下:
pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type)