Python 将具有空值且可插入Int64的数据帧从pandas导出到R
我正在尝试导出一个数据帧,其中包含分类数据,这样就可以很容易地被R读取 我把赌注押在apache feather上,但不幸的是pandas的Python 将具有空值且可插入Int64的数据帧从pandas导出到R,python,pandas,pyarrow,feather,Python,Pandas,Pyarrow,Feather,我正在尝试导出一个数据帧,其中包含分类数据,这样就可以很容易地被R读取 我把赌注押在apache feather上,但不幸的是pandas的Int64数据类型似乎没有实现: from pyarrow import feather import pandas as pd col1 = pd.Series([0, None, 1, 23]).astype('Int64') col2 = pd.Series([1, 3, 2, 1]).astype('Int64') df = pd.DataFra
Int64
数据类型似乎没有实现:
from pyarrow import feather
import pandas as pd
col1 = pd.Series([0, None, 1, 23]).astype('Int64')
col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
df = pd.DataFrame({'a': col1, 'b': col2})
feather.write_feather(df, '/tmp/foo')
这是您收到的错误消息:
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-107-8cc611a30355> in <module>
----> 1 feather.write_feather(df, '/tmp/foo')
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write_feather(df, dest)
181 writer = FeatherWriter(dest)
182 try:
--> 183 writer.write(df)
184 except Exception:
185 # Try to make sure the resource is closed
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write(self, df)
92 # TODO(wesm): Remove this length check, see ARROW-1732
93 if len(df.columns) > 0:
---> 94 table = Table.from_pandas(df, preserve_index=False)
95 for i, name in enumerate(table.schema.names):
96 col = table[i]
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
542 e.args += ("Conversion failed for column {0!s} with type {1!s}"
543 .format(col.name, col.dtype),)
--> 544 raise e
545 if not field_nullable and result.null_count > 0:
546 raise ValueError("Field {} was non-nullable but pandas column "
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
536
537 try:
--> 538 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
539 except (pa.ArrowInvalid,
540 pa.ArrowNotImplementedError,
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column a with type Int64')
---------------------------------------------------------------------------
ArrowTypeError回溯(最近一次调用上次)
在里面
---->1.写下羽毛(df,/tmp/foo)
写入中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py(df,dest)
181编写器=羽毛编写器(dest)
182尝试:
-->183作家写作(df)
184例外情况除外:
185#尽量确保资源已关闭
写入中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py(self,df)
92#TODO(wesm):删除此长度检查,请参见箭头-1732
93如果长度(测向列)>0:
--->94 table=table.from_pandas(df,preserve_index=False)
95表示i,枚举中的名称(table.schema.names):
96列=表[i]
pyarrow.lib.table.from_pandas()中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi
dataframe\u to\u数组中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas\u compat.py(df、schema、preserve\u index、nthreads、columns、safe)
551如果nthreads==1:
552数组=[convert_列(c,f)
-->553表示压缩中的c、f(列\u到\u转换,转换\u字段)]
554其他:
555来自同期进口期货
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas\u compat.py in(.0)
551如果nthreads==1:
552数组=[convert_列(c,f)
-->553表示压缩中的c、f(列\u到\u转换,转换\u字段)]
554其他:
555来自同期进口期货
convert_列中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py(列,字段)
542 e.args+=(“类型为{1!s}的列{0!s}的转换失败”
543.格式(col.name,col.dtype),)
-->544提高e
545如果字段不可为空且result.null计数大于0:
546 raise VALUERROR(“字段{}不可为空,但为列”
convert_列中的~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py(列,字段)
536
537尝试:
-->538结果=pa.array(col,type=type,from=True,safe=safe)
539除非(宾夕法尼亚州),
540 pa.箭头未执行错误,
ArrowTypeError:(“未传递numpy.dtype对象”,“类型为Int64的列a的转换失败”)
是否有一种变通方法允许我使用此特殊的
Int64
数据类型,最好使用最新的Arrow版本(Pyarow 0.15.0)的Pyarow?,并且在使用pandas开发版本时,现在支持此方法:
[1]中的:从pyarrow导入羽毛
…:作为pd导入熊猫
...:
…:col1=pd.Series([0,None,1,23]).astype('Int64'))
…:col2=pd.Series([1,3,2,1]).astype('Int64'))
...:
…:df=pd.DataFrame({'a':col1,'b':col2})
...:
…:feather.写_feather(df,'/tmp/foo')
[2]中:feather.read_表('/tmp/foo')
出[2]:
pyarrow.桌子
a:int64
b:int64
您可以看到结果箭头表(读回时)正确地包含整数列。
因此,要将其发布到版本中,需要等到pandas 1.0
目前(不使用pandas master),您有两个解决方案选项:
- 将列转换为对象数据类型列(
),然后写入feather。对于那些对象列(包含整数和缺少的值),pyarrow将正确推断它是整数df['a']=df['a'].astype(object)
- Monkeypatch熊猫现在(直到下一次熊猫发布): 因此,使用pyarrow/feather编写可为null的整数列应该是现成的(您仍然需要最新的pyarrow 0.15.0)
请注意,将feather文件读回pandas数据帧目前仍将导致一个浮点列(如果缺少值),因为这是arrow integer到pandas的默认转换。在转换为pandas时,还将继续保留这些特定的pandas类型(请参阅).您可以打开JIRA问题或写入dev@Apache Arrow邮件列表吗?外部参照:Arrow到pandas的转换表如下:
pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type)