Python 从带有熊猫的CSV读取十进制表示浮动_Python_Pandas_Numpy_Csv_Ieee 754

Python 从带有熊猫的CSV读取十进制表示浮动

python pandas numpy csv

Python 从带有熊猫的CSV读取十进制表示浮动,python,pandas,numpy,csv,ieee-754,Python,Pandas,Numpy,Csv,Ieee 754,我试图读入一个CSV文件的内容，其中包含我认为是IEEE 754单精度浮点，十进制格式默认情况下，它们作为int64读入。如果我用类似于dtype={'col1'：np.float32}的东西指定数据类型，那么dtype将正确显示为float32，但它们的值与float相同，而不是int，即1079762502变为1.079763e+09，而不是3.435444149988037 我已成功使用以下任一方法对单个值进行转换： from struct import unpack v = 1079

我试图读入一个CSV文件的内容，其中包含我认为是IEEE 754单精度浮点，十进制格式

默认情况下，它们作为int64读入。如果我用类似于

dtype={'col1'：np.float32}

的东西指定数据类型，那么dtype将正确显示为float32，但它们的值与float相同，而不是int，即

1079762502

变为

1.079763e+09

，而不是

3.435444149988037

我已成功使用以下任一方法对单个值进行转换：

from struct import unpack

v = 1079762502

print(unpack('>f', v.to_bytes(4, byteorder="big")))
print(unpack('>f', bytes.fromhex(str(hex(v)).split('0x')[1])))

产生

(3.435441493988037,)
(3.435441493988037,)

然而，我似乎无法用矢量化的方式在熊猫身上实现这一点：

import pandas as pd
from struct import unpack

df = pd.read_csv('experiments/test.csv')

print(df.dtypes)
print(df)

df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
#df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))

print(df)

抛出以下错误

col1    int64
dtype: object
         col1
0  1079762502
1  1079345162
2  1078565306
3  1078738012
4  1078635652

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-c06d0986cc96> in <module>
      7 print(df)
      8 
----> 9 df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
     10 #df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
     11 

~/anaconda3/envs/test/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'to_bytes'

col1 int64
数据类型：对象
可乐
0  1079762502
1  1079345162
2  1078565306
3  1078738012
4  1078635652
---------------------------------------------------------------------------
AttributeError回溯（最近一次呼叫上次）
在里面
7打印（df）
8.
---->9 df['col1']=解包（'>f'，df['col1']）。到_字节（4，byteorder=“big”））
10#df['col1']=unpack（'>f'，bytes.fromhex（str（hex（df['col1']）））.split（'0x'）[1]））
11
~/anaconda3/envs/test/lib/python3.7/site-packages/pandas/core/generic.py in\uuu\u getattr\uuu（self，name）
5177如果自我信息轴可以保存标识符，并且保存名称（名称）：
5178返回自我[姓名]
->5179返回对象。\uuuu getattribute\uuuuu（self，name）
5180
5181定义设置属性（自身、名称、值）：
AttributeError:“Series”对象没有“to_bytes”属性

或者，如果我尝试第二种方法，

TypeError:“Series”对象不能解释为整数

我在这里的Python知识有限，我想我可以迭代每一行，转换为十六进制，然后转换为字符串，然后剥离0x，解包并存储。但这似乎非常复杂，在较小的样本数据集上已经需要几秒钟的时间，更不用说数十万个条目了。我是不是遗漏了一些简单的东西，有没有更好的方法呢？

CSV是一种文本格式，IEEE 754单精度浮点是二进制数字格式。如果你有一个CSV，你有文本，它根本不是那种格式。如果我理解正确的话，我想你的意思是你的文本代表整数（十进制格式），对应于32位浮点的32位整数解释

因此，对于初学者来说，当您从csv读取数据时，

pandas

默认使用64位整数。因此，转换为32位整数，然后使用

重新解释字节。查看：
In [8]: df
Out[8]:
         col1
0  1079762502
1  1079345162
2  1078565306
3  1078738012
4  1078635652

In [9]: df.col1.astype(np.int32).view('f')
Out[9]:
0    3.435441
1    3.335940
2    3.150008
3    3.191184
4    3.166780
Name: col1, dtype: float32

分解为步骤以帮助理解：
In [10]: import numpy as np

In [11]: arr = df.col1.values

In [12]: arr
Out[12]: array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652])

In [13]: arr.dtype
Out[13]: dtype('int64')

In [14]: arr_32 = arr.astype(np.int32)

In [15]: arr_32
Out[15]:
array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652],
      dtype=int32)

In [16]: arr_32.view('f')
Out[16]:
array([3.4354415, 3.33594  , 3.1500077, 3.191184 , 3.1667795],
      dtype=float32)

啊，恰到好处.view
是我所缺少的让它以我想要的格式实际表示值的魔力，非常感谢。我现在可以在读取数据时简单地使用dtype={'col1'：np.int32}
，并设置df['col1']=df['col1'].view（'f'）
以获得一列32位浮点值，这要简单得多。