Python Numpy固定宽度字符串块到数组

Python Numpy固定宽度字符串块到数组,python,numpy,Python,Numpy,我有一个字符串块,如下所示。如何将其读入numpy数组 5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591

我有一个字符串块,如下所示。如何将其读入numpy数组

   5.780326E+03   7.261185E+03   7.749190E+03   8.488770E+03   5.406134E+03   2.828410E+03   9.620957E+02  1.0000000E+00
   3.097372E+03   3.885160E+03   5.432678E+03   8.060628E+03   2.768457E+03   6.574258E+03   7.268591E+02  2.0000000E+00
   2.061429E+03   4.665282E+03   8.214119E+03   3.579380E+03   8.542057E+03   2.089062E+03   8.829263E+02  3.0000000E+00
   3.572444E+03   9.920473E+03   3.573251E+03   6.423813E+03   2.469338E+03   4.652253E+03   8.211962E+02  4.0000000E+00
   7.460966E+03   7.691966E+03   7.501826E+03   3.414511E+03   8.590221E+03   6.737868E+03   8.586273E+02  5.0000000E+00
   3.250046E+03   9.611985E+03   9.195165E+03   1.064800E+03   7.944535E+03   2.685740E+03   8.212849E+02  6.0000000E+00
   8.069926E+03   9.208576E+03   4.267749E+03   2.491888E+03   9.036555E+03   5.001732E+03   7.202407E+02  7.0000000E+00
   5.691460E+03   3.868344E+03   3.103342E+03   6.567618E+03   7.274860E+03   8.393253E+03   5.628069E+02  8.0000000E+00
   2.887292E+03   9.081563E+02   6.955551E+03   6.763133E+03   2.146178E+03   2.033861E+03   9.725472E+02  9.0000000E+00
   6.127778E+03   8.065057E+02   7.474341E+03   4.185868E+03   4.516230E+03   8.714840E+03   8.254562E+02  1.0000000E+01
   1.594643E+03   6.060956E+03   2.137153E+03   3.505950E+03   7.714227E+03   6.249693E+03   5.724376E+02  1.1000000E+01
   5.039059E+03   3.138161E+03   5.570104E+03   4.594189E+03   7.889644E+03   1.891062E+03   7.085753E+02  1.2000000E+01
   3.263593E+03   6.085087E+03   7.136061E+03   9.895028E+03   6.139666E+03   6.670919E+03   5.018248E+02  1.3000000E+01
   9.954830E+03   6.777074E+03   3.013747E+03   3.638458E+03   4.357685E+03   1.876539E+03   5.969378E+02  1.4000000E+01
   9.920853E+03   3.414156E+03   5.534430E+03   2.011815E+03   7.791122E+03   3.893439E+03   5.229754E+02  1.5000000E+01
   5.447470E+03   7.184321E+03   1.382575E+03   9.134295E+03   7.883753E+02   9.160537E+03   7.521197E+02  1.6000000E+01
   3.344917E+03   8.151884E+03   3.596052E+03   3.953284E+03   7.456115E+03   7.749632E+03   9.773521E+02  1.7000000E+01
   6.310496E+03   1.472792E+03   1.812452E+03   9.535100E+03   1.581263E+03   3.649150E+03   6.562440E+02  1.8000000E+01
我正在尝试使用numpy原生方法,以加快数据读取速度。我正在尝试从自定义文件格式读取几GB的数据。我能够
搜索
并到达上面显示的文本块的区域。对它执行常规python字符串操作始终是可能的,但是,我想知道是否有任何本机numpy方法可以以固定宽度格式读取


我尝试将
np.frombuffer
dtype=float
一起使用,但没有成功。如果我使用
dtype='S15'
的话,它看起来是可读的,但是它显示为字节而不是数字。

我只是做了一次常规的python拆分,并将dtype分配给np.float32

>>> y=np.array(x.split(), dtype=np.float32())
>>> y
array([  5.78032617e+03,   7.26118506e+03,   7.74918994e+03,
         8.48876953e+03,   5.40613379e+03,   2.82840991e+03,
         9.62095703e+02,   1.00000000e+00,   3.09737207e+03,
         3.88515991e+03,   5.43267822e+03,   8.06062793e+03,
         2.76845703e+03,   6.57425781e+03,   7.26859070e+02,
         2.00000000e+00,   2.06142896e+03,   4.66528223e+03,
         8.21411914e+03,   3.57937988e+03,   8.54205664e+03,
         2.08906201e+03,   8.82926270e+02,   3.00000000e+00], dtype=float32)
另外,我复制了一块样本数据并将其分配给变量“x”

好的,除了行之外,这不依赖于任何空格或使用split(),并保持数组的形状,但仍然使用非numpython

>>> n=15
>>> x='   5.780326E+03   7.261185E+03   7.749190E+03   8.488770E+03   5.406134E+03   2.828410E+03   9.620957E+02  1.0000000E+00\n   3.097372E+03   3.885160E+03   5.432678E+03   8.060628E+03   2.768457E+03   6.574258E+03   7.268591E+02  2.0000000E+00\n   2.061429E+03   4.665282E+03   8.214119E+03   3.579380E+03   8.542057E+03   2.089062E+03   8.829263E+02  3.0000000E+00\n   3.572444E+03   9.920473E+03   3.573251E+03   6.423813E+03   2.469338E+03   4.652253E+03   8.211962E+02  4.0000000E+00\n   7.460966E+03   7.691966E+03   7.501826E+03   3.414511E+03   8.590221E+03   6.737868E+03   8.586273E+02  5.0000000E+00\n   3.250046E+03   9.611985E+03   9.195165E+03   1.064800E+03   7.944535E+03   2.685740E+03   8.212849E+02  6.0000000E+00\n   8.069926E+03   9.208576E+03   4.267749E+03   2.491888E+03   9.036555E+03   5.001732E+03   7.202407E+02  7.0000000E+00\n   5.691460E+03   3.868344E+03   3.103342E+03   6.567618E+03   7.274860E+03   8.393253E+03   5.628069E+02  8.0000000E+00\n   2.887292E+03   9.081563E+02   6.955551E+03   6.763133E+03   2.146178E+03   2.033861E+03   9.725472E+02  9.0000000E+00\n   6.127778E+03   8.065057E+02   7.474341E+03   4.185868E+03   4.516230E+03   8.714840E+03   8.254562E+02  1.0000000E+01\n   1.594643E+03   6.060956E+03   2.137153E+03   3.505950E+03   7.714227E+03   6.249693E+03   5.724376E+02  1.1000000E+01\n   5.039059E+03   3.138161E+03   5.570104E+03   4.594189E+03   7.889644E+03   1.891062E+03   7.085753E+02  1.2000000E+01\n   3.263593E+03   6.085087E+03   7.136061E+03   9.895028E+03   6.139666E+03   6.670919E+03   5.018248E+02  1.3000000E+01\n   9.954830E+03   6.777074E+03   3.013747E+03   3.638458E+03   4.357685E+03   1.876539E+03   5.969378E+02  1.4000000E+01\n   9.920853E+03   3.414156E+03   5.534430E+03   2.011815E+03   7.791122E+03   3.893439E+03   5.229754E+02  1.5000000E+01\n   5.447470E+03   7.184321E+03   1.382575E+03   9.134295E+03   7.883753E+02   9.160537E+03   7.521197E+02  1.6000000E+01\n   3.344917E+03   8.151884E+03   3.596052E+03   3.953284E+03   7.456115E+03   7.749632E+03   9.773521E+02  1.7000000E+01\n   6.310496E+03   1.472792E+03   1.812452E+03   9.535100E+03   1.581263E+03   3.649150E+03   6.562440E+02  1.8000000E+01'
>>> s=np.array([[y[i:i+n] for i in range(0, len(y) - n + 1, n)] for y in x.splitlines()], dtype=np.float32)
>>> s
array([[  5.78032617e+03,   7.26118506e+03,   7.74918994e+03,
          8.48876953e+03,   5.40613379e+03,   2.82840991e+03,
          9.62095703e+02,   1.00000000e+00],
       [  3.09737207e+03,   3.88515991e+03,   5.43267822e+03,
          8.06062793e+03,   2.76845703e+03,   6.57425781e+03,
          7.26859070e+02,   2.00000000e+00],
       [  2.06142896e+03,   4.66528223e+03,   8.21411914e+03,
          3.57937988e+03,   8.54205664e+03,   2.08906201e+03,
          8.82926270e+02,   3.00000000e+00],
       [  3.57244409e+03,   9.92047266e+03,   3.57325098e+03,
          6.42381299e+03,   2.46933789e+03,   4.65225293e+03,
          8.21196228e+02,   4.00000000e+00],
       [  7.46096582e+03,   7.69196582e+03,   7.50182617e+03,
          3.41451099e+03,   8.59022070e+03,   6.73786816e+03,
          8.58627319e+02,   5.00000000e+00],
       [  3.25004590e+03,   9.61198535e+03,   9.19516504e+03,
          1.06480005e+03,   7.94453516e+03,   2.68573999e+03,
          8.21284912e+02,   6.00000000e+00],
       [  8.06992578e+03,   9.20857617e+03,   4.26774902e+03,
          2.49188794e+03,   9.03655469e+03,   5.00173193e+03,
          7.20240723e+02,   7.00000000e+00],
       [  5.69145996e+03,   3.86834399e+03,   3.10334204e+03,
          6.56761816e+03,   7.27485986e+03,   8.39325293e+03,
          5.62806885e+02,   8.00000000e+00],
       [  2.88729199e+03,   9.08156311e+02,   6.95555078e+03,
          6.76313281e+03,   2.14617798e+03,   2.03386096e+03,
          9.72547180e+02,   9.00000000e+00],
       [  6.12777783e+03,   8.06505676e+02,   7.47434082e+03,
          4.18586816e+03,   4.51622998e+03,   8.71483984e+03,
          8.25456177e+02,   1.00000000e+01],
       [  1.59464294e+03,   6.06095605e+03,   2.13715308e+03,
          3.50594995e+03,   7.71422705e+03,   6.24969287e+03,
          5.72437622e+02,   1.10000000e+01],
       [  5.03905908e+03,   3.13816089e+03,   5.57010400e+03,
          4.59418896e+03,   7.88964404e+03,   1.89106201e+03,
          7.08575317e+02,   1.20000000e+01],
       [  3.26359302e+03,   6.08508691e+03,   7.13606104e+03,
          9.89502832e+03,   6.13966602e+03,   6.67091895e+03,
          5.01824799e+02,   1.30000000e+01],
       [  9.95483008e+03,   6.77707422e+03,   3.01374707e+03,
          3.63845801e+03,   4.35768506e+03,   1.87653894e+03,
          5.96937805e+02,   1.40000000e+01],
       [  9.92085254e+03,   3.41415601e+03,   5.53443018e+03,
          2.01181494e+03,   7.79112207e+03,   3.89343896e+03,
          5.22975403e+02,   1.50000000e+01],
       [  5.44747021e+03,   7.18432080e+03,   1.38257495e+03,
          9.13429492e+03,   7.88375305e+02,   9.16053711e+03,
          7.52119690e+02,   1.60000000e+01],
       [  3.34491699e+03,   8.15188379e+03,   3.59605200e+03,
          3.95328394e+03,   7.45611523e+03,   7.74963184e+03,
          9.77352112e+02,   1.70000000e+01],
       [  6.31049609e+03,   1.47279199e+03,   1.81245203e+03,
          9.53509961e+03,   1.58126294e+03,   3.64914990e+03,
          6.56244019e+02,   1.80000000e+01]], dtype=float32)

如果需要dtype=float的数组,则必须事先将字符串转换为float

import numpy as np

string_list = ["1", "0.1", "1.345e003"]
array = np.array([float(string) for string in string_list])
array.dtype


可以使用几个字符串操作将数据转换为可转换为浮点的字符串。例如:

import numpy as np

with open('data.txt', 'r') as f:
    data = f.readlines()

result = []
for line in data:
    splitted_data = line.split(' ')
    splitted_data = [item for item in splitted_data if item]
    splitted_data = [item.replace('E+', 'e') for item in splitted_data]

    result.append(splitted_data)

result = np.array(result, dtype = 'float64')
其中
data.txt
是您在问题中粘贴的数据

In [294]: txt = """5.780326E+03   7.261185E+03   7.749190E+03   8.488770E+03   5.406134E+03   2
     ...: .828410E+03   9.620957E+02  1.0000000E+00 
     ...:    3.097372E+03   3.885160E+03   5.432678E+03   8.060628E+03   2.768457E+03   6.57425
     ...: 8E+03   7.268591E+02  2.0000000E+00 
     ...:    2.061429E+03   4.665282E+03   8.214119E+03   3.579380E+03   8.542057E+03   2.08906
     ...: 2E+03   8.829263E+02  3.0000000E+00 
     ...:    """                                                                               
使用这个复制粘贴,我假设您的
是一个多行字符串

将其视为csv文件

In [296]: np.loadtxt(txt.splitlines())                                                         
Out[296]: 
array([[5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
        5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00],
       [3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
        2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00],
       [2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
        8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00]])
有很多事情在幕后进行,所以这不是特别快
pandas
具有更快的csv阅读器

fromstring
工作,但返回1d。您可以重塑结果

n [299]: np.fromstring(txt, sep='  ')                                                         
Out[299]: 
array([5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
       5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00,
       3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
       2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00,
       2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
       8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00])
这是一个字符串,不是缓冲区,因此
frombuffer
是错误的

此列表适用于:

np.array([row.strip().split('  ') for row in txt.strip().splitlines()], float) 
我必须添加
strip
,以清除产生空列表或字符串的多余空格


至少在这个小示例中,列表理解并不比
fromstring
慢多少,而且由于@hpaulj的评论,列表理解仍然比更一般的
loadtxt
好很多。这是我最后得到的答案

data = np.genfromtxt(f, delimiter=[15]*8, max_rows=18)
更多解释

因为我是从一个定制的文件格式中阅读这篇文章的,所以我也会发布我是如何完成整个事情的。 我对文件进行了一些初始处理,以确定文本块所在的位置,并最终得到一个“位置”数组,在该数组中,我可以
查找
以开始读取过程,然后使用上述方法读取文本块

data=np.array([])
r=18行/块
c=每个区块8列
w=15#柱宽
将open('mycustomfile.xyz')作为f:
对于位置中的位置:
f、 搜索(位置)
data=np.append(数据,np.genfromtxt(f,分隔符=[w]*c,max_rows=r))
数据=数据。重塑((r*len(位置),c))

我想避免
split()
操作。此外,这是固定宽度格式。因此,这两者之间的空间并不总是得到保证的。这个答案中的大多数操作都是在本机python中完成的。我想知道numpy是否有任何本机方法,如
frombuffer
fromstring
“字符串块”——这还不清楚。这是一个多行字符串吗?csv文件?你能提供一个我们可以复制粘贴的样品吗?请记住,numpy的快速功能是数字。字符串操作更多地依赖于本机pythin。我没有票数,但@hpaulj正在问一些重要的问题。@hpaulj,很抱歉我的问题缺少更多的上下文。我希望这个问题足够简单,这样我就能得到一些答案,并且足够详细,使这些答案对我有用。我现在已就问题补充了一些细节。我希望这能回答你的问题。您关于必须使用本机python进行字符串操作的评论或多或少回答了我的问题。我将不得不使用本机python列表来拆分固定宽度的字符串!genfromtxt和loadtxt的delimiter参数用于指定列宽。这对我来说是个好办法。这就是我现在的结局
np.genfromtxt(f,分隔符=[15]*8,最大行数=18)