Python numpy修剪字节字符串中的尾随零_Python_String_Numpy_Spacy

Python numpy修剪字节字符串中的尾随零

python string numpy

Python numpy修剪字节字符串中的尾随零,python,string,numpy,spacy,Python,String,Numpy,Spacy,我正在尝试将文档序列化为字节字符串，并将它们保存在numpy数组中 spacy有一个to_bytes函数，该函数生成一个字节数组。我在这个bytearray上调用str，并将该字符串对象插入到numpy数组中。这适用于大多数文档，但以尾随零字节结尾的文档除外复制： >>> import numpy as np >>> b_arr = bytearray(b'\xca\x00\x00\x00n\xff\xff\xff\x19C\x98\xc9\x06\xb18

我正在尝试将文档序列化为字节字符串，并将它们保存在

numpy

数组中

spacy

有一个

to_bytes

函数，该函数生成一个

字节数组

。我在这个

bytearray

上调用

str

，并将该字符串对象插入到

numpy

数组中。这适用于大多数文档，但以尾随零字节结尾的文档除外

复制：

>>> import numpy as np
>>> b_arr = bytearray(b'\xca\x00\x00\x00n\xff\xff\xff\x19C\x98\xc9\x06\xb18{\xa5\xe0\xaf6\xe3\x9f\xa7\xad\x86\xd6\x8d\xc0\xe6Mo;{\x96xm\x80\xe5\x8c\x9f<!\xc33\x9dg\xd3\xb3D\xf6\xac\x03P\x8do\x07m$r)\x06XBI\xc87\xcao\x83\x1d\xe4\r]\x86\xda\xeb\xb8\x1f\xd5\xcb\xde\xaa\x85r\x0f\xf1=p\xd6\x01\xdc\x83Z|&\xeb\xce|\xf9o\xa0\xe99x\x87\x87\xac\x1b\x17\x08\x000\x92\x10A\x98\x10\x13\x89( 0\x88 "!*N\xf8\xe6\xf4\r\xb1e\xf0\x9d\xfd\x80\xa2G2\x18\xdesv\xec\x85\xf7\xb1\xb3\xb3\xa68\xa7n\xe8BF\xa6\xe0\xb1\x8d\x8d\x9c\xe5\x99\x9bV\xfcE`\x1cI\x92$I\x92$I\x92$%I\x92\xe4\xff\xff\x7f\xd1\xff\xf0T\xa6\xe8\n\x9a\xd3\xffMe0\xa9\x15\xf1|\x00')
>>> b_arr_text = str(b_arr)
>>> b_arr_np = np.asarray([b_arr_text], dtype=np.str)
>>> b_arr_text == b_arr_np[0]
Out[229]: False
>>> len(b_arr_text)
Out[230]: 206
>>> len(b_arr_np[0])
Out[231]: 205
>>> b_arr_np.dtype
Out[232]: dtype('S206')

我假设

numpy

认为尾随的零是无关紧要的？但是，我无法将这些ByTestRing反序列化回

spacy

文档对象

有没有办法让

numpy

不修剪尾随的零，或者在这种情况下我必须坚持使用Python列表？

这是正常的行为。在

b_arr\u np.tostring（）

之后，您可以看到所有尾随的零都是有序的

b_arr = bytearray(b'\xca\x00\x00\x00')

b_arr_text = str(b_arr)

b_arr_np = np.asarray([b_arr_text], dtype=np.str)

b_arr_np
Out[303]: 
array(['\xca'], 
      dtype='|S4')

b_arr_np.tostring()
Out[304]: '\xca\x00\x00\x00'

检查来自github的帖子。问题是使用非零字节跟踪或使用

dtype=uint8

与

b_arr

：

b_arr_np = np.asarray([b_arr], dtype=np.uint8)

b_arr_np
Out[319]: array([[202,   0,   0,   0]], dtype=uint8)

b_arr_np.tostring()

Out[320]: '\xca\x00\x00\x00'

您需要

np.void

d类型

字符串或字节数组将始终切掉尾随的零

a = np.array([b"\x00\x00"], dtype=np.str)
a
# Out: array([''], dtype='<U2')
a[0]
# Out: ''

每个数组元素现在都被包装在一个

void（…）

中，这稍微有点复杂，但您可以通过以下任一方法解决此问题：

a[0].item()
# Out: b'\x00\x00'

或者对于整个阵列：

a = a.astype(object)
a
# Out: array([b'\x00\x00'], dtype=object)
a[0]
# Out: b'\x00\x00'

如果你换一条线

b_arr_np = np.asarray([b_arr_text], dtype=np.str)

与

然后，您的示例将按照您的预期运行。

在存储之前附加一个伪非零字节，并在检索之后将其删除？@nekomatic感谢您的建议，我完全可以这样做，而且它会起作用。理想情况下，我想知道为什么

numpy

会这样做，特别是对

dtype

字符串长度和修剪字符串大小之间的不匹配感到好奇。感谢@vadim shkaberda.。那篇文章解释了我想知道的一切。遗憾的是，您使用

np.uint8

的建议对我不起作用，因为我有混合形状的bytearray:

np.asarray（[bytearray（b'\xca\x00\x00\x00'）、bytearray（b'\xca\x00\x00'）、dtype=np.uint8）

产生一个值错误：

ValueError:设置一个带有序列的数组元素。

正如帖子所建议的，我可以使用对象

dtype

，但这失去了在Python列表上使用

numpy

的任何好处。我想我会坚持使用@nekomatic建议的解决方法，并附加一个伪字节，然后删除。np.void的问题是它不允许可变长度的元素。

a = a.astype(object)
a
# Out: array([b'\x00\x00'], dtype=object)
a[0]
# Out: b'\x00\x00'

b_arr_np = np.asarray([b_arr_text], dtype=np.str)

b_arr_np = np.asarray([b_arr_text], dtype=np.void).astype(object)