为什么在使用python split时字符串会改变？_Python_String_Python 3.x_Python 2.7

为什么在使用python split时字符串会改变？

python string python-3.x python-2.7

为什么在使用python split时字符串会改变？,python,string,python-3.x,python-2.7,Python,String,Python 3.x,Python 2.7,这是我分割后得到的输出 test_str = "Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at anу timе bеfore Арril 28? Ind

这是我分割后得到的输出

test_str = "Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at anу timе bеfore Арril 28? Indicаtоr: 60.76%"

print(test_str)
print(test_str.split('before '))

演示：

这个问题是由拉丁语和西里尔语字符混合造成的。它们在大多数保单中打印的内容完全相同，但仍然是不同的字符，并且具有不同的代码

问题中的输出是针对Python2.7的（提问者最初使用的是什么），但在Python3中很容易有相同的行为：

"['Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at an\xd1\x83 tim\xd0\xb5 b\xd0\xb5fore \xd0\x90\xd1\x80ril 28? Indic\xd0\xb0t\xd0\xber: 60.76%']"

Unicode数据模块有助于更好地了解实际发生的情况：

>>> print(test_str.encode('UTF8'))
b'Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at an\xd1\x83 tim\xd0\xb5 b\xd0\xb5fore \xd0\x90\xd1\x80ril 28? Indic\xd0\xb0t\xd0\xber: 60.76%'

crmk_0x443b'\xd1\x83'西里尔字母U
0x435 b'\xd0\xb5'西里尔文小写字母IE
А0x410 b'\xd0\x90'西里尔大写字母A
П0x440 b'\xd1\x80'西里尔文小写字母ER
ö0x43e b'\xd0\xbe'西里尔字母O

因此，原始文本包含西里尔字母，为了进行比较，它们与拉丁语的等效字母并不相同，即使它们打印的是相同的。这个问题与拆分无关，只是一个糟糕的原始字符串。

使用“UTF-8”解码字符串

由于它仍然有一些非ASCII字符（例如），我们可以进一步翻译它。完整列表如下：

使用

注意：如果您不想使用unidecode，我发现这篇文章详细地解释了另一种方式：

免责声明：这是@Aditya刚刚提出（并删除）的一个问题的副本。我不同意评论中的结论，我在这里再次询问。这只是字符串的不同表示形式，可能与IDE的编码有关。对我来说，它显示了预期的结果。不过，这个问题只在Python2中出现。请清楚您使用的是哪个版本。此外，字符串没有被拆分为两个，这表示字符串中有非标准（ASCII）字符。显然是编码的。这些原来是西里尔语的。（y、e、b、a，在字符串中的不同位置）。删除的问题可以重新打开，或链接到+10K代表人；这可能不会那么令人困惑。使用正确的标记（即，改进原始问题）也有帮助。因此，如果您无法控制输入字符串，那么如何解决类似问题……请注意，unidecode根据字母表示的内容对字母进行解码，而此处的字符串将字母用于打印的内容。例如，当西里尔字母ER像

一样打印时，它将变成

。无论如何，

unidecode

的一种变体可能是。。。

>>> for i in b'\xd1\x83\xd0\xb5\xd0\x90\xd1\x80\xd0\xbe'.decode('utf8'):
    print(i, hex(ord(i)), i.encode('utf8'), unicodedata.name(i))

print test_str.decode("utf-8")
u'Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at an\u0443 tim\u0435 b\u0435fore \u0410\u0440ril 28? Indic\u0430t\u043er: 60.76%'

import unidecode
unidecode.unidecode(test_str.decode("utf-8"))
'Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at anu time before Arril 28? Indicator: 60.76%'
unidecode.unidecode(test_str.decode("utf-8")).split("before ")
['Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at anu time ',
 'Arril 28? Indicator: 60.76%']