Python 熊猫中的字符串打印不正确_Python_String_Pandas_Encoding

Python 熊猫中的字符串打印不正确

python string pandas encoding

Python 熊猫中的字符串打印不正确,python,string,pandas,encoding,Python,String,Pandas,Encoding,我正在使用pandas加载包含twitter消息的csv文件 corpus = pd.read_csv(data_path, encoding='utf-8') 下面是一个数据示例 label,date,comment 0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""" 当我试图打印我得到的评论时： print(co

我正在使用pandas加载包含twitter消息的csv文件

corpus = pd.read_csv(data_path, encoding='utf-8')

下面是一个数据示例

label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""

当我试图打印我得到的评论时：

print(corpus.iloc[1]['comment'])
>> "i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."

\xa0仍在输出中。但是如果我粘贴文件中的字符串并打印它，我会得到正确的输出

print("""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""")
>> i really don't understand your point.  It seems that you are mixing apples and oranges.

我想知道为什么这两种输出是不同的，如果有一种方法，让熊猫字符串被正确打印？我想知道是否有更好的解决方案，然后直接替换，因为数据包含许多其他Unicode表示形式，如\xe1、\u0111、\u01b0、\u1edd等。

熊猫加载的输入数据文件必须是ASCII格式。如果是UTF-8，UTF-8编码器将正确加载UTF-8字节。如果文件不是UTF-8，pandas仍将加载，转义的\xa0将按字面方式加载，而不会转换为所需的unicode非中断空间

复制/粘贴时它之所以有效，是因为python在字符串文本中看到了转义

import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")

若csv是用\xa0构造的，并且是ascii，则尽管指定了utf-8编码，但仍将作为ascii加载

cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text

复制/粘贴时它之所以有效，是因为python在字符串文本中看到了转义

import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")

若csv是用\xa0构造的，并且是ascii，则尽管指定了utf-8编码，但仍将作为ascii加载

cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text

要从列中删除所有unicode字符的可能重复项。？要从列中删除所有unicode字符的可能重复项。？

df1 = pd.read_csv("/tmp/corpusascii.csv", encoding="utf-8")
df1
   label             date                                            comment
0      0  20120528192215Z  "i really don't understand your point.\xa0 It ...