Python 熊猫中的字符串打印不正确
我正在使用pandas加载包含twitter消息的csv文件Python 熊猫中的字符串打印不正确,python,string,pandas,encoding,Python,String,Pandas,Encoding,我正在使用pandas加载包含twitter消息的csv文件 corpus = pd.read_csv(data_path, encoding='utf-8') 下面是一个数据示例 label,date,comment 0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""" 当我试图打印我得到的评论时: print(co
corpus = pd.read_csv(data_path, encoding='utf-8')
下面是一个数据示例
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
当我试图打印我得到的评论时:
print(corpus.iloc[1]['comment'])
>> "i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."
\xa0仍在输出中。但是如果我粘贴文件中的字符串并打印它,我会得到正确的输出
print("""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""")
>> i really don't understand your point. It seems that you are mixing apples and oranges.
我想知道为什么这两种输出是不同的,如果有一种方法,让熊猫字符串被正确打印?我想知道是否有更好的解决方案,然后直接替换,因为数据包含许多其他Unicode表示形式,如\xe1、\u0111、\u01b0、\u1edd等。熊猫加载的输入数据文件必须是ASCII格式。如果是UTF-8,UTF-8编码器将正确加载UTF-8字节。如果文件不是UTF-8,pandas仍将加载,转义的\xa0将按字面方式加载,而不会转换为所需的unicode非中断空间 复制/粘贴时它之所以有效,是因为python在字符串文本中看到了转义
import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")
若csv是用\xa0构造的,并且是ascii,则尽管指定了utf-8编码,但仍将作为ascii加载
cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text
熊猫加载的输入数据文件必须是ASCII格式。如果是UTF-8,UTF-8编码器将正确加载UTF-8字节。如果文件不是UTF-8,pandas仍将加载,转义的\xa0将按字面方式加载,而不会转换为所需的unicode非中断空间 复制/粘贴时它之所以有效,是因为python在字符串文本中看到了转义
import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")
若csv是用\xa0构造的,并且是ascii,则尽管指定了utf-8编码,但仍将作为ascii加载
cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text
要从列中删除所有unicode字符的可能重复项。?要从列中删除所有unicode字符的可能重复项。?
df1 = pd.read_csv("/tmp/corpusascii.csv", encoding="utf-8")
df1
label date comment
0 0 20120528192215Z "i really don't understand your point.\xa0 It ...