Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/347.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫中的字符串打印不正确_Python_String_Pandas_Encoding - Fatal编程技术网

Python 熊猫中的字符串打印不正确

Python 熊猫中的字符串打印不正确,python,string,pandas,encoding,Python,String,Pandas,Encoding,我正在使用pandas加载包含twitter消息的csv文件 corpus = pd.read_csv(data_path, encoding='utf-8') 下面是一个数据示例 label,date,comment 0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""" 当我试图打印我得到的评论时: print(co

我正在使用pandas加载包含twitter消息的csv文件

corpus = pd.read_csv(data_path, encoding='utf-8')
下面是一个数据示例

label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
当我试图打印我得到的评论时:

print(corpus.iloc[1]['comment'])
>> "i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."
\xa0仍在输出中。但是如果我粘贴文件中的字符串并打印它,我会得到正确的输出

print("""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""")
>> i really don't understand your point.  It seems that you are mixing apples and oranges.

我想知道为什么这两种输出是不同的,如果有一种方法,让熊猫字符串被正确打印?我想知道是否有更好的解决方案,然后直接替换,因为数据包含许多其他Unicode表示形式,如\xe1、\u0111、\u01b0、\u1edd等。

熊猫加载的输入数据文件必须是ASCII格式。如果是UTF-8,UTF-8编码器将正确加载UTF-8字节。如果文件不是UTF-8,pandas仍将加载,转义的\xa0将按字面方式加载,而不会转换为所需的unicode非中断空间

复制/粘贴时它之所以有效,是因为python在字符串文本中看到了转义

import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")
若csv是用\xa0构造的,并且是ascii,则尽管指定了utf-8编码,但仍将作为ascii加载

cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text

熊猫加载的输入数据文件必须是ASCII格式。如果是UTF-8,UTF-8编码器将正确加载UTF-8字节。如果文件不是UTF-8,pandas仍将加载,转义的\xa0将按字面方式加载,而不会转换为所需的unicode非中断空间

复制/粘贴时它之所以有效,是因为python在字符串文本中看到了转义

import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")
若csv是用\xa0构造的,并且是ascii,则尽管指定了utf-8编码,但仍将作为ascii加载

cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text

要从列中删除所有unicode字符的可能重复项。?要从列中删除所有unicode字符的可能重复项。?
df1 = pd.read_csv("/tmp/corpusascii.csv", encoding="utf-8")
df1
   label             date                                            comment
0      0  20120528192215Z  "i really don't understand your point.\xa0 It ...