Python 熊猫和unicode_Python_Unicode_Pandas

Python 熊猫和unicode

python unicode pandas

Python 熊猫和unicode,python,unicode,pandas,Python,Unicode,Pandas,这是我从pandas.DataFrame.to_json（）中获取的字符串，将其放入redis，从redis的其他位置获取，并尝试通过pandas.read_json（）读取：它似乎没有任何unicode。然而，在尝试.read_json（）时，我得到了： Traceback (most recent call last): File "./sqlprofile.py", line 160, in <module> maybe_save_dataframes(rconn

这是我从pandas.DataFrame.to_json（）中获取的字符串，将其放入redis，从redis的其他位置获取，并尝试通过pandas.read_json（）读取：

它似乎没有任何unicode。然而，在尝试

.read_json（）

时，我得到了：

Traceback (most recent call last):
  File "./sqlprofile.py", line 160, in <module>
    maybe_save_dataframes(rconn, configd, results)
  File "./sqlprofile.py", line 140, in maybe_save_dataframes
    h5store.append(out_queue, df)
  File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 658, in append
    self._write_to_group(key, value, table=True, append=True, **kwargs)
  File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 923, in _write_to_group
    s.write(obj = value, append=append, complib=complib, **kwargs)
  File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2985, in write
    **kwargs)
  File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2717, in create_axes
    raise e
TypeError: [unicode] is not implemented as a table column
> /home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py(2717)create_axes()
-> raise e
(Pdb) locals()

我怎样才能解决这个问题？这是Pandas/pytables中的一个bug吗

环境：

Python 2.7

熊猫==0.12.0

tables==3.0.0

看来您的往返导致了一些unicode错误。不知道为什么，但很容易修复。在Python2中，您不能将unicode存储在HDFStore表中（但在Python3中这是正确的）。如果您愿意，您可以将其作为固定格式进行处理（它将被pickle）。看

这是推断

对象的实际类型

d类型系列。只有当至少有1个字符串是unicode时，它们才会显示为unicode（否则它们将被推断为字符串）

下面是如何“修复”它

In [28]: types = df.apply(lambda x: pd.lib.infer_dtype(x.values))

In [29]: types[types=='unicode']
Out[29]: 
args         unicode
host         unicode
kwargs       unicode
operation    unicode
thingy       unicode
dtype: object

In [30]: for col in types[types=='unicode'].index:
   ....:     df[col] = df[col].astype(str)
   ....:

看起来一样

In [31]: df
Out[31]: 
  args                date            host kwargs     operation  status   thingy      time
0   [] 2013-12-02 00:33:59  yy38.segm1.org     {}       x_gbinf    -101  a13yy38  0.000801
1   [] 2013-12-02 00:33:59  kyy1.segm1.org     {}     x_initobj       1  a19kyy1  0.003244
2   [] 2013-12-02 00:34:00  yy10.segm1.org     {}  x_gobjParams    -101  a14yy10  0.002247
3   [] 2013-12-02 00:34:00  yy24.segm1.org     {}        gtfull    -101  a14yy24  0.002787
4   [] 2013-12-02 00:34:00  yy24.segm1.org     {}       x_gbinf    -101  a14yy24  0.001067
5   [] 2013-12-02 00:34:00  yy34.segm1.org     {}       gxyzinf    -101  a12yy34  0.002652
6   [] 2013-12-02 00:34:00  yy15.segm1.org     {}     deletemfg       1  a15yy15  0.004371
7   [] 2013-12-02 00:34:00  yy15.segm1.org     {}       gxyzinf    -101  a15yy15  0.000602

[8 rows x 8 columns]

但现在的推断是正确的

In [32]: df.apply(lambda x: pd.lib.infer_dtype(x.values))
Out[32]: 
args             string
date         datetime64
host             string
kwargs           string
operation        string
status          integer
thingy           string
time           floating
dtype: object

上述解决方案可能会导致unicode特殊字符出现一些错误。将unicode转换为不会挂起unicode特殊字符的字符串的类似解决方案：

for col in types[types=='unicode'].index:
     df[col] = df[col].apply(lambda x: x.encode('utf-8').strip())

这部分是由于python处理unicode的方式。Python中的更多信息。

如果开头没有DFG，那么在0.12（和0.13rc）中，第一个字符串上的read_json对我来说都很好。谢谢！这很有效。尽管这似乎是一个奇怪的省略，但给定python的JSON解码器逻辑：“JSON.loads（s）…使用此转换表将s（包含JSON文档的str或unicode实例）反序列化到python对象。”。转换表：。这意味着，即使在P2.7中，JSON解码产生的唯一字符串类型对象也是unicode。这并不使用python JSON解码，而是基于ujson的自定义解码器。以前从未见过这种情况。在出现问题的环境中，打印出

pd.get\u option（'display.encoding'）

；应该和ipythonI中的相同我认为tht是某些系统IIRC的默认值。我想您可以在导入python后立即设置这个s。它在

sys

的某个地方（你也可以通过

set\u选项（…）设置pandas）

可能会为你解决这个问题在pandas的最新版本中，这个解决方案会产生：FutureWarning:pandas.lib已弃用，并将在将来的版本中删除。你可以作为pandas.api.types.infere\u dtype替换：df.apply访问Expert\u数据类型（lambda x:pd.lib.infere_dtype（x.values））和：df.apply（lambda x:pd.api.types.infere_dtype（x.values））

In [28]: types = df.apply(lambda x: pd.lib.infer_dtype(x.values))

In [29]: types[types=='unicode']
Out[29]: 
args         unicode
host         unicode
kwargs       unicode
operation    unicode
thingy       unicode
dtype: object

In [30]: for col in types[types=='unicode'].index:
   ....:     df[col] = df[col].astype(str)
   ....:

In [31]: df
Out[31]: 
  args                date            host kwargs     operation  status   thingy      time
0   [] 2013-12-02 00:33:59  yy38.segm1.org     {}       x_gbinf    -101  a13yy38  0.000801
1   [] 2013-12-02 00:33:59  kyy1.segm1.org     {}     x_initobj       1  a19kyy1  0.003244
2   [] 2013-12-02 00:34:00  yy10.segm1.org     {}  x_gobjParams    -101  a14yy10  0.002247
3   [] 2013-12-02 00:34:00  yy24.segm1.org     {}        gtfull    -101  a14yy24  0.002787
4   [] 2013-12-02 00:34:00  yy24.segm1.org     {}       x_gbinf    -101  a14yy24  0.001067
5   [] 2013-12-02 00:34:00  yy34.segm1.org     {}       gxyzinf    -101  a12yy34  0.002652
6   [] 2013-12-02 00:34:00  yy15.segm1.org     {}     deletemfg       1  a15yy15  0.004371
7   [] 2013-12-02 00:34:00  yy15.segm1.org     {}       gxyzinf    -101  a15yy15  0.000602

[8 rows x 8 columns]

In [32]: df.apply(lambda x: pd.lib.infer_dtype(x.values))
Out[32]: 
args             string
date         datetime64
host             string
kwargs           string
operation        string
status          integer
thingy           string
time           floating
dtype: object

for col in types[types=='unicode'].index:
     df[col] = df[col].apply(lambda x: x.encode('utf-8').strip())