Pandas 熊猫读取多索引数据帧（与_string（）相反）_Pandas

Pandas 熊猫读取多索引数据帧（与_string（）相反）

pandas

Pandas 熊猫读取多索引数据帧（与_string（）相反）,pandas,Pandas,我有一个如下所示的文本文件： test2.dat: col1 col2 idx1 idx2 a 0 0.256788 0.862771 1 0.409944 0.785159 2 0.822773 0.955309 b 0 0.159213 0.628662 1 0.463844 0.667742 2 0.2

我有一个如下所示的文本文件：

test2.dat:

               col1      col2
idx1 idx2                    
a    0     0.256788  0.862771
     1     0.409944  0.785159
     2     0.822773  0.955309
b    0     0.159213  0.628662
     1     0.463844  0.667742
     2     0.292325  0.768051

idx1 idx2 col1 col2
a 0 0.2567883353169065 0.862770538437793
a 1 0.40994403619942743 0.7851591115509821
a 2 0.8227727216889246 0.9553088749178045
b 0 0.1592133339255788 0.6286622783546136
b 1 0.4638439474864856 0.6677423709343185
b 2 0.2923252978245071 0.7680513714069206

它是通过

file.write（df.to_sring）

保存多索引数据帧创建的。现在，我想反转这个操作。但是当我尝试的时候

pandas.read\u csv（数据，sep=r'\s+'，索引列=[0,1]）

它抛出一个错误

parserror:error标记化数据。C错误：第3行中应包含2个字段，saw 4

这是一个小型MWE：

导入熊猫
将numpy作为np导入
来自itertools进口产品
数据帧（产品（['a'，'b'，范围（3）），列=['idx1'，'idx2']））
df2=pandas.DataFrame（np.random.rand（6,2），columns=['col1'，'col2']）
df=pandas.concat（[df1，df2]，轴=1）
df.set_索引（['idx1'，'idx2']，inplace=True）
df.to_csv（'test.dat'，sep=''）
以open（'test2.dat'，'w'）作为文件：
file.write（df.to_string（））

请注意，通过

pandas.to_csv（）

保存的

test.dat

与

test2.dat

测试数据：

               col1      col2
idx1 idx2                    
a    0     0.256788  0.862771
     1     0.409944  0.785159
     2     0.822773  0.955309
b    0     0.159213  0.628662
     1     0.463844  0.667742
     2     0.292325  0.768051

idx1 idx2 col1 col2
a 0 0.2567883353169065 0.862770538437793
a 1 0.40994403619942743 0.7851591115509821
a 2 0.8227727216889246 0.9553088749178045
b 0 0.1592133339255788 0.6286622783546136
b 1 0.4638439474864856 0.6677423709343185
b 2 0.2923252978245071 0.7680513714069206

按列表理解使用和设置列名称：

df = pd.read_fwf('file.csv', header=[0,1])
df.columns = [y for x in df.columns for y in x if not 'Unnamed' in y]

#replace missing values by first column
df.iloc[:, 0] = df.iloc[:, 0].ffill().astype(int)
#set first 2 columns to MultiIndex
df = df.set_index(df.columns[:2].tolist())
print (df)
             col1    col2
idx1 idx2                
1    1     0.1234  0.2345
     2     0.4567  0.2345
     3     0.1244  0.5332
2    1     0.4213  0.5233
     2     0.5423  0.5423
     3     0.5235  0.6233

我决定使用jezrael代码的一个细微变化，它会自动处理索引的数量。请注意，

df.columns

最初的形式为

[（x1，y1），（x2，y2），…，（xn，yn）]

，其中

是列数，

xi

是第一个标题行中列

的标签，

yi

是第二个标题行中的标签

df=pandas.read_fwf（f，header=[0,1]）
cols=[x代表x，如果“未命名”不在x中，则在df.columns中]
idxs=[y表示ux，如果“未命名”不在y中，则在df.columns中为y]
df.columns=idxs+cols
df[idxs]=df[idxs].ffill（）
df.set_索引（idxs，inplace=True）

谢谢您的回答。格式来自通过

file.write（df.to_string（））

保存完全相同的数据帧。这样做的原因是我想以人类可读的形式保存数据。不幸的是，pandas在使用多索引时没有提供与字符串相反的功能。@Hyperplane-不，需要

df=df.to\u csv（文件）

，然后

read\u csv

工作得很好。但是标准的csv肯定不是人类可读的。我真的希望使用空格作为分隔符，并使用to_字符串提供的良好的垂直对齐方式。@Hyperplane添加了可能的解决方案，但由

df.to_字符串（）创建的读取文件的反向函数不存在。请参阅我的编辑<代码>到_csv

对我来说真的不起作用，因为输出不是真正的人类可读的。（尤其是与

到_string

的输出相比）。无论如何，谢谢你的帮助！