Python 如何将多索引数据帧移动到Xarray数据阵列

Python 如何将多索引数据帧移动到Xarray数据阵列,python,pandas,dataframe,python-xarray,Python,Pandas,Dataframe,Python Xarray,我正在将CSV文件导入熊猫数据帧。CSV文件类似于: Time, Status, Variable, freq_1, freq_2, freq_3, ..... 1/1/2000, Hi, A, 0.1, 3.3, 8.1, .... 1/1/2000, Hi, B, 2.4, 1.2, 1.3, .... 1/1/2000, Lo, A, 4.5, 6.9, 6.4, .... 1/1/

我正在将CSV文件导入熊猫数据帧。CSV文件类似于:

Time,    Status, Variable, freq_1, freq_2, freq_3, .....
1/1/2000,  Hi,      A,      0.1,    3.3,    8.1, ....
1/1/2000,  Hi,      B,      2.4,    1.2,    1.3, ....
1/1/2000,  Lo,      A,      4.5,    6.9,    6.4, ....
1/1/2000,  Lo,      B,      7.1,    8.8,    2.3, ....
2/1/2000,  Hi,      A,      0.1,    3.3,    8.1, ....
2/1/2000,  Hi,      B,      2.4,    1.2,    1.3, ....
2/1/2000,  Lo,      A,      4.5,    6.9,    6.4, ....
2/1/2000,  Lo,      B,      7.1,    8.8,    2.3, ....
....
我使用时间、状态和变量作为指示,将其读入具有多索引的数据帧中

我现在想使用Pandas to_Xarray或Xarray from_dataframe将数据帧导入Xarray。但是,这两种方法似乎都会对索引造成阻塞,从而引发错误:

TypeError: Could not convert tuple of form (dims, data[, attrs, encoding]): (0, DatetimeIndex(['2018-01-12 00:15:00', '2018-01-12 00:45:00',
               '2018-01-12 01:15:00', '2018-01-12 01:45:00',
               '2018-01-12 02:15:00', '2018-01-12 02:45:00',
               '2018-01-12 03:15:00', '2018-01-12 03:45:00',
               '2018-01-12 04:15:00', '2018-01-12 04:45:00',
               ...
               '2019-11-01 16:15:00', '2019-11-01 17:15:00',
               '2019-11-01 17:45:00', '2019-11-01 18:15:00',
               '2019-11-01 18:45:00', '2019-11-01 19:15:00',
               '2019-11-01 20:45:00', '2019-11-01 21:15:00',
               '2019-11-01 21:45:00', '2019-11-01 22:15:00'],
              dtype='datetime64[ns]', name=0, length=3100, freq=None)) to Variable.
ValueError: coords is not dict-like, but it has 4 items, which does not match the 2 dimensions of the data
我还尝试使用Xarray.DataArray过程:

Mytime = PD.to_datetime(df[0],infer_datetime_format=True)
Myfreq = np.array([ 0,1,2,3...39])
OutDataArray = Xarray.DataArray(df[np.arange(3,43)], coords=[('time', Mytime ), ('freq', Myfreq ), ('Status', df[1]), ('variable', df[2])])
但这就产生了错误:

TypeError: Could not convert tuple of form (dims, data[, attrs, encoding]): (0, DatetimeIndex(['2018-01-12 00:15:00', '2018-01-12 00:45:00',
               '2018-01-12 01:15:00', '2018-01-12 01:45:00',
               '2018-01-12 02:15:00', '2018-01-12 02:45:00',
               '2018-01-12 03:15:00', '2018-01-12 03:45:00',
               '2018-01-12 04:15:00', '2018-01-12 04:45:00',
               ...
               '2019-11-01 16:15:00', '2019-11-01 17:15:00',
               '2019-11-01 17:45:00', '2019-11-01 18:15:00',
               '2019-11-01 18:45:00', '2019-11-01 19:15:00',
               '2019-11-01 20:45:00', '2019-11-01 21:15:00',
               '2019-11-01 21:45:00', '2019-11-01 22:15:00'],
              dtype='datetime64[ns]', name=0, length=3100, freq=None)) to Variable.
ValueError: coords is not dict-like, but it has 4 items, which does not match the 2 dimensions of the data
那么,如果数据帧是二维的,但是其中一个维度(即行)实际上由数据帧的多索引指定的多个维度组成,那么如何将数据帧导入到Xarray中呢


根据要求,下面是一个重现问题的示例脚本。注意:您需要为导入的示例数据的CSV文件设置文件名:

import numpy as np
import pandas as PD

# create some data
dt = PD.date_range(start='01/01/2000 00:00:00', end='02/01/2000 00:00:00', freq='30T')
ldt = len(dt)
vr1 = PD.Series(np.empty(ldt, dtype = np.str))
vr2 = PD.Series(np.empty(ldt, dtype = np.str))
vr3 = PD.Series(np.empty(ldt, dtype = np.str))
vr1.values[:] = 'apple'
vr2.values[:] = 'orange'
vr3.values[:] = 'peach'

# combine the data to mimic my file format
a = PD.Series([1,2,3,4], index=[7,2,8,9])
b = PD.Series([5,6,7,8], index=[7,2,8,9])
df1 = PD.DataFrame({'Time': dt,'Fruit':vr1, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df2 = PD.DataFrame({'Time': dt,'Fruit':vr2, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df3 = PD.DataFrame({'Time': dt,'Fruit':vr3, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df_unsorted = PD.concat([df1, df2, df3])
df = df_unsorted.sort_values(by=['Time', 'Fruit'])

# write the data to a csv file
filename = 'Give a file path/name here'
df.to_csv(filename, index=False)

#import the csv file and convert to an xarray
df2 = PD.read_csv(filename,  sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
da = df2.to_xarray()

您的错误似乎在于csv文件中的列和索引未在结果数据框中命名。将代码示例的最后两行替换为:

df2 = PD.read_csv(filename,  sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
df2.columns = ['N1', 'N2', 'N3']
df2.index.names = ['time', 'fruit']
ds = df2.to_xarray()
结果成功转换为xarray数据集

print(ds)

<xarray.Dataset>
Dimensions:  (fruit: 3, time: 1489)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ... 2000-02-01
  * fruit    (fruit) object 'apple' 'orange' 'peach'
Data variables:
    N1       (time, fruit) float64 0.114 0.3726 0.5072 ... 0.2065 0.9082 0.7945
    N2       (time, fruit) float64 0.7534 0.1107 0.8866 ... 0.4509 0.5218 0.1472
    N3       (time, fruit) float64 0.156 0.6498 0.3521 ... 0.3742 0.5899 0.607

你能提供一些复制品吗?
to_xarray
通常是有效的,因此我认为需要更多的细节,以便xarray无法处理没有标题行的CSV文件的具有默认列标签(即[0,1,2,3,…])的Pandas数据帧?看起来是这样的。但是,您可以简化此过程,而不必直接使用csv中的标题手动设置列和索引名。我更新了我的答案。不幸的是,我的CSV文件头不适合列命名。@Dan如果使用此方法,我们如何从dataframe自定义xarray.Dataset。假设coords将具有不在dim中的附加变量,并且datavariable N1仅具有时间而不具有结果