Pandas 使用多索引数据帧时出现的问题_Pandas_Data Manipulation_Multi Index

Pandas 使用多索引数据帧时出现的问题

pandas

Pandas 使用多索引数据帧时出现的问题,pandas,data-manipulation,multi-index,Pandas,Data Manipulation,Multi Index,我有一个巨大的数据帧。我试图在这里构建一个类似于它的多索引数据框架。我需要根据每个索引和列获取NaNs的数量 temp = pd.DataFrame({'tic': ['IBM', 'AAPL', 'AAPL', 'IBM', 'AAPL'], 'industry': ['A', 'B', 'B', 'A', 'B'], 'price': [np.nan, 5, 6, 11, np.nan],

我有一个巨大的数据帧。我试图在这里构建一个类似于它的多索引数据框架。我需要根据每个索引和列获取

NaN

s的数量

temp = pd.DataFrame({'tic': ['IBM', 'AAPL', 'AAPL', 'IBM', 'AAPL'],
                   'industry': ['A', 'B', 'B', 'A', 'B'],
                    'price': [np.nan, 5, 6, 11, np.nan],
                    'shares':[100, 60, np.nan, 100, 62],
                    'dates': pd.to_datetime(['1990-01-01', '1990-01-01','1990-04-01', 
                                                 '1990-04-01', '1990-08-01'])
                    })

temp.set_index(['tic', 'dates'], inplace=True)

这将产生：

                industry  price  shares
tic  dates                             
IBM  1990-01-01        A    NaN   100.0
AAPL 1990-01-01        B    5.0    60.0
     1990-04-01        B    6.0     NaN
IBM  1990-04-01        A   11.0   100.0
AAPL 1990-08-01        B    NaN    62.0

以下是问题：

1）小问题：为什么索引不起作用？我希望在

tic

列中看到一个

IBM

和

AAPL

2）如何获取每列上每个

tic

的

NaN

s与总数据点的比率？所以，我需要这样一个数据帧：

tic                                     IBM              AAPL 
number of total NaNs                    1                2 
percentage of NaNs in 'price' column    50%(1 out of 2)  33.3% (1 out 3)
percentage of NaNs in 'Shares' column   0% (0 out 2)     33.3% (1 out 3)

3）如何根据TIC在

price

列上的

NaN

s比率对TIC进行排名

4）如何选择两列上比率最低的

NaN

s的前n个tic

5）如何在两个日期之间执行上述操作？

1）为什么索引不起作用

temp.sort_index()

2）我怎样才能得到NAN的比率

grpd = temp.groupby(level='tic').agg(['size', 'count'])

null_ratio = grpd.xs('count', axis=1, level=1) \
        .div(grpd.xs('size', axis=1, level=1)).mul(-1).__radd__(1)

null_ratio

3）在价格列中按空值排序

null_ratio.price.rank()

tic
AAPL    1.0
IBM     2.0
Name: price, dtype: float64

4）如何选择两列上具有最低NAN比率的前n个TIC

null_ratio.price.nsmallest(1)

tic
AAPL    0.333333
Name: price, dtype: float64

5）日期间

temp.sort_index().loc[pd.IndexSlice[:, '1990-01-01':'1990-04-01'], :]

您可以使用该功能实现所需的订单。

temp.sort\u level（'tic'，inplace=True）

temp.sort\u level（['tic'，'dates']，inplace=True）

df=pd.DataFrame（{'total_missing'：temp_grpd.apply（lambda x:x['price'].isnull（）.sum（）+x['shares'].isnull（）.sum（）），
“pnt_missing_price”：temp_grpd.apply（lambda x:x['price'].isnull（）.sum（）/x.shape[0]），
“pnt_缺少共享”：临时grpd.apply（lambda x:x['shares'].isnull（）.sum（）/x.shape[0]），
“总记录”：临时应用（lambda x:x.shape[0]）

如果需要，可以将数据帧转换为与文章中包含的格式相匹配的格式，但使用这种格式可能更容易操作

df['pnt\u missing\u price'].排名（升序=假）

这个问题没有明确的定义。我想你可能需要像下面这样的东西，但还不清楚

df['pnt\u missing']=df['total\u missing']/df['total\u records']

df.sort\u值（'pnt\u缺失'，升序=True）

df.loc[df['pnt_missing'].nsmalest（5）]

你已经有了一个很好的答案

谢谢关于第4项，这是基于

价格列给我的nsmallest。

我如何获得

价格和股票的nsmallest？这是两个最重要的栏目。我将它们相乘，得到一个新的列市值。
因此，我只需要在这两个列上保留小于n%NaN
s的TIC。@st19297如果证券a的价格和股票空比分别为0.1和0.2，而证券B的空比分别为0.09和0.21，您如何对它们进行排序？听起来你需要多想想你的问题。或者使用过滤器。你可以一次问5个问题。如果你有一个新问题，请在另一篇文章中提问。事实上，如果你搜索“如何在两列上进行过滤？”我肯定以前有人问过并回答过。