类似SQL的使用Python摘要报告_Python_Pandas

类似SQL的使用Python摘要报告

python pandas

类似SQL的使用Python摘要报告,python,pandas,Python,Pandas,我经常使用dplyr生成单语句总结报告，如下所示： a <- group_by(data,x) b <- summarise(a, # count distinct y where value is not missing y_distinct = n_distinct(y[is.na(y) == F]), # count distinct z where value is no

我经常使用dplyr生成单语句总结报告，如下所示：

a <- group_by(data,x) 
b <- summarise(a, 
                 # count distinct y where value is not missing
                 y_distinct = n_distinct(y[is.na(y) == F]),
                 # count distinct z where value is not missing
                 z_distinct = n_distinct(z[is.na(z) == F]),
                 # count total number of values
                 total = n(),
                 # count y where value not missing
                 y_not_missing = length(y[is.na(y) == F]),
                 # count y where value is missing
                 y_missing = length(y[is.na(y) == T]))

然而，我（Python新手和）找不到熊猫的等价物，在文档中迷失了方向。我能够使用不同的groupby->agg语句生成每个聚合，

但是需要帮助才能在单个数据帧中生成报告（最好使用单个语句）。

尝试以下方法：

In [18]: df
Out[18]:
   x    y    z
0  1  2.0  NaN
1  1  3.0  NaN
2  2  NaN  1.0
3  2  NaN  2.0
4  3  4.0  5.0

In [19]: def nulls(s):
    ...:     return s.isnull().sum()
    ...:

In [23]: r = df.groupby('x').agg(['nunique','size',nulls])

In [24]: r
Out[24]:
        y                  z
  nunique size nulls nunique size nulls
x
1       2    2   0.0       0    2   2.0
2       0    2   2.0       2    2   0.0
3       1    1   0.0       1    1   0.0

要展平柱，请执行以下操作：

In [25]: r.columns = r.columns.map('_'.join)

In [26]: r
Out[26]:
   y_nunique  y_size  y_nulls  z_nunique  z_size  z_nulls
x
1          2       2      0.0          0       2      2.0
2          0       2      2.0          2       2      0.0
3          1       1      0.0          1       1      0.0

试着这样做：

In [18]: df
Out[18]:
   x    y    z
0  1  2.0  NaN
1  1  3.0  NaN
2  2  NaN  1.0
3  2  NaN  2.0
4  3  4.0  5.0

In [19]: def nulls(s):
    ...:     return s.isnull().sum()
    ...:

In [23]: r = df.groupby('x').agg(['nunique','size',nulls])

In [24]: r
Out[24]:
        y                  z
  nunique size nulls nunique size nulls
x
1       2    2   0.0       0    2   2.0
2       0    2   2.0       2    2   0.0
3       1    1   0.0       1    1   0.0

要展平柱，请执行以下操作：

In [25]: r.columns = r.columns.map('_'.join)

In [26]: r
Out[26]:
   y_nunique  y_size  y_nulls  z_nunique  z_size  z_nulls
x
1          2       2      0.0          0       2      2.0
2          0       2      2.0          2       2      0.0
3          1       1      0.0          1       1      0.0

我相信您需要使用函数进行聚合-对于计数所有值，对于计数非

NaN

s值，对于计数

unique

和自定义函数，对于计数

NaN

s：

df = pd.DataFrame({'y':[4,np.nan,4,5,5,4],
                   'z':[np.nan,8,9,4,2,3],
                   'x':list('aaaabb')})

print (df)
   x    y    z
0  a  4.0  NaN
1  a  NaN  8.0
2  a  4.0  9.0
3  a  5.0  4.0
4  b  5.0  2.0
5  b  4.0  3.0



f = lambda x: x.isnull().sum()
f.__name__ = 'non nulls'
df = df.groupby('x').agg(['nunique', f, 'count', 'size'])
df.columns = df.columns.map('_'.join)
print (df)
   y_nunique  y_non nulls  y_count  y_size  z_nunique  z_non nulls  z_count  \
x                                                                             
a          2          1.0        3       4          3          1.0        3   
b          2          0.0        2       2          2          0.0        2   

   z_size  
x          
a       4  
b       2

我相信您需要使用函数进行聚合-对于计数所有值，对于计数非

NaN

s值，对于计数

unique

和自定义函数，对于计数

NaN

s：

df = pd.DataFrame({'y':[4,np.nan,4,5,5,4],
                   'z':[np.nan,8,9,4,2,3],
                   'x':list('aaaabb')})

print (df)
   x    y    z
0  a  4.0  NaN
1  a  NaN  8.0
2  a  4.0  9.0
3  a  5.0  4.0
4  b  5.0  2.0
5  b  4.0  3.0



f = lambda x: x.isnull().sum()
f.__name__ = 'non nulls'
df = df.groupby('x').agg(['nunique', f, 'count', 'size'])
df.columns = df.columns.map('_'.join)
print (df)
   y_nunique  y_non nulls  y_count  y_size  z_nunique  z_non nulls  z_count  \
x                                                                             
a          2          1.0        3       4          3          1.0        3   
b          2          0.0        2       2          2          0.0        2   

   z_size  
x          
a       4  
b       2

如果已经有SQL查询，可以使用

pandasql

模块直接在

pandas.DataFrame

上应用

SQL

查询：

import pandasql as ps
query = """select
    count(distinct(case when y is not null then y end)) as y_distinct,
    count(distinct(case when z is not null then z end)) as z_distinct,
    count(1) as total,
    count(case when y is not null then 1 end) as y_not_missing,
    count(case when z is not null then 1 end) as y_missing
from df group by x""" #df here is the name of the DataFrame
ps.sqldf(query, locals())

如果已经有SQL查询，可以使用

pandasql

模块直接在

pandas.DataFrame

上应用

SQL

查询：

import pandasql as ps
query = """select
    count(distinct(case when y is not null then y end)) as y_distinct,
    count(distinct(case when z is not null then z end)) as z_distinct,
    count(1) as total,
    count(case when y is not null then 1 end) as y_not_missing,
    count(case when z is not null then 1 end) as y_missing
from df group by x""" #df here is the name of the DataFrame
ps.sqldf(query, locals())

你能添加数据样本吗？你能添加数据样本吗？我想你忘了groupby了。@jezrael，是的，现在已经修好了谢谢！这个答案是否意味着您将始终在每个列上使用所有聚合器？这真的是可伸缩的吗？如果我们有100个列，并且每个列上都需要不同的聚合，该怎么办？@kamashay，您可以始终指定一个应该处理的列列表：

df.groupby（'x'）[list of cols].agg（…）

或

df[list of cols].groupby（'x'）.agg（…）

@MaxU，但这是否意味着您必须将单个报表拆分为N个groupby agg语句（同样，计算开销），从而生成N个迷你报表？您如何将它们连接到单个报表中？之后真的没有一个groupby agg语句来处理这个问题吗？我想你忘了使用groupby了。@jezrael，是的，现在已经修好了，谢谢！这个答案是否意味着您将始终在每个列上使用所有聚合器？这真的是可伸缩的吗？如果我们有100个列，并且每个列上都需要不同的聚合，该怎么办？@kamashay，您可以始终指定一个应该处理的列列表：

df.groupby（'x'）[list of cols].agg（…）

或

df[list of cols].groupby（'x'）.agg（…）

@MaxU，但这是否意味着您必须将单个报表拆分为N个groupby agg语句（同样，计算开销），从而生成N个迷你报表？您如何将它们连接到单个报表中？之后真的没有一个groupby agg声明来处理这个问题吗？谢谢！我猜你的答案和MaxU的答案很相似-同样的评论适用-你能为每一列设置不同的聚合器吗？是的，没错。你可以使用

.agg（{'a'：'nunique'，'b'：'sum'，…}）

或者对一个列使用更多函数，比如

.agg（{'y'：['nunique'，f'，count'，size']，'z'：'sum}）

更多信息，谢谢，这可能是我一直在寻找的单语句解决方案！我猜你的答案和MaxU的答案很相似-同样的评论适用-你能为每一列设置不同的聚合器吗？是的，没错。你可以使用

.agg（{'a'：'nunique'，'b'：'sum'，…}）

或者对一个列使用更多函数，比如

.agg（{'y'：['nunique'，f'，count'，size']，'z'：'sum}）

更多信息是，谢谢，这可能是我正在寻找的单语句解决方案