Python 熊猫频率表中的描述性统计数据_Python_Pandas

Python 熊猫频率表中的描述性统计数据

python pandas

Python 熊猫频率表中的描述性统计数据,python,pandas,Python,Pandas,我有一个测试分数频率表： score count ----- ----- 77 1105 78 940 79 1222 80 4339 etc 我想显示基本统计数据和频率表汇总的样本箱线图。（例如，上述示例的平均值为79.16，中位数为80。）有没有办法在熊猫身上做到这一点？我所看到的所有例子都假设有一个个案表我想我可以生成一个个人分数列表，如下所示-- --但我希望避免这种情况；真正的非玩具数据集中的总频率高达数十亿

我有一个测试分数频率表：

score    count
-----    -----
  77      1105
  78       940
  79      1222
  80      4339
etc

我想显示基本统计数据和频率表汇总的样本箱线图。（例如，上述示例的平均值为79.16，中位数为80。）

有没有办法在熊猫身上做到这一点？我所看到的所有例子都假设有一个个案表

我想我可以生成一个个人分数列表，如下所示--

--但我希望避免这种情况；真正的非玩具数据集中的总频率高达数十亿

谢谢你的帮助

（我认为这与将权重应用于个别情况的问题不同。）

这里有一个小函数，用于计算频率分布的描述性统计信息：

# from __future__ import division (for Python 2)
def descriptives_from_agg(values, freqs):
    values = np.array(values)
    freqs = np.array(freqs)
    arg_sorted = np.argsort(values)
    values = values[arg_sorted]
    freqs = freqs[arg_sorted]
    count = freqs.sum()
    fx = values * freqs
    mean = fx.sum() / count
    variance = ((freqs * values**2).sum() / count) - mean**2
    variance = count / (count - 1) * variance  # dof correction for sample variance
    std = np.sqrt(variance)
    minimum = np.min(values)
    maximum = np.max(values)
    cumcount = np.cumsum(freqs)
    Q1 = values[np.searchsorted(cumcount, 0.25*count)]
    Q2 = values[np.searchsorted(cumcount, 0.50*count)]
    Q3 = values[np.searchsorted(cumcount, 0.75*count)]
    idx = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    result = pd.Series([count, mean, std, minimum, Q1, Q2, Q3, maximum], index=idx)
    return result

演示：

np.random.seed(0)

val = np.random.normal(100, 5, 1000).astype(int)

pd.Series(val).describe()
Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

vc = pd.value_counts(val)
descriptives_from_agg(vc.index, vc.values)

Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

请注意，这不能处理NaN，并且没有经过适当的测试。

在我最初的问题中，我说我不想从频率表中重建原始值，但只要它适合内存，我现在认为我会走这条路，特别是因为我的实际用例涉及更多列

如果有人感兴趣，下面是我将频率表转换为案例的函数

In [5]: def freqs2cases(df, freq_col, cases_cols):
   ...:     def itcases():
   ...:         for i, row in df.iterrows():
   ...:             for j in range(int(row[freq_col])):
   ...:                 yield row[cases_cols]
   ...:     return pd.DataFrame(itcases())
   ...: 

In [8]: freq_df
Out[8]: 
  course  score  freq
0   math     75     3
1   math     81     4
2   chem     92     2
3   chem     66     3

In [9]: freqs2cases(freq_df, 'freq', ['course', 'score'])
Out[9]: 
  course  score
0   math     75
0   math     75
0   math     75
1   math     81
1   math     81
1   math     81
1   math     81
2   chem     92
2   chem     92
3   chem     66
3   chem     66
3   chem     66

您可以执行以下操作：

使用groupby，您可以划分“分数”列

您可以添加[['score']次计数]

sum（add）是列表的列表。所以，使用itertools.chain，您可以将列表展平

使用pd.Series（），可以使用.descripe（）

我认为这可能与我所链接的问题相同：您需要

score

列的加权描述性统计数据，以及

count

列给出的权重。唉，我不认为这个问题有一个令人满意的答案。我同意他们问了非常类似的问题，但我不知道SAS proc如何工作，所以我将在这里发布我的答案，因为它可能不满足这些要求。谢谢！您的快速响应为我节省了另外两个小时的时间，让我能够找到一种内置的方法来实现这一点。嗨，欢迎来到Stack Overflow。当回答一个已经有很多答案的问题时，请务必补充一些额外的见解，说明为什么您提供的回答是实质性的，而不是简单地重复原始海报已经审查过的内容。这在“仅代码”答案中尤其重要，如您提供的答案。谢谢您的建议。我的英语不流利。但我会努力的。

In [5]: def freqs2cases(df, freq_col, cases_cols):
   ...:     def itcases():
   ...:         for i, row in df.iterrows():
   ...:             for j in range(int(row[freq_col])):
   ...:                 yield row[cases_cols]
   ...:     return pd.DataFrame(itcases())
   ...: 

In [8]: freq_df
Out[8]: 
  course  score  freq
0   math     75     3
1   math     81     4
2   chem     92     2
3   chem     66     3

In [9]: freqs2cases(freq_df, 'freq', ['course', 'score'])
Out[9]: 
  course  score
0   math     75
0   math     75
0   math     75
1   math     81
1   math     81
1   math     81
1   math     81
2   chem     92
2   chem     92
3   chem     66
3   chem     66
3   chem     66

    import itertools
    sum_add = []
    for idx,grp in df.groupby('score'):
        sum_add.append((list(grp['score']) * grp['count'].iloc[0]) )
    pd.Series(list(itertools.chain.from_iterable(sum_add))).describe()