Python/Pandas用于求解分组平均值、中值、模式和标准差
我有以下数据:Python/Pandas用于求解分组平均值、中值、模式和标准差,python,pandas,numpy,statistics,Python,Pandas,Numpy,Statistics,我有以下数据: [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8] 我需要根据上述数据建立其计数/频率表,如下所示: 4.1 - 4.5: 8 4.
[4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
我需要根据上述数据建立其计数/频率表,如下所示:
4.1 - 4.5: 8
4.6 - 5.0: 4
5.1 - 5.5: 10
5.6 - 6.0: 6
6.1 - 6.5: 7
6.6 - 7.0: 5
我能得到的最接近的结果如下:
counts freqs
categories
[4.1, 4.6) 8 0.200
[4.6, 5.1) 4 0.100
[5.1, 5.6) 10 0.250
[5.6, 6.1) 6 0.150
[6.1, 6.6) 7 0.175
[6.6, 7.1) 5 0.125
通过此代码:
sr = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
ncut = pd.cut(sr, [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1],right=False)
srpd = pd.DataFrame(ncut.describe())
我需要创建一个新列,它是“类别”值的中位数(例如“[4.1,4.6]”,它包含从4.1到4.5(不包括4.6))的数据计数/频率,因此我需要得到(4.1+4.5)/2,它等于4.3
以下是我的问题:
1) 如何访问“categories”索引下的值以将其用于上述计算
2) 是否有办法以这种方式反映范围:4.1-4.5、4.6到5.0等
3) 对于这样的分组数据,有没有更简单的方法来计算均值、中位数、模式等?或者我必须用Python为这些数据创建自己的函数
谢谢我想出了一个好办法:
def buildFreqTable(data, width, numclass, pw):
data.sort()
minrange = []
maxrange = []
x_med = []
count = []
# Since data is already sorted, take the lowest value to jumpstart the creation of ranges
f_data = data[0]
for i in range(0,numclass):
# minrange holds the minimum value for that row
minrange.append(f_data)
# maxrange holds the maximum value for that row
maxrange.append(f_data + (width - pw))
# Compute for range's median
minmax_median = (minrange[i] + maxrange[i]) / 2
x_med.append(minmax_median)
# initialize count per numclass to 0, this will be incremented later
count.append(0)
f_data = f_data + width
# Tally the frequencies
for x in data:
for i in range(0,6):
if (x>=minrange[i] and x<=maxrange[i]):
count[i] = count[i] + 1
# Now, create the pandas dataframe for easier manipulation
freqtable = pd.DataFrame()
freqtable['minrange'] = minrange
freqtable['maxrange'] = maxrange
freqtable['x'] = x_med
freqtable['count'] = count
buildFreqTable(sr, 0.5, 6, 0.1)
尽管我仍然很好奇是否有更简单的方法来实现这一点,或者是否有人可以将我的代码重构为更“专业”的代码,谢谢关于您的垃圾箱和标签问题,以下内容如何:
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
labels = ['{}-{}'.format(x, y-.1) for x, y in zip(bins[:], bins[1:])]
然后,不要将值作为列表,而是将它们作为一个系列
sr = pd.Series([4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1,
5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7,
5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8])
ncut = pd.cut(sr, bins=bins, labels=labels, right=False)
定义一个lambda
函数来计算频率
freq = lambda x: len(x) / x.sum()
freq.__name__ = 'freq'
最后,使用concat
、groupby
和agg
获取每个bin的汇总统计信息
pd.concat([ncut, sr], axis=1).groupby(0).agg(['size', 'std', 'mean', freq])
让我们试试:
l = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9,
5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6,
5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6,
6.7, 6.7, 6.8, 6.8]
s = pd.Series(l)
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
#Python 3.6+ f-string
labels = [f'{i}-{j-.1}' for i,j in zip(bins,bins[1:])]
(pd.concat([pd.cut(s, bins=bins, labels=labels, right=False),s],axis=1)
.groupby(0)[1]
.agg(['mean','median', pd.Series.mode, 'std'])
.rename_axis('categories')
.reset_index())
输出:
categories mean median mode std
0 4.1-4.5 4.250000 4.25 4.1 0.151186
1 4.6-5.0 4.725000 4.70 4.6 0.150000
2 5.1-5.5 5.280000 5.30 5.3 0.131656
3 5.6-6.0 5.700000 5.65 5.6 0.126491
4 6.1-6.5 6.314286 6.30 6.2 0.121499
5 6.6-7.0 6.720000 6.70 [6.7, 6.8] 0.083666
categories mean median mode std
0 4.1-4.5 4.250000 4.25 4.1 0.151186
1 4.6-5.0 4.725000 4.70 4.6 0.150000
2 5.1-5.5 5.280000 5.30 5.3 0.131656
3 5.6-6.0 5.700000 5.65 5.6 0.126491
4 6.1-6.5 6.314286 6.30 6.2 0.121499
5 6.6-7.0 6.720000 6.70 [6.7, 6.8] 0.083666