Python 如何从具有权重的数据创建箱线图?
我有以下数据:Python 如何从具有权重的数据创建箱线图?,python,pandas,dataframe,data-visualization,Python,Pandas,Dataframe,Data Visualization,我有以下数据:姓名姓名出现的次数(计数),以及每个姓名的得分。我想创建一个包含得分的方框和胡须图,通过其计数对每个名字的得分进行加权 结果应该是相同的,如果我有原始(不是频率)形式的数据。但我不想将数据转换成这样的形式,因为它的大小会迅速膨胀 import pandas as pd import seaborn as sns import matplotlib.pyplot as plt data = { "Name":['Sara', 'John', 'Mark', 'Peter',
姓名
姓名出现的次数(计数
),以及每个姓名的得分
。我想创建一个包含得分
的方框和胡须图,通过其计数对每个名字的得分
进行加权
结果应该是相同的,如果我有原始(不是频率)形式的数据。但我不想将数据转换成这样的形式,因为它的大小会迅速膨胀
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {
"Name":['Sara', 'John', 'Mark', 'Peter', 'Kate'],
"Count":[20, 10, 5, 2, 5],
"Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)
print(df)
我不知道如何在Python中解决这个问题。感谢您的帮助 这里有两种回答这个问题的方法。您可能会想到第一个,但是它不是一个好的解决方案,因为在计算中位数的置信区间时,它有下面的代码,使用示例数据,参考matplotlib/cbook/_init\u uuuuuu.py
。因此,与其他定制代码相比,第二种代码比其他任何代码都要好,因为它经过了良好的测试
def boxplot_stats(X, whis=1.5, bootstrap=None, labels=None,
autorange=False):
def _bootstrap_median(data, N=5000):
# determine 95% confidence intervals of the median
M = len(data)
percentiles = [2.5, 97.5]
bs_index = np.random.randint(M, size=(N, M))
bsData = data[bs_index]
estimate = np.median(bsData, axis=1, overwrite_input=True)
第一名:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = {
"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
"Count": [20, 10, 5, 2, 5],
"Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)
print(df)
def boxplot(values, freqs):
values = np.array(values)
freqs = np.array(freqs)
arg_sorted = np.argsort(values)
values = values[arg_sorted]
freqs = freqs[arg_sorted]
count = freqs.sum()
fx = values * freqs
mean = fx.sum() / count
variance = ((freqs * values ** 2).sum() / count) - mean ** 2
variance = count / (count - 1) * variance # dof correction for sample variance
std = np.sqrt(variance)
minimum = np.min(values)
maximum = np.max(values)
cumcount = np.cumsum(freqs)
print([std, variance])
Q1 = values[np.searchsorted(cumcount, 0.25 * count)]
Q2 = values[np.searchsorted(cumcount, 0.50 * count)]
Q3 = values[np.searchsorted(cumcount, 0.75 * count)]
'''
interquartile range (IQR), also called the midspread or middle 50%, or technically
H-spread, is a measure of statistical dispersion, being equal to the difference
between 75th and 25th percentiles, or between upper and lower quartiles,[1][2]
IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from
the third quartile; these quartiles can be clearly seen on a box plot on the data.
It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used
robust measure of scale.
'''
IQR = Q3 - Q1
'''
The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract
1.5 times the IQR from the 25 percentile (aka Q1). The whiskers should include
99.3% of the data if from a normal distribution. So the 6 foot tall man from
the example would be inside the whisker but my 6 foot 2 inch girlfriend would
be at the top whisker or pass it.
'''
whishi = Q3 + 1.5 * IQR
whislo = Q1 - 1.5 * IQR
stats = [{
'label': 'Scores', # tick label for the boxplot
'mean': mean, # arithmetic mean value
'iqr': Q3 - Q1, # 5.0,
# 'cilo': 2.0, # lower notch around the median
# 'cihi': 4.0, # upper notch around the median
'whishi': maximum, # end of the upper whisker
'whislo': minimum, # end of the lower whisker
'fliers': [], # '\array([], dtype=int64)', # outliers
'q1': Q1, # first quartile (25th percentile)
'med': Q2, # 50th percentile
'q3': Q3 # third quartile (75th percentile)
}]
fs = 10 # fontsize
_, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
axes.bxp(stats)
axes.set_title('Default', fontsize=fs)
plt.show()
boxplot(df['Score'], df['Count'])
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {
"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
"Count": [20, 10, 5, 2, 5],
"Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)
print(df)
labels = ['Scores']
data = df['Score'].repeat(df['Count']).tolist()
# compute the boxplot stats
stats = cbook.boxplot_stats(data, labels=labels, bootstrap=10000)
print(['stats :', stats])
fs = 10 # fontsize
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
axes.bxp(stats)
axes.set_title('Boxplot', fontsize=fs)
plt.show()
秒:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = {
"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
"Count": [20, 10, 5, 2, 5],
"Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)
print(df)
def boxplot(values, freqs):
values = np.array(values)
freqs = np.array(freqs)
arg_sorted = np.argsort(values)
values = values[arg_sorted]
freqs = freqs[arg_sorted]
count = freqs.sum()
fx = values * freqs
mean = fx.sum() / count
variance = ((freqs * values ** 2).sum() / count) - mean ** 2
variance = count / (count - 1) * variance # dof correction for sample variance
std = np.sqrt(variance)
minimum = np.min(values)
maximum = np.max(values)
cumcount = np.cumsum(freqs)
print([std, variance])
Q1 = values[np.searchsorted(cumcount, 0.25 * count)]
Q2 = values[np.searchsorted(cumcount, 0.50 * count)]
Q3 = values[np.searchsorted(cumcount, 0.75 * count)]
'''
interquartile range (IQR), also called the midspread or middle 50%, or technically
H-spread, is a measure of statistical dispersion, being equal to the difference
between 75th and 25th percentiles, or between upper and lower quartiles,[1][2]
IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from
the third quartile; these quartiles can be clearly seen on a box plot on the data.
It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used
robust measure of scale.
'''
IQR = Q3 - Q1
'''
The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract
1.5 times the IQR from the 25 percentile (aka Q1). The whiskers should include
99.3% of the data if from a normal distribution. So the 6 foot tall man from
the example would be inside the whisker but my 6 foot 2 inch girlfriend would
be at the top whisker or pass it.
'''
whishi = Q3 + 1.5 * IQR
whislo = Q1 - 1.5 * IQR
stats = [{
'label': 'Scores', # tick label for the boxplot
'mean': mean, # arithmetic mean value
'iqr': Q3 - Q1, # 5.0,
# 'cilo': 2.0, # lower notch around the median
# 'cihi': 4.0, # upper notch around the median
'whishi': maximum, # end of the upper whisker
'whislo': minimum, # end of the lower whisker
'fliers': [], # '\array([], dtype=int64)', # outliers
'q1': Q1, # first quartile (25th percentile)
'med': Q2, # 50th percentile
'q3': Q3 # third quartile (75th percentile)
}]
fs = 10 # fontsize
_, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
axes.bxp(stats)
axes.set_title('Default', fontsize=fs)
plt.show()
boxplot(df['Score'], df['Count'])
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {
"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
"Count": [20, 10, 5, 2, 5],
"Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)
print(df)
labels = ['Scores']
data = df['Score'].repeat(df['Count']).tolist()
# compute the boxplot stats
stats = cbook.boxplot_stats(data, labels=labels, bootstrap=10000)
print(['stats :', stats])
fs = 10 # fontsize
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
axes.bxp(stats)
axes.set_title('Boxplot', fontsize=fs)
plt.show()
参考资料:
这个问题提得太晚了,但万一有人遇到它,它会很有用--
当权重为整数时,可以使用reindex按计数展开,然后直接使用箱线图调用。我已经能够在数据帧上实现这一点,其中数千个数据帧变成了几十万个数据帧,而没有内存挑战,特别是如果实际重新索引的数据帧被包装到第二个函数中,而该函数不在内存中分配它
import pandas as pd
import seaborn as sns
data = {
"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
"Count": [20, 10, 5, 2, 5],
"Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)
def reindex_df(df, weight_col):
"""expand the dataframe to prepare for resampling
result is 1 row per count per sample"""
df = df.reindex(df.index.repeat(df[weight_col]))
df.reset_index(drop=True, inplace=True)
return(df)
df = reindex_df(df, weight_col = 'Count')
sns.boxplot(x='Name', y='Score', data=df)
或者如果你关心记忆
def weighted_boxplot(df, weight_col):
sns.boxplot(x='Name',
y='Score',
data=reindex_df(df, weight_col = weight_col))
weighted_boxplot(df, 'Count')
有趣。事实上,我对举重还不熟悉。这本质上就是传递权重数组所做的吗?@WhiteTie在sns.boxplot(…
)后面添加了print(df)
,以帮助您理解数据框。@WhiteTie数据框将数据保存在python字典中。如果您查看源代码类数据框(NDFrame):
在pandas/core/frame.py
中,你会得到它。编辑:我的意思是我想要一个方框图,不是在名称级别,而是在聚合级别——一个显示平均值、中值、Q25等的方框和胡须图。换句话说,我想要汇总整个数据。这显示了一个不同的东西。例如,这将得到所需的平均值。仍然不确定h如何从中创建箱线图:desired_mean=sum((df['Count']*df['Score'])/sum(df['Count'])