Python 使用pandas覆盖多个直方图

Python 使用pandas覆盖多个直方图,python,matplotlib,statistics,pandas,Python,Matplotlib,Statistics,Pandas,我有两个或三个具有相同标题的csv文件,并且希望在同一个绘图上绘制每个列的直方图 下面的代码给出了两个单独的图,每个图包含每个文件的所有直方图。使用pandas/matplot lib在同一个图形上绘制它们是否有一种简洁的方法?我想象一些接近但使用数据帧的东西 代码: 给出Phillip Cloud在回答中已经解决了将包含相同变量的两个(或多个)数据帧的直方图叠加在一个图形内并排图中的主要问题 该答案为问题作者(在对已接受答案的评论中)提出的问题提供了解决方案,该问题涉及如何为两个数据帧共用的变

我有两个或三个具有相同标题的csv文件,并且希望在同一个绘图上绘制每个列的直方图

下面的代码给出了两个单独的图,每个图包含每个文件的所有直方图。使用pandas/matplot lib在同一个图形上绘制它们是否有一种简洁的方法?我想象一些接近但使用数据帧的东西

代码:


给出

Phillip Cloud在回答中已经解决了将包含相同变量的两个(或多个)数据帧的直方图叠加在一个图形内并排图中的主要问题

该答案为问题作者(在对已接受答案的评论中)提出的问题提供了解决方案,该问题涉及如何为两个数据帧共用的变量实施相同数量的BIN和范围。这可以通过创建两个数据帧的所有变量共用的容器列表来实现。事实上,这个答案更进一步,针对每个数据帧中包含的不同变量覆盖略微不同的范围(但仍在相同数量级内)的情况调整了图,如以下示例所示:

import numpy as np                   # v 1.19.2
import pandas as pd                  # v 1.1.3
import matplotlib.pyplot as plt      # v 3.3.2
from matplotlib.lines import Line2D

# Set seed for random data
rng = np.random.default_rng(seed=1)

# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
                        var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
                        var2 = rng.normal(loc=50, scale=10, size=df2_size)))

# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]

# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)

值得注意的是,seaborn软件包提供了一种更方便的方法来创建此类绘图,与熊猫相反,箱子会自动对齐。唯一的缺点是,必须首先将数据帧组合并重塑为长格式,如本例所示,使用与之前相同的数据帧和数据箱:

import seaborn as sns    # v 0.11.0

# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')

# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
                element='step', bins=bin_edges, fill=False, height=4,
                facet_kws=dict(sharex=False, sharey=False))

# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)

# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)

plt.show()

正如您可能注意到的,直方图线在箱边列表的限制处被截断(由于比例的原因,在最大侧不可见)。要获得与熊猫示例更相似的行,可以在箱子列表的每一端添加一个空箱子,如下所示:

bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)

此示例还说明了此方法的限制,即为两个面设置公共容器。由于var1 var2的范围有所不同,并且使用了30个箱子来覆盖组合范围,因此var1的直方图包含的箱子很少,而var2的直方图包含的箱子略多于所需的箱子。据我所知,在调用绘图函数和时,没有直接的方法为每个方面分配不同的容器列表。因此,对于变量覆盖范围明显不同的情况,必须使用matplotlib或其他绘图库从头开始创建这些图形。

Cool。那看起来像是我想看到的!有没有办法为两个数据帧强制使用相同数量的存储箱(和范围)?我正在为我的df手动设置20个BIN,但df2可能具有不同的范围,因此图像看起来很奇怪。
系列
hist
方法(值的类型)可以使用
bins
关键字参数调用。我会在答案中加上这个。很好。我没想到。然而,由于范围不同,我仍然没有得到我想要的(即)。我想这样做可能没那么简单。哦,我明白了。我想你想要共享x轴,可能还有y轴。退房
# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
               linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
         linewidth=lw, alpha=alpha, label='df2')

# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
         for h in handles]
plt.legend(lines, labels, frameon=False)

plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()
import seaborn as sns    # v 0.11.0

# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')

# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
                element='step', bins=bin_edges, fill=False, height=4,
                facet_kws=dict(sharex=False, sharey=False))

# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)

# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)

plt.show()
bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)