Python 使用pandas覆盖多个直方图_Python_Matplotlib_Statistics_Pandas

Python 使用pandas覆盖多个直方图

python matplotlib statistics pandas

Python 使用pandas覆盖多个直方图,python,matplotlib,statistics,pandas,Python,Matplotlib,Statistics,Pandas,我有两个或三个具有相同标题的csv文件，并且希望在同一个绘图上绘制每个列的直方图下面的代码给出了两个单独的图，每个图包含每个文件的所有直方图。使用pandas/matplot lib在同一个图形上绘制它们是否有一种简洁的方法？我想象一些接近但使用数据帧的东西代码：给出Phillip Cloud在回答中已经解决了将包含相同变量的两个（或多个）数据帧的直方图叠加在一个图形内并排图中的主要问题该答案为问题作者（在对已接受答案的评论中）提出的问题提供了解决方案，该问题涉及如何为两个数据帧共用的变

我有两个或三个具有相同标题的csv文件，并且希望在同一个绘图上绘制每个列的直方图

下面的代码给出了两个单独的图，每个图包含每个文件的所有直方图。使用pandas/matplot lib在同一个图形上绘制它们是否有一种简洁的方法？我想象一些接近但使用数据帧的东西

代码：

给出

Phillip Cloud在回答中已经解决了将包含相同变量的两个（或多个）数据帧的直方图叠加在一个图形内并排图中的主要问题

该答案为问题作者（在对已接受答案的评论中）提出的问题提供了解决方案，该问题涉及如何为两个数据帧共用的变量实施相同数量的BIN和范围。这可以通过创建两个数据帧的所有变量共用的容器列表来实现。事实上，这个答案更进一步，针对每个数据帧中包含的不同变量覆盖略微不同的范围（但仍在相同数量级内）的情况调整了图，如以下示例所示：

import numpy as np                   # v 1.19.2
import pandas as pd                  # v 1.1.3
import matplotlib.pyplot as plt      # v 3.3.2
from matplotlib.lines import Line2D

# Set seed for random data
rng = np.random.default_rng(seed=1)

# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
                        var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
                        var2 = rng.normal(loc=50, scale=10, size=df2_size)))

# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]

# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)

值得注意的是，seaborn软件包提供了一种更方便的方法来创建此类绘图，与熊猫相反，箱子会自动对齐。唯一的缺点是，必须首先将数据帧组合并重塑为长格式，如本例所示，使用与之前相同的数据帧和数据箱：

import seaborn as sns    # v 0.11.0

# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')

# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
                element='step', bins=bin_edges, fill=False, height=4,
                facet_kws=dict(sharex=False, sharey=False))

# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)

# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)

plt.show()

正如您可能注意到的，直方图线在箱边列表的限制处被截断（由于比例的原因，在最大侧不可见）。要获得与熊猫示例更相似的行，可以在箱子列表的每一端添加一个空箱子，如下所示：

bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)

此示例还说明了此方法的限制，即为两个面设置公共容器。由于var1 var2的范围有所不同，并且使用了30个箱子来覆盖组合范围，因此var1的直方图包含的箱子很少，而var2的直方图包含的箱子略多于所需的箱子。据我所知，在调用绘图函数和时，没有直接的方法为每个方面分配不同的容器列表。因此，对于变量覆盖范围明显不同的情况，必须使用matplotlib或其他绘图库从头开始创建这些图形。

Cool。那看起来像是我想看到的！有没有办法为两个数据帧强制使用相同数量的存储箱（和范围）？我正在为我的df手动设置20个BIN，但df2可能具有不同的范围，因此图像看起来很奇怪。

系列

的

hist

方法（值的类型）可以使用

bins

关键字参数调用。我会在答案中加上这个。很好。我没想到。然而，由于范围不同，我仍然没有得到我想要的（即）。我想这样做可能没那么简单。哦，我明白了。我想你想要共享x轴，可能还有y轴。退房

# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
               linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
         linewidth=lw, alpha=alpha, label='df2')

# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
         for h in handles]
plt.legend(lines, labels, frameon=False)

plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()

import seaborn as sns    # v 0.11.0

# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')

# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
                element='step', bins=bin_edges, fill=False, height=4,
                facet_kws=dict(sharex=False, sharey=False))

# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)

# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)

plt.show()

bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)