Python 带有“持续时间”变量的累积绘图_Python_Pandas_Numpy_Matplotlib_Plot

Python 带有“持续时间”变量的累积绘图

python pandas numpy matplotlib plot

Python 带有“持续时间”变量的累积绘图,python,pandas,numpy,matplotlib,plot,Python,Pandas,Numpy,Matplotlib,Plot,我很难准确地绘制出我掌握的一些数据我有在某些时间点测量的计数数据，我想用累积图绘制这些数据。数据具有一些罕见的高值，这会导致绘图出现较大的跳跃。下面是我目前用的图，尝试用滚动函数平滑线条。有趣的是，我有一个持续时间变量，这意味着每次计数都需要一些持续时间来度量。考虑到每次测量的持续时间，如何绘制累积图？这将产生一个更漂亮、更准确的图表。我的意思是，不是原始数据中的一个大值会导致累积图中的一个大跳跃，而是我们将大值分布在测量这个值所用的时间内，从而使图更加平滑，没有任何跳跃以下是数据和生

我很难准确地绘制出我掌握的一些数据

我有在某些时间点测量的计数数据，我想用累积图绘制这些数据。数据具有一些罕见的高值，这会导致绘图出现较大的跳跃。下面是我目前用的图，尝试用滚动函数平滑线条。

有趣的是，我有一个持续时间变量，这意味着每次计数都需要一些持续时间来度量。考虑到每次测量的持续时间，如何绘制累积图？这将产生一个更漂亮、更准确的图表。我的意思是，不是原始数据中的一个大值会导致累积图中的一个大跳跃，而是我们将大值分布在测量这个值所用的时间内，从而使图更加平滑，没有任何跳跃

以下是数据和生成绘图的代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

start_times_test = pd.Series([10.5, 15.2, 15.7, 18.2, 23.0, 25.1, 26.4, 27.4, 31.5, 35.0, 39.4, 48.1])
duration_test = pd.Series([6.2, 2.1, 15.1, 2.7, 1.1, 4.7, 21.2, 6.0, 2.3, 6.2, 1.1, 3.2])
counts_test = pd.Series([7, 5, 130, 3, 2, 12, 262, 19, 5, 32, 3, 7, 10])
cumulative_count_test = (np.cumsum(counts_test))
start_dur_count_df_test = pd.concat([start_times_test, duration_test, counts_test, cumulative_count_test], axis = 1)
start_dur_count_df_test.columns = ["Start_time", 'Duration', "Counts", "Cumulative_counts"]
print(start_dur_count_df_test)
fig, ax = plt.subplots(1,1)
plot1, = ax.plot(start_dur_count_df_test["Start_time"], start_dur_count_df_test["Cumulative_counts"], c="blue", label="regular_plot")
plot2, = ax.plot(start_dur_count_df_test["Start_time"], start_dur_count_df_test["Cumulative_counts"].rolling(window=3).mean(), c="red", label="smoothed_with_rolling")
ax.legend()
plt.savefig('cumulative_plot_test.pdf')

我不知道你的数据来自哪里，但我可以假设计数测试取决于测量持续时间测试的持续时间。这意味着，在不考虑测量时间的情况下绘制计数\u测试可能不是绘制数据的最准确方法

一种可能的解决方案是通过测量持续时间测试来规范化数据计数测试，以获得每单位时间的计数。通过这种方式，您可以将数据绘制在一起，因为它们现在独立于测量持续时间：

cumulative_count_test = (np.cumsum(counts_test/duration_test))

这样，绘图变得更平滑，绘制的数量更均匀，这很好，但绘图的最右点不再表示计数总数

编辑1

我不确定是否可以称之为优雅，但一个可能的解决方案可能是将时间戳转换为整数，以便将它们用作numpy数组的索引。这允许您在所有测量值上填充阵列循环：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

start_times_test = pd.Series([10.5, 15.2, 15.7, 18.2, 23.0, 25.1, 26.4, 27.4, 31.5, 35.0, 39.4, 48.1])
duration_test = pd.Series([6.2, 2.1, 15.1, 2.7, 1.1, 4.7, 21.2, 6.0, 2.3, 6.2, 1.1, 3.2])
counts_test = pd.Series([7, 5, 130, 3, 2, 12, 262, 19, 5, 32, 3, 7])

### NEW CODE STARTS HERE
## The scale factor is needed to transform the timestamps to integers
## which is needed to use them as array indexes
SCALE_FACTOR = 100
start = start_times_test.values.astype(int)*SCALE_FACTOR
duration = duration_test.values.astype(int)*SCALE_FACTOR
counts = counts_test.values
stops = start + duration

counts_in_bin = np.zeros(max(stops))

for start_index, stop_index, count in zip(start, stops, counts):
    counts_per_unit_of_time = count / (stop_index - start_index)
    counts_in_bin[start_index:stop_index] = counts_in_bin[start_index:stop_index] + counts_per_unit_of_time

cumulative_counts_new = np.cumsum(counts_in_bin)
time_bins = np.linspace(0, max(stops)/SCALE_FACTOR, counts_in_bin.shape[0])
### NEW CODE ENDS HERE

cumulative_count_test = (np.cumsum(counts_test))
start_dur_count_df_test = pd.concat([start_times_test, duration_test, counts_test, cumulative_count_test], axis = 1)
start_dur_count_df_test.columns = ["Start_time", 'Duration', "Counts", "Cumulative_counts"]
print(start_dur_count_df_test)
fig, ax = plt.subplots(1,1)
plot1, = ax.plot(start_dur_count_df_test["Start_time"], start_dur_count_df_test["Cumulative_counts"], c="blue", label="regular_plot")
plot2, = ax.plot(start_dur_count_df_test["Start_time"], start_dur_count_df_test["Cumulative_counts"].rolling(window=3).mean(), c="red", label="smoothed_with_rolling")
plot3, = ax.plot(time_bins, cumulative_counts_new, label='possible solution')
ax.set_xlim(min(start)/SCALE_FACTOR, max(stops)/SCALE_FACTOR)
ax.legend()

另外，我删除了计数测试的最后一个元素，因为它的长度是13，而不是像开始时间测试或持续时间测试那样的12。好吧，你的假设是正确的，而且这不是一种完美的绘图方式。另外，你有一个标准化的好主意，虽然我们不能这样做，因为我们需要在最后的确切计数。我认为最好是在整个持续时间内以某种方式分配每个计数值，但我缺乏这样做的numpy/pandas知识。我理解。开始时间测试和持续时间测试的时间单位是否相同？换句话说，有些测量重叠吗？是的，单位是相同的，是的，测量重叠。我解决这个问题的想法是一种枯燥的编程方式——迭代计数，创建列表，显示每个时间步的计数量，然后绘制这些列表，例如，我们的计数值为557，持续时间为10，开始时间为13，那么它将是一个包含10个值的列表[55.7，55.7…55.7]这些将在时间13到23的累积图中计算。但这是一种非常丑陋且容易出错的方式，我希望有一个更优雅的解决方案，numpy/pandas/matplotlibI使用一个新的解决方案编辑了这篇文章，或多或少地遵循了您的建议。我不知道怎样才能避免这个循环。谢谢，这是一个比我想的更优雅的解决方案。