Python 熊猫数据帧：基于组平均值填充NA值_Python_Pandas_Dataframe

Python 熊猫数据帧：基于组平均值填充NA值

python pandas dataframe

Python 熊猫数据帧：基于组平均值填充NA值,python,pandas,dataframe,Python,Pandas,Dataframe,我想用groupby对象中的值更新Pandas DataFrame列的NA值让我们用一个例子来说明：我们有以下DataFrame列： |--------|-------|-----|-------------| | row_id | Month | Day | Temperature | |--------|-------|-----|-------------| | 1 | 1 | 1 | 14.3 | | 2 | 1 | 1 |

我想用groupby对象中的值更新Pandas DataFrame列的NA值

让我们用一个例子来说明：

我们有以下DataFrame列：

|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1      | 1     | 1   | 14.3        |
| 2      | 1     | 1   | 14.8        |
| 3      | 1     | 2   | 13.1        |
|--------|-------|-----|-------------|

我们只是在数月内每天多次测量温度。现在，让我们假设，对于我们的一些记录，温度读数失败，我们有一个

NA

|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1      | 1     | 1   | 14.3        |
| 2      | 1     | 1   | 14.8        |
| 3      | 1     | 2   | 13.1        |
| 4      | 1     | 2   | NA          |
| 5      | 1     | 3   | 14.8        |
| 6      | 1     | 4   | NA          |
|--------|-------|-----|-------------|

我们可以使用panda的

.fillna（）

，但是我们希望更复杂一点。由于每天有多个读数（每天可能有100个），我们希望取每日平均值，并将其用作填充值

我们可以通过简单的groupby获得每日平均值：

avg_temp_by_month_day=df.groupby（['month']）['day'].mean（）

这给了我们每个月每天的方法。问题是，如何最好地用groupby值填充NA值

我们可以使用

apply（）

然而，这真的很慢（超过1百万条记录）

是否有一种矢量化方法，可以使用

np.where（）

，也可以创建另一个系列并进行合并

执行此操作的更有效方法是什么

谢谢大家!

我不确定这是否是最快的，但是，

apply

需要约1小时，而+1M记录需要约20秒。以下代码已更新，可用于1列或多列

local_avg_cols = ['temperature'] # can work with multiple columns

# Create groupby's to get local averages
local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean()

# Convert to DataFrame and prepare for merge
local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index()

# Merge into original dataframe
df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg'))

# Now overwrite na values with values from new '_avg' col
for col in local_avg_cols:
    df[col] = df[col].mask(df[col].isna(), df[col+'_avg'])
    
# Drop new avg cols
df = df.drop(columns=[col+'_avg' for col in local_avg_cols])

如果有人找到一个更有效的方法来做这件事（在处理时间上有效，或者仅仅在可读性上有效），我将取消标记这个答案并标记你的答案。谢谢大家!

我猜是什么加快了你的进程是两件事。首先，您不需要将groupby转换为数据帧。其次，您不需要for循环

from pandas import DataFrame
from numpy import nan

# Populating the dataset
df = {"Month": [1] * 6,
      "Day": [1, 1, 2, 2, 3, 4],
      "Temperature": [14.3, 14.8, 13.1, nan, 14.8, nan]}

# Creating the dataframe
df = pd.DataFrame(df, columns=df.keys())
local_averages = df.groupby(['Month', 'Day'])['Temperature'].mean()
df = df.merge(local_averages, on=['Month', 'Day'], how='left', suffixes=('', '_avg'))
# Filling the missing values of the Temperature column with what is available in Temperature_avg
df.Temperature.fillna(df.Temperature_avg, inplace=True)
df.drop(columns="Temperature_avg", inplace=True)

Groupby是一个资源密集型流程，因此在使用它时要充分利用它。此外，正如您已经知道的，当涉及到数据帧时，循环不是一个好主意。此外，如果您有一个大数据，您可能希望避免从中创建额外的变量。如果我的数据有1m行和许多列，我可以将groupby放入合并中。

我也在研究如何使用

df.update（）

-但我还不确定如何对齐groupby对象。

local_avg_cols = ['temperature'] # can work with multiple columns

# Create groupby's to get local averages
local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean()

# Convert to DataFrame and prepare for merge
local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index()

# Merge into original dataframe
df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg'))

# Now overwrite na values with values from new '_avg' col
for col in local_avg_cols:
    df[col] = df[col].mask(df[col].isna(), df[col+'_avg'])
    
# Drop new avg cols
df = df.drop(columns=[col+'_avg' for col in local_avg_cols])