Python 用复杂规则快速填充熊猫数据框缺失值的方法
在m*(n+1)数据帧Python 用复杂规则快速填充熊猫数据框缺失值的方法,python,algorithm,pandas,dataframe,variable-assignment,Python,Algorithm,Pandas,Dataframe,Variable Assignment,在m*(n+1)数据帧data_df中,有一个时间戳列,其值可能是范围(0,p)中的重复整数(表示时间;总共有p个唯一值),并且没有缺失值。还有其他列data\u 1,data\u 2,data\u 3数据\u n,每个都有一些缺少的值 我希望使用与该行的时间戳值相关的特定数字来填充数据列每行中缺少的值。因此,我获得了一个p*n数据帧中间值表。median_table第i行上的值用于填充data_df中的缺失值,其时间戳为i 然而,我想不出一个快速和记忆友好的方法来做这件事。目前,我使用以下代码
data_df
中,有一个时间戳
列,其值可能是范围(0,p)
中的重复整数(表示时间;总共有p个唯一值),并且没有缺失值。还有其他列data\u 1
,data\u 2
,data\u 3
<代码>数据\u n,每个都有一些缺少的值
我希望使用与该行的时间戳
值相关的特定数字来填充数据列每行中缺少的值。因此,我获得了一个p*n数据帧中间值表
。median_table
第i行上的值用于填充data_df
中的缺失值,其时间戳为i
然而,我想不出一个快速和记忆友好的方法来做这件事。目前,我使用以下代码(median_table
和data_df
已经定义):
这是非常低效的。另一种算法:
for _timestamp in median_table.timestamp:
data_df.loc[data_df.timestamp == _timestamp] = \
data_df.loc[data_df.timestamp == _timestamp]\
.fillna(median_table.loc[_timestamp, :], inplace=False)
对我来说工作同样缓慢
有没有更快捷的方法来做同样的事情?试试这种方法:在数据中识别NaN,并将其与中间值表合并。我希望这比for循环更快。如果我错误地假设了您的数据结构,我深表歉意,但这至少可以让您开始:
import pandas as pd
import numpy as np
# Create dummy dataframe
data_df = pd.DataFrame({
"timestamp": [1, 2, 3, 4],
"data": [1, 2, np.nan, np.nan]
})
print data_df
"""
Dataframe looks like:
data timestamp
1.0 1
2.0 2
NaN 3
NaN 4
"""
# Create dummy median table
median_table = pd.DataFrame({
"timestamp": [1, 2, 3, 4],
"missing_data": [100, 200, 300, 400]
})
print median_table
"""
Median table looks like:
missing_data timestamp
100 1
200 2
300 3
400 4
"""
# Find NaNs in "data" column in data_df
nan_indexes = data_df["data"].isnull()
nan_df = data_df[nan_indexes]
print nan_df
"""
nan_df looks like:
data timestamp
NaN 3
NaN 4
"""
# Merge nan_df with median_table based on timestamp column
new_df = pd.merge(left=nan_df, right=median_table, on="timestamp", how="left")
print new_df
"""
new_df looks like:
data timestamp missing_data
NaN 3 300
NaN 4 400
"""
# Clean up new_df
new_df = new_df[["timestamp", "missing_data"]] # Discard "data" column
new_df.columns = ["timestamp", "data"] # Rename "missing_data" column to "data"
print new_df
"""
new_df now looks like:
timestamp data
3 300
4 400
"""
非常感谢。然而,我刚刚意识到使用DataFrame.groupby()
方法而不是在一开始就创建一个新的中间值表可能非常有效。
import pandas as pd
import numpy as np
# Create dummy dataframe
data_df = pd.DataFrame({
"timestamp": [1, 2, 3, 4],
"data": [1, 2, np.nan, np.nan]
})
print data_df
"""
Dataframe looks like:
data timestamp
1.0 1
2.0 2
NaN 3
NaN 4
"""
# Create dummy median table
median_table = pd.DataFrame({
"timestamp": [1, 2, 3, 4],
"missing_data": [100, 200, 300, 400]
})
print median_table
"""
Median table looks like:
missing_data timestamp
100 1
200 2
300 3
400 4
"""
# Find NaNs in "data" column in data_df
nan_indexes = data_df["data"].isnull()
nan_df = data_df[nan_indexes]
print nan_df
"""
nan_df looks like:
data timestamp
NaN 3
NaN 4
"""
# Merge nan_df with median_table based on timestamp column
new_df = pd.merge(left=nan_df, right=median_table, on="timestamp", how="left")
print new_df
"""
new_df looks like:
data timestamp missing_data
NaN 3 300
NaN 4 400
"""
# Clean up new_df
new_df = new_df[["timestamp", "missing_data"]] # Discard "data" column
new_df.columns = ["timestamp", "data"] # Rename "missing_data" column to "data"
print new_df
"""
new_df now looks like:
timestamp data
3 300
4 400
"""