Python 一列中的完整字符串值由另一列中的数值引导
我有一个数据帧:Python 一列中的完整字符串值由另一列中的数值引导,python,pandas,Python,Pandas,我有一个数据帧: ID URINE_TEST UNIT VALUE 1 'alb' mg 1500 2 'alb' mg 1200 3 'alb' mg 1600 4 'alb' g 1.2 5 'alb' g 1.8 7 'alb' NaN 1300 <- should become mg 8
ID URINE_TEST UNIT VALUE
1 'alb' mg 1500
2 'alb' mg 1200
3 'alb' mg 1600
4 'alb' g 1.2
5 'alb' g 1.8
7 'alb' NaN 1300 <- should become mg
8 'crt' l 2.3
9 'crt' l 3.3
10 'crt' l 4.1
11 'crt' ml 2500
12 'crt' ml 3400
13 'crt' ml 2100
14 'crt' NaN 3.0 <-should become l
15 'crt' NaN 99 <-should stay as NaN (not inside any range)
但我真的想不出一个方法来做到这一点。感谢您的帮助。如果
ID
值是唯一的,则解决方案:
#filter NaNs rows by UNIT
df1 = df[df['UNIT'].isna()]
print (df1)
ID URINE_TEST UNIT VALUE
5 7 'alb' NaN 1300.0
12 14 'crt' NaN 3.0
13 15 'crt' NaN 99.0
或: 使用
merge
和left join的解决方案是最常见的:
df1 = df[df['UNIT'].isna()]
df2 = df.groupby(['URINE_TEST', 'UNIT']).VALUE.agg(['min','max']).reset_index()
df3 = df1.merge(df2, on='URINE_TEST', suffixes=('_',''))
df3 = df3.loc[df3['VALUE'].between(df3['min'], df3['max']), ['URINE_TEST','VALUE', 'UNIT']]
df3 = df1.merge(df3, on=['URINE_TEST','VALUE'], suffixes=('_',''), how='left')
print (df3)
ID URINE_TEST UNIT_ VALUE UNIT
0 7 'alb' NaN 1300.0 mg
1 14 'crt' NaN 3.0 l
2 15 'crt' NaN 99.0 NaN
df = (pd.concat([df.dropna(subset=['UNIT']), df3[df.columns]])
.sort_values('URINE_TEST')
.reset_index(drop=True))
print (df)
ID URINE_TEST UNIT VALUE
0 1 'alb' mg 1500.0
1 2 'alb' mg 1200.0
2 3 'alb' mg 1600.0
3 4 'alb' g 1.2
4 5 'alb' g 1.8
5 7 'alb' mg 1300.0
6 8 'crt' l 2.3
7 9 'crt' l 3.3
8 10 'crt' l 4.1
9 11 'crt' ml 2500.0
10 12 'crt' ml 3400.0
11 13 'crt' ml 2100.0
12 14 'crt' l 3.0
13 15 'crt' NaN 99.0
在df1
中通过唯一的undex进行匹配的备选方案:
df1 = df[df['UNIT'].isna()]
df2 = df.groupby(['URINE_TEST', 'UNIT']).VALUE.agg(['min','max']).reset_index()
#add index to columns by reset_index()
df3 = df1.reset_index().merge(df2, on='URINE_TEST', suffixes=('_',''))
s = df3[df3['VALUE'].between(df3['min'], df3['max'])].set_index(['index'])['UNIT']
print (s)
index
5 mg
12 l
Name: UNIT, dtype: object
df['UNIT'] = df['UNIT'].fillna(s)
print (df)
ID URINE_TEST UNIT VALUE
0 1 'alb' mg 1500.0
1 2 'alb' mg 1200.0
2 3 'alb' mg 1600.0
3 4 'alb' g 1.2
4 5 'alb' g 1.8
5 7 'alb' mg 1300.0
6 8 'crt' l 2.3
7 9 'crt' l 3.3
8 10 'crt' l 4.1
9 11 'crt' ml 2500.0
10 12 'crt' ml 3400.0
11 13 'crt' ml 2100.0
12 14 'crt' l 3.0
13 15 'crt' NaN 99.0
假设我正确理解您的条件,并且您的值的数据类型为float:
# List for new unit values.
NEW_UNIT = []
# For loop that checks each row in the dataframe for its respective values.
for index, row in df.iterrows():
if row['URINE_TEST'] == 'alb':
if (row['VALUE'] >= 1200) and (row['VALUE'] <= 1600):
NEW_UNIT.append('mg')
elif (row['VALUE'] >= 1.2) and (row['VALUE'] <= 1.6):
NEW_UNIT.append('g')
else:
NEW_UNIT.append(float('NaN'))
elif row['URINE_TEST'] == 'crt':
if (row['VALUE'] >= 2300) and (row['VALUE'] <= 4100):
NEW_UNIT.append('ml')
elif (row['VALUE'] >= 2.3) and (row['VALUE'] <= 4.1):
NEW_UNIT.append('l')
else:
NEW_UNIT.append(float('NaN'))
# Replace unit column with the updated unit values
df['UNIT'] = NEW_UNIT
新单位值的列表。
新单位=[]
#For循环,用于检查数据帧中每一行的相应值。
对于索引,df.iterrows()中的行:
如果行['尿检']=='alb':
如果(第['VALUE']>=1200行)和(第['VALUE']=1.2行)以及(第['VALUE']=2300行)和(第['VALUE']=2.3行)和(第['VALUE']行)在您的逻辑中,您只需在最小最大值范围内的值上填充NA,并保持其他
NaN
不变。我认为您可以使用sort\u值
、ffill
和loc
分配,使用自定义掩码将NaN
设置回最小最大值范围之外的值
df1 = df.sort_values(['VALUE', 'UNIT'])
m1 = df1.UNIT.shift() != df1.UNIT.shift(-1)
m2 = df1.UNIT.isna()
m3 = df1.VALUE != df1.VALUE.shift()
df1['UNIT'] = df1.UNIT.ffill()
df1.loc[m1 & m2 & m3, 'UNIT'] = np.nan
df = df1.reindex(df.index)
Out[130]:
ID URINE_TEST UNIT VALUE
0 1 'alb' mg 1500.0
1 2 'alb' mg 1200.0
2 3 'alb' mg 1600.0
3 4 'alb' g 1.2
4 5 'alb' g 1.8
5 7 'alb' mg 1300.0
6 8 'crt' l 2.3
7 9 'crt' l 3.3
8 10 'crt' l 4.1
9 11 'crt' ml 2500.0
10 12 'crt' ml 3400.0
11 13 'crt' ml 2100.0
12 14 'crt' l 3.0
13 15 'crt' NaN 99.0
您可以使用
DataFrame.apply()
函数来清理数据并获得所需的结果。您可以在文档中阅读有关df.apply()
的更多信息
粗略的解决方案如下所示,假设数据名为urine\u data
:
#create a dictionary of all the tests and their different options and min, max values
test_dic = {'alb': [('mg', 1200, 1800), ('g', 1.2, 1.8)], 'crt': [('l', 2.3, 4.1), ('ml', 2100, 3400)]}
#will be applied for each row in the dataframe
def fill_unit(row):
test = row['URINE_TEST'] #get test
value = row['VALUES'] #get value
unit = row['UNIT'] #get initial unit
if test in test_dic.keys():
if test_dic[test][0][1] <= value <=test_dic[test][0][2]:
unit = test_dic[test][0][0]
elif test_dic[test][1][1] <= value <=test_dic[test][1][2]:
unit = test_dic[test][1][0]
else:
unit = np.nan
return unit
urine_data['UNIT'] = urine_data.apply(fill_unit, axis=1)
可能的,但是你必须手动生成一个字典,因为你得到一个新的数据帧?我想OP是在问如何从数据帧中获取它是的,上面只是一个普通的例子。答案不应该硬编码。另外,避免使用apply,因为它可能非常慢。最好对大多数问题使用向量化函数。我建议感谢您的回答。但是,通常情况下,SO中的答案片段应以编程方式工作。例如,这些最小值/最大值应自动计算。此外,该答案应适用于尿液测试中的任意数量的尿液测试。什么“``m1=df1.UNIT.shift()!=df1.UNIT.shift(-1)```是吗?你的解决方案绝对是最优雅的,但对我来说却很神秘。@Kaisar:它会检查前一行和下一行。如果它们不同,而当前行是“NaN”,请进一步检查“df1.VALUE”上的ID不是唯一的。如果你能将你的解决方案改为不知道ID,那就太好了。Thanks@Kaisar-答案用2个新的solu编辑不可知论者。
df1 = df[df['UNIT'].isna()]
df2 = df.groupby(['URINE_TEST', 'UNIT']).VALUE.agg(['min','max']).reset_index()
df3 = df1.merge(df2, on='URINE_TEST', suffixes=('_',''))
df3 = df3.loc[df3['VALUE'].between(df3['min'], df3['max']), ['URINE_TEST','VALUE', 'UNIT']]
df3 = df1.merge(df3, on=['URINE_TEST','VALUE'], suffixes=('_',''), how='left')
print (df3)
ID URINE_TEST UNIT_ VALUE UNIT
0 7 'alb' NaN 1300.0 mg
1 14 'crt' NaN 3.0 l
2 15 'crt' NaN 99.0 NaN
df = (pd.concat([df.dropna(subset=['UNIT']), df3[df.columns]])
.sort_values('URINE_TEST')
.reset_index(drop=True))
print (df)
ID URINE_TEST UNIT VALUE
0 1 'alb' mg 1500.0
1 2 'alb' mg 1200.0
2 3 'alb' mg 1600.0
3 4 'alb' g 1.2
4 5 'alb' g 1.8
5 7 'alb' mg 1300.0
6 8 'crt' l 2.3
7 9 'crt' l 3.3
8 10 'crt' l 4.1
9 11 'crt' ml 2500.0
10 12 'crt' ml 3400.0
11 13 'crt' ml 2100.0
12 14 'crt' l 3.0
13 15 'crt' NaN 99.0
df1 = df[df['UNIT'].isna()]
df2 = df.groupby(['URINE_TEST', 'UNIT']).VALUE.agg(['min','max']).reset_index()
#add index to columns by reset_index()
df3 = df1.reset_index().merge(df2, on='URINE_TEST', suffixes=('_',''))
s = df3[df3['VALUE'].between(df3['min'], df3['max'])].set_index(['index'])['UNIT']
print (s)
index
5 mg
12 l
Name: UNIT, dtype: object
df['UNIT'] = df['UNIT'].fillna(s)
print (df)
ID URINE_TEST UNIT VALUE
0 1 'alb' mg 1500.0
1 2 'alb' mg 1200.0
2 3 'alb' mg 1600.0
3 4 'alb' g 1.2
4 5 'alb' g 1.8
5 7 'alb' mg 1300.0
6 8 'crt' l 2.3
7 9 'crt' l 3.3
8 10 'crt' l 4.1
9 11 'crt' ml 2500.0
10 12 'crt' ml 3400.0
11 13 'crt' ml 2100.0
12 14 'crt' l 3.0
13 15 'crt' NaN 99.0
# List for new unit values.
NEW_UNIT = []
# For loop that checks each row in the dataframe for its respective values.
for index, row in df.iterrows():
if row['URINE_TEST'] == 'alb':
if (row['VALUE'] >= 1200) and (row['VALUE'] <= 1600):
NEW_UNIT.append('mg')
elif (row['VALUE'] >= 1.2) and (row['VALUE'] <= 1.6):
NEW_UNIT.append('g')
else:
NEW_UNIT.append(float('NaN'))
elif row['URINE_TEST'] == 'crt':
if (row['VALUE'] >= 2300) and (row['VALUE'] <= 4100):
NEW_UNIT.append('ml')
elif (row['VALUE'] >= 2.3) and (row['VALUE'] <= 4.1):
NEW_UNIT.append('l')
else:
NEW_UNIT.append(float('NaN'))
# Replace unit column with the updated unit values
df['UNIT'] = NEW_UNIT
df1 = df.sort_values(['VALUE', 'UNIT'])
m1 = df1.UNIT.shift() != df1.UNIT.shift(-1)
m2 = df1.UNIT.isna()
m3 = df1.VALUE != df1.VALUE.shift()
df1['UNIT'] = df1.UNIT.ffill()
df1.loc[m1 & m2 & m3, 'UNIT'] = np.nan
df = df1.reindex(df.index)
Out[130]:
ID URINE_TEST UNIT VALUE
0 1 'alb' mg 1500.0
1 2 'alb' mg 1200.0
2 3 'alb' mg 1600.0
3 4 'alb' g 1.2
4 5 'alb' g 1.8
5 7 'alb' mg 1300.0
6 8 'crt' l 2.3
7 9 'crt' l 3.3
8 10 'crt' l 4.1
9 11 'crt' ml 2500.0
10 12 'crt' ml 3400.0
11 13 'crt' ml 2100.0
12 14 'crt' l 3.0
13 15 'crt' NaN 99.0
#create a dictionary of all the tests and their different options and min, max values
test_dic = {'alb': [('mg', 1200, 1800), ('g', 1.2, 1.8)], 'crt': [('l', 2.3, 4.1), ('ml', 2100, 3400)]}
#will be applied for each row in the dataframe
def fill_unit(row):
test = row['URINE_TEST'] #get test
value = row['VALUES'] #get value
unit = row['UNIT'] #get initial unit
if test in test_dic.keys():
if test_dic[test][0][1] <= value <=test_dic[test][0][2]:
unit = test_dic[test][0][0]
elif test_dic[test][1][1] <= value <=test_dic[test][1][2]:
unit = test_dic[test][1][0]
else:
unit = np.nan
return unit
urine_data['UNIT'] = urine_data.apply(fill_unit, axis=1)
URINE_TEST UNIT VALUES
0 alb mg 1500.0
1 alb mg 1200.0
2 alb mg 1600.0
3 alb g 1.2
4 alb g 1.8
5 alb mg 1300.0
6 crt l 2.3
7 crt l 3.3
8 crt l 4.1
9 crt ml 2500.0
10 crt ml 3400.0
11 crt ml 2100.0
12 crt l 3.0
13 crt NaN 99.0