Python 有条件地删除行不能按预期工作
我有一个dataframe,它有一个包含重复样本的样本列(以_2结尾),还有一个相同的列详细说明了哪一个是原始样本。新类别包含一种突变类型,其中致病性/可能致病性是最具破坏性的,而可能良性是破坏性最小的。下面演示了我的dataframe的简化/基本版本Python 有条件地删除行不能按预期工作,python,pandas,dataframe,python-3.4,Python,Pandas,Dataframe,Python 3.4,我有一个dataframe,它有一个包含重复样本的样本列(以_2结尾),还有一个相同的列详细说明了哪一个是原始样本。新类别包含一种突变类型,其中致病性/可能致病性是最具破坏性的,而可能良性是破坏性最小的。下面演示了我的dataframe的简化/基本版本 df = pd.DataFrame(columns=['Sample', 'same','New Category'], data=[ ['HG_12_34', 'HG_12_34'
df = pd.DataFrame(columns=['Sample', 'same','New Category'],
data=[
['HG_12_34', 'HG_12_34', 'Pathogenic/Likely Pathogenic'],
['HG_12_34_2', 'HG_12_34', 'Likely Benign'],
['KD_89_9', 'KD_89_9', 'Likely Benign'],
['KD_98_9_2', 'KD_89_9', 'Likely Benign'],
['LG_3_45', 'LG_3_45', 'Likely Benign'],
['LG_3_45_2', 'LG_3_45', 'VUS']
])
我想有条件地删除一个样本或其副本,这取决于新类别中哪一个具有最小破坏性突变,即,如果一个样本可能是良性的,而副本具有致病性/利克利致病性变体,那么我想删除/删除样本行
我试图通过将数据帧传递给一个函数来实现这一点,该函数返回一个表示要删除的行的索引列表,随后我删除了这些行
def get_unwanted_duplicates_ix(df):
# filter df for samples that have a duplicate
same_only = df.groupby("same").filter(lambda x: len(x) > 1)
list_index_to_delete = []
for num in range(0,same_only.shape[0]-1):
row1 = same_only.irow(num)
row2 = same_only.irow(num+1)
index = list(same_only.index.values)[num]
if row1['Sample']+"_2" == row2['Sample'] or \
row1['Sample'] == row2['Sample']+"_2":
if row1['New Category'] == row2['New Category']:
list_index_to_delete.append(index+1)
elif row1['New Category'] == "Pathogenic/Likely Pathogenic" \
and row2['New Category'] != "Pathogenic/Likely Pathogenic":
list_index_to_delete.append(index+1)
elif row2['New Category'] == "Pathogenic/Likely Pathogenic" \
and row1['New Category'] != "Pathogenic/Likely Pathogenic":
list_index_to_delete.append(index)
elif row1['New Category'] == "VUS" \
and row2['New Category'] != "VUS":
list_index_to_delete.append(index+1)
elif row2['New Category'] == "VUS" \
and row1['New Category'] != "VUS":
list_index_to_delete.append(index)
elif row1['New Category'] == 'Likely Benign' \
and row2['New Category'] == 'Likely Benign':
list_index_to_delete.append(index+1)
else:
list_index_to_delete.append(index+1)
return list_index_to_delete
unwanted = get_unwanted_duplicates_ix(df)
df = df.drop(df.index[unwanted])
上面的功能一团糟,不出意料的是,它没有像我所希望的那样发挥作用。正确方向上的一点是最好的。首先,用整数替换突变严重性(值越高意味着破坏性越大) 下一个命令取决于是否要保留具有相同严重性的多行。如果是,则按
same
列分组,并选择具有最大严重性代码的行:
df[df.groupby('same')['New Category code'].transform(max) == df['New Category code']]
Sample same New Category New Category code
0 HG_12_34 HG_12_34 Pathogenic/Likely Pathogenic 3
2 KD_89_9 KD_89_9 Likely Benign 1
3 KD_98_9_2 KD_89_9 Likely Benign 1
5 LG_3_45_2 LG_3_45 VUS 2
如果否(每组中始终只保留一行),则改为按严重性升序排列值,并在每组中取最后一行(感谢@JonClements的建议):
这是您想要的,还是您不想按
同一列进行分组?如果没有,请将所需的输出添加到问题中。我建议不要转换和比较最大值(对于具有多个最大值的组,这将返回多个样本),而是按新类别代码降序排序,然后应用groupby('same')。首先()
。。。(或按升序排序,然后应用.last()
-任意选择)
df[df.groupby('same')['New Category code'].transform(max) == df['New Category code']]
Sample same New Category New Category code
0 HG_12_34 HG_12_34 Pathogenic/Likely Pathogenic 3
2 KD_89_9 KD_89_9 Likely Benign 1
3 KD_98_9_2 KD_89_9 Likely Benign 1
5 LG_3_45_2 LG_3_45 VUS 2
df.sort_values('New Category code').groupby('same').last()
Sample New Category New Category code
same
HG_12_34 HG_12_34 Pathogenic/Likely Pathogenic 3
KD_89_9 KD_98_9_2 Likely Benign 1
LG_3_45 LG_3_45_2 VUS 2