Python 删除重复的列值,并根据列中的条件选择保留行
我有一个数据帧,例如:Python 删除重复的列值,并根据列中的条件选择保留行,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据帧,例如: COL1 COL2 COL3 COL4 COL4bis COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 COL13 APE.1:8-9(+):Canis_lups SEQ1 0.171 1041 243 0 436 1476 14
COL1 COL2 COL3 COL4 COL4bis COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 COL13
APE.1:8-9(+):Canis_lups SEQ1 0.171 1041 243 0 436 1476 1485 194 487 1091 3.305000e-05 52
APE.1:8-9(+):Canis_lups YP_SEQ1 0.171 1041 243 0 436 1476 1485 194 487 1091 3.305000e-05 52
APE.1:8-9(+):Canis_lups SEQ2 0.20 1081 248 1 436 1476 1485 194 497 1091 0.305000e-08 51
APZ.1:1-1(-):Felis_catus SEQ1 0.184 732 184 0 61 792 1071 233 458 1308 2.275000e-03 45
OKI:3946-7231(-):Ratus SEQ3 0.185 852 203 0 388 1239 3285 194 443 1091 5.438000e-05 53
OKI:3946-7231(-):Ratus XP_SEQ3 0.185 852 203 0 388 1239 3285 194 443 1091 5.438000e-05 53
我想删除与COL1,COL3:COL13
值完全相同的行(除了COL2
)
为了知道我保留了哪个COL2
,我将带有前缀的保留在列表中:
`prefix_list =['AC_','NC_',"YP_"]
如果prefixlist中没有任何内容,我会保留第一个。
在此示例中,预期结果为:
APE.1:8-9(+):Canis_lups YP_SEQ1 0.171 1041 243 0 436 1476 1485 194 487 1091 3.305000e-05 52
APE.1:8-9(+):Canis_lups SEQ2 0.20 1081 248 1 436 1476 1485 194 497 1091 0.305000e-08 51
APZ.1:1-1(-):Felis_catus SEQ1 0.184 732 184 0 61 792 1071 233 458 1308 2.275000e-03 45
OKI:3946-7231(-):Ratus XP_SEQ3 0.185 852 203 0 388 1239 3285 194 443 1091 5.438000e-05 53
如果我理解正确,这应该可以做到:
import pandas as pd
#NOTE: i've only created a dataframe with 6 columns, but the code still applies to your dataframe of 13 columns
#data
d = {'COL1': ['APE.1:8-9(+):Canis_lups', 'APE.1:8-9(+):Canis_lups', 'APE.1:8-9(+):Canis_lups', 'APZ.1:1-1(-):Felis_catus', 'OKI:3946-7231(-):Ratus', 'OKI:3946-7231(-):Ratus'],
'COL2': ['SEQ1', 'YP_SEQ1', 'SEQ1', 'SEQ1', 'SEQ3', 'XP_SEQ3'],
'COL3': [0.171, 0.171, 0.20, 0.184, 0.185, 0.185],
'COL4': [243, 243, 248, 184, 203, 203],
'COL5': [0, 0, 1, 0, 0, 0],
'COL6': [436, 436, 436, 61, 388, 388]}
#create data frame
df = pd.DataFrame(data = d)
#list of substrings
prefix_list =['AC_','NC_',"YP_"]
#list of columns to group
groupingColumns = [c for c in df if c is not 'COL2']
#create check column
df['prefix_check'] = 0
#flag the check column with 1 if substrings in the list appear in column 2
for item in prefix_list:
df['prefix_check'] = df['COL2'].apply(lambda x: 1 if (df['prefix_check'] > 0).any() else (1 if item in x else 0))
#sort dataframe (asc=False)
df = df.sort_values(by=df.columns.tolist(), ascending=False)
#drop duplicates based on other columns and keep first value (this will keep the one where the flag check is 1)
output = df.drop_duplicates(subset=groupingColumns, keep='first').reset_index(drop = True)
#remove check column
output = output.drop(['prefix_check'], axis=1)
print(output)
COL1 COL2 COL3 COL4 COL5 COL6 ..........
0 OKI:3946-7231(-):Ratus XP_SEQ3 0.185 203 0 388 ..........
1 APZ.1:1-1(-):Felis_catus SEQ1 0.184 184 0 61 ..........
2 APE.1:8-9(+):Canis_lups YP_SEQ1 0.171 243 0 436 ..........
3 APE.1:8-9(+):Canis_lups SEQ1 0.200 248 1 436 ..........
如果我理解正确,这应该可以做到:
import pandas as pd
#NOTE: i've only created a dataframe with 6 columns, but the code still applies to your dataframe of 13 columns
#data
d = {'COL1': ['APE.1:8-9(+):Canis_lups', 'APE.1:8-9(+):Canis_lups', 'APE.1:8-9(+):Canis_lups', 'APZ.1:1-1(-):Felis_catus', 'OKI:3946-7231(-):Ratus', 'OKI:3946-7231(-):Ratus'],
'COL2': ['SEQ1', 'YP_SEQ1', 'SEQ1', 'SEQ1', 'SEQ3', 'XP_SEQ3'],
'COL3': [0.171, 0.171, 0.20, 0.184, 0.185, 0.185],
'COL4': [243, 243, 248, 184, 203, 203],
'COL5': [0, 0, 1, 0, 0, 0],
'COL6': [436, 436, 436, 61, 388, 388]}
#create data frame
df = pd.DataFrame(data = d)
#list of substrings
prefix_list =['AC_','NC_',"YP_"]
#list of columns to group
groupingColumns = [c for c in df if c is not 'COL2']
#create check column
df['prefix_check'] = 0
#flag the check column with 1 if substrings in the list appear in column 2
for item in prefix_list:
df['prefix_check'] = df['COL2'].apply(lambda x: 1 if (df['prefix_check'] > 0).any() else (1 if item in x else 0))
#sort dataframe (asc=False)
df = df.sort_values(by=df.columns.tolist(), ascending=False)
#drop duplicates based on other columns and keep first value (this will keep the one where the flag check is 1)
output = df.drop_duplicates(subset=groupingColumns, keep='first').reset_index(drop = True)
#remove check column
output = output.drop(['prefix_check'], axis=1)
print(output)
COL1 COL2 COL3 COL4 COL5 COL6 ..........
0 OKI:3946-7231(-):Ratus XP_SEQ3 0.185 203 0 388 ..........
1 APZ.1:1-1(-):Felis_catus SEQ1 0.184 184 0 61 ..........
2 APE.1:8-9(+):Canis_lups YP_SEQ1 0.171 243 0 436 ..........
3 APE.1:8-9(+):Canis_lups SEQ1 0.200 248 1 436 ..........