Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/292.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 评估重复样本的字段中是否有不同的数据,以及是否复制数据?_Python_Python 3.x_Pandas - Fatal编程技术网

Python 评估重复样本的字段中是否有不同的数据,以及是否复制数据?

Python 评估重复样本的字段中是否有不同的数据,以及是否复制数据?,python,python-3.x,pandas,Python,Python 3.x,Pandas,我想评估样本及其副本(以a_2结尾)是否在其年龄、家族史和诊断字段中输入了数据。如果一个样本有条目,而它的副本没有(所有“-”条目),那么我想将条目从样本复制到副本字段。评估应该以另一种方式进行:如果副本有条目,而样本没有,则将它们复制到样本字段 基本上,我希望输入的_-df看起来像所需的_-df(如下所示) 下面详细介绍了我在这方面真正低效且不完整的尝试: def testing(duplicate, df): ''' Checking difference in phenotype

我想评估样本及其副本(以a_2结尾)是否在其年龄、家族史和诊断字段中输入了数据。如果一个样本有条目,而它的副本没有(所有“-”条目),那么我想将条目从样本复制到副本字段。评估应该以另一种方式进行:如果副本有条目,而样本没有,则将它们复制到样本字段

基本上,我希望输入的_-df看起来像所需的_-df(如下所示)

下面详细介绍了我在这方面真正低效且不完整的尝试:

def testing(duplicate, df):
    ''' Checking difference in phenotype data between duplicates
        and return the sample name if 
    '''
    # only assess the duplicate
    if duplicate['Sample'][:-2] in list(df['Sample'].unique()):

        # get sam row
        sam = df[df['Sample'] == duplicate['Sample'][:-2]]

        # store the Age, Family History and Diagnosis in a list for each sample
        sam_pheno = sam.iloc[0][2:4].fillna("-").tolist()
        duplicate_pheno = duplicate[2:4].fillna("-").tolist()

        # if the duplicate sample has nothing in these fields then return the 
        # orginal sample name
        if len(set(duplicate_pheno)) == 1 and list(set(duplicate_pheno))[0] == "-" \
          and len(set(sam_pheno)) > 1:
            return duplicate['Sample'][:-2]         






# this creates a column called Pheno which has the name of the sample which contains the phenotype data that they should share. This is intended so that I can somehow copy over the phenotype data from the sample name in the Pheno field. However, I have no idea how to do this.
input_df['Pheno'] = input_df.apply(lambda x: testing(x, input_df), axis =1)
您可以使用:

#replace all - values to NaN
input_df = input_df.replace('-',np.nan)
#all values end with _2 and longer as 7
mask = (input_df.Sample.str.endswith('_2')) & (input_df.Sample.str.len() > 7)
#create new columnn same with column Sample + remove last 2 chars (_2)
input_df.ix[mask, 'same'] = input_df.ix[mask, 'Sample'].str[:-2]
#replace NaN in same by Sample column
input_df.same = input_df.same.combine_first(input_df.Sample)
#sort values
input_df = input_df.sort_values(['same','Family History'], ascending=False)
#replace NaN by forward filling
input_df[['Age','Family History','Diagnosis']] = 
input_df[['Age','Family History','Diagnosis']].ffill()
#get original index by sorting
input_df.sort_index(inplace=True)
#remove column same
input_df.drop('same', axis=1, inplace=True)

print (input_df)     
       Sample     Date Age Family History                    Diagnosis
0    HG_12_34  12/3/12  23              Y           Jerusalem Syndrome
1     LG_3_45   3/4/12  45              N               Paris Syndrome
2  HG_12_34_2   4/5/13  23              Y           Jerusalem Syndrome
3     KD_89_9   8/9/12  54              Y              Chronic Hiccups
4   KD_98_9_2   6/1/13  54              Y              Chronic Hiccups
5   LG_3_45_2   4/4/10  59              N  Dangerous Sneezing Syndrome


与耶斯雷尔的回答类似,我想我还是把它留在这里吧,因为我已经写好了

input_df = input_df.replace('-', np.nan)

filter = (input_df["Sample"].str.endswith('_2')) & (input_df["Sample"].str.len() > 7)

samples1 = input_df[~filter].copy()
samples2 = input_df[filter].copy()

samples2['Sample'] = samples2["Sample"].str.replace("_2","")

samples1 = samples1.set_index("Sample")
samples2 = samples2.set_index("Sample")

samples1 = samples1.combine_first(samples2)
samples2 = samples2.combine_first(samples1)

samples1 = samples1.reset_index()
samples2 = samples2.reset_index()

samples2["Sample"] = samples2["Sample"] + "_2"

samples = pd.concat([samples1,samples2])
print (desired_df)                   
       Sample     Date Age Family History                    Diagnosis
0    HG_12_34  12/3/12  23              Y           Jerusalem Syndrome
1     LG_3_45   3/4/12  45              N               Paris Syndrome
2  HG_12_34_2   4/5/13  23              Y           Jerusalem Syndrome
3     KD_89_9   8/9/12  54              Y              Chronic Hiccups
4   KD_98_9_2   6/1/13  54              Y              Chronic Hiccups
5   LG_3_45_2   4/4/10  59              N  Dangerous Sneezing Syndrome
input_df = input_df.replace('-', np.nan)

filter = (input_df["Sample"].str.endswith('_2')) & (input_df["Sample"].str.len() > 7)

samples1 = input_df[~filter].copy()
samples2 = input_df[filter].copy()

samples2['Sample'] = samples2["Sample"].str.replace("_2","")

samples1 = samples1.set_index("Sample")
samples2 = samples2.set_index("Sample")

samples1 = samples1.combine_first(samples2)
samples2 = samples2.combine_first(samples1)

samples1 = samples1.reset_index()
samples2 = samples2.reset_index()

samples2["Sample"] = samples2["Sample"] + "_2"

samples = pd.concat([samples1,samples2])