Python 比较两个文件中的文本并在字段中追加文本_Python_R_Compare_Match

Python 比较两个文件中的文本并在字段中追加文本

python r

Python 比较两个文件中的文本并在字段中追加文本,python,r,compare,match,Python,R,Compare,Match,我有两个文件：文件A看起来像 ProbeID rsID chr bp strand alleleA alleleB SNP_A-1780270 rs987435 7 78599583 - C G SNP_A-1780271 rs345783 15 33395779 - C G SNP_A-1780272 rs955894 1 189807684 - G T SNP_A-1780274 rs608879

我有两个文件：文件A看起来像

ProbeID rsID    chr bp  strand  alleleA alleleB
SNP_A-1780270   rs987435    7   78599583    -   C   G
SNP_A-1780271   rs345783    15  33395779    -   C   G
SNP_A-1780272   rs955894    1   189807684   -   G   T
SNP_A-1780274   rs6088791   20  33907909    -   A   G
SNP_A-1780277   rs11180435  12  75664046    +   C   T
SNP_A-1780278   rs17571465  1   218890658   -   A   T
SNP_A-1780283   rs17011450  4   127630276   -   C   T
SNP_A-1780285   rs6919430   6   90919465    +   A   C
SNP_A-1780286   rs41528453  --- --- --- A   G
SNP_A-1780287   rs2342723   16  5748791 +   C   T

文件B看起来像

ProbeID call
SNP_A-1780270   2
SNP_A-1780271   0
SNP_A-1780272   2
SNP_A-1780274   1
SNP_A-1780277   0
SNP_A-1780278   2
SNP_A-1780283   2
SNP_A-1780285   2
SNP_A-1780286   0
SNP_A-1780287   0

我想要一个如下所示的输出：

ProbeID call    genotype
SNP_A-1780270   2   G G
SNP_A-1780271   0   C C
SNP_A-1780272   2   T T 
SNP_A-1780274   1   A G
SNP_A-1780277   0   C C
SNP_A-1780278   2   T T
SNP_A-1780283   2   T T 
SNP_A-1780285   2   C C
SNP_A-1780286   0   A A
SNP_A-1780287   0   C C

本质上，这与两个列表中的ProbeID相匹配，并在文件B中检查call列中相应的call值。当call=0时，在相邻列中打印两次等位基因的值。当call=1时，打印等位基因A和等位基因B的值。当call=2时，会打印两次等位基因B的值。

使用嵌套字典，您可能可以非常轻松地完成此操作：

data = {}
with open(fileA) as fA:
    header = next(fA).split()
    attributes = header[1:]
    for line in fA:
        lst = line.split()
        data[lst[0]] = dict(zip(attributes,l[1:])

with open(fileB) as fB:
    header = next(fB).split()
    for line in fB:
        ID,call = line.split()
        data[ID]['call'] = int(call)

现在，您可以迭代数据，只打印所需的内容

或者，如果行完全对应，则可以使用itertools.izip一次处理1行，如果使用python 3，则可以使用纯zip：

import itertools as it:

with open(fileA) as fA,open(fileB) as fB:
    header_a = next(fA).split()
    header_b = next(fB).split()
    attrib_a = header_a[1:]
    attrib_b = header_b[1:]
    for line_a,line_b in it.izip(fA,fB):
        dat_a = line_a.split()
        dat_b = line_b.split()
        assert(dat_a[0] == dat_b[0])  #make sure they're the same ID
        dat = dict(zip(attrib_a,dat_a[1:]))
        dat.update(zip(attrib_b,dat_b[1:]))
        if (dat['call'] == '0'):
           print dat_a[0],dat['call'],dat['alleleA'],dat['alleleA']

        elif (dat['call'] == '1'):
           print dat_a[0],dat['call'],dat['alleleA'],dat['alleleB']

        elif (dat['call'] == '2'):
           print dat_a[0],dat['call'],dat['alleleB'],dat['alleleB']

        else:
             raise AssertionError("Unknown call")

使用：

屈服

               call alleleA alleleB
ProbeID                            
SNP_A-1780270     2       G       G
SNP_A-1780271     0       C       C
SNP_A-1780272     2       T       T
SNP_A-1780274     1       A       G
SNP_A-1780277     0       C       C
SNP_A-1780278     2       T       T
SNP_A-1780283     2       T       T
SNP_A-1780285     2       C       C
SNP_A-1780286     0       A       A
SNP_A-1780287     0       C       C

如果您有许多bFile，您可能会使用以下内容：

import pandas as pd
import re

A = pd.read_csv('FileA', delimiter = r'\s+')
A = A.set_index(['ProbeID'])

BFiles = ['FileB1', 'FileB2', 'FileB3']
for i, bfile in enumerate(BFiles):
    B = pd.read_csv('FileB', delimiter = r'\s+')
    B = B.set_index(['ProbeID'])
    C = pd.concat([A,B], axis = 1)

    idx = C['call'] == 0
    C['alleleB'][idx]  = C['alleleA'][idx]
    idx = C['call'] == 2
    C['alleleA'][idx]  = C['alleleB'][idx]
    cfile = 'FileC{i}'.format(i = i)
    with open(cfile, 'w') as f:
        f.write(C[['call', 'alleleA', 'alleleB']])

将cfile更改为适当的值。

这里有一个R解决方案

my.data <- merge(df1, df2, by = "ProbeID")

# select rows based on call
zero <- my.data$call == 0
one <- my.data$call == 1
two <- my.data$call == 2

# subset rows based on previous condition and calculate genotype
my.data[zero, "genotype"] <- paste(my.data$alleleA[zero], my.data$alleleA[zero], sep = " ")
my.data[one, "genotype"] <- paste(my.data$alleleA[one], my.data$alleleB[one], sep = " ")
my.data[two, "genotype"] <- paste(my.data$alleleB[two], my.data$alleleB[two], sep = " ")

my.data[, c("ProbeID", "call", "genotype")]


        ProbeID call genotype
1  SNP_A-1780270    2      G G
2  SNP_A-1780271    0      C C
3  SNP_A-1780272    2      T T
4  SNP_A-1780274    1      A G
5  SNP_A-1780277    0      C C
6  SNP_A-1780278    2      T T
7  SNP_A-1780283    2      T T
8  SNP_A-1780285    2      C C
9  SNP_A-1780286    0      A A
10 SNP_A-1780287    0      C C

在某种程度上，我应该学会使用这只我耳熟能详的神奇熊猫…@mgilson：我只是在自学。这很有趣，而且可能很有用。@unutbu当我将信息存储在文件中时，我尝试过使用它，但没有成功。另外，如果有许多文件B我想这样做，并且每次都在不同的文件中输出，那么如何做到这一点呢？@jules:该文件使用可变数量的空格来分隔数据。幸运的是，pandas.read_csv可以接受正则表达式模式作为分隔符。因此，您可以使用pd.read_csv'FileA'，delimiter=r'\s+'将文件a读入数据帧。当我有许多文件可以归入文件B时，我尝试使用第二个文件。本质上，我使用import glob list=glob.glob'/*.txt'作为列表中的文件名：但是我得到了一个错误头\u b=nextfB.split StopIteration我如何才能解决这个问题？@jules-据我所知，这意味着其中一个文件是完全空的。您可以将其放在try/except子句中：-在本例中，您将使用except StopIteration:continue，它将继续对文件执行for循环，而不会对出现问题的文件执行任何操作。

my.data <- merge(df1, df2, by = "ProbeID")

# select rows based on call
zero <- my.data$call == 0
one <- my.data$call == 1
two <- my.data$call == 2

# subset rows based on previous condition and calculate genotype
my.data[zero, "genotype"] <- paste(my.data$alleleA[zero], my.data$alleleA[zero], sep = " ")
my.data[one, "genotype"] <- paste(my.data$alleleA[one], my.data$alleleB[one], sep = " ")
my.data[two, "genotype"] <- paste(my.data$alleleB[two], my.data$alleleB[two], sep = " ")

my.data[, c("ProbeID", "call", "genotype")]


        ProbeID call genotype
1  SNP_A-1780270    2      G G
2  SNP_A-1780271    0      C C
3  SNP_A-1780272    2      T T
4  SNP_A-1780274    1      A G
5  SNP_A-1780277    0      C C
6  SNP_A-1780278    2      T T
7  SNP_A-1780283    2      T T
8  SNP_A-1780285    2      C C
9  SNP_A-1780286    0      A A
10 SNP_A-1780287    0      C C