Python 比较两个文件中的文本并在字段中追加文本
我有两个文件: 文件A看起来像Python 比较两个文件中的文本并在字段中追加文本,python,r,compare,match,Python,R,Compare,Match,我有两个文件: 文件A看起来像 ProbeID rsID chr bp strand alleleA alleleB SNP_A-1780270 rs987435 7 78599583 - C G SNP_A-1780271 rs345783 15 33395779 - C G SNP_A-1780272 rs955894 1 189807684 - G T SNP_A-1780274 rs608879
ProbeID rsID chr bp strand alleleA alleleB
SNP_A-1780270 rs987435 7 78599583 - C G
SNP_A-1780271 rs345783 15 33395779 - C G
SNP_A-1780272 rs955894 1 189807684 - G T
SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T
SNP_A-1780285 rs6919430 6 90919465 + A C
SNP_A-1780286 rs41528453 --- --- --- A G
SNP_A-1780287 rs2342723 16 5748791 + C T
文件B看起来像
ProbeID call
SNP_A-1780270 2
SNP_A-1780271 0
SNP_A-1780272 2
SNP_A-1780274 1
SNP_A-1780277 0
SNP_A-1780278 2
SNP_A-1780283 2
SNP_A-1780285 2
SNP_A-1780286 0
SNP_A-1780287 0
我想要一个如下所示的输出:
ProbeID call genotype
SNP_A-1780270 2 G G
SNP_A-1780271 0 C C
SNP_A-1780272 2 T T
SNP_A-1780274 1 A G
SNP_A-1780277 0 C C
SNP_A-1780278 2 T T
SNP_A-1780283 2 T T
SNP_A-1780285 2 C C
SNP_A-1780286 0 A A
SNP_A-1780287 0 C C
本质上,这与两个列表中的ProbeID相匹配,并在文件B中检查call列中相应的call值。当call=0时,在相邻列中打印两次等位基因的值。当call=1时,打印等位基因A和等位基因B的值。当call=2时,会打印两次等位基因B的值。使用嵌套字典,您可能可以非常轻松地完成此操作:
data = {}
with open(fileA) as fA:
header = next(fA).split()
attributes = header[1:]
for line in fA:
lst = line.split()
data[lst[0]] = dict(zip(attributes,l[1:])
with open(fileB) as fB:
header = next(fB).split()
for line in fB:
ID,call = line.split()
data[ID]['call'] = int(call)
现在,您可以迭代数据,只打印所需的内容
或者,如果行完全对应,则可以使用itertools.izip一次处理1行,如果使用python 3,则可以使用纯zip:
import itertools as it:
with open(fileA) as fA,open(fileB) as fB:
header_a = next(fA).split()
header_b = next(fB).split()
attrib_a = header_a[1:]
attrib_b = header_b[1:]
for line_a,line_b in it.izip(fA,fB):
dat_a = line_a.split()
dat_b = line_b.split()
assert(dat_a[0] == dat_b[0]) #make sure they're the same ID
dat = dict(zip(attrib_a,dat_a[1:]))
dat.update(zip(attrib_b,dat_b[1:]))
if (dat['call'] == '0'):
print dat_a[0],dat['call'],dat['alleleA'],dat['alleleA']
elif (dat['call'] == '1'):
print dat_a[0],dat['call'],dat['alleleA'],dat['alleleB']
elif (dat['call'] == '2'):
print dat_a[0],dat['call'],dat['alleleB'],dat['alleleB']
else:
raise AssertionError("Unknown call")
使用:
屈服
call alleleA alleleB
ProbeID
SNP_A-1780270 2 G G
SNP_A-1780271 0 C C
SNP_A-1780272 2 T T
SNP_A-1780274 1 A G
SNP_A-1780277 0 C C
SNP_A-1780278 2 T T
SNP_A-1780283 2 T T
SNP_A-1780285 2 C C
SNP_A-1780286 0 A A
SNP_A-1780287 0 C C
如果您有许多bFile,您可能会使用以下内容:
import pandas as pd
import re
A = pd.read_csv('FileA', delimiter = r'\s+')
A = A.set_index(['ProbeID'])
BFiles = ['FileB1', 'FileB2', 'FileB3']
for i, bfile in enumerate(BFiles):
B = pd.read_csv('FileB', delimiter = r'\s+')
B = B.set_index(['ProbeID'])
C = pd.concat([A,B], axis = 1)
idx = C['call'] == 0
C['alleleB'][idx] = C['alleleA'][idx]
idx = C['call'] == 2
C['alleleA'][idx] = C['alleleB'][idx]
cfile = 'FileC{i}'.format(i = i)
with open(cfile, 'w') as f:
f.write(C[['call', 'alleleA', 'alleleB']])
将cfile更改为适当的值。这里有一个R解决方案
my.data <- merge(df1, df2, by = "ProbeID")
# select rows based on call
zero <- my.data$call == 0
one <- my.data$call == 1
two <- my.data$call == 2
# subset rows based on previous condition and calculate genotype
my.data[zero, "genotype"] <- paste(my.data$alleleA[zero], my.data$alleleA[zero], sep = " ")
my.data[one, "genotype"] <- paste(my.data$alleleA[one], my.data$alleleB[one], sep = " ")
my.data[two, "genotype"] <- paste(my.data$alleleB[two], my.data$alleleB[two], sep = " ")
my.data[, c("ProbeID", "call", "genotype")]
ProbeID call genotype
1 SNP_A-1780270 2 G G
2 SNP_A-1780271 0 C C
3 SNP_A-1780272 2 T T
4 SNP_A-1780274 1 A G
5 SNP_A-1780277 0 C C
6 SNP_A-1780278 2 T T
7 SNP_A-1780283 2 T T
8 SNP_A-1780285 2 C C
9 SNP_A-1780286 0 A A
10 SNP_A-1780287 0 C C
在某种程度上,我应该学会使用这只我耳熟能详的神奇熊猫…@mgilson:我只是在自学。这很有趣,而且可能很有用。@unutbu当我将信息存储在文件中时,我尝试过使用它,但没有成功。另外,如果有许多文件B我想这样做,并且每次都在不同的文件中输出,那么如何做到这一点呢?@jules:该文件使用可变数量的空格来分隔数据。幸运的是,pandas.read_csv可以接受正则表达式模式作为分隔符。因此,您可以使用pd.read_csv'FileA',delimiter=r'\s+'将文件a读入数据帧。当我有许多文件可以归入文件B时,我尝试使用第二个文件。本质上,我使用import glob list=glob.glob'/*.txt'作为列表中的文件名:但是我得到了一个错误头\u b=nextfB.split StopIteration我如何才能解决这个问题?@jules-据我所知,这意味着其中一个文件是完全空的。您可以将其放在try/except子句中:-在本例中,您将使用except StopIteration:continue,它将继续对文件执行for循环,而不会对出现问题的文件执行任何操作。
my.data <- merge(df1, df2, by = "ProbeID")
# select rows based on call
zero <- my.data$call == 0
one <- my.data$call == 1
two <- my.data$call == 2
# subset rows based on previous condition and calculate genotype
my.data[zero, "genotype"] <- paste(my.data$alleleA[zero], my.data$alleleA[zero], sep = " ")
my.data[one, "genotype"] <- paste(my.data$alleleA[one], my.data$alleleB[one], sep = " ")
my.data[two, "genotype"] <- paste(my.data$alleleB[two], my.data$alleleB[two], sep = " ")
my.data[, c("ProbeID", "call", "genotype")]
ProbeID call genotype
1 SNP_A-1780270 2 G G
2 SNP_A-1780271 0 C C
3 SNP_A-1780272 2 T T
4 SNP_A-1780274 1 A G
5 SNP_A-1780277 0 C C
6 SNP_A-1780278 2 T T
7 SNP_A-1780283 2 T T
8 SNP_A-1780285 2 C C
9 SNP_A-1780286 0 A A
10 SNP_A-1780287 0 C C