使用python在两个文件中查找匹配项_Python_Match

使用python在两个文件中查找匹配项

python

使用python在两个文件中查找匹配项,python,match,Python,Match,我正在分析测序数据，我有几个候选基因，我需要找到它们的功能编辑可用的人类数据库后，我想将候选基因与数据库进行比较，并输出候选基因的函数我只有基本的python技能，所以我想这可能会帮助我加快寻找候选基因功能的工作包含候选基因的文件1如下所示 Gene AQP7 RLIM SMCO3 COASY HSPA6 数据库file2.csv如下所示： Gene function PDCD6 Programmed cell death protein 6 CDC2 Cell divis

我正在分析测序数据，我有几个候选基因，我需要找到它们的功能

编辑可用的人类数据库后，我想将候选基因与数据库进行比较，并输出候选基因的函数

我只有基本的python技能，所以我想这可能会帮助我加快寻找候选基因功能的工作

包含候选基因的文件1如下所示

Gene
AQP7
RLIM
SMCO3
COASY
HSPA6

数据库file2.csv如下所示：

Gene   function 
PDCD6  Programmed cell death protein 6 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a

期望输出

 Gene(from file1) ,function(matching from file2)

我尝试使用以下代码：

file1 = 'file1.csv'
file2 = 'file2.csv'
output = 'file3.txt'

with open(file1) as inf:
    match = set(line.strip() for line in inf)

with open(file2) as inf, open(output, 'w') as outf:
    for line in inf:
        if line.split(' ',1)[0] in match:
            outf.write(line)

我只有一页空白

我试着使用交集函数

with open('file1.csv', 'r') as ref:
    with open('file2.csv','r') as com:
       with open('common_genes_function','w') as output:
           same = set(ref).intersection(com)
                print same

也不工作

请提供帮助，否则我需要手动执行此操作

我建议使用

pandas

合并

功能。然而，它需要在“基因”和“功能”列之间有一个清晰的分隔符。在我的示例中，我假设它位于tab：

import pandas as pd
#open files as pandas datasets
file1 = pd.read_csv(filepath1, sep = '\t')
file2 = pd.read_csv(filepath2, sep = '\t')

#merge files by column 'Gene' using 'inner', so it comes up
#with the intersection of both datasets
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2'])
file3.to_csv(filepath3, sep = ',')

使用基本Python，您可以尝试以下操作：

import re

gene_function = {}
with open('file2.csv','r') as input:
    lines = [line.strip() for line in input.readlines()[1:]]
    for line in lines:
        match = re.search("(\w+)\s+(.*)",line)
        gene = match.group(1)
        function = match.group(2)
        if gene not in gene_function:
            gene_function[gene] = function

with open('file1.csv','r') as input:
    genes = [i.strip() for i in input.readlines()[1:]]
    for gene in genes:
        if gene in gene_function:
            print "{}, {}".format(gene, gene_function[gene])

您是否尝试查看python的

csv

模块？它有很多方法可以轻松解析csv文件。您可能会将

file1

中的两个基因加载到一个数组中，然后将数组中的每一项与csv模块加载到内存中的数据进行匹配。您如何将file1中的基因与file2中的函数相关联？文件1中是否有CDC2和PDCD基因？文件1中的基因应该出现在文件2中，因为文件2是完整的人类数据库。上面显示的数据只是内容的一部分。在

file2.csv

中，每个基因是否都有独特的功能？有些是重复的，因为它们在其他类别中有所不同，但我删除了其他类别，因为我只需要该功能。这就是为什么你可以在上面的例子中看到重复的基因