Python:在具有相同关键字的两行之间进行解析
我知道当开始的“目标词”和结束的“目标词”不同时,如何在两行之间解析 e、 g.如果我想在X和Y之间解析:Python:在具有相同关键字的两行之间进行解析,python,parsing,split,Python,Parsing,Split,我知道当开始的“目标词”和结束的“目标词”不同时,如何在两行之间解析 e、 g.如果我想在X和Y之间解析: parse = False for line in open(sys.argv[1]): if Y in line: parse = False if parse: print line if X in line: parse = True 我陷入了一个稍有不同的问题,我想在其中解析的单词是同一个单词。i、 e.在本例中,有4个不同的同源组,我想提取每个同源组中的人
parse = False
for line in open(sys.argv[1]):
if Y in line:
parse = False
if parse:
print line
if X in line:
parse = True
我陷入了一个稍有不同的问题,我想在其中解析的单词是同一个单词。i、 e.在本例中,有4个不同的同源组,我想提取每个同源组中的人类/小鼠对,因此我想打开此文件:
1:_HomoloGene:_141209.Gene_conserved_in_Mammals
LOC102724657 Homo_sapiens
Gm12569 Mus_musculus
2:_HomoloGene:_141208.Gene_conserved_in_Euarchontoglires
LOC102724737 Homo_sapiens
LOC102636216 Mus_musculus
3:_HomoloGene:_141152.Gene_conserved_in_Euarchontoglires
LOC728763 Homo_sapiens
E030010N07Rik Mus_musculus
E030010N09Rik Mus_musculus
E030010N010Rik Mus_musculus
E030010N08Rik Mus_musculus
LOC102551034 Rattus_norvegicus
4:_HomoloGene:_141054.Gene_conserved_in_Boreoeutheria
LOC102723572 Homo_sapiens
LOC102157295 Canis_lupus_familiaris
LOC102633228 Mus_musculus
进入智人/小家鼠的比较,如下所示:
Homo_sapiens Mus_musculus
LOC102724657 Gm12569
LOC102724737 LOC102636216
LOC728763 E030010N07Rik
LOC728763 E030010N09Rik
LOC728763 E030010N010Rik
LOC728763 E030010N08Rik
LOC102723572 LOC102633228
我没有几乎成功的代码可以展示,这是我尝试过的一个例子(我也尝试过正则表达式并在单词“HomoloGene”上拆分行):
谢谢下面的注释代码在您的示例中生成了结果。要理解它,您可能需要阅读以下内容:
- 及
import sys
import re
from collections import defaultdict
import itertools
#define the pairs of words we want to compare
compare = ['Homo_sapiens', 'Mus_musculus']
#define some regular expressions to split up the input data file
#this searches for a digit, a colon, and matches the rest of the line
group_re = re.compile("\n?\d+:.*\n")
#this matches non-whitespace, followed by whitespace, and then non-whitespace, returning the two non-whitespace sections
line_re = re.compile("(\S+)\s+(\S+)")
#to store our resulting comparisons
comparison = []
#open and read in the datafile
datafile = open(sys.argv[1]).read()
#use our regular expression to split the datafile into homolog groups
for dataset in group_re.split(datafile):
#ignore empty matches
if dataset.strip()=='': continue
#split our group into lines
dataset = dataset.split('\n')
#use our regular expression to match each line, pulling out the two bits of data
dataset = [line_re.match(line).groups() for line in dataset if line.strip()!='']
#build a dictionary to store our words
words = defaultdict(list)
#loop through our group dataset, grouping each line by its word
for v, k in dataset: words[k].append(v)
#add the results to our output list. Note here we are unpacking an argument list
comparison+=itertools.product(*[words[w] for w in compare])
#print out the words we wanted to compare
print('\t'.join(compare))
#loop through our output dataset
for combination in comparison:
#print each comparison, spaced with a tab character
print('\t'.join(combination))
下面的注释代码在您的示例中生成结果。要理解它,您可能需要阅读以下内容:
- 及
import sys
import re
from collections import defaultdict
import itertools
#define the pairs of words we want to compare
compare = ['Homo_sapiens', 'Mus_musculus']
#define some regular expressions to split up the input data file
#this searches for a digit, a colon, and matches the rest of the line
group_re = re.compile("\n?\d+:.*\n")
#this matches non-whitespace, followed by whitespace, and then non-whitespace, returning the two non-whitespace sections
line_re = re.compile("(\S+)\s+(\S+)")
#to store our resulting comparisons
comparison = []
#open and read in the datafile
datafile = open(sys.argv[1]).read()
#use our regular expression to split the datafile into homolog groups
for dataset in group_re.split(datafile):
#ignore empty matches
if dataset.strip()=='': continue
#split our group into lines
dataset = dataset.split('\n')
#use our regular expression to match each line, pulling out the two bits of data
dataset = [line_re.match(line).groups() for line in dataset if line.strip()!='']
#build a dictionary to store our words
words = defaultdict(list)
#loop through our group dataset, grouping each line by its word
for v, k in dataset: words[k].append(v)
#add the results to our output list. Note here we are unpacking an argument list
comparison+=itertools.product(*[words[w] for w in compare])
#print out the words we wanted to compare
print('\t'.join(compare))
#loop through our output dataset
for combination in comparison:
#print each comparison, spaced with a tab character
print('\t'.join(combination))
这是一个由两部分组成的问题。首先将同源组extact到字典中,然后遍历这些组并打印对
#!/bin/python
import re
# Opens the text file
with open("genes.txt","r") as f:
data = {}
# reads the lines
for line in f.readlines():
# When there is a : at the line start -> new group
match = re.search("^([0-9]+):",line)
if match:
# extracts the group number and puts it to the dict
group = match.group(1)
# adds the species as entries with empty lists as values
data[str(group)] = { "Homo_sapiens":[] , "Mus_musculus":[]}
else:
# splits the line (also removes the \n)
text = line.replace("\n","").split()
# if the species is in the group, add the gene name to the list
if text[1] in data[group].keys():
data[group][text[1]].append(text[0])
# Here you go with your parsed data
print data
# Now we feed it into the text format you want
print "Homo_sapiens\t\tMus_musculus"
# go through groups
for gr in data:
# go through the Hs genes
for hs_gene in data[gr]["Homo_sapiens"]:
# get all the associated Ms genes
for ms_gene in data[gr]["Mus_musculus"]:
# print the pairs
print hs_gene+"\t\t"+ms_gene
希望这能有所帮助。这是一个由两部分组成的问题。首先将同源组extact到字典中,然后遍历这些组并打印对
#!/bin/python
import re
# Opens the text file
with open("genes.txt","r") as f:
data = {}
# reads the lines
for line in f.readlines():
# When there is a : at the line start -> new group
match = re.search("^([0-9]+):",line)
if match:
# extracts the group number and puts it to the dict
group = match.group(1)
# adds the species as entries with empty lists as values
data[str(group)] = { "Homo_sapiens":[] , "Mus_musculus":[]}
else:
# splits the line (also removes the \n)
text = line.replace("\n","").split()
# if the species is in the group, add the gene name to the list
if text[1] in data[group].keys():
data[group][text[1]].append(text[0])
# Here you go with your parsed data
print data
# Now we feed it into the text format you want
print "Homo_sapiens\t\tMus_musculus"
# go through groups
for gr in data:
# go through the Hs genes
for hs_gene in data[gr]["Homo_sapiens"]:
# get all the associated Ms genes
for ms_gene in data[gr]["Mus_musculus"]:
# print the pairs
print hs_gene+"\t\t"+ms_gene
希望这能有所帮助。如果匹配,您不认为组数会超过9?s/
无:
/如果匹配:
/。您忘了删除组
的旧定义,因此您的代码仍然存在漏洞。谢谢。组定义确实是旧版本的残余。如果数学!=无并不是真的错。如果没有匹配项,则匹配对象为None
。我想这是品味的问题。或者有没有不使用!=无
我没有想到这一点?没有功能上的区别,但不必要的冗长和“非音速”。如果匹配,您认为组号不会超过9秒/无:
/如果匹配:
/。您忘了删除组
的旧定义,因此您的代码仍然存在漏洞。谢谢。组定义确实是旧版本的残余。如果数学!=无并不是真的错。如果没有匹配项,则匹配对象为None
。我想这是品味的问题。或者有没有不使用!=无
我没有想到这一点?没有功能上的区别,但不必要的冗长和“不和谐”。