Python 重复提取两个单词之间的结构化数据
输入定义如下:Python 重复提取两个单词之间的结构化数据,python,awk,Python,Awk,输入定义如下: SEQUENCE ATTCGGTCTAATGACGGACGCTCTA 423575 user_name 029708252 END SEQUENCE GCAAGTCTAATGACGGACGCTCTGA 423600 user_name2 03276541 SEQUENCE GTAAGATCTAATGACGGACGCTCCA 423625 user_name3 00923408271 END SEQUENCE GGCTATTAAGGGGTCGGACGCTCGC 423650 use
SEQUENCE
ATTCGGTCTAATGACGGACGCTCTA
423575
user_name
029708252
END
SEQUENCE
GCAAGTCTAATGACGGACGCTCTGA
423600
user_name2
03276541
SEQUENCE
GTAAGATCTAATGACGGACGCTCCA
423625
user_name3
00923408271
END
SEQUENCE
GGCTATTAAGGGGTCGGACGCTCGC
423650
user_name4
00923408271
SEQUENCE
GTAACTAAACTTTAACGGACGCTCC
423675
user_name5
0653053443
END
SEQUENCE
数据结构如下所示:
SEQUENCE
string1
number1
string2
number2
END
SEQUENCE
或:
有时,在序列文本之前有一个结束
我有数千个这样的块进行分析。我们是否可以将块中的数据提取为txt文件,如下所示
ATTCGGTCTAATGACGGACGCTCTA 423575 user_name 029708252
GCAAGTCTAATGACGGACGCTCTGA 423600 user_name2 03276541
GTAAGATCTAATGACGGACGCTCCA 423625 user_name3 00923408271
GGCTATTAAGGGGTCGGACGCTCGC 423650 user_name4 00923408271
GTAACTAAACTTTAACGGACGCTCC 423675 user_name5 0653053443
每行具有来自其中一个数据块的数据。
我尝试sed在所有的方块上循环,但结果我只得到了交替的比赛。sed-n-e'/SEQUENCE/,/SEQUENCE/p'输入
开放探索其他语言,例如python。我将使用python的re模块实现以下目的:
import re
data = '''SEQUENCE
ATTCGGTCTAATGACGGACGCTCTA
423575
user_name
029708252
END
SEQUENCE
GCAAGTCTAATGACGGACGCTCTGA
423600
user_name2
03276541
SEQUENCE
GTAAGATCTAATGACGGACGCTCCA
423625
user_name3
00923408271
END
SEQUENCE
GGCTATTAAGGGGTCGGACGCTCGC
423650
user_name4
00923408271
SEQUENCE
GTAACTAAACTTTAACGGACGCTCC
423675
user_name5
0653053443
END
SEQUENCE'''
for record in re.findall(r'SEQUENCE\n(.+)\n(.+)\n(.+)\n(.+)', data):
print(*record, sep='\t')
输出:
ATTCGGTCTAATGACGGACGCTCTA 423575 user_name 029708252
GCAAGTCTAATGACGGACGCTCTGA 423600 user_name2 03276541
GTAAGATCTAATGACGGACGCTCCA 423625 user_name3 00923408271
GGCTATTAAGGGGTCGGACGCTCGC 423650 user_name4 00923408271
GTAACTAAACTTTAACGGACGCTCC 423675 user_name5 0653053443
说明:默认情况下,我使用捕获组的模式。在python中,re表示除换行符以外的任何内容,所以我在序列之后每四行捕获一次。当在re.findall中使用这种模式时,它会给出4元组的列表,因此我使用*themes解包并通知print使用\t作为分隔符。我将使用python的re模块实现以下目的:
import re
data = '''SEQUENCE
ATTCGGTCTAATGACGGACGCTCTA
423575
user_name
029708252
END
SEQUENCE
GCAAGTCTAATGACGGACGCTCTGA
423600
user_name2
03276541
SEQUENCE
GTAAGATCTAATGACGGACGCTCCA
423625
user_name3
00923408271
END
SEQUENCE
GGCTATTAAGGGGTCGGACGCTCGC
423650
user_name4
00923408271
SEQUENCE
GTAACTAAACTTTAACGGACGCTCC
423675
user_name5
0653053443
END
SEQUENCE'''
for record in re.findall(r'SEQUENCE\n(.+)\n(.+)\n(.+)\n(.+)', data):
print(*record, sep='\t')
输出:
ATTCGGTCTAATGACGGACGCTCTA 423575 user_name 029708252
GCAAGTCTAATGACGGACGCTCTGA 423600 user_name2 03276541
GTAAGATCTAATGACGGACGCTCCA 423625 user_name3 00923408271
GGCTATTAAGGGGTCGGACGCTCGC 423650 user_name4 00923408271
GTAACTAAACTTTAACGGACGCTCC 423675 user_name5 0653053443
说明:默认情况下,我使用捕获组的模式。在python中,re表示除换行符以外的任何内容,所以我在序列之后每四行捕获一次。当在re.findall中使用这种模式时,它会给出4元组的列表,因此我使用*它们解包并通知print使用\t作为分隔符。我会首先读取序列和结束之间的部分并存储它们。然后,将它们输入数据帧
out = []
curr = []
lines = f.split('\n')
for l in lines:
if "SEQ" in l or "END" in l:
if len(curr)>0:
out.append(curr)
curr=[]
else:
try:
curr.append(int(l))
except:
curr.append(l)
data = {"string1":[],"number1":[],"string2":[],"number2":[]}
for case in out:
if len(case)==4:
data["string1"].append(case[0])
data["string2"].append(case[2])
data["number1"].append(case[1])
data["number2"].append(case[3])
其结果是一个字典,您可以将其用作数据帧本身,或者直接将其转换为您喜欢的numpy、pandas等数据结构
{'string1': ['ATTCGGTCTAATGACGGACGCTCTA', 'GCAAGTCTAATGACGGACGCTCTGA', 'GTAAGATCTAATGACGGACGCTCCA', 'GGCTATTAAGGGGTCGGACGCTCGC', 'GTAACTAAACTTTAACGGACGCTCC'],
'number1': [423575, 423600, 423625, 423650, 423675],
'string2': ['user_name', 'user_name2', 'user_name3', 'user_name4', 'user_name5'],
'number2': [29708252, 3276541, 923408271, 923408271, 653053443]}
请注意,此脚本将只获取由四行组成的格式良好的块。所有其他条目都将被丢弃。如果这不是您想要的,您需要在If lencase==4:.之后制定一个else语句。我将首先读取SEQUENCE和END之间的部分并存储它们。然后,将它们输入数据帧
out = []
curr = []
lines = f.split('\n')
for l in lines:
if "SEQ" in l or "END" in l:
if len(curr)>0:
out.append(curr)
curr=[]
else:
try:
curr.append(int(l))
except:
curr.append(l)
data = {"string1":[],"number1":[],"string2":[],"number2":[]}
for case in out:
if len(case)==4:
data["string1"].append(case[0])
data["string2"].append(case[2])
data["number1"].append(case[1])
data["number2"].append(case[3])
其结果是一个字典,您可以将其用作数据帧本身,或者直接将其转换为您喜欢的numpy、pandas等数据结构
{'string1': ['ATTCGGTCTAATGACGGACGCTCTA', 'GCAAGTCTAATGACGGACGCTCTGA', 'GTAAGATCTAATGACGGACGCTCCA', 'GGCTATTAAGGGGTCGGACGCTCGC', 'GTAACTAAACTTTAACGGACGCTCC'],
'number1': [423575, 423600, 423625, 423650, 423675],
'string2': ['user_name', 'user_name2', 'user_name3', 'user_name4', 'user_name5'],
'number2': [29708252, 3276541, 923408271, 923408271, 653053443]}
请注意,此脚本将只获取由四行组成的格式良好的块。所有其他条目都将被丢弃。如果这不是您想要的,那么您需要在If lencase==4:.之后制定一个else语句。请尝试以下内容,并使用GNU awk中显示的示例编写和测试 说明:增加对以上内容的详细说明
awk -v RS="SEQUENCE\n" -v FS="\n|END" ' ##Starting awk program from here, setting record separator as SEQUENCE new line and setting field separator as newline or END keyword here for all lines.
{
$1=$1 ##Resetting 1st field here for all lines so that new values of RS, FS and OFS applied on it.
}
NF>1{ ##Checking if number of fields is greater than 1 here.
sub(/ +$/,"") ##Substituting space at last of line with NULL here.
print ##Printing current line here.
}
' Input_file ##Mentioning Input_file name here.
请您尝试以下,书面和测试显示的样本在GNU awk 说明:增加对以上内容的详细说明
awk -v RS="SEQUENCE\n" -v FS="\n|END" ' ##Starting awk program from here, setting record separator as SEQUENCE new line and setting field separator as newline or END keyword here for all lines.
{
$1=$1 ##Resetting 1st field here for all lines so that new values of RS, FS and OFS applied on it.
}
NF>1{ ##Checking if number of fields is greater than 1 here.
sub(/ +$/,"") ##Substituting space at last of line with NULL here.
print ##Printing current line here.
}
' Input_file ##Mentioning Input_file name here.
试着用这种简单的方式将其作为文本文件进行读取、处理和写入-
filename = 'sample.txt'
outfile = 'processed_sample.txt'
with open(filename) as f:
content = [i.strip() for i in f.readlines()] #read as a list and strip \n
content = [i for i in content if i != 'END' and i != 'SEQUENCE'] #remove sequence and end tokens
content = [' '.join(content[i:i + 4]) for i in range(0, len(content), 4)] #break into parts of 4
content
这将为您提供如下列表-
['ATTCGGTCTAATGACGGACGCTCTA 423575 user_name 029708252',
'GCAAGTCTAATGACGGACGCTCTGA 423600 user_name2 03276541',
'GTAAGATCTAATGACGGACGCTCCA 423625 user_name3 00923408271',
'GGCTATTAAGGGGTCGGACGCTCGC 423650 user_name4 00923408271',
'GTAACTAAACTTTAACGGACGCTCC 423675 user_name5 0653053443']
接下来,您可以将其写入另一个文本文件,如下所示-
with open(outfile, "w") as outfile:
outfile.write("\n".join(content))
试着用这种简单的方式将其作为文本文件进行读取、处理和写入-
filename = 'sample.txt'
outfile = 'processed_sample.txt'
with open(filename) as f:
content = [i.strip() for i in f.readlines()] #read as a list and strip \n
content = [i for i in content if i != 'END' and i != 'SEQUENCE'] #remove sequence and end tokens
content = [' '.join(content[i:i + 4]) for i in range(0, len(content), 4)] #break into parts of 4
content
这将为您提供如下列表-
['ATTCGGTCTAATGACGGACGCTCTA 423575 user_name 029708252',
'GCAAGTCTAATGACGGACGCTCTGA 423600 user_name2 03276541',
'GTAAGATCTAATGACGGACGCTCCA 423625 user_name3 00923408271',
'GGCTATTAAGGGGTCGGACGCTCGC 423650 user_name4 00923408271',
'GTAACTAAACTTTAACGGACGCTCC 423675 user_name5 0653053443']
接下来,您可以将其写入另一个文本文件,如下所示-
with open(outfile, "w") as outfile:
outfile.write("\n".join(content))
使用grep和paste:
使用grep和paste:
输入的格式是什么?带有新行序列的文本文件,String1,number1…?@Akshay txt文件带有新行。@ankit7540,很抱歉,您的示例预期输出不清楚,请更清楚地添加它,然后让我们知道。是否始终有两个字符串和数字,或者可能有更多或更少?在您的数字之前始终是所需的字符串。如果是,我想你可以很容易地解析它。输入的格式是什么?带有新行序列的文本文件,String1,number1…?@Akshay txt文件带有新行。@ankit7540,很抱歉,您的示例预期输出不清楚,请更清楚地添加它,然后让我们知道。是否始终有两个字符串和数字,或者可能有更多或更少?在您的数字之前始终是所需的字符串。如果是,我想你可以很容易地解析它。如果我是正确的,我可以使用open input.txt,r as myfile:data=myfile.readlines将文件内容分配给python变量。如果我是正确的,我可以使用open input.txt,r as myfile:data=myfile.readlines将文件内容分配给python变量。有趣而简单的答案,不知道粘贴的存在,UV有趣而简单的答案,不知道粘贴的存在,UV