从文本文件映射数据并在matlab中创建多维数组
我刚刚创建了一个matlab文件,该文件从包含[Term]的文本文件中获取数据并生成向量,该文本文件包含关于is_a关系和部分关系的信息(生物信息学领域) 结果如下:从文本文件映射数据并在matlab中创建多维数组,matlab,text,create-table,Matlab,Text,Create Table,我刚刚创建了一个matlab文件,该文件从包含[Term]的文本文件中获取数据并生成向量,该文本文件包含关于is_a关系和部分关系的信息(生物信息学领域) 结果如下: s= '[Term]' 'id: GO:0008150' 'name: biological_process' 'namespace: biological_process' [1x180 char] [1x445 char] '[Term]'
s=
'[Term]'
'id: GO:0008150'
'name: biological_process'
'namespace: biological_process'
[1x180 char]
[1x445 char]
'[Term]'
'id: GO:0016740'
'name: transferase activity'
'namespace: molecular_function'
'xref: Reactome:REACT_25050 "Molybdenum ion transfer onto molybdopterin, Homo sapiens"'
'//is_a: GO:0003674 ! molecular_function'
'is_a: GO:0008150 ! molecular_function (added by Zaid, To be Removed Later)'
'//relationship: part_of GO:0008150 ! biological_process'
'[Term]'
'id: GO:0016787'
'name: hydrolase activity'
'namespace: molecular_function'
'xref: Reactome:REACT_110436 "Hydrolysis of phosphatidylcholine, Bos taurus"'
[1x92 char]
'xref: Reactome:REACT_87959 "Hydrolysis of phosphatidylcholine, Gallus gallus"'
'//is_a: GO:0003674 ! molecular_function'
'is_a: GO:0016740 ! molecular_function (added by Zaid, to be removed later)'
'relationship: part_of GO:0008150 ! biological_process'
'[Term]'
'id: GO:0006810'
'name: transport'
'namespace: biological_process'
'alt_id: GO:0015457'
'alt_id: GO:0015460'
[1x255 char]
'subset: goslim_aspergillus'
'synonym: "transport accessory protein activity" RELATED [GOC:mah]'
'is_a: GO:0016787 ! biological_process'
'relationship: part_of GO:0008150 ! biological_process'
'[Term]'
'id: GO:0006412'
'name: translation'
'namespace: biological_process'
'alt_id: GO:0006416'
[1x522 char]
'subset: gosubset_prok'
'synonym: "protamine kinase activity" NARROW []'
'is_a: GO:0016740 ! transferase activity'
'//relationship: part_of GO:0006464 ! cellular protein modification process'
'[Term]'
'id: GO:0016779'
'name: nucleotidyltransferase activity'
'namespace: molecular_function'
'is_a: GO:0016740 ! transferase activity'
'[Term]'
'id: GO:0004386'
'helicases, Xenopus tropicalis"'
[1x100 char]
'is_a: GO:0016787 ! hydrolase activity'
'[Term]'
'id: GO:0003774'
'name: motor activity'
'namespace: molecular_function'
[1x178 char]
'is_a: GO:0016787 ! hydrolase activity'
[1x110 char]
'[Term]'
'id: GO:0016298'
'name: lipase activity'
'namespace: molecular_function'
'holesterol ester + H2O -> cholesterol + fatty acid, Caenorhabditis elegans"'
'is_a: GO:0016787 ! hydrolase activity'
'[Term]'
'id: GO:0016192'
'name: vesicle-mediated transport'
'namespace: biological_process'
'alt_id: GO:0006899'
[1x429 char]
'subset: goslim_aspergillus'
'synonym: "vesicular transport" EXACT [GOC:mah]'
'is_a: GO:0006810 ! transport'
'[Term]'
'id: GO:0005215'
'name: transporter activity'
'namespace: molecular_function'
[1x92 char]
'//is_a: GO:0003674 ! molecular_function'
'is_a: GO:0006412 ! molecular_function (to be removed later)'
'relationship: part_of GO:0006810 ! transport'
'[Term]'
'id: GO:0030533'
'name: triplet codon-amino acid adaptor activity'
'namespace: molecular_function'
'is_a: GO:0004672 ! RNA binding (added by Zaid, to be removed later)'
'relationship: part_of GO:0005215 ! translation'
GO_Terms =
'GO:0008150'
'GO:0016740'
'GO:0016787'
'GO:0006810'
'GO:0006412'
'GO:0004672'
'GO:0016779'
'GO:0004386'
'GO:0003774'
'GO:0016298'
'GO:0016192'
'GO:0005215'
'GO:0030533'
is_a_relations =
'GO:0008150'
'GO:0016740'
'GO:0016787'
'GO:0008150'
'GO:0016740'
'GO:0016740'
'GO:0016787'
'GO:0016787'
'GO:0016787'
'GO:0006810'
'GO:0006412'
'GO:0004672'
part_of_relations =
'GO:0008150'
'GO:0008150'
'GO:0006810'
'GO:0016192'
'GO:0006810'
'GO:0005215'
我想在一个多维数组中收集这些数据,第一列是:“GO_Term”,第二列是“is_a_relations”,第三列是“part_of_relations”
问题是文本文件中的所有[术语]都不包含第二列和第三列('is_a'和'part of relations')。。。
因此,如何通过文本文件中每个[Term]段落的is_a和部分关系(如果有)映射每个GO_术语。在这种情况下,您必须逐个术语,并在途中创建映射:
% find start and end positions of every [Term] marker in s
terms = [find(~cellfun('isempty', regexp(s, '\[Term\]'))); numel(s)+1];
% for every [Term] section, run the previously implemented regexps
% and save the results into a map - a cell array with 3 columns
map = cell(0,3);
for term=1:numel(terms)-1
% extract single [Term] data
s_term = s(terms(term):terms(term+1)-1);
% match regexps
%To generate the GO_Terms vector from the text file
tok = regexp(s_term, '^id: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
GO_Terms = cellfun(@(x)x{1}, {tok{idx}})';
%To generate the is_a relations vector from the text file
tok = regexp(s_term, '^is_a: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
is_a_relations = cellfun(@(x)x{1}, {tok{idx}})';
%To generate the part_of relaions vector from the text file
tok = regexp(s_term, '^relationship: part_of (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
part_of_relations = cellfun(@(x)x{1}, {tok{idx}})';
% map. note the end+1 - here we create a new map row. Only once!
map{end+1,1} = GO_Terms;
map{end, 2} = is_a_relations;
map{end, 3} = part_of_relations;
end
map
现在是一个包含3列的单元格数组。有些条目是空的,这意味着这个特定的[Term]
条目没有相应的值。听起来不错,但是为什么我要显示map数组只包含单元格维度{1x1 cell}而不包含GOterm(GO:*****)…@Gloria检查map{1,1}
和可能map{2,2}
)这是一个很好的例子。在开始编程之前,您应该阅读文档。
% find start and end positions of every [Term] marker in s
terms = [find(~cellfun('isempty', regexp(s, '\[Term\]'))); numel(s)+1];
% for every [Term] section, run the previously implemented regexps
% and save the results into a map - a cell array with 3 columns
map = cell(0,3);
for term=1:numel(terms)-1
% extract single [Term] data
s_term = s(terms(term):terms(term+1)-1);
% match regexps
%To generate the GO_Terms vector from the text file
tok = regexp(s_term, '^id: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
GO_Terms = cellfun(@(x)x{1}, {tok{idx}})';
%To generate the is_a relations vector from the text file
tok = regexp(s_term, '^is_a: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
is_a_relations = cellfun(@(x)x{1}, {tok{idx}})';
%To generate the part_of relaions vector from the text file
tok = regexp(s_term, '^relationship: part_of (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
part_of_relations = cellfun(@(x)x{1}, {tok{idx}})';
% map. note the end+1 - here we create a new map row. Only once!
map{end+1,1} = GO_Terms;
map{end, 2} = is_a_relations;
map{end, 3} = part_of_relations;
end