从文本文件映射数据并在matlab中创建多维数组

从文本文件映射数据并在matlab中创建多维数组,matlab,text,create-table,Matlab,Text,Create Table,我刚刚创建了一个matlab文件,该文件从包含[Term]的文本文件中获取数据并生成向量,该文本文件包含关于is_a关系和部分关系的信息(生物信息学领域) 结果如下: s= '[Term]' 'id: GO:0008150' 'name: biological_process' 'namespace: biological_process' [1x180 char] [1x445 char] '[Term]'

我刚刚创建了一个matlab文件,该文件从包含[Term]的文本文件中获取数据并生成向量,该文本文件包含关于is_a关系和部分关系的信息(生物信息学领域)

结果如下:

s= 



   '[Term]'    
    'id: GO:0008150'
    'name: biological_process'
    'namespace: biological_process'
    [1x180 char]
    [1x445 char]

    '[Term]'    
    'id: GO:0016740'
    'name: transferase activity'
    'namespace: molecular_function'
    'xref: Reactome:REACT_25050 "Molybdenum ion transfer onto molybdopterin, Homo sapiens"'
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0008150 ! molecular_function (added by Zaid, To be Removed Later)'
    '//relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0016787'
    'name: hydrolase activity'
    'namespace: molecular_function'
    'xref: Reactome:REACT_110436 "Hydrolysis of phosphatidylcholine, Bos taurus"'
    [1x92  char]
    'xref: Reactome:REACT_87959 "Hydrolysis of phosphatidylcholine, Gallus gallus"'
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0016740 ! molecular_function (added by Zaid, to be removed later)'
    'relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0006810'
    'name: transport'
    'namespace: biological_process'
    'alt_id: GO:0015457'
    'alt_id: GO:0015460'
    [1x255 char]
    'subset: goslim_aspergillus'
    'synonym: "transport accessory protein activity" RELATED [GOC:mah]'
    'is_a: GO:0016787 ! biological_process'
    'relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0006412'
    'name: translation'
    'namespace: biological_process'
    'alt_id: GO:0006416'
    [1x522 char]
    'subset: gosubset_prok'
    'synonym: "protamine kinase activity" NARROW []'
    'is_a: GO:0016740 ! transferase activity'
    '//relationship: part_of GO:0006464 ! cellular protein modification process'

    '[Term]'    
    'id: GO:0016779'
    'name: nucleotidyltransferase activity'
    'namespace: molecular_function'
    'is_a: GO:0016740 ! transferase activity'

    '[Term]'    
    'id: GO:0004386'
    'helicases, Xenopus tropicalis"'
    [1x100 char]
    'is_a: GO:0016787 ! hydrolase activity'

    '[Term]'    
    'id: GO:0003774'
    'name: motor activity'
    'namespace: molecular_function'
    [1x178 char]
    'is_a: GO:0016787 ! hydrolase activity'
    [1x110 char]

    '[Term]'    
    'id: GO:0016298'
    'name: lipase activity'
    'namespace: molecular_function'
    'holesterol ester + H2O -> cholesterol + fatty acid, Caenorhabditis elegans"'
    'is_a: GO:0016787 ! hydrolase activity'

    '[Term]'    
    'id: GO:0016192'
    'name: vesicle-mediated transport'
    'namespace: biological_process'
    'alt_id: GO:0006899'
    [1x429 char]
    'subset: goslim_aspergillus'
    'synonym: "vesicular transport" EXACT [GOC:mah]'
    'is_a: GO:0006810 ! transport'

    '[Term]'    
    'id: GO:0005215'
    'name: transporter activity'
    'namespace: molecular_function'
    [1x92  char]
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0006412 ! molecular_function (to be removed later)'
    'relationship: part_of GO:0006810 ! transport'

    '[Term]'    
    'id: GO:0030533'
    'name: triplet codon-amino acid adaptor activity'
    'namespace: molecular_function'
    'is_a: GO:0004672 ! RNA binding (added by Zaid, to be removed later)'
    'relationship: part_of GO:0005215 ! translation'





GO_Terms = 

    'GO:0008150'
    'GO:0016740'
    'GO:0016787'
    'GO:0006810'
    'GO:0006412'
    'GO:0004672'
    'GO:0016779'
    'GO:0004386'
    'GO:0003774'
    'GO:0016298'
    'GO:0016192'
    'GO:0005215'
    'GO:0030533'


is_a_relations = 

    'GO:0008150'
    'GO:0016740'
    'GO:0016787'
    'GO:0008150'
    'GO:0016740'
    'GO:0016740'
    'GO:0016787'
    'GO:0016787'
    'GO:0016787'
    'GO:0006810'
    'GO:0006412'
    'GO:0004672'


part_of_relations = 

    'GO:0008150'
    'GO:0008150'
    'GO:0006810'
    'GO:0016192'
    'GO:0006810'
    'GO:0005215'
我想在一个多维数组中收集这些数据,第一列是:“GO_Term”,第二列是“is_a_relations”,第三列是“part_of_relations”

问题是文本文件中的所有[术语]都不包含第二列和第三列('is_a'和'part of relations')。。。
因此,如何通过文本文件中每个[Term]段落的is_a和部分关系(如果有)映射每个GO_术语。

在这种情况下,您必须逐个术语,并在途中创建映射:

% find start and end positions of every [Term] marker in s 
terms = [find(~cellfun('isempty', regexp(s, '\[Term\]'))); numel(s)+1];

% for every [Term] section, run the previously implemented regexps
% and save the results into a map - a cell array with 3 columns
map = cell(0,3);
for term=1:numel(terms)-1
    % extract single [Term]  data
    s_term = s(terms(term):terms(term+1)-1);

    % match regexps
    %To generate the GO_Terms vector from the text file
    tok = regexp(s_term, '^id: (GO:\w*)', 'tokens');
    idx = ~cellfun('isempty', tok); 
    GO_Terms   = cellfun(@(x)x{1}, {tok{idx}})';

    %To generate the is_a relations vector from the text file
    tok = regexp(s_term, '^is_a: (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    is_a_relations  = cellfun(@(x)x{1}, {tok{idx}})';

    %To generate the part_of relaions vector from the text file
    tok = regexp(s_term, '^relationship: part_of (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    part_of_relations = cellfun(@(x)x{1}, {tok{idx}})';

    % map. note the end+1 - here we create a new map row. Only once!
    map{end+1,1} = GO_Terms;
    map{end,  2} = is_a_relations;
    map{end,  3} = part_of_relations;
end

map
现在是一个包含3列的单元格数组。有些条目是空的,这意味着这个特定的
[Term]
条目没有相应的值。

听起来不错,但是为什么我要显示map数组只包含单元格维度{1x1 cell}而不包含GOterm(GO:*****)…@Gloria检查
map{1,1}
和可能
map{2,2}
)这是一个很好的例子。在开始编程之前,您应该阅读文档。
% find start and end positions of every [Term] marker in s 
terms = [find(~cellfun('isempty', regexp(s, '\[Term\]'))); numel(s)+1];

% for every [Term] section, run the previously implemented regexps
% and save the results into a map - a cell array with 3 columns
map = cell(0,3);
for term=1:numel(terms)-1
    % extract single [Term]  data
    s_term = s(terms(term):terms(term+1)-1);

    % match regexps
    %To generate the GO_Terms vector from the text file
    tok = regexp(s_term, '^id: (GO:\w*)', 'tokens');
    idx = ~cellfun('isempty', tok); 
    GO_Terms   = cellfun(@(x)x{1}, {tok{idx}})';

    %To generate the is_a relations vector from the text file
    tok = regexp(s_term, '^is_a: (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    is_a_relations  = cellfun(@(x)x{1}, {tok{idx}})';

    %To generate the part_of relaions vector from the text file
    tok = regexp(s_term, '^relationship: part_of (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    part_of_relations = cellfun(@(x)x{1}, {tok{idx}})';

    % map. note the end+1 - here we create a new map row. Only once!
    map{end+1,1} = GO_Terms;
    map{end,  2} = is_a_relations;
    map{end,  3} = part_of_relations;
end