R中的xml2:从父级提取子级属性(所有属性的名称都相同)
我有以下xml,其中节点可以有相同的名称,但它们的属性可能不同R中的xml2:从父级提取子级属性(所有属性的名称都相同),r,xml2,R,Xml2,我有以下xml,其中节点可以有相同的名称,但它们的属性可能不同 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5"> <protein> <sequence md5="6e7e4fcef214ab5c
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5">
<protein>
<sequence md5="6e7e4fcef214ab5cf97714e899af0b96">MATLRAMLKNAFILFLFTLTIMAKTVFSQQCGTTGCAANLCCSRYGYCGTTDAYCGTGCRSGPCSSSTTPIPPTPSGGAGGLNADPRDTIENVVTPAFFDGIMSKVGNGCPAKGFYTRQAFIAAAQSFDAYKGTVAKREIAAMLAQFSHESGSFCYKEEIARGKYCSPSTAYPCTPGKDYYGRGPIQITWNYNYGAAGKFLGLPLLTDPDMVARSPQVAFQCAMWFWNLNVRPVLDQGFGATTRKINGGECNGRRPAAVQSRVNYYLEFCRTLGITPGANLSC</sequence>
<xref id="AT2G43620.1"/>
<matches>
<hmmer3-match evalue="1.5E-8" score="34.9">
<signature ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1">
<entry ac="IPR001002" desc="Chitin-binding, type 1" name="Chitin-bd_1" type="DOMAIN">
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0008061" name="chitin binding"/>
</entry>
<models>
<model ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1"/>
</models>
<signature-library-release library="PFAM" version="31.0"/>
</signature>
<locations>
<hmmer3-location env-end="64" env-start="31" score="34.9" evalue="1.5E-8" hmm-start="2" hmm-end="38" hmm-length="0" start="31" end="64"/>
</locations>
</hmmer3-match>
<hmmer3-match evalue="1.4E-46" score="159.2">
<signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19">
<entry ac="IPR000726" desc="Glycoside hydrolase, family 19, catalytic" name="Glyco_hydro_19_cat" type="DOMAIN">
<go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0016998" name="cell wall macromolecule catabolic process"/>
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0004568" name="chitinase activity"/>
<go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0006032" name="chitin catabolic process"/>
<pathway-xref db="MetaCyc" id="PWY-7822" name="Chitin degradation III (Serratia)"/>
<pathway-xref db="MetaCyc" id="PWY-6855" name="Chitin degradation I (archaea)"/>
<pathway-xref db="MetaCyc" id="PWY-6902" name="Chitin degradation II (Vibrio)"/>
<pathway-xref db="KEGG" id="00520+3.2.1.14" name="Amino sugar and nucleotide sugar metabolism"/>
</entry>
<models>
<model ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19"/>
</models>
<signature-library-release library="PFAM" version="31.0"/>
</signature>
<locations>
<hmmer3-location env-end="228" env-start="94" score="131.4" evalue="4.3E-38" hmm-start="2" hmm-end="155" hmm-length="0" start="94" end="228"/>
<hmmer3-location env-end="283" env-start="237" score="28.0" evalue="1.7E-6" hmm-start="185" hmm-end="232" hmm-length="0" start="237" end="283"/>
</locations>
</hmmer3-match>
</matches>
</protein>
</protein-matches>
看起来比赛增加了一倍,因为:
xml_find_all(hmmer3_match, "//*[name()='signature']")
#output
{xml_nodeset (2)}
[1] <signature ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1">\n <entry ac="IPR001002" desc="Chitin-binding, type 1" name="Chi" ...
[2] <signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19">\n <entry ac="IPR000726" desc="Glycoside hydrolase, family 19, catalytic" ...
我对这件事的了解非常有限,我觉得我遗漏了一些相当简单的东西
我正在寻找一种通用解决方案,它可以处理任意数量的hmmer3 match
节点,这些节点具有任意数量的go-xref
,位置
谢谢。使用相对XPath:
'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5">
<protein>
<sequence md5="6e7e4fcef214ab5cf97714e899af0b96">MATLRAMLKNAFILFLFTLTIMAKTVFSQQCGTTGCAANLCCSRYGYCGTTDAYCGTGCRSGPCSSSTTPIPPTPSGGAGGLNADPRDTIENVVTPAFFDGIMSKVGNGCPAKGFYTRQAFIAAAQSFDAYKGTVAKREIAAMLAQFSHESGSFCYKEEIARGKYCSPSTAYPCTPGKDYYGRGPIQITWNYNYGAAGKFLGLPLLTDPDMVARSPQVAFQCAMWFWNLNVRPVLDQGFGATTRKINGGECNGRRPAAVQSRVNYYLEFCRTLGITPGANLSC</sequence>
<xref id="AT2G43620.1"/>
<matches>
<hmmer3-match evalue="1.5E-8" score="34.9">
<signature ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1">
<entry ac="IPR001002" desc="Chitin-binding, type 1" name="Chitin-bd_1" type="DOMAIN">
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0008061" name="chitin binding"/>
</entry>
<models>
<model ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1"/>
</models>
<signature-library-release library="PFAM" version="31.0"/>
</signature>
<locations>
<hmmer3-location env-end="64" env-start="31" score="34.9" evalue="1.5E-8" hmm-start="2" hmm-end="38" hmm-length="0" start="31" end="64"/>
</locations>
</hmmer3-match>
<hmmer3-match evalue="1.4E-46" score="159.2">
<signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19">
<entry ac="IPR000726" desc="Glycoside hydrolase, family 19, catalytic" name="Glyco_hydro_19_cat" type="DOMAIN">
<go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0016998" name="cell wall macromolecule catabolic process"/>
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0004568" name="chitinase activity"/>
<go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0006032" name="chitin catabolic process"/>
<pathway-xref db="MetaCyc" id="PWY-7822" name="Chitin degradation III (Serratia)"/>
<pathway-xref db="MetaCyc" id="PWY-6855" name="Chitin degradation I (archaea)"/>
<pathway-xref db="MetaCyc" id="PWY-6902" name="Chitin degradation II (Vibrio)"/>
<pathway-xref db="KEGG" id="00520+3.2.1.14" name="Amino sugar and nucleotide sugar metabolism"/>
</entry>
<models>
<model ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19"/>
</models>
<signature-library-release library="PFAM" version="31.0"/>
</signature>
<locations>
<hmmer3-location env-end="228" env-start="94" score="131.4" evalue="4.3E-38" hmm-start="2" hmm-end="155" hmm-length="0" start="94" end="228"/>
<hmmer3-location env-end="283" env-start="237" score="28.0" evalue="1.7E-6" hmm-start="185" hmm-end="232" hmm-length="0" start="237" end="283"/>
</locations>
</hmmer3-match>
</matches>
</protein>
</protein-matches>' -> txt
'
MatlRamlknafilflftltimaktVfsqqqcgtgCaanlcsrygycgtdAyCgCgCgCrsCsSsTpPtpSgAggregatdPgTienVvTpPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgPgTgTgTgPgTgTgTgTgTgTgTgPgPgTgPgTgTgPgTgTgTgTfAfAfAfAfA
'->txt
实际代码:
library(purrr)
doc <- read_xml(txt)
xml_find_all(doc, ".//*[name()='hmmer3-match']") %>%
map(xml_find_all, ".//*[name()='signature']") -> sig
sig
## [[1]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00187" desc="Chitin recognition protein" name="Chit ...
##
## [[2]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_1 ...
hmmer <- xml_find_all(doc, ".//*[name()='hmmer3-match']")
sig <- lapply(hmmer, xml_find_all, ".//*[name()='signature']")
sig
## [[1]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00187" desc="Chitin recognition protein" name="Chit ...
##
## [[2]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_1 ...
hmmer <- xml_find_all(doc, ".//*[name()='hmmer3-match']")
sig <- list()
for (i in 1:length(hmmer)) {
sig_match <- xml_find_all(hmmer[[i]], ".//*[name()='signature']")
sig <- c(sig, sig_match)
}
sig
## [[1]]
## {xml_node}
## <signature ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1">
## [1] <entry ac="IPR001002" desc="Chitin-binding, type 1" name="Chitin-bd_ ...
## [2] <models>\n <model ac="PF00187" desc="Chitin recognition protein" na ...
## [3] <signature-library-release library="PFAM" version="31.0"/>
##
## [[2]]
## {xml_node}
## <signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19">
## [1] <entry ac="IPR000726" desc="Glycoside hydrolase, family 19, catalyti ...
## [2] <models>\n <model ac="PF00182" desc="Chitinase class I" name="Glyco ...
## [3] <signature-library-release library="PFAM" version="31.0"/>
库(purrr)
doc%
映射(xml\u find\u all,“.//*[name()='signature']”->sig
信号
## [[1]]
##{xml_nodeset(1)}
##[1]您可能知道为什么lappy
和for
循环方法不起作用吗?我用lappy
和for
扩展了答案。我没想到for
方法会产生xml\u节点
s而不是xml\u节点集
s,这三个目标几乎相当。感谢您的支持,以及我上一个问题中的打嗝套件建议。在这个项目中你帮了我很多。
'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5">
<protein>
<sequence md5="6e7e4fcef214ab5cf97714e899af0b96">MATLRAMLKNAFILFLFTLTIMAKTVFSQQCGTTGCAANLCCSRYGYCGTTDAYCGTGCRSGPCSSSTTPIPPTPSGGAGGLNADPRDTIENVVTPAFFDGIMSKVGNGCPAKGFYTRQAFIAAAQSFDAYKGTVAKREIAAMLAQFSHESGSFCYKEEIARGKYCSPSTAYPCTPGKDYYGRGPIQITWNYNYGAAGKFLGLPLLTDPDMVARSPQVAFQCAMWFWNLNVRPVLDQGFGATTRKINGGECNGRRPAAVQSRVNYYLEFCRTLGITPGANLSC</sequence>
<xref id="AT2G43620.1"/>
<matches>
<hmmer3-match evalue="1.5E-8" score="34.9">
<signature ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1">
<entry ac="IPR001002" desc="Chitin-binding, type 1" name="Chitin-bd_1" type="DOMAIN">
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0008061" name="chitin binding"/>
</entry>
<models>
<model ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1"/>
</models>
<signature-library-release library="PFAM" version="31.0"/>
</signature>
<locations>
<hmmer3-location env-end="64" env-start="31" score="34.9" evalue="1.5E-8" hmm-start="2" hmm-end="38" hmm-length="0" start="31" end="64"/>
</locations>
</hmmer3-match>
<hmmer3-match evalue="1.4E-46" score="159.2">
<signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19">
<entry ac="IPR000726" desc="Glycoside hydrolase, family 19, catalytic" name="Glyco_hydro_19_cat" type="DOMAIN">
<go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0016998" name="cell wall macromolecule catabolic process"/>
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0004568" name="chitinase activity"/>
<go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0006032" name="chitin catabolic process"/>
<pathway-xref db="MetaCyc" id="PWY-7822" name="Chitin degradation III (Serratia)"/>
<pathway-xref db="MetaCyc" id="PWY-6855" name="Chitin degradation I (archaea)"/>
<pathway-xref db="MetaCyc" id="PWY-6902" name="Chitin degradation II (Vibrio)"/>
<pathway-xref db="KEGG" id="00520+3.2.1.14" name="Amino sugar and nucleotide sugar metabolism"/>
</entry>
<models>
<model ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19"/>
</models>
<signature-library-release library="PFAM" version="31.0"/>
</signature>
<locations>
<hmmer3-location env-end="228" env-start="94" score="131.4" evalue="4.3E-38" hmm-start="2" hmm-end="155" hmm-length="0" start="94" end="228"/>
<hmmer3-location env-end="283" env-start="237" score="28.0" evalue="1.7E-6" hmm-start="185" hmm-end="232" hmm-length="0" start="237" end="283"/>
</locations>
</hmmer3-match>
</matches>
</protein>
</protein-matches>' -> txt
library(purrr)
doc <- read_xml(txt)
xml_find_all(doc, ".//*[name()='hmmer3-match']") %>%
map(xml_find_all, ".//*[name()='signature']") -> sig
sig
## [[1]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00187" desc="Chitin recognition protein" name="Chit ...
##
## [[2]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_1 ...
hmmer <- xml_find_all(doc, ".//*[name()='hmmer3-match']")
sig <- lapply(hmmer, xml_find_all, ".//*[name()='signature']")
sig
## [[1]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00187" desc="Chitin recognition protein" name="Chit ...
##
## [[2]]
## {xml_nodeset (1)}
## [1] <signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_1 ...
hmmer <- xml_find_all(doc, ".//*[name()='hmmer3-match']")
sig <- list()
for (i in 1:length(hmmer)) {
sig_match <- xml_find_all(hmmer[[i]], ".//*[name()='signature']")
sig <- c(sig, sig_match)
}
sig
## [[1]]
## {xml_node}
## <signature ac="PF00187" desc="Chitin recognition protein" name="Chitin_bind_1">
## [1] <entry ac="IPR001002" desc="Chitin-binding, type 1" name="Chitin-bd_ ...
## [2] <models>\n <model ac="PF00187" desc="Chitin recognition protein" na ...
## [3] <signature-library-release library="PFAM" version="31.0"/>
##
## [[2]]
## {xml_node}
## <signature ac="PF00182" desc="Chitinase class I" name="Glyco_hydro_19">
## [1] <entry ac="IPR000726" desc="Glycoside hydrolase, family 19, catalyti ...
## [2] <models>\n <model ac="PF00182" desc="Chitinase class I" name="Glyco ...
## [3] <signature-library-release library="PFAM" version="31.0"/>