XML文件解析:如何开始
我需要一些解析xml文件的帮助。这是我第一次做这种工作,我希望能得到一些建议或帮助。我有一个像这样的大文件:XML文件解析:如何开始,xml,perl,Xml,Perl,我需要一些解析xml文件的帮助。这是我第一次做这种工作,我希望能得到一些建议或帮助。我有一个像这样的大文件: <Response success="true" start_row="0" num_rows="100" total_rows="100"> <ncbi-genes> <ncbi-gene> <acronym>Accn1</acronym> <alias-tags>BNC1
<Response success="true" start_row="0" num_rows="100" total_rows="100">
<ncbi-genes>
<ncbi-gene>
<acronym>Accn1</acronym>
<alias-tags>BNC1 BNaC1 ACIC2 ASIC2 Mdeg BNaC1a</alias-tags>
<data-sets>
<data-set>
<blue-channel nil="true"/>
<delegate type="boolean">true</delegate>
<specimen>
<chemotherapy nil="true"/>
<donor-id type="integer">9456</donor-id>
<donor>
<age-id type="integer">1</age-id>
<condition-description>TS26</condition-description>
<age>
<age-group-id type="integer">1</age-group-id>
<days type="float">18.5</days>
</age>
</donor>
</specimen>
<differential-expression-rankings type="array">
<differential-expression-ranking>
<structure>
<acronym>PPH</acronym>
<name>prepontine hindbrain</name>
</structure>
</differential-expression-ranking>
<differential-expression-ranking>
<structure>
<acronym>p3</acronym>
<name>prosomere 3</name>
</structure>
</differential-expression-ranking>
</differential-expression-rankings>
</data-set>
<data-set>
(...same fields as before...)
</data-set>
</data-sets>
</ncbi-gene>
</ncbi-genes>
这个循环不起作用。。我想是因为ncbi基因的内部。我已将此字段更改为NCBIGENES,现在错误为:
Not a HASH reference at xml_parser.pl line 19.
HASH(0x29d7ca0)
调用哈希时出现问题。。。
正如我所说的,我对这种数据是新手,这是我第一次使用xml模块。因此,任何关于自我定位的建议都将不胜感激
提前感谢。这里是一个使用解析的快速示例;LibXML使您能够轻松访问,这是一种XML查询语言,允许您根据节点的标记名、值、属性和/或与其他节点的关系来选择节点集。使用XPath可以很容易地挑出y节点下的所有x节点,或具有属性z且子节点id为w的所有x节点,或类似的复杂查询
use strict;
use warnings;
use feature qw(say);
use Data::Dumper;
use XML::LibXML;
my $tree = XML::LibXML->load_xml( IO => \*DATA );
## make sure that we have some genes!
die "Could not find any genes!" if ! $tree->exists('//ncbi-gene');
# for every 'ncbi-gene' node:
for my $gene ( $tree->findnodes('//ncbi-gene') ) {
my %data;
# is there an acronym as direct child of the node?
$data{acronym} = $gene->findvalue('acronym') if $gene->exists('acronym');
# find the donor age in days using the path specified
# to get the value of each node, run to_literal on it
$data{donor_age_days} = [ map { $_->to_literal }
$gene->findnodes('data-sets/data-set/specimen/donor/age/days') ];
# find all the 'name' nodes under a 'structure' node that is a descendant of $gene
$data{structures} = [ map { $_->to_literal }
$gene->findnodes( 'descendant::structure/name', $gene ) ];
# this will find any 'name' node under a structure node anywhere in the tree
$data{all_structures} = [ map { $_->to_literal }
$gene->findnodes('//structure/name') ];
# an example of using findvalue on a query that returns an array: only the
# first value is returned.
$data{acronyms_str} = [ $gene->findvalue('//structure/acronym') ];
say Dumper( \%data );
}
__DATA__
<Response success="true" start_row="0" num_rows="100" total_rows="100">
<ncbi-genes>
<ncbi-gene>
<acronym>Accn1</acronym>
<alias-tags>BNC1 BNaC1 ACIC2 ASIC2 Mdeg BNaC1a</alias-tags>
<data-sets>
<data-set>
<blue-channel nil="true"/>
<delegate type="boolean">true</delegate>
<specimen>
<chemotherapy nil="true"/>
<donor-id type="integer">9456</donor-id>
<donor>
<age-id type="integer">1</age-id>
<condition-description>TS26</condition-description>
<age>
<age-group-id type="integer">1</age-group-id>
<days type="float">18.5</days>
</age>
</donor>
</specimen>
<differential-expression-rankings type="array">
<differential-expression-ranking>
<structure>
<acronym>PPH</acronym>
<name>prepontine hindbrain</name>
</structure>
</differential-expression-ranking>
<differential-expression-ranking>
<structure>
<acronym>p3</acronym>
<name>prosomere 3</name>
</structure>
</differential-expression-ranking>
</differential-expression-rankings>
</data-set>
<data-set>
(...same fields as before...)
</data-set>
</data-sets>
</ncbi-gene>
<ncbi-favourite-places>
<structure>
<name>Eiffel Tower</name>
</structure>
</ncbi-favourite-places>
</ncbi-genes>
</Response>
有一些很好的XPath教程,在浏览XML文档时应该很方便。请注意,基于XML::libxml的libxml只实现了XPath1.0
下面是为每个数据集节点收集数据的快速示例:
此脚本没有26行,哪一行是错误?您阅读了中此模块部分的状态了吗?旁注:使用引号,$genelist->{'ncbi-gene'},以避免重命名元素。@cucurbit对用户非常友好-您可能会发现更好地开始使用XML::LibXML.maging是没有意义的。它可以表示1Mb到100Gb之间的任何内容。这有很大的区别,它可能会影响你如何去做。效果非常好,非常感谢,现在,我将试着理解所有的行:有什么方法可以把和年龄相关的结构分组吗?我的意思是,我想知道每个年龄段都包括哪些结构。。也许将年龄保存为键并构造该键的数组?是的-您可能希望依次遍历每个数据集,以确保将年龄与结构相关联。我将在答案中添加一个示例。
use strict;
use warnings;
use feature qw(say);
use Data::Dumper;
use XML::LibXML;
my $tree = XML::LibXML->load_xml( IO => \*DATA );
## make sure that we have some genes!
die "Could not find any genes!" if ! $tree->exists('//ncbi-gene');
# for every 'ncbi-gene' node:
for my $gene ( $tree->findnodes('//ncbi-gene') ) {
my %data;
# is there an acronym as direct child of the node?
$data{acronym} = $gene->findvalue('acronym') if $gene->exists('acronym');
# find the donor age in days using the path specified
# to get the value of each node, run to_literal on it
$data{donor_age_days} = [ map { $_->to_literal }
$gene->findnodes('data-sets/data-set/specimen/donor/age/days') ];
# find all the 'name' nodes under a 'structure' node that is a descendant of $gene
$data{structures} = [ map { $_->to_literal }
$gene->findnodes( 'descendant::structure/name', $gene ) ];
# this will find any 'name' node under a structure node anywhere in the tree
$data{all_structures} = [ map { $_->to_literal }
$gene->findnodes('//structure/name') ];
# an example of using findvalue on a query that returns an array: only the
# first value is returned.
$data{acronyms_str} = [ $gene->findvalue('//structure/acronym') ];
say Dumper( \%data );
}
__DATA__
<Response success="true" start_row="0" num_rows="100" total_rows="100">
<ncbi-genes>
<ncbi-gene>
<acronym>Accn1</acronym>
<alias-tags>BNC1 BNaC1 ACIC2 ASIC2 Mdeg BNaC1a</alias-tags>
<data-sets>
<data-set>
<blue-channel nil="true"/>
<delegate type="boolean">true</delegate>
<specimen>
<chemotherapy nil="true"/>
<donor-id type="integer">9456</donor-id>
<donor>
<age-id type="integer">1</age-id>
<condition-description>TS26</condition-description>
<age>
<age-group-id type="integer">1</age-group-id>
<days type="float">18.5</days>
</age>
</donor>
</specimen>
<differential-expression-rankings type="array">
<differential-expression-ranking>
<structure>
<acronym>PPH</acronym>
<name>prepontine hindbrain</name>
</structure>
</differential-expression-ranking>
<differential-expression-ranking>
<structure>
<acronym>p3</acronym>
<name>prosomere 3</name>
</structure>
</differential-expression-ranking>
</differential-expression-rankings>
</data-set>
<data-set>
(...same fields as before...)
</data-set>
</data-sets>
</ncbi-gene>
<ncbi-favourite-places>
<structure>
<name>Eiffel Tower</name>
</structure>
</ncbi-favourite-places>
</ncbi-genes>
</Response>
$VAR1 = {
'acronym' => 'Accn1',
'donor_age_days' => [
'18.5'
],
'structures' => [
'prepontine hindbrain',
'prosomere 3'
],
'acronyms_str' => [
'PPHp3'
],
'all_structures' => [
'prepontine hindbrain',
'prosomere 3',
'Eiffel Tower'
]
};
for my $gene ( $tree->findnodes('//ncbi-gene') ) {
my $data;
for my $ds ( $gene->findnodes('data-sets/data-set')) {
# get the age in days -- assumes there is only one age per <data-set>
my $age = $ds->findvalue('specimen/donor/age/days');
# get the structures associated with that age
my @structures = map { $_->to_literal }
$ds->findnodes('descendant::structure/name');
# you can now save them however you like--e.g.
push @{$data->{$age}}, @structures;
}
}