从R中的XML文件检索所有值

从R中的XML文件检索所有值,xml,r,xml-parsing,Xml,R,Xml Parsing,我有以下XML输出。目标是从字符串行中提取所有值,并将该值存储在该行中的数据帧。我将R与XML包一起使用。我的代码使用了许多for循环,但无法提取这些值。是否有更好的函数或代码来提取这些值?我想提取并存储在数据框中的值是:行字符串中的“wc”、“content”、“height”、“width”、“vpos”、“hpos” <Layout> <Page ID="Page1" PHYSICAL_IMG_NR="1" HEIGHT="3440" WIDTH="2352">

我有以下XML输出。目标是从
字符串
行中提取所有值,并将该值存储在该行中的
数据帧
。我将
R
XML
包一起使用。我的代码使用了许多for循环,但无法提取这些值。是否有更好的函数或代码来提取这些值?我想提取并存储在数据框中的值是:行
字符串
中的“wc”、“content”、“height”、“width”、“vpos”、“hpos”

<Layout>
<Page ID="Page1" PHYSICAL_IMG_NR="1" HEIGHT="3440" WIDTH="2352">
    <BottomMargin HEIGHT="3440" WIDTH="2352" VPOS="0" HPOS="0">
        <TextBlock ID="Page1_Block1" HEIGHT="222" WIDTH="586" VPOS="466" HPOS="891" language="nl">
            <Shape>
                <Polygon POINTS="908,503 1489,503 1489,710 908,710 908,503"/>
            </Shape>
            <TextLine HEIGHT="35" WIDTH="264" VPOS="472" HPOS="902">
                <String WC="0.8519999981" CONTENT="SHELL" HEIGHT="30" WIDTH="92" VPOS="472" HPOS="902"/>
                <SP WIDTH="20" VPOS="474" HPOS="995"/>
                <String WC="0.5462499857" CONTENT="MAATVELD" HEIGHT="32" WIDTH="150" VPOS="475" HPOS="1016"/>
            </TextLine>
            <TextLine HEIGHT="36" WIDTH="227" VPOS="511" HPOS="901">
                <String WC="0.5287500024" CONTENT="RIJKSWEG" HEIGHT="34" WIDTH="150" VPOS="511" HPOS="901"/>
                <SP WIDTH="20" VPOS="516" HPOS="1052"/>
                <String WC="0.296666652" CONTENT="A20" HEIGHT="31" WIDTH="55" VPOS="515" HPOS="1073"/>
            </TextLine>
            <TextLine HEIGHT="42" WIDTH="418" VPOS="550" HPOS="900">
                <String WC="0.4427272677" CONTENT="NIEUWERKERK" HEIGHT="36" WIDTH="207" VPOS="550" HPOS="900"/>
                <SP WIDTH="21" VPOS="556" HPOS="1108"/>
                <String WC="0.2633333206" CONTENT="A/D" HEIGHT="31" WIDTH="54" VPOS="557" HPOS="1130"/>
                <SP WIDTH="20" VPOS="558" HPOS="1185"/>
                <String WC="0.4916666746" CONTENT="IJSSEL" HEIGHT="33" WIDTH="112" VPOS="559" HPOS="1206"/>
            </TextLine>
            <TextLine HEIGHT="51" WIDTH="570" VPOS="591" HPOS="898">
                <String WC="0.4333333373" CONTENT="BTW" HEIGHT="31" WIDTH="54" VPOS="591" HPOS="899"/>
                <SP WIDTH="21" VPOS="592" HPOS="954"/>
                <String WC="0.6039999723" CONTENT="Shop:" HEIGHT="38" WIDTH="87" VPOS="593" HPOS="975"/>
                <SP WIDTH="27" VPOS="595" HPOS="1063"/>
                <String WC="0.4900000095" CONTENT="NL" HEIGHT="30" WIDTH="34" VPOS="596" HPOS="1091"/>
                <SP WIDTH="21" VPOS="597" HPOS="1126"/>
                <String WC="0.6335294247" CONTENT="81.82.19.233.B.01" HEIGHT="39" WIDTH="321" VPOS="597" HPOS="1147"/>
            </TextLine>
            <TextLine HEIGHT="44" WIDTH="304" VPOS="631" HPOS="897">
                <String WC="0.6299999952" CONTENT="BTW" HEIGHT="31" WIDTH="54" VPOS="631" HPOS="898"/>
                <SP WIDTH="21" VPOS="632" HPOS="953"/>
                <String WC="0.5933333039" CONTENT="Types:" HEIGHT="39" WIDTH="106" VPOS="633" HPOS="974"/>
                <SP WIDTH="27" VPOS="636" HPOS="1081"/>
                <String WC="0.6980000138" CONTENT="0.1.3" HEIGHT="32" WIDTH="92" VPOS="636" HPOS="1109"/>
            </TextLine>
        </TextBlock>
        <TextBlock ID="Page1_Block3" HEIGHT="218" WIDTH="378" VPOS="782" HPOS="902" language="nl">
            <Shape>
                <Polygon POINTS="927,819 1300,819 1300,1027 927,1027 927,819"/>
            </Shape>
            <TextLine HEIGHT="28" WIDTH="117" VPOS="788" HPOS="914">
                <String WC="0.5157142878" CONTENT="ARTTKFI" HEIGHT="28" WIDTH="117" VPOS="788" HPOS="914"/>
            </TextLine>
            <TextLine HEIGHT="45" WIDTH="361" VPOS="828" HPOS="912">
                <String WC="0.3000000119" CONTENT="D2Gb" HEIGHT="32" WIDTH="73" VPOS="828" HPOS="913"/>
                <SP WIDTH="21" VPOS="830" HPOS="986"/>
                <String WC="0.5366666913" CONTENT="Brd" HEIGHT="31" WIDTH="56" VPOS="831" HPOS="1007"/>
                <SP WIDTH="22" VPOS="832" HPOS="1063"/>
                <String WC="0.6299999952" CONTENT="Tonijn" HEIGHT="38" WIDTH="112" VPOS="833" HPOS="1085"/>
                <SP WIDTH="21" VPOS="836" HPOS="1197"/>
                <String WC="0.6100000143" CONTENT="MSC" HEIGHT="32" WIDTH="55" VPOS="836" HPOS="1218"/>
            </TextLine>
            <TextLine HEIGHT="46" WIDTH="353" VPOS="867" HPOS="911">
                <String WC="0.4950000048" CONTENT="D2Gb" HEIGHT="33" WIDTH="74" VPOS="867" HPOS="911"/>
                <SP WIDTH="21" VPOS="870" HPOS="985"/>
                <String WC="0.4659999907" CONTENT="Extra" HEIGHT="32" WIDTH="94" VPOS="871" HPOS="1006"/>
                <SP WIDTH="21" VPOS="873" HPOS="1100"/>
                <String WC="0.5537499785" CONTENT="Speltbol" HEIGHT="40" WIDTH="143" VPOS="873" HPOS="1121"/>
            </TextLine>
            <TextLine HEIGHT="39" WIDTH="279" VPOS="907" HPOS="909">
                <String WC="0.2820000052" CONTENT="CCola" HEIGHT="34" WIDTH="95" VPOS="907" HPOS="909"/>
                <SP WIDTH="20" VPOS="910" HPOS="1004"/>
                <String WC="0.474999994" CONTENT="Zero" HEIGHT="33" WIDTH="75" VPOS="910" HPOS="1024"/>
                <SP WIDTH="21" VPOS="912" HPOS="1099"/>
                <String WC="0.4275000095" CONTENT="33cl" HEIGHT="33" WIDTH="68" VPOS="913" HPOS="1120"/>
            </TextLine>
            <TextLine HEIGHT="45" WIDTH="286" VPOS="947" HPOS="908">
                <String WC="0.6075000167" CONTENT="Puro" HEIGHT="33" WIDTH="75" VPOS="947" HPOS="908"/>
                <SP WIDTH="21" VPOS="949" HPOS="983"/>
                <String WC="0.4560000002" CONTENT="Cappuccino" HEIGHT="42" WIDTH="190" VPOS="950" HPOS="1004"/>
            </TextLine>
        </TextBlock>
    </BottomMargin>
 </Page>
</Layout>      

您可以使用package
rvest
(我将您的数据放入test.xml中):

库(rvest)
测试%xml\u节点(“字符串”)%%>%xml\u属性()

test您可以使用package
rvest
(我将您的数据放入test.xml中):

库(rvest)
测试%xml\u节点(“字符串”)%%>%xml\u属性()

测试

可选地,考虑XML包解决方案:

library(XML)

doc <- xmlParse("Input.xml")

stringdata <- t(xpathSApply(doc, "//String", xmlAttrs))
df <- data.frame(stringdata, stringsAsFactors = FALSE)

# CONVERT CHARACTER COLUMNS TO NUMERIC
df[, c(1,3:6)] <- sapply(df[, c(1,3:6)], function(x) as.numeric(x))
head(df)

#          WC     CONTENT HEIGHT WIDTH VPOS HPOS
# 1 0.8520000       SHELL     30    92  472  902
# 2 0.5462500    MAATVELD     32   150  475 1016
# 3 0.5287500    RIJKSWEG     34   150  511  901
# 4 0.2966667         A20     31    55  515 1073
# 5 0.4427273 NIEUWERKERK     36   207  550  900
# 6 0.2633333         A/D     31    54  557 1130
库(XML)

doc < P>可选地,考虑XML包解决方案:

library(XML)

doc <- xmlParse("Input.xml")

stringdata <- t(xpathSApply(doc, "//String", xmlAttrs))
df <- data.frame(stringdata, stringsAsFactors = FALSE)

# CONVERT CHARACTER COLUMNS TO NUMERIC
df[, c(1,3:6)] <- sapply(df[, c(1,3:6)], function(x) as.numeric(x))
head(df)

#          WC     CONTENT HEIGHT WIDTH VPOS HPOS
# 1 0.8520000       SHELL     30    92  472  902
# 2 0.5462500    MAATVELD     32   150  475 1016
# 3 0.5287500    RIJKSWEG     34   150  511  901
# 4 0.2966667         A20     31    55  515 1073
# 5 0.4427273 NIEUWERKERK     36   207  550  900
# 6 0.2633333         A/D     31    54  557 1130
库(XML)
医生