Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Xpath Tika或JAXP或两者_Xpath_Apache Tika_Jaxp_Javax.xml - Fatal编程技术网

Xpath Tika或JAXP或两者

Xpath Tika或JAXP或两者,xpath,apache-tika,jaxp,javax.xml,Xpath,Apache Tika,Jaxp,Javax.xml,为了更好地理解我的困境,请参阅第页;) 正如在上面的线程中提到的,我决定使用Tika提供一个通用接口来解析文档。并提取内容。现在,我决定使用适当的ContentHandler将每个文档转换为XML/HTML 以下是示例输出: File type is application/vnd.openxmlformats-officedocument.wordprocessingml.document Handler <html xmlns="http://www.w3.org/19

为了更好地理解我的困境,请参阅第页;)

正如在上面的线程中提到的,我决定使用Tika提供一个通用接口来解析文档。并提取内容。现在,我决定使用适当的ContentHandler将每个文档转换为XML/HTML

以下是示例输出:

    File type is application/vnd.openxmlformats-officedocument.wordprocessingml.document
    Handler <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="cp:revision" content="2" />
    <meta name="meta:last-author" content="ogilvie.f" />
    <meta name="Last-Author" content="ogilvie.f" />
    <meta name="meta:save-date" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Name" content="Microsoft Office Word" />
    <meta name="Author" content="ogilvie.f" />
    <meta name="dcterms:created" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Version" content="12.0000" />
    <meta name="Character-Count-With-Spaces" content="21667" />
    <meta name="date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:Template" content="Normal" />
    <meta name="meta:line-count" content="153" />
    <meta name="creator" content="ogilvie.f" />
    <meta name="publisher" content="Procter &amp; Gamble" />
    <meta name="Word-Count" content="3240" />
    <meta name="meta:paragraph-count" content="43" />
    <meta name="Creation-Date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:AppVersion" content="12.0000" />
    <meta name="meta:author" content="ogilvie.f" />
    <meta name="Line-Count" content="153" />
    <meta name="extended-properties:Application" content="Microsoft Office Word" />
    <meta name="Paragraph-Count" content="43" />
    <meta name="Last-Save-Date" content="2012-04-24T15:24:00Z" />
    <meta name="Last-Printed" content="2012-03-29T15:06:00Z" />
    <meta name="Revision-Number" content="2" />
    <meta name="meta:print-date" content="2012-03-29T15:06:00Z" />
    <meta name="meta:creation-date" content="2012-04-24T15:24:00Z" />
    <meta name="dcterms:modified" content="2012-04-24T15:24:00Z" />
    <meta name="Template" content="Normal" />
    <meta name="Page-Count" content="15" />
    <meta name="meta:character-count" content="18470" />
    <meta name="dc:creator" content="ogilvie.f" />
    <meta name="meta:word-count" content="3240" />
    <meta name="extended-properties:Company" content="Procter &amp; Gamble" />
    <meta name="Last-Modified" content="2012-04-24T15:24:00Z" />
    <meta name="custom:ContentTypeId" content="0x010100832DCE57D1DD144A851051A25C75E147" />
    <meta name="modified" content="2012-04-24T15:24:00Z" />
    <meta name="xmpTPg:NPages" content="15" />
    <meta name="dc:publisher" content="Procter &amp; Gamble" />
    <meta name="Character Count" content="18470" />
    <meta name="meta:page-count" content="15" />
    <meta name="meta:character-count-with-spaces" content="21667" />
    <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
    <title></title>
    </head>
    <body><p class="body_Text"><b>CONFIDENTIAL</b></p>
    <table><tbody><tr>  <td><p>principle</p>
</td>   <td><p>optimum</p>
</td>   <td><p>rationale</p>
</td></tr>
<tr>    <td><p>Number of  suppliers</p>
</td>   <td><p class="list_Paragraph">2-3 per plant</p>
<p class="list_Paragraph">&gt;80% with 5 per region/country cluster</p>
</td>   <td><p class="list_Paragraph">Competition is local</p>
<p class="list_Paragraph">Scale the spend with central accounts</p>
</td></tr>
<tr>    <td><p>Global/local suppliers</p>
</td>   <td><p>Regional is sufficient</p>
</td>   <td><p class="list_Paragraph">No advantage to global as scale is regional only and there is limited IP to transfer.</p>
<p class="list_Paragraph">Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.</p>
</td></tr>
<tr>    <td><p>Approach to suppliers</p>
</td>   <td><p>collaborative</p>
</td>   <td><p>Competition to drive price is clear; preferential and value-add deals require collaboration</p>
</td></tr>
<tr>    <td><p>Make v buy</p>
</td>   <td><p>buy</p>
</td>   <td><p>Multiple suppliers; commoditised technologies</p>
</td></tr>
<tr>    <td><p>Distance of suppliers to plant</p>
</td>   <td><p class="list_Paragraph">Max 300km for boxes (300miles in NA); up to 1000km for paper reels.</p>
<p class="list_Paragraph">Can be longer for specialist print grades or to countries with no high quality local supply</p>
</td>   <td><p class="list_Paragraph">Economic max as high volume product (air in the fluting)</p>
<p class="list_Paragraph">Need recent built paper machines to produce paper strong enough to run on high-speed corrugators</p>
</td></tr>
<tr>    <td><p>Type of suppliers</p>
</td>   <td><p class="list_Paragraph">Integrated with containerboard making</p>
<p />
<p class="list_Paragraph">Corrugators on-site</p>
</td>   <td><p class="list_Paragraph">To assure supply and avoid being leveraged by paper making scale</p>
<p class="list_Paragraph">Cost structure not competitive if have to buy in board (shipping air)</p>
</td></tr>
<tr>    <td><p>Purchase of feedstocks</p>
</td>   <td><p>Not if integrated suppliers</p>
</td>   <td><p>Integrated suppliers have 20x our scale</p>
</td></tr>
<tr>    <td><p>Length and nature of contracts</p>
</td>   <td><p>Multiple year (2-3), but with fixed glidepath pricing/value every year</p>
</td>   <td><p>Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.</p>
</td></tr>
<tr>    <td><p>Specifications</p>
</td>   <td><p class="list_Paragraph">Standard board weights</p>
<p />
<p />
<p class="list_Paragraph">Tailored box sizes</p>
</td>   <td><p class="list_Paragraph">Paper scale much higher so uneconomic to make tailored weight</p>
<p class="list_Paragraph">Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.</p>
</td></tr>
<tr>    <td><p>Terms</p>
</td>   <td><p>Standard, including payment terms</p>
</td>   <td><p>High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.</p>
</td></tr>
</tbody></table>
    <p>date</p>
    </td></tr>
    </tbody></table>
    <p />
    <p />
    <p>1</p>
    <p class="footer" />
    </body></html>
文件类型为application/vnd.openxmlformats-officedocument.wordprocessingml.document
处理者

机密

原则

最佳的

理由

供应商数量

每个工厂2-3个

80%,每个地区/国家/地区集群5个

竞争是本地的

用中央账户衡量支出

全球/本地供应商

区域合作就足够了

由于规模仅限于区域,且可转让的IP有限,因此对全球没有任何优势

较大的区域供应商可以合并本地的单一工厂供应商,使其对我们更有效。他们还为机械升级和纸张来源的规模带来资金

对供应商的做法

协作

推动价格的竞争显而易见;优惠和增值交易需要合作

多个供应商;商品化技术

供应商到工厂的距离

箱子的最大行驶里程为300公里(北美为300英里);卷筒纸长度可达1000公里

对于专业印刷品等级或没有高质量本地供应的国家,可以更长

作为大容量产品的经济最大值(水槽中的空气)

需要最新制造的造纸机来生产强度足以在高速瓦楞机上运行的纸张

供应商类型

与纸板制作相结合

现场瓦楞机

确保供应,避免被造纸规模所利用

如果必须在船上购买(空运),则成本结构没有竞争力

原料采购

如果是综合供应商,则不会

综合供应商的规模是我们的20倍

合同的期限和性质

多年(2-3年),但每年固定下滑道定价/价值

每年对采购进行重新查询的重大努力。如果只有12个月的分配,高规格数量和低资源意味着相对于附加值的认证时间较长

规格

标准板重量

定制的盒子尺寸

纸质秤要高得多,所以定制重量不经济

与标准箱子尺寸的秤节约相比,最大化托盘适合度可提供更好的节约和更强的托盘(更少的运输损坏)

条件

标准,包括付款条件

高度竞争,没有专业投资。造纸具有良好的现金流,因此无需缩短付款期限

日期

一,

当我想从处理程序中提取元素时,挑战就开始了。有人建议我使用XPath并通过正则表达式获取表。我得到了这个概念,但无法使用Tika作为

阅读之后,我想知道我是应该完全退出Tika,转而支持JAXP还是使用组合(?)


有谁能告诉我,我的假设、方向哪里错了,我该怎么做?

出于多种原因,你的问题不适合这样做。首先,我们不推荐这里的工具。第二,使用哪种工具在很多因素上都是高度主观的。此外,您缺少了真正重要的细节(例如,
不能做“
意味着什么?什么不起作用,任何错误消息,等等。pp.?)此外,我也没有真正了解您的问题。因此,您现在有了这个XML文档,希望提取一些信息。使用XPath这应该是一项简单的任务(如果是更复杂的XPath 3.0或XQuery).当你写了一些关于regex的文章时,我通常会说:不要使用regex解析XML-这根本不可能正确!我同意这个问题不是“直接”的,但请检查背景线程和其他提到的线程,这会让事情变得清楚,我也不希望任何人“推荐”任何工具-我只想确定哪一个是因此,该工具适合我的场景(在后台线程中详细提到)。首先,您应该在问题中包含所有详细信息,因此我们不必检查背景线程。但我确实查看了它,并且仍然有我在第二条评论中提到的问题。此外,很明显,您正在寻找一个工具建议,正如标题所指出的:您希望推荐使用Tika或JAXP。我们不能替你决定。