Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/234.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Php 分页图中第一句的Xpath表达式_Php_Xml_Xpath_Xml Parsing_Domxpath - Fatal编程技术网

Php 分页图中第一句的Xpath表达式

Php 分页图中第一句的Xpath表达式,php,xml,xpath,xml-parsing,domxpath,Php,Xml,Xpath,Xml Parsing,Domxpath,我正在为段落中的第一句话寻找Xpath表达式 <p> A federal agency is recommending that White House adviser Kellyanne Conway be removed from federal service saying she violated the Hatch Act on numerous occasions. The office is unrelated to Robert Mueller and his i

我正在为段落中的第一句话寻找Xpath表达式

<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>
我试过几件事都没用

$expression = '/html/body/div/div/div/div/p//text()';

我是否需要使用:
//p[以
结尾,或者可能是之前的子字符串?

您将无法通过XPath解析自然语言,但您可以按如下方式获取第一个句点之前的子字符串:

substring(/p,1,string-length(substring-before(/p,"."))+1)
请注意,如果在第一句结束之前有缩写或其他词汇出现,如果第一句以另一种形式的标点符号结束,则这可能不是“第一句”


或者,更简洁地说:

concat(substring-before(/p, "."), ".")

值得称赞:在评论中提出了一个聪明的想法。

在Xpath级别上没有真正好的方法。PHP只有Xpath 1.0,只支持基本的字符串操作。没有任何东西可以考虑语言环境/语言。但是PHP本身在
ext/intl
中也有它的功能

因此,使用DOM+Xpath作为字符串获取段落元素节点的文本内容,并从中提取第一个句子

IntlBreakIterator
可以根据特定于语言环境/语言的规则拆分字符串

$html = <<<'HTML'
<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>
HTML;

$document = new DOMDocument();
$document->loadXML($html);
$xpath = new DOMXpath($document);

// fetch the first paragraph in the document as string
$summary = $xpath->evaluate('string((//p)[1])');
// create a break iterator for en_US sentences.
$breaker = IntlBreakIterator::createSentenceInstance('en_US');
// replace line breaks with spaces before feeding it to the breaker
$breaker->setText(str_replace(["\r\n", "\n"], '', $summary));

$firstSentence = '';
// iterate the sentences
foreach ($breaker->getPartsIterator() as $sentence) {
  $firstSentence = $sentence;
  // break after the first sentence
  break;
}

var_dump($firstSentence);

另外,
DOMXpath
允许您注册PHP函数并从Xpath表达式调用它们。如果您需要Xpath级别的逻辑(在条件中使用它们),这是一种可能性。

我想我也可以explode()它并获取数组中的第一个元素。可能尝试过多地使用XPath。上面的内容与您在单个简单XPath中获得的内容一样好;
explode()
不会更好。您需要使用NLP库在语义、句子级别而不是词汇、标点符号级别进行操作,才能真正做到正确。我认为
substring before(/p,')
就足够了。表达式的其余部分也可以获得句点,可能会混淆
substring-before()
semantic.@Alejandro:事实上,我最初只在(/p,,)之前写了
子字符串,但后来我看到OP要求“结果应该是”包含句点的输出,所以我想我应该更进一步。不过,你是对的,
substring-before()
是最重要的。您可以再加一次点:
concat(substring before(/p,“.”,“.”“)
您需要明确您使用的是什么XPath版本。在XPath 2.0或更高版本中,这类事情要容易得多。您可以参考
end-with()
,它需要XPath 2.0,但您也提到了PHP,这表明您仅限于1.0。这是一个利用宿主语言的有用补充。
$html = <<<'HTML'
<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>
HTML;

$document = new DOMDocument();
$document->loadXML($html);
$xpath = new DOMXpath($document);

// fetch the first paragraph in the document as string
$summary = $xpath->evaluate('string((//p)[1])');
// create a break iterator for en_US sentences.
$breaker = IntlBreakIterator::createSentenceInstance('en_US');
// replace line breaks with spaces before feeding it to the breaker
$breaker->setText(str_replace(["\r\n", "\n"], '', $summary));

$firstSentence = '';
// iterate the sentences
foreach ($breaker->getPartsIterator() as $sentence) {
  $firstSentence = $sentence;
  // break after the first sentence
  break;
}

var_dump($firstSentence);
string(164) "A federal agency is recommending that White House adviser Kellyanne Conway be removed from federal service saying she violated the Hatch Act on numerous occasions. "