Php 正则表达式从网页中刮取数据_Php_Regex

Php 正则表达式从网页中刮取数据

php regex

Php 正则表达式从网页中刮取数据,php,regex,Php,Regex,我试图使用正则表达式从网页中删除数据，但它给出了DOM警告。所以我想知道，regex是否有可能从这个页面中删除日期、评论、费率值下面是DOM的示例：给出错误这适用于较小的代码：可以使用正则表达式吗 <?php $html= file_get_contents('http://www.yelp.com/biz/franchino-san-francisco?start=80'); $html = escapeshellarg($html) ; $html = nl2br($htm

我试图使用正则表达式从网页中删除数据，但它给出了DOM警告。所以我想知道，regex是否有可能从这个页面中删除日期、评论、费率值

下面是DOM的示例：

给出错误

这适用于较小的代码：

可以使用正则表达式吗

<?php
$html= file_get_contents('http://www.yelp.com/biz/franchino-san-francisco?start=80');

$html = escapeshellarg($html) ;
$html = nl2br($html);

$classname = 'rating-qualifier';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");

if ($results->length > 0) {
    echo $review = $results->item(0)->nodeValue;
}


$classname = 'review_comment ieSucks';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");

if ($results->length > 0) {
    echo $review = $results->item(0)->nodeValue;
}

$meta = $dom->documentElement->getElementsByTagName("meta");
echo $meta->item(0)->getAttribute('content');
?>

见此：尝试在本地计算机上运行代码，会出现什么错误？仅使用“常规”正则表达式只有在网站结构保证永不更改且您完全了解的情况下才有可能这样做，因为HTML不是常规语言
@aelor:它给出了类似于警告：DOMDocument:：loadHTML（）：htmlParseEntityRef:应为“；”的非格式HTML代码的错误在实体中，第23行的F:\wamp\www\htdocs\thenwat\yelp.php中的第756行
我可以使用libxml\u use\u internal\u errors（true）
抑制这些错误。上面的解决方案是从您的一个回复中选取的，仅在不同的线程上