Php 来自维基百科页面的scrapr图像url_Php_Regex

Php 来自维基百科页面的scrapr图像url

php regex

Php 来自维基百科页面的scrapr图像url,php,regex,Php,Regex,我创建了regex，它根据页面的源代码提供图像url <?php function get_logo($html, $url) { //preg_match_all('', $html, $matches); //preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches); if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $mat

我创建了regex，它根据页面的源代码提供图像url

<?php
function get_logo($html, $url)
{
//preg_match_all('', $html, $matches);
//preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches);
    if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
        echo "First";
        return $matches[0][0];
    } else {
        if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches)) {
            echo "Second";
            return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
        } else
            return null;
    }
}

既然PHP中的DOMDocument
类可以轻松地解析HTML，为什么还要尝试使用正则表达式解析HTML呢
<?php
$doc = new DOMDocument();
@$doc->loadHTMLfile( "http://www.wikipedia.org/" );

$images = $doc->getElementsByTagName("img");

foreach( $images as $image ) {
    echo $image->getAttribute("src");
    echo "<br>";
}

?>

既然PHP中的DOMDocument
类可以轻松解析HTML，为什么还要尝试使用正则表达式解析HTML呢
<?php
$doc = new DOMDocument();
@$doc->loadHTMLfile( "http://www.wikipedia.org/" );

$images = $doc->getElementsByTagName("img");

foreach( $images as $image ) {
    echo $image->getAttribute("src");
    echo "<br>";
}

?>


！相反，使用XPath，这将很容易产生您的结果。您不应该使用正则表达式来解析HTML或XML。相反，您应该使用这样的工具来提供适当的功能来解析这些类型的文件。对于上面的每个建议+1。DOM解析通常更易于实现、读取、理解和维护。此外，与任何网站内容抓取一样，请务必检查目标网站的使用条款，以确保您没有违反这些条款。！相反，使用XPath，这将很容易产生您的结果。您不应该使用正则表达式来解析HTML或XML。相反，您应该使用这样的工具来提供适当的功能来解析这些类型的文件。对于上面的每个建议+1。DOM解析通常更易于实现、读取、理解和维护。此外，与任何网站内容抓取一样，一定要检查目标网站的使用条款，以确保没有违反这些条款。为什么不使用$image->getAttribute（'src'）
而不是对所有属性使用foreach循环？@AeroX:您可以修改它以只获得一个imag url吗？@Programming\u疯狂：上面的代码提供了img标记中src属性的内容。这只是代码第一步的一个示例，不是礼品包装的解决方案。@casimirithippolyte感谢$image->getAttribute（'src'）
上的提示。编辑答案以使用该方法。为什么不使用$image->getAttribute（'src'）
而不是对所有属性使用foreach循环？@AeroX:您可以修改它以仅获得一个imag url吗？@Programming\u疯狂：上面的代码提供了img标记中src属性的内容。这只是代码第一步的一个示例，不是礼品包装的解决方案。@casimirithippolyte感谢$image->getAttribute（'src'）
上的提示。编辑答案以使用该方法。