PHP正则表达式匹配URL但不匹配图像_Php_Regex_Preg Match_Preg Match All

PHP正则表达式匹配URL但不匹配图像

php regex

PHP正则表达式匹配URL但不匹配图像,php,regex,preg-match,preg-match-all,Php,Regex,Preg Match,Preg Match All,我试图使用preg_match_all从一块HTML代码中提取所有URL。我还试图忽略所有图像示例HTML块： $html = 'This is a test http://www.facebook.com <img src="http://www.google.com/photo.jpg">www.yahoo.com https://www.aol.com '; 谷歌被排除在外，因为它包

我试图使用preg_match_all从一块HTML代码中提取所有URL。我还试图忽略所有图像

示例HTML块：

$html = '<p>This is a test</p><br>http://www.facebook.com<br><img src="http://www.google.com/photo.jpg">www.yahoo.com https://www.aol.com<br>';

谷歌被排除在外，因为它包含.jpg图像扩展名。当我将这样的图像添加到$html时，就会出现问题：

<img src="http://www.google.com/image%201.jpg">

你知道如何只抓取不是图片的url吗？即使它们包含URL通常具有的特殊字符，使用DOM也可以识别HTML文档的结构。在您的情况下，您需要识别要从中获取URL的部分

使用DOM加载HTML 仅当需要时才使用Xpath从link href属性获取URL 使用Xpath从DOM获取文本节点在文本节点值上使用RegEx匹配URL 下面是一个示例实现：

$html = <<<'HTML'
  <p>This is a test</p>
  <br>
  http://www.facebook.com
  <br>
  <img src="http://www.google.com/photo.jpg">
  www.yahoo.com 
  https://www.aol.com
  <a href="http://www.google.com">Link</a>
  <!-- http://comment.ingored.url -->
  <br>
HTML;

$urls = array();

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

// fetch urls from link href attributes
foreach ($xpath->evaluate('//a[@href]/@href') as $href) {
  $urls[] = $href->value;
}

// fetch urls inside text nodes
$pattern = '(
 (?:(?:https?://)|(?:www\.))
 (?:[^"\'\\s]+)
)xS';
foreach ($xpath->evaluate('/html/body//text()') as $text) {
  $matches = array();
  preg_match_all($pattern, $text->nodeValue, $matches);
  foreach ($matches[0] as $href) {
    $urls[] = $href;
  }
}

var_dump($urls);

停止使用正则表达式。这个问题是今天早些时候提出的。%20是空间的URL编码。您的正则表达式很可能与空格匹配，而不是与文本%20匹配。它停在空格处，因为正则表达式是这么说的：[^]+。停止使用正则表达式-你还有其他建议吗？没有，不匹配空格。如上面的示例所示。请参见$htmlHow您如何匹配URL而不是使用dom的a href？

<img src="http://www.google.com/image%201.jpg">

http://www.google.com/image

$html = <<<'HTML'
  <p>This is a test</p>
  <br>
  http://www.facebook.com
  <br>
  <img src="http://www.google.com/photo.jpg">
  www.yahoo.com 
  https://www.aol.com
  <a href="http://www.google.com">Link</a>
  <!-- http://comment.ingored.url -->
  <br>
HTML;

$urls = array();

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

// fetch urls from link href attributes
foreach ($xpath->evaluate('//a[@href]/@href') as $href) {
  $urls[] = $href->value;
}

// fetch urls inside text nodes
$pattern = '(
 (?:(?:https?://)|(?:www\.))
 (?:[^"\'\\s]+)
)xS';
foreach ($xpath->evaluate('/html/body//text()') as $text) {
  $matches = array();
  preg_match_all($pattern, $text->nodeValue, $matches);
  foreach ($matches[0] as $href) {
    $urls[] = $href;
  }
}

var_dump($urls);

array(4) {
  [0]=>
  string(21) "http://www.google.com"
  [1]=>
  string(23) "http://www.facebook.com"
  [2]=>
  string(13) "www.yahoo.com"
  [3]=>
  string(19) "https://www.aol.com"
}