如何使用PHP preg_match_all来区分由内部HTML元素的属性标识的锚元素？_Php_Preg Match All_Pcre

如何使用PHP preg_match_all来区分由内部HTML元素的属性标识的锚元素？

php

如何使用PHP preg_match_all来区分由内部HTML元素的属性标识的锚元素？,php,preg-match-all,pcre,Php,Preg Match All,Pcre,我有一组包含图像元素的HTML锚元素。对于每个集合，使用PHP-CLI，我希望提取URL并根据它们的类型对它们进行分类。锚定类型只能由其子图像元素的属性确定。如果每套每种类型只有一种，那就容易了。我的问题是当一种类型的两个锚元素被一个或多个其他类型分开时。我的非贪婪括号子模式似乎变得贪婪，并展开以查找第二个相关的子属性。在我的测试脚本中，我试图从其他类型中提取“Userlink”URL。使用以下简单模式： #<a href="(.*?)" custattr="value1"><

我有一组包含图像元素的HTML锚元素。对于每个集合，使用PHP-CLI，我希望提取URL并根据它们的类型对它们进行分类。锚定类型只能由其子图像元素的属性确定。如果每套每种类型只有一种，那就容易了。我的问题是当一种类型的两个锚元素被一个或多个其他类型分开时。我的非贪婪括号子模式似乎变得贪婪，并展开以查找第二个相关的子属性。在我的测试脚本中，我试图从其他类型中提取“Userlink”URL。使用以下简单模式：

#<a href="(.*?)" custattr="value1"><img alt="Userlink"#

（对不起，实际的html是在这样一行上的）

我的子模式从第一个“Userlink”URL的开头到最后一个URL的结尾进行捕获

我尝试过许多不同的look aheads，不确定是否应该在这里列出它们。到目前为止，他们要么根本没有返回匹配项，要么返回与上述相同的结果

下面是我的测试脚本（在Bash shell中运行）：

#/usr/bin/php
该正则表达式应适用于以下情况：
#!/usr/bin/php
<?
    $lines = 0;
    $input = "";
    $matches = array();

    while ($line = fgets(STDIN)){
        $input .= $line;
        $lines++;
    }
    fwrite(STDERR, "Processing $lines\n");

    $pcre = '#<a href="(.*?)" custattr="value1"><img alt="Userlink"#';

    if (preg_match_all($pcre,$input,$matches)){
        fwrite(STDERR, "\$matches has " . count($matches) . " elements\n");
        foreach ($matches[1] as $match){
            fwrite(STDOUT, $match . "\n");
        }
    }
?>



测试它-
<a href="([^"]*?)"[^>]*\><img alt="Userlink"

$pcre='/我已经冒昧地更改了您的变量名：
$pcre = '/<a href="([^"]*?)"[^>]*\><img alt="Userlink"/';
if (preg_match_all($pcre,$input,$matches)){
    var_dump($matches);
    //$matches[1] will be the array containing the urls.
}
/*
    OUTPUT- 
    array
      0 => 
        array
          0 => string '<a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
          1 => string '<a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
      1 => 
        array
          0 => string 'http://www.userlink1.com/my/page.html' (length=37)
          1 => string 'http://www.userlink2.com/my/page.html' (length=37)
*/

$pattern='~。使用解析器。不要使用ungreedy*？
使用贪婪字符类*。正如Ed Cottrell在这个*^？！#链接中所说，如果您只想查找href内容，那么使用DOM是一个不错的选择。即使我不需要识别或使用HTML元素本身，并且将它们全部扔掉，HTML解析器还会更好吗？
$pcre = '/<a href="([^"]*?)"[^>]*\><img alt="Userlink"/';
if (preg_match_all($pcre,$input,$matches)){
    var_dump($matches);
    //$matches[1] will be the array containing the urls.
}
/*
    OUTPUT- 
    array
      0 => 
        array
          0 => string '<a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
          1 => string '<a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
      1 => 
        array
          0 => string 'http://www.userlink1.com/my/page.html' (length=37)
          1 => string 'http://www.userlink2.com/my/page.html' (length=37)
*/

$pattern = '~<a href="([^"]++)" custattr="value1"><img alt="Userlink"~';

if ($nb = preg_match_all($pattern, $input, $matches)) {
    fwrite(STDERR, "\$matches has " . $nb . " elements\n");
    fwrite(STDOUT, implode("\n", $match) . "\n");
}