提取网址&;从Php或cURL获取的网页链接锚定文本

提取网址&;从Php或cURL获取的网页链接锚定文本,curl,dom,hyperlink,extract,Curl,Dom,Hyperlink,Extract,Php大师 下面是从谷歌获取链接的代码 <?php # Use the Curl extension to query Google and get back a page of results $url = "http://www.google.com"; $ch = curl_init(); $timeout = 5; curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); c

Php大师

下面是从谷歌获取链接的代码

<?php

# Use the Curl extension to query Google and get back a page of results
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Google.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";

?>
现在,我还是一个学习者,需要你的帮助。 我想转换上面的代码,以便使用DOM能够从选定网页上的所有链接中提取所有URL及其锚文本,无论这些链接采用何种格式。格式,例如:

<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>
更新:

<?php

/*
Using PHP's DOM functions to
  fetch hyperlinks and their anchor text
*/


$dom = new DOMDocument;
$dom->loadHTML(file_get_contents('https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl')); 

// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
    $href = $link->getAttribute('href');
    $anchor = $link->nodeValue;
    echo $href,"\t",$anchor,"\n";
}
echo '</pre>';

?>

让我们使用cURL而不是
file\u get\u contents
,因为它是处理HTTPS请求的更好选项。另外,添加警告抑制控件以避免有关损坏HTML的消息


在第一种情况下,只要使用
echo$link->nodeValue显示锚文本即可。对于regexp的情况,它相当复杂,因为href属性可能位于下一行,所以您应该进行多行匹配。如果可能的话,总是支持KISS原则,但是regex案例是一个很好的家庭作业练习;)。各位,你们知道有什么好的php网络爬虫免费软件/gpl等吗?我也可以检查源代码并从中学习(cURL、DOM等)。斯皮德正在使用不推荐的东西,所以今晚就放弃了。别忘了回复我原来的帖子。路易斯·穆尼奥斯,我按照你的建议做了,但我看到了完整的白色空白页。查看我的原始帖子标题“第一次编辑”下的详细信息。请使用stackoverflow.com而不是fiverr进行检查,这可能是一个动态构建的页面,因此可能需要javascript呈现。仍然没有luck Muiz。查看我的op进行第二次编辑。路易斯·穆尼奥斯,我还没有开始学习oop风格或pdo。我很快就要开始了。到目前为止,仍然使用mysqli和procedral风格。如果您不介意,出于我们的学习目的,我们可以看看上面提到的代码的过程式版本吗?在我看来,答案中的代码是过程式的,只是一个普通的php脚本。没有涉及任何课程。路易斯·穆尼奥斯,我在我的原始帖子上添加了“第三次编辑”。谢谢你的帮助。非常感谢。第二个看起来更好。是时候开始挖掘基本编程方面的设计和良好实践了。花同样的时间至少做一个项目的流程图,也就是说,先把想法写在纸上。此外,尝试开始将基本概念应用为封装,以避免重复代码。
http://stackoverflow.com<br>
A programmer's forum<br>
<br>
http://google.com<br>
A searchengine<br>
<br>
http://yahoo.com<br>
An Index<br>
<br>
<?php

$curl = curl_init('http://stackoverflow.com/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
    echo 'Scraper error: ' . curl_error($curl);
    exit;
}

curl_close($curl);

$regex = '<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else 
    print "Not found"; 

?>
<?php

# Use the Curl extension to query Google and get back a page of results
$url = "http://fiverr.com/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Devshed Forum.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";
        echo $link->nodeValue;      
}

?>
**Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 97 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 123 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 184 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 185 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 348 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 361 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 838 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID display-name already defined in Entity, line: 895 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID m-address already defined in Entity, line: 899 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 1168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 1175 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194**
<?php

/*
Using PHP's DOM functions to
  fetch hyperlinks and their anchor text
*/


$dom = new DOMDocument;
$dom->loadHTML(file_get_contents('https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl')); 

// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
    $href = $link->getAttribute('href');
    $anchor = $link->nodeValue;
    echo $href,"\t",$anchor,"\n";
}
echo '</pre>';

?>
<?php 

include('simple_html_dom.php'); 

$current_link_crawling_level = 0; 
$link_crawling_level_max = 2;

if($current_link_crawling_level == $link_crawling_level_max)
{
    echo "link crawling depth level reached!"; 
    sleep(5);
    exit(); 
}
else
{
    $url = 'http://php.net/manual-lookup.php? 
pattern=str_get_html&scope=quickref'; 
    $curl = curl_init($url); 
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
    $response_string = curl_exec($curl); 

    $html = str_get_html($response_string);

    $current_link_crawling_level++; 

    //to fetch all hyperlinks from the webpage 
    $links = array(); 
    foreach($html->find('a') as $a) 
    { 
        $links[] = $a->href; 
        echo "Value: $a<br />\n"; 
        print_r($links); 

        sleep(1);

        $url = '$value'; 
        $curl = curl_init($a); 
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
        curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
        $response_string = curl_exec($curl); 

        $html = str_get_html($response_string);

        $current_link_crawling_level++; 

        //to fetch all hyperlinks from the webpage 
        $links = array(); 
        foreach($html->find('a') as $a) 
        { 
            $links[] = $a->href; 
            echo "Value: $a<br />\n";
            print_r($links); 

            sleep(1);           
        } 
    echo "Value: $a<br />\n";
    print_r($links); 
    }
}

?>
<?php 

include('simple_html_dom.php'); 

$current_link_crawling_level = 0; 
$link_crawling_level_max = 2;

if($current_link_crawling_level == $link_crawling_level_max)
{
    echo "link crawling depth level reached!"; 
    sleep(5);
    exit(); 
}
else
{
    $url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; 
    $curl = curl_init($url); 
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
    $response_string = curl_exec($curl); 

    $html = str_get_html($response_string);

    $current_link_crawling_level++; 

    //to fetch all hyperlinks from the webpage 
    // Hide HTML warnings
    libxml_use_internal_errors(true);
    $dom = new DOMDocument;
    if($dom->loadHTML($html, LIBXML_NOWARNING))
    {
        // echo Links and their anchor text
        echo '<pre>';
        echo "Link\tAnchor\n";
        foreach($dom->getElementsByTagName('a') as $link) 
        {
            $href = $link->getAttribute('href');
            $anchor = $link->nodeValue;
            echo $href,"\t",$anchor,"\n";

            sleep(1);

            $url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; 
            $curl = curl_init($url); 
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
            curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
            curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
            curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
            $response_string = curl_exec($curl); 

            $html = str_get_html($response_string);

            $current_link_crawling_level++; 

            //to fetch all hyperlinks from the webpage 
            // Hide HTML warnings
            libxml_use_internal_errors(true);
            $dom = new DOMDocument;
            if($dom->loadHTML($html, LIBXML_NOWARNING))
            {
                // echo Links and their anchor text
                echo '<pre>';
                echo "Link\tAnchor\n";
                foreach($dom->getElementsByTagName('a') as $link) 
                {
                    $href = $link->getAttribute('href');
                    $anchor = $link->nodeValue;
                    echo $href,"\t",$anchor,"\n";

                    sleep(1);
                }
                echo '</pre>';
            }
            else
            {
                echo "Failed to load html.";
            }
        }
    }
    else
    {
        echo "Failed to load html.";
    }
}
?>
<?php

/*
Using PHP's DOM functions to
fetch hyperlinks and their anchor text
*/

$url = 'https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($curl);

// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($data, LIBXML_NOWARNING)){
    // echo Links and their anchor text
    echo '<pre>';
    echo "Link\tAnchor\n";
    foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        echo $href,"\t",$anchor,"\n";
    }
    echo '</pre>';
}else{
    echo "Failed to load html.";

}
?>