提取网址&;从Php或cURL获取的网页链接锚定文本
Php大师 下面是从谷歌获取链接的代码提取网址&;从Php或cURL获取的网页链接锚定文本,curl,dom,hyperlink,extract,Curl,Dom,Hyperlink,Extract,Php大师 下面是从谷歌获取链接的代码 <?php # Use the Curl extension to query Google and get back a page of results $url = "http://www.google.com"; $ch = curl_init(); $timeout = 5; curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); c
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";
?>
现在,我还是一个学习者,需要你的帮助。
我想转换上面的代码,以便使用DOM能够从选定网页上的所有链接中提取所有URL及其锚文本,无论这些链接采用何种格式。格式,例如:
<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>
更新:
<?php
/*
Using PHP's DOM functions to
fetch hyperlinks and their anchor text
*/
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents('https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl'));
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
}
echo '</pre>';
?>
让我们使用cURL而不是file\u get\u contents
,因为它是处理HTTPS请求的更好选项。另外,添加警告抑制控件以避免有关损坏HTML的消息
在第一种情况下,只要使用echo$link->nodeValue显示锚文本即可。对于regexp的情况,它相当复杂,因为href属性可能位于下一行,所以您应该进行多行匹配。如果可能的话,总是支持KISS原则,但是regex案例是一个很好的家庭作业练习;)。各位,你们知道有什么好的php网络爬虫免费软件/gpl等吗?我也可以检查源代码并从中学习(cURL、DOM等)。斯皮德正在使用不推荐的东西,所以今晚就放弃了。别忘了回复我原来的帖子。路易斯·穆尼奥斯,我按照你的建议做了,但我看到了完整的白色空白页。查看我的原始帖子标题“第一次编辑”下的详细信息。请使用stackoverflow.com而不是fiverr进行检查,这可能是一个动态构建的页面,因此可能需要javascript呈现。仍然没有luck Muiz。查看我的op进行第二次编辑。路易斯·穆尼奥斯,我还没有开始学习oop风格或pdo。我很快就要开始了。到目前为止,仍然使用mysqli和procedral风格。如果您不介意,出于我们的学习目的,我们可以看看上面提到的代码的过程式版本吗?在我看来,答案中的代码是过程式的,只是一个普通的php脚本。没有涉及任何课程。路易斯·穆尼奥斯,我在我的原始帖子上添加了“第三次编辑”。谢谢你的帮助。非常感谢。第二个看起来更好。是时候开始挖掘基本编程方面的设计和良好实践了。花同样的时间至少做一个项目的流程图,也就是说,先把想法写在纸上。此外,尝试开始将基本概念应用为封装,以避免重复代码。
http://stackoverflow.com<br>
A programmer's forum<br>
<br>
http://google.com<br>
A searchengine<br>
<br>
http://yahoo.com<br>
An Index<br>
<br>
<?php
$curl = curl_init('http://stackoverflow.com/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
curl_close($curl);
$regex = '<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]';
if ( preg_match($regex, $page, $list) )
echo $list[0];
else
print "Not found";
?>
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "http://fiverr.com/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Devshed Forum.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";
echo $link->nodeValue;
}
?>
**Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 97 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 123 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 184 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 185 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 348 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 361 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 838 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID display-name already defined in Entity, line: 895 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID m-address already defined in Entity, line: 899 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 1168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 1175 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194**
<?php
/*
Using PHP's DOM functions to
fetch hyperlinks and their anchor text
*/
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents('https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl'));
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
}
echo '</pre>';
?>
<?php
include('simple_html_dom.php');
$current_link_crawling_level = 0;
$link_crawling_level_max = 2;
if($current_link_crawling_level == $link_crawling_level_max)
{
echo "link crawling depth level reached!";
sleep(5);
exit();
}
else
{
$url = 'http://php.net/manual-lookup.php?
pattern=str_get_html&scope=quickref';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
$links = array();
foreach($html->find('a') as $a)
{
$links[] = $a->href;
echo "Value: $a<br />\n";
print_r($links);
sleep(1);
$url = '$value';
$curl = curl_init($a);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
$links = array();
foreach($html->find('a') as $a)
{
$links[] = $a->href;
echo "Value: $a<br />\n";
print_r($links);
sleep(1);
}
echo "Value: $a<br />\n";
print_r($links);
}
}
?>
<?php
include('simple_html_dom.php');
$current_link_crawling_level = 0;
$link_crawling_level_max = 2;
if($current_link_crawling_level == $link_crawling_level_max)
{
echo "link crawling depth level reached!";
sleep(5);
exit();
}
else
{
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($html, LIBXML_NOWARNING))
{
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
sleep(1);
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($html, LIBXML_NOWARNING))
{
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
sleep(1);
}
echo '</pre>';
}
else
{
echo "Failed to load html.";
}
}
}
else
{
echo "Failed to load html.";
}
}
?>
<?php
/*
Using PHP's DOM functions to
fetch hyperlinks and their anchor text
*/
$url = 'https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($curl);
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($data, LIBXML_NOWARNING)){
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
}
echo '</pre>';
}else{
echo "Failed to load html.";
}
?>