Php 从HTML标记中提取数据_Php_Html_Extract

Php 从HTML标记中提取数据

php html

Php 从HTML标记中提取数据,php,html,extract,Php,Html,Extract,我有下面的代码，试图从html页面中提取属性内容的值，但它没有给出我期望的任何结果，而是只给出空白页面有什么帮助吗？问题出在哪里 $url=”https://fr-ca.wordpress.org"; $html=文件内容（$url）； #创建DOM解析器对象 $dom=新的DOMDocument（）； $dom->loadHTML（$html）； foreach（$dom->getElementsByTagName（'meta'）作为$key）{ 回声“； $tab[]=$key->get

我有下面的代码，试图从html页面中提取属性内容的值，但它没有给出我期望的任何结果，而是只给出空白页面

有什么帮助吗？问题出在哪里

$url=”https://fr-ca.wordpress.org";
$html=文件内容（$url）；
#创建DOM解析器对象
$dom=新的DOMDocument（）；
$dom->loadHTML（$html）；
foreach（$dom->getElementsByTagName（'meta'）作为$key）{
回声“；
$tab[]=$key->getAttribute（'content'）；
}
$reg=''；
如果（预匹配全部（$reg、$html、$ar））{
印刷费（$ar）；
}

试试这个：

$html = '<meta name="generator" content="WP 4.5"/>';

preg_match_all( '#<meta.*?content=[\'"](.*?)[\'"]\s*/>#i', $tab, $results );
print_r( $results[1] ); // contains array of captures.
if( $results[1] ) {
    // code here...
}

$html=''；
preg_match_all（'/content=“（.*）”/i'，$html，$matches）；
if（isset（$matches[1]））{
打印（$matches[1]）；
}

试试这个：

$html = '<meta name="generator" content="WP 4.5"/>';

preg_match_all( '#<meta.*?content=[\'"](.*?)[\'"]\s*/>#i', $tab, $results );
print_r( $results[1] ); // contains array of captures.
if( $results[1] ) {
    // code here...
}

$html=''；
preg_match_all（'/content=“（.*）”/i'，$html，$matches）；
if（isset（$matches[1]））{
打印（$matches[1]）；
}

下面是一个正则表达式，它将查找元标记并获取内容属性contents。它有一些通配符，可以解释其他变量，如不同的名称或额外的空格等

$html = file_get_contents( $url);

    libxml_use_internal_errors( true);
    $doc = new DOMDocument;
    $doc->loadHTML( $html);
    $xpath = new DOMXpath( $doc);

    // A name attribute on a <div>???
    $nodes = $xpath->query( '//div[@name="changeable_text"]')->item( 0);

    echo $nodes->Content;

$html=''；
preg_match_all（'#这里是一个正则表达式，它将查找元标记并获取内容属性contents。它有一些通配符，可以解释其他变量，例如不同的名称或额外的空格等
$html = file_get_contents( $url);

    libxml_use_internal_errors( true);
    $doc = new DOMDocument;
    $doc->loadHTML( $html);
    $xpath = new DOMXpath( $doc);

    // A name attribute on a <div>???
    $nodes = $xpath->query( '//div[@name="changeable_text"]')->item( 0);

    echo $nodes->Content; 

$html=''；
preg_match_all（'#请像这样使用
function getHTML($url,$timeout)
{
       $ch = curl_init($url); // initialize curl with given url
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
       curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
       curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
       return @curl_exec($ch);
}
$html=getHTML("http://www.website.com",10);
// Find all images on webpage
foreach($html->find("img") as $element)
echo $element->src . '<br>';

// Find all links on webpage
foreach($html->find("a") as $element)
echo $element->href . '<br>';

$html=file\u get\u contents（$url）；
libxml\u使用\u内部错误（true）；
$doc=新文档；
$doc->loadHTML（$html）；
$xpath=新的DOMXpath（$doc）；
//在？？？上的名称属性？？？
$nodes=$xpath->query（'//div[@name=“changeable_text”]'）->项（0）；
echo$nodes->Content；

或
//使用卷曲
函数getHTML（$url，$timeout）
{
$ch=curl\u init（$url）；//使用给定的url初始化curl
curl\u setopt（$ch，CURLOPT\u USERAGENT，$\u SERVER[“HTTP\u USER\u AGENT”]）；//set USERAGENT
curl_setopt（$ch，CURLOPT_RETURNTRANSFER，true）；//将响应写入变量
curl_setopt（$ch，CURLOPT_FOLLOWLOCATION，true）；//如果有重定向，请遵循重定向
curl_setopt（$ch，CURLOPT_CONNECTTIMEOUT，$timeout）；//执行的最大秒数
curl_setopt（$ch，CURLOPT_FAILONERROR，1）；//遇到错误时停止
返回@curl_exec（$ch）；
}
$html=getHTML（“http://www.website.com",10);
//查找网页上的所有图像
foreach（$html->find（“img”）作为$element）
echo$element->src.“
”；
//查找网页上的所有链接
foreach（$html->find（“a”）作为$element）
echo$element->href.
；
请像这样使用
function getHTML($url,$timeout)
{
       $ch = curl_init($url); // initialize curl with given url
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set  useragent
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
       curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
       curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
       return @curl_exec($ch);
}
$html=getHTML("http://www.website.com",10);
// Find all images on webpage
foreach($html->find("img") as $element)
echo $element->src . '<br>';

// Find all links on webpage
foreach($html->find("a") as $element)
echo $element->href . '<br>';

$html=file\u get\u contents（$url）；
libxml\u使用\u内部错误（true）；
$doc=新文档；
$doc->loadHTML（$html）；
$xpath=新的DOMXpath（$doc）；
//在？？？上的名称属性？？？
$nodes=$xpath->query（'//div[@name=“changeable_text”]'）->项（0）；
echo$nodes->Content；

或
//使用卷曲
函数getHTML（$url，$timeout）
{
$ch=curl\u init（$url）；//使用给定的url初始化curl
curl\u setopt（$ch，CURLOPT\u USERAGENT，$\u SERVER[“HTTP\u USER\u AGENT”]）；//set USERAGENT
curl_setopt（$ch，CURLOPT_RETURNTRANSFER，true）；//将响应写入变量
curl_setopt（$ch，CURLOPT_FOLLOWLOCATION，true）；//如果有重定向，请遵循重定向
curl_setopt（$ch，CURLOPT_CONNECTTIMEOUT，$timeout）；//执行的最大秒数
curl_setopt（$ch，CURLOPT_FAILONERROR，1）；//遇到错误时停止
返回@curl_exec（$ch）；
}
$html=getHTML（“http://www.website.com",10);
//查找网页上的所有图像
foreach（$html->find（“img”）作为$element）
echo$element->src.“
”；
//查找网页上的所有链接
foreach（$html->find（“a”）作为$element）
echo$element->href.
；

我想您应该打印$ar而不是$tab。我相信preg_match参数是regex、source、results。不应该将regex用于

HTML

scraping@DarkBee不确定这是否太离题了，但您能推荐一种用于HTML抓取的最佳通用方法吗？只需像您的示例代码中那样使用dom解析器？我想您想要打印$ar不是$tab。我相信preg_match参数是regex、source、results。不应该将regex用于

HTML

scraping@DarkBee不确定这是否太离题了，但你能推荐一种用于HTML抓取的最佳通用方法吗？只需像你的示例代码中那样使用dom解析器就可以了？嗨@smoqadam，谢谢你的回答，实际上情况是不同，因为每个网站将有不同的输出，如上面的示例所示：如果我打印match，我将有索引$match[1][4]，但如果我更改URL，我将需要打印不同的索引$match[1][x]。如何解决这个问题。也许使用

php simple html dom parser

或Hi@smoqadam更好，谢谢你的回答，事实上情况是不同的，因为每个网站都有不同的输出，如上例所示：如果我打印match，我有sit index$match[1][4]，但是如果我更改URL，我需要打印不同的index$match[1] [x]。如何解决这个问题。也许使用

php simple html dom parser

更好，或者非常感谢，这正是我想要的：）preg_match_all（'.#I'，$html，$results）；非常感谢，这正是我想要的：）preg_match_all（'.#I'，$html，$results）；也请检查此URL:-也请检查此URL:-