由于特殊字符,无法使用PHP解析HTML内容

由于特殊字符,无法使用PHP解析HTML内容,php,html,curl,Php,Html,Curl,我正试图用CURL刮一个网站。到目前为止,我写了以下内容: 旋度等级: <?php class Curl { public $cookieJar = ""; public function __construct($cookieJarFile = 'cookies.txt') { $this->cookieJar = $cookieJarFile; } function setup() {

我正试图用CURL刮一个网站。到目前为止,我写了以下内容:

旋度等级:

<?php

class Curl
{       

    public $cookieJar = "";

    public function __construct($cookieJarFile = 'cookies.txt') {
        $this->cookieJar = $cookieJarFile;
    }

    function setup()
    {


        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.


        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
        curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar); 
        curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
        curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
        curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
    }


    function get($url)
    { 
        $this->curl = curl_init($url);
        $this->setup();

        return $this->request();
    }

    function getAll($reg,$str)
    {
        preg_match_all($reg,$str,$matches);
        return $matches[1];
    }

    function postForm($url, $fields, $referer='')
    {
        $this->curl = curl_init($url);
        $this->setup();
        curl_setopt($this->curl, CURLOPT_URL, $url);
        curl_setopt($this->curl, CURLOPT_POST, 1);
        curl_setopt($this->curl, CURLOPT_REFERER, $referer);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
        return $this->request();
    }

    function getInfo($info)
    {
        $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
        return $info;
    }

    function request()
    {
        return curl_exec($this->curl);
    }
}

?>

然后在我的php文件中调用这个curl类:

include_once("curl.php");
$curl = new Curl();
$html = $curl->get("www.somewebsite.com");
$html = htmlentities($html);
//echo $html;
$pattern = htmlentities("<span class=\"review-text\">");
function get_string_between($string, $start, $end)
{
    $string = " ".$string;
    $ini = strpos($string,$start);
    if ($ini == 0)
        return "";
    $ini += strlen($start);
    $len = strpos($string,$end,$ini) - $ini;
    return substr($string,$ini,$len);
}
echo get_string_between($html, '<span class=\"review-text\">', '<\/span>');
include_once(“curl.php”);
$curl=新的curl();
$html=$curl->get(“www.somewebsite.com”);
$html=htmlentities($html);
//echo$html;
$pattern=htmlentities(“”);
函数get\u string\u between($string,$start,$end)
{
$string=”“.$string;
$ini=strpos($string,$start);
如果($ini==0)
返回“”;
$ini+=strlen($start);
$len=strpos($string,$end,$ini)-$ini;
返回substr($string,$ini,$len);
}
echo get_string_在($html,,'')之间;
现在,我正在尝试获取两个字符串之间的字符串,我得到了一个空白页。然而,当我看到html内容时,我清楚地能够发现字符串

HTML内容非常大,我正在尝试在巨大的文件之间搜索并获取内容


我甚至尝试替换“”,有一种更好的方法可以通过使用dom来获取html标记的值

$dom = new DomDocument();
@$dom -> loadHTML($html);
$dom -> preserveWhiteSpace = false;
$spans = getElementsByTagName('span');
foreach($spans as $span){
  if($span -> getAttribute('class') == 'review-text'){ print $span-> nodeValue }
}
或者还有另一种方法:

$dompath = new DOMXPath($dom);
$review_div = $dompath -> query('//*[@class="review-text"]')->item(0)
$string = $review_div -> nodeValue;

希望这能对您有所帮助。

使用htmlspecialchars@mohamadatat:它似乎也不是这样工作的。这个方法给了我一个500内部服务器错误。知道为什么吗?500错误是服务器端的,这意味着服务器出了问题。我认为这与php脚本没有任何关系。或者,如果您重复curl响应,它可能是错误的您尝试爬网的页面出现错误。