PHP使用DOM和多进程fopen()函数进行抓取,在分析html时返回错误(对非对象调用函数getElementsByTagName()

PHP使用DOM和多进程fopen()函数进行抓取,在分析html时返回错误(对非对象调用函数getElementsByTagName(),php,web-scraping,multiprocessing,Php,Web Scraping,Multiprocessing,我将试着在我的文本中比在标题中更清楚一点 我已经构建了一个php页面,它可以刮取另一个internet站点,并将结果存储在数组中,而不是存储在数据库中(重复了155次,这些多次调用基于另一个数组) 为了获得更快的结果,我实现了另一个php页面,它使用fopen()多次(大约5次)调用“scraping page”,将原始数组分成5部分 每次我调用刮削页面,一个接一个地重复155次,一切都正常。但当我使用fopen()时,它开始(有时)返回以下错误: Fatal error: Call to a


我将试着在我的文本中比在标题中更清楚一点

我已经构建了一个php页面,它可以刮取另一个internet站点,并将结果存储在数组中,而不是存储在数据库中(重复了155次,这些多次调用基于另一个数组)

为了获得更快的结果,我实现了另一个php页面,它使用fopen()多次(大约5次)调用“scraping page”,将原始数组分成5部分

每次我调用刮削页面,一个接一个地重复155次,一切都正常。但当我使用fopen()时,它开始(有时)返回以下错误:

 Fatal error: Call to a member function getElementsByTagName() on a non-object
所以我想这应该是一个“多重处理”的方法,所以如果我在一起激活scrape太多时间,它会返回错误

因此,我尝试调用“scraping page”3到2次,而不是让脚本休息(sleep(1)),然后调用其他2/3次scraping page。 在这种情况下,有时我得到所有的脚本工作完美,其他时候我总是有相同的错误再次

这是我代码的一部分。 从“刮取”页面(刮取脚本):

错误始终与此部分代码相关:

 $taxrows = $taxtable->getElementsByTagName("tr");
 if (($totC > 150) && ($totC <= 200)) {
    echo "<br>do something it's between 151-200";

    //> 150 - 5 array
    $part1 = array();
    $part2 = array();
    $part3 = array();
    $part4 = array();
    $part5 = array();

    list($part1, $part2, $part3, $part4, $part5) = array_chunk($countryList, ceil(count($countryList) / 5));

    echo "<br><br>ARRAY 1: <br>";
    print_r($part1);
    echo "<br>total count for part1 = ".count($part1);
    $data1 = extractTax($server,$part1); sleep(1);
    echo "<br><br>ARRAY 2: <br>";
    print_r($part2);
    echo "<br>total count for part2 = ".count($part2);
    $data2 = extractTax($server,$part2); sleep(1);  

    resp($data1);
    echo_flush();
    resp($data2);
    echo_flush();       

    echo "<br><br>ARRAY 3: <br>";
    print_r($part3);
    echo "<br>total count for part3 = ".count($part3);
    $data3 = extractTax($server,$part3); sleep(1);
    echo "<br><br>ARRAY 4: <br>";
    print_r($part4);
    echo "<br>total count for part4 = ".count($part4);
    $data4 = extractTax($server,$part4); sleep(1);

    resp($data3);       
    echo_flush();
    resp($data4);       
    echo_flush();       

    echo "<br><br>ARRAY 5: <br>";
    print_r($part5);
    echo "<br>total count for part5 = ".count($part5);
    $data5 = extractTax($server,$part5); sleep(1);

    resp($data5);       
    echo_flush(); 
}

function extractTax($server,$cList) { 
   echo "<br><br><i>***** Country List Updater ******</i></p><br>";
   echo "<i>***** Server $server *****</i><br>";  

   echo "<br><i><p class='start'>** Launched process $server **</i></p>";
   $cLists = base64_encode(serialize($cList));
   $url = "[...url...]/cData.php?server=".$server."&cList=".$cLists;
   $child = fopen($url, 'r');
   if ($child == TRUE) {
      echo "<br>Worked! Move on...<br>";
   } else {
      $i = 0;
      while ($child == FALSE && $i<=3) {
        echo "There's problem with fopen(), waiting for next try<br>";
        sleep(60); 
        $i++;
        echo "<br>Attempt $i/3 (after the 3rd, I'll move on)<br>";
        $child = fopen($url, 'r');
    }   
    if ($child == TRUE) {
        echo "<br>Finally worked! Moving on...<br>";
    }
    if ($child == FALSE && $i == 3) {
        echo "After 3 usuccessful attempts, I'm moving on...<br>";
    }
} return $child;    
};

function resp($data) {
   // get response from child (if any) as soon at it's ready:
   $response = stream_get_contents($data);
   echo "<br><b><p class='buytitles'>+++This is RESPONSE from process+++</b></p>";
   echo "<br>".$response;
   echo "<br><b><p class='buyendtitles'>---RESPONSE END process ---</b><br></p>";
 fclose($data);
 echo_flush();
 }
来自多进程页面(多进程脚本):

 $taxrows = $taxtable->getElementsByTagName("tr");
 if (($totC > 150) && ($totC <= 200)) {
    echo "<br>do something it's between 151-200";

    //> 150 - 5 array
    $part1 = array();
    $part2 = array();
    $part3 = array();
    $part4 = array();
    $part5 = array();

    list($part1, $part2, $part3, $part4, $part5) = array_chunk($countryList, ceil(count($countryList) / 5));

    echo "<br><br>ARRAY 1: <br>";
    print_r($part1);
    echo "<br>total count for part1 = ".count($part1);
    $data1 = extractTax($server,$part1); sleep(1);
    echo "<br><br>ARRAY 2: <br>";
    print_r($part2);
    echo "<br>total count for part2 = ".count($part2);
    $data2 = extractTax($server,$part2); sleep(1);  

    resp($data1);
    echo_flush();
    resp($data2);
    echo_flush();       

    echo "<br><br>ARRAY 3: <br>";
    print_r($part3);
    echo "<br>total count for part3 = ".count($part3);
    $data3 = extractTax($server,$part3); sleep(1);
    echo "<br><br>ARRAY 4: <br>";
    print_r($part4);
    echo "<br>total count for part4 = ".count($part4);
    $data4 = extractTax($server,$part4); sleep(1);

    resp($data3);       
    echo_flush();
    resp($data4);       
    echo_flush();       

    echo "<br><br>ARRAY 5: <br>";
    print_r($part5);
    echo "<br>total count for part5 = ".count($part5);
    $data5 = extractTax($server,$part5); sleep(1);

    resp($data5);       
    echo_flush(); 
}

function extractTax($server,$cList) { 
   echo "<br><br><i>***** Country List Updater ******</i></p><br>";
   echo "<i>***** Server $server *****</i><br>";  

   echo "<br><i><p class='start'>** Launched process $server **</i></p>";
   $cLists = base64_encode(serialize($cList));
   $url = "[...url...]/cData.php?server=".$server."&cList=".$cLists;
   $child = fopen($url, 'r');
   if ($child == TRUE) {
      echo "<br>Worked! Move on...<br>";
   } else {
      $i = 0;
      while ($child == FALSE && $i<=3) {
        echo "There's problem with fopen(), waiting for next try<br>";
        sleep(60); 
        $i++;
        echo "<br>Attempt $i/3 (after the 3rd, I'll move on)<br>";
        $child = fopen($url, 'r');
    }   
    if ($child == TRUE) {
        echo "<br>Finally worked! Moving on...<br>";
    }
    if ($child == FALSE && $i == 3) {
        echo "After 3 usuccessful attempts, I'm moving on...<br>";
    }
} return $child;    
};

function resp($data) {
   // get response from child (if any) as soon at it's ready:
   $response = stream_get_contents($data);
   echo "<br><b><p class='buytitles'>+++This is RESPONSE from process+++</b></p>";
   echo "<br>".$response;
   echo "<br><b><p class='buyendtitles'>---RESPONSE END process ---</b><br></p>";
 fclose($data);
 echo_flush();
 }
if($totC>150)&($totC 150-5数组
$part1=array();
$part2=数组();
$part3=数组();
$part4=数组();
$part5=数组();
list($part1,$part2,$part3,$part4,$part5)=数组块($countryList,ceil(count($countryList)/5));
回声“

数组1:
”; 印刷品(第1部分); echo“
第1部分的总计数=.count($part1); $data1=extractTax($server,$part1);sleep(1); 回声“

数组2:
”; 印刷品(第二部分); echo“
第2部分的总计数=.count($part2); $data2=extractTax($server,$part2);sleep(1); resp(数据1); echo_flush(); resp(数据2美元); echo_flush(); 回声“

数组3:
”; 印刷品(第三部分); echo“
第3部分的总计数=.count($part3); $data3=提取税($server,$part3);睡眠(1); 回声“

数组4:
”; 印刷品(第4部分); echo“
第4部分的总计数=.count($part4); $data4=提取税($server,$part4);睡眠(1); resp(数据3美元); echo_flush(); resp(数据4美元); echo_flush(); 回声“

数组5:
”; 印刷品(第5部分); echo“
第5部分的总计数=.count($part5); $data5=extractTax($server,$part5);sleep(1); resp(数据5美元); echo_flush(); } 函数提取税($server,$cList){ echo“

******国家/地区列表更新程序******


”; echo“****Server$Server*****
”; echo“

**已启动进程$server**

”; $cLists=base64_编码(序列化($cList)); $url=“[…url…]”/cData.php?server=“.$server.”&cList=“.$cLists; $child=fopen($url,'r'); 如果($child==TRUE){ 回声“
工作了!继续…
”; }否则{ $i=0; 而($child==FALSE&&$i
如果(!taxtable)抛出新的SomeException();

将刮取逻辑放入函数中,然后尝试该函数,并以这种方式检查错误

我真的不能帮你清理一个我不知道数据的网页。你能给出你的curl请求的示例数据和一个示例表吗