PHP使用DOM和多进程fopen()函数进行抓取,在分析html时返回错误(对非对象调用函数getElementsByTagName()
PHP使用DOM和多进程fopen()函数进行抓取,在分析html时返回错误(对非对象调用函数getElementsByTagName(),php,web-scraping,multiprocessing,Php,Web Scraping,Multiprocessing,我将试着在我的文本中比在标题中更清楚一点 我已经构建了一个php页面,它可以刮取另一个internet站点,并将结果存储在数组中,而不是存储在数据库中(重复了155次,这些多次调用基于另一个数组) 为了获得更快的结果,我实现了另一个php页面,它使用fopen()多次(大约5次)调用“scraping page”,将原始数组分成5部分 每次我调用刮削页面,一个接一个地重复155次,一切都正常。但当我使用fopen()时,它开始(有时)返回以下错误: Fatal error: Call to a
我将试着在我的文本中比在标题中更清楚一点 我已经构建了一个php页面,它可以刮取另一个internet站点,并将结果存储在数组中,而不是存储在数据库中(重复了155次,这些多次调用基于另一个数组) 为了获得更快的结果,我实现了另一个php页面,它使用fopen()多次(大约5次)调用“scraping page”,将原始数组分成5部分 每次我调用刮削页面,一个接一个地重复155次,一切都正常。但当我使用fopen()时,它开始(有时)返回以下错误:
Fatal error: Call to a member function getElementsByTagName() on a non-object
所以我想这应该是一个“多重处理”的方法,所以如果我在一起激活scrape太多时间,它会返回错误
因此,我尝试调用“scraping page”3到2次,而不是让脚本休息(sleep(1)),然后调用其他2/3次scraping page。
在这种情况下,有时我得到所有的脚本工作完美,其他时候我总是有相同的错误再次
这是我代码的一部分。
从“刮取”页面(刮取脚本):
错误始终与此部分代码相关:
$taxrows = $taxtable->getElementsByTagName("tr");
if (($totC > 150) && ($totC <= 200)) {
echo "<br>do something it's between 151-200";
//> 150 - 5 array
$part1 = array();
$part2 = array();
$part3 = array();
$part4 = array();
$part5 = array();
list($part1, $part2, $part3, $part4, $part5) = array_chunk($countryList, ceil(count($countryList) / 5));
echo "<br><br>ARRAY 1: <br>";
print_r($part1);
echo "<br>total count for part1 = ".count($part1);
$data1 = extractTax($server,$part1); sleep(1);
echo "<br><br>ARRAY 2: <br>";
print_r($part2);
echo "<br>total count for part2 = ".count($part2);
$data2 = extractTax($server,$part2); sleep(1);
resp($data1);
echo_flush();
resp($data2);
echo_flush();
echo "<br><br>ARRAY 3: <br>";
print_r($part3);
echo "<br>total count for part3 = ".count($part3);
$data3 = extractTax($server,$part3); sleep(1);
echo "<br><br>ARRAY 4: <br>";
print_r($part4);
echo "<br>total count for part4 = ".count($part4);
$data4 = extractTax($server,$part4); sleep(1);
resp($data3);
echo_flush();
resp($data4);
echo_flush();
echo "<br><br>ARRAY 5: <br>";
print_r($part5);
echo "<br>total count for part5 = ".count($part5);
$data5 = extractTax($server,$part5); sleep(1);
resp($data5);
echo_flush();
}
function extractTax($server,$cList) {
echo "<br><br><i>***** Country List Updater ******</i></p><br>";
echo "<i>***** Server $server *****</i><br>";
echo "<br><i><p class='start'>** Launched process $server **</i></p>";
$cLists = base64_encode(serialize($cList));
$url = "[...url...]/cData.php?server=".$server."&cList=".$cLists;
$child = fopen($url, 'r');
if ($child == TRUE) {
echo "<br>Worked! Move on...<br>";
} else {
$i = 0;
while ($child == FALSE && $i<=3) {
echo "There's problem with fopen(), waiting for next try<br>";
sleep(60);
$i++;
echo "<br>Attempt $i/3 (after the 3rd, I'll move on)<br>";
$child = fopen($url, 'r');
}
if ($child == TRUE) {
echo "<br>Finally worked! Moving on...<br>";
}
if ($child == FALSE && $i == 3) {
echo "After 3 usuccessful attempts, I'm moving on...<br>";
}
} return $child;
};
function resp($data) {
// get response from child (if any) as soon at it's ready:
$response = stream_get_contents($data);
echo "<br><b><p class='buytitles'>+++This is RESPONSE from process+++</b></p>";
echo "<br>".$response;
echo "<br><b><p class='buyendtitles'>---RESPONSE END process ---</b><br></p>";
fclose($data);
echo_flush();
}
来自多进程页面(多进程脚本):
$taxrows = $taxtable->getElementsByTagName("tr");
if (($totC > 150) && ($totC <= 200)) {
echo "<br>do something it's between 151-200";
//> 150 - 5 array
$part1 = array();
$part2 = array();
$part3 = array();
$part4 = array();
$part5 = array();
list($part1, $part2, $part3, $part4, $part5) = array_chunk($countryList, ceil(count($countryList) / 5));
echo "<br><br>ARRAY 1: <br>";
print_r($part1);
echo "<br>total count for part1 = ".count($part1);
$data1 = extractTax($server,$part1); sleep(1);
echo "<br><br>ARRAY 2: <br>";
print_r($part2);
echo "<br>total count for part2 = ".count($part2);
$data2 = extractTax($server,$part2); sleep(1);
resp($data1);
echo_flush();
resp($data2);
echo_flush();
echo "<br><br>ARRAY 3: <br>";
print_r($part3);
echo "<br>total count for part3 = ".count($part3);
$data3 = extractTax($server,$part3); sleep(1);
echo "<br><br>ARRAY 4: <br>";
print_r($part4);
echo "<br>total count for part4 = ".count($part4);
$data4 = extractTax($server,$part4); sleep(1);
resp($data3);
echo_flush();
resp($data4);
echo_flush();
echo "<br><br>ARRAY 5: <br>";
print_r($part5);
echo "<br>total count for part5 = ".count($part5);
$data5 = extractTax($server,$part5); sleep(1);
resp($data5);
echo_flush();
}
function extractTax($server,$cList) {
echo "<br><br><i>***** Country List Updater ******</i></p><br>";
echo "<i>***** Server $server *****</i><br>";
echo "<br><i><p class='start'>** Launched process $server **</i></p>";
$cLists = base64_encode(serialize($cList));
$url = "[...url...]/cData.php?server=".$server."&cList=".$cLists;
$child = fopen($url, 'r');
if ($child == TRUE) {
echo "<br>Worked! Move on...<br>";
} else {
$i = 0;
while ($child == FALSE && $i<=3) {
echo "There's problem with fopen(), waiting for next try<br>";
sleep(60);
$i++;
echo "<br>Attempt $i/3 (after the 3rd, I'll move on)<br>";
$child = fopen($url, 'r');
}
if ($child == TRUE) {
echo "<br>Finally worked! Moving on...<br>";
}
if ($child == FALSE && $i == 3) {
echo "After 3 usuccessful attempts, I'm moving on...<br>";
}
} return $child;
};
function resp($data) {
// get response from child (if any) as soon at it's ready:
$response = stream_get_contents($data);
echo "<br><b><p class='buytitles'>+++This is RESPONSE from process+++</b></p>";
echo "<br>".$response;
echo "<br><b><p class='buyendtitles'>---RESPONSE END process ---</b><br></p>";
fclose($data);
echo_flush();
}
if($totC>150)&($totC 150-5数组
$part1=array();
$part2=数组();
$part3=数组();
$part4=数组();
$part5=数组();
list($part1,$part2,$part3,$part4,$part5)=数组块($countryList,ceil(count($countryList)/5));
回声“
数组1:
”;
印刷品(第1部分);
echo“
第1部分的总计数=.count($part1);
$data1=extractTax($server,$part1);sleep(1);
回声“
数组2:
”;
印刷品(第二部分);
echo“
第2部分的总计数=.count($part2);
$data2=extractTax($server,$part2);sleep(1);
resp(数据1);
echo_flush();
resp(数据2美元);
echo_flush();
回声“
数组3:
”;
印刷品(第三部分);
echo“
第3部分的总计数=.count($part3);
$data3=提取税($server,$part3);睡眠(1);
回声“
数组4:
”;
印刷品(第4部分);
echo“
第4部分的总计数=.count($part4);
$data4=提取税($server,$part4);睡眠(1);
resp(数据3美元);
echo_flush();
resp(数据4美元);
echo_flush();
回声“
数组5:
”;
印刷品(第5部分);
echo“
第5部分的总计数=.count($part5);
$data5=extractTax($server,$part5);sleep(1);
resp(数据5美元);
echo_flush();
}
函数提取税($server,$cList){
echo“
******国家/地区列表更新程序******
”;
echo“****Server$Server*****
”;
echo“
**已启动进程$server**
”;
$cLists=base64_编码(序列化($cList));
$url=“[…url…]”/cData.php?server=“.$server.”&cList=“.$cLists;
$child=fopen($url,'r');
如果($child==TRUE){
回声“
工作了!继续…
”;
}否则{
$i=0;
而($child==FALSE&&$i如果(!taxtable)抛出新的SomeException();
将刮取逻辑放入函数中,然后尝试该函数,并以这种方式检查错误
我真的不能帮你清理一个我不知道数据的网页。你能给出你的curl请求的示例数据和一个示例表吗