Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/290.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
PHP论坛网站web爬虫只运行一次,只打印一个表_Php - Fatal编程技术网

PHP论坛网站web爬虫只运行一次,只打印一个表

PHP论坛网站web爬虫只运行一次,只打印一个表,php,Php,我有这个网络爬虫脚本,我试图用它从论坛网站获取数据。 到目前为止,我已经尝试了许多站点,所有站点都运行良好,但有一个站点除外,在该站点中,代码只从整个表的一个部分输出一个链接,而其余部分则保持不变 论坛站点HTML站点如下所示: <table summary="forum topics"> <tbody> <tr class="first even topic-3162 row-read"> <td class="s

我有这个网络爬虫脚本,我试图用它从论坛网站获取数据。 到目前为止,我已经尝试了许多站点,所有站点都运行良好,但有一个站点除外,在该站点中,代码只从整个表的一个部分输出一个链接,而其余部分则保持不变

论坛站点HTML站点如下所示:

<table summary="forum topics">
    <tbody>
      <tr class="first even topic-3162  row-read">
        <td class="status firstcol"></td>
        <td class="feeds">
          <a href="/feed/get/type/rss/source/lead/id/3162" title="RSS feed" rel="nofollow" class="rss"><span>RSS</span></a>
          <a href="/subscriptions/add/leadid/3162/backto/1" title="subscribe by email" class="email"><span>Email</span></a>
        </td>
        <td class="topic-titles">
          <img src="http://www.ezboard.com/images/posticons/pi_sunglasses.gif" alt="posticon" class="posticon">
          <a href="/topic/3162/step1-prep-diary-mommy-style" title="warning: long post and actually">step1 prep diary - mommy style</a>
        </td>
        <td class="replies">4</td>
        <td class="kudos">0</td>                                                
        <td class="latest lastcol">
          <p class="user-name">
            <a href="/profile/mini/override_id/9695204" title="User Info" class="grayout">
             <img src="http://static.yuku.com/common/bypass/images/user_info_icon.gif" title="View user info." alt="User Info"></a> -->
            <a href="http://mommyduck.pinoyimgforum.yuku.com" title="mommyduck's Profile">mommyduck</a>
          </p>
          <p class="date">Jul 14 13 12:49 AM</p>
        </td>
        <td class="author lastcol">
          <p class="user-name">
           <a href="/profile/mini/override_id/9695204" title="User Info" class="grayout">
            <img src="http://static.yuku.com/common/bypass/images/user_info_icon.gif" title="View user info." alt="User Info"></a> -->
           <a href="http://mommyduck.pinoyimgforum.yuku.com" title="mommyduck's Profile">mommyduck</a></p>
        </td>
      </tr>

      <tr class="first odd topic-425  row-hot row-read">
        <td class="status firstcol">
         <img src="http://static.yuku.com/domainskins/bypass/img/ezboard/hottopic.gif" class="icon icon-hot-read" title="This is a hot topic with no new posts" alt="Hot Topic w/ No New Posts">
        </td>
        <td class="feeds">
         <a href="/feed/get/type/rss/source/lead/id/425" title="RSS feed" rel="nofollow" class="rss">
          <span>RSS</span>
         </a>
         <a href="/subscriptions/add/leadid/425/backto/1" title="subscribe by email" class="email"><span>Email</span></a>
        </td>
        <td class="topic-titles">
         <a href="/topic/425/tips-by-a-97er" title="I took my Step1 exam</a>
         <span class="topic-pager">stuff</span>
        </td>
        <td class="replies">46</td>
        <td class="kudos">0</td>                                                
        <td class="latest lastcol">
          <p class="user-name">
            <a href="/profile/mini/override_id/9695204" title="User Info" class="grayout">
             <img src="http://static.yuku.com/common/bypass/images/user_info_icon.gif" title="View user info." alt="User Info">
            </a>
            <a href="http://mommyduck.pinoyimgforum.yuku.com" title="mommyduck's Profile">mommyduck</a>
          </p>
          <p class="date">Jul 11 13  1:16 AM</p>
        </td>
        <td class="author lastcol">
          <p class="user-name">
           <a href="/profile/mini/override_id/2996016" title="User Info" class="grayout">
            <img src="http://static.yuku.com/common/bypass/images/user_info_icon.gif" title="View user info." alt="User Info">
           </a>
           <a href="http://roxter.e.yuku.com" title="roxter's Profile">roxter</a>
          </p>
        </td>
      </tr>
    </tbody>
</table>
我的PHP页面代码如下所示:

<?php
    function get_data($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL,$url);
    $result=curl_exec($ch);
    curl_close($ch);
    return $result;
    }
    $returned_content = get_data('http://example.com');
    $first_step = explode( '<table summary="forum topics"' , $returned_content );
    $second_step = explode('</table>', $first_step[1]);

    $third_step = explode('<tr>', $second_step[0]);
    //print_r($third_step);
    foreach ($third_step as $key=>$element) {
    $child_first = explode( '<td class="topic-titles"' , $element );
    $child_second = explode( '</td>' , $child_first[1] );
    $child_third = explode( '<a href=' , $child_second[0] );
    $child_fourth = explode( '</a>' , $child_third[1] );
    $final = "<a href=".$child_fourth[0]."</a></br>";
    ?>
    <li target="_blank" class="itemtitle">
        <span class="item_new"></span><?php echo $final?>
    </li>
    <?php       
        }
    ?>      
    </ul>        
    <div style="clear:both"></div>
    </div>
    </div>
如果运行此代码,它将只输出上述HTML之一的一个链接,而忽略另一个链接

注意:上面的代码只能从class=topic标题打印出来

任何建议都将不胜感激

您正在使用explode

但是,在获取的表中只有一个元素具有不带类ie的tr


如果你一定要使用这个字符串拆分方法进行刮取,你会希望你的分解基于你的意思…但是同样的代码适用于其他两个站点。。。如果我必须编辑锚定线,我缺少什么?@harishk使用一个html库,比如simple_html_dom或php native dom。不要使用正则表达式或一些php函数来抓取网站。是的,我也有使用DOM的代码。我只是想看看这个样子嘿,伙计,现在很好用。。但我实际上给出了一个错误,结果是未识别的偏移量1,带有指向它的锚链接,因为这很可能是因为你只是回显了刮取的html,并且你在部分html标记上爆炸,因此你创建了所有形式的半格式html标记。您应该尝试将所有废弃的数据包装在PHP htmlspecialchars函数中,并查看您正在吐出的内容。类似地,您可能会发现$child_first[1]和$child_third[1]试图引用不存在的数组元素,同样,您应该查看您正在刮取的内容以确定这一点