php爬虫程序（对单个网站进行爬网）_Php_Parsing_Html Parsing_Web Crawler

php爬虫程序（对单个网站进行爬网）

php parsing web-crawler

php爬虫程序（对单个网站进行爬网）,php,parsing,html-parsing,web-crawler,Php,Parsing,Html Parsing,Web Crawler,我正在从事爬虫项目，我需要你的帮助，这是我的第一个项目。任务是从'http://justdial.com'. 例如，我想获取城市名称（班加罗尔）、categoury（酒店）、酒店名称、地址和电话号码我已经编写了一个代码来从标签的“id”中提取标签内容，就像我从下面的代码中提取地址一样： <?php $url="http://www.justdial.com/Bangalore/hotels"; $original_file = file_get_contents("$url");

我正在从事爬虫项目，我需要你的帮助，这是我的第一个项目。任务是从'http://justdial.com'. 例如，我想获取城市名称（班加罗尔）、categoury（酒店）、酒店名称、地址和电话号码

我已经编写了一个代码来从标签的“id”中提取标签内容，就像我从下面的代码中提取地址一样：

<?php

$url="http://www.justdial.com/Bangalore/hotels";  
$original_file = file_get_contents("$url");
$stripped_file = strip_tags($original_file, "<div>");

$newlines="'<div class=\"logoDesc\">(.*?)</div>'si";
$newlines=preg_replace('#<div(?:[^>]*)>.</div>#u','',$newlines);

preg_match_all("$newlines", $stripped_file, $matches);


//DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

  header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read!
 $path= ($matches);

 print_r($path); //View the array to see if it worked
?>

您不应该使用正则表达式来解析HTML。你应该使用类似的东西。正在使用的it小示例：
<?php
   $str = '<h1>T1</h1>Lorem ipsum.<h1>T2</h1>The quick red fox...<h1>T3</h1>... jumps over the lazy brown FROG';
   $DOM = new DOMDocument;
   $DOM->loadHTML($str);

   //get all H1
   $items = $DOM->getElementsByTagName('h1');

   //display all H1 text
   for ($i = 0; $i < $items->length; $i++)
        echo $items->item($i)->nodeValue . "<br/>";
?>


您的意思是带标签（）
？路径包含什么？请给我们看一个垃圾场。你试过数据库代码了吗？是否需要从数据库->excel，或者可以同时生成excel工作表？它必须是xls，还是csv就足够了？你是说和？我用html解析来解析php的内容。这是代码。hello@wayne，我已经包含了html解析器来解析php的内容。我不想使用数据库，我想使用记事本。当“justdial.com”页面运行时，数据必须存储在记事本中，然后从记事本存储到excel工作表中。