web抓取删除php中未附加id/class的链接_Php

web抓取删除php中未附加id/class的链接

php

web抓取删除php中未附加id/class的链接,php,Php,嗨，我正在使用网页抓取一个网站，但它包含了太多我不需要的信息。这是我的密码： <?php require('phpQuery.php'); $url = 'http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A'; $html = file_get_contents($url); $pq = phpQuery::newDocumentHTML($html); echo $pq['#CompanylistResults

嗨，我正在使用网页抓取一个网站，但它包含了太多我不需要的信息。这是我的密码：

<?php
require('phpQuery.php');
$url = 'http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A';
$html = file_get_contents($url);
$pq = phpQuery::newDocumentHTML($html);
echo $pq['#CompanylistResults'];
?>

结果是：

<table id="CompanylistResults">
<tbody>
<tr>
<tr>
<td>
<a target="_blank" rel="nofollow" href="http://www.1800flowers.com">1-800 FLOWERS.COM, Inc.</a>
</td>
<td>
<td style="">$100.55M</td>
<td style="display:none"></td>
<td>United States</td>
<td>1999</td>
<td style="width:105px">Other Specialty Stores</td>


1.055亿美元
美国
1999
其他专卖店

我需要的是“1-800 FLOWERS.COM，Inc.”和“1.055亿美元”中的文本，如何操作？

请尝试以下代码：

//the url you need to scrape
$uri = ('http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A');
//extracts HTML from the url
$get = file_get_contents($uri);

//Finding what you want removed
$pos1 = strpos($get, "<a target=\"_blank\" rel=\"nofollow\" href=\"http://www.1800flowers.com\">");
$pos2 = strpos($get, "</a>", $pos1);

$pos3 = strpos($get, "<td style=\"\">");
$pos4 = strpos($get, "</td>", $pos3);

//Removing the parts that are not needed
$text = substr($get,$pos1,$pos2-$pos1);
$test3 = substr($get,$pos3,$pos4-$pos3);

//Removing tags from is left after the above code, you should now have only the values that you are looking for
$text1 = strip_tags($text);
$text2 = strip tags($text3);

//需要刮取的url
$uri=（'http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A');
//从url中提取HTML
$get=文件内容（$uri）；
//查找要删除的内容
$pos1=strpos（$get，“，$pos1）；
$pos3=strpos（$get，“”）；
$pos4=strpos（$get，“，$pos3）；
//卸下不需要的零件
$text=substr（$get、$pos1、$pos2-$pos1）；
$test3=substr（$get，$pos3，$pos4-$pos3）；
//从中删除标记是在上面的代码之后留下的，您现在应该只有您要查找的值
$text1=带标签（$text）；
$text2=条形标签（$text3）；

这类财务信息可从几十个api获得，无需刮取。在您显示的页面上有一个链接：“下载此列表”，它提供了csv文件api？？？事实上，我想用这两个文本创建一个链接并显示在一个网站上。你需要更好地解释你的代码片段在做什么，这样我们就可以毫无疑问地理解这是如何回答提出的问题的。对。。。在$uri中放入需要刮取的url，$getRight。。。在$uri中，放入您需要刮取的url，$get（文件获取内容）从url中提取html，其中$pos1和$pos2从何处修复数据（相同$pos3和$pos4），使用$text获取$pos1和$pos2之间的代码（相同$text3在$pos3和$pos4之间）。使用strip_tags（）可以获得值。我在代码中添加了一些注释。如果我不理解，请随时更正编辑。