在php中使用curl概念获取内部文本_Php_Simple Html Dom

在php中使用curl概念获取内部文本

php

在php中使用curl概念获取内部文本,php,simple-html-dom,Php,Simple Html Dom,这是网站中的html文本，我想抓取死前看1000个地方我是这样使用代码的 foreach($html->find('ul.listings li a') as $e) echo $e->innertext. '<br/>'; foreach（$html->find（'ul.a'）作为$e）回显$e->innertext。“；我得到的输出是 999: Whats Your Emergency<span class="epnum">2012&

这是网站中的html文本，我想抓取

死前看1000个地方

我是这样使用代码的

foreach($html->find('ul.listings li a') as $e)
echo $e->innertext. '<br/>';

foreach（$html->find（'ul.a'）作为$e）
回显$e->innertext。“
；

我得到的输出是

 999: Whats Your Emergency<span class="epnum">2012</span>

999:2012年你的紧急情况是什么

包括span请帮我这个

你可以使用

strip\u tags（）
echo trim(strip_tags($e->innertext));

或者尝试使用preg_replace（）
删除不需要的标记及其内容
echo preg_replace('/<span[^>]*>([\s\S]*?)<\/span[^>]*>/', '', $e->innertext);

echo preg\u replace（'/]*>（[\s\s]*？）]*>/'，''$e->innertext）；
首先检查html。现在就像
  $string = '<ul class="listings">
               <li>
                  <a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
 1,000 Places To See Before You Die
                    <span class="epnum">2009</span>
                 </a>
             </li>';

改用纯文本

echo $e->plaintext;

但今年仍然会出现，您可以使用regexp删除它
文档中的示例：
$html=str_get_html（“foo-bar”）；
$e=$html->find（“div”，0）；
echo$e->tag；//返回：“div”
echo$e->outertext；//返回：“foobar”
echo$e->innertext；//返回：“foobar”
echo$e->纯文本；//返回：“foobar”
为什么不DOMDocument
并获取title属性
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';

$dom = new DOMDocument;
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$text = $xpath->query('//ul[@class="listings"]/li/a/@title')->item(0)->nodeValue;
echo $text;

我可以想出两种方法来解决这个问题。一是从锚标记中获取title属性。当然，不是每个人都为锚定标记设置了title属性，如果他们想这样填充，属性的值可能会不同。另一种解决方案是，获取innertext
属性，然后用空值替换锚定标记的每个子级
所以，要么这样做
$e->title;

还是这个
$text = $e->innertext;
foreach ($e->children() as $child)
{
    $text = str_replace($child, '', $text);
}

不过，这可能是一个好主意，可以改用它。
使用此回音修剪（strip_标签（$e->innertext））无法解决无使用问题；试着用第二个example@MateiMihai我很确定给你-1的人是因为你不应该对HTML使用正则表达式。这是个坏建议。也许现在可以了，但明天他会回来。你测试过这个代码吗？？这将生成实体：第7行：解析器错误：标记ul第1行中的数据过早结束
，这确实是一种干净的方法。令人遗憾的是，noobs不能欣赏一个好的答案+1.是的，没错：）。我本可以给他答案simplehtmldom，但这不是解析html的正确方法，因为它是一个自定义库。此外，在HTML上使用正则表达式是非常糟糕的。这应该是公认的答案，而不是使用正则表达式解析的答案。
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';

$dom = new DOMDocument;
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$text = $xpath->query('//ul[@class="listings"]/li/a/@title')->item(0)->nodeValue;
echo $text;

$text = explode("\n", trim($xpath->query('//ul[@class="listings"]/li/a')->item(0)->nodeValue));
echo $text[0];

$e->title;

$text = $e->innertext;
foreach ($e->children() as $child)
{
    $text = str_replace($child, '', $text);
}