symfony爬虫访问嵌套div
我正在拼命尝试访问嵌套div中的内容:symfony爬虫访问嵌套div,symfony,phpunit,domcrawler,Symfony,Phpunit,Domcrawler,我正在拼命尝试访问嵌套div中的内容: <tr> <th class="monthCellContent" style="vertical-align : top"> <div class="monthEventWrapper"> <div class="monthEvent"> <a class="event" href="/event/1"
<tr>
<th class="monthCellContent" style="vertical-align : top">
<div class="monthEventWrapper">
<div class="monthEvent">
<a class="event"
href="/event/1"
title="test title updated - test place - 09:00-10:00">
09:00
<span class="showForMediumInline">
test title updated test place
</span>
</a>
</div>
</div>
</th>
</tr>
但我无法访问
<div class="monthEvent">
我尝试了所有的变化
foreach ($items as $item) {
foreach ($item->childNodes as $child) {
$value .= $paragraph->ownerDocument->saveHTML($child);
}
}
及
没有运气
html通过了验证,没有js
谢谢 这是一种解决方案类型的代码:
<?php
use Symfony\Component\DomCrawler\Crawler;
require_once(__DIR__ . '/../vendor/autoload.php');
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<tr>
<th class="monthCellContent" style="vertical-align : top">
<div class="monthEventWrapper">
<div class="monthEvent">
<a class="event"
href="/event/1"
title="test title updated - test place - 09:00-10:00">
09:00
<span class="showForMediumInline">
test title updated test place
</span>
</a>
</div>
</div>
</th>
</tr>
</body>
</html>
HTML;
$crawler = new Crawler($html);
$crawlerFiltered = $crawler->filter('div[class="monthEventWrapper"] a');
$results = [];
$childResults = [];
for ($i=0; $i<count($crawlerFiltered); $i++) {
$results[] = removeLeadingAndTrailingWhiteCharsAndNewLine($crawlerFiltered->eq($i)->text());
$children = $crawlerFiltered->eq($i)->children();
if (count($children)) {
for ($j=0; $j<count($children); $j++) {
$childResults[] = removeLeadingAndTrailingWhiteCharsAndNewLine($children->eq($j)->text());
}
}
}
$results[0] = substractSpan($results[0], $childResults[0]);
function removeLeadingAndTrailingWhiteCharsAndNewLine(string $text) : string
{
$pattern = '/(?:\r\n[\s]+|\n[\s]+)/s';
return preg_replace($pattern, '', $text);
}
function substractSpan($text, $textToSubstract) : string
{
$length = strlen($text) - strlen($textToSubstract);
return substr($text, 0, $length);
}
echo 'Parent Nodes:' . PHP_EOL;
var_export($results);
echo PHP_EOL;
echo 'Child Nodes:' . PHP_EOL;
var_export($childResults);
echo PHP_EOL;
echo 'Time: ';
echo $results[0];
echo PHP_EOL;
echo 'Text: ';
echo $childResults[0];
注意:我使用了for循环
和->eq()
来提供爬虫
实例,而不是使用foreach
得到的DOMNode
注意:该代码假定所需的文本部分
9:00
位于开头。这是一种解决方案类型的代码:
<?php
use Symfony\Component\DomCrawler\Crawler;
require_once(__DIR__ . '/../vendor/autoload.php');
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<tr>
<th class="monthCellContent" style="vertical-align : top">
<div class="monthEventWrapper">
<div class="monthEvent">
<a class="event"
href="/event/1"
title="test title updated - test place - 09:00-10:00">
09:00
<span class="showForMediumInline">
test title updated test place
</span>
</a>
</div>
</div>
</th>
</tr>
</body>
</html>
HTML;
$crawler = new Crawler($html);
$crawlerFiltered = $crawler->filter('div[class="monthEventWrapper"] a');
$results = [];
$childResults = [];
for ($i=0; $i<count($crawlerFiltered); $i++) {
$results[] = removeLeadingAndTrailingWhiteCharsAndNewLine($crawlerFiltered->eq($i)->text());
$children = $crawlerFiltered->eq($i)->children();
if (count($children)) {
for ($j=0; $j<count($children); $j++) {
$childResults[] = removeLeadingAndTrailingWhiteCharsAndNewLine($children->eq($j)->text());
}
}
}
$results[0] = substractSpan($results[0], $childResults[0]);
function removeLeadingAndTrailingWhiteCharsAndNewLine(string $text) : string
{
$pattern = '/(?:\r\n[\s]+|\n[\s]+)/s';
return preg_replace($pattern, '', $text);
}
function substractSpan($text, $textToSubstract) : string
{
$length = strlen($text) - strlen($textToSubstract);
return substr($text, 0, $length);
}
echo 'Parent Nodes:' . PHP_EOL;
var_export($results);
echo PHP_EOL;
echo 'Child Nodes:' . PHP_EOL;
var_export($childResults);
echo PHP_EOL;
echo 'Time: ';
echo $results[0];
echo PHP_EOL;
echo 'Text: ';
echo $childResults[0];
注意:我使用了for循环
和->eq()
来提供爬虫
实例,而不是使用foreach
得到的DOMNode
注意:代码假定所需的文本部分9:00
位于开头
foreach ($items as $item) {
foreach ($item->childNodes as $child) {
$value .= $paragraph->ownerDocument->saveHTML($child);
}
}
$crawler->filterXPath('//div[@class="monthEvent"]')
<?php
use Symfony\Component\DomCrawler\Crawler;
require_once(__DIR__ . '/../vendor/autoload.php');
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<body>
<tr>
<th class="monthCellContent" style="vertical-align : top">
<div class="monthEventWrapper">
<div class="monthEvent">
<a class="event"
href="/event/1"
title="test title updated - test place - 09:00-10:00">
09:00
<span class="showForMediumInline">
test title updated test place
</span>
</a>
</div>
</div>
</th>
</tr>
</body>
</html>
HTML;
$crawler = new Crawler($html);
$crawlerFiltered = $crawler->filter('div[class="monthEventWrapper"] a');
$results = [];
$childResults = [];
for ($i=0; $i<count($crawlerFiltered); $i++) {
$results[] = removeLeadingAndTrailingWhiteCharsAndNewLine($crawlerFiltered->eq($i)->text());
$children = $crawlerFiltered->eq($i)->children();
if (count($children)) {
for ($j=0; $j<count($children); $j++) {
$childResults[] = removeLeadingAndTrailingWhiteCharsAndNewLine($children->eq($j)->text());
}
}
}
$results[0] = substractSpan($results[0], $childResults[0]);
function removeLeadingAndTrailingWhiteCharsAndNewLine(string $text) : string
{
$pattern = '/(?:\r\n[\s]+|\n[\s]+)/s';
return preg_replace($pattern, '', $text);
}
function substractSpan($text, $textToSubstract) : string
{
$length = strlen($text) - strlen($textToSubstract);
return substr($text, 0, $length);
}
echo 'Parent Nodes:' . PHP_EOL;
var_export($results);
echo PHP_EOL;
echo 'Child Nodes:' . PHP_EOL;
var_export($childResults);
echo PHP_EOL;
echo 'Time: ';
echo $results[0];
echo PHP_EOL;
echo 'Text: ';
echo $childResults[0];
Parent Nodes:
array (
0 => '09:00',
)
Child Nodes:
array (
0 => 'test title updated test place',
)
Time: 09:00
Text: test title updated test placee