regexphp：从另一个网站的代码块中获取特定内容_Php_Regex

regexphp：从另一个网站的代码块中获取特定内容

php regex

regexphp：从另一个网站的代码块中获取特定内容,php,regex,Php,Regex,我有一个网站，我想从7个职位的具体内容。这7篇文章都有相同的HTML布局（见下文） Z的(矢沢永吉） Z的2015之旅 2015年6.月 4.日 (木) 19:00開演 10500新加坡元　7500日元(全席指定・消費税込）※注意事項の詳細をより必ずご確認ください 2015年5.月 16日(土) 　06-6344-3326 我只想从这个布局和代码中的第一个表中获取H3。我应该使用什么正则表达式方法来获得所需的结果另外，这是7篇文章，就像上面的代码一样，我必须从每一篇文章中得到H3和第一个表

我有一个网站，我想从7个职位的具体内容。这7篇文章都有相同的HTML布局（见下文）


Z的(矢沢永吉）
Z的2015之旅
2015年6.月
4.日 (木) 19:00開演
10500新加坡元　7500日元(全席指定・消費税込）
※注意事項の詳細をより必ずご確認ください
2015年5.月
16日(土)
　06-6344-3326

我只想从这个布局和代码中的第一个表中获取H3。我应该使用什么正则表达式方法来获得所需的结果

另外，这是7篇文章，就像上面的代码一样，我必须从每一篇文章中得到H3和第一个表

我已经测试过，但不确定是否正确：

但正如您所见，我也必须添加不需要的数据，如H4 DT IMG:（

我不认为正则表达式是您在这里的最佳选择。如果您不使用正则表达式就可以离开，我会这样做。请看一看PHP web scraper

$crawler = $client->request('GET', 'http://www.example.com/some-page');
$heading = $crawler->filter('h3')->first();
$table = $crawler->filter('table')-> first();

这样不仅可读性更好，而且在html结构更改时更容易修复某些内容

如果您必须选择regex，您可以对h3执行以下操作（尚未测试，但类似于此）：

$html=preg\u replace\u回调(
“/（.*？）/u”，
函数（$match）{
返回$match[1]；
},
$html
);

对于表，它是类似的，只是您必须使用多行修饰符

（也可以将它添加到h3中，但从您的示例来看，您不需要它）。

//我假设您可以以某种方式将HTML放入变量中
//我用你的HTML内容测试了本地文件
$data=file_get_contents（'foo.html'）；
$h3_content=array（）；
$table_content=array（）；
//h3内容很容易抓取，但它可以在多行上！
//我没有在这里解释多行：
preg_match（“/”（[^你读过这个吗？我想这不是我想要的我知道，它只是指出用正则表达式解析HTML是个坏主意。它们不是这个工作的合适工具。如果你确定HTML不会随时间而改变，并且只需要那些字符串，快乐编码：-）是的，我确信HTML不会改变。那么你现在的建议是什么？：）在这种情况下，我应该更适合使用DomDocument来解析HTML。（）在我的项目中，我将其与curl结合使用，从链接的URL获取标题。请看这里的第630行：
$crawler = $client->request('GET', 'http://www.example.com/some-page');
$heading = $crawler->filter('h3')->first();
$table = $crawler->filter('table')-> first();

$html = preg_replace_callback(
    '/<h3>(.*?)<\/h3>/u',
    function ($match) {
        return $match[1];
    },
    $html
);

// I'm assuming you can get the HTML into a variable somehow
// I did my testing w/ a local file with your HTML content
$data = file_get_contents('foo.html');

$h3_content = array();
$table_content = array();

// h3 content is easy to grab, but it could be on multiple lines!
// I didn't account for multiline here:
preg_match('/<h3>([^<]+)<\/h3>/', $data, $h3_content);

// regex can't find the ending table tag easily, unless the 
// entire HTML on one line, so make everything one line
// you don't need a new variable here, I did it only to be explicit
// that we have munged the original HTML into something else
$data2 = str_replace("\n", '', $data);

// to separate tables, put new line after each one 
$data2 = str_replace('</table>', "</table>\n", $data2);
// now regex is easy
preg_match_all('/(<table.+<\/table>)/m', $data2, $table_content);

echo $h3_content[1], "\n";
echo $table_content[0][1], "\n";