php preg_match从外部网站获取数据的替代方案？_Php

php preg_match从外部网站获取数据的替代方案？

php

php preg_match从外部网站获取数据的替代方案？,php,Php,我想在外部网页中附加特定div的内容，该div如下所示： <dt>Win rate</dt><dd><div>50%</div></dd> 获胜率50% 我的目标是“50%”。实际上，我正在使用以下php代码提取内容： function getvalue($parameter,$content){ preg_match($parameter, $content, $match); return $matc

我想在外部网页中附加特定div的内容，该div如下所示：

<dt>Win rate</dt><dd><div>50%</div></dd>

获胜率50%

我的目标是“50%”。实际上，我正在使用以下php代码提取内容：

function getvalue($parameter,$content){
    preg_match($parameter, $content, $match);
    return $match[1];
    };
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');

函数getvalue（$parameter，$content）{
preg_match（$parameter，$content，$match）；
返回$match[1]；
};
$parameter='#分数（.*）#'；
$content=file\u get\u contents（'https://somewebpage.com');

一切都很好，问题是这个方法花费了太多的时间，特别是当我需要对不同的$content多次使用它时

我想知道是否有更好（更快、更简单等）的方法来完成相同的功能？谢谢

您可以使用并导航到给定节点

$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);

现在要到达所需的节点，您可以使用方法，例如

$dds=$doc->getElementsByTagName（'dd'）；
foreach（$dds作为$dd）{
//在这里处理每个元素，提取内部div及其内部html。。。
}

编辑：我看到@pebbl对DomDocument的速度变慢提出了一点看法。事实上，然而，用preg_match解析HTML是一件麻烦事；在这种情况下，我还建议查看事件驱动的SAXXML解析器。它更加轻量级，速度更快，内存占用更少，因为它不构建树。您可以看一看这样的解析器

您可以使用并导航到给定节点

$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);

现在要到达所需的节点，您可以使用方法，例如

$dds=$doc->getElementsByTagName（'dd'）；
foreach（$dds作为$dd）{
//在这里处理每个元素，提取内部div及其内部html。。。
}

与其尝试不使用

preg\u match

为什么不将文档内容缩小？例如，您可以在

之前转储所有内容，而不是尝试不使用preg\u match
为什么不将文档内容的大小缩小？例如，您可以在之前转储所有内容。为了提高代码的速度，您可以做三件主要的事情：
将外部页面加载卸载到其他时间（即使用cron）
在基于linux的服务器上，我知道该建议什么，但鉴于您使用Windows，我不确定其等效性是什么，但Cron for linux允许您在特定的计划时间偏移（在后台）触发脚本，因此不使用浏览器。基本上，我建议您创建一个脚本，其唯一目的是在特定的时间偏移量（取决于您需要更新数据的频率）获取网站页面，然后将这些网页写入本地系统上的文件
$listOfSites = array(
  'http://www.something.com/page.htm',
  'http://www.something-else.co.uk/index.php',
);

$dirToContainSites = getcwd() . '/sites';

foreach ( $listOfSites as $site ) {
  $content = file_get_contents( $site );

  /// i've just simply converted the URL into a filename here, there are
  /// better ways of handling this, but this at least keeps things simple.
  /// the following just converts any non letter or non number into an
  /// underscore... so, http___www_something_com_page_htm
  $file_name = preg_replace('/[^a-z0-9]/i','_', $site);

  file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}

创建此脚本后，您需要设置服务器，以便根据需要定期执行它。然后，您可以修改显示从本地文件读取的统计数据的前端脚本，这将显著提高速度
您可以在此处找到如何从目录中读取文件：

或者更简单的方法（但容易出现问题）就是重新调整站点数组的步骤，使用上面的preg_replace将URL转换为文件名，然后检查文件夹中是否存在该文件
缓存计算统计数据的结果
很可能这是一个统计页面，您希望经常访问它（不像公共页面那么频繁，但仍然如此）。如果访问同一页面的频率高于执行基于cron的脚本的频率，则没有理由再次执行所有计算。因此，基本上，缓存输出所需做的就是执行以下类似操作：
$cachedVersion = getcwd() . '/cached/stats.html';

/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
  /// if so, load it and echo it to the browser
  echo file_get_contents($cachedVersion);
}
else {
  /// start output buffering so we can catch what we send to the browser
  ob_start();

  /// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL

  /// end output buffering and grab the contents so we now have a string
  /// of the page we've just generated
  $content = ob_get_contents(); ob_end_clean();

  /// write the content to the cached file for next time
  file_put_contents($cachedVersion, $content);

  echo $content;
}

一旦你开始缓存你需要知道的东西时，你应该删除或清除你的缓存-否则，如果你不你的统计输出将永远不会改变。对于这种情况，清除缓存的最佳时间是在再次获取外部网页时。因此，您应该将这一行添加到“cron”脚本的底部
您还可以对缓存系统进行其他速度改进（您甚至可以记录外部网页的修改时间，并仅在它们被更新时加载），但我一直试图让事情更容易解释
在这种情况下不要使用HTML解析器
扫描一个HTML文件寻找一个特定的唯一值不需要使用完全成熟的甚至轻量级的HTML解析器。不正确地使用RegExp似乎是许多初创程序员都会遇到的问题之一，也是经常被问到的问题。这导致更多有经验的编码人员自动做出许多下意识的反应，从而自动遵守以下逻辑：
if ( $askedAboutUsingRegExpForHTML ) {
  $automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
  $soundAdvice = $think->about( $theSituation );
  print $soundAdvice;
}

当标记中的目标不是唯一的，或者您要匹配的模式依赖于这些脆弱的规则，以至于在出现额外的标记或字符时，它会破坏这些规则时，应该使用HTMLPassers。它们应该被用来使你的代码更可靠，而不是如果你想加快速度的话。即使没有构建所有元素的树的解析器也会使用某种形式的字符串搜索或正则表达式表示法，因此，除非您使用的库代码已经以极其优化的方式编译，否则它将无法击败编码良好的strpos/preg_匹配逻辑
考虑到我还没有看到您希望解析的HTML，我可能有点不对劲，但是从我看到的代码片段来看，使用strpos和preg_match的组合应该很容易找到值。显然，如果您的HTML更复杂，并且可能会随机多次出现Win rate50%，这将导致问题，但即使如此，还是会导致HTMLPasser出现问题
if ( $askedAboutUsingRegExpForHTML ) {
  $automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
  $soundAdvice = $think->about( $theSituation );
  print $soundAdvice;
}

$offset = 0;

/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {

  /// grab out a snippet of the surrounding HTML to speed up the RegExp
  $snippet = substr($html, $p, $p + 50 ); 

  /// I've extended your RegExp to try and account for 'white space' that could
  /// occur around the elements. The following wont take in to account any random
  /// attributes that may appear, so if you find some pages aren't working - echo
  /// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
  /// and that should show you what is appearing that is breaking the RegExp.

  if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
    /// once you are here your % value will be in $regs[1];
    break; /// exit the while loop as we have found our 'Win rate'
  }

  /// reset our offset for the next loop
  $offset = $p;
}