HTML表到数组PHP_Php_Html_Regex_Html Table

HTML表到数组PHP

php html regex

HTML表到数组PHP,php,html,regex,html-table,Php,Html,Regex,Html Table,我在网上有一个学校日历，但我想在我自己的应用程序中有它。不幸的是，我不能让它与PHP和正则表达式一起工作问题是表格单元格没有被平均分割，并且每个类都会发生变化。你可以找到时间表和我尝试的正则表达式是： <td rowspan='(?:[0-9]{1,3})' class='value'>(.+?) (.+?) (.+?) </td> 我希望这个问题足够清楚，因

我在网上有一个学校日历，但我想在我自己的应用程序中有它。不幸的是，我不能让它与PHP和正则表达式一起工作

问题是表格单元格没有被平均分割，并且每个类都会发生变化。你可以找到时间表和

我尝试的正则表达式是：

<td rowspan='(?:[0-9]{1,3})' class='value'>(.+?)<br/>(.+?)<br/>(.+?)<br/><br/><br/></td>

我希望这个问题足够清楚，因为我不是英国人；）

请使用HTML解析器提取值。PHP简单HTML解析器值得一试：

祝你好运，这会很棘手。。。仅仅“使用HTML解析器”实际上并不能避免主要问题，这是使用行跨度的表的本质。虽然使用HTML解析器解析大量HTML总是一个很好的建议，但如果您可以将HTML分解为更小、可靠的块，那么使用其他技术进行解析总是会更为优化（但显然更容易在HTML中出现细微的意外差异）

使桌子正常化如果是我，我会从能够检测表的开始和结束位置的东西开始（因为如果不需要的话，即使使用HTML解析器，我也不想解析整个页面）：

$table=$start=$end=false；
///“Vrijdag”应该足够独特，但如果它出现在其他地方，它将失败
$pos=strpos（$html，'Vrijdag'）；
///根据可靠的标签找到起点和终点
如果（$pos！==false）{
$start=stripos（$html，，$pos）；
如果（$start！==false）{
$end=stripos（$html，，$start）；
}
}
如果（$start！==false&&$end！==false）{
///我们现在可以抓取表$html；
$table=substr（$html，$start，$end-$start）；
}

然后，由于细胞在垂直方向上的排列方式很随意（但在水平方向上似乎是一致的），我会选择一个“日”列并向下排列

if ( $table ) {
  /// break apart based on rows
  $rows = preg_split('#</tr>#i', $table);
  ///
  foreach ( $rows as $key => $row ) {
    $rows[$key] = preg_split('#</td>#i', $row);
  }
}

if（$table）{
///按行拆分
$rows=preg#u split（“##i”，$table）；
///
foreach（$key=>$row的行）{
$rows[$key]=预分割（“##i”，$row）；
}
}

以上内容应该会给你一些类似的信息：

array (
  '0' => array (
    '0' => "<td class='heading'>1",
    '1' => "<td rowspan='1' class='empty'>"
    '2' => "<td rowspan='5' class='value'>3D<br/>009<br/>Hk<br/><br/><br/>"
    ...
  ),
  '0' => array (
    '0' => "<td class='heading'>2",
    '1' => "<td rowspan='2' class='empty'>"
    '2' => "<td rowspan='3' class='value'>Hk<br/>"
    ...
  ),
)

数组(
“0”=>数组(
'0' => "1",
'1' => ""
“2”=>“3D
009
香港



”
...
),
“0”=>数组(
'0' => "2",
'1' => ""
“2”=>“香港
”
...
),
)

现在，您可以跨每一行进行扫描，并且在预匹配行跨度的位置，您必须在下面的行中（在正确的位置）创建该单元格信息的副本，以便实际创建完整的表结构（没有行跨度）

///此处无法使用foreach，因为我们要修改循环中的数组
$lof=计数（$rows）；
对于（$rkey=0；$rkey$单元格）{
if（preg_match（'/rowspan=.（[0-9]+）./'，$cell，$regs））{
$rowspan=（int）$regs[1]；
如果（$rowspan>1）{
///这里有一个陷阱，我后来意识到我正在构建
///一个类似于“14$2”的替换模式，这意味着
///系统试图在偏移量14处找到一个组。为了避开此问题
///问题是，PHP允许使用{}包装组引用号。
///现在我们得到了在一个文本数字周围插入的“$1”和“$2”的值
$newcell=preg_replace（'/（rowspan=）[0-9]+（.）/，'${1}'。（$rowspan-1）。'${2}'，$cell）；
阵列拼接（$rows[$rkey+1]，$ckey，$newcell）；
}
}
}
}

以上内容应该使表格正常化，这样行间距就不再是问题了

（请注意，上面的代码是理论代码，我已经手动键入了它，还没有对它进行测试——我将很快进行测试）

测试后我已经更新了上面的一些小错误，即错误地获取某些函数的php参数。。。在对这些数据进行排序后，它似乎起到了作用：

/// grab the html
$html = file_get_contents('http://www.cibap.nl/beheer/modules/roosters/create_rooster.php?element=CR13A&soort=klas&week=37&jaar=2012');

/// start with nothing
$table = $start = $end = false;
/// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
$pos = strpos($html, 'Vrijdag');

/// find your start and end based on reliable tags
if ( $pos !== false ) {
  $start = stripos($html, '<tr>', $pos);
  if ( $start !== false ) {
    $end = stripos($html, '</table>', $start);
  }
}

/// make sure we have a start and end
if ( $start !== false && $end !== false ) {
  /// we can now grab our table $html;
  $table = substr($html, $start, $end - $start);
  /// convert brs to something that wont be removed by strip_tags
  $table = preg_replace('#<br ?/>#i', "\n", $table);
}

if ( $table ) {
  /// break apart based on rows (a close tr is quite reliable to find)
  $rows = preg_split('#</tr>#i', $table);
  /// break apart the cells (a close td is quite reliable to find)
  foreach ( $rows as $key => $row ) {
    $rows[$key] = preg_split('#</td>#i', $row);
  }
}
else {
  /// create so we avoid errors
  $rows = array();
}

/// changed this here from a foreach to a for because it seems
/// foreach was working from a copy of $rows and so any modifications
/// we made to $rows while the loop was happening were ignored.
$lof = count($rows);
for ( $rkey=0; $rkey<$lof; $rkey++ ) {
  /// pull out the row
  $row = $rows[$rkey];
  /// step each cell in the row
  foreach ( $row as $ckey => $cell ) {
    /// pull out our rowspan value
    if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
      /// if rowspan is greater than one (i.e. spread across multirows)
      $rowspan = (int) $regs[1];
      if ( $rowspan > 1 ) {
        /// then copy this cell into the next row down, but decrease it's rowspan
        /// so that when we find it in the next row we know how many more times
        /// it should span down.
        $newcell = preg_replace('/( rowspan=.)([0-9]+)(.)/', '${1}'.($rowspan-1).'${3}', $cell);
        array_splice( $rows[$rkey+1], $ckey, 0, $newcell );
      }
    }
  }
}

/// now finally step the normalised table and get rid of the unwanted tags 
/// that remain at the same time split our values in to something more useful
foreach ( $rows as $rkey => $row ) {
  foreach ( $row as $ckey => $cell ) {
    $rows[$rkey][$ckey] = preg_split('/\n+/',trim(strip_tags( $cell )));
  }
}

echo '<xmp>';
print_r($rows);
echo '</xmp>';

///获取html
$html=文件\u获取\u内容（'http://www.cibap.nl/beheer/modules/roosters/create_rooster.php?element=CR13A&soort=klas&week=37&jaar=2012');
///白手起家
$table=$start=$end=false；
///“Vrijdag”应该足够独特，但如果它出现在其他地方，它将失败
$pos=strpos（$html，'Vrijdag'）；
///根据可靠的标签找到起点和终点
如果（$pos！==false）{
$start=stripos（$html，，$pos）；
如果（$start！==false）{
$end=stripos（$html，，$start）；
}
}
///确保我们有一个开始和结束
如果（$start！==false&&$end！==false）{
///我们现在可以抓取表$html；
$table=substr（$html，$start，$end-$start）；
///将brs转换为不会被带标签移除的内容
$table=preg#u replace（“##i”、“\n”、$table）；
}
若有（$表）{
///根据行进行拆分（找到一个接近的tr是非常可靠的）
$rows=preg#u split（“##i”，$table）；
///拆开电池（找到一个接近的td非常可靠）
foreach（$key=>$row的行）{
$rows[$key]=预分割（“##i”，$row）；
}
}
否则{
///创建这样我们就能避免错误
$rows=array（）；
}
///把这个从foreach改为for，因为看起来
///foreach使用的是$rows的副本，因此任何修改
///当循环发生时，我们对$rows进行了修改，但被忽略。
$lof=计数（$rows）；
对于（$rkey=0；$rkey$单元格）{
///拿出我们的rowspan值
if（preg_match（'/rowspan=.（[0-9]+）./'，$cell，$regs））{
///如果行跨度大于一（即跨多行分布）
$rowspan=（int）$regs[1]；
如果（$rowspan>1）{
///然后将此单元格复制到下一行，但减小其行跨度
///所以当我们在下一行找到它时，我们知道还有多少次
///它应该向下延伸。
$newcell=preg_replace（'/（rowspan=）（[0-9]+）（../”，“${1}.”（$rowspan-1）。“${3}”和$cell）；
阵列拼接（$rows[$rkey+1]，$ckey，0，$newcell）；
}
}
}
}
///现在，最后一步是标准化的表格，去掉不需要的标签
///同时，这也将我们的价值观分割成更有用的东西
foreach（$行作为$rkey=>$行
array (
  '0' => array (
    '0' => "<td class='heading'>1",
    '1' => "<td rowspan='1' class='empty'>"
    '2' => "<td rowspan='5' class='value'>3D<br/>009<br/>Hk<br/><br/><br/>"
    ...
  ),
  '0' => array (
    '0' => "<td class='heading'>2",
    '1' => "<td rowspan='2' class='empty'>"
    '2' => "<td rowspan='3' class='value'>Hk<br/>"
    ...
  ),
)

/// can't use foreach here because we want to modify the array within the loop
$lof = count($rows);
for ( $rkey=0; $rkey<$lof; $rkey++ ) {
  /// pull out the row
  $row = $rows[$rkey];
  foreach ( $row as $ckey => $cell ) {
    if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
      $rowspan = (int) $regs[1];
      if ( $rowspan > 1 ) {
        /// there was a gotcha here, I realised afterwards i was constructing
        /// a replacement pattern that looked like this '$14$2'. Which meant
        /// the system tried to find a group at offset 14. To get around this
        /// problem, PHP allows the group reference numbers to be wraped with {}.
        /// so we now get the value of '$1' and '$2' inserted around a literal number
        $newcell = preg_replace('/( rowspan=.)[0-9]+(.)/', '${1}'.($rowspan-1).'${2}', $cell);
        array_splice( $rows[$rkey+1], $ckey, $newcell );
      }
    }
  }
}

/// grab the html
$html = file_get_contents('http://www.cibap.nl/beheer/modules/roosters/create_rooster.php?element=CR13A&soort=klas&week=37&jaar=2012');

/// start with nothing
$table = $start = $end = false;
/// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
$pos = strpos($html, 'Vrijdag');

/// find your start and end based on reliable tags
if ( $pos !== false ) {
  $start = stripos($html, '<tr>', $pos);
  if ( $start !== false ) {
    $end = stripos($html, '</table>', $start);
  }
}

/// make sure we have a start and end
if ( $start !== false && $end !== false ) {
  /// we can now grab our table $html;
  $table = substr($html, $start, $end - $start);
  /// convert brs to something that wont be removed by strip_tags
  $table = preg_replace('#<br ?/>#i', "\n", $table);
}

if ( $table ) {
  /// break apart based on rows (a close tr is quite reliable to find)
  $rows = preg_split('#</tr>#i', $table);
  /// break apart the cells (a close td is quite reliable to find)
  foreach ( $rows as $key => $row ) {
    $rows[$key] = preg_split('#</td>#i', $row);
  }
}
else {
  /// create so we avoid errors
  $rows = array();
}

/// changed this here from a foreach to a for because it seems
/// foreach was working from a copy of $rows and so any modifications
/// we made to $rows while the loop was happening were ignored.
$lof = count($rows);
for ( $rkey=0; $rkey<$lof; $rkey++ ) {
  /// pull out the row
  $row = $rows[$rkey];
  /// step each cell in the row
  foreach ( $row as $ckey => $cell ) {
    /// pull out our rowspan value
    if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
      /// if rowspan is greater than one (i.e. spread across multirows)
      $rowspan = (int) $regs[1];
      if ( $rowspan > 1 ) {
        /// then copy this cell into the next row down, but decrease it's rowspan
        /// so that when we find it in the next row we know how many more times
        /// it should span down.
        $newcell = preg_replace('/( rowspan=.)([0-9]+)(.)/', '${1}'.($rowspan-1).'${3}', $cell);
        array_splice( $rows[$rkey+1], $ckey, 0, $newcell );
      }
    }
  }
}

/// now finally step the normalised table and get rid of the unwanted tags 
/// that remain at the same time split our values in to something more useful
foreach ( $rows as $rkey => $row ) {
  foreach ( $row as $ckey => $cell ) {
    $rows[$rkey][$ckey] = preg_split('/\n+/',trim(strip_tags( $cell )));
  }
}

echo '<xmp>';
print_r($rows);
echo '</xmp>';