如何将结构化文本文件转换为PHP多维数组

如何将结构化文本文件转换为PHP多维数组,php,regex,Php,Regex,我有100个文件,每个文件包含x数量的新闻文章。这些文章通过带有以下缩写的章节构成: HD BY WC PD SN SC PG LA CY LP TD CO IN NS RE IPC PUB AN 其中[LP]和[TD]可以包含任意数量的段落 典型的消息如下所示: HD Corporate News: Alcoa Earnings Soar; Outlook Stays Upbeat BY By James R. Hagerty and Matthew Day WC 421 words P

我有100个文件,每个文件包含x数量的新闻文章。这些文章通过带有以下缩写的章节构成:

HD BY WC PD SN SC PG LA CY LP TD CO IN NS RE IPC PUB AN
其中
[LP]
[TD]
可以包含任意数量的段落

典型的消息如下所示:

HD Corporate News: Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
WC 421 words
PD 12 July 2011
SN The Wall Street Journal
SC J
PG B7
LA English
CY (Copyright (c) 2011, Dow Jones & Company, Inc.) 

LP 

Alcoa Inc.'s profit more than doubled in the second quarter, but the giant 
aluminum producer managed only to meet analysts' recently lowered forecasts.

Alcoa serves as a bellwether for U.S. corporate earnings because it is the 
first major company to report and draws demand from a wide range of 
industries.

TD 

The results marked an early test of how corporate optimism is holding up 
in the face of bleak economic news.

License this article from Dow Jones Reprint 
Service[http://www.djreprints.com/link/link.html?FACTIVA=wjco20110712000115]

CO 
almam : ALCOA Inc

IN 
i2245 : Aluminum | i22 : Primary Metals | i224 : Non-ferrous Metals | imet 
  : Metals/Mining

NS 
c15 : Performance | c151 : Earnings | c1521 : Analyst 
Comment/Recommendation | ccat : Corporate/Industrial News | c152 : 
Earnings Projections | ncat : Content Types | nfact : Factiva Filters | 
nfce : FC&E Exclusion Filter | nfcpin : FC&E Industry News Filter

RE 
usa : United States | use : Northeast U.S. | uspa : Pennsylvania | namz : 
North America

IPC 
DJCS | EWR | BSC | NND | CNS | LMJ | TPT

PUB 
Dow Jones & Company, Inc.

AN 
Document J000000020110712e77c00035
在每篇文章之后,在新文章开始之前有4行换行符。我需要将这些文章放入一个数组中,如下所示:

$articles = array(
  [0] = array (
    [HD] => Corporate News: Alcoa earnings Soar; Outlook...
    [BY] => By James R. Hagerty...
    ...
    [AN] => Document J000000020110712e77c00035
  )
)
[编辑]
感谢@Casimir et Hippolyte,我现在拥有:

$path = "C:/path/to/textfiles/";

if ($handle = opendir($path)) {
  while (false !== ($file = readdir($handle))) {
    if ('.' === $file) continue;
    if ('..' === $file) continue;

    $text = file_get_contents($path . $file);
    $subjects = explode("\r\n\r\n\r\n\r\n", $text);

    $pattern = <<<'LOD'
        ~
        # definition
        (?(DEFINE)(?<fieldname>(?<=^|\n)(?>HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)))
        # pattern
        \G(?<key>\g<fieldname>)\s++(?<value>[^\n]++(?>\n{1,2}+(?!\g<fieldname>) [^\n]++ )*+)(?>\n{1,3}|$)
        ~x 
LOD;

    $result = array();
    foreach($subjects as $i => $subject) {
      if (preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER)) {
        foreach ($matches as $match) {
          $result[$i][$match['key']] = $match['value'];
        }
      }
    }
  }
  closedir($handle);
  echo '<pre>';
  print_r($result);
}
$path=“C:/path/to/textfiles/”;
如果($handle=opendir($path)){
while(false!=($file=readdir($handle))){
如果('.'=$file)继续;
如果('..'==$file)继续;
$text=file\u get\u contents($path.$file);
$subjects=explode(“\r\n\r\n\r\n\r\n”,$text);
$pattern=$subject){
if(预匹配所有($pattern,$subject,$matches,预设置顺序)){
foreach($matches作为$match进行匹配){
$result[$i][$match['key']]=$match['value'];
}
}
}
}
closedir($handle);
回声';
打印(结果);
}
但是,找不到匹配项,也不会产生任何错误。有人能问我这里出了什么问题吗?

一种使用explode分隔每个块和正则表达式提取字段的方法:

\G                        # this forces the match to be contiguous to the
                          # precedent match or the start of the string (no gap)
(?<key> \g<fieldname> )   # a capturing group named "key" for the fieldname
\s++                      # one or more white characters
(?<value>                 # open a capturing group named "value" for the
                          # field content
    [^\n]++               # all characters except newlines 1 or more times
    (?>                   # open an atomic group
        \n{1,2}+          # one or two newlines to allow paragraphs (LP & TD) 
        (?!\g<fieldname>) # but not followed by a fieldname (only a check)
        [^\n]++           # all characters except newlines 1 or more times
    )*+                   # close the atomic group and repeat 0 or more times
)                         # close the capture group "value"
(?>\n{1,3}|$)             # between 1 or 3 newlines max. or the end of the
                          # string (necessary if i want contiguous matches)
$pattern结尾的
x
允许在正则表达式中使用verbose模式(您可以用#将注释放在里面,并且可以根据需要用空格格式化代码)

注意:此模式不关心字段顺序以及字段是否存在。
为了更具可读性,我使用(
您可能希望链接到Heredoc字符串引号。如果有人将其粘贴到类似Dreamweaver的内容中,则会出现各种错误。谢谢!但在TS中运行此示例将返回零匹配。
$subjects
包含一个文档(如TS中的$text所示)但是模式与任何内容都不匹配?@Pr0no:我已经用文本样本进行了测试,效果很好。我将发布数据样本。我已经更新了TS以反映您的答案。我无法让它工作。我哪里出错了?@Pr0no:第一个分隔符
~
必须正好在下一行的
'LOD'
之后(之前没有空格和制表符)。由于文本文件对换行符使用了
\r\n
,因此必须将模式中的
\n
替换为
\r\n
,请参见编辑。
\G                        # this forces the match to be contiguous to the
                          # precedent match or the start of the string (no gap)
(?<key> \g<fieldname> )   # a capturing group named "key" for the fieldname
\s++                      # one or more white characters
(?<value>                 # open a capturing group named "value" for the
                          # field content
    [^\n]++               # all characters except newlines 1 or more times
    (?>                   # open an atomic group
        \n{1,2}+          # one or two newlines to allow paragraphs (LP & TD) 
        (?!\g<fieldname>) # but not followed by a fieldname (only a check)
        [^\n]++           # all characters except newlines 1 or more times
    )*+                   # close the atomic group and repeat 0 or more times
)                         # close the capture group "value"
(?>\n{1,3}|$)             # between 1 or 3 newlines max. or the end of the
                          # string (necessary if i want contiguous matches)
$pattern = <<<'LOD'
~
# definition
(?(DEFINE)
    (?<fieldname> (?<=^|\n)
                  (?>HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)
    )
)

# pattern
\G(?<key>\g<fieldname>) \s++
(?<value>
    [^\r\n]++ 
    (?> (?>\r?\n){1,2}+ (?!\g<fieldname>) [^\r\n]++ )*+
)
(?>(?>\r?\n){1,3}|$)
~x
LOD;
$subjects = explode("\r\n\r\n\r\n\r\n", $text);