使用php使用多行文本的正则表达式

使用php使用多行文本的正则表达式,php,regex,multiline,Php,Regex,Multiline,我从pdf中提取了未格式化的文本数据,如下所示: AB01234 This could be a long question with multiple new lines a)these b)are c)the responses which could contains new lines d)either b AB01235 This is another question with same multiple response a) one b) two c) three d)

我从pdf中提取了未格式化的文本数据,如下所示:

AB01234 This could be a

long question with multiple

new lines a)these b)are c)the responses which could

contains new lines d)either b

AB01235 This is another question with same multiple

response a) one b) two c) three d) four c

...
$text = '
    AB01234 This could be a
    long question with multiple
    new lines a)these b)are c)the responses which could
    contains new lines d)either b
    AB01235 This is another question with same multiple
    response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION\1 ', $text);
$text = preg_replace('/([a-z]\))/', ' ANSWER\1 ', $text);
$text = trim(preg_replace('/\s+/', ' ', $text));
print($text);
我的目标是将问题、问题、答案和最后一个字符的正确答案分组。使用正则表达式有什么方法可以做到这一点吗

{
   [0] => 'AB01234',
   [1] => 'This could be a long question with multiple new lines',
   [2] => 'these'
   [3] => 'are',
   [4] => 'the responses which could contains new lines',
   [5] => 'either',
   [6] => 'b'
}

我不会试图用一个正则表达式来实现这一点。输入的差异太大了。我会像这样整理文本:

AB01234 This could be a

long question with multiple

new lines a)these b)are c)the responses which could

contains new lines d)either b

AB01235 This is another question with same multiple

response a) one b) two c) three d) four c

...
$text = '
    AB01234 This could be a
    long question with multiple
    new lines a)these b)are c)the responses which could
    contains new lines d)either b
    AB01235 This is another question with same multiple
    response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION\1 ', $text);
$text = preg_replace('/([a-z]\))/', ' ANSWER\1 ', $text);
$text = trim(preg_replace('/\s+/', ' ', $text));
print($text);
你会看到文本现在相当清晰。这是一条线。空间被清理干净了。你也有明确的问题和答案标志。您可以将其更改为您喜欢的任何内容,例如!@$@问个问题。它们必须是永远不会出现在文本中的东西

现在,您可以尝试使用正则表达式,但此时分解更容易,因为您标记了分隔符。在这个例子中,我经常使用“爆炸”和“内爆”,以防万一你没怎么看到它。你不必使用它。可以使用正则表达式或子字符串

$questions = array();
$qas = explode("QUESTION", $text);
foreach($qas as $qa)
{
    if($qa == "") continue;
    $answers = explode("ANSWER", $qa);
    $q = array();
    foreach($answers as $i=>$answer)
    {
        $a = explode(' ', $answer);
        if($i == 0) $q[] = $a[0];
        $questions[0] = $a[0];
        array_shift($a);
        $q[] = implode(' ', $a);
    }
    $questions[] = $q;
}
print_r($questions);

现在,您应该有一个所需的数组。

您的代码(和正则表达式)是什么样子的?那么如何识别文本中各个部分的确切规则是什么呢?实际上我只能用a([a-Z]{2}[0-9]{5})来识别问题id。下一部分将是问题本身。因此,更准确地说,文本中包含问题a)文本b)文本c)文本d)文本正确答案重复。可以通过查找问题ID(第一组匹配)和a)之间的所有内容来提取问题。从a到c的答案也是一样。答案d可以从d)中提取,并在下一个问题ID之前匹配一个字符(正确答案)。假设只在一个问题中搜索,并将整个字符串放入一行,您的正则表达式是这样的:这可能是一个部分解决方案,因为我有更多的问题连接在一起,而且可能pdf解析器在某些地方失败,事实上,某些单词包含新行(例如:apple已提取为[\n]pple)。无论如何,感谢您提供一行解决方案,
implde
explode
函数是我最后的选择。我想用正则表达式实现这一点:)我仍然会清理文本,如图所示。然后,您可以使用正则表达式中的标志使其更简单,例如“/问题([A-Z]{2}[0-9]{5}.*)答案(。。。。