使用php使用多行文本的正则表达式_Php_Regex_Multiline

使用php使用多行文本的正则表达式

php regex

使用php使用多行文本的正则表达式,php,regex,multiline,Php,Regex,Multiline,我从pdf中提取了未格式化的文本数据，如下所示： AB01234 This could be a long question with multiple new lines a)these b)are c)the responses which could contains new lines d)either b AB01235 This is another question with same multiple response a) one b) two c) three d)

我从pdf中提取了未格式化的文本数据，如下所示：

AB01234 This could be a

long question with multiple

new lines a)these b)are c)the responses which could

contains new lines d)either b

AB01235 This is another question with same multiple

response a) one b) two c) three d) four c

...

$text = '
    AB01234 This could be a
    long question with multiple
    new lines a)these b)are c)the responses which could
    contains new lines d)either b
    AB01235 This is another question with same multiple
    response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION\1 ', $text);
$text = preg_replace('/([a-z]\))/', ' ANSWER\1 ', $text);
$text = trim(preg_replace('/\s+/', ' ', $text));
print($text);

我的目标是将问题、问题、答案和最后一个字符的正确答案分组。使用正则表达式有什么方法可以做到这一点吗

{
   [0] => 'AB01234',
   [1] => 'This could be a long question with multiple new lines',
   [2] => 'these'
   [3] => 'are',
   [4] => 'the responses which could contains new lines',
   [5] => 'either',
   [6] => 'b'
}

我不会试图用一个正则表达式来实现这一点。输入的差异太大了。我会像这样整理文本：

AB01234 This could be a

long question with multiple

new lines a)these b)are c)the responses which could

contains new lines d)either b

AB01235 This is another question with same multiple

response a) one b) two c) three d) four c

...

$text = '
    AB01234 This could be a
    long question with multiple
    new lines a)these b)are c)the responses which could
    contains new lines d)either b
    AB01235 This is another question with same multiple
    response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION\1 ', $text);
$text = preg_replace('/([a-z]\))/', ' ANSWER\1 ', $text);
$text = trim(preg_replace('/\s+/', ' ', $text));
print($text);

你会看到文本现在相当清晰。这是一条线。空间被清理干净了。你也有明确的问题和答案标志。您可以将其更改为您喜欢的任何内容，例如！@$@问个问题。它们必须是永远不会出现在文本中的东西

现在，您可以尝试使用正则表达式，但此时分解更容易，因为您标记了分隔符。在这个例子中，我经常使用“爆炸”和“内爆”，以防万一你没怎么看到它。你不必使用它。可以使用正则表达式或子字符串

$questions = array();
$qas = explode("QUESTION", $text);
foreach($qas as $qa)
{
    if($qa == "") continue;
    $answers = explode("ANSWER", $qa);
    $q = array();
    foreach($answers as $i=>$answer)
    {
        $a = explode(' ', $answer);
        if($i == 0) $q[] = $a[0];
        $questions[0] = $a[0];
        array_shift($a);
        $q[] = implode(' ', $a);
    }
    $questions[] = $q;
}
print_r($questions);

现在，您应该有一个所需的数组。

您的代码（和正则表达式）是什么样子的？那么如何识别文本中各个部分的确切规则是什么呢？实际上我只能用a（[a-Z]{2}[0-9]{5}）来识别问题id。下一部分将是问题本身。因此，更准确地说，文本中包含问题a）文本b）文本c）文本d）文本正确答案重复。可以通过查找问题ID（第一组匹配）和a）之间的所有内容来提取问题。从a到c的答案也是一样。答案d可以从d）中提取，并在下一个问题ID之前匹配一个字符（正确答案）。假设只在一个问题中搜索，并将整个字符串放入一行，您的正则表达式是这样的：这可能是一个部分解决方案，因为我有更多的问题连接在一起，而且可能pdf解析器在某些地方失败，事实上，某些单词包含新行（例如：apple已提取为[\n]pple）。无论如何，感谢您提供一行解决方案，

implde

explode

函数是我最后的选择。我想用正则表达式实现这一点：）我仍然会清理文本，如图所示。然后，您可以使用正则表达式中的标志使其更简单，例如“/问题（[A-Z]{2}[0-9]{5}.*）答案（。。。。