使用php使用多行文本的正则表达式
我从pdf中提取了未格式化的文本数据,如下所示:使用php使用多行文本的正则表达式,php,regex,multiline,Php,Regex,Multiline,我从pdf中提取了未格式化的文本数据,如下所示: AB01234 This could be a long question with multiple new lines a)these b)are c)the responses which could contains new lines d)either b AB01235 This is another question with same multiple response a) one b) two c) three d)
AB01234 This could be a
long question with multiple
new lines a)these b)are c)the responses which could
contains new lines d)either b
AB01235 This is another question with same multiple
response a) one b) two c) three d) four c
...
$text = '
AB01234 This could be a
long question with multiple
new lines a)these b)are c)the responses which could
contains new lines d)either b
AB01235 This is another question with same multiple
response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION\1 ', $text);
$text = preg_replace('/([a-z]\))/', ' ANSWER\1 ', $text);
$text = trim(preg_replace('/\s+/', ' ', $text));
print($text);
我的目标是将问题、问题、答案和最后一个字符的正确答案分组。使用正则表达式有什么方法可以做到这一点吗
{
[0] => 'AB01234',
[1] => 'This could be a long question with multiple new lines',
[2] => 'these'
[3] => 'are',
[4] => 'the responses which could contains new lines',
[5] => 'either',
[6] => 'b'
}
我不会试图用一个正则表达式来实现这一点。输入的差异太大了。我会像这样整理文本:
AB01234 This could be a
long question with multiple
new lines a)these b)are c)the responses which could
contains new lines d)either b
AB01235 This is another question with same multiple
response a) one b) two c) three d) four c
...
$text = '
AB01234 This could be a
long question with multiple
new lines a)these b)are c)the responses which could
contains new lines d)either b
AB01235 This is another question with same multiple
response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION\1 ', $text);
$text = preg_replace('/([a-z]\))/', ' ANSWER\1 ', $text);
$text = trim(preg_replace('/\s+/', ' ', $text));
print($text);
你会看到文本现在相当清晰。这是一条线。空间被清理干净了。你也有明确的问题和答案标志。您可以将其更改为您喜欢的任何内容,例如!@$@问个问题。它们必须是永远不会出现在文本中的东西
现在,您可以尝试使用正则表达式,但此时分解更容易,因为您标记了分隔符。在这个例子中,我经常使用“爆炸”和“内爆”,以防万一你没怎么看到它。你不必使用它。可以使用正则表达式或子字符串
$questions = array();
$qas = explode("QUESTION", $text);
foreach($qas as $qa)
{
if($qa == "") continue;
$answers = explode("ANSWER", $qa);
$q = array();
foreach($answers as $i=>$answer)
{
$a = explode(' ', $answer);
if($i == 0) $q[] = $a[0];
$questions[0] = $a[0];
array_shift($a);
$q[] = implode(' ', $a);
}
$questions[] = $q;
}
print_r($questions);
现在,您应该有一个所需的数组。您的代码(和正则表达式)是什么样子的?那么如何识别文本中各个部分的确切规则是什么呢?实际上我只能用a([a-Z]{2}[0-9]{5})来识别问题id。下一部分将是问题本身。因此,更准确地说,文本中包含问题a)文本b)文本c)文本d)文本正确答案重复。可以通过查找问题ID(第一组匹配)和a)之间的所有内容来提取问题。从a到c的答案也是一样。答案d可以从d)中提取,并在下一个问题ID之前匹配一个字符(正确答案)。假设只在一个问题中搜索,并将整个字符串放入一行,您的正则表达式是这样的:这可能是一个部分解决方案,因为我有更多的问题连接在一起,而且可能pdf解析器在某些地方失败,事实上,某些单词包含新行(例如:apple已提取为[\n]pple)。无论如何,感谢您提供一行解决方案,
implde
explode
函数是我最后的选择。我想用正则表达式实现这一点:)我仍然会清理文本,如图所示。然后,您可以使用正则表达式中的标志使其更简单,例如“/问题([A-Z]{2}[0-9]{5}.*)答案(。。。。