使用CONSTRINT[PHP正则表达式HTML]将字符串拆分为较小的部分_Php_Regex_String_Split_Html Parsing

使用CONSTRINT[PHP正则表达式HTML]将字符串拆分为较小的部分

php regex string

使用CONSTRINT[PHP正则表达式HTML]将字符串拆分为较小的部分,php,regex,string,split,html-parsing,Php,Regex,String,Split,Html Parsing,我需要将长字符串拆分为具有以下约束的数组：输入将是HTML字符串，可以是整页或部分每个部分（新字符串）的字符数有限（例如不超过8000个字符）每个部分可以包含多个句子（由[full stop]分隔），但不能包含部分句子除非字符串的最后一部分（因为最后一部分可能没有句号）字符串包含HTML标记。但是标记不能划分为（到）。这意味着HTML标记应该是完整的。但是起始标记和结束标记可以保留在不同的段/块上如果任何中间句子的长度大于所需的长度，则前导和尾随标记以及空格应位于数组的不同部分。即使

我需要将长字符串拆分为具有以下约束的数组：

输入将是HTML字符串，可以是整页或部分
每个部分（新字符串）的字符数有限（例如不超过8000个字符）

每个部分可以包含多个句子（由[full stop]分隔），但不能包含部分句子除非字符串的最后一部分（因为最后一部分可能没有句号）

字符串包含HTML标记。但是标记不能划分为（
到
）。这意味着HTML标记应该是完整的。但是起始标记和结束标记可以保留在不同的段/块上

如果任何中间句子的长度大于所需的长度，则前导和尾随标记以及空格应位于数组的不同部分。即使在这样做之后，如果句子较长，则将其划分为数组的多个元素：(

请注意：无需解析HTML，只需解析标记（如或等）

我认为带preg_split的正则表达式可以做到这一点。请帮助我使用合适的正则表达式。除正则表达式之外的任何解决方案也欢迎使用
多谢各位

Sadi
不幸的是，html是一种不规则的语言，这意味着您不能用一个正则表达式来解析它。另一方面，如果输入总是相似的，或者您只需要解析某些部分，这并没有什么问题。对该正则表达式的迭代会生成元素名称及其内容：

'~<(?P<element>\s+)(?P<attributes>[^>]*)>(?:(?P<content>.*?)</\s+>)?~'

“~]*）>（？：（？P.*）？~”
如果我错了，请纠正我的错误，但我认为您不能用简单的regexp来实现这一点。在完整的regexp实现中，您可以使用以下内容：

$parts = preg_split("/(?<!<[^>]*)\./", $input);

$parts=preg\u split（“/（？]*）\./”，$input）；
但是php不允许非固定长度的查找，所以这是行不通的。显然，只有jgsoft和.NETregexp两个允许
我的处理方法是：

function splitStringUp($input, $maxlen) { $parts = explode(".", $input); $i = 0; while ($i < count($parts)) { if (preg_match("/<[^>]*$/", $parts[$i])) { array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]); } else { if ($i < (count($parts) - 1) && strlen($parts[$i] . "." . $parts[$i+1]) < $maxlen) { array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]); } else { $i++; } } } return $parts; }

函数splitStringUp（$input，$maxlen）{ $parts=分解（“.”，$input）； $i=0；而（$i
你没有提到当一个句子的长度>8000个字符时，你希望发生什么，所以这只是让它们保持不变样本输出： splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 8000); array(1) { [0]=> string(114) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray" } splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 80); array(2) { [0]=> string(81) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag" [1]=> string(32) " and the closing tag</a>. hooray" } splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 40); array(4) { [0]=> string(18) "this is a sentence" [1]=> string(25) " this is another sentence" [2]=> string(36) " this is an html <a href="a.b.c">tag" [3]=> string(32) " and the closing tag</a>. hooray" } splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 0); array(5) { [0]=> string(18) "this is a sentence" [1]=> string(25) " this is another sentence" [2]=> string(36) " this is an html <a href="a.b.c">tag" [3]=> string(24) " and the closing tag</a>" [4]=> string(7) " hooray" } splitStringUp（“这是一个句子。这是另一个句子。这是一个html.hooray”，8000）；阵列（1）{ [0]=>string（114）“这是一个句子。这是另一个句子。这是一个html。万岁” } splitStringUp（“这是一个句子。这是另一个句子。这是一个html.hooray”，80）；阵列（2）{ [0]=>string（81）“这是一个句子。这是另一个句子。这是一个html。万岁” } splitStringUp（“这是一个句子。这是另一个句子。这是一个html.hooray”，40）；阵列（4）{ [0]=>字符串（18）“这是一个句子” [1] =>字符串（25）“这是另一个句子” [2] =>string（36）“这是一个html.hooray” } splitStringUp（“这是一个句子。这是另一个句子。这是一个html.hooray”，0）；阵列（5）{ [0]=>字符串（18）“这是一个句子” [1] =>字符串（25）“这是另一个句子” [2] =>字符串（36）“这是一个html” [4] =>字符串（7）“万岁” } 实际上我不关心HTML。我关心标签。标签以开始，以code>结束。这就足够了。除正则表达式之外的任何解决方案都可以。我会尝试你的答案。谢谢你的时间：）哦！不要忘记每个新字符串的长度。这是最重要的部分对不起！我忘了提那件事。我要更新这一点。看起来你的解决方案放弃了句号：P添加句号不会有问题（我想）：）是的，只要添加一个。在每个部分的末尾：）您好，请添加约束条件：如果任何中间句子的长度大于所需长度，则前导和尾随标记以及空格应位于数组的不同部分。即使这样做了，如果句子更长，那么将它分成数组的多个元素：（对不起，我不知道你的意思。这听起来很复杂，你应该可以自己修改我的代码来完成这项工作。毕竟，它包含了你需要做的所有元素。