Java 如何检测未引用或双引用的空格_Java_Regex_Regex Negation_Regex Lookarounds

Java 如何检测未引用或双引用的空格

java regex

Java 如何检测未引用或双引用的空格,java,regex,regex-negation,regex-lookarounds,Java,Regex,Regex Negation,Regex Lookarounds,我试图创建一个Java正则表达式，它将用一个空格替换字符串中出现的所有空格，除非空格出现在引号之间（单引号或双引号）如果我只是寻找双引号，我可以使用前瞻： text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " "); 如果我只是寻找单引号，我可以使用类似的模式诀窍在于找到两者我有一个很好的主意，先运行双引号模式，然后再运行单引号模式，当然，这最终会替换所有空格，而不管引号是什么下面是一些测试和预期结果 a b c

我试图创建一个Java正则表达式，它将用一个空格替换字符串中出现的所有空格，除非空格出现在引号之间（单引号或双引号）

如果我只是寻找双引号，我可以使用前瞻：

text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " ");

如果我只是寻找单引号，我可以使用类似的模式

诀窍在于找到两者

我有一个很好的主意，先运行双引号模式，然后再运行单引号模式，当然，这最终会替换所有空格，而不管引号是什么

下面是一些测试和预期结果

a   b   c    d   e   -->  a b c d e
a   b   "c    d"   e -->  a b "c    d" e
a   b   'c    d'   e -->  a b 'c    d' e
a   b   "c    d'   e -->  a b "c d' e    (Can't mix and match quotes)

有没有办法在Java正则表达式中实现这一点

假设已单独验证无效输入。因此，以下情况都不会发生：

a "b c ' d
a 'b " c' d
a 'b c d

我建议将字符串封装标准化。使用正则表达式替换标准的替代项。假设你决定使用双引号“ 然后你可以把你的字符串拆分成“和所有的奇数元素” 是引用的内容，您的偶数元素将被取消引用，仅在偶数元素上运行regex replace并重新生成字符串

从修改过的数组中。

编辑：由于@DeanTaylor修复了他的正则表达式，我将修复（修改）这个，
以防有人决定在不平衡的报价上使用它

平衡报价的原始测试有一个原子组。
我从未将其添加到解析逻辑中。因此，增加了这一点。就这样

您可以在替换项中匹配引号或空格，
确定哪一组匹配以决定替换什么

或者使用这个正则表达式来获得两者，避免做出决定

查找：

\G（（？>）（？：\\[\S\S]\\[^“\\]）*“\\'”（？：\\[\S\S]\[^'\]*）*“\\\[^”\S]+）*）\S+

“\\G（（？>\”（？：\\\\\[\\S\\S]\\\\[^\\\\\].*“\\\”）*（？：\\\\\[\\S\\S]\\\[^'\\\].*”\\\\\\\\\\\\\\\\\\\\\\[^\\\\\\\\\\\\\\\\\\\\\\].\\\\\\\\\\\\\\\\\\\\\\\\\\
替换：$1



注意-在使用上述正则表达式之前，可以测试字符串的平衡引号。

这将测试字符串，如果它通过，它将使用平衡引号
（？：“（？：”（？：\[\S\S].[^“\\]]）*“\”（？：\\[\S\S].[^'\]*）*）\\[^']+$

“^（？>（？：\”（？：\\\\[\\S\\S].[^\\\\\\].*“\\”。（？：\\\\[\\S\\S].[^'\\\].*）\\\\\\\\\\\\\\\\+$”（？：\\\\\\\\\\\\\\\\[^\\\\\\\\\\\\\\\]）+$”


更新@DeanTaylor新答案测试
示例1-对于字符串Word1 Word2（单词之间有两个空格）

这个版本需要大约27个步骤
@DeanTaylor的版本需要大约29个步骤

示例2-对于字符串“示例”另一个单词（单词之间有两个空格）

这个版本需要大约51个步骤
@DeanTaylor的版本需要36个步骤（大概是因为未滚动的循环）

示例3-用于WordPress的文件

此版本需要大约315647个步骤
@DeanTaylor的版本需要122701个步骤（Dean的版本不处理单个空格）

其他示例3测试将在regex101.com上生成永久链接。

页面变得无响应，显示出它实际上是一块垃圾。EDIT-注意-此答案有一个错误/缺陷
它要求在结束引号（“
或”
）和后面的字符之间有一个空格，以便正确匹配引号中的字符串。因此，此答案无法正确处理某些文本
它可能有更多的缺点，但这只是其中之一
编辑-备选答案
我加了一个没有毛病的
把这个留给子孙后代
支持
这一个支持通过\“
和\”
和多行引号转义引号
正则表达式

替换
$1$2
（是，末尾有空格）
想象

代码
可读
支持

通过\“
和\”
和多行引号转义引号
不匹配的引号，引号以字符串结尾终止
针对大型文件的其他优化

优化
减少步骤数的若干优化：
示例1-对于字符串Word1 Word2（单词之间有两个空格）

接受
这个版本只需要几分钟

示例2-对于字符串“示例”另一个单词（单词之间有两个空格）

接受
这个版本只需要几分钟

示例3-用于WordPress的文件

导致错误
这个版本只需要几分钟

正则表达式

替换
$1
（是，末尾有空格）
想象

代码
可读
这是一个您无法提前解决的常见问题。唯一的解决方案是您必须匹配引号部分才能通过它们。那么“
”呢？哪些要替换，哪些不替换？我不认为正则表达式可以做到这一点；这似乎不是一种常规语言。@tobias_k它确实不需要那么健壮，但在您的示例中，双引号是不平衡的。我假设不会有嵌套，引号将是平衡的“我假设不会有嵌套，引号将是平衡的”，这不是您在上一个示例a b“c d'e
中显示的，“
和”
没有对。还有转义呢？您的输入是否可以包含一些转义引号，如a“b\”c d“e
？仅供参考，此（[^\s“\\]+）*
没有任何用途。此外，此正则表达式仅在引号紧靠空格时匹配引号，这意味着它将匹配此处突出显示的“
”asdf。很好的展开循环。很好的点@sln我可以确认这确实需要在结束报价后留一个空格。因此，“”某些文本
不匹配@sli元素（[^\s“\\]+）*实际上是一个性能元素，在我的快速测试套件中，40MB文件的处理时间减少了一半-migh
 \G                            # Must match where last match left off
                               # (This will stop the match if there is a quote unbalance)
 (                             # (1 start), quotes or non-whitespace 
      (?>                           # Atomic cluster to stop backtracking if quote unbalance
           "
           (?: \\ [\S\s] | [^"\\] )*     # Double quoted text
           "
        |                              # or,
           '
           (?: \\ [\S\s] | [^'\\] )*     # Single quoted text
           ' 
        |                              # or,
           [^"'\s]+                      # Not quotes nor whitespace
      )*                            # End Atomic cluster, do 0 to many times
 )                             # (1 end)
 \s+                           # The whitespaces outside of quotes

([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+)

try {
    String resultString = subjectString.replaceAll("([^\\s\"'\\\\]+)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*(\\s+)", "$1$2 ");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}

// ([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+)
// 
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
// 
// Match the regex below and capture its match into backreference number 1 «([^\s"'\\]+)*»
//    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//       You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «*»
//       Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.
//    Match any single character NOT present in the list below «[^\s"'\\]+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//       A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//       A single character from the list “"'” «"'»
//       The backslash character «\\»
// Match the regex below and capture its match into backreference number 2 «("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*»
//    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//       You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «*»
//       Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.
//    Match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
//       Match the character “"” literally «"»
//       Match any single character NOT present in the list below «[^"\\]*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          The literal character “"” «"»
//          The backslash character «\\»
//       Match the regular expression below «(?:\\.[^"\\]*)*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          Match the backslash character «\\»
//          Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//          Match any single character NOT present in the list below «[^"\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “"” «"»
//             The backslash character «\\»
//       Match the character “"” literally «"»
//    Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
//       Match the character “'” literally «'»
//       Match any single character NOT present in the list below «[^'\\]*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          The literal character “'” «'»
//          The backslash character «\\»
//       Match the regular expression below «(?:\\.[^'\\]*)*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          Match the backslash character «\\»
//          Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//          Match any single character NOT present in the list below «[^'\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “'” «'»
//             The backslash character «\\»
//       Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 3 «(\s+)»
//    Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

\G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)

try {
    String resultString = subjectString.replaceAll("\\G((?:[^\\s\"']+| (?!\\s)|\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*+)(\\s+)", "$1 ");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}

// \G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
// 
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
// 
// Assert position at the end of the previous match (the start of the string for the first attempt) «\G»
// Match the regex below and capture its match into backreference number 1 «((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)»
//    Match the regular expression below «(?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+»
//       Between zero and unlimited times, as many times as possible, without giving back (possessive) «*+»
//       Match this alternative (attempting the next alternative only if this one fails) «[^\s"']+»
//          Match any single character NOT present in the list below «[^\s"']+»
//             Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//             A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//             A single character from the list “"'” «"'»
//       Or match this alternative (attempting the next alternative only if this one fails) « (?!\s)»
//          Match the character “ ” literally « »
//          Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\s)»
//             Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//       Or match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
//          Match the character “"” literally «"»
//          Match any single character NOT present in the list below «[^"\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “"” «"»
//             The backslash character «\\»
//          Match the regular expression below «(?:\\.[^"\\]*)*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             Match the backslash character «\\»
//             Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//             Match any single character NOT present in the list below «[^"\\]*»
//                Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//                The literal character “"” «"»
//                The backslash character «\\»
//          Match the character “"” literally «"»
//       Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
//          Match the character “'” literally «'»
//          Match any single character NOT present in the list below «[^'\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “'” «'»
//             The backslash character «\\»
//          Match the regular expression below «(?:\\.[^'\\]*)*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             Match the backslash character «\\»
//             Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//             Match any single character NOT present in the list below «[^'\\]*»
//                Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//                The literal character “'” «'»
//                The backslash character «\\»
//          Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 2 «(\s+)»
//    Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»