Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何检测未引用或双引用的空格_Java_Regex_Regex Negation_Regex Lookarounds - Fatal编程技术网

Java 如何检测未引用或双引用的空格

Java 如何检测未引用或双引用的空格,java,regex,regex-negation,regex-lookarounds,Java,Regex,Regex Negation,Regex Lookarounds,我试图创建一个Java正则表达式,它将用一个空格替换字符串中出现的所有空格,除非空格出现在引号之间(单引号或双引号) 如果我只是寻找双引号,我可以使用前瞻: text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " "); 如果我只是寻找单引号,我可以使用类似的模式 诀窍在于找到两者 我有一个很好的主意,先运行双引号模式,然后再运行单引号模式,当然,这最终会替换所有空格,而不管引号是什么 下面是一些测试和预期结果 a b c

我试图创建一个Java正则表达式,它将用一个空格替换字符串中出现的所有空格,除非空格出现在引号之间(单引号或双引号)

如果我只是寻找双引号,我可以使用前瞻:

text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " ");
如果我只是寻找单引号,我可以使用类似的模式

诀窍在于找到两者

我有一个很好的主意,先运行双引号模式,然后再运行单引号模式,当然,这最终会替换所有空格,而不管引号是什么

下面是一些测试和预期结果

a   b   c    d   e   -->  a b c d e
a   b   "c    d"   e -->  a b "c    d" e
a   b   'c    d'   e -->  a b 'c    d' e
a   b   "c    d'   e -->  a b "c d' e    (Can't mix and match quotes)
有没有办法在Java正则表达式中实现这一点

假设已单独验证无效输入。因此,以下情况都不会发生:

a "b c ' d
a 'b " c' d
a 'b c d

我建议将字符串封装标准化。 使用正则表达式替换标准的替代项。 假设你决定使用双引号“ 然后你可以把你的字符串拆分成“和所有的奇数元素” 是引用的内容,您的偶数元素将被取消引用, 仅在偶数元素上运行regex replace并重新生成字符串
从修改过的数组中。

编辑:由于@DeanTaylor修复了他的正则表达式,我将修复(修改)这个,
以防有人决定在不平衡的报价上使用它

平衡报价的原始测试有一个原子组。
我从未将其添加到解析逻辑中。因此,增加了这一点。就这样


您可以在替换项中匹配引号或空格,
确定哪一组匹配以决定替换什么

或者使用这个正则表达式来获得两者,避免做出决定

查找:
\G((?>)(?:\\[\S\S]\\[^“\\])*“\\'”(?:\\[\S\S]\[^'\]*)*“\\\[^”\S]+)*)\S+

“\\G((?>\”(?:\\\\\[\\S\\S]\\\\[^\\\\\].*“\\\”)*(?:\\\\\[\\S\\S]\\\[^'\\\].*”\\\\\\\\\\\\\\\\\\\\\\[^\\\\\\\\\\\\\\\\\\\\\\].\\\\\\\\\\\\\\\\\\\\\\\\\\

替换:
$1


注意-在使用上述正则表达式之前,可以测试字符串的平衡引号。
这将测试字符串,如果它通过,它将使用平衡引号

(?:“(?:”(?:\[\S\S].[^“\\]])*“\”(?:\\[\S\S].[^'\]*)*)\\[^']+$

“^(?>(?:\”(?:\\\\[\\S\\S].[^\\\\\\].*“\\”。(?:\\\\[\\S\\S].[^'\\\].*)\\\\\\\\\\\\\\\\+$”(?:\\\\\\\\\\\\\\\\[^\\\\\\\\\\\\\\\])+$”


更新@DeanTaylor新答案测试

示例1-对于字符串
Word1 Word2
(单词之间有两个空格)
  • 这个版本需要大约27个步骤
  • @DeanTaylor的版本需要大约29个步骤
示例2-对于字符串
“示例”另一个单词
(单词之间有两个空格)
  • 这个版本需要大约51个步骤
  • @DeanTaylor的版本需要36个步骤(大概是因为未滚动的循环)
示例3-用于WordPress的文件
  • 此版本需要大约315647个步骤
  • @DeanTaylor的版本需要122701个步骤(Dean的版本不处理单个空格)
其他示例3测试将在regex101.com上生成永久链接。
页面变得无响应,显示出它实际上是一块垃圾。

EDIT-注意-此答案有一个错误/缺陷 它要求在结束引号(
)和后面的字符之间有一个空格,以便正确匹配引号中的字符串。因此,此答案无法正确处理某些文本

它可能有更多的缺点,但这只是其中之一

编辑-备选答案 我加了一个没有毛病的

把这个留给子孙后代

支持 这一个支持通过
\“
\”
和多行引号转义引号

正则表达式

替换
$1$2
(是,末尾有空格)

想象

代码 可读 支持
  • 通过
    \“
    \”
    和多行引号转义引号
  • 不匹配的引号,引号以字符串结尾终止
  • 针对大型文件的其他优化
优化 减少步骤数的若干优化:

示例1-对于字符串
Word1 Word2
(单词之间有两个空格)
  • 接受
  • 这个版本只需要几分钟
示例2-对于字符串
“示例”另一个单词
(单词之间有两个空格)
  • 接受
  • 这个版本只需要几分钟
示例3-用于WordPress的文件
  • 导致错误
  • 这个版本只需要几分钟
正则表达式

替换
$1
(是,末尾有空格)

想象

代码 可读
这是一个您无法提前解决的常见问题。唯一的解决方案是您必须匹配引号部分才能通过它们。那么
”呢?哪些要替换,哪些不替换?我不认为正则表达式可以做到这一点;这似乎不是一种常规语言。@tobias_k它确实不需要那么健壮,但在您的示例中,双引号是不平衡的。我假设不会有嵌套,引号将是平衡的“我假设不会有嵌套,引号将是平衡的”,这不是您在上一个示例
a b“c d'e
中显示的,
没有对。还有转义呢?您的输入是否可以包含一些转义引号,如
a“b\”c d“e
?仅供参考,此
([^\s“\\]+)*
没有任何用途。此外,此正则表达式仅在引号紧靠空格时匹配引号,这意味着它将匹配此处突出显示的“
”asdf。很好的展开循环。很好的点@sln我可以确认这确实需要在结束报价后留一个空格。因此,
“”某些文本
不匹配@sli元素
([^\s“\\]+)*
实际上是一个性能元素,在我的快速测试套件中,40MB文件的处理时间减少了一半-migh
 \G                            # Must match where last match left off
                               # (This will stop the match if there is a quote unbalance)
 (                             # (1 start), quotes or non-whitespace 
      (?>                           # Atomic cluster to stop backtracking if quote unbalance
           "
           (?: \\ [\S\s] | [^"\\] )*     # Double quoted text
           "
        |                              # or,
           '
           (?: \\ [\S\s] | [^'\\] )*     # Single quoted text
           ' 
        |                              # or,
           [^"'\s]+                      # Not quotes nor whitespace
      )*                            # End Atomic cluster, do 0 to many times
 )                             # (1 end)
 \s+                           # The whitespaces outside of quotes
([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+)
try {
    String resultString = subjectString.replaceAll("([^\\s\"'\\\\]+)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*(\\s+)", "$1$2 ");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}
// ([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+)
// 
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
// 
// Match the regex below and capture its match into backreference number 1 «([^\s"'\\]+)*»
//    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//       You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «*»
//       Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.
//    Match any single character NOT present in the list below «[^\s"'\\]+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//       A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//       A single character from the list “"'” «"'»
//       The backslash character «\\»
// Match the regex below and capture its match into backreference number 2 «("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*»
//    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//       You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «*»
//       Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.
//    Match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
//       Match the character “"” literally «"»
//       Match any single character NOT present in the list below «[^"\\]*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          The literal character “"” «"»
//          The backslash character «\\»
//       Match the regular expression below «(?:\\.[^"\\]*)*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          Match the backslash character «\\»
//          Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//          Match any single character NOT present in the list below «[^"\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “"” «"»
//             The backslash character «\\»
//       Match the character “"” literally «"»
//    Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
//       Match the character “'” literally «'»
//       Match any single character NOT present in the list below «[^'\\]*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          The literal character “'” «'»
//          The backslash character «\\»
//       Match the regular expression below «(?:\\.[^'\\]*)*»
//          Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//          Match the backslash character «\\»
//          Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//          Match any single character NOT present in the list below «[^'\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “'” «'»
//             The backslash character «\\»
//       Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 3 «(\s+)»
//    Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
\G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
try {
    String resultString = subjectString.replaceAll("\\G((?:[^\\s\"']+| (?!\\s)|\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*+)(\\s+)", "$1 ");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}
// \G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
// 
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
// 
// Assert position at the end of the previous match (the start of the string for the first attempt) «\G»
// Match the regex below and capture its match into backreference number 1 «((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)»
//    Match the regular expression below «(?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+»
//       Between zero and unlimited times, as many times as possible, without giving back (possessive) «*+»
//       Match this alternative (attempting the next alternative only if this one fails) «[^\s"']+»
//          Match any single character NOT present in the list below «[^\s"']+»
//             Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//             A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//             A single character from the list “"'” «"'»
//       Or match this alternative (attempting the next alternative only if this one fails) « (?!\s)»
//          Match the character “ ” literally « »
//          Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\s)»
//             Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//       Or match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
//          Match the character “"” literally «"»
//          Match any single character NOT present in the list below «[^"\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “"” «"»
//             The backslash character «\\»
//          Match the regular expression below «(?:\\.[^"\\]*)*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             Match the backslash character «\\»
//             Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//             Match any single character NOT present in the list below «[^"\\]*»
//                Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//                The literal character “"” «"»
//                The backslash character «\\»
//          Match the character “"” literally «"»
//       Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
//          Match the character “'” literally «'»
//          Match any single character NOT present in the list below «[^'\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “'” «'»
//             The backslash character «\\»
//          Match the regular expression below «(?:\\.[^'\\]*)*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             Match the backslash character «\\»
//             Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//             Match any single character NOT present in the list below «[^'\\]*»
//                Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//                The literal character “'” «'»
//                The backslash character «\\»
//          Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 2 «(\s+)»
//    Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»