Java 如何检测未引用或双引用的空格
我试图创建一个Java正则表达式,它将用一个空格替换字符串中出现的所有空格,除非空格出现在引号之间(单引号或双引号) 如果我只是寻找双引号,我可以使用前瞻:Java 如何检测未引用或双引用的空格,java,regex,regex-negation,regex-lookarounds,Java,Regex,Regex Negation,Regex Lookarounds,我试图创建一个Java正则表达式,它将用一个空格替换字符串中出现的所有空格,除非空格出现在引号之间(单引号或双引号) 如果我只是寻找双引号,我可以使用前瞻: text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " "); 如果我只是寻找单引号,我可以使用类似的模式 诀窍在于找到两者 我有一个很好的主意,先运行双引号模式,然后再运行单引号模式,当然,这最终会替换所有空格,而不管引号是什么 下面是一些测试和预期结果 a b c
text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " ");
如果我只是寻找单引号,我可以使用类似的模式
诀窍在于找到两者
我有一个很好的主意,先运行双引号模式,然后再运行单引号模式,当然,这最终会替换所有空格,而不管引号是什么
下面是一些测试和预期结果
a b c d e --> a b c d e
a b "c d" e --> a b "c d" e
a b 'c d' e --> a b 'c d' e
a b "c d' e --> a b "c d' e (Can't mix and match quotes)
有没有办法在Java正则表达式中实现这一点
假设已单独验证无效输入。因此,以下情况都不会发生:
a "b c ' d
a 'b " c' d
a 'b c d
我建议将字符串封装标准化。 使用正则表达式替换标准的替代项。 假设你决定使用双引号“ 然后你可以把你的字符串拆分成“和所有的奇数元素” 是引用的内容,您的偶数元素将被取消引用, 仅在偶数元素上运行regex replace并重新生成字符串
从修改过的数组中。编辑:由于@DeanTaylor修复了他的正则表达式,我将修复(修改)这个,
以防有人决定在不平衡的报价上使用它 平衡报价的原始测试有一个原子组。
我从未将其添加到解析逻辑中。因此,增加了这一点。就这样
您可以在替换项中匹配引号或空格,
确定哪一组匹配以决定替换什么 或者使用这个正则表达式来获得两者,避免做出决定 查找:
\G((?>)(?:\\[\S\S]\\[^“\\])*“\\'”(?:\\[\S\S]\[^'\]*)*“\\\[^”\S]+)*)\S+
“\\G((?>\”(?:\\\\\[\\S\\S]\\\\[^\\\\\].*“\\\”)*(?:\\\\\[\\S\\S]\\\[^'\\\].*”\\\\\\\\\\\\\\\\\\\\\\[^\\\\\\\\\\\\\\\\\\\\\\].\\\\\\\\\\\\\\\\\\\\\\\\\\
替换:$1
注意-在使用上述正则表达式之前,可以测试字符串的平衡引号。
这将测试字符串,如果它通过,它将使用平衡引号
(?:“(?:”(?:\[\S\S].[^“\\]])*“\”(?:\\[\S\S].[^'\]*)*)\\[^']+$
“^(?>(?:\”(?:\\\\[\\S\\S].[^\\\\\\].*“\\”。(?:\\\\[\\S\\S].[^'\\\].*)\\\\\\\\\\\\\\\\+$”(?:\\\\\\\\\\\\\\\\[^\\\\\\\\\\\\\\\])+$”
更新@DeanTaylor新答案测试
示例1-对于字符串Word1 Word2
(单词之间有两个空格)
- 这个版本需要大约27个步骤
- @DeanTaylor的版本需要大约29个步骤
示例2-对于字符串“示例”另一个单词(单词之间有两个空格)
- 这个版本需要大约51个步骤
- @DeanTaylor的版本需要36个步骤(大概是因为未滚动的循环)
示例3-用于WordPress的文件
- 此版本需要大约315647个步骤
- @DeanTaylor的版本需要122701个步骤(Dean的版本不处理单个空格)
其他示例3测试将在regex101.com上生成永久链接。
页面变得无响应,显示出它实际上是一块垃圾。EDIT-注意-此答案有一个错误/缺陷
它要求在结束引号(“
或”
)和后面的字符之间有一个空格,以便正确匹配引号中的字符串。因此,此答案无法正确处理某些文本
它可能有更多的缺点,但这只是其中之一
编辑-备选答案
我加了一个没有毛病的
把这个留给子孙后代
支持
这一个支持通过\“
和\”
和多行引号转义引号
正则表达式
替换
$1$2
(是,末尾有空格)
想象
代码
可读
支持
- 通过
\“
和\”
和多行引号转义引号
- 不匹配的引号,引号以字符串结尾终止
- 针对大型文件的其他优化
优化
减少步骤数的若干优化:
示例1-对于字符串Word1 Word2
(单词之间有两个空格)
- 接受
- 这个版本只需要几分钟
示例2-对于字符串“示例”另一个单词(单词之间有两个空格)
- 接受
- 这个版本只需要几分钟
示例3-用于WordPress的文件
- 导致错误
- 这个版本只需要几分钟
正则表达式
替换
$1
(是,末尾有空格)
想象
代码
可读
这是一个您无法提前解决的常见问题。唯一的解决方案是您必须匹配引号部分才能通过它们。那么“
”呢?哪些要替换,哪些不替换?我不认为正则表达式可以做到这一点;这似乎不是一种常规语言。@tobias_k它确实不需要那么健壮,但在您的示例中,双引号是不平衡的。我假设不会有嵌套,引号将是平衡的“我假设不会有嵌套,引号将是平衡的”,这不是您在上一个示例a b“c d'e
中显示的,“
和”
没有对。还有转义呢?您的输入是否可以包含一些转义引号,如a“b\”c d“e
?仅供参考,此([^\s“\\]+)*
没有任何用途。此外,此正则表达式仅在引号紧靠空格时匹配引号,这意味着它将匹配此处突出显示的“
”asdf。很好的展开循环。很好的点@sln我可以确认这确实需要在结束报价后留一个空格。因此,“”某些文本
不匹配@sli元素([^\s“\\]+)*
实际上是一个性能元素,在我的快速测试套件中,40MB文件的处理时间减少了一半-migh
\G # Must match where last match left off
# (This will stop the match if there is a quote unbalance)
( # (1 start), quotes or non-whitespace
(?> # Atomic cluster to stop backtracking if quote unbalance
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| # or,
'
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| # or,
[^"'\s]+ # Not quotes nor whitespace
)* # End Atomic cluster, do 0 to many times
) # (1 end)
\s+ # The whitespaces outside of quotes
([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+)
try {
String resultString = subjectString.replaceAll("([^\\s\"'\\\\]+)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*(\\s+)", "$1$2 ");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
// Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
// Non-existent backreference used the replacement text
}
// ([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+)
//
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
//
// Match the regex below and capture its match into backreference number 1 «([^\s"'\\]+)*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
// Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.
// Match any single character NOT present in the list below «[^\s"'\\]+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
// A single character from the list “"'” «"'»
// The backslash character «\\»
// Match the regex below and capture its match into backreference number 2 «("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
// Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.
// Match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
// Match the character “"” literally «"»
// Match any single character NOT present in the list below «[^"\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “"” «"»
// The backslash character «\\»
// Match the regular expression below «(?:\\.[^"\\]*)*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the backslash character «\\»
// Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
// Match any single character NOT present in the list below «[^"\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “"” «"»
// The backslash character «\\»
// Match the character “"” literally «"»
// Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
// Match the character “'” literally «'»
// Match any single character NOT present in the list below «[^'\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “'” «'»
// The backslash character «\\»
// Match the regular expression below «(?:\\.[^'\\]*)*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the backslash character «\\»
// Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
// Match any single character NOT present in the list below «[^'\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “'” «'»
// The backslash character «\\»
// Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 3 «(\s+)»
// Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
\G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
try {
String resultString = subjectString.replaceAll("\\G((?:[^\\s\"']+| (?!\\s)|\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*+)(\\s+)", "$1 ");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
// Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
// Non-existent backreference used the replacement text
}
// \G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
//
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
//
// Assert position at the end of the previous match (the start of the string for the first attempt) «\G»
// Match the regex below and capture its match into backreference number 1 «((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)»
// Match the regular expression below «(?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+»
// Between zero and unlimited times, as many times as possible, without giving back (possessive) «*+»
// Match this alternative (attempting the next alternative only if this one fails) «[^\s"']+»
// Match any single character NOT present in the list below «[^\s"']+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
// A single character from the list “"'” «"'»
// Or match this alternative (attempting the next alternative only if this one fails) « (?!\s)»
// Match the character “ ” literally « »
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\s)»
// Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
// Or match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
// Match the character “"” literally «"»
// Match any single character NOT present in the list below «[^"\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “"” «"»
// The backslash character «\\»
// Match the regular expression below «(?:\\.[^"\\]*)*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the backslash character «\\»
// Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
// Match any single character NOT present in the list below «[^"\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “"” «"»
// The backslash character «\\»
// Match the character “"” literally «"»
// Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
// Match the character “'” literally «'»
// Match any single character NOT present in the list below «[^'\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “'” «'»
// The backslash character «\\»
// Match the regular expression below «(?:\\.[^'\\]*)*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the backslash character «\\»
// Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
// Match any single character NOT present in the list below «[^'\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “'” «'»
// The backslash character «\\»
// Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 2 «(\s+)»
// Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»