Java 正则表达式拆分camelCase或滴定酶(高级)

Java 正则表达式拆分camelCase或滴定酶(高级),java,regex,camelcasing,title-case,Java,Regex,Camelcasing,Title Case,我找到了一种方法来提取骆驼酶或滴定酶表达的部分 (?<!^)(?=[A-Z]) (? 它按预期工作: 价值->价值 camelValue->camel/Value 标题价值->标题/价值 例如,Java: String s = "loremIpsum"; words = s.split("(?<!^)(?=[A-Z])"); //words equals words = new String[]{"lorem","Ipsum"} String s=“loremIpsum”;

我找到了一种方法来提取骆驼酶或滴定酶表达的部分

 (?<!^)(?=[A-Z])
(?
它按预期工作:

  • 价值->价值
  • camelValue->camel/Value
  • 标题价值->标题/价值
例如,Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}
String s=“loremIpsum”;

words=s.split((?以下正则表达式适用于上述所有示例:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   
publicstaticvoidmain(字符串[]args)
{

对于(字符串w:“camelValue”).split((?),您似乎使其变得比需要的更复杂。对于camelCase,拆分位置只是大写字母紧跟小写字母之后的任意位置:

(?值
  • camelValue->camel/Value
  • TitleValue->Title/Value
  • VALUE->VALUE
  • eclipseercpext->eclipse/RCPExt
  • 与所需输出的唯一区别在于
    eclipserpext
    ,我认为这是正确的

    增编-改进版 注:这个答案最近得到了一个投票,我意识到有一个更好的方法

    通过在上述正则表达式中添加第二个替代项,OP的所有测试用例都被正确分割

    (?驼峰/值
  • TitleValue->Title/Value
  • VALUE->VALUE
  • eclipseercpext->eclipse/RCP/Ext

  • Edit:20130824添加了改进的版本来处理
    RCPExt->RCP/Ext
    案例。

    另一个解决方案是使用专用的方法:

    我无法让aix的解决方案工作(而且它在RegExr上也不工作),所以我提出了自己的解决方案,我已经测试过,似乎完全符合您的要求:

    ((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))
    
    下面是一个使用它的示例:

    ; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
    ;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
    ;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
    ;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
    newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
    newString := Trim(newString)
    
    ; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
    ;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
    ;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
    ;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
    ;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
    newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
    newString := Trim(newString)
    
    这里我用空格分隔每个单词,下面是一些如何转换字符串的示例:

    • ThisisTitleCaseString=>这是一个标题大小写字符串
    • 而这一个是驼峰案例

    上面的这个解决方案满足了原始帖子的要求,但我还需要一个正则表达式来查找包含数字的camel和pascal字符串,所以我还提出了这个变体来包含数字:

    ((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))
    
    以及一个使用它的示例:

    ; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
    ;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
    ;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
    ;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
    newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
    newString := Trim(newString)
    
    ; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
    ;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
    ;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
    ;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
    ;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
    newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
    newString := Trim(newString)
    
    下面是一些示例,说明如何使用此正则表达式转换带数字的字符串:

    • myVariable123=>myVariable123
    • my2Variables=>My2变量
    • 第三个变量ISHER=>第三个变量在这里
    • 12345numsatthestartedtoo=>12345nums开始时也包括在内
    要处理更多的字母,而不仅仅是
    A-Z

    对于Java,您可以使用以下表达式:

    (?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)
    

    (?除了查找不存在的分隔符,您还可以考虑查找名称组件(这些组件肯定存在):

    String test=“\u eclipse”福福RCPExt”;
    Pattern componentPattern=Pattern.compile(“?(\\p{Upper}?\\p{Lower}++|(?:\\p{Upper}(?!\\p{Lower}))+\\p{Digit}*”,Pattern.COMMENTS);
    Matcher componentMatcher=componentPattern.Matcher(测试);
    List components=新建LinkedList();
    int endOfLastMatch=0;
    while(componentMatcher.find()){
    //比赛应该是连续的
    if(componentMatcher.start()!=endOfLastMatch){
    //如果你不想中间有垃圾,就做些可怕的事
    //不过我们很宽容,任何汉字都是幸运的,可以作为一个整体通过
    String startOrInBetween=test.substring(endOfLastMatch,componentMatcher.start());
    组件。添加(startorinbeween);
    }
    components.add(componentMatcher.group(1));
    endOfLastMatch=componentMatcher.end();
    }
    if(endOfLastMatch!=test.length()){
    String end=test.substring(endOfLastMatch,componentMatcher.start());
    增加(结束);
    }
    系统输出打印项次(组件);
    
    这将输出
    [eclipse,福福, RCP,Ext]
    。转换为数组当然很简单。

    简短 这里的两个顶级答案都提供了使用正向lookbehinds的代码,这不是所有regex风格都支持的。下面的regex将捕获
    PascalCase
    camelCase
    ,并且可以在多种语言中使用

    注意:我确实意识到这个问题是关于Java的,但是,我也看到在其他针对不同语言标记的问题中多次提到这篇文章,以及针对同一问题的一些评论

    代码

    结果 样本输入 样本输出 解释
    • 匹配一个或多个大写字母字符
      [A-Z]+
    • 或者匹配零个或一个大写字母字符
      [A-Z]?
      ,后跟一个或多个小写字母字符
      [A-Z]+
    • 确保下面是大写字母字符
      [A-Z]
      或单词边界字符
      \b

    我可以确认,上面ctwheels给出的正则表达式字符串
    ([A-Z]+[A-Z]?[A-Z]+)(?=[A-Z]|\b)
    与微软风格的正则表达式一起工作

    我还想根据ctwheels处理数字字符的正则表达式提出以下备选方案:
    ([A-Z0-9]+|[A-Z]?[A-Z]+)(?=[A-Z0-9]|\b)

    这可以拆分字符串,例如:

    自2019年起驾驶B2BTrade

    推动2019年以后的B2B贸易

    您可以使用ApacheCommonsLang中的StringUtils。

    一个JavaScript解决方案

    /**
     * howToDoThis ===> ["", "how", "To", "Do", "This"]
     * @param word word to be split
     */
    export const splitCamelCaseWords = (word: string) => {
        if (typeof word !== 'string') return [];
        return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
    };
    

    似乎您可能需要在
    ^
    上使用条件修饰符,并在负回溯中使用另一个大写字母的条件大小写。还没有确定测试,但我认为这是解决问题的最佳选择。如果有人正在检查您的输入。在本例中,我需要将RCP和Ext分开,因为我可以在这种情况下,我更喜欢ECLIPSE\u RCP_
    ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)
    
    eclipseRCPExt
    
    SomethingIsWrittenHere
    
    TEXTIsWrittenHERE
    
    VALUE
    
    loremIpsum
    
    eclipse
    RCP
    Ext
    
    Something
    Is
    Written
    Here
    
    TEXT
    Is
    Written
    HERE
    
    VALUE
    
    lorem
    Ipsum
    
    /**
     * howToDoThis ===> ["", "how", "To", "Do", "This"]
     * @param word word to be split
     */
    export const splitCamelCaseWords = (word: string) => {
        if (typeof word !== 'string') return [];
        return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
    };