Java 如何按数字和词组拆分文本_Java_Regex

Java 如何按数字和词组拆分文本

java regex

Java 如何按数字和词组拆分文本,java,regex,Java,Regex,假设我有一个包含 -一些逗号分隔的字符串 -和文本 my_string = "2 Marine Cargo 14,642 10,528 16,016 more text 8,609 argA 2,106 argB" 我想将它们提取到一个数组中，该数组由数字和一组单词分割 resultArray = {"2", "Marine Cargo", "14,642", "10,528", "16,016", "more text", "8

假设我有一个包含 -一些逗号分隔的字符串 -和文本

  my_string =  "2 Marine Cargo       14,642 10,528       16,016 more text 8,609 argA 2,106 argB"

我想将它们提取到一个数组中，该数组由数字和一组单词分割

 resultArray = {"2", "Marine Cargo", "14,642", "10,528", "16,016",
                "more text", "8,609", "argA", "2,106", "argB"};

注意0：每个条目之间可能有多个空格，应忽略这些空格

注1:Marine Cargo and more文本不分为不同的字符串，因为它们是一组没有数字分隔的单词。

而argA和argB是分开的，因为它们之间有一个数字

如果空格是你的问题。Stringsplit将正则表达式作为参数。然后你可以这样做： my_list=Arrays.asListmy_string.split\s

但是，这并不能解决所有问题，如评论中提到的问题。

您可以这样做：

    List<String> strings = new ArrayList<>();
    String prev = null;
    for (String w: my_string.split("\\s+")) {
        if (w.matches("\\d+(?:,\\d+)?")) {
            if (prev != null) {
                strings.add(prev);
                prev = null;
            }
            strings.add(w);
        } else if (prev == null) {
            prev = w;
        } else {
            prev += " " + w;
        }
    }
    if (prev != null) {
        strings.add(prev);
    }

您可以尝试使用此正则表达式进行拆分

([\d,]+|[a-zA-Z]+ *[a-zA-Z]*) //note the spacing between + and *.

[0-9，]+//将搜索一个或多个数字和逗号 [a-zA-Z]+[a-zA-Z]//将搜索一个单词，后跟空格（如果有的话）和另一个单词（如果有的话）

String regEx = "[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*";

你是这样使用它们的

public static void main(String args[]) {

  String input = new String("2 Marine Cargo       14,642 10,528       16,016 more text 8,609 argA 2,106 argB");
  System.out.println("Return Value :" );      

  Pattern pattern = Pattern.compile("[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*");

  ArrayList<String> result = new ArrayList<String>();
  Matcher m = pattern.matcher(input);
  while (m.find()) { 
         System.out.println(">"+m.group(0)+"<");  
         result.add(m.group(0));

   }
}

我喜欢并想再加上它。只有当数字部分由一个或两个部分组成时，他的解决方案才会匹配

如果您还想捕获由三个或更多部分组成的部分，则必须将正则表达式稍微更改为：[\d，]+|[a-zA-Z]+？：*[a-zA-Z]*

非捕获组？：*[a-zA-Z]重复无限次，如果需要，将捕获所有纯数字部分。

2海运货物，我们如何知道货物属于海运？14642值是单个值还是两个值，例如：14和642，在您的预期输出中？没有足够的信息真正解决此问题。我们如何知道海运货物应该是单一要素？难道它不应该和2号船连在一起，导致2号船载货物吗？然后，我们可以假设每个元素的长度为20-25个字符，并添加了一些填充。@Stephanhogenya 14642是我重新表述您的问题和标题的唯一值。你以前的标题太笼统，问题太含糊，导致了一波反对票。希望能有帮助。

1st Alternative [0-9,]+
Match a single character present in the list below [0-9,]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
, matches the character , literally (case sensitive)


2nd Alternative [a-zA-Z]+ *[a-zA-Z]*
Match a single character present in the list below [a-zA-Z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
 * matches the character   literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [a-zA-Z]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)