Java 在自然语言中断时拆分字符串_Java_Regex_String

Java 在自然语言中断时拆分字符串

java regex string

Java 在自然语言中断时拆分字符串,java,regex,string,Java,Regex,String,概述我将字符串发送到一个文本到语音服务器，该服务器接受的最大长度为300个字符。由于网络延迟，返回的每段讲话之间可能会有延迟，因此我希望尽可能在最“自然停顿”的时候打断讲话每个服务器请求都要花费我的钱，因此理想情况下，我会发送尽可能长的字符串，最多允许发送最多个字符以下是我当前的实现：然后使用匹配器解析位置： Matcher matcher = pFirstChoice.matcher(input); if (matcher.find()) { spli

概述

我将字符串发送到一个文本到语音服务器，该服务器接受的最大长度为300个字符。由于网络延迟，返回的每段讲话之间可能会有延迟，因此我希望尽可能在最“自然停顿”的时候打断讲话

每个服务器请求都要花费我的钱，因此理想情况下，我会发送尽可能长的字符串，最多允许发送最多个字符

以下是我当前的实现：

然后使用匹配器解析位置：

    Matcher matcher = pFirstChoice.matcher(input);

    if (matcher.find()) {
        splitLocation = matcher.start();
    }

在我当前的实现中，我的替代方法是存储每个分隔符的位置，然后选择最接近的

MAX\u outrance\u LENGTH

我尝试了各种方法将

MIN\u outrance\u LENGTH

和

MAX\u outrance\u LENGTH

应用于模式，因此它只捕获这些值之间的值，并使用lookarounds反向迭代

？，我会这样做：
Pattern p = Pattern.compile(".{1,299}(?:[.!?]\\s+|\\n|$)", Pattern.DOTALL);
Matcher matcher = p.matcher(text);
while (matcher.find()) {
    speakableUtterances.add(matcher.group().trim());
}

正则表达式的解释：
.{1,299}                 any character between 1 and 299 times (matching the most amount possible)
(?:[.!?]\\s+|\\n|$)      followed by either .!? and whitespaces, a newline or the end of the string

可以考虑将标点扩展到<代码> \p{t} < />代码>，参见javADoc，
您可以在上看到一个工作示例。
定义了如何将文本分解为句子和其他逻辑组件。下面是一些可用的伪代码：
// tests two consecutive codepoints within the text to detect the end of sentences
boolean continueSentence(Text text, Range range1, Range range2) {
    Code code1 = text.code(range1), code2 = text.code(range2);

    // 0.2  sot ÷   
    if (code1.isStartOfText())
        return false;

    // 0.3      ÷    eot
    if (code2.isEndOfText())
        return false;

    // 3.0  CR  ×    LF
    if (code1.isCR() && code2.isLF())
        return true;

    // 4.0  (Sep | CR | LF) ÷   
    if (code1.isSep() || code1.isCR() || code1.isLF())
        return false;

    // 5.0      ×    [Format Extend]
    if (code2.isFormat() || code2.isExtend())
        return true;

    // 6.0  ATerm   ×    Numeric
    if (code1.isATerm() && (code2.isDigit() || code2.isDecimal() || code2.isNumeric()))
        return true;

    // 7.0  Upper ATerm ×    Upper
    if (code2.isUppercase() && code1.isATerm()) {
        Range range = text.previousCode(range1);
        if (range.isValid() && text.code(range).isUppercase())
            return true;
    }

    boolean allow_STerm = true, return_value = true;

    // 8.0  ATerm Close* Sp*    ×    [^ OLetter Upper Lower Sep CR LF STerm ATerm]* Lower
    Range range = range2;
    Code code = code2;
    while (!code.isOLetter() && !code.isUppercase() && !code.isLowercase() && !code.isSep() && !code.isCR() && !code.isLF() && !code.isSTerm() && !code.isATerm()) {
        if (!(range = text.nextCode(range)).isValid())
            break;
        code = text.code(range);
    }
    range = range1;
    if (code.isLowercase()) {
        code = code1;
        allow_STerm = true;
        goto Sp_Close_ATerm;
    }
    code = code1;

    // 8.1  (STerm | ATerm) Close* Sp*  ×    (SContinue | STerm | ATerm)
    if (code2.isSContinue() || code2.isSTerm() || code2.isATerm())
        goto Sp_Close_ATerm;

    // 9.0  ( STerm | ATerm ) Close*    ×    ( Close | Sp | Sep | CR | LF )
    if (code2.isClose())
        goto Close_ATerm;

    // 10.0 ( STerm | ATerm ) Close* Sp*    ×    ( Sp | Sep | CR | LF )
    if (code2.isSp() || code2.isSep() || code2.isCR() || code2.isLF())
        goto Sp_Close_ATerm;

    // 11.0 ( STerm | ATerm ) Close* Sp* (Sep | CR | LF)?   ÷   
    return_value = false;

    // allow Sep, CR, or LF zero or one times
    for (int iteration = 1; iteration != 0; iteration--) {
        if (!code.isSep() && !code.isCR() && !code.isLF()) goto Sp_Close_ATerm;
        if (!(range = text.previousCode(range)).isValid()) goto Sp_Close_ATerm;
        code = text.code(range);
    }

Sp_Close_ATerm:
    // allow zero or more Sp
    while (code.isSp() && (range = text.previousCode(range)).isValid())
        code = text.code(range);

Close_ATerm:
    // allow zero or more Close
    while (code.isClose() && (range = text.previousCode(range)).isValid())
        code = text.code(range);

    // require STerm or ATerm
    if (code.isATerm() || (allow_STerm && code.isSTerm()))
        return return_value;

    // 12.0     ×    Any
    return true;
}

然后你可以像这样重复句子：
// pass in a range of (0, 0) to get the range of the first sentence
// returns a range with a length of 0 if there are no more sentences
Range nextSentence(Text text, Range range) {
try_again:
    range = text.nextCode(new Range(range.start + range.length, 0));
    if (!range.isValid())
        return range;
    Range next = text.nextCode(range);
    long start = range.start;
    while (next.isValid()) && text.continueSentence(range, next))
        next = text.nextCode(range = next);
    range = new Range(start, range.start + range.length - start);

    Range range2 = text.trimRange(range);
    if (!range2.isValid())
        goto try_again;

    return range2;
}

其中：

范围定义为从>=开始到<开始+长度的范围
text.trimRange删除空白字符（可选）
所有的Code.is[Type]函数都是查找。例如，您将在其中一些文件中看到一些代码点被定义为“CR”、“Sep”、“StartOfText”等
Text.code（range）在range.start处对文本中的代码点进行解码。不使用长度
Text.nextCode和Text.previousCode根据当前代码点的范围返回字符串中下一个或上一个代码点的范围。如果该方向上没有代码点，则返回一个无效的范围，即长度为0的范围

该标准还定义了迭代的方法，以及。
哇，好问题。我希望看到一些答案尽快出现，因为这似乎正是regex
tag-guys喜欢的挑战类型。太棒了！也非常快，谢谢。唯一的问题是最小匹配长度为200个字符。我想我能做到这一点的唯一方法是检查组（）的长度，然后使用另一种模式返回到我的第二选择匹配？是的，我想如果你想返回，你需要做类似的事情。但它变得更加复杂，因为您需要调整另一个匹配器。您可以使用find（index）
、start（）
和end（）。
// tests two consecutive codepoints within the text to detect the end of sentences
boolean continueSentence(Text text, Range range1, Range range2) {
    Code code1 = text.code(range1), code2 = text.code(range2);

    // 0.2  sot ÷   
    if (code1.isStartOfText())
        return false;

    // 0.3      ÷    eot
    if (code2.isEndOfText())
        return false;

    // 3.0  CR  ×    LF
    if (code1.isCR() && code2.isLF())
        return true;

    // 4.0  (Sep | CR | LF) ÷   
    if (code1.isSep() || code1.isCR() || code1.isLF())
        return false;

    // 5.0      ×    [Format Extend]
    if (code2.isFormat() || code2.isExtend())
        return true;

    // 6.0  ATerm   ×    Numeric
    if (code1.isATerm() && (code2.isDigit() || code2.isDecimal() || code2.isNumeric()))
        return true;

    // 7.0  Upper ATerm ×    Upper
    if (code2.isUppercase() && code1.isATerm()) {
        Range range = text.previousCode(range1);
        if (range.isValid() && text.code(range).isUppercase())
            return true;
    }

    boolean allow_STerm = true, return_value = true;

    // 8.0  ATerm Close* Sp*    ×    [^ OLetter Upper Lower Sep CR LF STerm ATerm]* Lower
    Range range = range2;
    Code code = code2;
    while (!code.isOLetter() && !code.isUppercase() && !code.isLowercase() && !code.isSep() && !code.isCR() && !code.isLF() && !code.isSTerm() && !code.isATerm()) {
        if (!(range = text.nextCode(range)).isValid())
            break;
        code = text.code(range);
    }
    range = range1;
    if (code.isLowercase()) {
        code = code1;
        allow_STerm = true;
        goto Sp_Close_ATerm;
    }
    code = code1;

    // 8.1  (STerm | ATerm) Close* Sp*  ×    (SContinue | STerm | ATerm)
    if (code2.isSContinue() || code2.isSTerm() || code2.isATerm())
        goto Sp_Close_ATerm;

    // 9.0  ( STerm | ATerm ) Close*    ×    ( Close | Sp | Sep | CR | LF )
    if (code2.isClose())
        goto Close_ATerm;

    // 10.0 ( STerm | ATerm ) Close* Sp*    ×    ( Sp | Sep | CR | LF )
    if (code2.isSp() || code2.isSep() || code2.isCR() || code2.isLF())
        goto Sp_Close_ATerm;

    // 11.0 ( STerm | ATerm ) Close* Sp* (Sep | CR | LF)?   ÷   
    return_value = false;

    // allow Sep, CR, or LF zero or one times
    for (int iteration = 1; iteration != 0; iteration--) {
        if (!code.isSep() && !code.isCR() && !code.isLF()) goto Sp_Close_ATerm;
        if (!(range = text.previousCode(range)).isValid()) goto Sp_Close_ATerm;
        code = text.code(range);
    }

Sp_Close_ATerm:
    // allow zero or more Sp
    while (code.isSp() && (range = text.previousCode(range)).isValid())
        code = text.code(range);

Close_ATerm:
    // allow zero or more Close
    while (code.isClose() && (range = text.previousCode(range)).isValid())
        code = text.code(range);

    // require STerm or ATerm
    if (code.isATerm() || (allow_STerm && code.isSTerm()))
        return return_value;

    // 12.0     ×    Any
    return true;
}

// pass in a range of (0, 0) to get the range of the first sentence
// returns a range with a length of 0 if there are no more sentences
Range nextSentence(Text text, Range range) {
try_again:
    range = text.nextCode(new Range(range.start + range.length, 0));
    if (!range.isValid())
        return range;
    Range next = text.nextCode(range);
    long start = range.start;
    while (next.isValid()) && text.continueSentence(range, next))
        next = text.nextCode(range = next);
    range = new Range(start, range.start + range.length - start);

    Range range2 = text.trimRange(range);
    if (!range2.isValid())
        goto try_again;

    return range2;
}