Python 3.x 将长字符串剪切为包含完整句子的段落_Python 3.x_Google Translate

Python 3.x 将长字符串剪切为包含完整句子的段落

python-3.x

Python 3.x 将长字符串剪切为包含完整句子的段落,python-3.x,google-translate,Python 3.x,Google Translate,我的任务是使用在线翻译api（google、yandex等）翻译非常长的文本（超过50k个符号）。它们都有请求长度的限制。所以，我想将我的文本剪切成长度小于这些限制的字符串列表，但同时保存未剪切的句子例如，如果我想处理此文本，但限制为300个符号：斯坦福NLP小组为每个人提供了一些自然语言处理软件！我们为主要的计算语言学问题提供统计NLP、深度学习NLP和基于规则的NLP工具，这些工具可以整合到具有人类语言技术需求的应用程序中。这些软件包在工业界、学术界和政府中广泛使用。这段代码正在积极开发

我的任务是使用在线翻译api（google、yandex等）翻译非常长的文本（超过50k个符号）。它们都有请求长度的限制。所以，我想将我的文本剪切成长度小于这些限制的字符串列表，但同时保存未剪切的句子

例如，如果我想处理此文本，但限制为300个符号：

斯坦福NLP小组为每个人提供了一些自然语言处理软件！我们为主要的计算语言学问题提供统计NLP、深度学习NLP和基于规则的NLP工具，这些工具可以整合到具有人类语言技术需求的应用程序中。这些软件包在工业界、学术界和政府中广泛使用。这段代码正在积极开发中，我们会尽力回答问题并修复bug。我们所有支持的软件发行版都是用Java编写的。从2014年10月起，我们软件的当前版本需要Java 8+。（2013年3月至2014年9月的版本需要Java 1.6+；2005年至2013年2月的版本需要Java 1.5+。斯坦福解析器最初是用Java 1.1编写的。）分发包包括用于命令行调用的组件、jar文件、Java API和源代码。你也可以在GitHub和Maven上找到我们。许多有帮助的人扩展了我们的工作，为其他语言进行了绑定或翻译。因此，这种软件的大部分也可以从Python（或Jython）、Ruby、Perl、Javascript、F#以及其他.NET和JVM语言轻松使用

我应该得到这个输出：

['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.', 
'These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java.', 
'Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.)', 
'Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages.', 
'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']

做这件事最像蟒蛇的方式是什么？有什么正则表达式可以实现这一点吗

regex不是从段落中解析句子的正确工具。你应该看看

根据累计长度聚合句子的一种方法是使用生成函数：

在这里，如果字符串长度超过300个字符或到达iterable的末尾，函数

将生成一个连接字符串。此函数假定没有一个句子超过300个字符的限制

def g(sents):
    idx = 0
    text_length = 0
    for i, s in enumerate(sents):
        if text_length + len(s) > 300:
            yield ' '.join(sents[idx:i])
            text_length = len(s)
            idx = i
        else:
            text_length += len(s)
    yield ' '.join(sents[idx:])

句子聚合器可以这样调用：

for s in g(sents):
    print(s)
outputs:
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!
We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.
This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+.
(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.

检查每个文本段的长度表明，所有文本段的字符数都少于300个：

[len(s) for s in g(sents)]
#outputs:
[100, 268, 244, 276, 289]

谢谢你的回答！但在字符串长度最长且同时小于300之前，连接句子的最快方法是什么？事实上，句子可以混在一起，这不是什么大问题

[len(s) for s in g(sents)]
#outputs:
[100, 268, 244, 276, 289]