C# 使用分隔符按句子拆分文章

C# 使用分隔符按句子拆分文章,c#,c#-4.0,C#,C# 4.0,我有一个小作业,我有一篇这样格式的文章 <REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2"> <TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE> <DATELINE> CLEVELAND, Feb 26 - </DATELINE>

我有一个小作业,我有一篇这样格式的文章

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>
.Split(new string[] { ". " }, StringSplitOptions.None);

好的,我用我在这里收到的想法找到了一个解决方案 我使用了像这样拆分的重载方法

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>
.Split(new string[] { ". " }, StringSplitOptions.None);

现在看起来好多了

好吧,所以我用我在这里收到的想法找到了一个解决方案 我使用了像这样拆分的重载方法

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>
.Split(new string[] { ". " }, StringSplitOptions.None);

现在看起来好多了

您还可以使用正则表达式来查找带有空格的句子终止符:

var pattern = @"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);

foreach (var sentence in sentences) {
    //do something with the sentence
    var node = string.Format("\t \t <sentence>{0}</sentence>", sentence);
    file.WriteLine(node);
}

var pattern=@”(?您还可以使用正则表达式查找带有空格的句子终止符:

var pattern = @"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);

foreach (var sentence in sentences) {
    //do something with the sentence
    var node = string.Format("\t \t <sentence>{0}</sentence>", sentence);
    file.WriteLine(node);
}

var pattern=@”(?我将列出“.”字符的所有索引点

对于每个索引点,检查每一侧是否有数字,如果两侧都有数字,则从列表中删除索引点

然后,在输出时,只需使用子字符串函数和剩余的索引点,即可将每个句子作为一个单独的句子

质量不好的代码如下(已经晚了):

现在,我们只需删除“.”两侧都有数字的所有位置

foreach(int indexPoint in indexesToRemove)
{
    IndexPoints.RemoveAt(indexPoint);
}

现在,当您以新的文件格式读出句子时,只需循环
句子.substring(lastinexpoint+1,currentinexpoint)
我将列出“.”字符的所有索引点

对于每个索引点,检查每一侧是否有数字,如果两侧都有数字,则从列表中删除索引点

然后,在输出时,只需使用子字符串函数和剩余的索引点,即可将每个句子作为一个单独的句子

质量不好的代码如下(已经晚了):

现在,我们只需删除“.”两侧都有数字的所有位置

foreach(int indexPoint in indexesToRemove)
{
    IndexPoints.RemoveAt(indexPoint);
}

现在,当您将句子读入新的文件格式时,只需循环
句子。子字符串(lastinexpoint+1,currentinexpoint)

在这方面花费了很多时间—您可能希望看到它,因为它实际上没有使用任何笨拙的代码—它产生的输出与您的输出99%相似

<articles>
    <article id="2">
        <subject>STANDARD OIL &lt;SRD&gt; TO FORM FINANCIAL UNIT</subject>
        <sentence>Standard Oil Co and BP North America</sentence>
        <sentence>Inc said they plan to form a venture to manage the money market</sentence>
        <sentence>borrowing and investment activities of both companies.</sentence>
        <sentence>BP North America is a subsidiary of British Petroleum Co</sentence>
        <sentence>Plc &lt;BP&gt;, which also owns a 55.0 pct interest in Standard Oil.</sentence>
        <sentence>The venture will be called BP/Standard Financial Trading</sentence>
        <sentence>and will be operated by Standard Oil under the oversight of a</sentence>
        <sentence>joint management committee.</sentence>
    </article>
</articles>

我希望您喜欢它:)

在这方面花了很多时间-我想您可能会喜欢看它,因为它实际上没有使用任何笨拙的代码-它产生的输出99%与您的类似

<articles>
    <article id="2">
        <subject>STANDARD OIL &lt;SRD&gt; TO FORM FINANCIAL UNIT</subject>
        <sentence>Standard Oil Co and BP North America</sentence>
        <sentence>Inc said they plan to form a venture to manage the money market</sentence>
        <sentence>borrowing and investment activities of both companies.</sentence>
        <sentence>BP North America is a subsidiary of British Petroleum Co</sentence>
        <sentence>Plc &lt;BP&gt;, which also owns a 55.0 pct interest in Standard Oil.</sentence>
        <sentence>The venture will be called BP/Standard Financial Trading</sentence>
        <sentence>and will be operated by Standard Oil under the oversight of a</sentence>
        <sentence>joint management committee.</sentence>
    </article>
</articles>

我希望你喜欢:)

分割“
点然后空格,因为大多数句子在句号后都有空格。也可以包括换行符。这看起来像是regex的问题。虽然分割方法接收到一个char[],所以我不能执行“.”操作,因为它是一个字符串而不是charuse
”。ToCharray()
。这仍然会创建一个包含元素的char数组。点和空间,会被这两个分开,所以它不会解决这个问题problem@YuvalHaran正如您最终意识到的,Split也可以使用字符串数组。Split by
点然后空格,因为大多数句子在句号后都有空格。也可以包括换行符。这看起来像是regex的问题。虽然分割方法接收到一个char[],所以我不能执行“.”操作,因为它是一个字符串而不是charuse
”。ToCharray()
。这仍然会创建一个包含元素的char数组。点和空间,会被这两个分开,所以它不会解决这个问题problem@YuvalHaran正如您最终意识到的,拆分也可以使用字符串数组。如果这对您有效,那么您的问题就出了问题。在你的问题中,句子结尾的点后面没有空格。无论如何,这似乎是不可靠的,因为一个句号后面没有空格可能仍然会结束一个句子,例如,如果它后面有一个结束引号,然后是一个空格。如果这对你有效,那么你的问题就出了问题。在你的问题中,句子结尾的点后面没有空格。无论如何,这似乎是不可靠的,因为一个句号后面没有空格可能仍然会结束一个句子,例如,如果它后面有一个结束引号,然后是空格。