C# 使用分隔符按句子拆分文章_C#_C# 4.0

C# 使用分隔符按句子拆分文章

c# c#-4.0

C# 使用分隔符按句子拆分文章,c#,c#-4.0,C#,C# 4.0,我有一个小作业，我有一篇这样格式的文章 <REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2"> <TITLE>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</TITLE> <DATELINE> CLEVELAND, Feb 26 - </DATELINE>

我有一个小作业，我有一篇这样格式的文章

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>

.Split(new string[] { ". " }, StringSplitOptions.None);

好的，我用我在这里收到的想法找到了一个解决方案我使用了像这样拆分的重载方法

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>

.Split(new string[] { ". " }, StringSplitOptions.None);

现在看起来好多了

好吧，所以我用我在这里收到的想法找到了一个解决方案我使用了像这样拆分的重载方法

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>

.Split(new string[] { ". " }, StringSplitOptions.None);

现在看起来好多了

您还可以使用正则表达式来查找带有空格的句子终止符：

var pattern = @"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);

foreach (var sentence in sentences) {
    //do something with the sentence
    var node = string.Format("\t \t <sentence>{0}</sentence>", sentence);
    file.WriteLine(node);
}

var pattern=@”（？您还可以使用正则表达式查找带有空格的句子终止符：
var pattern = @"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);

foreach (var sentence in sentences) {
    //do something with the sentence
    var node = string.Format("\t \t <sentence>{0}</sentence>", sentence);
    file.WriteLine(node);
}

var pattern=@”（？我将列出“.”字符的所有索引点
对于每个索引点，检查每一侧是否有数字，如果两侧都有数字，则从列表中删除索引点
然后，在输出时，只需使用子字符串函数和剩余的索引点，即可将每个句子作为一个单独的句子
质量不好的代码如下（已经晚了）：
现在，我们只需删除“.”两侧都有数字的所有位置
foreach(int indexPoint in indexesToRemove)
{
    IndexPoints.RemoveAt(indexPoint);
}

现在，当您以新的文件格式读出句子时，只需循环句子.substring（lastinexpoint+1，currentinexpoint）
我将列出“.”字符的所有索引点
对于每个索引点，检查每一侧是否有数字，如果两侧都有数字，则从列表中删除索引点
然后，在输出时，只需使用子字符串函数和剩余的索引点，即可将每个句子作为一个单独的句子
质量不好的代码如下（已经晚了）：
现在，我们只需删除“.”两侧都有数字的所有位置
foreach(int indexPoint in indexesToRemove)
{
    IndexPoints.RemoveAt(indexPoint);
}

现在，当您将句子读入新的文件格式时，只需循环句子。子字符串（lastinexpoint+1，currentinexpoint）
在这方面花费了很多时间—您可能希望看到它，因为它实际上没有使用任何笨拙的代码—它产生的输出与您的输出99%相似
<articles>
    <article id="2">
        <subject>STANDARD OIL &lt;SRD&gt; TO FORM FINANCIAL UNIT</subject>
        <sentence>Standard Oil Co and BP North America</sentence>
        <sentence>Inc said they plan to form a venture to manage the money market</sentence>
        <sentence>borrowing and investment activities of both companies.</sentence>
        <sentence>BP North America is a subsidiary of British Petroleum Co</sentence>
        <sentence>Plc &lt;BP&gt;, which also owns a 55.0 pct interest in Standard Oil.</sentence>
        <sentence>The venture will be called BP/Standard Financial Trading</sentence>
        <sentence>and will be operated by Standard Oil under the oversight of a</sentence>
        <sentence>joint management committee.</sentence>
    </article>
</articles>

我希望您喜欢它：）
在这方面花了很多时间-我想您可能会喜欢看它，因为它实际上没有使用任何笨拙的代码-它产生的输出99%与您的类似
<articles>
    <article id="2">
        <subject>STANDARD OIL &lt;SRD&gt; TO FORM FINANCIAL UNIT</subject>
        <sentence>Standard Oil Co and BP North America</sentence>
        <sentence>Inc said they plan to form a venture to manage the money market</sentence>
        <sentence>borrowing and investment activities of both companies.</sentence>
        <sentence>BP North America is a subsidiary of British Petroleum Co</sentence>
        <sentence>Plc &lt;BP&gt;, which also owns a 55.0 pct interest in Standard Oil.</sentence>
        <sentence>The venture will be called BP/Standard Financial Trading</sentence>
        <sentence>and will be operated by Standard Oil under the oversight of a</sentence>
        <sentence>joint management committee.</sentence>
    </article>
</articles>

我希望你喜欢：）
由分割“
点然后空格，因为大多数句子在句号后都有空格。也可以包括换行符。这看起来像是regex的问题。虽然分割方法接收到一个char[]，所以我不能执行“.”操作，因为它是一个字符串而不是charuse”。ToCharray（）
。这仍然会创建一个包含元素的char数组。点和空间，会被这两个分开，所以它不会解决这个问题problem@YuvalHaran正如您最终意识到的，Split也可以使用字符串数组。Split by“
点然后空格，因为大多数句子在句号后都有空格。也可以包括换行符。这看起来像是regex的问题。虽然分割方法接收到一个char[]，所以我不能执行“.”操作，因为它是一个字符串而不是charuse”。ToCharray（）
。这仍然会创建一个包含元素的char数组。点和空间，会被这两个分开，所以它不会解决这个问题problem@YuvalHaran正如您最终意识到的，拆分也可以使用字符串数组。如果这对您有效，那么您的问题就出了问题。在你的问题中，句子结尾的点后面没有空格。无论如何，这似乎是不可靠的，因为一个句号后面没有空格可能仍然会结束一个句子，例如，如果它后面有一个结束引号，然后是一个空格。如果这对你有效，那么你的问题就出了问题。在你的问题中，句子结尾的点后面没有空格。无论如何，这似乎是不可靠的，因为一个句号后面没有空格可能仍然会结束一个句子，例如，如果它后面有一个结束引号，然后是空格。