C# 如何在课文中找到10个最常用的单词_C#_Text

C# 如何在课文中找到10个最常用的单词

c# text

C# 如何在课文中找到10个最常用的单词,c#,text,C#,Text,所以我有一个txt文件中的任意文本，我需要找到10个最常见的单词。我该怎么做？我想我应该把句子从一个点读到另一个点，然后把它排成一个数组，但我真的不知道怎么做您可以使用Linq实现它。试着这样做： var words = "two one three one three one"; var orderedWords = words .Split(' ') .GroupBy(x => x) .Select(x => new { KeyField = x.Key,

所以我有一个txt文件中的任意文本，我需要找到10个最常见的单词。我该怎么做？我想我应该把句子从一个点读到另一个点，然后把它排成一个数组，但我真的不知道怎么做

您可以使用Linq实现它。试着这样做：

var words = "two one three one three one";
var orderedWords = words
  .Split(' ')
  .GroupBy(x => x)
  .Select(x => new { 
    KeyField = x.Key, 
    Count = x.Count() })
  .OrderByDescending(x => x.Count)
  .Take(10);

将所有数据转换为字符串，并将其拆分为数组

例如：

char[] delimiterChars = { ' ', ',', '.', ':', '\t' };
string text = "one\ttwo three:four,five six seven";

string[] words = text.Split(delimiterChars);

var dict = new Dictionary<String, int>();
foreach(var value in array)
{
    if (dict.ContainsKey(value))
        dict[value]++;
    else
        dict[value] = 1;
}

for(int i=0;i<dict.length();i++) //or i<10
{
   Console.WriteLine(dict[i]);
}

char[]delimiterChars={'、'、'、'.、'：'、'\t'}；
string text=“一\t两个三：四，五个六个七”；
string[]words=text.Split（delimiterCars）；
var dict=新字典（）；
foreach（数组中的var值）
{
if（dict.ContainsKey（值））
dict[值]+；
其他的
dict[值]=1；
}
对于（int i=0；i而言，任务最困难的部分是将初始文本拆分为单词。自然语言（如英语）单词是一件非常复杂的事情：
Forget-me-not     // 1 word (a nice blue flower) 
Do not Forget me! // 4 words
Cannot            // 1 word or shall we split "cannot" into "can" + "not"?
May not           // 2 words
George W. Bush    // Is "W" a word?
W.A.S.P.          // ...If it is, is it equal to "W" in the "W.A.S.P"?
Donald Trump      // Homonyms: name
Spades is a trump // ...and a special follow in a game of cards 
It's an IT; it is // "It" and "IT" are different (IT is an acronym), "It" and "it" are same

另一个问题是这样的：您可能希望将It
和It
计算为一个相同的单词，但将It
作为不同的首字母缩略词。作为第一次尝试，我建议这样做：
var top10words = File
  .ReadLines(@"C:\MyFile.txt")
  .SelectMany(line => Regex
    .Matches(value, @"[A-Za-z-']+")
    .OfType<Match>()
    .Select(match => CultureInfo.InvariantCulture.TextInfo.ToTitleCase(match.Value)))
  .GroupBy(word => word)
  .Select(chunk => new {
     word = chunk.Key,
     count = chunk.Count()})
  .OrderByDescending(item => item.count)
  .ThenBy(item => item.word)
  .Take(10);

var top10words=文件
.ReadLines（@“C:\MyFile.txt”）
.SelectMany（line=>Regex
.匹配（值@“[A-Za-z-']+”）
第（）类
.Select（match=>CultureInfo.InvariantCulture.TextInfo.ToTitleCase（match.Value）））
.GroupBy（word=>word）
.Select（chunk=>new{
word=chunk.Key，
count=chunk.count（）}）
.OrderByDescending（item=>item.count）
.ThenBy（item=>item.word）
.采取（10）；

在我的解决方案中，我假设：

单词只能包含A..Z、A..Z
、-
（破折号）和'
（撇号）字母
TitleCase
已用于将所有大写首字母缩略词与常规词分开（It
和It
将被视为同一个词，而It
将被视为不同的词）
如果是平局（两个或多个单词有相同的频率），这个平局是按字母顺序打破的
这是我根据和提供的答案编写的一个组合方法。不过，我想分隔符字符将取决于您的用例。在您的情况下，您可以为numWords
参数提供10

public static Dictionary<string, int> WordCount(string text, int numWords = int.MaxValue)
{
    var delimiterChars = new char[] { ' ', ',', ':', '\t', '\"', '\r', '{', '}', '[', ']', '=', '/' };
    return text
        .Split(delimiterChars)
        .Where(x => x.Length > 0)
        .Select(x => x.ToLower())
        .GroupBy(x => x)
        .Select(x => new { Word = x.Key, Count = x.Count() })
        .OrderByDescending(x => x.Count)
        .Take(numWords)
        .ToDictionary(x => x.Word, x => x.Count);
}

公共静态字典字数（字符串文本，int numWords=int.MaxValue）
{
var delimiterChars=new char[]{'、'、'：'、'\t'、'\'、'\r'、'{'、'}'、'['、']'、'='、'/'}；
返回文本
.Split（分隔符）
.其中（x=>x.长度>0）
.Select（x=>x.ToLower（））
.GroupBy（x=>x）
.Select（x=>new{Word=x.Key，Count=x.Count（）}）
.OrderByDescending（x=>x.Count）
.Take（numWords）
.ToDictionary（x=>x.Word，x=>x.Count）；
}
到目前为止你尝试了什么？将文本拆分为单词，按这些单词分组，按计数排序（降序），取前10名ToList（）
是多余的：…单词。拆分（“”）。GroupBy（x=>x）…。
非常正确的Dmitry，它是不需要的。我编辑了代码示例。如果在txt文件中给你一个“随机文本”“您当前的例行程序将遇到困难：您必须删除所有标点符号（逗号、句号等）；您必须处理大小写，例如，“一是一，而不是一加”
-单词One
出现两次计数器示例：var words=“二对一、三对一和三。一、一和三”
最常见的单词应该是“一”
，但返回的是“三”
。反例：text=“一，二，三，四，四，五”
预期结果是“四”
位于顶部。实际结果是空字符串将它们全部规则化。