C# 谷歌式搜索查询标记化&；字符串拆分_C#_Search_Tokenize

C# 谷歌式搜索查询标记化&；字符串拆分

c# search

C# 谷歌式搜索查询标记化&；字符串拆分,c#,search,tokenize,C#,Search,Tokenize,我希望将搜索查询标记化，类似于谷歌的做法。例如，如果我有以下搜索查询： the quick "brown fox" jumps over the "lazy dog" 我想要一个带有以下标记的字符串数组： the quick brown fox jumps over the lazy dog 如您所见，标记保留双引号中的空格我想找一些例子来说明如何在C#中实现这一点，最好不要使用正则表达式，但是如果这是最有意义、最有效的，那就这样吧此外，我还想知道如何扩展它以处理其他特殊字符，例如，在一

我希望将搜索查询标记化，类似于谷歌的做法。例如，如果我有以下搜索查询：

the quick "brown fox" jumps over the "lazy dog"

我想要一个带有以下标记的字符串数组：

the
quick
brown fox
jumps
over
the
lazy dog

如您所见，标记保留双引号中的空格

我想找一些例子来说明如何在C#中实现这一点，最好不要使用正则表达式，但是如果这是最有意义、最有效的，那就这样吧

此外，我还想知道如何扩展它以处理其他特殊字符，例如，在一个词前面加一个-以强制从搜索查询中排除，等等。

将字符逐个转换为如下字符串：（类似于伪代码）

到目前为止，这似乎是RegEx的一个很好的候选者。如果它变得非常复杂，那么可能需要一个更复杂的标记化方案，但是除非必要，否则您应该避免该路线，因为它需要大量的工作。（另一方面，对于复杂的模式，regex很快就会变成一只狗，同样应该避免）

这个正则表达式应该可以解决您的问题：

("[^"]+"|\w+)\s*

下面是一个C#用法示例：

string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = @"(""[^""]+""|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

这种方法的真正好处是可以很容易地扩展到包含您的“-”需求，如：

string data = "the quick \"brown fox\" jumps over " +
              "the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = @"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

现在我和下一个家伙一样讨厌读Regex，但是如果你把它分开，这本书很容易读：

(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*

解释

如果可能的话，匹配一个减号，后跟一个“后面跟着所有的东西，直到下一个”

否则，匹配一个“后面跟着所有内容，直到下一个”

否则，匹配a-后跟任何单词字符

否则，请尽可能多地匹配单词字符

将结果分组

吞掉下面的空格字符

几天前我只是想弄明白怎么做。我最终使用了Microsoft.VisualBasic.FileIO.TextFieldParser，它完全符合我的要求（只需将HasFieldSenClosedQuotes设置为true）。当然，在C#程序中使用“Microsoft.VisualBasic”看起来有点奇怪，但它是有效的，据我所知，它是.NET框架的一部分

为了让我的字符串进入TextFieldParser的流中，我使用了“new MemoryStream（new-AscienceOding（）.GetBytes（stringvar））”。不确定这是否是最好的方法

编辑：我认为这不能满足您的“-”要求，所以也许正则表达式解决方案更好

我正在寻找一个Java解决方案来解决这个问题，并使用@Michael La Voie提出了一个解决方案。我想我会在这里分享，尽管C#中有人问我这个问题。希望没问题

public static final List<String> convertQueryToWords(String q) {
    List<String> words = new ArrayList<>();
    Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
    Matcher matcher = pattern.matcher(q);
    while (matcher.find()) {
        MatchResult result = matcher.toMatchResult();
        if (result != null && result.group() != null) {
            if (result.group().contains("\"")) {
                words.add(result.group().trim().replaceAll("\"", "").trim());
            } else {
                words.add(result.group().trim());
            }
        }
    }
    return words;
}

公共静态最终列表convertQueryToWords（字符串q）{
List words=new ArrayList（）；
Pattern=Pattern.compile（“（\”[^\“]+\”\\w+\\s*”）；
Matcher-Matcher=pattern.Matcher（q）；
while（matcher.find（））{
MatchResult=matcher.toMatchResult（）；
if（result！=null&&result.group（）！=null）{
如果（result.group（）包含（“\”）{
words.add（result.group（）.trim（）.replaceAll（“\”，”）.trim（））；
}否则{
words.add（result.group（）.trim（））；
}
}
}
返回单词；
}

在您的语法中，双引号字符（“）除了表示多字标记外，还能在其他任何地方使用吗？就我而言，它不能。我想这大概是我在考虑正则表达式不够时的想法。但是，我强烈建议单词不要是字符串。由于字符串的不变性，您将疯狂地分配字符串。最好让word成为字符串生成器，甚至只是一个字符数组。你是对的。但这是伪代码。这是关于原则的。

public static final List<String> convertQueryToWords(String q) {
    List<String> words = new ArrayList<>();
    Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
    Matcher matcher = pattern.matcher(q);
    while (matcher.find()) {
        MatchResult result = matcher.toMatchResult();
        if (result != null && result.group() != null) {
            if (result.group().contains("\"")) {
                words.add(result.group().trim().replaceAll("\"", "").trim());
            } else {
                words.add(result.group().trim());
            }
        }
    }
    return words;
}