C# ElasticSearch文档结构不单独索引https/http_C#_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch_Nest

C# ElasticSearch文档结构不单独索引https/http

c# nest

C# ElasticSearch文档结构不单独索引https/http,c#,elasticsearch,nest,C#,elasticsearch,Nest,我一直在构建一个ElasticSearch网页索引，用于支持在线网站搜索。我有一个C类，我已经用一些嵌套属性构建和装饰了它，但是我仍然有点不确定我是否已经涵盖了我可能需要的所有内容这是我的班级： [ElasticType(IdProperty = "url_id")] public class WebPage { /// <summary> /// Thee last time this document was indexed /// </summa

我一直在构建一个ElasticSearch网页索引，用于支持在线网站搜索。我有一个C类，我已经用一些嵌套属性构建和装饰了它，但是我仍然有点不确定我是否已经涵盖了我可能需要的所有内容

这是我的班级：

[ElasticType(IdProperty = "url_id")]
public class WebPage
{
    /// <summary>
    /// Thee last time this document was indexed
    /// </summary>
    public string dateScanned { get; set; }

    /// <summary>
    /// The ACTUAL mime type returned.  Can be something like application/vnd.openxmlformats-officedocument.presentationml.presentation
    /// </summary>
    [ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
    public string mimeType { get; set; }

    /// <summary>
    /// Human-friendly type.  Like: HTML, DOC, PPT
    /// </summary>
    [ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
    public string shortMimeType { get; set; }

    /// <summary>
    /// The URL without protocol.  Prevents indexing http:// and https:// as two separate index pages
    /// This is used as the ID field in ES.
    /// </summary>
    public string url_id { get; set; }

    /// <summary>
    /// The url we use when building a link.  DOES include protocol
    /// </summary>
    public string url { get; set; }

    //the rest are your standard fields for a simple "document"
    public string body { get; set; }
    public string keywords { get; set; }
    public string description { get; set; }
    public string title { get; set; }
}

我遇到的一个问题是，如果我对ElasticSearch ID使用完整的URL，我可能会在同一个页面上出现两个条目。i、 e

为了防止出现这种情况，我决定将url存储在url_id字段中，而不使用上面示例中的协议www.example.com/，并将其用作ES标识符。然后，我还将在URL字段中存储完整的URL，这将是在查询期间打印到页面上的URL

我看到的问题是，有时url字段将指向http，而其他时间则指向https——它将是最后一个被索引的字段

是否有更好的方法来处理协议存储问题？

您能否澄清为什么这样的行为是一个问题，以及在您的情况下，什么是理想的行为？这是一个搜索引擎，因此我希望消除结果集中出现的实质上是重复的结果，并且每个唯一的内容只显示一个结果用户。是的，您是通过切断url_id中的协议实现的。那么，问题是什么？可以始终将url设置为集合，例如哈希集，并存储https和http。此外，您可能需要考虑使用路径层次标记器或UAX电子邮件url标记器等为url字段设置适当的分析，具体取决于您对搜索文档的预期方式。@imotov-我想没有问题，但我担心我忽略了一些基本内容，谢谢。