Java Lucene 8.4.1-LatLonShape.createIndexableFields与RecursivePrefixtRestrategy.createIndexableFields_Java_Lucene

Java Lucene 8.4.1-LatLonShape.createIndexableFields与RecursivePrefixtRestrategy.createIndexableFields

java lucene

Java Lucene 8.4.1-LatLonShape.createIndexableFields与RecursivePrefixtRestrategy.createIndexableFields,java,lucene,Java,Lucene,我正在使用Lucene版本8.4.1，我有一些关于空间索引的问题。这与性能有关，后来又与空间搜索有关。我的测试数据大约是10000个多边形。这就是小数据集首先，我的设置： // JtsSpaticalContext is needed to index polygons this.ctx = JtsSpatialContext.GEO; SpatialPrefixTree tree = new GeohashPrefix

我正在使用Lucene版本8.4.1，我有一些关于空间索引的问题。这与性能有关，后来又与空间搜索有关。我的测试数据大约是10000个多边形。这就是小数据集

首先，我的设置：

            // JtsSpaticalContext is needed to index polygons
            this.ctx = JtsSpatialContext.GEO;
            SpatialPrefixTree tree = new GeohashPrefixTree(this.ctx, MAX_LEVEL);
            this.strategy = new RecursivePrefixTreeStrategy(tree, GEOMETRY_FIELDNAME);
            this.shapeReader = this.ctx.getFormats().getWktReader();
            // Creating the path for lucene index files
            Path path = Paths.get(INDEX_FOLDER);
            this.dir = SimpleFSDirectory.open(path);
            
            // preparing IndexWriter
            IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());           
            config.setOpenMode(OpenMode.CREATE);
            config.setRAMBufferSizeMB(256.0);
            config.setUseCompoundFile(false);           
            config.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
            
            LogMergePolicy policy = new LogDocMergePolicy();
            policy.setMergeFactor(15);
            config.setMergePolicy(policy);
            
            this.indexWriter = new IndexWriter(dir, config);

如您所见，我正在使用JtsSpatialContext对空间数据进行索引。配置对我来说仍然是一种魔力，这次咨询给了我最好的结果<将GeohashPrefixTree的strong>MAX_LEVEL设置为11。另外：this.shapeReader=this.ctx.getFormats（）.getWktReader用于消除在使用ctx.readFromWkt时显示的不推荐使用的警告。我观看了官方Lucene Github Repo上的SpatialSample.java

现在，正如我所说的，我想索引10000个多边形，这在我的用例中是一个小数据集。我有两种方法来索引这些数据，不同的是CaseA和CaseB

以下是我如何将这些多边形添加到索引的逻辑：

        // Start Case A
        List<String> testDataCaseA = new ArrayList<>();     
        for (int i = 0; i < 10000; i++) {
            testDataCaseA.add("POLYGON((9.0842201 48.80324419974113,9.084344 48.803237199741126,9.0843574 48.80333909974109,9.0842334 48.8033461997411,9.0842201 48.80324419974113))");
        }
        
        long startCaseA = System.nanoTime();

        testDataCaseA.parallelStream().forEach(current -> {
            try {
                this.indexWriter.addDocument(createDocumentCaseA(current));
            } catch (InvalidShapeException | IOException | java.text.ParseException e) {
                logger.error(e.toString());
            }
        });
        
        double elapsedTimeCaseA = (System.nanoTime() - startCaseA) / 1000000;
        logger.trace("Elapsed Time: " + elapsedTimeCaseA + "ms");
        // End Case A
        
        // Deleting the index
        this.indexWriter.deleteAll();
        
        // Start Case B
        List<String> testDataCaseB = new ArrayList<>(); 
        for (int i = 0; i < 10000; i++) {
            testDataCaseB.add("{\"type\":\"Polygon\",\"coordinates\":[[[9.0842201,48.80324419974113],[9.084344,48.803237199741126],[9.0843574,48.80333909974109],[9.0842334,48.8033461997411],[9.0842201,48.80324419974113]]]}");
        }
        
        long startCaseB = System.nanoTime();
        
        testDataCaseB.parallelStream().forEach(current -> {
            try {
                this.indexWriter.addDocument(createDocumentCaseB(current));
            } catch (java.text.ParseException | IOException e) {
                logger.error(e.toString());
            }
        });
        
        double elapsedTimeCaseB = (System.nanoTime() - startCaseB) / 1000000;
        logger.trace("Elapsed Time: " + elapsedTimeCaseB + "ms");
        // End Case B

这两种变体之间的差异是惊人的：

案例A经过的时间：41522.0ms

案例B运行时间：168.0ms

好吧，我想：“嗯，好吧，那我就选择案例B，一切都好了。”。但我的问题是在“可理解的层面”：这样做的“正确方式”是什么？并且：如果我使用空间搜索，在CaseA中我得到了方法strategy.makeQuery（SpatialArgs），在CaseB中我需要使用LatLonShape.createXYQuery（something）

选择哪种方式？我在Lucene的文档中遗漏了什么吗？

看看那篇文章的链接，这解释了这两种方法（基本上是三角形与长方体）之间的区别，以及LatLonShape在索引多边形时提供的相对于递归预处理策略的惊人改进

如果有人感兴趣，我还研究了Lucene相对于其他各种（）的查询性能，它确实很好

测试Geofabrik OSM英格兰土地利用多边形索引，在每个索引上查询相同的10000个随机点/多边形：

Benchmark                                  Mode  Cnt   Score    Error  Units
GeotoolsBenchmark.pointIntersectsQuery    thrpt    3  13.640 ± 83.265  ops/s
GeotoolsBenchmark.polygonIntersectsQuery  thrpt    3   0.101 ±  0.422  ops/s
LuceneBenchmark.pointIntersectsQuery      thrpt    3   0.108 ±  0.514  ops/s
LuceneBenchmark.polygonIntersectsQuery    thrpt    3   0.092 ±  0.117  ops/s
MongoDbBenchmark.pointQuery               thrpt    3   0.095 ±  0.049  ops/s
MongoDbBenchmark.polygonQuery             thrpt    3   0.028 ±  0.022  ops/s
PostgisBenchmark.pointQuery               thrpt    3   0.091 ±  0.142  ops/s
PostgisBenchmark.polygonQuery             thrpt    3   0.065 ±  0.068  ops/s

看看那篇文章中的链接，它解释了两种方法（基本上是三角形与长方体）之间的区别，以及LatLonShape在索引多边形时相对于递归PrefixtReastegy提供的惊人改进

如果有人感兴趣，我还研究了Lucene相对于其他各种（）的查询性能，它确实很好

测试Geofabrik OSM英格兰土地利用多边形索引，在每个索引上查询相同的10000个随机点/多边形：

Benchmark                                  Mode  Cnt   Score    Error  Units
GeotoolsBenchmark.pointIntersectsQuery    thrpt    3  13.640 ± 83.265  ops/s
GeotoolsBenchmark.polygonIntersectsQuery  thrpt    3   0.101 ±  0.422  ops/s
LuceneBenchmark.pointIntersectsQuery      thrpt    3   0.108 ±  0.514  ops/s
LuceneBenchmark.polygonIntersectsQuery    thrpt    3   0.092 ±  0.117  ops/s
MongoDbBenchmark.pointQuery               thrpt    3   0.095 ±  0.049  ops/s
MongoDbBenchmark.polygonQuery             thrpt    3   0.028 ±  0.022  ops/s
PostgisBenchmark.pointQuery               thrpt    3   0.091 ±  0.142  ops/s
PostgisBenchmark.polygonQuery             thrpt    3   0.065 ±  0.068  ops/s