Neo4j和Java：Iterable的快速随机样本<；关系>；_Java_Neo4j_Guava

Neo4j和Java：Iterable的快速随机样本<；关系>；

java neo4j

Neo4j和Java：Iterable的快速随机样本<；关系>；,java,neo4j,guava,Java,Neo4j,Guava,我用Java编写了一个遍历，它返回一个Iterable。最坏的情况是850784个关系的大小不合适目标：我只想取样（不需要替换）20段关系，我想快速完成解决方案1：执行toList（）或将其放入某种集合中花费的时间太长（>1分钟）。我知道我本可以使用shuffle（）函数等等，但这是不可接受的解决方案2：因此，为了直接在Iterable上执行此操作，我使用了番石榴collect库，我将以下3个步骤中的每一步的时间（以毫秒为单位计算）（使用System.nanoTime（）并除以100000

我用Java编写了一个遍历，它返回一个Iterable。最坏的情况是850784个关系的大小不合适
目标：我只想取样（不需要替换）20段关系，我想快速完成
解决方案1：执行
toList（）
或将其放入某种
集合中花费的时间太长（>1分钟）。我知道我本可以使用shuffle（）函数等等，但这是不可接受的解决方案2：因此，为了直接在Iterable 上执行此操作，我使用了番石榴collect 库，我将以下3个步骤中的每一步的时间（以毫秒为单位计算）（使用System.nanoTime（）并除以1000000）。我需要为随机数生成器设置Iterable 的大小，这是一个真正的瓶颈 /* TRAVERSAL: 5 ms */ Iterable<Relationship> simrels = traversal1.traverse(user).relationships(); /* GET ITERABLE SIZE: 74669 ms */ int simrelssize = com.google.common.collect.Iterables.size(simrels); /* RANDOM SAMPLE OF 20: 28321 ms*/ long seed = System.nanoTime(); int[] idxs = new int[20]; Random randomGenerator = new XSRandom(seed); for (int i = 0; i < idxs.length; ++i){ int randomInt = randomGenerator.nextInt(simrelssize); idxs[i]=randomInt; } Arrays.sort(idxs); List<Relationship> simrelslist2 = new ArrayList<Relationship>(); for(int i = 0; i < idxs.length; ++i){ if (i > 0) { int pos = idxs[i]-idxs[i-1]; simrelslist2.add(com.google.common.collect.Iterables.get(simrels, pos)); } else{ simrelslist2.add(com.google.common.collect.Iterables.get(simrels, idxs[i])); } } neo4j.属性 # Enable this to be able to upgrade a store from an older version. #allow_store_upgrade=true # The amount of memory to use for mapping the store files, either in bytes or # as a percentage of available memory. This will be clipped at the amount of # free memory observed when the database starts, and automatically be rounded # down to the nearest whole page. For example, if "500MB" is configured, but # only 450MB of memory is free when the database starts, then the database will # map at most 450MB. If "50%" is configured, and the system has a capacity of # 4GB, then at most 2GB of memory will be mapped, unless the database observes # that less than 2GB of memory is free when it starts. #mapped_memory_total_size=50% # Enable this to specify a parser other than the default one. #cypher_parser_version=2.0 # Keep logical logs, helps debugging but uses more disk space, enabled for # legacy reasons To limit space needed to store historical logs use values such # as: "7 days" or "100M size" instead of "true". #keep_logical_logs=7 days # Autoindexing # Enable auto-indexing for nodes, default is false. #node_auto_indexing=true # The node property keys to be auto-indexed, if enabled. #node_keys_indexable=name,age # Enable auto-indexing for relationships, default is false. #relationship_auto_indexing=true # The relationship property keys to be auto-indexed, if enabled. #relationship_keys_indexable=name,age # Enable shell server so that remote clients can connect via Neo4j shell. #remote_shell_enabled=true # The network interface IP the shell will listen on (use 0.0.0 for all interfaces). #remote_shell_host=127.0.0.1 # The port the shell will listen on, default is 1337. #remote_shell_port=1337 # The type of cache to use for nodes and relationships. #cache_type=hpc # Maximum size of the heap memory to dedicate to the cached nodes. #node_cache_size= # Maximum size of the heap memory to dedicate to the cached relationships. #relationship_cache_size= # Enable online backups to be taken from this database. online_backup_enabled=true # Port to listen to for incoming backup requests. online_backup_server=127.0.0.1:6362 # Uncomment and specify these lines for running Neo4j in High Availability mode. # See the High availability setup tutorial for more details on these settings # http://neo4j.com/docs/2.2.0-M02/ha-setup-tutorial.html # ha.server_id is the number of each instance in the HA cluster. It should be # an integer (e.g. 1), and should be unique for each cluster instance. #ha.server_id= # ha.initial_hosts is a comma-separated list (without spaces) of the host:port # where the ha.cluster_server of all instances will be listening. Typically # this will be the same for all cluster instances. #ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001 # IP and port for this instance to listen on, for communicating cluster status # information iwth other instances (also see ha.initial_hosts). The IP # must be the configured IP address for one of the local interfaces. #ha.cluster_server=192.168.0.1:5001 # IP and port for this instance to listen on, for communicating transaction # data with other instances (also see ha.initial_hosts). The IP # must be the configured IP address for one of the local interfaces. #ha.server=192.168.0.1:6001 # The interval at which slaves will pull updates from the master. Comment out # the option to disable periodic pulling of updates. Unit is seconds. ha.pull_interval=10 # Amount of slaves the master will try to push a transaction to upon commit # (default is 1). The master will optimistically continue and not fail the # transaction even if it fails to reach the push factor. Setting this to 0 will # increase write performance when writing through master but could potentially # lead to branched data (or loss of transaction) if the master goes down. #ha.tx_push_factor=1 # Strategy the master will use when pushing data to slaves (if the push factor # is greater than 0). There are two options available "fixed" (default) or # "round_robin". Fixed will start by pushing to slaves ordered by server id # (highest first) improving performance since the slaves only have to cache up # one transaction at a time. #ha.tx_push_strategy=fixed # Policy for how to handle branched data. #branched_data_policy=keep_all # Clustering timeouts # Default timeout. #ha.default_timeout=5s # How often heartbeat messages should be sent. Defaults to ha.default_timeout. #ha.heartbeat_interval=5s # Timeout for heartbeats between cluster members. Should be at least twice that of ha.heartbeat_interval. #heartbeat_timeout=11s 您有一个遍历器： Traverser traverser = traversal1.traverse(user); int size = traverser.metadata().getNumberOfRelationshipsTraversed(); Iterable<Relationship> simrels = traverser.relationships(); Traverser-Traverser=traversal1.traverse（用户）； int size=traverser.metadata（）.getNumberOfRelationshipsTraversed（）； Iterable simrels=traverser.relations（）；现在您有了自己的大小，可以优化随机选择器。 Neo4j返回Iterable的原因是它在迭代时执行遍历。为了取样，恐怕你必须“拜访”每一段关系。是的，你可以跳过一些，但是你仍然需要在一天结束的时候迭代所有的内容我们使用的是“水库采样”算法。由于上述原因，我不确定它是否会表现得更好。也就是说，使用热缓存，您应该能够在不到1秒的时间内对1M个关系进行采样。如果需要更长的时间，您可能需要稍微调整一下内存设置。您不能从创建时获得可容纳的大小吗？另外，你能任意地将iterable剪切到它的前20个元素吗？@Olivier:我不能得到前20个元素，因为iterable 中的顺序是由遍历强加的，我不能改变这一点。对于大小，在.relationships（）之后没有“size（）”或类似的方法。哇，谢谢！不知道Traverser 。。。好的，我在int size=traverser.metadata（）.getNumberOfRelationshipsTraversed（）上得到一个NullPointerException，我会调查，但如果你有想法，请不要犹豫。我不太擅长使用Eclipse进行调试，我在相应的行中添加了一个断点，我可以看到traverser“framework”内容，但无法检测导致此NullPointerException 的“越界”原因，有人能帮我吗？这个程序运行良好，我对遍历结果没有任何问题。。。为什么我不能检索它的大小？我有一个线程.dispatchUncaughtException（Throwable）行：不可用arg0 设置为NullPointerException（id=89）好的，谢谢，我运行了一个带有Iterable simrels2=com.graphaware.common.util.IterableUtils.random（simrels，20）的示例并在71658毫秒内得到结果，这很好，因为我不需要获得Iterable 大小信息。顺便问一下，你对上面奥利弗的回答有何评论？太慢了。您能否共享您的neo4j.properties文件和neo4j-wrapper.properties的内容？关于奥利弗的回答，我不知道这会有什么帮助。没有办法看到所有的关系。谢谢Michal，请查找上面的文件内容。您使用的是Neo4j 2.2.x吗？您是否可以尝试将dbms.pagecache.memory=6g之类的内容放入neo4j.properties中，或者稍微减小堆大小，然后重新测试采样？我对使用Neo4j 2.2.0-M02版的后续运行（即不是第一次使用冷缓存）的时间安排很感兴趣。老实说，我在conf文件夹中的任何配置文件中都找不到dbms.pagecache.memory参数。。。 Traverser traverser = traversal1.traverse(user); int size = traverser.metadata().getNumberOfRelationshipsTraversed(); Iterable<Relationship> simrels = traverser.relationships();