Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/neo4j/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Neo4j和Java:Iterable的快速随机样本<;关系>;_Java_Neo4j_Guava - Fatal编程技术网

Neo4j和Java:Iterable的快速随机样本<;关系>;

Neo4j和Java:Iterable的快速随机样本<;关系>;,java,neo4j,guava,Java,Neo4j,Guava,我用Java编写了一个遍历,它返回一个Iterable。最坏的情况是850784个关系的大小不合适 目标:我只想取样(不需要替换)20段关系,我想快速完成 解决方案1:执行toList()或将其放入某种集合中花费的时间太长(>1分钟)。我知道我本可以使用shuffle()函数等等,但这是不可接受的 解决方案2:因此,为了直接在Iterable上执行此操作,我使用了番石榴collect库,我将以下3个步骤中的每一步的时间(以毫秒为单位计算)(使用System.nanoTime()并除以100000

我用Java编写了一个遍历,它返回一个Iterable。最坏的情况是850784个关系的大小不合适

目标:我只想取样(不需要替换)20段关系,我想快速完成

解决方案1:执行
toList()
或将其放入某种
集合中花费的时间太长(>1分钟)。我知道我本可以使用
shuffle()
函数等等,但这是不可接受的

解决方案2:因此,为了直接在
Iterable
上执行此操作,我使用了番石榴
collect
库,我将以下3个步骤中的每一步的时间(以毫秒为单位计算)(使用
System.nanoTime()
并除以1000000)。我需要为随机数生成器设置
Iterable
的大小,这是一个真正的瓶颈

    /* TRAVERSAL: 5 ms */
    Iterable<Relationship> simrels = traversal1.traverse(user).relationships();

    /* GET ITERABLE SIZE: 74669 ms */
    int simrelssize = com.google.common.collect.Iterables.size(simrels);

    /* RANDOM SAMPLE OF 20: 28321 ms*/
    long seed = System.nanoTime();
    int[] idxs = new int[20];
    Random randomGenerator = new XSRandom(seed);
    for (int i = 0; i < idxs.length; ++i){
        int randomInt = randomGenerator.nextInt(simrelssize);
        idxs[i]=randomInt;
    }
    Arrays.sort(idxs);

    List<Relationship> simrelslist2 = new ArrayList<Relationship>();
    for(int i = 0; i < idxs.length; ++i){
        if (i > 0) {
            int pos = idxs[i]-idxs[i-1];
            simrelslist2.add(com.google.common.collect.Iterables.get(simrels, pos));
        }
        else{
            simrelslist2.add(com.google.common.collect.Iterables.get(simrels, idxs[i]));
        }
    }
neo4j.属性

# Enable this to be able to upgrade a store from an older version.
#allow_store_upgrade=true

# The amount of memory to use for mapping the store files, either in bytes or
# as a percentage of available memory. This will be clipped at the amount of
# free memory observed when the database starts, and automatically be rounded
# down to the nearest whole page. For example, if "500MB" is configured, but
# only 450MB of memory is free when the database starts, then the database will
# map at most 450MB. If "50%" is configured, and the system has a capacity of
# 4GB, then at most 2GB of memory will be mapped, unless the database observes
# that less than 2GB of memory is free when it starts.
#mapped_memory_total_size=50%

# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0

# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true".
#keep_logical_logs=7 days

# Autoindexing

# Enable auto-indexing for nodes, default is false.
#node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled.
#node_keys_indexable=name,age

# Enable auto-indexing for relationships, default is false.
#relationship_auto_indexing=true

# The relationship property keys to be auto-indexed, if enabled.
#relationship_keys_indexable=name,age

# Enable shell server so that remote clients can connect via Neo4j shell.
#remote_shell_enabled=true
# The network interface IP the shell will listen on (use 0.0.0 for all interfaces).
#remote_shell_host=127.0.0.1
# The port the shell will listen on, default is 1337.
#remote_shell_port=1337

# The type of cache to use for nodes and relationships.
#cache_type=hpc

# Maximum size of the heap memory to dedicate to the cached nodes.
#node_cache_size=

# Maximum size of the heap memory to dedicate to the cached relationships.
#relationship_cache_size=

# Enable online backups to be taken from this database.
online_backup_enabled=true

# Port to listen to for incoming backup requests.
online_backup_server=127.0.0.1:6362


# Uncomment and specify these lines for running Neo4j in High Availability mode.
# See the High availability setup tutorial for more details on these settings
# http://neo4j.com/docs/2.2.0-M02/ha-setup-tutorial.html

# ha.server_id is the number of each instance in the HA cluster. It should be
# an integer (e.g. 1), and should be unique for each cluster instance.
#ha.server_id=

# ha.initial_hosts is a comma-separated list (without spaces) of the host:port
# where the ha.cluster_server of all instances will be listening. Typically
# this will be the same for all cluster instances.
#ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001

# IP and port for this instance to listen on, for communicating cluster status
# information iwth other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.cluster_server=192.168.0.1:5001

# IP and port for this instance to listen on, for communicating transaction
# data with other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.server=192.168.0.1:6001

# The interval at which slaves will pull updates from the master. Comment out
# the option to disable periodic pulling of updates. Unit is seconds.
ha.pull_interval=10

# Amount of slaves the master will try to push a transaction to upon commit
# (default is 1). The master will optimistically continue and not fail the
# transaction even if it fails to reach the push factor. Setting this to 0 will
# increase write performance when writing through master but could potentially
# lead to branched data (or loss of transaction) if the master goes down.
#ha.tx_push_factor=1

# Strategy the master will use when pushing data to slaves (if the push factor
# is greater than 0). There are two options available "fixed" (default) or
# "round_robin". Fixed will start by pushing to slaves ordered by server id
# (highest first) improving performance since the slaves only have to cache up
# one transaction at a time.
#ha.tx_push_strategy=fixed

# Policy for how to handle branched data.
#branched_data_policy=keep_all

# Clustering timeouts
# Default timeout.
#ha.default_timeout=5s

# How often heartbeat messages should be sent. Defaults to ha.default_timeout.
#ha.heartbeat_interval=5s

# Timeout for heartbeats between cluster members. Should be at least twice that of ha.heartbeat_interval.
#heartbeat_timeout=11s

您有一个
遍历器

Traverser traverser = traversal1.traverse(user);
int size = traverser.metadata().getNumberOfRelationshipsTraversed();
Iterable<Relationship> simrels = traverser.relationships();
Traverser-Traverser=traversal1.traverse(用户);
int size=traverser.metadata().getNumberOfRelationshipsTraversed();
Iterable simrels=traverser.relations();

现在您有了自己的大小,可以优化随机选择器。

Neo4j返回Iterable的原因是它在迭代时执行遍历。为了取样,恐怕你必须“拜访”每一段关系。是的,你可以跳过一些,但是你仍然需要在一天结束的时候迭代所有的内容


我们使用的是“水库采样”算法。由于上述原因,我不确定它是否会表现得更好。也就是说,使用热缓存,您应该能够在不到1秒的时间内对1M个关系进行采样。如果需要更长的时间,您可能需要稍微调整一下内存设置。

您不能从创建时获得可容纳的大小吗?另外,你能任意地将iterable剪切到它的前20个元素吗?@Olivier:我不能得到前20个元素,因为
iterable
中的顺序是由遍历强加的,我不能改变这一点。对于大小,在
.relationships()
之后没有“size()”或类似的方法。哇,谢谢!不知道
Traverser
。。。好的,我在
int size=traverser.metadata().getNumberOfRelationshipsTraversed()上得到一个NullPointerException,我会调查,但如果你有想法,请不要犹豫。我不太擅长使用Eclipse进行调试,我在相应的行中添加了一个断点,我可以看到traverser“framework”内容,但无法检测导致此
NullPointerException
的“越界”原因,有人能帮我吗?这个程序运行良好,我对遍历结果没有任何问题。。。为什么我不能检索它的大小?我有一个
线程.dispatchUncaughtException(Throwable)行:不可用
arg0
设置为NullPointerException(id=89)好的,谢谢,我运行了一个带有
Iterable simrels2=com.graphaware.common.util.IterableUtils.random(simrels,20)的示例并在71658毫秒内得到结果,这很好,因为我不需要获得
Iterable
大小信息。顺便问一下,你对上面奥利弗的回答有何评论?太慢了。您能否共享您的neo4j.properties文件和neo4j-wrapper.properties的内容?关于奥利弗的回答,我不知道这会有什么帮助。没有办法看到所有的关系。谢谢Michal,请查找上面的文件内容。您使用的是Neo4j 2.2.x吗?您是否可以尝试将dbms.pagecache.memory=6g之类的内容放入neo4j.properties中,或者稍微减小堆大小,然后重新测试采样?我对使用Neo4j 2.2.0-M02版的后续运行(即不是第一次使用冷缓存)的时间安排很感兴趣。老实说,我在conf文件夹中的任何配置文件中都找不到
dbms.pagecache.memory
参数。。。
Traverser traverser = traversal1.traverse(user);
int size = traverser.metadata().getNumberOfRelationshipsTraversed();
Iterable<Relationship> simrels = traverser.relationships();