Neo4J协同过滤比预期慢
我正在Neo4J图上实现一个推荐系统,我刚开始研究我计划使用的查询,但它的执行速度比我预期的慢得多 统计数据Neo4J协同过滤比预期慢,neo4j,query-optimization,cypher,query-performance,collaborative-filtering,Neo4j,Query Optimization,Cypher,Query Performance,Collaborative Filtering,我正在Neo4J图上实现一个推荐系统,我刚开始研究我计划使用的查询,但它的执行速度比我预期的慢得多 统计数据 Neo4J Version: 2.3.1 Nodes: 820K Relationships: 7.6M Indexes ON :LIKES(created_at) ONLINE ON :Product(id) ONLINE ON :Product(created_at) ONLINE ON :User(id) ON
Neo4J Version: 2.3.1
Nodes: 820K
Relationships: 7.6M
Indexes
ON :LIKES(created_at) ONLINE
ON :Product(id) ONLINE
ON :Product(created_at) ONLINE
ON :User(id) ONLINE
ON :User(date_joined) ONLINE
No constraints
我已经对查询优化做了很多研究,但就我所见,我没有在查询结构中犯任何常见/常见的错误(但我不是专家)
这是一个带有测试数据集的开发人员控制台:
查询
MATCH (u1:User {id: {user_id}})-[l1:LIKES]->(p1:Product)
WITH u1, l1, p1
ORDER BY p1.created_at DESC
LIMIT 10
MATCH (p1)<-[:LIKES]-(u2:User)
WHERE NOT u1=u2
WITH u1, l1, p1, u2, COUNT(u2) as rating
ORDER BY rating DESC
LIMIT 50
MATCH (u2)-[l2:LIKES]->(recommendation:Product)
WHERE NOT (p1)=(recommendation)
WITH recommendation, COUNT(recommendation) as weight
RETURN recommendation.id as id
ORDER BY weight DESC
LIMIT {limit}
查询配置文件输出(对照我们的生产数据集副本)
+-------------------+----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|运算符|估计行|行|数据库命中数|标识符|其他|
+-------------------+----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+生产结果| 7 | 100 | 0 | id | id|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+投影| 7 | 100 | 0 | anon[382],id,建议,权重| anon[382]|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+Top | 7 | 100 | 0 | anon[382],推荐,权重|文字(100);重量|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+投影| 7 | 129342 | 129342 | anon[382],建议,权重|建议id;重量|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|加总| 7 | 129342 | 0 |推荐,权重|推荐|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+过滤器| 44 | 442432 | 471953 | l1、l2、p1、评级、建议、u1、u2和(非(p1=建议)、建议:产品)|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+扩展(全部)| 44 | 472039 | 472089 | l1、l2、p1、评级、推荐、u1、u2 |(u2)-[l2:LIKES]->(推荐)|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+Top | 10 | 50 | 0 | l1、p1、评级、u1、u2 |文字(50);评级|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+10 | 527 | 0 | l1,p1,评级,u1,u2 | u1,l1,p1,u2|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+过滤器| 92 | 563 | 563 | anon[82],anon[119],l1,p1,u1,u2 | Ands(非(u1==u2),u2:用户)|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+扩展(全部)| 92 | 574 | 584 | anon[82],anon[119],l1,p1,u1,u2 |(p1)(p1)|
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
|+NodeIndexSeek | 1 | 1 | 2 | u1 |:用户(id)|
+-------------------+----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
我看过一些案例研究,其中人们正在使用Neo4j进行实时协同过滤,因此我认为一定有可能在这种数据集上进行这种查询。我是否不切实际?我们在AmazonEC2计算优化节点(c4.large)上运行这个程序,所以我认为它的性能相当好
我在这里挠头,非常感谢任何意见
干杯,
David.[旁白:重新打开开发人员控制台时,不会重新创建索引,因此必须手动重新创建索引。]
我不知道这对您来说是否足够好,但您可以通过简单地不指定查询中大多数节点(p1
、u2
和建议
)的标签来消除分析结果中约44%的DB命中:
MATCH (u1:User {id: {user_id}})-[l1:LIKES]->(p1)
WITH u1, l1, p1
ORDER BY p1.created_at DESC
LIMIT 10
MATCH (p1)<-[:LIKES]-(u2)
WHERE NOT u1=u2
WITH u1, l1, p1, u2, COUNT(u2) as rating
ORDER BY rating DESC
LIMIT 50
MATCH (u2)-[l2:LIKES]->(recommendation)
WHERE NOT (p1)=(recommendation)
WITH recommendation, COUNT(recommendation) as weight
RETURN recommendation.id as id
ORDER BY weight DESC
LIMIT {limit}
MATCH(u1:User{id:{User\u id}})-[l1:LIKES]>(p1)
用u1,l1,p1
由p1.U在DESC创建的订单
限制10
匹配(p1)(推荐)
如果不是(p1)=(建议)
对于推荐,将(推荐)计算为
MATCH (u1:User {id: {user_id}})-[l1:LIKES]->(p1)
WITH u1, l1, p1
ORDER BY p1.created_at DESC
LIMIT 10
MATCH (p1)<-[:LIKES]-(u2)
WHERE NOT u1=u2
WITH u1, l1, p1, u2, COUNT(u2) as rating
ORDER BY rating DESC
LIMIT 50
MATCH (u2)-[l2:LIKES]->(recommendation)
WHERE NOT (p1)=(recommendation)
WITH recommendation, COUNT(recommendation) as weight
RETURN recommendation.id as id
ORDER BY weight DESC
LIMIT {limit}