Neo4j Cypher如何规范Pagerank分数_Neo4j_Cypher_Data Science_Neo4j Apoc

Neo4j Cypher如何规范Pagerank分数

neo4j

Neo4j Cypher如何规范Pagerank分数,neo4j,cypher,data-science,neo4j-apoc,Neo4j,Cypher,Data Science,Neo4j Apoc,我在Neo4j中有许多论文相互引用数据如下所示： {"title": "TitleWave", "year": 2010, "references": ["002", "003"], "id": "001"} {"title": "Title002", "year": 2005, "references": ["003", "004"], "id": "002"} {"title": "RealTitle", "year": 2000, "references": ["004", "001"

我在Neo4j中有许多论文相互引用

数据如下所示：

{"title": "TitleWave", "year": 2010, "references": ["002", "003"], "id": "001"}
{"title": "Title002", "year": 2005, "references": ["003", "004"], "id": "002"}
{"title": "RealTitle", "year": 2000,  "references": ["004", "001"], "id": "003"}
{"title": "Title004", "year": 2014, "references": ["001", "002"], "id": "004"}

我通过以下方式建立了关系：

CALL apoc.load.json('file.txt') YIELD value AS q
MERGE (p:Paper {id:q.id}) ON CREATE SET 
p.title=q.title, 
p.refs = q.references
WITH p
MATCH (p) UNWIND p.refs AS ref
MATCH (p2:Paper {id: ref})
MERGE (p)-[:CITES]->(p2);

我想运行

algo.PageRank.stream

函数来获得一组PageRank分数，然后将它们标准化为一个大数据集。我能在一个查询中高效地完成这项工作吗

这可以运行pagerank算法，但不能正常化：

CALL algo.pageRank.stream(
'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id',
'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:20, write:false, concurrency:20})
YIELD node, score
WITH *,
node.title AS title,  
score AS page_rank,
log(score) AS impact,
ORDER BY impact DESC
LIMIT 100
RETURN title, page_rank, impact;

如有任何建议，将不胜感激

你能详述一下“正常化”是什么意思吗？规范化集是否需要是线性的

n/max

，或者可以是对数的/指数的

n/（n+100）

。是否需要将其规格化为介于0和1之间的值？在失败的查询中，新变量在该语句之后才可用，因此需要另一个WITH来进行除法。注意，所有内容都将规范化为1，因为MAX将为每行取1个元素的MAX。（它是一个聚合，因此它只会在其他所有内容都相同的集合之间聚合）；因此，对于每一行，有效地与n/n相同。不确定最好的方法是什么，但开始时，我认为在0和1之间标准化。我意识到拥有日志也可能会产生一些低于0的分数…明白了。我想从所有分数中找出最大分数，而不仅仅是那一行的分数。你建议我收集所有分数，然后在这个集合中找到最大值吗？那么

n/（n+100）

对你的标准化有用吗？（它将线性增长转换为百分比。因此，100=50%、200=66.66%、900=90%、9000=98.9%。您可以将常数100更改为您想要的任何值。）将值自身标准化要比将所有值折叠以获得最大值，然后展开所有值便宜得多。

CALL algo.pageRank.stream(
'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id',
'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:20, write:false, concurrency:20})
YIELD node, score
WITH *,
node.title AS title, 
score AS page_rank,
log(score) AS impact,
max(log(score)) as max_val,
impact / max_val as impact_norm
ORDER BY impact_norm DESC
LIMIT 100
RETURN title, page_rank, impact_norm;

Variable `impact` not defined (line 18, column 1 (offset: 539))
"impact / max_val as impact_norm"