Neo4j Cypher如何规范Pagerank分数

Neo4j Cypher如何规范Pagerank分数,neo4j,cypher,data-science,neo4j-apoc,Neo4j,Cypher,Data Science,Neo4j Apoc,我在Neo4j中有许多论文相互引用 数据如下所示: {"title": "TitleWave", "year": 2010, "references": ["002", "003"], "id": "001"} {"title": "Title002", "year": 2005, "references": ["003", "004"], "id": "002"} {"title": "RealTitle", "year": 2000, "references": ["004", "001"

我在Neo4j中有许多论文相互引用

数据如下所示:

{"title": "TitleWave", "year": 2010, "references": ["002", "003"], "id": "001"}
{"title": "Title002", "year": 2005, "references": ["003", "004"], "id": "002"}
{"title": "RealTitle", "year": 2000,  "references": ["004", "001"], "id": "003"}
{"title": "Title004", "year": 2014, "references": ["001", "002"], "id": "004"}
我通过以下方式建立了关系:

CALL apoc.load.json('file.txt') YIELD value AS q
MERGE (p:Paper {id:q.id}) ON CREATE SET 
p.title=q.title, 
p.refs = q.references
WITH p
MATCH (p) UNWIND p.refs AS ref
MATCH (p2:Paper {id: ref})
MERGE (p)-[:CITES]->(p2);
我想运行
algo.PageRank.stream
函数来获得一组PageRank分数,然后将它们标准化为一个大数据集。我能在一个查询中高效地完成这项工作吗

这可以运行pagerank算法,但不能正常化:

CALL algo.pageRank.stream(
'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id',
'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:20, write:false, concurrency:20})
YIELD node, score
WITH *,
node.title AS title,  
score AS page_rank,
log(score) AS impact,
ORDER BY impact DESC
LIMIT 100
RETURN title, page_rank, impact;

如有任何建议,将不胜感激

你能详述一下“正常化”是什么意思吗?规范化集是否需要是线性的
n/max
,或者可以是对数的/指数的
n/(n+100)
。是否需要将其规格化为介于0和1之间的值?在失败的查询中,新变量在该语句之后才可用,因此需要另一个WITH来进行除法。注意,所有内容都将规范化为1,因为MAX将为每行取1个元素的MAX。(它是一个聚合,因此它只会在其他所有内容都相同的集合之间聚合);因此,对于每一行,有效地与n/n相同。不确定最好的方法是什么,但开始时,我认为在0和1之间标准化。我意识到拥有日志也可能会产生一些低于0的分数…明白了。我想从所有分数中找出最大分数,而不仅仅是那一行的分数。你建议我收集所有分数,然后在这个集合中找到最大值吗?那么
n/(n+100)
对你的标准化有用吗?(它将线性增长转换为百分比。因此,100=50%、200=66.66%、900=90%、9000=98.9%。您可以将常数100更改为您想要的任何值。)将值自身标准化要比将所有值折叠以获得最大值,然后展开所有值便宜得多。
CALL algo.pageRank.stream(
'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id',
'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target',
{graph:'cypher', iterations:20, write:false, concurrency:20})
YIELD node, score
WITH *,
node.title AS title, 
score AS page_rank,
log(score) AS impact,
max(log(score)) as max_val,
impact / max_val as impact_norm
ORDER BY impact_norm DESC
LIMIT 100
RETURN title, page_rank, impact_norm;
Variable `impact` not defined (line 18, column 1 (offset: 539))
"impact / max_val as impact_norm"