CALL algo.pageRank(
'MATCH (u:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
(:Category {name: $category})
WITH u, count(*) AS reviews
WHERE reviews >= $cutOff
RETURN id(u) AS id',
'MATCH (u1:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
(:Category {name: $category})
MATCH (u1)-[:FRIENDS]->(u2)
RETURN id(u1) AS source, id(u2) AS target', (*)
{graph: "cypher", write: true, writeProperty: "hotelPageRank",
params: {category: "Hotels", cutOff: 3}}
The node-statement is clear. However, in the relationship-statement, it seems like it could include duplicate relationships. For example, User
A
writes 2 Review
s; User
A
also has 2 Friends (2 other User
s, e.g. B
and C
). Then, we end up with 4 relationship pairs A
-B
, A
-C
, A
-B
, A
-C
, as in (*).
Question: Does APOC's PageRank ignore such duplicate relationships? I.e. it will see only two pairs A
-B
, A
-C
. Or this implementation considers duplicate relationships as "weight"? I.e. if 3 pairs A
-B
are returned, it considers single A
-B
relationship with weight 3. Or there is no such pre-processing, and PageRank itself will consider each pair as a separate link?
I have consulted the official documentation (below), but found no insight on this issue. Any help is appreciated.
The PageRank algorithm
You should probably start using the Graph Data Science library, which is the successor of the Graph Algorithms library and more mature product as well. The book you are reading is slowly being updated to use the GDS library. The PageRank version in the GDS library does support duplicate relationships between a single pair of nodes, so-called multigraph. You can also take a look at this blog post for more information.
The APOC's PageRank treats duplicate links as separate links, implying that in the procedure apoc.algo.pageRankWithCypher()
: if the Cypher query accidentially returns duplicate links, like the example in the book, it can affect the final PageRanks. Said in other words, the procedure returns different PageRanks, compared to if duplicate links are purged (to only one).
Proof
In the procedure's Java source code, the number of out-links for each node is counted with line sourceDegreeData.increment(sourceIndex);
(link). The nodes will share their own PageRank equally to all out-links. It is showed that the code doesn't do any deduplication, i.e. if it encounters A->B
, A->C
, A->B
, it considers 3 relationships. In this context, the number of out-links for each node is called degree for that node. In our example, the degree for node A
is 3.
This (no deduplication) holds true also in other part of the code. The NumLinks
, as in the original PageRank formular (link), is calculated in function getTotalWeightForNode()
(link). Later, when PageRanks are computed iteratively in parallel in function startIteration()
(link) and iterateParallel()
(link), the PageRank for the current node relies on its degree. Again, no code snippet indicates deduplication.
In addition, without yielding weights for each relationship in Cypher query part of apoc.algo.pageRankWithCypher()
, the default weight is 1.
An example
Assuming an empty graph, execute this query:
CREATE (a:Node {name: "a"}), (b:Node {name: "b"}), (c: Node {name: "c"}), (x:Node {name: "x"}), (y:Node {name: "y"}), (a)-[:CONNECT]->(b), (b)-[:CONNECT]->(c), (x)-[:CONNECT]->(a), (y)-[:CONNECT]->(c)
The graph:
Note: Although it shows relationship direction, we ignore the direction and consider each pair to have two-way relationship.
Computer the PageRanks, the 1st assumes no duplicate links; the 2nd intentionally produces duplicate links for some nodes.
// 1st
CALL apoc.algo.pageRankWithCypher({iterations: 2, node_cypher: 'MATCH (n) RETURN ID(n) as id', rel_cypher: 'MATCH (n)-[]-(m) RETURN ID(n) AS source, ID(m) AS target ORDER BY source', write: true, property:'pageRank1', numCpu: 1})
// 2nd
CALL apoc.algo.pageRankWithCypher({iterations: 2, node_cypher: 'MATCH (n) RETURN ID(n) as id', rel_cypher: 'MATCH (n)-[]-(m) RETURN ID(n) AS source, ID(m) AS target ORDER BY source UNION ALL MATCH (n)-[]-(m {name: "b"}) RETURN ID(n) AS source, ID(m) AS target ORDER BY source', write: true, property:'pageRank2', numCpu: 1})
After that, we can compare property PageRank1
and PageRank2
for each node, and see the difference.
Final words
Is it a big deal to have unintended duplicate links when computing PageRank? I believe so, because the PageRank formular assumes distinctive out-links:
duplicate links on the same page are treated as a single link
PageRank
For reproducibility, the Neo4j version tested here is v3.5.17
. APOC version v3.5.0.9
(its corresponding source code on GitHub, branch 3.5
.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.