Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Neo4j - How does APOC's PageRank handle duplicate relationships in relationship-statement?

Ask Question

In the book " Graph Algorithms, Practical Examples in Apache Spark & Neo4J (05-2019, Mark Needham, Amy E. Hodler) ", page 155, there is an example of using APOC's PageRank algorithm to calculate PageRanks for certain users.

The code snippet:

CALL algo.pageRank(
    'MATCH (u:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
                                    (:Category {name: $category})
    WITH u, count(*) AS reviews
    WHERE reviews >= $cutOff
    RETURN id(u) AS id',
    'MATCH (u1:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
                                    (:Category {name: $category})
    MATCH (u1)-[:FRIENDS]->(u2)
    RETURN id(u1) AS source, id(u2) AS target', (*)
    {graph: "cypher", write: true, writeProperty: "hotelPageRank",
    params: {category: "Hotels", cutOff: 3}}

The node-statement is clear. However, in the relationship-statement, it seems like it could include duplicate relationships. For example, User A writes 2 Reviews; User A also has 2 Friends (2 other Users, e.g. B and C). Then, we end up with 4 relationship pairs A-B, A-C, A-B, A-C, as in (*).

Question: Does APOC's PageRank ignore such duplicate relationships? I.e. it will see only two pairs A-B, A-C. Or this implementation considers duplicate relationships as "weight"? I.e. if 3 pairs A-B are returned, it considers single A-B relationship with weight 3. Or there is no such pre-processing, and PageRank itself will consider each pair as a separate link?

I have consulted the official documentation (below), but found no insight on this issue. Any help is appreciated.

The PageRank algorithm

You should probably start using the Graph Data Science library, which is the successor of the Graph Algorithms library and more mature product as well. The book you are reading is slowly being updated to use the GDS library. The PageRank version in the GDS library does support duplicate relationships between a single pair of nodes, so-called multigraph. You can also take a look at this blog post for more information.

The APOC's PageRank treats duplicate links as separate links, implying that in the procedure apoc.algo.pageRankWithCypher(): if the Cypher query accidentially returns duplicate links, like the example in the book, it can affect the final PageRanks. Said in other words, the procedure returns different PageRanks, compared to if duplicate links are purged (to only one).

Proof

In the procedure's Java source code, the number of out-links for each node is counted with line sourceDegreeData.increment(sourceIndex); (link). The nodes will share their own PageRank equally to all out-links. It is showed that the code doesn't do any deduplication, i.e. if it encounters A->B, A->C, A->B, it considers 3 relationships. In this context, the number of out-links for each node is called degree for that node. In our example, the degree for node A is 3.

This (no deduplication) holds true also in other part of the code. The NumLinks, as in the original PageRank formular (link), is calculated in function getTotalWeightForNode() (link). Later, when PageRanks are computed iteratively in parallel in function startIteration() (link) and iterateParallel() (link), the PageRank for the current node relies on its degree. Again, no code snippet indicates deduplication.

In addition, without yielding weights for each relationship in Cypher query part of apoc.algo.pageRankWithCypher(), the default weight is 1.

An example

Assuming an empty graph, execute this query:

CREATE (a:Node {name: "a"}), (b:Node {name: "b"}), (c: Node {name: "c"}), (x:Node {name: "x"}), (y:Node {name: "y"}), (a)-[:CONNECT]->(b), (b)-[:CONNECT]->(c), (x)-[:CONNECT]->(a), (y)-[:CONNECT]->(c)

The graph:

Note: Although it shows relationship direction, we ignore the direction and consider each pair to have two-way relationship.

Computer the PageRanks, the 1st assumes no duplicate links; the 2nd intentionally produces duplicate links for some nodes.

// 1st
CALL apoc.algo.pageRankWithCypher({iterations: 2, node_cypher: 'MATCH (n) RETURN ID(n) as id', rel_cypher: 'MATCH (n)-[]-(m) RETURN ID(n) AS source, ID(m) AS target ORDER BY source', write: true, property:'pageRank1', numCpu: 1})
// 2nd
CALL apoc.algo.pageRankWithCypher({iterations: 2, node_cypher: 'MATCH (n) RETURN ID(n) as id', rel_cypher: 'MATCH (n)-[]-(m) RETURN ID(n) AS source, ID(m) AS target ORDER BY source UNION ALL MATCH (n)-[]-(m {name: "b"}) RETURN ID(n) AS source, ID(m) AS target ORDER BY source', write: true, property:'pageRank2', numCpu: 1})

After that, we can compare property PageRank1 and PageRank2 for each node, and see the difference.

Final words

Is it a big deal to have unintended duplicate links when computing PageRank? I believe so, because the PageRank formular assumes distinctive out-links:

duplicate links on the same page are treated as a single link

PageRank

For reproducibility, the Neo4j version tested here is v3.5.17. APOC version v3.5.0.9 (its corresponding source code on GitHub, branch 3.5.

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.