(Day 300) Trying out Neo4j's GraphRAG package

Ivan Ivanov · October 27, 2024

applying-knowledge nlp gnn

Hello :) Today is Day 300!

A quick summary of today:

path finding in neo4j
streamed

Path Finding with GDS

Before streaming today I had some free time and decided to do these ~1hr course from Neo4j.

Shortest path calculation

Graphs allow for calculating the shortest paths by representing cities, intersections, or other points as nodes and the connections between them (such as roads or routes) as edges. For instance, in a transportation network, one can determine the fastest or least expensive route between cities by considering attributes like travel time or distance. Graph databases like Neo4j can quickly identify these optimal paths by leveraging each relationship’s properties, such as cost or travel time.

Graphs vs. Relational dbs

In traditional relational databases, pathfinding across multiple connections is challenging. We must hypothesize table joins and create complex queries, which can be computationally expensive. Moreover, because relational databases lack inherent awareness of connections between data points, traversing unknown or numerous relationships adds to the complexity and processing time.

In contrast, graph databases like Neo4j are designed to handle paths and relationships directly, often achieving the same result with a few lines of code, making them ideal for finding optimal paths.

Performance advantages of graph dbs

Neo4j and similar graph databases use a concept called Index-free Adjacency, where each node knows all incoming and outgoing relationships. This design avoids costly table joins, allowing traversals through nodes via efficient in-memory pointer chasing. As a result, query performance remains consistent, regardless of data size, which isn’t the case with relational databases that slow down as they grow. This feature is particularly advantageous for real-time pathfinding and large datasets.

Applications of graph pathfinding

logistics and routing: determining the most efficient delivery routes
infrastructure management: mapping out the best paths for utilities and road planning
social and networking: identifying the shortest paths to make new contacts or connections
web link analysis: finding optimal link structures for SEO or content discovery

Shortest paths with Cypher

Unweighted shortest path

Cypher supports the calculation of the shortest unweighted path between a pair of nodes with the shortestPath() function. In an unweighted path, the traversal of each relationship has an identical cost, so the shortest path between two nodes will always be the sum of the total relationships in a path between them.

The shortestPath() function expects a definition of a Cypher path between an origin and destination node. The calculation is then made by traversing through the graph from both nodes, finding where the two paths meet in the middle. The traversal is breadth first, meaning that all relationships from a node will be followed before traversing further through the graph.

The state of the traversal is held in memory, pruning or removing paths that are longer than the current shortest path as it goes. The overall shortest path is calculated as the shortest path length between the two nodes, or in other words, the smallest number of relationships in the path.

The output of the function is the single shortest path between the two nodes.

Example:

MATCH (source:Airport {city:"Baltimore"}),
      (target:Airport {city:"Frankfurt"}) 
MATCH p = shortestPath((source)-[:HAS_ROUTE*1..10]->(target)) 
RETURN p

It is good practice to specify a range of relationships (i.e. [:HAS_ROUTE*1..10]). This ensures that excessively long relationships are not queried or returned - improving query performance and constraining the result.

For multiple relationships:

MATCH (source:Airport {city:"Baltimore"}),
      (target:Airport {city:"Frankfurt"})
MATCH p=shortestPath((source)-[:HAS_ROUTE|CAN_WALK_TO|CAN_SWIM_TO*1..10]->(target))
RETURN p

Or for finding all shortest paths:

MATCH (source:Airport {city:"Baltimore"}),
      (target:Airport {city:"Frankfurt"})
MATCH p = allShortestPaths((source)-[:HAS_ROUTE*1..10]->(target))
RETURN p

This will return all paths that contain the shortest number of relationships

There was a simple challenge after this: How many flights you have to take to get from BNA to HKT?

MATCH (source:Airport {iata: 'BNA'}), (target:Airport {iata: 'HKT'})  
MATCH p=shortestPath((source)-[:HAS_ROUTE*]->(target))
RETURN length(p) AS result

-> 3

And also: How many options do you have to get from Nashville to Phuket with the smallest amount of flights?

MATCH (source:Airport {iata: 'BNA'}), (target:Airport {iata: 'HKT'})  
MATCH p=allShortestPaths((source)-[:HAS_ROUTE*]->(target))
RETURN count(p) AS result

-> 206

Using the Graph Data Science (GDS) library for weighted relationships

In Neo4j, relationship properties in a property graph, called ‘weights’, represent traversal costs like distance, time, or financial cost. For example, a flight route between Paris and Lisbon may be cheaper via Madrid due to combined relationship weights, allowing applications to find the most cost-effective, time-saving, or eco-friendly routes.

To use weighted shortest-path algorithms, Neo4j’s GDS library requires creating an in-memory graph projection using gds.graph.project(). We can specify properties (like cost or distance) for relationships, allowing flexible path calculations.

Common Weighted Shortest Path Algorithms

Dijkstra’s algorithm: calculates the shortest path between two nodes, working only with positive weights. In Neo4j, it’s accessed with gds.shortestPath.dijkstra.stream() and returns nodes and costs in the path
Yen’s k-Shortest Paths Algorithm: an extension of Dijkstra’s, Yen’s algorithm finds multiple shortest paths (up to a specified k number). It ensures each discovered path is unique, ideal for alternative route suggestions

These algorithms are valuable in scenarios where users seek efficient paths based on cost, distance, or other weighted factors, such as in transportation networks or logistics planning.

Next, is a challenge using Djikstra’s algorithm:

First, a graph projection is created:

CALL gds.graph.project(
  'routes',
  'Airport',
  'HAS_ROUTE',
  {relationshipProperties:'distance'}
);

Then, we can run the below to find the shortest weighted path:

MATCH (source:Airport {iata: "BNA"}),
      (target:Airport {iata:"HKT"})

CALL gds.shortestPath.dijkstra.stream(
  'routes',
  {
    sourceNode:source,
    targetNode:target,
    relationshipWeightProperty:'distance'
  }
) YIELD totalCost
RETURN totalCost;

-> 9387.0

Another challenge: In cases where the shortest path isn’t ideal (i.e. due to inconvenient timing or insufficient services), providing multiple route options can enhance user choice. Using Yen’s algorithm, we can calculate the three shortest paths between Nashville (BNA) and Phuket (HKT) in Neo4j. The Cypher query below identifies the total cost of the third shortest path by setting k:3

MATCH (source:Airport {iata: "BNA"}),
      (target:Airport {iata:"HKT"})

CALL gds.shortestPath.yens.stream(
  'routes',
  {
    sourceNode:source,
    targetNode:target,
    relationshipWeightProperty:'distance',
    k:3
  }
) YIELD totalCost
RETURN totalCost

Stream

On stream I watched GraphRAG Python Package: Accelerate GenAI with Knowledge Graphs, which covers neo4j’s GraphRAG python package. It was around 30 mins, and after that I decided to check the example notebook by meself. However, instead of the provided pdf files I downloaded 3 random papers from sciencedirect. However, if you check the stream, it did not work for some reason. I believe because the code for the creation of the nodes and edges is generated through an LLM. And even though in the sample notebook they manage to get that working, I used different pdfs and also maybe because LLMs are not deterministic. But I could not generate nodes and edges for the knowledge graph for my 3 papers of choice. Still a cool package that will for sure become better (I think it was officially released recently)

I did not have anything special planned for day 300 😆 just normal study day

That is all for today!

See you tomorrow :)