(Day 278) Before Machine Learning Volume 2 - Calculus + Neo4j GDS

Ivan Ivanov · October 5, 2024

gnn traditional-machine-learning math

Hello :) Today is Day 278!

A quick summary of today:

started volume 2 of the before ML book series by Jorge Brasil
covered the Neo4j graph data science courses

After reading the 1st volume on linear algebra I decided to continue as I love going back to basics and seeing how different books introduce them. Also I like the titles of these books - Before Machine Learning which are straight to the point and clear.

Chapter 1 - Why Calculus?

In 1 sentence: It is the ‘magic’ power behind the optimization and functionality of machine learning algorithms

Chapter 2 - Pointing Fingers and Crossing Lines: The last breath of just linearity

It starts with describing the point (yes, a single point), a small dot that can be assigned coordinates in any space. With two points, we can define a line, and by calculating the slope (rise over run), we arrive at the equation y=mx+b, where m is the slope and b is the y-intercept. This formula gives us a way to describe any straight line and sets the stage for exploring more complex concepts.

Chapter 3 - Changing Times and Tangent Lines: The Derivative

The derivative measures how much a function’s output changes in response to a small change in its input.

We want to understand what happens to the rate of change when h is getting very close to, but is not, 0, meaning that h is a finitely small number.

Then, it introduced some of the rules of calculus:

constant rule
product rule
quotient rule
sum rule
substraction rule
power rule
chain rule (a bit later in the chapter)

It introduced limits and rules for them:

L’Hopital’s rule

This is a technique in calculus used to find the limit of indeterminate forms, such as 0/0 or ∞/∞. The rule says that if you get one of these forms when calculating a limit, you can take the derivative of the numerator and the derivative of the denominator separately, and then calculate the limit again.

If you have a limit that looks like this:

lim (x → c) [f(x) / g(x)] = 0/0 or ∞/∞

You can apply L’Hopital’s Rule by taking the derivatives:

lim (x → c) [f(x) / g(x)] = lim (x → c) [f'(x) / g'(x)]

If the result is still indeterminate, you can repeat the process by taking more derivatives.

For example, to find the limit of:

lim (x → 0) [sin(x) / x]

At x = 0, it results in 0/0, so we apply L’Hopital’s Rule:

The derivative of sin(x) is cos(x). The derivative of x is 1. Now the limit becomes:

lim (x → 0) [cos(x) / 1] = cos(0) = 1

So, the limit is 1.

Second derivative

We know that the first derivative gives the slope of the tangent line at any given point on the function. However, the 2nd derivative is about understanding the variation of this slope along the function’s path. To think of it visually, imagine drawing tangent lines at various points along a curve. The slope of these lines could change: they might get steeper, remain constant, or become flatter. The second derivative is a tool that quantifies this change. If it’s positive, the slopes of these tangent lines are getting steeper, indicating that the function is bending or curving upward. On the contrary, if it’s negative, the slopes are becoming less steep or more negative, which signifies that the function is curving downward

Taylor series

An infinite series allows us to represent or approximate functions using polynomials. Among these, the Taylor series stands out. It enables us to create precise polynomial approximations of functions around a specific point.

At its core, these series are straightforward; each term is added to the sum of the preceding terms

The Taylor series is useful especially for solving differential equations, approximating difficult functions or analyzing the behavior of functions around a specific point.

Critical points

The Newton-Raphson method

The next iteration is the result of the subtraction of the previous point and the ratio between the values of both the function and the derivative at the same location

For a more intuitive view, let’s consider the ratio. The term f(x_{n-1}) gives the value of f(x) at that point, and if we are looking for the root, this value indicates how far we are from it. The derivative at this point, which represents the slope of the tangent line at x_{n-1}, acts like a guide, showing us the direction to move. In essence, we are ‘normalizing our error’ (how far off our current guess is) by considering the “rate of change” (how quickly the function is changing). This ratio helps determine how much we should adjust our guess to move closer to the root.

Put simply, a root point in Newton’s method is the value where the function equals zero

To successfully find the root of a function, the function must be:

differentiable
the derivative can’t be zero at the approximation
if f’(x) is decreasing, i.e, f’‘(x) < 0, then f’(a) = 0 is a maximum point at a, and f(x) has the shape of a ∩, where a belongs to the domain of the function
if f’(x) is increasing, i.e f’‘(x) > 0, then f’(a) = 0 is a minimum point at a, and f(x) has the shape of a ∪, where a belongs to the domain of the function

To understand the significance of the second derivative f’‘(x), we need to explore its roots. By plotting f(x) and f’‘(x), we see that where f’‘(x) crosses the x-axis marks critical points where the tangent to f(x) is horizontal, potentially indicating local maxima or minima.

These points where f’‘(x) = 0 are important because they signal changes in the curvature of f(x), which could mean the function is switching between a bowl (concave up) and a dome (concave down). These are called inflection points

To confirm whether f’‘(x) = 0 is an inflection point:

confirmed inflection point: if f’‘(x) changes sign around x = a, then x = a is an inflection point, showing a change in curvature
not an inflection point: if f’‘(x) does not change sign around x = a, then x = a is not an inflection point, meaning the curvature remains unchanged

The sign of f’‘(x) around the root helps determine whether the curvature of f(x) changes at that point.

Gradient descent

Select a starting point
Pick a learning rate
Update the value of the estimate x_n
Check the stopping conditions

I found that the series has an accompanying github repo with sample code showcasing some of the concepts. And part of this chapter 3 is showcased through code as well.

Code for chapter 3

First, we define the functions and derivatives:

def f(x):
    return x**4 - 4*x**3 + 4*x**2
def df(x):
    return 4*x**3 - 12*x**2 + 8*x
    
def df2(x):  # Second derivative
    return 12*x**2 - 24*x + 8

1. Newton’s Method for Root Finding:

Function and Derivative:

The function we’re considering is:

f(x) = x^4 + 4x^3 + 4x^2

Its derivative is:

f’(x) = 4x^3 - 12x^2 + 8x

Method Overview:

The Newton-Raphson method updates our estimate for the root using the formula:

x_{n+1} = x_n - \frac{f(x_n)}{f’(x_n)}

This method iteratively refines the root approximation.

Stopping Criteria:

We use two stopping criteria:

When the absolute value of f(x) is less than 10^{-5} indicating proximity to a root
After 10 iterations, to prevent endless loops in case the method doesn’t converge

def newton_method(x0, tol=1e-5, max_iter=10):
    """
    Newton's method for root finding.
    
    Parameters:
    - x0: Initial guess
    - tol: Tolerance level (default is 1e-5)
    - max_iter: Maximum number of iterations (default is 10)

    Returns:
    - Root of the function f(x) if found within max_iter iterations and tolerance, else returns the latest estimate
    """
    x = x0
    for i in range(max_iter):
        # Compute the next approximation using Newton's method formula
        x_next = x - f(x)/df(x)
        
        # Check the stopping criterion
        if abs(f(x_next)) < tol:
            print(f"Root found at x = {x_next}, after {i+1} iterations.")
            return x_next
        
        # Update x for the next iteration
        x = x_next
        
    print(f"Stopped after {max_iter} iterations. Latest estimate for root is x = {x}.")
    return x

Using the below:

x0 = 0.1
root = newton_method(x0)

We get: Root found at x = 0.001407967242655117, after 6 iterations.

2. Minimization with Gradient Descent

Method Overview:

Gradient Descent updates our estimate for the local minimum using the rule:

x_{n+1} = x_n - alpha f’(x_n)

Where alpha is the learning rate. This method iteratively shifts the current estimate in the direction of the steepest decrease of the function

Stopping Criteria:

The algorithm stops when the difference between the function values in consecutive steps is smaller than 10^{-5}. If this doesn’t occur, the method will stop after 10,000 iterations as a fallback.

def gradient_descent_minimize(learning_rate=0.01, tol=1e-5, x0=0.1):
    """
    Function to perform minimization using gradient descent.
    
    Parameters:
    - learning_rate: Step size for each iteration.
    - tol: Tolerance value for stopping criterion.
    - x0: Initial guess.
    
    Returns:
    - x: Approximate minimum point.
    - errors: List containing value of f(x) for each iteration.
    """
    x = x0
    errors = [f(x)]
    
    # Run gradient descent until change in f(x) value is smaller than tol
    while True:
        x_new = x - learning_rate * df(x)
        if np.abs(f(x_new) - f(x)) < tol:
            break
        x = x_new
        errors.append(f(x))
    
    return x, errors

From:

min_gd, gd_errors = gradient_descent_minimize()
print(f"Approximate minimum point using Gradient Descent: x = {min_gd}")

We get: Approximate minimum point using Gradient Descent: x = 0.00387025495013399

3. Minimization with Newton’s Method

Method Overview:

Newton’s method for optimization updates our estimate for the local minimum using the rule:

x_{n+1} = x_n - \frac{f’(x_n)}{f’‘(x_n)}

This method tries to approximate the function locally by a quadratic function and then jumps straight to the minimum of that quadratic.

Stopping Criteria:

The algorithm stops when the difference between the function values in consecutive steps is smaller than 10^{-5}. If this doesn’t occur, the method will stop after 10 iterations as a fallback.

def newton_method_optimize(tol=1e-5, x0=0.1):
    """
    Function to perform minimization using Newton's method.
    
    Parameters:
    - tol: Tolerance value for stopping criterion.
    - x0: Initial guess.
    
    Returns:
    - x: Approximate minimum point.
    - errors: List containing value of f(x) for each iteration.
    """
    x = x0
    errors = [f(x)]
    
    # Run Newton's method until change in f(x) value is smaller than tol
    while True:
        x_new = x - df(x)/df2(x)
        if np.abs(f(x_new) - f(x)) < tol:
            break
        x = x_new
        errors.append(f(x))
    
    return x, errors

# Initial guess
x0 = 0.1
min_point, errors = newton_method_optimize(x0=x0)
print(f"Approximate minimum point using Newton's method for optimization: x = {min_point:.5f}")

We get: Approximate minimum point using Newton's method for optimization: x = -0.00055

4. Comparison between Gradient Descent and Newton’s Method for Optimization

min_gd, gd_errors = gradient_descent_minimize()
min_newton, newton_errors = newton_method_optimize()

# Compare number of iterations taken to converge
print(f"Gradient Descent took {len(gd_errors)} iterations.")
print(f"Newton's method took {len(newton_errors)} iterations.")

# Plot the convergence of the methods
plt.figure(figsize=(10, 6))
plt.plot(gd_errors, label='Gradient Descent', linestyle='--')
plt.plot(newton_errors, label="Newton's Method", linestyle='-')
plt.title("Convergence of Gradient Descent vs Newton's Method")
plt.xlabel("Iterations")
plt.ylabel("Value of f(x)")
plt.legend()
plt.grid(True)
plt.show()

Gradient Descent took 42 iterations.

Newton’s method took 3 iterations.

While Newton’s method can be faster in terms of iterations, its computational complexity, memory requirements, and sensitivity make Gradient Descent a more practical choice for many applications, particularly in high-dimensional spaces or when dealing with non-convex functions.

Chapter 4 Cleaning Up The Derivatives Debris: The Integral

How can we calculate the area under a function

We can use rectangles to approximate it

And the base is the derivative of x (dx) and the height is f(x). To get a better approximation we can increase the amount of rectangles:

And even more

Now the height of each of the rectangles is f(x)

Fundamental theorem of calculus

Where F is an anti-derivative with formula:

Integration and differentiation are inverse operations; integrating a function and then differentiating the result returns the original function.

Chapter 5 (last chapter) is for tomorrow:

Introduction to Neo4j Graph Data Science

Today I did this course. It was very short and not that much info.

The main thing I think is Native vs Cypher graph projections

In Neo4j, both Native and Cypher Projections refer to different methods of working with graph data, especially when it comes to accessing and manipulating it. Here’s a breakdown of each:

1. Native

Native refers to the core functionality provided by Neo4j, which operates directly on the underlying graph database structure. This includes:

Data Model: Native access utilizes the property graph model where nodes and relationships can have properties associated with them.
Performance: Accessing data natively can be faster for certain operations since it leverages Neo4j’s optimized internal storage and indexing mechanisms.
API Access: You can interact with the graph directly using the Neo4j Java API or other drivers for various programming languages. This approach is usually more complex but offers high performance for large datasets and complex queries.
Lower-Level Operations: Native access allows for more control over transaction management, caching, and other low-level operations, which can be beneficial for advanced use cases.

2. Cypher Projections

Cypher Projections are a more abstracted way of working with Neo4j data through the Cypher query language. Projections allow you to define a subset of data and work with it in a more user-friendly manner:

Declarative Syntax: cypher uses a more intuitive, SQL-like syntax, making it easier for users to query and manipulate graph data without needing to understand the underlying data structures
Data Retrieval: we can create projections that define which parts of the graph to retrieve and how to structure the results. This is useful for getting specific views of the data without the overhead of fetching everything
Read-Only: Cypher projections are generally read-only, meaning you can retrieve and manipulate the data in-memory but can’t directly modify the underlying graph through projections
Flexibility: Projections allow you to transform the data in various ways, such as aggregating properties, filtering nodes, and modifying relationships, all in a single query

Use Cases

Native might be more suitable for applications that require high performance and involve complex transactions or batch processing
Cypher Projections are often preferred for data exploration, analytics, and scenarios where ease of use and readability are more important than raw performance

Graph Data Science Fundamentals

Algorithm Tiers and Execution Modes

Tiers

GDS algorithms are classified into three tiers: alpha, beta, and production.

Production-quality: Indicates that the algorithm has been tested in regard to stability and scalability. Algorithms in this tier are prefixed with gds..
Beta: Indicates that the algorithm is a candidate for the production-quality tier. Algorithms in this tier are prefixed with gds.beta..
Alpha: Indicates that the algorithm is experimental and might be changed or removed at any time. Algorithms in this tier are prefixed with gds.alpha..

Execution Modes

GDS algorithms have 4 executions modes which determine how the results of the algorithm are handled.

stream: Returns the result of the algorithm as a stream of records.
stats: Returns a single record of summary statistics, but does not write to the Neo4j database or modify any data.
mutate: Writes the results of the algorithm to the in-memory graph projection and returns a single record of summary statistics.
write: Writes the results of the algorithm back the Neo4j database and returns a single record of summary statistics.

Memory Estimation

To learn more about the used algorithms we can use .estimate to estimate the memory that is required to support the run analytics/flows.

Overall Algorithm Syntax

CALL gds[.<tier>].<algorithm>.<execution-mode>[.<estimate>](
	graphName: STRING,
	configuration: MAP
)

Centrality and Importance

Centrality algorithms are used to determine the importance of distinct nodes in a graph.

Common use cases of centrality include:

Recommendations: Identify and recommend the most influential or popular items in your content or product offering catalog
Supply chain analytics: find the most critical node in your supply chain, whether it be a supplier in a network, a raw material that is part of a manufactured product, or a port in a route
Fraud & Anomaly Detection: Find users with many shared identifiers or who otherwise act as a bridge between many communities

Degree Centrality

Degree centrality is one of the most ubiquitous and simple centrality algorithms. It counts the number of relationships a node has. In the GDS implementation, we specifically calculate out-degree centrality which is the count of outgoing relationships from a node

To use a GDS algorithm, you must first create a graph projection. A projection is an in-memory graph you can quickly query and manipulate.

Create a graph projection of Actor and Movie nodes:

CALL gds.graph.project(
  'proj',
  ['Actor','Movie'],
  'ACTED_IN'
  );

Then stream the degree centrality to find the actors who have acted in the most movies:

//get top 5 most prolific actors (those in the most movies)
//using degree centrality which counts number of `ACTED_IN` relationships
CALL gds.degree.stream('proj')
YIELD nodeId, score
RETURN
  gds.util.asNode(nodeId).name AS actorName,
  score AS numberOfMoviesActedIn
ORDER BY numberOfMoviesActedIn DESCENDING, actorName LIMIT 5

We get:

PageRank

PageRank is a good algorithm for measuring the influence of nodes in a directed graph, particularly where the relationships imply some form of flow of movement such as in payment networks, supply chain and logistics, communications, routing, and graphs of website and links.

In summary, PageRank estimates the importance of a node by counting the number of incoming relationships from neighboring nodes weighted by the importance and out-degree centrality of those neighbors. The underlying assumption is that more important nodes are likely to have proportionately more incoming relationships from other important nodes.

First, create the graph projection. We can use a Cypher projection to create an in-memory graph with :DIRECTED_ACTOR relationships between two (:Person) nodes. This graph can be traversed to understand the influence across directors and actors.

// create Cypher projection for network of people directing actors
// filter to recent high grossing movies
MATCH (source:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(target)
WHERE m.year >= 1990 AND m.revenue >= 10000000
WITH source, target, count(*) as actedWithCount
WITH gds.graph.project(
  'proj',
  source,
  target,
  {
    relationshipType: "DIRECTED_ACTOR"
  }
) as g
RETURN
  g.graphName AS graph, g.nodeCount AS node, g.relationshipCount AS rels

Next stream PageRank to find the top 5 most influential people in director-actor network.

CALL gds.pageRank.stream('proj')
YIELD nodeId, score
RETURN
  gds.util.asNode(nodeId).name AS personName,
  score AS influence
ORDER BY influence DESCENDING, personName LIMIT 5

Other Centrality Algorithms

Other GDS production tier centrality algorithms include:

Betweenness Centrality: Measures the extent to which a node stands between the other nodes in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another.
Eigenvector Centrality: Measures the transitive influence of nodes. Similar to PageRank, but works only on the largest eigenvector of the adjacency matrix so does not converge in the same way and tends to more strongly favor high degree nodes. It can be more appropriate in certain use cases, particularly those with undirected relationships.
Article Rank: A variant of PageRank which assumes that relationships originating from low-degree nodes have a higher influence than relationships from high-degree nodes.

Path Finding

Path finding algorithms find the shortest path between two or more nodes or evaluate the availability and quality of paths.

Common use cases of path finding are:

Supply chain analytics: Identifying the fastest path between an origin and a destination or between a raw material and a finished product
Customer Journey: Analyzing the events that make up a customer’s experience. In healthcare for example, this can be the experience of an in-patient from admission to discharge.

Dijkstra Source-Target Shortest Path

A common, industry standard, path finding algorithm is Dijkstra. It computes the shortest path between a source and a target node. Like many other path finding algorithms in GDS, Dijkstra supports weighted relationships to account for distance or another cost property when comparing paths.

First, create the graph projection.

CALL gds.graph.project('proj',
    ['Person','Movie'],
    {
        ACTED_IN:{orientation:'UNDIRECTED'},
        DIRECTED:{orientation:'UNDIRECTED'}
    }
);

Then we can run Dijkstra’s shortest path.

MATCH (kevin:Actor{name : 'Kevin Bacon'})
MATCH (denzel:Actor{name : 'Denzel Washington'})

CALL gds.shortestPath.dijkstra.stream(
    'proj',
    {
        sourceNode:kevin,
        TargetNode:denzel
    }
)

YIELD sourceNode, targetNode, path
RETURN sourceNode, targetNode, nodes(path) as path;

Resulting path:

Other Path Finding Algorithms

Other GDS production tier Path Finding algorithms can be split into a few different subcategories that are listed below:

Shortest path between two nodes:

A* Shortest Path: An extension of Dijkstra that uses a heuristic function to speed up computation.
Yen’s Algorithm Shortest Path: An extension of Dijkstra that allows you to find multiple, the top k, shortest paths.

Shortest path between a source node and multiple other target nodes:

Dijkstra Single-Source Shortest Path: Dijkstra implementation for shortest path between one source and multiple targets.
Delta-Stepping Single-Source Shortest Path: Parallelized shortest path computation. Computes faster than Dijkstra single-source shortest Path but uses more memory.

General path search between a source node and multiple other target nodes:

Breadth First Search: Searches paths in order of increasing distance from the source node on each iteration.
Depth First Search: Searches as far as possible along a single multi-hop path on each iteration.

Community Detection

Community detection algorithms are used to evaluate how groups of nodes may be clustered or partitioned in the graph. Much of the community detection functionality in GDS is focused on distinguishing and assigning ids to these node groups for downstream analytics, visualization, or other processing.

Common use cases of community detection include:

Fraud detection: Finding fraud rings by identifying accounts that have frequent suspicious transactions and/or share identifiers between one another.
Customer 360: Disambiguating multiple records and interactions into a single customer profile so an organization has an aggregated source of truth for each customer.
Market segmentation: dividing a target market into approachable subgroups based on priorities, behaviors, interests, and other criteria.

Louvain Community Detection

A common community detection algorithm is Louvain. Louvain maximizes a modularity score for each community, where the modularity quantifies the quality of an assignment of nodes to communities. This means evaluating how much more densely connected the nodes within a community are, compared to how connected they would be in a random network.

Louvain optimizes this modularity with a hierarchical clustering approach that recursively merges communities together. There are multiple parameters that can be used to tune Louvain to control its performance and the number and size of communities produced. This includes the maximum number of iterations and hierarchical levels to use as well as the tolerance parameter for assessing convergence/stopping conditions.

An additional important consideration is that Louvain is a stochastic algorithm. As such, the community assignments may change a bit when re-run. When the graph does not have a naturally well-defined community structure the changes between runs can become more significant. Louvain includes a seedProperty parameter which can be used to assign initial community ids and help with consistency between runs. Also, if consistency is important, other community detection algorithms, such as Weakly Connected Components (WCC), take a more deterministic partitioning approach to assigning communities and thus will not change between runs.

First create a graph projection with movies, actors, and directors. Project the relationships with an UNDIRECTED orientation as that works best with the Louvain algorithm.

CALL gds.graph.project('proj', ['Movie', 'Person'], {
    ACTED_IN:{orientation:'UNDIRECTED'},
    DIRECTED:{orientation:'UNDIRECTED'}
});

Then we can run Louvain. Here we will run Louvain in mutate mode to save community Ids and return high level statistics on the community counts, distribution, modularity score, and information for how Louvain processed the graph.

CALL gds.louvain.mutate('proj', {mutateProperty:'communityId'})

Here is the output:

mutateMillis
nodePropertiesWritten
modularity
modularities
ranLevels
communityCount
communityDistribution
postProcessingMillis
preProcessingMillis
computeMillis
configuration
0
28172
0.7080752619422075
[0.4735378349269174, 0.610543949698786, 0.6984121667978503, 0.707715829688635, 0.7080752619422075]
5
504
{
  "min": 1,
  "p5": 3,
  "max": 2752,
  "p999": 2752,
  "p99": 986,
  "p1": 1,
  "p10": 4,
  "p90": 154,
  "p50": 6,
  "p25": 6,
  "p75": 8,
  "p95": 310,
  "mean": 55.8968253968254
}
27
0
13469
{
  "mutateProperty": "communityId",
  "jobId": "2a32b9a2-1489-41ed-9afd-9429bbe9a7f7",
  "sudo": false,
  "maxIterations": 10,
  "maxLevels": 10,
  "seedProperty": null,
  "logProgress": true,
  "nodeLabels": [
    "*"
  ],
  "concurrency": 4,
  "includeIntermediateCommunities": false,
  "relationshipTypes": [
    "*"
  ],
  "tolerance": 0.0001,
  "consecutiveIds": false
}

We can verify the communityId node properties in the projection with a stream operation.

CALL gds.graph.nodeProperty.stream('proj','communityId', ['Person'])
YIELD nodeId, propertyValue
WITH gds.util.asNode(nodeId) AS n, propertyValue AS communityId
WHERE n:Person
RETURN n.name, communityId LIMIT 10

Where we can see the node’s name and the id of the community it belongs to.

Other Community Detection Algorithms

Below are some of the other production tier community detection algorithms. A full list of all community detection algorithms can be found in the Community Detection algorithms documentation.

Label Propagation: Similar intent as Louvain. Fast algorithm that parallelizes well. Great for large graphs.
Weakly Connected Components (WCC): Partitions the graph into sets of connected nodes such that: a) Every node is reachable from any other node in the same set; b) No path exists between nodes from different sets
Triangle Count: Counts the number of triangles for each node. Can be used to detect the cohesiveness of communities and stability of the graph.
Local Clustering Coefficient: Computes the local clustering coefficient for each node in the graph which is an indicator for how the node clusters with its neighbors.

Node Embedding

The goal of node embedding is to compute low-dimensional vector representations of nodes such that similarity between vectors (eg. dot product) approximates similarity between nodes in the original graph. These vectors, also called embeddings, can be extremely useful for exploratory data analysis, similarity measurements, and machine learning.

Of course, in real-world problems node embeddings will usually be larger than 2 dimensions, often ending up in the hundreds or larger, especially when applied to bigger graphs with millions or billions of nodes. Node embedding also doesn’t have to base similarity strictly on node proximity in the graph. While similarity based on distance in relationship hops and common neighbors is perhaps most common in application, node embedding can also consider node properties and other “global-view” node attributes when calculating embedding vectors.

Use Cases

EDA
Similarity measurement
Features for ML

FastRP

GDS offers a custom implementation of a node embedding technique called Fast Random Projection

FastRP leverages probabilistic sampling techniques to generate sparse representations of the graph allowing for extremely fast calculation of embedding vectors that are comparative in quality to those produced with traditional random walk and neural net techniques such as Node2vec and GraphSage. This makes FastRP a great choice for getting started with exploring embedding on a graph in GDS.

Other Node Embedding Algorithms

GDS has also implemented Node2Vec, which computes a vector representation of a node based on random walks in the graph, and GraphSage, which is an inductive modeling approach for computing node embeddings using node properties and graph structure.

Similarity

Similarity algorithms, as the name implies, are used to infer similarity between pairs of nodes. In GDS these algorithms run over the graph projection in bulk. When similar node pairs are identified according to the user specified metric and threshold, a relationship with a similarity score property is drawn between the pair. Depending on which execution mode is used when running the algorithm, these similarity relationships can be streamed, mutated to the in-memory graph, or written back to the database.

Common use cases for similarity include:

Fraud detection: finding potential fraud user accounts by analyzing how similar a set of new user accounts is to flagged accounts
Recommendation Systems: In an online retail store, identifying items that pair to the one currently being viewed by a user to inform impressions and increase rate of purchase
Entity Resolution: Identify nodes that are similar to one another based on activity or identifying information in the graph

Similarity Algorithms in GDS

Node Similarity: Determines similarity between nodes based on the relative proportion of shared neighboring nodes in the graph. Node Similarity is a good choice where explainability is important, and you can narrow down the universe of comparisons to a subset of your data. Examples of narrowing down include focusing on just single communities, newly added nodes, or nodes within a specific proximity to a subgraph of interest.
K-Nearest Neighbor (KNN): Determines similarity based off node properties. The GDS KNN implementation can scale well for global inference over large graphs when tuned appropriately. it can be used in conjunction with embeddings and other graph algorithms to determine the similarity between nodes based on proximity in the graph, node properties, community structure, importance/centrality, etc.

Choice of Similarity Metric

Both Node Similarity and KNN provide choices between different similarity metrics. Node Similarity has choices between Jaccard and Overlap similarity. KNN choice of metric is driven by the node property types. List of integers are subject to Jaccard and Overlap, list of floating point numbers to Cosine Similarity, Pearson, and Euclidean. Using different metrics will of course alter the similarity score and change the interpretation slightly.

Controlling Scope of Comparisons

Comparing every node to every other node in the graph is a computationally expensive task of roughly O(n^2) complexity. The GDS implementations for both Node Similarity and KNN have internal mechanisms to intelligently select node pairs for comparison allowing them to work faster and scale better. They also have parameters that can be adjusted by the user to tune how node pairs are sampled and selected for comparison.

Controlling Scope of Results

For similarity comparisons we may also want to control the number of results returned to only consider the most relevant node pairs. Both Node Similarity and KNN have a topK parameter to limit the number similarity comparisons returned per node. With node similarity there is also the capability to limit the results globally as opposed to just a per node basis.

Applied Example with KNN

Create a projection

CALL gds.graph.project('proj', ['Movie', 'Person'], {
    ACTED_IN:{orientation:'UNDIRECTED'},
    DIRECTED:{orientation:'UNDIRECTED'}
});

Run FastRP in mutate mode so the embeddings will be saved in the projection

CALL gds.fastRP.mutate('proj',  {
    embeddingDimension:64,
    randomSeed:7474,
    mutateProperty:'embedding'
})

After that we can run similarity. We will use the default cosine metric.

CALL gds.knn.stream('proj', {nodeLabels:['Person'], nodeProperties:['embedding'], topK:1})
YIELD  node1, node2, similarity
RETURN gds.util.asNode(node1).name AS actorName1,
    gds.util.asNode(node2).name AS actorName2,
    similarity
LIMIT 10

Here is a sample out of the 10

actorName1
actorName2
similarity
"François Lallement"
"Gian Maria Volontè"
0.701564371585846
"Jules-Eugène Legris"
"Giacomo Baessato"
0.7108281850814819

Similarity Functions

In addition to the node similarity and KNN algorithms, GDS also provides a set of functions that can be used to calculate similarity between two arrays of numbers using various similarity metrics including jaccard, overlap, pearson, cosine similarity and a few others.

Machine Learning Overview

GDS focuses on offering managed pipelines for end-to-end ML workflows. Data selection, feature engineering, data splitting, hyperparameter configuration, and training steps are coupled together within the pipeline object to track the end-to-end steps needed.

There are currently two supported types of ML pipelines:

Node Classification Pipelines: Supervised binary and multi-class classification for nodes
Link Prediction Pipelines: Supervised prediction for whether a relationship or “link” should exist between pairs of nodes

Node Classification Pipeline

Node Classification Pattern in GDS

In practice, steps 1-6 will be executed automatically by the pipeline. We will just be responsible for providing configuration and hyperparameters for them. So at a high level, the workflow will look like the following for node classification, this will be the same for link prediction as well:

Project a graph and configure the pipeline (the order doesn’t matter).
Execute the pipeline with a train command.
Predict on a projected graph with the predict command. The predictions can then be written back to the database if desired using graph write operations.

Link Prediction

GDS currently offers a binary classifier where the target is a 0-1 indicator, 0 for no link, 1 for a link. This type of link prediction works really well on an undirected graph where you are predicting one type of relationship between nodes of a single label, such as for social network and entity resolution problems.

Link Prediction Pattern in GDS

there is an additional feature-input set in the relationship splits which now comes before node property and feature generation steps. In short, this is to handle data leakage issues, whereby model features are calculated using the relationships you are trying to predict. Such a situation would allow the model to use information in the features that would normally not be available, resulting in overly optimistic performance metrics.

Creating a pipeline in code seem to have a lot of steps but it’s something I need to explore further by meself.

That is all for today!

See you tomorrow :)