(Day 274) Finished the Graph Algorithms for Data Science book + stream

Ivan Ivanov · October 1, 2024

theory gnn applying-knowledge

Hello :) Today is Day 274!

A quick summary of today:

covered the last chapters of the graph algs for DS book
did a stream from 6 to ~7.30pm here

Ch. 10 Link prediction (link)

Over time people might make new friends and cut ties with some. In the above case, Luke and Ravij might become friends given their common friend Surya. This is one case of a link prediction tasks.

Another case would be in drugs and symptoms

10.1 Link Prediction Workflow

Networks often follow certain organizing principles, either intentionally or unintentionally, which can be leveraged to predict the likelihood of new links forming between node pairs. For instance, in a social network, similar-age people might be more likely to become friends. However, solely considering personal characteristics can be limiting, as relationships may depend on additional factors like shared social circles. Nodes with more common connections (ie. friends) or higher degrees of connectedness are more likely to form links.

Key Approaches to Link Prediction:

Node Property Combination: This involves using attributes like the age difference between nodes, node degrees, or cosine similarity of node embeddings. The properties can be concatenated or otherwise combined to improve prediction
Distance-Based Metrics: Features such as the shortest path between nodes (number of hops) indicate how close two nodes are, with shorter distances suggesting a higher probability of a future connection
Neighborhood Overlap: The more overlapping connections two nodes have, the higher the likelihood of them forming a link. For example, common friends can be quantified using the Jaccard similarity index

Prediction Methods:

Unsupervised Approach: Calculate link prediction metrics between node pairs and rank the most likely future links
Classification Model: Train a classification model on link prediction metrics. This method can identify patterns and assign weights to different metrics, capturing more complex relationships between node pairs that might be missed in an unsupervised approach

Going back to being a DS at Twitch. If two channels share a significant audience, this information is used to provide recommendations to users in the “Users of This Channel Also Watch” section. One way to improve these recommendations is by predicting future audience overlap between channels and using these predictions to enhance recommendations. This would allow recommending both channels with existing overlaps and channels likely to share audiences in the future.

The network of Twitch channels can be represented as a graph, where edges indicate audience overlap. Solid lines represent existing shared audiences and are used to populate current recommendations. As a data scientist, you can leverage these relationships to predict future audience overlap, which would then power the recommendation engine.

10.2 Dataset split

Time-based split

Random split

Negative samples

When training a binary classifier like the link prediction model, we should include both positive and negative examples in the training and test sets. Without negative examples, the model cannot learn to differentiate between the two outputs and might produce inaccurate predictions.

For the coding exercises

Create a feature set

run_query("""
MATCH (s1:Stream)-[:SHARED_AUDIENCE]->(s2:Stream)
WITH s1, s2
WHERE rand() <= 0.9
MERGE (s1)-[:FEATURE_REL]->(s2);
""")

Create train/test set from the remaining 10%

train_test_size = run_query("""
MATCH (s1)-[:SHARED_AUDIENCE]->(s2)
WHERE NOT EXISTS {(s1)-[:FEATURE_REL]->(s2)}
MERGE (s1)-[r:TEST_TRAIN]->(s2)
RETURN count(r) AS result;
""")

Create negative train/test

run_query("""
MATCH (s1:Stream),(s2:Stream)
WHERE NOT EXISTS {(s1)-[:SHARED_AUDIENCE]-(s2)}
      AND s1 < s2
      AND rand() > 0.9
WITH s1,s2
LIMIT 13082
MERGE (s1)-[:NEGATIVE_TEST_TRAIN]->(s2);
""")

10.3 Network feature engineering

Next, we will produce network features that capture the closeness or similarity of pairs of nodes in the network. The idea is that the closer or more similar a pair of nodes are given the network metrics, the more likely they are to form a future connection. The future connections will then be used to provide better recommendations to Twitch users.

Network distance

The idea behind the network distance is that the closer the two nodes are in the network, the more likely they are to form future connections

Calculating the network distance between pairs of nodes in the train and test sets

run_query("""
MATCH (s1)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2)    #1
MATCH p = shortestPath((s1)-[:FEATURE_REL*]-(s2))      #2
WITH r, length(p) AS networkDistance                   #3
SET r.networkDistance = networkDistance                #4
""")

Matches all the pairs of nodes connected with the TEST_TRAIN or NEGATIVE_TEST_TRAIN relationships
Identifies the shortest path between the pairs of nodes. The shortest path is constrained to be only allowed to traverse the FEATURE_REL relationships to prevent data leakage.
Calculates the length of the shortest paths using the length() function
Stores the network distance result as the relationship property

Preferential attachment

Another popular metric used in link prediction is preferential attachment. Preferential attachment is an underlying organizing principle occurring in real-world networks, where nodes with a higher number of relationships are more likely to make new relationships

run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2)
WITH r, count{ (s1)-[:FEATURE_REL]-() } *
        count{ (s2)-[:FEATURE_REL]-() } AS preferentialAttachment
SET r.preferentialAttachment = preferentialAttachment
""")

Common neighbors

The more common neighbors two nodes have, the higher the chance is of a link forming in the future

run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2)
OPTIONAL MATCH (s1)-[:FEATURE_REL]-(neighbor)-[:FEATURE_REL]-(s2)
WITH r, count(distinct neighbor) AS commonNeighbor
SET r.commonNeighbor = commonNeighbor
""")

Adamic-Adar index

The idea behind the Adamic-Adar index is that the smaller the node degree common neighbors between a pair of nodes have, the more likely it is that they will form a connection in the future

In case B, the nodes A and B are more likely to become friends, than in case A.

run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2:Stream)
OPTIONAL MATCH (s1)-[:FEATURE_REL]-(neighbor)-[:FEATURE_REL]-(s2)
WITH r, collect(distinct neighbor) AS commonNeighbors             
UNWIND commonNeighbors AS cn
WITH r, count{ (cn)-[:FEATURE_REL]-() } AS neighborDegree         
WITH r, sum(1 / log(neighborDegree)) AS adamicAdar                
SET r.adamicAdar = adamicAdar;                                    
""")

Clustering coefficient of common neighbors

A clustering coefficient measures the connectedness of the neighbors of a particular node, with the values ranging from 0 to 1. A value of 0 indicates the neighboring nodes have no connections with each other. On the other hand, a value of 1 indicates the network of neighbors forms a complete graph, where all the neighbors are connected.

run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2:Stream)
OPTIONAL MATCH (s1)-[:FEATURE_REL]-(neighbor)-[:FEATURE_REL]-(s2)
WITH r, collect(distinct neighbor) AS commonNeighbors,
        count(distinct neighbor) AS commonNeighborCount             #1
OPTIONAL MATCH (x)-[cr:FEATURE_REL]->(y)                            #2
WHERE x IN commonNeighbors AND y IN commonNeighbors
WITH r, commonNeighborCount, count(cr) AS commonNeighborRels
WITH r, CASE WHEN commonNeighborCount < 2 THEN 0 ELSE               #3
  toFloat(commonNeighborRels) / (commonNeighborCount *
                 (commonNeighborCount - 1) / 2) END as clusteringCoefficient
SET r.clusteringCoefficient = clusteringCoefficient                 #4
""")

Identifies and counts the number of common neighbors between a pair of nodes
Identifies all existing relationships between common neighbors
Calculates the clustering coefficient by dividing the number of existing relationships by the number of potential relationships
Stores the results as a relationship property

After all the features, here is an example link with its features

10.4 Link prediction classification model

Retrieving link prediction features and class output

data = run_query("""
MATCH (s1)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2)
WITH r.networkDistance AS networkDistance,
     r.preferentialAttachment AS preferentialAttachment,
     r.commonNeighbor AS commonNeighbor,
     r.adamicAdar AS adamicAdar,
     r.clusteringCoefficient AS clusteringCoefficient,
     CASE WHEN r:TEST_TRAIN THEN 1 ELSE 0 END as output
RETURN networkDistance, preferentialAttachment, commonNeighbor,
       adamicAdar, clusteringCoefficient, output
""")

The output column is is a 0 or 1. Positive examples are tagged with the TEST_TRAIN relationship type and are represented with a value of 1, while the negative examples are marked with NEGATIVE_TEST_TRAIN and are represented as 0.

Splitting the train and test sets and training the link prediction model

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
 
X = data.drop("output", axis=1)
y = data["output"].to_list()
 
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=0)
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

Generating the classification report

Evaluating feature importance

Ch. 11 Knowledge graph completion

The difference between link prediction and completion is that the first is a workflow to predict future links, while the latter deals with predicting missing links

The link prediction features used in chapter 10 do not differentiate between various node or relationship types. For example, the number of common neighbors does not differentiate between different relationship or node types. Therefore, the link prediction features used in chapter 10 work best with monopartite or homogeneous graphs

11.1 Knowledge graph embedding model

Triple

Knowledge graph embedding models use triples to describe graphs. A triple consists of two nodes, known as a head (h) and tail (t), and a labeled-directed relationship (r).

TransE

The objective of the TransE method is to calculate low-dimensional vector representations, also known as embeddings, for all the nodes and relationships in the graph. The TransE method is frequently used to demonstrate knowledge graph embeddings, as it is simple to illustrate and relatively cheap to calculate.

The TransE method tries to produce embeddings so that for every triple in the training set, it minimizes the distance between the sum of the head and the relationship to the tail embedding. This optimization score can be written as h + r ≈ t. On the other hand, if a relationship between the head and tail node does not exist, then the sum of head and relation embedding should not be close to the tail (h + r != t).

Limitations

Symmetric Relationships TransE struggles with symmetric relationships, where two entities are mutually related. For example:
- Tomaž, SIBLING, Blaž and Blaž, SIBLING, Tomaž.

In TransE, relationships are represented as vectors. The issue arises because the same relationship vector cannot point in opposite directions, meaning that TransE cannot properly encode relationships that are symmetric by nature.

Composition Relationships TransE can handle composition relationships, where one relationship can be derived by combining others. For example:
- John, MOTHER, Alicia, Alicia, SPOUSE, Michael, leads to John, FATHER, Michael.

In this case, the vectors for these relationships can be composed to represent such patterns, so TransE supports composite relations.

1-to-N Relationships TransE has significant limitations with 1-to-N relationships, where one entity is related to multiple others, like:
- Surya, FRIEND, Rajiv and Surya, FRIEND, Jane.

To tackle these limitations, later PairRE was developed and it is capable of encoding symmetry, composition, and 1-to-N relationships.

Streamed today

This is a resource that I need to read again - Accelerate Fraud Detection with Graph Databases

And I covered part 1 of this blog series going over using neo4j for fraud detection

That is all for today!

See you tomorrow :)