Hello :) Today is Day 274!
A quick summary of today:
- covered the last chapters of the graph algs for DS book
- did a stream from 6 to ~7.30pm here
Ch. 10 Link prediction (link)
Over time people might make new friends and cut ties with some. In the above case, Luke and Ravij might become friends given their common friend Surya. This is one case of a link prediction tasks.
Another case would be in drugs and symptoms
10.1 Link Prediction Workflow
Networks often follow certain organizing principles, either intentionally or unintentionally, which can be leveraged to predict the likelihood of new links forming between node pairs. For instance, in a social network, similar-age people might be more likely to become friends. However, solely considering personal characteristics can be limiting, as relationships may depend on additional factors like shared social circles. Nodes with more common connections (ie. friends) or higher degrees of connectedness are more likely to form links.
Key Approaches to Link Prediction:
- Node Property Combination: This involves using attributes like the age difference between nodes, node degrees, or cosine similarity of node embeddings. The properties can be concatenated or otherwise combined to improve prediction
- Distance-Based Metrics: Features such as the shortest path between nodes (number of hops) indicate how close two nodes are, with shorter distances suggesting a higher probability of a future connection
- Neighborhood Overlap: The more overlapping connections two nodes have, the higher the likelihood of them forming a link. For example, common friends can be quantified using the Jaccard similarity index
Prediction Methods:
- Unsupervised Approach: Calculate link prediction metrics between node pairs and rank the most likely future links
- Classification Model: Train a classification model on link prediction metrics. This method can identify patterns and assign weights to different metrics, capturing more complex relationships between node pairs that might be missed in an unsupervised approach
Going back to being a DS at Twitch. If two channels share a significant audience, this information is used to provide recommendations to users in the “Users of This Channel Also Watch” section. One way to improve these recommendations is by predicting future audience overlap between channels and using these predictions to enhance recommendations. This would allow recommending both channels with existing overlaps and channels likely to share audiences in the future.
The network of Twitch channels can be represented as a graph, where edges indicate audience overlap. Solid lines represent existing shared audiences and are used to populate current recommendations. As a data scientist, you can leverage these relationships to predict future audience overlap, which would then power the recommendation engine.
Building a link prediction model to recommend Twitch channels
10.2 Dataset split
- Time-based split
- Random split
- Negative samples
When training a binary classifier like the link prediction model, we should include both positive and negative examples in the training and test sets. Without negative examples, the model cannot learn to differentiate between the two outputs and might produce inaccurate predictions.
For the coding exercises
- Create a feature set
run_query("""
MATCH (s1:Stream)-[:SHARED_AUDIENCE]->(s2:Stream)
WITH s1, s2
WHERE rand() <= 0.9
MERGE (s1)-[:FEATURE_REL]->(s2);
""")
- Create train/test set from the remaining 10%
train_test_size = run_query("""
MATCH (s1)-[:SHARED_AUDIENCE]->(s2)
WHERE NOT EXISTS {(s1)-[:FEATURE_REL]->(s2)}
MERGE (s1)-[r:TEST_TRAIN]->(s2)
RETURN count(r) AS result;
""")
- Create negative train/test
run_query("""
MATCH (s1:Stream),(s2:Stream)
WHERE NOT EXISTS {(s1)-[:SHARED_AUDIENCE]-(s2)}
AND s1 < s2
AND rand() > 0.9
WITH s1,s2
LIMIT 13082
MERGE (s1)-[:NEGATIVE_TEST_TRAIN]->(s2);
""")
10.3 Network feature engineering
Next, we will produce network features that capture the closeness or similarity of pairs of nodes in the network. The idea is that the closer or more similar a pair of nodes are given the network metrics, the more likely they are to form a future connection. The future connections will then be used to provide better recommendations to Twitch users.
Network distance
The idea behind the network distance is that the closer the two nodes are in the network, the more likely they are to form future connections
Calculating the network distance between pairs of nodes in the train and test sets
run_query("""
MATCH (s1)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2) #1
MATCH p = shortestPath((s1)-[:FEATURE_REL*]-(s2)) #2
WITH r, length(p) AS networkDistance #3
SET r.networkDistance = networkDistance #4
""")
- Matches all the pairs of nodes connected with the TEST_TRAIN or NEGATIVE_TEST_TRAIN relationships
- Identifies the shortest path between the pairs of nodes. The shortest path is constrained to be only allowed to traverse the FEATURE_REL relationships to prevent data leakage.
- Calculates the length of the shortest paths using the length() function
- Stores the network distance result as the relationship property
Preferential attachment
Another popular metric used in link prediction is preferential attachment. Preferential attachment is an underlying organizing principle occurring in real-world networks, where nodes with a higher number of relationships are more likely to make new relationships
run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2)
WITH r, count{ (s1)-[:FEATURE_REL]-() } *
count{ (s2)-[:FEATURE_REL]-() } AS preferentialAttachment
SET r.preferentialAttachment = preferentialAttachment
""")
Common neighbors
The more common neighbors two nodes have, the higher the chance is of a link forming in the future
run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2)
OPTIONAL MATCH (s1)-[:FEATURE_REL]-(neighbor)-[:FEATURE_REL]-(s2)
WITH r, count(distinct neighbor) AS commonNeighbor
SET r.commonNeighbor = commonNeighbor
""")
Adamic-Adar index
The idea behind the Adamic-Adar index is that the smaller the node degree common neighbors between a pair of nodes have, the more likely it is that they will form a connection in the future
In case B, the nodes A and B are more likely to become friends, than in case A.
run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2:Stream)
OPTIONAL MATCH (s1)-[:FEATURE_REL]-(neighbor)-[:FEATURE_REL]-(s2)
WITH r, collect(distinct neighbor) AS commonNeighbors
UNWIND commonNeighbors AS cn
WITH r, count{ (cn)-[:FEATURE_REL]-() } AS neighborDegree
WITH r, sum(1 / log(neighborDegree)) AS adamicAdar
SET r.adamicAdar = adamicAdar;
""")
Clustering coefficient of common neighbors
A clustering coefficient measures the connectedness of the neighbors of a particular node, with the values ranging from 0 to 1. A value of 0 indicates the neighboring nodes have no connections with each other. On the other hand, a value of 1 indicates the network of neighbors forms a complete graph, where all the neighbors are connected.
run_query("""
MATCH (s1:Stream)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2:Stream)
OPTIONAL MATCH (s1)-[:FEATURE_REL]-(neighbor)-[:FEATURE_REL]-(s2)
WITH r, collect(distinct neighbor) AS commonNeighbors,
count(distinct neighbor) AS commonNeighborCount #1
OPTIONAL MATCH (x)-[cr:FEATURE_REL]->(y) #2
WHERE x IN commonNeighbors AND y IN commonNeighbors
WITH r, commonNeighborCount, count(cr) AS commonNeighborRels
WITH r, CASE WHEN commonNeighborCount < 2 THEN 0 ELSE #3
toFloat(commonNeighborRels) / (commonNeighborCount *
(commonNeighborCount - 1) / 2) END as clusteringCoefficient
SET r.clusteringCoefficient = clusteringCoefficient #4
""")
- Identifies and counts the number of common neighbors between a pair of nodes
- Identifies all existing relationships between common neighbors
- Calculates the clustering coefficient by dividing the number of existing relationships by the number of potential relationships
- Stores the results as a relationship property
After all the features, here is an example link with its features
10.4 Link prediction classification model
- Retrieving link prediction features and class output
data = run_query("""
MATCH (s1)-[r:TEST_TRAIN|NEGATIVE_TEST_TRAIN]->(s2)
WITH r.networkDistance AS networkDistance,
r.preferentialAttachment AS preferentialAttachment,
r.commonNeighbor AS commonNeighbor,
r.adamicAdar AS adamicAdar,
r.clusteringCoefficient AS clusteringCoefficient,
CASE WHEN r:TEST_TRAIN THEN 1 ELSE 0 END as output
RETURN networkDistance, preferentialAttachment, commonNeighbor,
adamicAdar, clusteringCoefficient, output
""")
The output
column is is a 0 or 1. Positive examples are tagged with the TEST_TRAIN relationship type and are represented with a value of 1, while the negative examples are marked with NEGATIVE_TEST_TRAIN and are represented as 0.
- Splitting the train and test sets and training the link prediction model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X = data.drop("output", axis=1)
y = data["output"].to_list()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0)
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
- Generating the classification report
- Evaluating feature importance
Ch. 11 Knowledge graph completion
The difference between link prediction and completion is that the first is a workflow to predict future links, while the latter deals with predicting missing links
The link prediction features used in chapter 10 do not differentiate between various node or relationship types. For example, the number of common neighbors does not differentiate between different relationship or node types. Therefore, the link prediction features used in chapter 10 work best with monopartite or homogeneous graphs
11.1 Knowledge graph embedding model
Triple
Knowledge graph embedding models use triples to describe graphs. A triple consists of two nodes, known as a head (h) and tail (t), and a labeled-directed relationship (r).
TransE
The objective of the TransE method is to calculate low-dimensional vector representations, also known as embeddings, for all the nodes and relationships in the graph. The TransE method is frequently used to demonstrate knowledge graph embeddings, as it is simple to illustrate and relatively cheap to calculate.
The TransE method tries to produce embeddings so that for every triple in the training set, it minimizes the distance between the sum of the head and the relationship to the tail embedding. This optimization score can be written as h + r ≈ t. On the other hand, if a relationship between the head and tail node does not exist, then the sum of head and relation embedding should not be close to the tail (h + r != t).
Limitations
- Symmetric Relationships
TransE struggles with symmetric relationships, where two entities are mutually related. For example:
- Tomaž, SIBLING, Blaž and Blaž, SIBLING, Tomaž.
In TransE, relationships are represented as vectors. The issue arises because the same relationship vector cannot point in opposite directions, meaning that TransE cannot properly encode relationships that are symmetric by nature.
- Composition Relationships
TransE can handle composition relationships, where one relationship can be derived by combining others. For example:
- John, MOTHER, Alicia, Alicia, SPOUSE, Michael, leads to John, FATHER, Michael.
In this case, the vectors for these relationships can be composed to represent such patterns, so TransE supports composite relations.
- 1-to-N Relationships
TransE has significant limitations with 1-to-N relationships, where one entity is related to multiple others, like:
- Surya, FRIEND, Rajiv and Surya, FRIEND, Jane.
To tackle these limitations, later PairRE was developed and it is capable of encoding symmetry, composition, and 1-to-N relationships.
Streamed today
This is a resource that I need to read again - Accelerate Fraud Detection with Graph Databases
And I covered part 1 of this blog series going over using neo4j for fraud detection
That is all for today!
See you tomorrow :)