(Day 272) A bit more of Graph Algorithms for Data Science

Ivan Ivanov · September 29, 2024

theory gnn

Hello :) Today is Day 272!

A quick summary of today:

read a bit more of Graph Algorithms for Data Science

Ch. 2 Representing network structure: Designing your first graph model

2.1 Graph terminology

Directed vs. undirected graph ((A) represents a directed graph; (B) represents an undirected graph)

Weighted vs. unweighted graphs ((A) represents an unweighted graph; (B) represents a weighted graph)

Bipartite vs. monopartite graphs ((A) represents a monopartite graph; (B) represents a bipartite graph)

Multigraph vs. simple graph ((A) represents a simple graph; (B) represents a multigraph)

A complete graph (a graph in which each node is connected to all the other nodes)

2.2 Network representations

This section introduced Cypher’s syntax for text-based network representations, used to describe nodes and relationships. Nodes are enclosed in parentheses with labels (e.g., Person) and properties (e.g., {name: “Thomas”})

Example Cypher representation of a work relationship:

(:Person {name:"Tomaz"})-[:WRITES_FOR {since: 2020}]->(:Organization {name:"Manning Publications"})

Labeled-Property Graph Model

The labeled-property graph (LPG) model consists of nodes with labels and properties, and directed relationships. For example

(:Person {name:"Tomaz"})<-[:WRITES_FOR {since:2020}]-(:Organization {name:"Manning Publications"})

In bidirectional relationships like friendship, direction is less important:

(:Person {name:"Thomas"})-[:FRIEND {since: 2016}]->(:Person {name:"Elaine"})

2.3 Designing your first labeled-property graph model

Now, imagine a client asks to perform a network analysis of Twitter. They’ll provide all the necessary data, and my job is to represent it as a graph and extract insights through network analysis. I’ll use the LPG model to represent Twitter. A good approach is to work backward, starting with the questions I want to answer. In this case, no specific questions were provided, so it’s up to me as the network scientist to uncover as many insights as possible. To begin, I can define the key elements of the Twitter domain, such as:

Users can follow other users (follower network).
Users can publish tweets (user-tweet network).
Users can retweet posts from others (retweet network).

Follower network

On Twitter, you have the option to follow other users. By following users, you are subscribing to their activity and indicating that you would like to see their tweets on your feed. The follower-network specification is as follows: A user can follow other users.

There is no correct or perfect way to design a graph model, but some models can be useful for a specific scenario. When designing a graph model, try to answer the following questions:

How many different types of nodes are present?
What kind of properties do these nodes have?
Which property would you use to store the unique identifier of nodes?
What type of relationships are present?
Is a single relationship type enough to accurately describe your domain?
Does the relationship direction hold any semantic value?
How do properties qualify or quantify relationships?

In the follower-network specification, both the subject and the object of the sentence are users and can be represented as nodes. Relationships can be used to represent the verb of the specification sentence. Here, it makes sense to represent a follow interaction as a relationship between two users. For instance:

(:User{id:"Vanessa", registeredAt:"2019-03-05"})-[:FOLLOWS{since:"2020-01-01"}]->(:User{id:"Thomas", registeredAt:"2011-03-05"})

User-tweet network

Tweets are the primary way to share content on Twitter. The simplest description of the user-tweet network is as follows: A user can publish a tweet.

For instance:

(:User)-[:PUBLISH]->(:Tweet)

We can add PUBLISHED_BY relationship too

And also LIKE

Retweet network

The only remaining task specification is the definition of a graph model for retweets. When users strongly react to a tweet, they might want to share it with their followers to amplify its reach. In this case, they have the option of retweeting the original tweet. Optionally, users can like the retweet, and those likes do not count toward the original tweet. The task specification is defined as follows: A user can retweet posts from other users.

(:User)-[:PUBLISH]->(:Tweet)<-[:RETWEETS]-(:User)

Representing graph schema

The beauty of the graph approach to data modeling is that we can connect new information to an existing graph. We can now combine all the graph model decisions so far into a single graph model.

The above is a labeled-property graph representing a Twitter network, where users can follow one another as well as publish and retweet posts

2.4 Extracting knowledge from text

Links

The first information we can extract from tweet content is any links included in the tweet

We could either store the URL as a tweet node property or store it as a separate node and add a relationship between the link and the tweet

Considerations:

Standardization: If values like links in tweets are not standardized, storing them as separate nodes can reduce traversal efficiency. Standardized values allow faster queries and should be represented as single nodes in the graph
Node Representation: A real-world entity should only be represented by one node. Storing non-standardized data as multiple nodes (e.g., the same website as different nodes) can lead to invalid results
Super Nodes: Unspecific information, like gender, is better stored as node properties to avoid “super nodes”—nodes connected to a large portion of the graph—which can negatively impact query performance

Hashtags

People use the hashtag symbol (#) before a relevant keyword or phrase in their tweet to categorize them. A tweet can contain many hashtags; it makes sense to store them as separate nodes and connect tweets to them

An important thing to remember is to avoid generic relationship types. For example HAS. HAS can relate to many relationships so it is best to be more descriptive.

Mentions

A user can mention other users in their tweet by using the mention symbol (@). A mention can be understood as an invitation to comment or a callout, while at other times, it can be used to notify users to look at specific content. We already have users defined in the graph schema, so it only makes sense to connect tweets to mentioned users

Like the hashtag network, the mention network is also a classic bipartite network.

Final Twitter social network schema

Designing a graph model schema is an iterative process where additional data enriches the graph over time. A self-describing graph model is recommended to avoid the need for a separate schema manual. The schema may evolve based on query performance needs.

For example, in a Twitter network model, nodes include users, tweets, hashtags, and links, with six types of relationships. Inferred relationships should be left out until they can be instantiated.

A key theme in network analysis is converting indirect relationships to direct ones, such as turning retweet links into direct amplification relationships between users. Monopartite projections are often used for applying graph algorithms to networks.

Ch. 3 Your First Steps With Cypher Query Language

As the title suggests this was an introduction to Cypher - the language used to interact with Neo4j. I just read this part, as I have already used Neo4j, but also covered Neo4j’s official Intro to Cypher course

Here is a summary of the chapter:

Cypher syntax uses parentheses, (), to encapsulate a node.
Cypher syntax uses square brackets, [], to encapsulate a relationship.
A relationship cannot exist on its own but needs to be described with adjacent nodes.
In Neo4j, all relationships are stored as directed, although you can ignore the direction at query time.
It is not advisable to ever create any nodes without node labels.
The CREATE clause is used to create data.
The MATCH clause is used to identify existing patterns in the database.
The WHERE clause can be used in combination with MATCH or WITH clauses to specify various filters.
If a MATCH clause doesn’t find the specified graph pattern, the whole Cypher statement returns no results.
You can use OPTIONAL MATCH when you are not certain whether a graph pattern exists in the database, but you still want to return other variables in the query output.
The WITH clause can be used to filter, aggregate, select, paginate, or limit intermediate rows in the Cypher statement.
The SET clause is used to add node properties and labels.
The REMOVE clause is used to remove node properties and labels.
The DELETE clause is used to delete nodes and relationships.
If you want to delete a node with existing relationship patterns, you need to use the DETACH DELETE clause.
The MERGE clause is a combination of MATCH and CREATE clauses that ensures the specified graph pattern exists in the database.
The MERGE clause is frequently used for data import, as it allows for idempotent queries and automatic deduplication.
In Neo4j, you can define unique constraints that ensure unique values for the specified node property of a particular node label.
It is recommended to import nodes and relationships separately into the database to improve performance and data quality.
You can easily evaluate the graph schema of existing data with the db.schema.visualization() procedure.

Tomorrow I will continue with this book. And also will be my 1st stream on youtube 🥳 from 6PM KST

That is all for today!

See you tomorrow :)