(Day 259) ML monitoring pipelines + Going deeper into neo4j

Ivan Ivanov · September 16, 2024

Hello :) Today is Day 259!

A quick summary of today:

5.1. Introduction to data and ML pipeline testing

When to perform testing

ML lifecycle involves various steps that require testing to ensure that our ML models function properly. Critical areas to test include the following steps of the ML lifecycle:

  • During the feature engineering stage: testing the input data quality as it affects the whole pipeline.

  • During model training (or retraining): model quality checks.

  • During model serving: validating incoming data and model outputs.

  • During performance monitoring: continuously testing the model quality to detect and resolve potential issues.

How to perform testing

There are different types of checks you can use to test data and ML pipelines:

Individual tests A test is a metric with a condition. You can perform a certain evaluation or measurement on top of a data batch and compare it against a threshold or expectation. You can formulate almost anything as a test: assertions on feature values, expectations about model quality on a specific segment, etc. Whatever you can measure, you can design as a test.

Tests can be column-level (when metrics are calculated for a specific feature or column) or dataset-level (in this case, you calculate metrics for the whole dataset).

Test suites Individual tests can be grouped into test suites. For each test in a test suite, you can define test criticality and set alerting conditions: for example, based on the number of failed critical tests.

When you create a test, you must define the test conditions. There are two main strategies you can use for establishing test conditions:

Reference-based conditions You can use a reference dataset to derive conditions automatically rather than set conditions manually for each individual test. This is a great option for certain types of checks, such as testing column types (which are easy to derive from a reference example) and for ad hoc testing such as when you import a new batch of data and can immediately visually explore the test results. However, be careful when designing alerting, as auto-generated test conditions are not perfect and may be prone to false alerts or missed issues.

Manually defined conditions With this approach, you specify conditions for each test manually. This method does not require additional data and can be great for encoding specific conditions based on domain expertise.

Test automation

image

Recording test results

image

Example use case

image

image

5.2. Train and evaluate an ML model (code practice)

Using a bank marketing dataset from sklearn:

Bank client data:

  • Age (numeric)
  • Job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
  • Marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’ ; note: ‘divorced’ means divorced or widowed)
  • Education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)
  • Default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)
  • Balance: average yearly balance, in euros (numeric)
  • Housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
  • Loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
  • Contact: contact communication type (categorical: ‘cellular’,’telephone’)
  • Day: ast contact day of the month (numeric)
  • Month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  • Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes:

  • Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • Previous: number of contacts performed before this campaign and for this client (numeric)
  • Poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)

Doing some feature engineering:

def feature_engineering(raw_data: pd.DataFrame) -> pd.DataFrame:
    preprocessed_data = raw_data.copy(deep = True)

    preprocessed_data.columns = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 
                              'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'class']

    #client data preprocessing
    preprocessed_data['has_default'] = preprocessed_data.default.apply(
        lambda x : 0 if x == 'no' else 1 if x == 'yes' else -1
    )

    preprocessed_data['has_housing_loan'] = preprocessed_data.housing.apply(
        lambda x : 0 if x == 'no' else 1 if x == 'yes' else -1
    )

    preprocessed_data['has_personal_loan'] = preprocessed_data.loan.apply(
        lambda x : 0 if x == 'no' else 1 if x == 'yes' else -1
    )

    marital_dummies = pd.get_dummies(preprocessed_data.marital, prefix = 'marital')
    preprocessed_data = pd.concat([preprocessed_data, marital_dummies], axis = 1)

    job_dummies = pd.get_dummies(preprocessed_data.job, prefix = 'job')
    preprocessed_data = pd.concat([preprocessed_data, job_dummies], axis = 1)

    edu_dummies = pd.get_dummies(preprocessed_data.education, prefix = 'edu')
    preprocessed_data = pd.concat([preprocessed_data, edu_dummies], axis = 1)

    preprocessed_data.drop(columns = ['default', 'housing', 'loan', 'marital', 'job', 'education'], inplace=True)

    # last contact data preprocessing
    contact_dummies = pd.get_dummies(preprocessed_data.contact, prefix = 'contact_type')
    preprocessed_data = pd.concat([preprocessed_data, contact_dummies], axis = 1)

    month_dummies = pd.get_dummies(preprocessed_data.month, prefix = 'month')
    preprocessed_data = pd.concat([preprocessed_data, month_dummies], axis = 1)   

    preprocessed_data.drop(columns = ['contact', 'month'], inplace=True)
    
    # other attributes preprocessing
    poutcome_dummies = pd.get_dummies(preprocessed_data.poutcome, prefix = 'prev_camp_outcome')
    preprocessed_data = pd.concat([preprocessed_data, poutcome_dummies], axis = 1)
    preprocessed_data.drop(columns = ['poutcome'], inplace=True)

    #target preprocessing
    preprocessed_data['target'] = preprocessed_data['class'].apply(lambda x : 0 if x == '1' else 1)
    preprocessed_data.drop(columns = ['class'], inplace=True)
    
    return preprocessed_data

Then, splitting into train, reference, and prod simulation

train_data = bank_marketing_data[:5000]

reference_data = bank_marketing_data[5000:7000]

prod_simulation_data = bank_marketing_data[7000:]

The model used is a simple RF classifier

model = ensemble.RandomForestClassifier(random_state=42, n_estimators=100)

To create a simple report:

model_quality_report = Report(metrics=[ClassificationPreset()])
model_quality_report.run(reference_data=processed_train, current_data=processed_reference)
model_quality_report.show(mode='inline')

The output report is:

image image image image

5.3. Test input data quality, stability and drift (code practice)

Raw data checks:

data_stability_suite = TestSuite(tests=[DataStabilityTestPreset()])
data_stability_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
data_stability_suite.show(mode='inline')

image

For any test we can see details:

image

data_quality_suite = TestSuite(tests=[DataQualityTestPreset()])
data_quality_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
data_quality_suite.show(mode='inline')

image

data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
data_drift_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
data_drift_suite.show(mode='inline')

image

Preprocessed data checks

We preprocess the data using our feature_engineering function

processed_reference_data = feature_engineering(reference_data)
processed_prod_simulation_data = feature_engineering(prod_simulation_data)

Run the data drift preset report code

data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
data_drift_suite.run(reference_data=processed_reference_data, current_data=processed_prod_simulation_data[:batch_size])
data_drift_suite.show(mode='inline')

image

We can run some quick checks for warnings/alerts to check tests pass rate like below

data_drift_suite.as_dict()['summary']['all_passed'] == True # False
data_drift_suite.as_dict()['summary']['by_status']['SUCCESS'] > 40 # True

5.4. Test ML model outputs and quality (code practice)

Load the trained model and add predictions for reference and the prod simulation data that was created earlier:

processed_data_reference['prediction'] = model.predict(processed_data_reference.iloc[:, :-1])
processed_data_prod_simulation['prediction'] = model.predict(processed_data_prod_simulation.iloc[:, :-1])

Model output checks

no_target_performance_suite = TestSuite(tests=[NoTargetPerformanceTestPreset()])
no_target_performance_suite.run(reference_data=processed_data_reference, current_data=processed_data_prod_simulation[2*batch_size:3*batch_size])
no_target_performance_suite.show(mode='inline')

image

Model quality checks

model_performance_suite = TestSuite(tests=[BinaryClassificationTestPreset()])
model_performance_suite.run(reference_data=processed_data_reference, current_data=processed_data_prod_simulation[0*batch_size:1*batch_size])
model_performance_suite.show(mode='inline')

image

5.5. Design a custom test suite with Evidently (code practice)

Here is an example custome suite

custom_performance_suite = TestSuite(tests=[
    #TestColumnsType(),
    #TestShareOfDriftedColumns(ls=0.5),
    TestShareOfMissingValues(eq=0),
    TestPrecisionScore(gt=0.5),
    TestRecallScore(gt=0.3),
    TestAccuracyScore(gte=0.75),
])

custom_performance_suite.run(reference_data=processed_reference, current_data=processed_prod_simulation[:batch_size])
custom_performance_suite.show(mode='inline')

image

Running and logging reports using Airflow and mlflow

Adding monitoring using Airlow

import os
import pandas as pd 

from datetime import datetime, timedelta
from sklearn import datasets

from airflow import DAG
from airflow.operators.python_operator import PythonOperator 

from evidently.test_suite import TestSuite 
from evidently.test_preset import DataDriftTestPreset

default_args = {
	"start_date" : datetime(2023, 1, 1),
	"owner" : "emeli",
	"retries" : 2,
	"retry_delay" : timedelta(minutes = 5),
}

dir_path = "reports"
file_path = "data_drift_suite.html"

def load_data_execute(**context):
	bank_marketing = datasets.fetch_openml(name='bank-marketing', as_frame='auto')
	bank_marketing_data = bank_marketing.frame 
	context["ti"].xcom_push(key="data_frame", value=bank_marketing_data)

def drift_analysis_execute(**context):
	data = context["ti"].xcom_pull(key="data_frame")

	reference_data = data[5000:7000]
	prod_simulation_data = data[7000:]
	batch_size = 2000

	data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
	data_drift_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])

	if not data_drift_suite.as_dict()['summary']['all_passed']:
		return "create_report"

def create_report_execute(**context):
	data = context["ti"].xcom_pull(key="data_frame")

	reference_data = data[5000:7000]
	prod_simulation_data = data[7000:]
	batch_size = 2000

	data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
	data_drift_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])

	try:
		os.mkdir(dir_path)
	except OSError:
		print("Creation of the directory {} failed".format(dir_path))
	data_drift_suite.save_html(os.path.join(dir_path, file_path))


with DAG(
	dag_id = "evidently_conditional_drift",
	schedule_interval = "@daily",
	default_args = default_args,
	catchup = False,
	) as dag:

	load_data = PythonOperator(
		task_id = "load_data",
		python_callable = load_data_execute,
		provide_context = True
		)

	drift_analysis = PythonOperator(
		task_id = "drift_analysis",
		python_callable = drift_analysis_execute,
		provide_context = True
		)

	create_report = PythonOperator(
		task_id = "create_report",
		python_callable = create_report_execute,
		provide_context = True
		)

load_data >> drift_analysis >> [create_report]

image

Log artifacts using mlflow

import os
import pandas as pd 

from datetime import datetime, timedelta
from sklearn import datasets
import mlflow

from evidently.test_suite import TestSuite 
from evidently.test_preset import DataDriftTestPreset

bank_marketing = datasets.fetch_openml(name='bank-marketing', as_frame='auto')
bank_marketing_data = bank_marketing.frame
reference_data = bank_marketing_data[5000:7000]
prod_simulation_data = bank_marketing_data[7000:]
batch_size = 2000 

#set experiment
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Drift Tests")

for batch_id in range(10):
	with mlflow.start_run() as run:
		data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
		data_drift_suite.run(reference_data=reference_data,
			current_data=prod_simulation_data[batch_id * batch_size : (batch_id + 1) * batch_size])

		data_drift_suite.save_html("data_drift_suite.html")

		#Log parameters
		mlflow.log_param("batch_id", f"batch_{batch_id}")
		mlflow.log_param("success_tests", data_drift_suite.as_dict()['summary']['success_tests'])
		mlflow.log_param("failed_tests", data_drift_suite.as_dict()['summary']['failed_tests'])

		#Log report
		mlflow.log_artifact("data_drift_suite.html")

		print(run.info)

image


Below is module 6.

6.1. How to deploy a live ML monitoring dashboard

Why build a monitoring dashboard?

Tests and reports are great for running structured checks and exploring and debugging your data and ML models. However, when you run tests or build an ad-hoc report, you evaluate only a specific batch of data. This makes your monitoring “static”: you can get alerts but have no visibility into metric evolution and trends.

That is where the ML monitoring dashboard comes into play, as it:

  • Tracks metrics over time,

  • Aids in trend analysis and provides insights,

  • Provides a shared UI to enhance visibility for stakeholders, e.g., data scientists, model users, product managers, business stakeholders, etc.

image

ML monitoring architecture recap

image

6.2. ML model monitoring dashboard with Evidently. Batch architecture (code practice)

import datetime

from sklearn import datasets

from evidently.report import Report
from evidently.metrics import ColumnDriftMetric, DatasetDriftMetric

from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset

from evidently.ui.dashboards import CounterAgg
from evidently.ui.dashboards import DashboardPanelCounter
from evidently.ui.dashboards import DashboardPanelPlot
from evidently.ui.dashboards import PanelValue
from evidently.ui.dashboards import PlotType
from evidently.ui.dashboards import ReportFilter
from evidently.ui.dashboards import DashboardPanelTestSuite
from evidently.ui.dashboards import TestFilter
from evidently.ui.dashboards import TestSuitePanelType
from evidently.renderers.html_widgets import WidgetSize

from evidently.ui.workspace import Workspace
from evidently.ui.workspace import WorkspaceBase

bank_marketing = datasets.fetch_openml(name='bank-marketing', as_frame='auto')
bank_marketing_data = bank_marketing.frame

reference_data = bank_marketing_data[5000:7000]
prod_simulation_data = bank_marketing_data[7000:]
batch_size = 2000

WORKSPACE = "bank_data"

YOUR_PROJECT_NAME = "Bank Marketing Classification"
YOUR_PROJECT_DESCRIPTION = "Test project using Bank Marketing dataset"


def create_data_quality_report(i: int):
    report = Report(
        metrics=[
            DatasetDriftMetric(),
            ColumnDriftMetric(column_name="Class"),
        ],
        timestamp=datetime.datetime.now() + datetime.timedelta(days=i),
    )

    report.run(reference_data=reference_data, current_data=prod_simulation_data[i * batch_size : (i + 1) * batch_size])
    return report

def create_data_drift_test_suite(i: int):
    suite = TestSuite(
        tests=[
            DataDriftTestPreset()
        ],
        timestamp=datetime.datetime.now() + datetime.timedelta(days=i),
        tags = []
    )

    suite.run(reference_data=reference_data, current_data=prod_simulation_data[i * batch_size : (i + 1) * batch_size])
    return suite

def create_project(workspace: WorkspaceBase):
    project = workspace.create_project(YOUR_PROJECT_NAME)
    project.description = YOUR_PROJECT_DESCRIPTION
    project.dashboard.add_panel(
        DashboardPanelCounter(
            filter=ReportFilter(metadata_values={}, tag_values=[]),
            agg=CounterAgg.NONE,
            title="Bank Marketing Dataset",
        )
    )
    
    project.dashboard.add_panel(
        DashboardPanelPlot(
            title="Target Drift",
            filter=ReportFilter(metadata_values={}, tag_values=[]),
            values=[
                PanelValue(
                    metric_id="ColumnDriftMetric",
                    metric_args={"column_name.name": "Class"},
                    field_path=ColumnDriftMetric.fields.drift_score,
                    legend="target: Class",
                ),
            ],
            plot_type=PlotType.LINE,
            size=WidgetSize.HALF
        )
    )

    project.dashboard.add_panel(
        DashboardPanelPlot(
            title="Dataset Drift",
            filter=ReportFilter(metadata_values={}, tag_values=[]),
            values=[
                PanelValue(metric_id="DatasetDriftMetric", field_path="share_of_drifted_columns", legend="Drift Share"),
            ],
            plot_type=PlotType.BAR,
            size=WidgetSize.HALF
        )
    )

    project.dashboard.add_panel(
        DashboardPanelTestSuite(
            title="Data Drift tests",
            filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
            size=WidgetSize.HALF
        )
    )

    project.dashboard.add_panel(
        DashboardPanelTestSuite(
            title="Data Drift tests: detailed",
            filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
            size=WidgetSize.HALF,
            panel_type=TestSuitePanelType.DETAILED

        )
    )

    project.save()
    return project


def create_demo_project(workspace: str):
    ws = Workspace.create(workspace)
    project = create_project(ws)

    for i in range(0, 10):
        report = create_data_quality_report(i=i)
        ws.add_report(project.id, report)

        suite = create_data_drift_test_suite(i=i)
        ws.add_report(project.id, suite)


if __name__ == "__main__":
    create_demo_project(WORKSPACE)

First, execute the above script, and then we can run evidently ui --workspace bank_data

image

And when we go to the browser:

image

image

We can view indiviaul reports and tests:

image

6.3. ML model monitoring dashboard with Evidently. Online architecture (code practice)

import datetime
import os.path
import time
import pandas as pd

from requests.exceptions import RequestException
from sklearn import datasets

from evidently.collector.client import CollectorClient
from evidently.collector.config import CollectorConfig, IntervalTrigger, ReportConfig

from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset

from evidently.ui.dashboards import DashboardPanelTestSuite 
from evidently.ui.dashboards import ReportFilter
from evidently.ui.dashboards import TestFilter
from evidently.ui.dashboards import TestSuitePanelType
from evidently.renderers.html_widgets import WidgetSize
from evidently.ui.workspace import Workspace

COLLECTOR_ID = "default"
COLLECTOR_TEST_ID = "default_test"

PROJECT_NAME = "Bank Marketing: online service "
WORKSACE_PATH = "bank_data"

client = CollectorClient("http://localhost:8001")

bank_marketing = datasets.fetch_openml(name="bank-marketing", as_frame="auto")
bank_marketing_data = bank_marketing.frame
reference_data = bank_marketing_data[5000:5500]
prod_simulation_data = bank_marketing_data[7000:]
mini_batch_size = 50

def setup_test_suite():
	suite = TestSuite(tests=[DataDriftTestPreset()], tags=[])
	suite.run(reference_data=reference_data, current_data=prod_simulation_data[:mini_batch_size])
	return ReportConfig.from_test_suite(suite)

def workspace_setup():
	ws = Workspace.create(WORKSACE_PATH)
	project = ws.create_project(PROJECT_NAME)
	project.dashboard.add_panel(
		DashboardPanelTestSuite(
			title="Data Drift Tests",
			filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
			size=WidgetSize.HALF
		)
	)
	project.dashboard.add_panel(
		DashboardPanelTestSuite(
			title="Data Drift Tests",
			filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
			size=WidgetSize.HALF,
			panel_type=TestSuitePanelType.DETAILED
		)
	)
	project.save()

def setup_config():
	ws = Workspace.create(WORKSACE_PATH)
	project = ws.search_project(PROJECT_NAME)[0]

	test_conf = CollectorConfig(trigger=IntervalTrigger(interval=5),
		report_config=setup_test_suite(), project_id=str(project.id))

	client.create_collector(COLLECTOR_TEST_ID, test_conf)
	client.set_reference(COLLECTOR_TEST_ID, reference_data)

def send_data():
	print("Start sending data")
	for i in range(50):
		try:
			data = prod_simulation_data[i * mini_batch_size : (i + 1) * mini_batch_size]
			client.send_data(COLLECTOR_TEST_ID, data)
			print("sent")
		except RequestException as e:
			print(f"collector service is not available: {e.__class__.__name__}")
		time.sleep(1)

def main():
	if not os.path.exists(WORKSACE_PATH) or len(Workspace.create(WORKSACE_PATH).search_project(PROJECT_NAME)) == 0:
		workspace_setup()

	setup_config()
	send_data()

if __name__ == '__main__':
	main()

This is the same as before, but now we update the evidently UI as new data is coming in. It is called ‘near real-time’ monitoring.


Overall, what a great course. Definitely going in my ‘courses to recommend’ list.

Neo4j courses

Today I found neo4j offers free courses and tests.

Neo4j Fundamentals

image

Graph thinking:

  • where graph theory originated (Russia)
  • what are nodes and edges
  • directed vs undirected graphs
  • shortest path algorithms
  • graph use cases (i.e. recommendations, network dependency analysis, supply chain management)

Property graphs:

image

RDBMS vs Native Graph db

image

image

  1. Neo4j offers faster queries due to index-free adjacency, as relationships are stored during writing, unlike relational databases where joins are computed during reading.

  2. It simplifies relationship modeling, making it more intuitive than using pivot tables for many-to-many relationships in relational databases.

The last bit was using Cypher to do basic node and relationship querying on a movie dataset.

image

Cyper Fundamentals

image image

Sample graph structure

image

Retrieving exercise for Kevin Bacon’s birht year:

MATCH (p:Person)
where p.name = "Kevin Bacon"
RETURN p.born

-> 1958

Another exercise: How many people directed the movie Cloud Atlas?

MATCH (m:Movie {title: 'Cloud Atlas'})<-[:DIRECTED]-(p) 
RETURN count(p)

-> 3

Another exercise: How many actors in the movie The Matrix were born after 1960?

MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE a.born > 1960 AND m.title = 'The Matrix'
RETURN count(a)

-> 4

Create a node exercise

MERGE (p:Person {name: "Daniel Kaluuya"})

Added 1 label, created 1 node, set 1 property, completed after 98 ms. MERGE and CREATE can be used but MERGE first checks if a node with that property exists, and if it does, it does nothing (using MERGE is best practice)

Another exercise:

  1. Find the Person node for Daniel Kaluuya.

  2. Create the Movie node, Get Out.

  3. Add the ACTED_IN relationship between Daniel Kaluuya and the movie, Get Out.

MATCH (p:Person {name: "Daniel Kaluuya"})
MERGE (m:Movie {title: "Get Out"})
MERGE (p)-[:ACTED_IN]->(m)

-> Added 1 label, created 1 node, set 1 property, created 1 relationship, completed after 22 ms.

Setting properties exercise: Write the Cypher code to find this Movie node and add the tagline and released properties for the node with the values tagline: Gripping, scary, witty and timely! and released: 2017

MATCH (m:Movie {title: 'Get Out'})
SET m.tagline = 'Gripping, scary, witty and timely!'
SET m.released = 2017
RETURN m.title, m.tagline, m.released

In neo4j we can create nodes and relationships on individual lines with MERGE, or in 1 line:

MERGE (p:Person {name: 'Michael Caine'})-[:ACTED_IN]->(m:Movie {title: 'The Cider House Rules'})
RETURN p, m

and this is what heppens under the hood:

  1. Neo4j will attempt to find a Person node with the name Michael Caine.

  2. If it does not exist, it creates the node.

  3. Then, it will attempt to expand the ACTED_IN relationships in the graph for this node.

  4. If there are any ACTED_IN relationships from this node, it looks for a Movie with the title ‘The Cider House Rules’.

  5. If there is no node for the Movie, it creates the node.

  6. If there is no relationship between the two nodes, it then creates the ACTED_IN relationship between them.

However this can result in an error if one of the items does not exist. I.e. if our graph has a node for Michael, but there is not for the movie. It tried to create the entire pattern, but it failed because the node Person with Michael Caine already exists. That is why it is best practice to use MERGE on individual lines when adding nodes/relationships

// Find or create a person with this name
MERGE (p:Person {name: 'Michael Caine'})

// Find or create a movie with this title
MERGE (m:Movie {title: 'The Cider House Rules'})

// Find or create a relationship between the two nodes
MERGE (p)-[:ACTED_IN]->(m)

To remove properties we can use REMOVE, to remove node/edge we use DELETE

If a node has a relationship, the operation will result in an error. To avoid this, we can use DETACH DELETE which 1st deletes any relationship and then the node

image

Graph Data Modelling Fundamentals

Course structure:

  1. Intro
  2. Node modelling
  3. Relationships modelling
  4. Testing the model
  5. Refactoring the graph
  6. Eliminating duplicate data
  7. Using specific relationships
  8. Adding intermediate nodes

Data modeling process

Here are the steps to create a graph data model:

  • Understand the domain and define specific use cases (questions) for the application.

  • Develop the initial graph data model:

    • Model the nodes (entities).

    • Model the relationships between nodes.

  • Test the use cases against the initial data model.

  • Create the graph (instance model) with test data using Cypher.

  • Test the use cases, including performance against the graph.

  • Refactor (improve) the graph data model due to a change in the key use cases or for performance reasons.

  • Implement the refactoring on the graph and retest using Cypher.

Graph data modeling is an iterative process. Your initial graph data model is a starting point, but as you learn more about the use cases or if the use cases change, the initial graph data model will need to change. In addition, you may find that especially when the graph scales, you will need to modify the graph (refactor) to achieve the best performance for your key use cases.

Refactoring is very common in the development process. A Neo4j graph has an optional schema which is quite flexible, unlike the schema in an RDBMS. A Cypher developer can easily modify the graph to represent an improved data model.

Types of models

When performing the graph data modeling process for an application, you will need at least two types of models:

  • Data model

  • Instance model

Data model

The data model describes the labels, relationships, and properties for the graph. It does not have specific data that will be created in the graph.

Here is an example of a data model:

image

There is nothing that uniquely identifies a node with a given label. A graph data model, however is important because it defines the names that will be used for labels, relationship types, and properties when the graph is created and used by the application.

Instance model

An important part of the graph data modeling process is to test the model against the use cases. To do this, you need to have a set of sample data that you can use to see if the use cases can be answered with the model.

Here is an example of an instance model:

image

In this instance model, we have created some instances of Person and Movie nodes, as well as their relationships. Having this type of instance model will help us to test our use cases.

Modelling nodes

Defining labels

Entities are the dominant nouns in your application use cases:

  • What ingredients are used in a recipe?

  • Who is married to this person?

The entities of your use cases will be the labeled nodes in the graph data model.

In the Movie domain, we use the nouns in our use cases to define the labels, for example:

  • What people acted in a movie?

  • What person directed a movie?

  • What movies did a person act in?

Here are some of the labeled nodes that we will start with.

image

Properties for nodes

Depending on our use cases we need to decide on node properties

Use Case Steps Required
1: What people acted in a movie? Retrieve a movie by its title.
  Return the names of the actors.
2: What person directed a movie? Retrieve a movie by its title.
  Return the name of the director.
3: What movies did a person act in? Retrieve a person by their name.
  Return the titles of the movies.
5: Who was the youngest person to act in a movie? Retrieve a movie by its title.
  Evaluate the ages of the actors.
  Return the name of the youngest actor.
7: What is the highest rated movie in a particular year? Retrieve all movies released in a particular year.
  Evaluate the IMDb ratings.
  Return the movie title.
8: What drama movies did an actor act in? Retrieve the actor by name.
  Evaluate the genres for the movies the actor acted in.
  Return the movie titles.

Here is the initial data model

image

And an initial instance model

image

Modelling relationships

Will continue tomorrow ^^

Pen and paper exercises in ML: Optimisation

Optimisation P1 Optimisation P2


That is all for today!

See you tomorrow :)