Hello :) Today is Day 259!
A quick summary of today:
- Module 5: ML pipelines validation and testing
- starting Neo4j’s free courses
5.1. Introduction to data and ML pipeline testing
When to perform testing
ML lifecycle involves various steps that require testing to ensure that our ML models function properly. Critical areas to test include the following steps of the ML lifecycle:
-
During the feature engineering stage: testing the input data quality as it affects the whole pipeline.
-
During model training (or retraining): model quality checks.
-
During model serving: validating incoming data and model outputs.
-
During performance monitoring: continuously testing the model quality to detect and resolve potential issues.
How to perform testing
There are different types of checks you can use to test data and ML pipelines:
Individual tests A test is a metric with a condition. You can perform a certain evaluation or measurement on top of a data batch and compare it against a threshold or expectation. You can formulate almost anything as a test: assertions on feature values, expectations about model quality on a specific segment, etc. Whatever you can measure, you can design as a test.
Tests can be column-level (when metrics are calculated for a specific feature or column) or dataset-level (in this case, you calculate metrics for the whole dataset).
Test suites Individual tests can be grouped into test suites. For each test in a test suite, you can define test criticality and set alerting conditions: for example, based on the number of failed critical tests.
When you create a test, you must define the test conditions. There are two main strategies you can use for establishing test conditions:
Reference-based conditions You can use a reference dataset to derive conditions automatically rather than set conditions manually for each individual test. This is a great option for certain types of checks, such as testing column types (which are easy to derive from a reference example) and for ad hoc testing such as when you import a new batch of data and can immediately visually explore the test results. However, be careful when designing alerting, as auto-generated test conditions are not perfect and may be prone to false alerts or missed issues.
Manually defined conditions With this approach, you specify conditions for each test manually. This method does not require additional data and can be great for encoding specific conditions based on domain expertise.
Test automation
Recording test results
Example use case
5.2. Train and evaluate an ML model (code practice)
Using a bank marketing dataset from sklearn:
Bank client data:
- Age (numeric)
- Job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
- Marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’ ; note: ‘divorced’ means divorced or widowed)
- Education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)
- Default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)
- Balance: average yearly balance, in euros (numeric)
- Housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
- Loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
Related with the last contact of the current campaign:
- Contact: contact communication type (categorical: ‘cellular’,’telephone’)
- Day: ast contact day of the month (numeric)
- Month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
- Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes:
- Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- Previous: number of contacts performed before this campaign and for this client (numeric)
- Poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
Doing some feature engineering:
def feature_engineering(raw_data: pd.DataFrame) -> pd.DataFrame:
preprocessed_data = raw_data.copy(deep = True)
preprocessed_data.columns = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact',
'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'class']
#client data preprocessing
preprocessed_data['has_default'] = preprocessed_data.default.apply(
lambda x : 0 if x == 'no' else 1 if x == 'yes' else -1
)
preprocessed_data['has_housing_loan'] = preprocessed_data.housing.apply(
lambda x : 0 if x == 'no' else 1 if x == 'yes' else -1
)
preprocessed_data['has_personal_loan'] = preprocessed_data.loan.apply(
lambda x : 0 if x == 'no' else 1 if x == 'yes' else -1
)
marital_dummies = pd.get_dummies(preprocessed_data.marital, prefix = 'marital')
preprocessed_data = pd.concat([preprocessed_data, marital_dummies], axis = 1)
job_dummies = pd.get_dummies(preprocessed_data.job, prefix = 'job')
preprocessed_data = pd.concat([preprocessed_data, job_dummies], axis = 1)
edu_dummies = pd.get_dummies(preprocessed_data.education, prefix = 'edu')
preprocessed_data = pd.concat([preprocessed_data, edu_dummies], axis = 1)
preprocessed_data.drop(columns = ['default', 'housing', 'loan', 'marital', 'job', 'education'], inplace=True)
# last contact data preprocessing
contact_dummies = pd.get_dummies(preprocessed_data.contact, prefix = 'contact_type')
preprocessed_data = pd.concat([preprocessed_data, contact_dummies], axis = 1)
month_dummies = pd.get_dummies(preprocessed_data.month, prefix = 'month')
preprocessed_data = pd.concat([preprocessed_data, month_dummies], axis = 1)
preprocessed_data.drop(columns = ['contact', 'month'], inplace=True)
# other attributes preprocessing
poutcome_dummies = pd.get_dummies(preprocessed_data.poutcome, prefix = 'prev_camp_outcome')
preprocessed_data = pd.concat([preprocessed_data, poutcome_dummies], axis = 1)
preprocessed_data.drop(columns = ['poutcome'], inplace=True)
#target preprocessing
preprocessed_data['target'] = preprocessed_data['class'].apply(lambda x : 0 if x == '1' else 1)
preprocessed_data.drop(columns = ['class'], inplace=True)
return preprocessed_data
Then, splitting into train, reference, and prod simulation
train_data = bank_marketing_data[:5000]
reference_data = bank_marketing_data[5000:7000]
prod_simulation_data = bank_marketing_data[7000:]
The model used is a simple RF classifier
model = ensemble.RandomForestClassifier(random_state=42, n_estimators=100)
To create a simple report:
model_quality_report = Report(metrics=[ClassificationPreset()])
model_quality_report.run(reference_data=processed_train, current_data=processed_reference)
model_quality_report.show(mode='inline')
The output report is:
5.3. Test input data quality, stability and drift (code practice)
Raw data checks:
data_stability_suite = TestSuite(tests=[DataStabilityTestPreset()])
data_stability_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
data_stability_suite.show(mode='inline')
For any test we can see details:
data_quality_suite = TestSuite(tests=[DataQualityTestPreset()])
data_quality_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
data_quality_suite.show(mode='inline')
data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
data_drift_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
data_drift_suite.show(mode='inline')
Preprocessed data checks
We preprocess the data using our feature_engineering function
processed_reference_data = feature_engineering(reference_data)
processed_prod_simulation_data = feature_engineering(prod_simulation_data)
Run the data drift preset report code
data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
data_drift_suite.run(reference_data=processed_reference_data, current_data=processed_prod_simulation_data[:batch_size])
data_drift_suite.show(mode='inline')
We can run some quick checks for warnings/alerts to check tests pass rate like below
data_drift_suite.as_dict()['summary']['all_passed'] == True # False
data_drift_suite.as_dict()['summary']['by_status']['SUCCESS'] > 40 # True
5.4. Test ML model outputs and quality (code practice)
Load the trained model and add predictions for reference and the prod simulation data that was created earlier:
processed_data_reference['prediction'] = model.predict(processed_data_reference.iloc[:, :-1])
processed_data_prod_simulation['prediction'] = model.predict(processed_data_prod_simulation.iloc[:, :-1])
Model output checks
no_target_performance_suite = TestSuite(tests=[NoTargetPerformanceTestPreset()])
no_target_performance_suite.run(reference_data=processed_data_reference, current_data=processed_data_prod_simulation[2*batch_size:3*batch_size])
no_target_performance_suite.show(mode='inline')
Model quality checks
model_performance_suite = TestSuite(tests=[BinaryClassificationTestPreset()])
model_performance_suite.run(reference_data=processed_data_reference, current_data=processed_data_prod_simulation[0*batch_size:1*batch_size])
model_performance_suite.show(mode='inline')
5.5. Design a custom test suite with Evidently (code practice)
Here is an example custome suite
custom_performance_suite = TestSuite(tests=[
#TestColumnsType(),
#TestShareOfDriftedColumns(ls=0.5),
TestShareOfMissingValues(eq=0),
TestPrecisionScore(gt=0.5),
TestRecallScore(gt=0.3),
TestAccuracyScore(gte=0.75),
])
custom_performance_suite.run(reference_data=processed_reference, current_data=processed_prod_simulation[:batch_size])
custom_performance_suite.show(mode='inline')
Running and logging reports using Airflow and mlflow
Adding monitoring using Airlow
import os
import pandas as pd
from datetime import datetime, timedelta
from sklearn import datasets
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset
default_args = {
"start_date" : datetime(2023, 1, 1),
"owner" : "emeli",
"retries" : 2,
"retry_delay" : timedelta(minutes = 5),
}
dir_path = "reports"
file_path = "data_drift_suite.html"
def load_data_execute(**context):
bank_marketing = datasets.fetch_openml(name='bank-marketing', as_frame='auto')
bank_marketing_data = bank_marketing.frame
context["ti"].xcom_push(key="data_frame", value=bank_marketing_data)
def drift_analysis_execute(**context):
data = context["ti"].xcom_pull(key="data_frame")
reference_data = data[5000:7000]
prod_simulation_data = data[7000:]
batch_size = 2000
data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
data_drift_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
if not data_drift_suite.as_dict()['summary']['all_passed']:
return "create_report"
def create_report_execute(**context):
data = context["ti"].xcom_pull(key="data_frame")
reference_data = data[5000:7000]
prod_simulation_data = data[7000:]
batch_size = 2000
data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
data_drift_suite.run(reference_data=reference_data, current_data=prod_simulation_data[:batch_size])
try:
os.mkdir(dir_path)
except OSError:
print("Creation of the directory {} failed".format(dir_path))
data_drift_suite.save_html(os.path.join(dir_path, file_path))
with DAG(
dag_id = "evidently_conditional_drift",
schedule_interval = "@daily",
default_args = default_args,
catchup = False,
) as dag:
load_data = PythonOperator(
task_id = "load_data",
python_callable = load_data_execute,
provide_context = True
)
drift_analysis = PythonOperator(
task_id = "drift_analysis",
python_callable = drift_analysis_execute,
provide_context = True
)
create_report = PythonOperator(
task_id = "create_report",
python_callable = create_report_execute,
provide_context = True
)
load_data >> drift_analysis >> [create_report]
Log artifacts using mlflow
import os
import pandas as pd
from datetime import datetime, timedelta
from sklearn import datasets
import mlflow
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset
bank_marketing = datasets.fetch_openml(name='bank-marketing', as_frame='auto')
bank_marketing_data = bank_marketing.frame
reference_data = bank_marketing_data[5000:7000]
prod_simulation_data = bank_marketing_data[7000:]
batch_size = 2000
#set experiment
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Drift Tests")
for batch_id in range(10):
with mlflow.start_run() as run:
data_drift_suite = TestSuite(tests=[DataDriftTestPreset()])
data_drift_suite.run(reference_data=reference_data,
current_data=prod_simulation_data[batch_id * batch_size : (batch_id + 1) * batch_size])
data_drift_suite.save_html("data_drift_suite.html")
#Log parameters
mlflow.log_param("batch_id", f"batch_{batch_id}")
mlflow.log_param("success_tests", data_drift_suite.as_dict()['summary']['success_tests'])
mlflow.log_param("failed_tests", data_drift_suite.as_dict()['summary']['failed_tests'])
#Log report
mlflow.log_artifact("data_drift_suite.html")
print(run.info)
Below is module 6.
6.1. How to deploy a live ML monitoring dashboard
Why build a monitoring dashboard?
Tests and reports are great for running structured checks and exploring and debugging your data and ML models. However, when you run tests or build an ad-hoc report, you evaluate only a specific batch of data. This makes your monitoring “static”: you can get alerts but have no visibility into metric evolution and trends.
That is where the ML monitoring dashboard comes into play, as it:
-
Tracks metrics over time,
-
Aids in trend analysis and provides insights,
-
Provides a shared UI to enhance visibility for stakeholders, e.g., data scientists, model users, product managers, business stakeholders, etc.
ML monitoring architecture recap
6.2. ML model monitoring dashboard with Evidently. Batch architecture (code practice)
import datetime
from sklearn import datasets
from evidently.report import Report
from evidently.metrics import ColumnDriftMetric, DatasetDriftMetric
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset
from evidently.ui.dashboards import CounterAgg
from evidently.ui.dashboards import DashboardPanelCounter
from evidently.ui.dashboards import DashboardPanelPlot
from evidently.ui.dashboards import PanelValue
from evidently.ui.dashboards import PlotType
from evidently.ui.dashboards import ReportFilter
from evidently.ui.dashboards import DashboardPanelTestSuite
from evidently.ui.dashboards import TestFilter
from evidently.ui.dashboards import TestSuitePanelType
from evidently.renderers.html_widgets import WidgetSize
from evidently.ui.workspace import Workspace
from evidently.ui.workspace import WorkspaceBase
bank_marketing = datasets.fetch_openml(name='bank-marketing', as_frame='auto')
bank_marketing_data = bank_marketing.frame
reference_data = bank_marketing_data[5000:7000]
prod_simulation_data = bank_marketing_data[7000:]
batch_size = 2000
WORKSPACE = "bank_data"
YOUR_PROJECT_NAME = "Bank Marketing Classification"
YOUR_PROJECT_DESCRIPTION = "Test project using Bank Marketing dataset"
def create_data_quality_report(i: int):
report = Report(
metrics=[
DatasetDriftMetric(),
ColumnDriftMetric(column_name="Class"),
],
timestamp=datetime.datetime.now() + datetime.timedelta(days=i),
)
report.run(reference_data=reference_data, current_data=prod_simulation_data[i * batch_size : (i + 1) * batch_size])
return report
def create_data_drift_test_suite(i: int):
suite = TestSuite(
tests=[
DataDriftTestPreset()
],
timestamp=datetime.datetime.now() + datetime.timedelta(days=i),
tags = []
)
suite.run(reference_data=reference_data, current_data=prod_simulation_data[i * batch_size : (i + 1) * batch_size])
return suite
def create_project(workspace: WorkspaceBase):
project = workspace.create_project(YOUR_PROJECT_NAME)
project.description = YOUR_PROJECT_DESCRIPTION
project.dashboard.add_panel(
DashboardPanelCounter(
filter=ReportFilter(metadata_values={}, tag_values=[]),
agg=CounterAgg.NONE,
title="Bank Marketing Dataset",
)
)
project.dashboard.add_panel(
DashboardPanelPlot(
title="Target Drift",
filter=ReportFilter(metadata_values={}, tag_values=[]),
values=[
PanelValue(
metric_id="ColumnDriftMetric",
metric_args={"column_name.name": "Class"},
field_path=ColumnDriftMetric.fields.drift_score,
legend="target: Class",
),
],
plot_type=PlotType.LINE,
size=WidgetSize.HALF
)
)
project.dashboard.add_panel(
DashboardPanelPlot(
title="Dataset Drift",
filter=ReportFilter(metadata_values={}, tag_values=[]),
values=[
PanelValue(metric_id="DatasetDriftMetric", field_path="share_of_drifted_columns", legend="Drift Share"),
],
plot_type=PlotType.BAR,
size=WidgetSize.HALF
)
)
project.dashboard.add_panel(
DashboardPanelTestSuite(
title="Data Drift tests",
filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
size=WidgetSize.HALF
)
)
project.dashboard.add_panel(
DashboardPanelTestSuite(
title="Data Drift tests: detailed",
filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
size=WidgetSize.HALF,
panel_type=TestSuitePanelType.DETAILED
)
)
project.save()
return project
def create_demo_project(workspace: str):
ws = Workspace.create(workspace)
project = create_project(ws)
for i in range(0, 10):
report = create_data_quality_report(i=i)
ws.add_report(project.id, report)
suite = create_data_drift_test_suite(i=i)
ws.add_report(project.id, suite)
if __name__ == "__main__":
create_demo_project(WORKSPACE)
First, execute the above script, and then we can run evidently ui --workspace bank_data
And when we go to the browser:
We can view indiviaul reports and tests:
6.3. ML model monitoring dashboard with Evidently. Online architecture (code practice)
import datetime
import os.path
import time
import pandas as pd
from requests.exceptions import RequestException
from sklearn import datasets
from evidently.collector.client import CollectorClient
from evidently.collector.config import CollectorConfig, IntervalTrigger, ReportConfig
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset
from evidently.ui.dashboards import DashboardPanelTestSuite
from evidently.ui.dashboards import ReportFilter
from evidently.ui.dashboards import TestFilter
from evidently.ui.dashboards import TestSuitePanelType
from evidently.renderers.html_widgets import WidgetSize
from evidently.ui.workspace import Workspace
COLLECTOR_ID = "default"
COLLECTOR_TEST_ID = "default_test"
PROJECT_NAME = "Bank Marketing: online service "
WORKSACE_PATH = "bank_data"
client = CollectorClient("http://localhost:8001")
bank_marketing = datasets.fetch_openml(name="bank-marketing", as_frame="auto")
bank_marketing_data = bank_marketing.frame
reference_data = bank_marketing_data[5000:5500]
prod_simulation_data = bank_marketing_data[7000:]
mini_batch_size = 50
def setup_test_suite():
suite = TestSuite(tests=[DataDriftTestPreset()], tags=[])
suite.run(reference_data=reference_data, current_data=prod_simulation_data[:mini_batch_size])
return ReportConfig.from_test_suite(suite)
def workspace_setup():
ws = Workspace.create(WORKSACE_PATH)
project = ws.create_project(PROJECT_NAME)
project.dashboard.add_panel(
DashboardPanelTestSuite(
title="Data Drift Tests",
filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
size=WidgetSize.HALF
)
)
project.dashboard.add_panel(
DashboardPanelTestSuite(
title="Data Drift Tests",
filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
size=WidgetSize.HALF,
panel_type=TestSuitePanelType.DETAILED
)
)
project.save()
def setup_config():
ws = Workspace.create(WORKSACE_PATH)
project = ws.search_project(PROJECT_NAME)[0]
test_conf = CollectorConfig(trigger=IntervalTrigger(interval=5),
report_config=setup_test_suite(), project_id=str(project.id))
client.create_collector(COLLECTOR_TEST_ID, test_conf)
client.set_reference(COLLECTOR_TEST_ID, reference_data)
def send_data():
print("Start sending data")
for i in range(50):
try:
data = prod_simulation_data[i * mini_batch_size : (i + 1) * mini_batch_size]
client.send_data(COLLECTOR_TEST_ID, data)
print("sent")
except RequestException as e:
print(f"collector service is not available: {e.__class__.__name__}")
time.sleep(1)
def main():
if not os.path.exists(WORKSACE_PATH) or len(Workspace.create(WORKSACE_PATH).search_project(PROJECT_NAME)) == 0:
workspace_setup()
setup_config()
send_data()
if __name__ == '__main__':
main()
This is the same as before, but now we update the evidently UI as new data is coming in. It is called ‘near real-time’ monitoring.
Overall, what a great course. Definitely going in my ‘courses to recommend’ list.
Neo4j courses
Today I found neo4j offers free courses and tests.
- neo4j certified professional (its free and I get a free t-shirt if I succeed)
- neo4j graph data science certification
Neo4j Fundamentals
Graph thinking:
- where graph theory originated (Russia)
- what are nodes and edges
- directed vs undirected graphs
- shortest path algorithms
- graph use cases (i.e. recommendations, network dependency analysis, supply chain management)
Property graphs:
RDBMS vs Native Graph db
-
Neo4j offers faster queries due to index-free adjacency, as relationships are stored during writing, unlike relational databases where joins are computed during reading.
-
It simplifies relationship modeling, making it more intuitive than using pivot tables for many-to-many relationships in relational databases.
The last bit was using Cypher to do basic node and relationship querying on a movie dataset.
Cyper Fundamentals
Sample graph structure
Retrieving exercise for Kevin Bacon’s birht year:
MATCH (p:Person)
where p.name = "Kevin Bacon"
RETURN p.born
-> 1958
Another exercise: How many people directed the movie Cloud Atlas?
MATCH (m:Movie {title: 'Cloud Atlas'})<-[:DIRECTED]-(p)
RETURN count(p)
-> 3
Another exercise: How many actors in the movie The Matrix were born after 1960?
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE a.born > 1960 AND m.title = 'The Matrix'
RETURN count(a)
-> 4
Create a node exercise
MERGE (p:Person {name: "Daniel Kaluuya"})
Added 1 label, created 1 node, set 1 property, completed after 98 ms. MERGE and CREATE can be used but MERGE first checks if a node with that property exists, and if it does, it does nothing (using MERGE is best practice)
Another exercise:
-
Find the Person node for Daniel Kaluuya.
-
Create the Movie node, Get Out.
-
Add the ACTED_IN relationship between Daniel Kaluuya and the movie, Get Out.
MATCH (p:Person {name: "Daniel Kaluuya"})
MERGE (m:Movie {title: "Get Out"})
MERGE (p)-[:ACTED_IN]->(m)
-> Added 1 label, created 1 node, set 1 property, created 1 relationship, completed after 22 ms.
Setting properties exercise: Write the Cypher code to find this Movie node and add the tagline and released properties for the node with the values tagline: Gripping, scary, witty and timely! and released: 2017
MATCH (m:Movie {title: 'Get Out'})
SET m.tagline = 'Gripping, scary, witty and timely!'
SET m.released = 2017
RETURN m.title, m.tagline, m.released
In neo4j we can create nodes and relationships on individual lines with MERGE, or in 1 line:
MERGE (p:Person {name: 'Michael Caine'})-[:ACTED_IN]->(m:Movie {title: 'The Cider House Rules'})
RETURN p, m
and this is what heppens under the hood:
-
Neo4j will attempt to find a Person node with the name Michael Caine.
-
If it does not exist, it creates the node.
-
Then, it will attempt to expand the ACTED_IN relationships in the graph for this node.
-
If there are any ACTED_IN relationships from this node, it looks for a Movie with the title ‘The Cider House Rules’.
-
If there is no node for the Movie, it creates the node.
-
If there is no relationship between the two nodes, it then creates the ACTED_IN relationship between them.
However this can result in an error if one of the items does not exist. I.e. if our graph has a node for Michael, but there is not for the movie. It tried to create the entire pattern, but it failed because the node Person with Michael Caine already exists. That is why it is best practice to use MERGE on individual lines when adding nodes/relationships
// Find or create a person with this name
MERGE (p:Person {name: 'Michael Caine'})
// Find or create a movie with this title
MERGE (m:Movie {title: 'The Cider House Rules'})
// Find or create a relationship between the two nodes
MERGE (p)-[:ACTED_IN]->(m)
To remove properties we can use REMOVE, to remove node/edge we use DELETE
If a node has a relationship, the operation will result in an error. To avoid this, we can use DETACH DELETE which 1st deletes any relationship and then the node
Graph Data Modelling Fundamentals
Course structure:
- Intro
- Node modelling
- Relationships modelling
- Testing the model
- Refactoring the graph
- Eliminating duplicate data
- Using specific relationships
- Adding intermediate nodes
Data modeling process
Here are the steps to create a graph data model:
-
Understand the domain and define specific use cases (questions) for the application.
-
Develop the initial graph data model:
-
Model the nodes (entities).
-
Model the relationships between nodes.
-
-
Test the use cases against the initial data model.
-
Create the graph (instance model) with test data using Cypher.
-
Test the use cases, including performance against the graph.
-
Refactor (improve) the graph data model due to a change in the key use cases or for performance reasons.
-
Implement the refactoring on the graph and retest using Cypher.
Graph data modeling is an iterative process. Your initial graph data model is a starting point, but as you learn more about the use cases or if the use cases change, the initial graph data model will need to change. In addition, you may find that especially when the graph scales, you will need to modify the graph (refactor) to achieve the best performance for your key use cases.
Refactoring is very common in the development process. A Neo4j graph has an optional schema which is quite flexible, unlike the schema in an RDBMS. A Cypher developer can easily modify the graph to represent an improved data model.
Types of models
When performing the graph data modeling process for an application, you will need at least two types of models:
-
Data model
-
Instance model
Data model
The data model describes the labels, relationships, and properties for the graph. It does not have specific data that will be created in the graph.
Here is an example of a data model:
There is nothing that uniquely identifies a node with a given label. A graph data model, however is important because it defines the names that will be used for labels, relationship types, and properties when the graph is created and used by the application.
Instance model
An important part of the graph data modeling process is to test the model against the use cases. To do this, you need to have a set of sample data that you can use to see if the use cases can be answered with the model.
Here is an example of an instance model:
In this instance model, we have created some instances of Person and Movie nodes, as well as their relationships. Having this type of instance model will help us to test our use cases.
Modelling nodes
Defining labels
Entities are the dominant nouns in your application use cases:
-
What ingredients are used in a recipe?
-
Who is married to this person?
The entities of your use cases will be the labeled nodes in the graph data model.
In the Movie domain, we use the nouns in our use cases to define the labels, for example:
-
What people acted in a movie?
-
What person directed a movie?
-
What movies did a person act in?
Here are some of the labeled nodes that we will start with.
Properties for nodes
Depending on our use cases we need to decide on node properties
Use Case | Steps Required |
---|---|
1: What people acted in a movie? | Retrieve a movie by its title. |
Return the names of the actors. | |
2: What person directed a movie? | Retrieve a movie by its title. |
Return the name of the director. | |
3: What movies did a person act in? | Retrieve a person by their name. |
Return the titles of the movies. | |
5: Who was the youngest person to act in a movie? | Retrieve a movie by its title. |
Evaluate the ages of the actors. | |
Return the name of the youngest actor. | |
7: What is the highest rated movie in a particular year? | Retrieve all movies released in a particular year. |
Evaluate the IMDb ratings. | |
Return the movie title. | |
8: What drama movies did an actor act in? | Retrieve the actor by name. |
Evaluate the genres for the movies the actor acted in. | |
Return the movie titles. |
Here is the initial data model
And an initial instance model
Modelling relationships
Will continue tomorrow ^^
Pen and paper exercises in ML: Optimisation
That is all for today!
See you tomorrow :)