JSON-LD Validation & Python: SHACL & Pydantic Generation

Oct 29, 2025 by Admin 57 views

Generating SHACL and Pydantic for JSON-LD Documents

Hey there, fellow data enthusiasts! Have you ever found yourself wrestling with JSON-LD documents and wishing you could seamlessly validate them and bring them to life as Python objects? Well, you're not alone! This is a common challenge, especially when dealing with formats like RO-Crate metadata. In this article, we'll dive into the fascinating world of generating both SHACL (Shapes Constraint Language) and Pydantic models for the same JSON-LD documents. This approach allows us to not only validate the RDF graphs described by the document but also to easily work with the data in Python.

The Challenge: Validating and Using JSON-LD Data

Let's face it: working with JSON-LD can be a bit like navigating a complex maze. You've got your data, represented in a structured, graph-based format, and you need to ensure it's valid, conforms to specific rules, and is easily usable within your Python code. This is where SHACL and Pydantic come into play. SHACL helps you validate the structure and content of your JSON-LD documents, while Pydantic allows you to create Python classes that represent your data, making it easy to work with the data.

Imagine you have a JSON-LD document describing biological samples. This document might include information about the sample's title, organism classification, and a description. You want to ensure that the document follows the rules defined for biological samples, like having a title and organism classification. Also, you want to easily access these properties in your Python code.

The core of the issue lies in the nuances of JSON-LD, especially when dealing with flattened documents. In the example provided, the organismClassification field in the BioSample class points to another object (a Taxon), which has its own properties. The challenge is to represent these relationships correctly in both SHACL (for validation) and Pydantic (for Python class generation).

Here’s a breakdown of the key elements we need to address:

Validation: Ensuring your data structure and content are correct. SHACL is used for validating the RDF graph described by the JSON-LD document.
Data Representation: Converting the JSON-LD data into a format that's easy to use in Python. Pydantic is used to generate Python classes based on the data structure.
Complex Relationships: Handling the relationships between different data objects (like BioSample and Taxon). The organismClassification field in the BioSample class references a Taxon object, creating a relationship that needs to be properly defined.

So, how do we tackle this complex problem? Let's explore some strategies and tools to make this process smoother.

Deep Dive into SHACL and Pydantic

Let's get down to the nitty-gritty of SHACL and Pydantic. We'll cover the fundamental concepts and how they relate to the problem of JSON-LD validation and data modeling.

SHACL: The Validator

SHACL (Shapes Constraint Language) is a W3C recommendation for validating RDF graphs. Think of it as a set of rules and constraints that define the shape and structure of your data. For example, you can define that a BioSample must have a title (a required property) and its organismClassification must be of type Taxon.

Here's a simplified SHACL shape for our BioSample example:

@prefix sh: <http://www.w3.org/ns/shacl#>
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
@prefix ex: <http://example.org/> .

ex:BioSampleShape
    a sh:NodeShape ;
    sh:targetClass ex:BioSample ;
    sh:property [
        sh:path ex:title ;
        sh:name "Title" ;
        sh:minCount 1 ;
    ] ;
    sh:property [
        sh:path ex:organismClassification ;
        sh:class ex:Taxon ;
    ] .

In this SHACL shape, we define that a BioSample should have a property (title) that is required (minCount 1) and the organismClassification should be of the Taxon type. This shape can then be used to validate JSON-LD documents that conform to this structure.

Pydantic: The Data Modeler

Pydantic is a Python library that allows you to define data models (Python classes) with data validation and parsing capabilities. Pydantic leverages Python type hints to ensure that the data conforms to the defined structure. It's like having a blueprint for your data, ensuring data integrity and making it easier to work with. Using a Pydantic model for your data has some advantages:

Data Validation: Pydantic automatically validates the data against the defined types and constraints.
Data Parsing: It can parse JSON, dictionaries, and other data formats into Python objects.
Developer Experience: Pydantic models are easy to define and use, which improves the developer experience.

Here's an example of a simple Pydantic model for BioSample:

from typing import Optional
from pydantic import BaseModel

class Taxon(BaseModel):
    scientificName: str

class BioSample(BaseModel):
    title: str
    organismClassification: Taxon
    biologicalEntityDescription: Optional[str] = None

In this example, the BioSample model includes a title (a required string), an organismClassification (a Taxon object), and an optional biologicalEntityDescription. Pydantic handles the data validation and parsing automatically. This allows you to easily parse your JSON-LD data into instances of the BioSample class, making it easy to access the data.

Bringing SHACL and Pydantic Together

Now, let's explore how to bring SHACL and Pydantic together to achieve the goal of validating and using JSON-LD data. The key is to generate SHACL shapes and Pydantic models that align with the structure of your JSON-LD documents. This will allow you to validate your data and work with it as Python objects.

The LinkML Approach

One approach is to use LinkML (Linked Machine Learning Language). LinkML is a data modeling language designed to describe data structures in a way that's easy to use for data validation and code generation. LinkML allows you to define classes, slots, and relationships in a declarative way.

Here's a LinkML schema snippet for the example provided:

id: https://bia_rocrate/schema
name: bia_rocrate_schema
prefixes:
  schema: http://schema.org/
  bia: http://bia/
  dc: http://purl.org/dc/terms/
  csvw: https://www.w3.org/ns/csvw#
  obo: http://purl.obolibrary.org/obo/
  linkml: https://w3id.org/linkml/
imports:
  - linkml:types
classes:
  BioSample:
    class_uri: bia:BioSample
    slots:
      - title
    attributes:
      biologicalEntityDescription:
        slot_uri: bia:biologicalEntityDescription
      organismClassification:
        slot_uri: bia:organismClassification
        range: Taxon
        required: True
  Taxon:
    class_uri: bia:Taxon
    attributes:
      scientificName:
        slot_uri: bia:scientificName
        required: False
slots:
  title:
    slot_uri: schema:name
    range: string
    required: True

This LinkML schema defines the classes (BioSample, Taxon), their attributes (like title, scientificName), and the relationships between them (the organismClassification relationship between BioSample and Taxon).

Generating SHACL Shapes from LinkML

LinkML can be used to generate SHACL shapes. The generated SHACL shapes will have the same structure as defined in your LinkML schema, allowing you to validate your JSON-LD documents against the rules you've defined. This ensures that the data conforms to the expected structure. You can use tools such as linkml-owl to automatically generate SHACL shapes from a LinkML schema.

Generating Pydantic Models from LinkML

Similarly, LinkML can generate Pydantic models. You can use the generated Pydantic models to parse and work with your JSON-LD data in Python. This lets you turn the JSON-LD documents into instances of Python classes, making it easier to work with them.

Tools and Libraries

Here are some libraries and tools that can help you:

linkml-owl: A tool for generating SHACL shapes from LinkML schemas.
linkml-runtime: A library for working with LinkML models in Python, including the generation of Pydantic models.
rdflib: A Python library for working with RDF data, including JSON-LD.

Step-by-Step Guide: Generating SHACL and Pydantic

Let’s walk through the steps to generate SHACL and Pydantic models from your JSON-LD documents. This will help you get a better understanding of the entire process.

Define Your LinkML Schema:

The first step is to create a LinkML schema that describes your data structure. Include classes, slots, ranges, and any other relevant details about your data. This schema serves as the blueprint for both your SHACL shapes and Pydantic models. This is the foundation upon which your validation and Python object creation will be built.
Generate SHACL Shapes:

Use linkml-owl (or similar tools) to generate SHACL shapes from your LinkML schema. These shapes will specify the validation rules for your data. Validate your JSON-LD data against these shapes.
Generate Pydantic Models:

Use linkml-runtime (or similar tools) to generate Pydantic models from your LinkML schema. These models define the Python classes that will represent your data.
Parse and Validate Your Data:

Load your JSON-LD document, parse it into a Python dictionary using rdflib, and then create instances of your Pydantic models. The Pydantic models will validate the data against the defined types and constraints.

Here's an example of how you might use these tools and Python code:

from linkml_runtime.loaders import json_loader
from linkml_runtime.dumpers import yaml_dumper
from pydantic import ValidationError
from rdflib import Graph

# 1. Load the LinkML schema (assuming you have one)
# For the example, you'd load your 'bia_rocrate_schema'

# 2. Generate Pydantic models (using linkml-runtime)
# You'll need to generate these models based on your schema

# 3. Load your JSON-LD data
json_ld_data = {
    "@context": {
      "bia": "http://bia/",
      "schema": "http://schema.org/",
    },
    "@graph": [
        {
            "@id": "_:BioSampleExample",
            "@type": ["bia:BioSample"],
            "organismClassification": {
                "@id": "_:TaxonExample"
            },
            "title": "The Example Bio Sample"
        },
        {
            "@id": "_:TaxonExample",
            "@type": ["bia:Taxon"],
            "scientificName": "Example example"
        }
    ]
}

# 4. Parse the JSON-LD data
graph = Graph().parse(data=json.dumps(json_ld_data), format="json-ld")

# 5. Create Python objects (using the Pydantic models)
try:
    # Assuming you have generated your Pydantic models
    # Example: bio_sample = BioSample(**json_ld_data["@graph"][0])
    # In the example, we would need to manually handle the nested objects
    # This part needs to be handled according to the implementation.
    # print(bio_sample)
    pass
except ValidationError as e:
    print(f"Validation error: {e}")

Tips and Best Practices

Start Simple: Begin with a simple LinkML schema and gradually add complexity as needed. This helps keep the process manageable.
Test Early and Often: Test your SHACL shapes and Pydantic models with different data variations to ensure they work as expected.
Handle Complex Relationships: Carefully define relationships between classes in your LinkML schema, especially when dealing with nested objects and references. The relationships are the core of the validation and data modeling process.
Use Descriptive Names: Choose meaningful names for classes, slots, and properties to make your code more readable and maintainable.
Document Your Schema: Document your LinkML schema to explain your data model's purpose, design, and validation rules.

Conclusion: Your Path to Data Mastery

Congratulations! You've successfully navigated the challenges of generating SHACL and Pydantic models for JSON-LD documents. You now have the tools and knowledge to validate your data and work with it easily in Python. This approach ensures data integrity, improves the development experience, and makes working with complex data structures a breeze.

By following these steps, you can create a robust and efficient workflow for working with JSON-LD data. Embrace the power of SHACL and Pydantic and unlock the full potential of your data!

I hope this helps you get started and provides some inspiration. Feel free to ask more questions.