From Heterogeneous Data to Knowledge Graphs: RML-Based Data Ingestion with YARRRML

January 27, 2026

Knowledge graphs transform scattered information into interconnected, semantically rich data. They represent entities, relationships, and context, making it possible to reason across domains rather than handling isolated tables or documents. The ontology provides the shared vocabulary – but in practice, the real challenge is turning heterogeneous data into RDF.

Diagram showing how heterogeneous data from multiple enterprise applications is unified in a central knowledge graph

This challenge is universal: onboarding CSV dumps from ERP systems, consuming JSON from REST APIs, ingesting XML from legacy partners, synchronizing events from streaming pipelines, or converting relational schemas. Every source comes with its quirks, and RDF systems expect normalized, semantic triples. The question becomes: How can we instantiate our ontology efficiently, repeatably, and at scale?

Modern enterprises rarely deal with clean, uniform datasets. Instead, they operate on heterogeneous data that evolves over time, spans organizational boundaries, and reflects different modeling assumptions. Turning this heterogeneous data into consistent Knowledge Graphs is therefore not a one-off task, but a continuous engineering effort.

The Data Integration Problem: Data Variety Meets Semantic Rigor

Integrating heterogeneous data requires more than scripting. We need repeatable, auditable transformations that link raw data to ontology terms. CSV columns must become predicates. JSON arrays must be navigated with paths. XML hierarchies need to be flattened or preserved. SQL tables need URI templates and join logic to enable robust Data Ingestion pipelines.

Without a structured approach, pipelines become brittle and opaque. Ontology changes ripple into half-forgotten scripts. Teams duplicate logic. Debugging becomes guesswork. A declarative layer is needed to keep semantics separate from execution mechanics.

Conceptual architecture of a data ingestion pipeline for knowledge graphs — **A typical Data Ingestion pipeline**

What makes this problem particularly challenging is that heterogeneous data often changes independently of the ontology. New fields appear, formats evolve, and source systems are replaced, all while downstream Knowledge Graphs must remain stable and trustworthy.

Two Architectural Paths: Imperative Code vs. Mapping Languages

Imperative pipelines rely on custom code written in Python or Java to parse inputs and emit RDF. They offer fine-grained control but tangle data logic with semantic logic. Over time, they can become hard to maintain, reuse, or adapt to changing ontologies.

Declarative mapping languages describe what the RDF should look like, not how to generate it. Execution engines process these specifications automatically. This separation improves reusability, testability, clarity, and alignment with the ontology.

From a lifecycle and DataOps perspective, declarative mappings are easier to version, review, lint, and test. They work well with virtual graphs, enabling queries over the source while postponing full materialization.

For these reasons, we generally prefer mapping languages for the transformation layer, while still allowing imperative preprocessing for special cases.

Mapping Languages: R2RML vs RML vs YARRRML

Mapping languages function similarly to transformation languages like XSLT: instead of step‑by‑step control flow, you define rules that match patterns in the source and produce RDF triples accordingly.

A brief landscape:

R2RML: W3C recommendation for mapping relational databases to RDF.
RML: An extension of R2RML supporting CSV, JSON, XML, and other formats.
YARRRML: A human‑friendly YAML syntax that compiles into RML.

Where R2RML is SQL-centric, RML covers multi-format ecosystems. YARRRML lowers the barrier to authoring complex RML mappings.

YARRRML: Making Data Ingestion and Mappings Author‑Friendly

YARRRML exists to make knowledge graph mappings easier to read and write. The YAML format reduces verbosity, supports indentation, and keeps the structure clear. Under the hood, YARRRML transforms into RML, ensuring interoperability with mapping engines.

Example: Mapping CSV to RDF with YARRRML

Let’s take a simple CSV:

ID;FirstName;LastName;Email
1;Ada;Lovelace;ada@example.org
2;Alan;Turing;alan@example.org

ID;FirstName;LastName;Email
1;Ada;Lovelace;ada@example.org
2;Alan;Turing;alan@example.org

We want to generate simple RDF for schema:Person instances.

prefixes:
  ex: http://example.com#
mappings:
  persons:
    sources:
      - access: persons.csv
        referenceFormulation: csv
        delimiter: ;
    s: ex:person_$(ID)
    po:
      - p: rdf:type
        o: schema:Person
      - p: schema:givenName
        o: $(FirstName)

prefixes:
  ex: http://example.com#
mappings:
  persons:
    sources:
      - access: persons.csv
        referenceFormulation: csv
        delimiter: ;
    s: ex:person_$(ID)
    po:
      - p: rdf:type
        o: schema:Person
      - p: schema:givenName
        o: $(FirstName)

This compiles to RML, which engines can execute to produce RDF. The screenshot below shows how much more boilerplate and how much worse the readability of the same logic would be in RML.

Screenshot comparing a concise YARRRML mapping with a verbose equivalent RML mapping

Tooling in the RML Ecosystem

Morph‑KGC

Morph‑KGC is a modern, efficient, and flexible R2RML/RML engine designed for materializing knowledge graphs from multiple data sources. It focuses on:

high performance
compatibility with RML and YARRRML
easy configuration through simple YAML
clean integration with Python and data engineering workflows.

A basic Morph‑KGC configuration might look like:

mappings: "mappings.rml.ttl"
output:
  format: "turtle"
  file: "output.ttl"
sources:
  people:
    type: csv
    location: "persons.csv"

mappings: "mappings.rml.ttl"
output:
  format: "turtle"
  file: "output.ttl"
sources:
  people:
    type: csv
    location: "persons.csv"

Running Morph‑KGC materializes the RDF with minimal overhead.

RMLStreamer

For large‑scale workloads or continuous ingestion, RMLStreamer builds on Apache Flink. It executes RML mappings in distributed or streaming contexts, enabling near‑real‑time generation of knowledge graph fragments from high‑volume inputs.

Neonto editor

During the development phase, tools like the neonto editor help design and refine mappings visually, reducing trial‑and‑error and bridging the gap between domain experts and engineers.

Why Morph‑KGC + YARRRML Works Well

This combination offers a practical balance:

YARRRML provides human‑friendly mapping design.
RML ensures interoperability and long-term stability.
Morph‑KGC materializes RDF quickly and consistently.
RMLStreamer scales when workloads grow.
Everything remains version-controlled, testable, and reproducible.

RML-based data ingestion pipeline executing YARRRML mappings to generate knowledge graphs — **YARRRML-based Data Ingestion pipelines compile to RML and are executed by dedicated engines to materialize scalable Knowledge Graphs.**

As a result, RML-based pipelines provide a robust foundation for scalable Data Ingestion, even as data volume, velocity, and variety increase. Together, they create a sustainable, maintainable, and transparent data to knowledge graph pipeline.

Putting Everything Together

A robust workflow might look like this:

Model your domain with an ontology
Inventory your data sources
Write mappings in YARRRML
Convert to RML (morph-kgc can also directly use YARRRML files)
Use Morph‑KGC to materialize the first RDF graph
Validate using SHACL
Automate everything in CI/CD

Side-by-side view of a YARRRML mapping file and the resulting RDF Turtle output generated from CSV data — **YARRRML mapping (left) and the resulting RDF triples in Turtle format (right).**

Want to learn how to build production-ready ETL pipelines with YARRRML? Book a call with us.