Skip to content

YAML Schema

pyconveyor lets you define the shape of LLM output entirely in YAML — no Python files required. Field descriptions become part of the prompt automatically.

Simple inline schema

The quickest way to add a schema is to write field names and type strings directly inside a step:

steps:
  - name: extract
    type: llm
    model: default
    prompt: prompts/extract.j2
    schema:
      title: str
      authors: list[str]
      doi: str | None
      publication_year: int
    max_attempts: 3

Supported types:

Type string Python type
str str
int int
float float
bool bool
str \| None Optional[str]
int \| None Optional[int]
float \| None Optional[float]
bool \| None Optional[bool]
list[str] list[str]
list[int] list[int]
list[float] list[float]
list[bool] list[bool]
dict[str, str] dict[str, str]
dict[str, int] dict[str, int]

For anything more complex — unions, generics, custom validators — use a Pydantic model (see When to use Python models below).


Rich field format

Add a description, constraints, and failure behaviour by expanding each field from a type string to a dict:

steps:
  - name: extract
    type: llm
    model: default
    prompt: prompts/extract.j2
    schema:
      title:
        type: str
        description: "Paper title exactly as written, including subtitle."
        min_length: 1
      authors:
        type: list[str]
        description: "All author names in publication order."
        min_items: 1
      doi:
        type: str | None
        description: "DOI if listed. Null if not found."
        pattern: "^10\\.[0-9]{4,}/.+$"
        on_fail: null
      publication_year:
        type: int
        description: "Four-digit year of publication."

Rich field keys

Key Required Description
type yes Type string (same as the simple format)
description no Human-readable description. Injected into the LLM prompt automatically.
pattern no Regex pattern. The value must match the full string.
min_length no Minimum string length (strings only).
max_length no Maximum string length (strings only).
min_items no Minimum list length (lists only).
max_items no Maximum list length (lists only).
on_fail no What to do when a constraint is violated: error (default), null, or warn.
vocab no Filename in vocabularies/ directory (e.g. organismvocabularies/organism.yaml) or inline dict {terms: [...], description: ...}. Vocab normalisation runs automatically — fuzzy matches are normalised, novel values are captured as suggestions.

on_fail values

Value Behaviour
error (default) Raise a ValidationError — triggers a retry if max_attempts > 1
null Silently coerce the invalid value to None
warn Log a warning and keep the value as-is

Use on_fail: null for fields where bad model output is expected sometimes and a null is acceptable:

accession_id:
  type: str | None
  description: "Database accession (e.g. WP_123456). Null if not reported."
  pattern: "^[A-Za-z0-9][A-Za-z0-9_.]{2,39}$"
  on_fail: null       # bad accession → null, no retry

Use on_fail: error (the default) for fields that must be correct:

organism_name:
  type: str
  description: "Genus + species binomial."
  min_length: 1      # empty string → retry

Nested objects

Use type: list with an items: block to define a list of structured objects:

steps:
  - name: extract
    type: llm
    model: default
    prompt: prompts/extract.j2
    schema:
      records:
        type: list
        description: "One record per organism mentioned in the paper."
        min_items: 1
        items:
          organism_name:
            type: str
            description: "Genus + species binomial."
            min_length: 1
          plastic:
            type: str
            description: "ISO polymer code (e.g. PET, PLA)."
          confidence:
            type: float
            description: "Extraction confidence 0.0–1.0."

Items can themselves have type: list with an items: block for deeper nesting, though two levels is enough for almost all extraction tasks.


Top-level schema: block

For complex schemas shared across multiple steps, define the schema at the top level and reference it by name:

schema:
  records:
    type: list
    description: "One record per item."
    min_items: 1
    items:
      name:
        type: str
        description: "Item name."
      value:
        type: float
        description: "Numeric value."
  metadata:
    type: str
    description: "Document-level metadata."

steps:
  - name: extract
    type: ensemble
    schema: src.extractor:ExtractionResult  # still references Python class
    prompt: prompts/extract.j2
    members:
      - model: model_a
        name: primary
        required: true
      - model: model_b
        name: reviewer
        required: false
    judge:
      model: model_b
      condition: all_succeeded

The top-level schema: block is typically loaded by your application code:

import yaml
from pathlib import Path
from pyconveyor.schema_builder import yaml_dict_to_model

raw = yaml.safe_load(Path("pipeline.yaml").read_text())
ExtractionResult = yaml_dict_to_model("ExtractionResult", raw["schema"])

This keeps the schema as the single source of truth: edit the YAML and the Pydantic model updates automatically.


How descriptions reach the prompt

When a field has a description, pyconveyor renders a schema hint and makes it available as {{ schema_hint }} in your Jinja2 prompt templates:

{# prompts/extract.j2 #}
You are a scientific literature extractor.

{{ schema_hint }}

---
{{ ctx.document }}

schema_hint renders as:

Return a JSON object with the following fields:
- "records": array of objects (required)
    One record per organism mentioned in the paper.
    - "organism_name": string (required)
        Genus + species binomial.
    - "plastic": string (required)
        ISO polymer code (e.g. PET, PLA).
    - "confidence": number (required)
        Extraction confidence 0.0–1.0.
- "metadata": string (required)
    Document-level metadata.

Nested fields are indented one level deeper than their parent. Fields without descriptions appear as a single line.

If {{ schema_hint }} is absent from the template, no hint is injected — the schema is still enforced for validation and retries, just not described in the prompt.


Using Pydantic models directly

YAML schemas cover the common cases. For anything beyond what YAML supports, pass a Pydantic BaseModel subclass as the schema instead:

steps:
  - name: extract
    type: llm
    model: default
    prompt: prompts/extract.j2
    schema: schemas:ExtractionResult   # module:ClassName
# schemas.py
from pydantic import BaseModel, field_validator
from typing import Optional

class EntryRecord(BaseModel):
    organism: str
    confidence: float

    @field_validator("confidence")
    @classmethod
    def _range(cls, v: float) -> float:
        if not 0.0 <= v <= 1.0:
            raise ValueError("confidence out of range")
        return v

class ExtractionResult(BaseModel):
    entries: list[EntryRecord]
    supplements_required: bool

model_to_schema_hint() works with hand-written models too — any field defined with Field(description=...) gets rendered in {{ schema_hint }}:

from pydantic import BaseModel, Field

class ExtractionResult(BaseModel):
    entries: list[EntryRecord] = Field(..., description="One entry per organism.")
    supplements_required: bool = Field(..., description="True if key data is only in supplements.")

When to use Python models

Prefer YAML schemas when:

  • You want the schema to be readable and editable without touching Python
  • You need field descriptions to appear in prompts
  • Simple type + constraint combinations are sufficient (min_length, pattern, on_fail)

Use Python Pydantic models when:

  • You need cross-field validation (@model_validator)
  • You need computed fields or custom coercion beyond on_fail
  • You want to reuse the model in downstream code (type annotations, serialisation)
  • Your schema is deeply nested or recursive

Both approaches can be mixed freely — a Pydantic model can reference YAML-generated sub-models and vice versa.