YAML Schema¶
pyconveyor lets you define the shape of LLM output entirely in YAML — no Python files required. Field descriptions become part of the prompt automatically.
Simple inline schema¶
The quickest way to add a schema is to write field names and type strings directly inside a step:
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
title: str
authors: list[str]
doi: str | None
publication_year: int
max_attempts: 3
Supported types:
| Type string | Python type |
|---|---|
str |
str |
int |
int |
float |
float |
bool |
bool |
str \| None |
Optional[str] |
int \| None |
Optional[int] |
float \| None |
Optional[float] |
bool \| None |
Optional[bool] |
list[str] |
list[str] |
list[int] |
list[int] |
list[float] |
list[float] |
list[bool] |
list[bool] |
dict[str, str] |
dict[str, str] |
dict[str, int] |
dict[str, int] |
For anything more complex — unions, generics, custom validators — use a Pydantic model (see When to use Python models below).
Rich field format¶
Add a description, constraints, and failure behaviour by expanding each field from a type string to a dict:
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
title:
type: str
description: "Paper title exactly as written, including subtitle."
min_length: 1
authors:
type: list[str]
description: "All author names in publication order."
min_items: 1
doi:
type: str | None
description: "DOI if listed. Null if not found."
pattern: "^10\\.[0-9]{4,}/.+$"
on_fail: null
publication_year:
type: int
description: "Four-digit year of publication."
Rich field keys¶
| Key | Required | Description |
|---|---|---|
type |
yes | Type string (same as the simple format) |
description |
no | Human-readable description. Injected into the LLM prompt automatically. |
pattern |
no | Regex pattern. The value must match the full string. |
min_length |
no | Minimum string length (strings only). |
max_length |
no | Maximum string length (strings only). |
min_items |
no | Minimum list length (lists only). |
max_items |
no | Maximum list length (lists only). |
on_fail |
no | What to do when a constraint is violated: error (default), null, or warn. |
vocab |
no | Filename in vocabularies/ directory (e.g. organism → vocabularies/organism.yaml) or inline dict {terms: [...], description: ...}. Vocab normalisation runs automatically — fuzzy matches are normalised, novel values are captured as suggestions. |
on_fail values¶
| Value | Behaviour |
|---|---|
error (default) |
Raise a ValidationError — triggers a retry if max_attempts > 1 |
null |
Silently coerce the invalid value to None |
warn |
Log a warning and keep the value as-is |
Use on_fail: null for fields where bad model output is expected sometimes and a null is acceptable:
accession_id:
type: str | None
description: "Database accession (e.g. WP_123456). Null if not reported."
pattern: "^[A-Za-z0-9][A-Za-z0-9_.]{2,39}$"
on_fail: null # bad accession → null, no retry
Use on_fail: error (the default) for fields that must be correct:
organism_name:
type: str
description: "Genus + species binomial."
min_length: 1 # empty string → retry
Nested objects¶
Use type: list with an items: block to define a list of structured objects:
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
records:
type: list
description: "One record per organism mentioned in the paper."
min_items: 1
items:
organism_name:
type: str
description: "Genus + species binomial."
min_length: 1
plastic:
type: str
description: "ISO polymer code (e.g. PET, PLA)."
confidence:
type: float
description: "Extraction confidence 0.0–1.0."
Items can themselves have type: list with an items: block for deeper nesting, though two levels is enough for almost all extraction tasks.
Top-level schema: block¶
For complex schemas shared across multiple steps, define the schema at the top level and reference it by name:
schema:
records:
type: list
description: "One record per item."
min_items: 1
items:
name:
type: str
description: "Item name."
value:
type: float
description: "Numeric value."
metadata:
type: str
description: "Document-level metadata."
steps:
- name: extract
type: ensemble
schema: src.extractor:ExtractionResult # still references Python class
prompt: prompts/extract.j2
members:
- model: model_a
name: primary
required: true
- model: model_b
name: reviewer
required: false
judge:
model: model_b
condition: all_succeeded
The top-level schema: block is typically loaded by your application code:
import yaml
from pathlib import Path
from pyconveyor.schema_builder import yaml_dict_to_model
raw = yaml.safe_load(Path("pipeline.yaml").read_text())
ExtractionResult = yaml_dict_to_model("ExtractionResult", raw["schema"])
This keeps the schema as the single source of truth: edit the YAML and the Pydantic model updates automatically.
How descriptions reach the prompt¶
When a field has a description, pyconveyor renders a schema hint and makes it available as {{ schema_hint }} in your Jinja2 prompt templates:
{# prompts/extract.j2 #}
You are a scientific literature extractor.
{{ schema_hint }}
---
{{ ctx.document }}
schema_hint renders as:
Return a JSON object with the following fields:
- "records": array of objects (required)
One record per organism mentioned in the paper.
- "organism_name": string (required)
Genus + species binomial.
- "plastic": string (required)
ISO polymer code (e.g. PET, PLA).
- "confidence": number (required)
Extraction confidence 0.0–1.0.
- "metadata": string (required)
Document-level metadata.
Nested fields are indented one level deeper than their parent. Fields without descriptions appear as a single line.
If {{ schema_hint }} is absent from the template, no hint is injected — the schema is still enforced for validation and retries, just not described in the prompt.
Using Pydantic models directly¶
YAML schemas cover the common cases. For anything beyond what YAML supports, pass a Pydantic BaseModel subclass as the schema instead:
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema: schemas:ExtractionResult # module:ClassName
# schemas.py
from pydantic import BaseModel, field_validator
from typing import Optional
class EntryRecord(BaseModel):
organism: str
confidence: float
@field_validator("confidence")
@classmethod
def _range(cls, v: float) -> float:
if not 0.0 <= v <= 1.0:
raise ValueError("confidence out of range")
return v
class ExtractionResult(BaseModel):
entries: list[EntryRecord]
supplements_required: bool
model_to_schema_hint() works with hand-written models too — any field defined with Field(description=...) gets rendered in {{ schema_hint }}:
from pydantic import BaseModel, Field
class ExtractionResult(BaseModel):
entries: list[EntryRecord] = Field(..., description="One entry per organism.")
supplements_required: bool = Field(..., description="True if key data is only in supplements.")
When to use Python models¶
Prefer YAML schemas when:
- You want the schema to be readable and editable without touching Python
- You need field descriptions to appear in prompts
- Simple type + constraint combinations are sufficient (
min_length,pattern,on_fail)
Use Python Pydantic models when:
- You need cross-field validation (
@model_validator) - You need computed fields or custom coercion beyond
on_fail - You want to reuse the model in downstream code (type annotations, serialisation)
- Your schema is deeply nested or recursive
Both approaches can be mixed freely — a Pydantic model can reference YAML-generated sub-models and vice versa.