Structured Outputs or Structured Generation in Large Language Models: Why not both?

Alonso Silva Allende (Github: alonsosilvaallende)

Introduction

hi 👋, I’m Alonso

Slides:

https://alonsosilvaallende.github.io/2025-PyCon-Chile

This presentation:

https://github.com/alonsosilvaallende/2025-PyCon-Chile/blob/main/PyConChile.ipynb

Unstructured text → Structured data

Application: Extraction

Unstructured text:

“My name is John Smith and you can contact me at sales@example.com and she is Jane Doe and can be contacted at support@example.com”

→ Structured data:

First name Last name email
John Smith sales@example.com
Jane Doe support@example.com

Application: Classification

Unstructured text → Structured data

Email Department
I would like to have more information related to the new product. Sales
I cannot exit Vim in my computer. Could you help me with that? IT
Are there any openings at your company? HR

Application: Knowledge Graph Generation

Unstructured text:

“Alice loves Bob but she hates Charles”

→ Structured data:

Application: Agents / MCP Servers

Unstructured text:

“What’s the temperature in San Francisco now? How about tomorrow?”

→ Structured data:

Tool Tool arguments
get_current_temperature {'location': 'San Francisco'}
get_temperature_by_date {'location': 'San Francisco', 'date': '2025-11-10'}

Unstructured text → Structured data

  • 3 ways to go from unstructured text to structured data using language models:
    • Modify the weights (SFT, RLVR, etc.)
    • Improve the prompt (Instructor, DSPy, BAML, etc.)
    • Constrain the generation (Outlines, Xgrammar, Guidance, etc.)

Unstructured text → Structured data

  • 3 ways to go from unstructured text to structured data using language models:
    • Modify the weights (SFT, RLVR, etc.)
    • Improve the prompt (Instructor, DSPy, BAML, etc.)
    • Constrain the generation (Outlines, Xgrammar, Guidance, etc.)

Unstructured text → Structured data

  • 3 ways to go from unstructured text to structured data using language models:
    • Modify the weights (SFT, RLVR, etc.)
    • Improve the prompt (Instructor, DSPy, BAML, etc.)
    • Constrain the generation (Outlines, Xgrammar, Guidance, etc.)

The three approaches—modifying weights, improving prompts, and constraining generation—are complementary and should NOT be viewed as mutually exclusive.

Framework vs Library

  • A framework enforces a way of doing things.
  • A library tries to do one thing well and gets out of your way.

New Library: Litelines

Installation:

Terminal
pip install litelines

Documentation: https://tinyurl.com/litelines

Getting Started colab notebooks:

https://tinyurl.com/litelines-hf

License: Apache-2.0

Using transformers library

1. Load a model and its tokenizer

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "Qwen/Qwen3-1.7B"  # https://huggingface.co/Qwen/Qwen3-1.7B
device = torch.device("cuda")  # "cuda", "cpu" or "mps"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID).to(device)

2. Prepare the inputs and generate a response

user_input = "Hello"
messages = [{"role": "user", "content": user_input}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=False,
    tools=[],
    return_tensors="pt",
    return_dict=True,
).to(model.device)
prompt_length = inputs["input_ids"].shape[-1]
generated = model.generate(
    **inputs, temperature=0.1, logits_processor=[], 
    max_new_tokens=100
)
tokenizer.decode(generated[0][prompt_length:-1])
'Hello! How can I assist you today? 😊'

2. Prepare the inputs and generate a response

user_input = "Hello"
messages = [{"role": "user", "content": user_input}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=False,
    tools=[],
    return_tensors="pt",
    return_dict=True,
).to(model.device)
prompt_length = inputs["input_ids"].shape[-1]
generated = model.generate(
    **inputs, temperature=0.1, logits_processor=[], 
    max_new_tokens=100
)
tokenizer.decode(generated[0][prompt_length:-1])

2. Prepare the inputs and generate a response

def generate_response(user_input, tools=[], 
                      logits_processor=[], enable_thinking=False):
    messages = [{"role": "user", "content": user_input}]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        enable_thinking=False,
        tools=tools,
        return_tensors="pt",
        return_dict=True,
    ).to(model.device)
    prompt_length = inputs["input_ids"].shape[-1]
    generated = model.generate(
        **inputs, temperature=0.1, logits_processor=logits_processor, 
        max_new_tokens=100
    )
    return tokenizer.decode(generated[0][prompt_length:-1])

Generate a response

user_input = "Hello"
generate_response(user_input)
'Hello! How can I assist you today? 😊'

Generate a streamed response

from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)

def generate_streamed_response(user_input, 
                      tools=[], 
                      logits_processor=[], enable_thinking=False):
    messages = [{"role": "user", "content": user_input}]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        enable_thinking=False,
        tools=tools,
        return_tensors="pt",
        return_dict=True,
    ).to(model.device)
    generation_kwargs = dict(
            inputs,
            streamer=streamer,
            logits_processor=logits_processor,
            max_new_tokens=100,
            temperature=0.1
        )
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    assistant_response = ""
    for chunk in streamer:
        clean_chunk = chunk.split("<|im_end|>")[0]
        assistant_response += clean_chunk
        print(clean_chunk, end="")
    thread.join()
    return assistant_response

Generate a streamed response

generate_streamed_response("Write an haiku about Chile");
Cielo azul,  
Río cristalino,  
Tierra en silencio.

Structured generation

Structured generation

  • It consists of masking forbidden tokens (subwords) during the generation.

Structured generation

from litelines.transformers import SchemaProcessor

processor = SchemaProcessor(response_format=r"A\.|B\.|C\.", 
                            tokenizer=tokenizer)
processor.show_graph()

Structured generation

user_input = """Which key combination exits Vim?

A) Ctrl+X
B) Esc then :q!
C) Alt+F4

Answer:"""
generate_response(user_input, logits_processor=[processor])
'B.'
processor.show_graph()

Structured generation

processor = SchemaProcessor(response_format=r"Yes\.|No\.", 
                            tokenizer=tokenizer)
processor.show_graph()

Structured generation

user_input = """Give me an N and an O"""
generate_response(user_input, logits_processor=[processor])
'No.'

https://molab.marimo.io/notebooks/nb_sbGwmqQdNC5NPQjg7Eu8qi

Structured generation

from pydantic import BaseModel, Field
from typing import Literal

class Sentiment(BaseModel):
    key: Literal["ham", "spam"] = Field(..., description="Is it a spam or a ham")

processor = SchemaProcessor(response_format=Sentiment, tokenizer=tokenizer, whitespace_pattern="")
processor.show_graph()

Structured outputs

Structured outputs

  • It consists on using the function/tool calling capabilities of language models to structure its response.
  • Used by

Structured outputs1

user_input = "Extract Jason is 25 years old"

from pydantic import BaseModel, Field

class Person(BaseModel):
    name: str = Field(..., description="The person's name")
    age: int = Field(..., description="The person's age in years")

from openai import pydantic_function_tool

tool = pydantic_function_tool(Person)

assistant_response = generate_response(user_input, tools=[tool])
print(assistant_response)
<tool_call>
{"name": "Person", "arguments": {"name": "Jason", "age": 25}}
</tool_call>

Structured outputs

user_input = "What's the sentiment of the text: That's awesome!"

from typing import Literal

class Sentiment(BaseModel):
    label: Literal["Positive", "Negative"] = Field(
        ..., description="The sentiment conveyed by the text"
    )

tool = pydantic_function_tool(Sentiment)
assistant_response = generate_response(
    user_input, 
    tools=[tool],
)
print(assistant_response)
<tool_call>
{"name": "Sentiment", "arguments": {"label": "Positive"}}
</tool_call>

Structured outputs sometimes fail

user_input = "That's awesome!"

from typing import Literal

class Sentiment(BaseModel):
    label: Literal["Positive", "Negative"] = Field(
        ..., description="The sentiment conveyed by the text"
    )

tool = pydantic_function_tool(Sentiment)
assistant_response = generate_response(
    user_input, 
    tools=[tool],
)
print(assistant_response)
The sentiment of the text "That's awesome!" is Positive.

Structured outputs + structured generation

Structured outputs + structured generation

  • No valid reason not to use both structured generation+structured outputs

Structured outputs + structured generation

user_input = "That's awesome!"

class Sentiment(BaseModel):
    label: Literal["Positive", "Negative"] = Field(
        ..., description="The sentiment conveyed by the text"
    )

tool = pydantic_function_tool(Sentiment)
processor = SchemaProcessor(
    response_format=Sentiment, tokenizer=tokenizer, include_tool_call=True
)
assistant_response = generate_response(
    user_input, 
    tools=[tool], 
    logits_processor=[processor]
)
print(assistant_response)
<tool_call>
{"name": "Sentiment", "arguments": {"label": "Positive"}}
</tool_call>

Structured outputs + structured generation

user_input = "Extract Jason is 25 years old"

class Person(BaseModel):
    name: str = Field(..., description="The person's name")
    age: int = Field(..., description="The person's age in years")

tool = pydantic_function_tool(Person)
processor = SchemaProcessor(
    response_format=Person, tokenizer=tokenizer, include_tool_call=True
)
assistant_response = generate_response(
    user_input, 
    tools=[tool], 
    logits_processor=[processor]
)
print(assistant_response)
<tool_call>
{"name": "Person", "arguments": {"name": "Jason", "age": 25}}
</tool_call>

Streamed structured outputs+structured generation

assistant_response = generate_streamed_response(
    user_input, 
    tools=[tool], 
    logits_processor=[processor]
)
<tool_call>
{"name": "Person", "arguments": {"name": "Jason", "age": 25}}
</tool_call>

Visualize the processor

processor.show_graph()

Experiments

  • SST2 dataset
  • Without logits processor/ with logits processor
    • Qwen2.5-0.5B: 78.14% -> 84.17%
    • Qwen2.5-1.5B: 82.57% -> 90.6%

Conclusions

Conclusions

  • Litelines is a new library to do structured generation
  • It allows to easily combine structured generation with structured outputs

Perspectives

  • Do batching with transformers library
  • Incorporate vllm
  • High-level library similar to Marvin or Instructor