Getting started
This guide will walk you through the basics of using Litelines to get structured generation from language models. By the end, you'll understand how to:
- Install Litelines
- Generate a basic structured response
- Generate a basic streamed structured response
Installation
To install Litelines:
pip install litelines
uv pip install litelines
Your First Structured Generation
Let's start with a simple example.
Download a model and its tokenizer
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda") # "cuda", "mps", or "cpu"
model_id = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
Prepare the inputs to the LLM
user_input = "What is the sentiment of the following text: 'Awesome'"
messages = [{"role": "user", "content": user_input}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
).to(model.device)
Define a Pydantic schema describing the required JSON
from typing import Literal
from pydantic import BaseModel, Field
class Sentiment(BaseModel):
"""Correctly inferred `Sentiment`."""
label: Literal["positive", "negative"] = Field(
..., description="Sentiment of the text"
)
Define the processor and visualize it
from litelines.transformers import SchemaProcessor
processor = SchemaProcessor(response_format=Sentiment, tokenizer=tokenizer)
processor.show_graph()
Generate a structured response
generated = model.generate(**inputs, logits_processor=[processor])
print(tokenizer.decode(generated[0][inputs['input_ids'].shape[-1]:]))
# {"label": "positive"}
Visualize the selected path
processor.show_graph()
Your First Streamed Structured Generation
Since Litelines gives you the processor, you can do whatever you want with it. In particular, you can generate a streaming response like you would normally do (just don't forget to add the processor).
from threading import Thread
from transformers import TextIteratorStreamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
generation_kwargs = dict(
inputs, streamer=streamer, logits_processor=[processor], max_new_tokens=100
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
assistant_response = ""
for chunk in streamer:
if tokenizer.eos_token in chunk or tokenizer.pad_token in chunk:
chunk = chunk.split(tokenizer.eos_token)[0]
chunk = chunk.split(tokenizer.pad_token)[0]
assistant_response += chunk
print(chunk, end="")
thread.join()