Understanding Structured Outputs

In my previous post, I explained that we can provide a description of a function to the language model and the language model will call that function even if the function itself has not been implemented!

This feature has several exciting applications. This is the power behind structured ouputs libraries such as Instructor and Marvin.

In this post, I start by providing a basic example of extracting information with structured outputs, then I give a slightly more complex example by extracting personal information of several people from a text. After that, I give an example of text classification with structured outputs. Finally, I explain how to do structured outputs in WebAssembly (optional but fun if you want to play with structured outputs in the browser in this post itself).

The language model has been trained (fine-tuned) to determine (given a prompt and a list of functions descriptions):

which function to use
which arguments to use for that function

Basic Example

Let’s start with a basic example to see an interesting application.

Let’s download a small language model (1.7B parameters) and its tokenizer:

Show the code

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

model_id = "Qwen/Qwen3-1.7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, cache_dir="/big_storage/llms/hf_models/"
).to("cuda")
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)

We can describe a function (which we won’t implement) that extracts the city and the country of the text.

Show the code

from pydantic import BaseModel, Field
import json
from openai import pydantic_function_tool

class city_extractor(BaseModel):
    """Extracts the correctly inferred city, state and country name from the text with all the required parameters with correct types."""

    city: str = Field(..., description="city name, e.g. Berkeley")
    state: str = Field(..., description="state name, e.g. California")
    country: str = Field(..., description="country name, e.g United States")

tool = pydantic_function_tool(city_extractor)
print(json.dumps(tool, indent=4))

{
    "type": "function",
    "function": {
        "name": "city_extractor",
        "strict": true,
        "parameters": {
            "description": "Extracts the correctly inferred city, state and country name from the text with all the required parameters with correct types.",
            "properties": {
                "city": {
                    "description": "city name, e.g. Berkeley",
                    "title": "City",
                    "type": "string"
                },
                "state": {
                    "description": "state name, e.g. California",
                    "title": "State",
                    "type": "string"
                },
                "country": {
                    "description": "country name, e.g United States",
                    "title": "Country",
                    "type": "string"
                }
            },
            "required": [
                "city",
                "state",
                "country"
            ],
            "title": "city_extractor",
            "type": "object",
            "additionalProperties": false
        },
        "description": "Extracts the correctly inferred city, state and country name from the text with all the required parameters with correct types."
    }
}

We can provide the text I live in the big apple and the function description above. We obtain the following response:

Show the code

user_input = "I live in the big apple"

def generate_response(user_input, tool):
    
    messages = [
        {"role": "user", "content": user_input},
    ]
    
    prompt = tokenizer.apply_chat_template(
        messages, tools=[tool], tokenize=False, add_generation_prompt=True, enable_thinking=True
    )
    
    model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    generation_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=4 * 1024,
        do_sample=False,
        temperature=1.0,
        top_p=1.0,
        top_k=50,
    )
    
    from threading import Thread
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    assistant_response = ""
    for chunk in streamer:
        assistant_response += chunk
        print(chunk, end="")
    
    thread.join()

    return assistant_response

assistant_response = generate_response(user_input, tool)

<think>
Okay, the user says, "I live in the big apple." Let me think about how to handle this. The city_extractor function is supposed to get the city, state, and country from the text. 

First, the user mentions "the big apple." The Big Apple is a common nickname for New York City. So, the city would be New York. But wait, the function might expect the full name. However, "New York" is the correct city name. The state would be New York, but since it's a city, maybe the state is not needed? Wait, the function requires all three: city, state, country. But in this case, the user is referring to a city, not a state. So maybe the state is not applicable here. But the function's parameters require all three. Hmm.

Wait, the function's description says it extracts the city, state, and country. But if the user is referring to a city, maybe the state is not present. However, the function might still require the state. But in this case, the user is in New York, which is a state. So the state would be New York, and the country is the United States. 

But the user's statement is "I live in the big apple," which is a city. So the city is New York, state is New York (since it's a city in the state), and country is United States. But the function might not require the state if it's a city. However, the function's parameters require all three. So maybe the state is still needed. 

Alternatively, maybe the function is designed to handle cases where the city is the state. But I need to check the function's parameters. The function's properties include state as a string, so even if it's a city, the state is required. But the user is in New York, which is a state. So the state would be New York, and the country is United States. 

So the city_extractor function would return city: "New York", state: "New York", country: "United States". But I need to make sure that the function can handle that. The function's strict parameter is set to true, so it should correctly infer the parameters. 

Therefore, the correct call would be to city_extractor with city: "New York", state: "New York", country: "United States".
</think>

<tool_call>
{"name": "city_extractor", "arguments": {"city": "New York", "state": "New York", "country": "United States"}}
</tool_call><|im_end|>

Inside the tool_call xml tags, we get the information:

Show the code

assistant_response_clean = assistant_response.split("<tool_call>")[-1].split("</tool_call>")[0]
assistant_response_json = json.loads(assistant_response_clean)

for key in assistant_response_json['arguments']:
    print(f"{key}: {assistant_response_json['arguments'][key]}")

city: New York
state: New York
country: United States

This is amazing! The language model was able to infer from the text I live in the big apple that the city I was referring to was New York, the state New York and the country United States!

I want to stress again that all I needed was to provide to the language model a description of the function without ever implementing it.

Extractor

A slightly more complicated example but interesting could be to have a personal information extractor. The logic is quite similar to the previous example. We need to describe a function that requires as arguments the information we want to extract (first_name, last_name, email). Here is the description of a function that does that:

Show the code

from typing import List

class Person(BaseModel):
    """It extracts the first name, the last name, and the email address mentioned in the text."""

    first_name: str = Field(..., description="the first name")
    last_name: str = Field(..., description="the last name")
    email: str = Field(..., description="the email address")


class PeopleList(BaseModel):
    people: List[Person] = Field(
        ..., description="List of people mentioned in the text"
    )

tool = pydantic_function_tool(PeopleList)
print(json.dumps(tool, indent=4))

{
    "type": "function",
    "function": {
        "name": "PeopleList",
        "strict": true,
        "parameters": {
            "$defs": {
                "Person": {
                    "description": "It extracts the first name, the last name, and the email address mentioned in the text.",
                    "properties": {
                        "first_name": {
                            "description": "the first name",
                            "title": "First Name",
                            "type": "string"
                        },
                        "last_name": {
                            "description": "the last name",
                            "title": "Last Name",
                            "type": "string"
                        },
                        "email": {
                            "description": "the email address",
                            "title": "Email",
                            "type": "string"
                        }
                    },
                    "required": [
                        "first_name",
                        "last_name",
                        "email"
                    ],
                    "title": "Person",
                    "type": "object",
                    "additionalProperties": false
                }
            },
            "properties": {
                "people": {
                    "description": "List of people mentioned in the text",
                    "items": {
                        "$ref": "#/$defs/Person"
                    },
                    "title": "People",
                    "type": "array"
                }
            },
            "required": [
                "people"
            ],
            "title": "PeopleList",
            "type": "object",
            "additionalProperties": false
        }
    }
}

We could have a text that says:

“My name is John Doe and you can contact me at sales@example.com and she is Jane Doe and can be contacted at support@example.com”

We can see that this small language model is able to extract the personal information from that text without any problem.

Show the code

user_input = "My name is John Doe and you can contact me at sales@example.com and she is Jane Doe and can be contacted at support@example.com"

assistant_response = generate_response(user_input, tool)

<think>
Okay, let me see. The user provided their name and contact info, and also mentioned someone else, Jane Doe with another email. I need to extract this into a list of people.

First, the function PeopleList is available. It requires a list of Person objects, each with first_name, last_name, and email. The user's message has two entries: John Doe and Jane Doe. 

John Doe has first_name "John", last_name "Doe", and email "sales@example.com". Then Jane Doe has first_name "Jane", last_name "Doe", and email "support@example.com". 

I need to make sure each person's details are correctly mapped. The function's parameters are strict, so I have to check that all required fields are present. Both entries have first_name, last_name, and email, so that's good. 

I should structure the people array with each person as an object in the array. The function's example shows the people array as a list of Person objects. So the output should be two Person objects in the people array. 

No other information is needed here. The user didn't mention any other people, so the list should only include John and Jane. 

Double-checking the parameters: the function's properties are correct, and the emails are correctly assigned. Alright, that's all.
</think>

<tool_call>
{"name": "PeopleList", "arguments": {"people": [{"first_name": "John", "last_name": "Doe", "email": "sales@example.com"}, {"first_name": "Jane", "last_name": "Doe", "email": "support@example.com"}]}}
</tool_call><|im_end|>

Inside the tool_call xml tags, we get the information we wanted to extract:

Show the code

assistant_response_clean = assistant_response.split("<tool_call>")[-1].split("</tool_call>")[0]
assistant_response_json = json.loads(assistant_response_clean)

for people in assistant_response_json['arguments']['people']:
    print(f"First name: {people['first_name']}, Last name: {people['last_name']}, E-mail: {people['email']}")

First name: John, Last name: Doe, E-mail: sales@example.com
First name: Jane, Last name: Doe, E-mail: support@example.com

This is great. We can imagine several interesting applications. For example, we could use a small language model to automatically extract from CVs of candidates the information we are interested in and put it into a database instead of asking candidates to put that information themselves. Similarly, we could extract some specific information that we are interested in from websites (as an example, I saw an application that was just taking the required ingredients from a collection of cooking recipes websites).

Classifier

Another example which is quite common for structured outputs is classification. Suppose we want to determine if an e-mail we received should be forwarded to the IT department or to the Sales department.

Before language models, to do this task we would need to create a dataset with different emails and the labels at which they correspond to and then train a model to do the classification. Now, we just need to provide a succint description. That’s neat!

We can describe a function that requires as arguments the information we want (IT department or Sales department). Here is the description of a function that does that:

Show the code

from typing import Literal

from pydantic import BaseModel, Field

class Classifier(BaseModel):
    """Correctly inferred `team` the email should be directed to with all the required parameters with correct types."""

    team: Literal["IT department", "Sales department"] = Field(
        ..., description="Team at which should be the email should be directed to"
    )
tool = pydantic_function_tool(Classifier)
print(json.dumps(tool, indent=4))

{
    "type": "function",
    "function": {
        "name": "Classifier",
        "strict": true,
        "parameters": {
            "description": "Correctly inferred `team` the email should be directed to with all the required parameters with correct types.",
            "properties": {
                "team": {
                    "description": "Team at which should be the email should be directed to",
                    "enum": [
                        "IT department",
                        "Sales department"
                    ],
                    "title": "Team",
                    "type": "string"
                }
            },
            "required": [
                "team"
            ],
            "title": "Classifier",
            "type": "object",
            "additionalProperties": false
        },
        "description": "Correctly inferred `team` the email should be directed to with all the required parameters with correct types."
    }
}

We could have an email with the content:

“I would like to have more information related to the new product.”

This small language model is able to classify this text without any problem:

Show the code

user_input = "I would like to have more information related to the new product."

assistant_response = generate_response(user_input, tool)

<think>
Okay, the user wants more information about a new product. Let me see. The available tool is a Classifier that determines the team to send the email to. The team options are IT department or Sales department.

Hmm, the user's request is about product information. Typically, Sales departments handle product details and customer inquiries. IT might be involved if there's a technical aspect, but the user didn't mention anything about technical issues. So, the most appropriate team here would be the Sales department. The Classifier function needs to be called with "Sales department" as the team parameter. I should make sure the JSON is correctly formatted with the team key and the appropriate value.
</think>

<tool_call>
{"name": "Classifier", "arguments": {"team": "Sales department"}}
</tool_call><|im_end|>

Inside the tool_call xml tags, we get the correct department:

Show the code

assistant_response_clean = assistant_response.split("<tool_call>")[-1].split("</tool_call>")[0]
assistant_response_json = json.loads(assistant_response_clean)

print(assistant_response_json['arguments']['team'])

Sales department

Let’s try another one:

Show the code

user_input = "I cannot exit Vim in my computer. Could you help me with that?"

assistant_response = generate_response(user_input, tool)

<think>
Okay, the user is having trouble exiting Vim. Let me think about how to help them. First, I need to determine which team they're dealing with because the Classifier function is supposed to figure that out. The team options are IT department or Sales department. But how does that relate to exiting Vim?

Hmm, maybe the user is using Vim for work, so they might be in the Sales department. But I'm not sure. The Classifier function requires the team to be specified. Since the user is asking about exiting Vim, which is a text editor, perhaps the IT department is more involved with technical support. But the Sales department might be more about business processes. 

Wait, the function's description says it's for correctly inferred team. The user's query is about a technical issue, so maybe the IT department is the right team. But the user didn't mention anything about IT. This is a bit confusing. The function needs to be called with the team parameter. Since the user is asking about exiting Vim, which is a technical problem, the team might be IT. 

So, I should call the Classifier function with the team as "IT department" to get the appropriate support. Even though the user didn't explicitly mention IT, the context of the problem suggests it. Therefore, the tool call would be to the Classifier function with team: "IT department".
</think>

<tool_call>
{"name": "Classifier", "arguments": {"team": "IT department"}}
</tool_call><|im_end|>

We get again the correct department:

Show the code

assistant_response_clean = assistant_response.split("<tool_call>")[-1].split("</tool_call>")[0]
assistant_response_json = json.loads(assistant_response_clean)

print(assistant_response_json['arguments']['team'])

'IT department'

In this example, we have seen only a simple classifier between two classes but the same code can be done with many more classes. Similarly, we can use a whole decision tree structure to classify the text and do different things in the leaf nodes of the decision tree (to minimize delay we could use asynchronous calls to the language model). We could determine to which department should the email be sent and for each department what’s the type of question the user is asking (e.g. for sales, to which product is it related to while for IT department, wich program is it related to).

WebAssembly

We can do structured outputs in the browser thanks to WebAssembly!

We first need to install some packages:

We can download a quantized version of a small language model (0.6B parameters, not everything will work but most of it will). This step should take a few minutes the first time (later it should be stored in the cache):

We can send a prompt

We can define a tool and tokenize the prompt:

This is the generation part and it can take a few minutes:

We decode the assistant response:

This is the correct extracted information:

This tiny language model (quantized 0.6B parameters) was able to extract the information required!

Feel free to modify the code and/or prompts and run them in the browser!