Structured code generation for domain-specific languages

Alonso SILVA (joint work with Denes TORNYI)

Introduction

hi 👋, I’m Alonso

AI Researcher at Nokia Bell Labs
PhD at Inria, post-doc at University of California at Berkeley
Based in Paris, France
- Webpage: https://alonsosilvaallende.github.io/
- Github: https://github.com/alonsosilvaallende/
- Twitter: @alonsosilva
- Bluesky: https://bsky.app/profile/alonsosilva.bsky.social
- Mastodon: https://sigmoid.social/@alonsosilva
- Linkedin: https://www.linkedin.com/in/alonsosilvaallende/

Unstructured data → Structured data

Application: Extraction

Unstructured data:

“My name is John Smith and you can contact me at sales@example.com and she is Jane Doe and can be contacted at support@example.com”

→ Structured data:

First name	Last name	email
John	Smith	sales@example.com
Jane	Doe	support@example.com

Application: Classification

Unstructured data → Structured data

Email	Department
I would like to have more information related to the new product.	Sales
Are there any openings at your company?	HR
I cannot exit Vim in my computer. Could you help me with that?	IT

Application: Knowledge Graph Generation

Unstructured data:

“Alice loves Bob but she hates Charles”

→ Structured data:

Application: Agents / MCP Servers

Unstructured data:

“What’s the temperature in San Francisco now? How about tomorrow?”

→ Structured data:

Tool	Tool arguments
get_current_temperature	{'location': 'San Francisco'}
get_temperature_by_date	{'location': 'San Francisco', 'date': '2025-12-03'}

Application: Code generation

Unstructured data:

“Make a program in Python that prints ‘Hello World’”

→ Structured data:

print("Hello World")

Unstructured data → Structured data

3 ways to go from unstructured data to structured data using language models:
- Modify the weights (SFT, RLVR, etc.)
- Improve the prompt (Instructor, DSPy, BAML, etc.)
- Constrain the generation (Outlines, Xgrammar, Guidance, etc.)

Unstructured data → Structured data

3 ways to go from unstructured data to structured data using language models:
- Modify the weights (SFT, RLVR, etc.)
- Improve the prompt (Instructor, DSPy, BAML, etc.)
- Constrain the generation (Outlines, Xgrammar, Guidance, etc.)

Unstructured data → Structured data

3 ways to go from unstructured data to structured data using language models:
- Modify the weights (SFT, RLVR, etc.)
- Improve the prompt (Instructor, DSPy, BAML, etc.)
- Constrain the generation (Outlines, Xgrammar, Guidance, etc.)

The three approaches—modifying weights, improving prompts, and constraining generation—are complementary and should NOT be viewed as mutually exclusive.

Motivational example

La disparition (A Void)

GPT-4o fails to follow this instruction ¹

While a 0.5B model can do it ¹

How is this possible?

This needs some explanation:

The LLM outputs for the next token the whole probability distribution over all the tokens

The LLM samples the next token from this probability distribution

However, some of these rules can be bent. Others can be broken.

https://alonsosilva-nexttokenprediction.hf.space

Nobody has problems similar to as not using a letter.

Really?

One might think that both problems are unrelated.

My goal is to show that they are very similar.

A small (related) detour

Deterministic Finite Automata (DFA)

Deterministic Finite Automaton (DFA)

A deterministic finite automaton \(\mathcal{A}\) is a tuple \((Q, \Sigma, q_0, F, \delta)\) where:

\(Q\) is a finite set of states (in the example, \(Q=\{0,1\}\))
\(\Sigma\) is a finite set of input symbols (\(\Sigma=\{a,b\}\))
\(\delta:Q\times \Sigma\to Q\) is the transition function
\(q_0\in Q\) is the start state (\(0\))
\(F\subseteq Q\) is the set of accept states (\(\{1\}\))

Nondeterministic Finite Automaton (NFA)

A nondeterministic finite automaton \(\mathcal{A}\) is a tuple \((Q, \Sigma, q_0, F, \delta)\) where:

\(Q\) is a finite set of states
\(\Sigma\) is a finite set of input symbols
\(\delta:Q\times \Sigma_\varepsilon\to \mathcal{P}(Q)\) is the transition function
\(q_0\in Q\) is the start state
\(F\subseteq Q\) is the set of accept states

Theorem

Every nondeterministic finite automaton (NFA) has an equivalent deterministic finite automaton (DFA).

Regular Expressions (regex)

We need to talk about Regular Expressions (regex)

Let \(A\) and \(B\) be languages. We define the regular operations union, concatenation, and star as follows:

Union: \(A\cup B:=\{x: x\in A\text{ or }x\in B\}\)
Concatenation: \(A\circ B:=\{xy: x\in A, y\in B\}\)
Star: \(A^*:=\{x_1,\ldots,x_k: K\ge0, x_i\in A\}\)

Regular Expressions

We say that \(R\) is a regular expression if \(R\) is

\(a\) for some \(a\) in the alphabet \(\Sigma\),
\(\varepsilon\),
\(\emptyset\),
\((R_1\cup R_2)\), where \(R_1\) and \(R_2\) are regular expressions,
\((R_1\circ R_2)\), where \(R_1\) and \(R_2\) are regular expressions, or
\((R_1)^*\), where \(R_1\) is a regular expression

Theorem

Any regular expression can be converted into a deterministic finite automaton that recognizes the language it describes, and vice versa.

Regular Expression (regex)

From regular expression (regex) to deterministic finite automata (DFA)

For example, for sentiment analysis:

Life is hard

What’s in a “label”?

From character-level DFA to token-level DFA

Deterministic finite transducer

https://hf.co/datasets/alonsosilva/litelines-notebooks

Pumping lemma

If \(A\) is a regular language, then there is a number \(p\) (the pumping length) where if \(s\) is any string in \(A\) of length at least \(p\), then \(s\) may be divided into three pieces, \(s=xyz\), satisfying the following conditions:

for each \(i\ge0, xy^iz\in A\),
\(\lvert y\rvert>0\),
\(\lvert xy\rvert\le p.\)

Let \(B\) be the language \(\{0^n1^n: n\ge0\}\).

\(B\) is not regular.

Structured code generation

The lifecycle of a Python code

Abstract Syntax Tree (AST)

Consider the following source code:

print(
    "Hello", "World"
)

Hello World

Abstract Syntax Tree (AST)

source = bytes(
    """
print(
    "Hello", "World"
)
""",
    "utf8",
)

Abstract Syntax Tree

import ast

node = ast.parse(source)
print(ast.dump(node, indent=4))

Module(
    body=[
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Constant(value='Hello'),
                    Constant(value='World')],
                keywords=[]))],
    type_ignores=[])

Abstract Syntax Tree

ast.unparse(node)

"print('Hello', 'World')"

Concrete Syntax Tree

module(
  expression_statement(
    call(
      identifier: 'print'
      argument_list(
        (: '('
        string(
          string_start: '"'
          string_content: 'Hello'
          string_end: '"'
        )
        ,: ','
        string(
          string_start: '"'
          string_content: 'World'
          string_end: '"'
        )
        ): ')'
      )
    )
  )
)

Concrete Syntax Tree

'print(\n    "Hello", "World"\n)\n'

Parser

The parser attemps to:

determine if the lexer outcome (sequence of tokens) is valid in the grammar
if valid, how can it be derived

Context-Free Grammars (CFGs)

Remember our previous example

Let \(B\) be the language \(\{0^n1^n: n\ge0\}\).

\(B\) is not regular.

Context-Free Grammars

The following is an example of a context-free grammar \(G\)

\(A\to 0A1\)

\(A\to B\)

\(B\to \#\)

The grammar \(G\) can generate the string \(000\#111\).

\(A\) \(\implies\) \(0A1\) \(\implies\) \(00A11\) \(\implies\) \(000A111\) \(\implies\) \(000B111\) \(\implies\) \(000\#111\)

Context-Free Grammars

A context-free grammar is a \(4\)-tuple \((V,\Sigma,R,S)\) where

\(V\) is a finite set of variables
\(\Sigma\) is a finite set, disjoint from \(V\), called the terminals
\(R\) is a finite set of rules, with each rule being a variable and a string of variables and terminals, and
\(S\in V\) is the start variable

Pushdown Automaton (PDA)

A pushdown automaton is a \(6\)-tuple \((Q,\Sigma,\Gamma,\delta,q_0,F)\) where \(Q\), \(\Sigma\), \(\Gamma\) and \(F\) are finite sets, and

\(Q\) is the set of states
\(\Sigma\) is the set of input symbols
\(\Gamma\) is the stack alphabet
\(\delta:Q\times\Sigma_\varepsilon\times\Gamma_\varepsilon\to\mathcal{P}(Q\times\Gamma_\varepsilon)\) is the transition function
\(q_0\in Q\) is the start state, and
\(F\subseteq Q\) is the set of accept states

Pushdown Automaton (PDA)

We can also use a state diagram to describe a PDA. The following is a state diagram for the PDA that recognizes \(\{0^n1^n: n\ge0\}\).

Theorem

A language is context free if and only if some pushdown automaton recognizes it.

Conclusions and Perspectives

Conclusions

The three approaches to go from unstructured data to structured data—modifying weights, improving prompts, and constraining generation—are complementary and should NOT be viewed as mutually exclusive.
Several problems in constraining generation (valid JSON, valid tool calling) can be mapped to create a regex or equivalently a deterministic finite automata (DFA).
Other problems cannot be mapped to DFA but can be mapped to create a grammar or equivalently a pushdown automata (PA).

Perspectives

Add semantic knowledge to the masking.