Potenciando la generación aumentada usando recuperación con grafos de conocimiento

Alonso Silva (Github: alonsosilvaallende)

hi 👋, I’m Alonso

Introduction

Data

import polars as pl
import string

df = pl.read_csv(
    "https://drive.google.com/uc?export=download&id=1uD3h7xYxr9EoZ0Ggoh99JtQXa3AxtxyU"
)
df = df.with_columns(
    pl.Series("Album", [string.capwords(album) for album in df["Album"]])
)
df = df.with_columns(
    pl.Series("Song", [string.capwords(song) for song in df["Song"]])
)
df = df.with_columns(pl.col("Lyrics").fill_null("None"))
df.head()
shape: (5, 4)
Song Lyrics Album Artist
str str str str
"Don't Tread On Me" "Liberty or death what we so pr… "The Black Album" "Metallica"
"Nothing Else Matters" "So close no matter how far cou… "The Black Album" "Metallica"
"Through The Never" "All that is was and will be un… "The Black Album" "Metallica"
"My Friend Of Misery" "You just stood there screaming… "The Black Album" "Metallica"
"Wherever I May Roam" "...and the road becomes my bri… "The Black Album" "Metallica"

Potential kind of queries

queries = [
    "Which song is about a boy who is having nightmares?",
    """Which song is about a guy who is so badly wounded in war 
    so he no longer has any senses?""",
    "How many songs does the reload album have?",
    "Which songs does the black album have?"
]

Vector-only retrieval

Vector database: https://lancedb.com/

Vector-only retrieval

df = df.with_columns(
    text=pl.lit("# ")
    + pl.col("Album")
    + pl.lit(": ")
    + pl.col("Song")
    + pl.lit("\n\n")
    + pl.col("Lyrics")
)

df.head()
shape: (5, 5)
Song Lyrics Album Artist text
str str str str str
"Don't Tread On Me" "Liberty or death what we so pr… "The Black Album" "Metallica" "# The Black Album: Don't Tread…
"Nothing Else Matters" "So close no matter how far cou… "The Black Album" "Metallica" "# The Black Album: Nothing Els…
"Through The Never" "All that is was and will be un… "The Black Album" "Metallica" "# The Black Album: Through The…
"My Friend Of Misery" "You just stood there screaming… "The Black Album" "Metallica" "# The Black Album: My Friend O…
"Wherever I May Roam" "...and the road becomes my bri… "The Black Album" "Metallica" "# The Black Album: Wherever I …

Vector-only retrieval

print(df.select("text")[1].item()[:300])
print("...")
# The Black Album: Nothing Else Matters

So close no matter how far
couldn't be much more from the heart
forever trusting who we are
and nothing else matters
never opened myself this way
life is ours
we live it our way
all these words I don't just say
and nothing else matters
trust I seek and I find
...

Vector-only retrieval

# Initialize vector database
import shutil
import lancedb

shutil.rmtree("lancedb_explorer", ignore_errors=True)
db = lancedb.connect("lancedb_explorer")

Vector-only retrieval

from lancedb.embeddings import get_registry

embeddings = (
    get_registry()
    .get("sentence-transformers")
    .create(name="TaylorAI/gte-tiny", device="cuda")
)

Vector-only retrieval

from lancedb.pydantic import LanceModel, Vector

class Songs(LanceModel):
    Song: str
    Lyrics: str
    Album: str
    Artist: str
    text: str = embeddings.SourceField()
    vector: Vector(embeddings.ndims()) = embeddings.VectorField()

Vector-only retrieval

# Add dataframe to the vector database
table = db.create_table("Songs", schema=Songs)
table.add(data=df)

Vector-only retrieval

query = "Which song is about a boy who is having nightmares?"
results = table.search(query).limit(10).to_polars()
results.select(["Song","Album", "Lyrics", "_distance", "vector"])
shape: (10, 5)
Song Album Lyrics _distance vector
str str str f32 array[f32, 384]
"Enter Sandman" "The Black Album" "Say your prayers little one do… 0.228461 [-0.019242, 0.004355, … 0.019346]
"The Thing That Should Not Be" "S&m" "Messenger of fear in sight Dar… 0.229785 [-0.0343, -0.004135, … 0.017745]
"Hero Of The Day" "S&m" "Mama they try and break me The… 0.233728 [-0.06035, 0.001112, … 0.052594]
"The Thing That Should Not Be" "Master Of Puppets" "Messenger of fear in sight Dar… 0.241206 [-0.039418, 0.001882, … 0.016381]
"Hero Of The Day" "Load" "Mama they try and break me The… 0.245303 [-0.07442, -0.000922, … 0.047706]
"Invisible Kid" "St. Anger" "Invisible Kid Never see what h… 0.2473 [-0.033447, -0.034929, … 0.047731]
"Enter Sandman" "S&m" "Let loose man! Say your prayer… 0.250452 [-0.024093, 0.032401, … 0.021124]
"Sad But True" "S&m" "Hey, hey, hey, hey, hey, hey, … 0.250919 [-0.046668, -0.00912, … 0.024796]
"My Friend Of Misery" "The Black Album" "You just stood there screaming… 0.253463 [-0.0754, -0.001582, … 0.026701]
"One" "S&m" "I can't remember anything Can'… 0.253472 [-0.055845, -0.000173, … 0.016305]

Vector-only retrieval

query = "Which song is about a boy who is having nightmares?"
results = table.search(query).limit(10).to_polars()
results["Song"].to_list()
['Enter Sandman',
 'The Thing That Should Not Be',
 'Hero Of The Day',
 'The Thing That Should Not Be',
 'Hero Of The Day',
 'Invisible Kid',
 'Enter Sandman',
 'Sad But True',
 'My Friend Of Misery',
 'One']

Vector-only retrieval

query = """Which song is about a guy who is so badly wounded in war
so he no longer has any senses?"""
results = table.search(query).limit(10).to_polars()
results["Song"].to_list()
['Phantom Lord',
 'One',
 'No Remorse',
 'One',
 'My Friend Of Misery',
 "Low Man's Lyric",
 'The Unforgiven',
 'Sad But True',
 "Don't Tread On Me",
 '2x4']

Vector-only retrieval

https://alonsosilva-song-finder.hf.space

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG)

import llama_cpp  

1model_id = "NousResearch/Hermes-3-Llama-3.1-8B"
2llm = llama_cpp.Llama(
    "/big_storage/llms/models/Hermes-3-Llama-3.1-8B.Q6_K.gguf",
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        model_id
    ),
    n_gpu_layers=-1,
    flash_attn=True,
    n_ctx=8192,
    verbose=False,
    chat_format="chatml-function-calling"
)
1
Hermes 3 by NousResearch
2
llama-cpp-python python bindings for llama.cpp

Retrieval Augmented Generation (RAG)

def get_relevant_texts(query):
    results = (
        table.search(query)
             .limit(5)
             .to_polars()
    )
    return " ".join([results["text"][i] + "\n\n---\n\n" for i in range(5)])

print(get_relevant_texts(query))
# Kill 'em All: Phantom Lord

Sound is ripping through your ears
The deafening sound of metal nears
Your bodies waiting for his whips
The taste of leather on your lips
Hear the cry of war
Louder than before
With his sword in hand
To control the land
Crushing metal strikes
On this frightening night
Fall onto your knees
For the phantom lord
Victims falling under chains
You hear them crying dying pains
The fist of terrors breaking through
Now there's nothing you can do
Hear the cry of war
Louder than before
With his sword in hand
To control the land
Crushing metal strikes
On this frightening night
Fall onto your knees
For the phantom lord
The leather armies have prevailed
The phantom lord has never failed
Smoke is lifting from the ground
The rising volume metal sound
Hear the cry of war
Louder than before
With his sword in hand
To control the land
Crushing metal strikes
On this frightening night
Fall onto your knees
For the phantom lord
Fall to your knees
And bow to the phantom lord

---

 # And Justice For All..: One

I Can't Remember Anything
Can't Tell If this Is True or Dream
Deep down Inside I Feel to Scream
this Terrible Silence Stops Me
Now That the War Is Through with Me
I'm Waking up I Can Not See
That There Is Not Much Left of Me
Nothing Is Real but Pain Now
Hold My Breath as I Wish for Death
Oh Please God,wake Me
Back in the Womb its Much Too Real
in Pumps Life That I must Feel
but Can't Look Forward to Reveal
Look to the Time When I'll Live
Fed Through the Tube That Sticks in Me
Just like a Wartime Novelty
Tied to Machines That Make Me Be
Cut this Life off from Me
Hold My Breath as I Wish for Death
Oh Please God,wake Me
Now the World Is Gone I'm Just One
Oh God,help Me Hold My Breath as I Wish for Death
Oh Please God Help Me
Darkness
Imprisoning Me
All That I See
Absolute Horror
I Cannot Live
I Cannot Die
Trapped in Myself
Body My Holding Cell
Landmine
Has Taken My Sight
Taken My Speech
Taken My Hearing
Taken My Arms
Taken My Legs
Taken My Soul
Left Me with Life in Hell

---

 # Kill 'em All: No Remorse

No mercy for what we are doing
No thought to even what we have done
We don't need to feel the sorrow
No remorse for the helpless one
War without end
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
Blood feeds the war machine
As it eats its way across the land
We don't need to feel the sorrow
No remorse is the one command
War without end
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
Only the strong survive
No one to save the weaker race
We are ready to kill all comers
Like a loaded gun right at your face
War without end
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
Attack
Bullets are flying
People are dying
With madness surrounding all hell's breaking loose
Soldiers are hounding
Bodies are mounting
Cannons are shouting to take their abuse
With war machines going
Blood starts to flowing
No mercy given to anyone hear
The furious fighting
Swords are like lighting
It all becomes frightening to you
Know death is near
No remorse

---

 # S&m: One

I can't remember anything
Can't tell if this is true or dream
Deep down
inside I feel to scream
This terrible silence stops me
Now that the war is through with me
I'm waking up, I cannot see
That
there's not much left of me
Nothing is real but pain now
Hold my breath as I wish for death
Oh please God, wake me
Back in the womb it's much to real
In pumps life that I must feel
But
can't look forward to reveal
Look to the time when I lived
Fed through the tube that sticks in me
Just like a wartime novelty
Tied
to machines that make me be
Cut this shit out from me
Hold my breath as I wish for death
Oh please God, wake me
Please God
wake me!
Now the world is gone I'm just one
Oh God, help me
Hold my breath as I
wish for death
Oh please God, help me, help me!!
Darkness
Imprisoning me
All that I see
Absolute horror
I cannot
live
I cannot die
Trapped in myself
Body my holding cell
Landmine
Has taken my sight
Taken my speech
Taken my
hearing
Taken my arms
Taken my legs
Taken my soul
Left me with life
in hell 
no, NO, NO NO NO NO NO!!!

---

 # The Black Album: My Friend Of Misery

You just stood there screaming
fearing no one was listening to you
they say the empty can rattles the most
the sound of your voice must soothe you
hearing only what you want to hear
and knowing only what you've heard
you you're smothered in tragedy
you're out to save the world
misery
you insist that the weight of the world
should be on your shoulders
misery
there's much more to life than what you see
my friend of misery
you still stood there screaming
no one caring about these words you tell
my friend before your voice is gone
one man's fun is another's hell
these times are sent to try men's souls
but something's wrong with all you see
you you'll take it on all yourself
remember
misery loves company
misery
you insist that the weight of the world
should be on your shoulders
misery
there's much more to life than what you see
my friend of misery
you just stood there creaming
my friend of misery

---

Retrieval Augmented Generation (RAG)

def build_prompt(query):
    return (
        "Answer the question based only on the following context:\n\n"
        + get_relevant_texts(query)
        + "\n\nQuestion: "
        + query
    )

print(build_prompt(query))
Answer the question based only on the following context:

# Kill 'em All: Phantom Lord

Sound is ripping through your ears
The deafening sound of metal nears
Your bodies waiting for his whips
The taste of leather on your lips
Hear the cry of war
Louder than before
With his sword in hand
To control the land
Crushing metal strikes
On this frightening night
Fall onto your knees
For the phantom lord
Victims falling under chains
You hear them crying dying pains
The fist of terrors breaking through
Now there's nothing you can do
Hear the cry of war
Louder than before
With his sword in hand
To control the land
Crushing metal strikes
On this frightening night
Fall onto your knees
For the phantom lord
The leather armies have prevailed
The phantom lord has never failed
Smoke is lifting from the ground
The rising volume metal sound
Hear the cry of war
Louder than before
With his sword in hand
To control the land
Crushing metal strikes
On this frightening night
Fall onto your knees
For the phantom lord
Fall to your knees
And bow to the phantom lord

---

 # And Justice For All..: One

I Can't Remember Anything
Can't Tell If this Is True or Dream
Deep down Inside I Feel to Scream
this Terrible Silence Stops Me
Now That the War Is Through with Me
I'm Waking up I Can Not See
That There Is Not Much Left of Me
Nothing Is Real but Pain Now
Hold My Breath as I Wish for Death
Oh Please God,wake Me
Back in the Womb its Much Too Real
in Pumps Life That I must Feel
but Can't Look Forward to Reveal
Look to the Time When I'll Live
Fed Through the Tube That Sticks in Me
Just like a Wartime Novelty
Tied to Machines That Make Me Be
Cut this Life off from Me
Hold My Breath as I Wish for Death
Oh Please God,wake Me
Now the World Is Gone I'm Just One
Oh God,help Me Hold My Breath as I Wish for Death
Oh Please God Help Me
Darkness
Imprisoning Me
All That I See
Absolute Horror
I Cannot Live
I Cannot Die
Trapped in Myself
Body My Holding Cell
Landmine
Has Taken My Sight
Taken My Speech
Taken My Hearing
Taken My Arms
Taken My Legs
Taken My Soul
Left Me with Life in Hell

---

 # Kill 'em All: No Remorse

No mercy for what we are doing
No thought to even what we have done
We don't need to feel the sorrow
No remorse for the helpless one
War without end
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
Blood feeds the war machine
As it eats its way across the land
We don't need to feel the sorrow
No remorse is the one command
War without end
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
Only the strong survive
No one to save the weaker race
We are ready to kill all comers
Like a loaded gun right at your face
War without end
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
No remorse no repent
We don't care what it meant
Another day another death
Another sorrow another breath
Attack
Bullets are flying
People are dying
With madness surrounding all hell's breaking loose
Soldiers are hounding
Bodies are mounting
Cannons are shouting to take their abuse
With war machines going
Blood starts to flowing
No mercy given to anyone hear
The furious fighting
Swords are like lighting
It all becomes frightening to you
Know death is near
No remorse

---

 # S&m: One

I can't remember anything
Can't tell if this is true or dream
Deep down
inside I feel to scream
This terrible silence stops me
Now that the war is through with me
I'm waking up, I cannot see
That
there's not much left of me
Nothing is real but pain now
Hold my breath as I wish for death
Oh please God, wake me
Back in the womb it's much to real
In pumps life that I must feel
But
can't look forward to reveal
Look to the time when I lived
Fed through the tube that sticks in me
Just like a wartime novelty
Tied
to machines that make me be
Cut this shit out from me
Hold my breath as I wish for death
Oh please God, wake me
Please God
wake me!
Now the world is gone I'm just one
Oh God, help me
Hold my breath as I
wish for death
Oh please God, help me, help me!!
Darkness
Imprisoning me
All that I see
Absolute horror
I cannot
live
I cannot die
Trapped in myself
Body my holding cell
Landmine
Has taken my sight
Taken my speech
Taken my
hearing
Taken my arms
Taken my legs
Taken my soul
Left me with life
in hell 
no, NO, NO NO NO NO NO!!!

---

 # The Black Album: My Friend Of Misery

You just stood there screaming
fearing no one was listening to you
they say the empty can rattles the most
the sound of your voice must soothe you
hearing only what you want to hear
and knowing only what you've heard
you you're smothered in tragedy
you're out to save the world
misery
you insist that the weight of the world
should be on your shoulders
misery
there's much more to life than what you see
my friend of misery
you still stood there screaming
no one caring about these words you tell
my friend before your voice is gone
one man's fun is another's hell
these times are sent to try men's souls
but something's wrong with all you see
you you'll take it on all yourself
remember
misery loves company
misery
you insist that the weight of the world
should be on your shoulders
misery
there's much more to life than what you see
my friend of misery
you just stood there creaming
my friend of misery

---



Question: Which song is about a guy who is so badly wounded in war
so he no longer has any senses?

Retrieval Augmented Generation (RAG)

def generate_response(query):
    prompt = build_prompt(query)
    response = llm.create_chat_completion(
        messages=[{"role": "user", "content": prompt}], 
        temperature=0,
        seed=42
    )
    return response["choices"][0]["message"]["content"]

Retrieval Augmented Generation (RAG)

query = """Which song is about a guy who is so badly wounded in war
so he no longer has any senses?"""
print(generate_response(query))
Based on the context provided, the song that is about a guy who is so badly wounded in war that he no longer has any senses is "And Justice For All..: One". The lyrics mention losing sight, speech, hearing, arms, legs, and soul, being trapped in his body, and being left with life in hell.

Retrieval Augmented Generation (RAG)

query = "Which song is about a boy who is having nightmares?"
print(generate_response(query))
The song that is about a boy who is having nightmares is "The Black Album: Enter Sandman". The lyrics mention "Say your prayers little one / don't forget / my son / to include everyone / tuck you in / warm within / keep you free from sin / till the sandman he comes", suggesting a child being comforted before going to sleep and having nightmares. The lyrics also mention "dreams of war / dreams of liars / dreams of dragon's fire / and of things that will bite", which further supports the theme of nightmares.

Retrieval Augmented Generation (RAG)

https://alonsosilva-song-finder-bot.hf.space

Retrieval Augmented Generation (RAG)

query = "How many songs does the black album have?"
print(generate_response(query))
Based on the provided context, the Black Album appears to have 5 songs. The songs are:

1. My Friend Of Misery
2. Nothing Else Matters 
3. Don't Tread On Me
4. Sad But True
5. Holier Than Thou

Retrieval Augmented Generation (RAG)

df.filter(pl.col("Album") == "The Black Album")["Song"].to_list()
["Don't Tread On Me",
 'Nothing Else Matters',
 'Through The Never',
 'My Friend Of Misery',
 'Wherever I May Roam',
 'The Unforgiven',
 'The Struggle Within',
 'Of Wolf And Man',
 'The God That Failed',
 'Enter Sandman',
 'Sad But True',
 'Holier Than Thou']

Graph-only retrieval

Graph database: https://kuzudb.com/

Graph-only retrieval

# Initialize graph database
import shutil
import kuzu

shutil.rmtree("kuzudb_explorer", ignore_errors=True)
db = kuzu.Database("kuzudb_explorer")
conn = kuzu.Connection(db)

Graph-only retrieval

# Create nodes schema
conn.execute("CREATE NODE TABLE ARTIST(name STRING, PRIMARY KEY (name))")
conn.execute("CREATE NODE TABLE ALBUM(name STRING, PRIMARY KEY (name))")
conn.execute(
"CREATE NODE TABLE SONG(ID SERIAL,name STRING,lyrics STRING,PRIMARY KEY(ID))"
)
# Create edges schema
conn.execute("CREATE REL TABLE IN_ALBUM(FROM SONG TO ALBUM)")
conn.execute("CREATE REL TABLE FROM_ARTIST(FROM ALBUM TO ARTIST)");

Graph-only retrieval

# Insert nodes
for artist in df["Artist"].unique():
    conn.execute(f"CREATE (artist:ARTIST {{name: '{artist}'}})")

for album in df["Album"].unique():
    conn.execute(f"""CREATE (album:ALBUM {{name: "{album}"}})""")

for song, lyrics in df.select(["Song", "text"]).unique().rows():
    replaced_lyrics = lyrics.replace('"', "'")
    conn.execute(
        f"""CREATE (song:SONG {{name: "{song}", lyrics: "{replaced_lyrics}"}})"""
    )

Graph-only retrieval

# Insert edges
for song, album, lyrics in df.select(["Song", "Album", "text"]).rows():
    replaced_lyrics = lyrics.replace('"', "'")
    conn.execute(
        f"""
        MATCH (song:SONG), (album:ALBUM) 
        WHERE song.name = "{song}" AND song.lyrics = "{replaced_lyrics}" AND album.name = "{album}"
        CREATE (song)-[:IN_ALBUM]->(album)
        """
    )
for album, artist in df.select(["Album", "Artist"]).unique().rows():
  conn.execute(
    f"""
    MATCH (album:ALBUM), (artist:ARTIST) WHERE album.name = "{album}" AND artist.name = "{artist}"
    CREATE (album)-[:FROM_ARTIST]->(artist)
    """
  )

Graph-only retrieval

response = conn.execute(
    """
    MATCH (a:ALBUM {name: 'The Black Album'})<-[:IN_ALBUM]-(s:SONG) 
    RETURN s.name
    """
)

df_response = response.get_as_pl()

df_response["s.name"].to_list()
["Don't Tread On Me",
 'The Unforgiven',
 'The God That Failed',
 'My Friend Of Misery',
 'Enter Sandman',
 'Sad But True',
 'Holier Than Thou',
 'Nothing Else Matters',
 'Of Wolf And Man',
 'Wherever I May Roam',
 'The Struggle Within',
 'Through The Never']

Graph-only retrieval

from langchain_community.graphs import KuzuGraph

graph = KuzuGraph(db)
print(graph.get_schema)
Node properties: [{'properties': [('name', 'STRING')], 'label': 'ARTIST'}, {'properties': [('name', 'STRING')], 'label': 'ALBUM'}, {'properties': [('ID', 'SERIAL'), ('name', 'STRING'), ('lyrics', 'STRING')], 'label': 'SONG'}]
Relationships properties: [{'properties': [], 'label': 'FROM_ARTIST'}, {'properties': [], 'label': 'IN_ALBUM'}]
Relationships: ['(:ALBUM)-[:FROM_ARTIST]->(:ARTIST)', '(:SONG)-[:IN_ALBUM]->(:ALBUM)']

Graph-only retrieval

def generate_kuzu_prompt(user_query):
    return """Task: Generate Kùzu Cypher statement to query a graph database.

Instructions:
Generate the Kùzu dialect of Cypher with the following rules in mind:
1. Do not omit the relationship pattern. Always use `()-[]->()` instead of `()->()`.
2. Do not include triple backticks ``` in your response. Return only Cypher.
3. Do not return any notes or comments in your response.


Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:\n""" + graph.get_schema + """\nExample:
The question is:\n"Which songs does the load album have?"
MATCH (a:ALBUM {name: 'Load'})<-[:IN_ALBUM]-(s:SONG) RETURN s.name

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:\n""" + user_query

Graph-only retrieval

def generate_final_prompt(query,cypher_query,col_name,_values):
    return f"""You are an assistant that helps to form nice and human understandable answers.
The information part contains the provided information that you must use to construct an answer.
The provided information is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
Make the answer sound as a response to the question. Do not mention that you based the result on the given information.
Here is an example:

Question: Which managers own Neo4j stocks?
Context:[manager:CTL LLC, manager:JANE STREET GROUP LLC]
Helpful Answer: CTL LLC, JANE STREET GROUP LLC owns Neo4j stocks.

Follow this example when generating answers.
If the provided information is empty, say that you don't know the answer.
Query:\n{cypher_query}
Information:
[{col_name}: {_values}]

Question: {query}
Helpful Answer:
"""

Graph-only retrieval

def generate_kg_response(query):
    prompt = generate_kuzu_prompt(query)
    cypher_query_response = llm.create_chat_completion(
        messages=[{"role": "user", "content": prompt}], 
        temperature=0,
        seed=42
    )
    cypher_query = cypher_query_response["choices"][0]["message"]["content"]
    response = conn.execute(
        f"""
        {cypher_query}
        """
    )
    df = response.get_as_pl()
    col_name = df.columns[0]
    _values = df[col_name].to_list()
    final_prompt = generate_final_prompt(query,cypher_query,col_name,_values)
    final_response = llm.create_chat_completion(
        messages=[{"role": "user", "content": final_prompt}], 
        temperature=0,
        seed=42
    )
    return final_response["choices"][0]["message"]["content"]

Graph-only retrieval

query = "How many songs does the black album have?"
print(generate_kg_response(query))
The Black Album has 12 songs.

Graph-only retrieval

query = "Which songs does the black album have?"
print(generate_kg_response(query))
The Black Album has the following songs: "Don't Tread On Me", "The Unforgiven", "The God That Failed", "My Friend Of Misery", "Enter Sandman", "Sad But True", "Holier Than Thou", "Nothing Else Matters", "Of Wolf And Man", "Wherever I May Roam", "The Struggle Within", and "Through The Never".

Graph-only retrieval

https://alonsosilva-song-finder-graph.hf.space

Combining graph and vector retrieval

Classification

def generate_hermes_prompt(user_prompt):
    return (
        "<|im_start|>system\n"
        "You are a world class AI model classifier"
        "<|im_end|>\n"
        "<|im_start|>user\n"
        + user_prompt
        + "<|im_end|>"
        + "\n<|im_start|>assistant\n"
    )

Classification

from outlines import generate, models
from outlines.samplers import greedy

def get_classification(query):
    model = models.LlamaCpp(llm)
    generator = generate.choice(model, ["YES", "NO"], sampler=greedy())
    prompt = "Answer only YES or NO. Is the question '" + query + \
             "' related to the content of a song?"
    prompt = generate_hermes_prompt(prompt)
    response = generator(prompt, max_tokens=1024, temperature=0, seed=42)
    return response

Classification

queries = [
    "Which song is about a boy who is having nightmares?",
    "Which song is about a guy who is so badly wounded in war so he no longer has any senses?",
    "How many songs does the black album have?",
    "Which songs does the black album have?"
]
for query in queries:
    print(f"**{query}**")
    print(get_classification(query))
    print()
**Which song is about a boy who is having nightmares?**
YES

**Which song is about a guy who is so badly wounded in war so he no longer has any senses?**
YES

**How many songs does the black album have?**
NO

**Which songs does the black album have?**
NO

Combining graph and vector retrieval

def get_final_response(query):
    query_class = get_classification(query)
    if query_class == 'YES':
        response = generate_response(query)
    else:
        response = generate_kg_response(query)
    return response

Combining graph and vector retrieval

queries = [
    "Which song is about a boy who is having nightmares?",
    "Which song is about a guy who is so badly wounded in war so he no longer has any senses",
    "How many songs does the black album have?",
    "Which songs does the black album have?"
]
for query in queries:
    print(f"**{query}**")
    print(get_final_response(query))
    print()
**Which song is about a boy who is having nightmares?**
The song that is about a boy who is having nightmares is "The Black Album: Enter Sandman". The lyrics mention "Say your prayers little one / don't forget / my son / to include everyone / tuck you in / warm within / keep you free from sin / till the sandman he comes", suggesting a child being comforted before going to sleep and having nightmares. The lyrics also mention "dreams of war / dreams of liars / dreams of dragon's fire / and of things that will bite", which further supports the theme of nightmares.

**Which song is about a guy who is so badly wounded in war so he no longer has any senses**
Based on the context provided, the song that is about a guy who is so badly wounded in war that he no longer has any senses is "And Justice For All..: One". The lyrics describe a person who has lost their sight, speech, hearing, arms, legs, and soul due to a landmine explosion, and is trapped in their body, experiencing absolute horror and unable to live or die.

**How many songs does the black album have?**
The Black Album has 12 songs.

**Which songs does the black album have?**
The Black Album has the following songs: "Don't Tread On Me", "The Unforgiven", "The God That Failed", "My Friend Of Misery", "Enter Sandman", "Sad But True", "Holier Than Thou", "Nothing Else Matters", "Of Wolf And Man", "Wherever I May Roam", "The Struggle Within", and "Through The Never".

Combining graph and vectorial retrieval

https://alonsosilva-song-finder-graphrag.hf.space

What is Structured Generation?

The problem

scikit-llm: https://skllm.beastbyte.ai/

Show me the prompt!

scikit-llm prompts: https://github.com/iryna-kondr/scikit-llm/blob/main/skllm/prompts/templates.py

What is NOT Structured Generation?

Structured Outputs (in the style of Instructor/Marvin libraries)

Structured Outputs in Code

from typing import Literal

from pydantic import BaseModel, Field


class Sentiment(BaseModel):
    """Correctly inferred `Sentiment` with all the required parameters
    with correct types."""

    label: Literal["Positive", "Negative"] = Field(
        ..., description="Sentiment of the text"
    )

Structured Outputs in Code

from langchain_core.utils.function_calling import convert_to_openai_tool

tools = [convert_to_openai_tool(Sentiment)]
tools
[{'type': 'function',
  'function': {'name': 'Sentiment',
   'description': 'Correctly inferred `Sentiment` with all the required parameters\nwith correct types.',
   'parameters': {'properties': {'label': {'description': 'Sentiment of the text',
      'enum': ['Positive', 'Negative'],
      'type': 'string'}},
    'required': ['label'],
    'type': 'object'}}}]

Structured Outputs in Code

user_input = "You are great"

output = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": (
                "What is the sentiment conveyed in the following text: "
                f"{user_input}."
            ),
        },
    ],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "Sentiment"}},
)

output["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
'{ "label": "Positive"}'

What is Structured Generation?

https://dottxt-ai.github.io/outlines/

Observation: LLMs are samplers

https://alonsosilva-nexttokenprediction.hf.space

We need to talk about Regular Expressions (regex)

Regular Expression (regex)

Deterministic Finite Automata (DFA)

Fast, High-Fidelity LLM Decoding with Regex Constraints by Vivien Tran-Thien

Deterministic Finite Automata (DFA)

Fast, High-Fidelity LLM Decoding with Regex Constraints by Vivien Tran-Thien

Deterministic Finite Automata (DFA)

Fast, High-Fidelity LLM Decoding with Regex Constraints by Vivien Tran-Thien

Back to the problem

scikit-llm prompts: https://github.com/iryna-kondr/scikit-llm/blob/main/skllm/prompts/templates.py

From regular expression (regex) to deterministic finite automata (DFA)

For example, for sentiment analysis:

Life is hard

What’s in a “label”?

Structured Generation in Code

import outlines

import transformers

outlines_tokenizer = outlines.models.TransformerTokenizer(
    transformers.AutoTokenizer.from_pretrained(model_id)
)

outlines_logits_processor = outlines.processors.JSONLogitsProcessor(
    Sentiment,
    outlines_tokenizer,
1    whitespace_pattern=r"[\n\t ]*"
)
1
I like to give the language model some structure but also some leeway.

Structured Generation in Code

user_input = "You are great"

output = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": (
                "What is the sentiment conveyed in the following text: "
                f"{user_input}."
            ),
        },
    ],
    logits_processor=transformers.LogitsProcessorList(
        [outlines_logits_processor]
    ),
)

output["choices"][0]["message"]["content"]
'{ "label": "Positive" }'

Structured Outputs + Structured Generation

user_input = "You are great"
output = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": (
                "What is the sentiment conveyed in the following text: "
                f"{user_input}."
            ),
        },
    ],
1    tools=tools,
    tool_choice={"type": "function", "function": {"name": "Sentiment"}},
2    logits_processor=transformers.LogitsProcessorList(
        [outlines_logits_processor]
    ),
)
output["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
1
Structured Outputs
2
Structured Generation
'{ "label": "Positive"}'

Knowledge Graph Generation

Consider a text

with open('./assets/curie.txt', "r") as f:
    curie_text = f.read()
print(curie_text)
When Marie Curie came to the United States for the first time, in May 1921, she had already discovered the elements radium and polonium, coined the term “radio-active” and won the Nobel Prize—twice. But the Polish-born scientist, almost pathologically shy and accustomed to spending most of her time in her Paris laboratory, was stunned by the fanfare that greeted her. In 1898 she indeed identified one of the substances and named it polonium, after her homeland. Five months later, she identified a second element, which the world came to know as radium. Curie described the elements she studied as “radio-active.”

In 1894, she met Pierre Curie, a 35-year-old physicist at a French technical college who had been studying crystals and magnetism. More than a decade before, he and his brother Jacques had discovered piezoelectricity, the electric charge produced in solid materials under pressure. Pierre was taken by Marie’s uncommon intellect and drive, and he proposed to her. They were married in 1895 in a civil service attended by family and a few friends.

Both Curies shared the Nobel Prize in physics with Becquerel in 1903. It was the first Nobel to be awarded to a woman. In 1911, rumors spread that Curie was having an affair with the prominent physicist Paul Langevin, a man five years her junior who had been Pierre’s student and had worked closely with Albert Einstein. The front-page coverage of the scandal threatened to overshadow another news story later that year: her second Nobel Prize.

This one, in chemistry, was for the discovery of polonium and radium. In her acceptance speech in Stockholm, she paid tribute to her husband but also made clear that her work was independent from his, spelling out their separate contributions and describing the discoveries she had made after his death.

Chunk the text

texts = curie_text.split("\n\n")
texts
['When Marie Curie came to the United States for the first time, in May 1921, she had already discovered the elements radium and polonium, coined the term “radio-active” and won the Nobel Prize—twice. But the Polish-born scientist, almost pathologically shy and accustomed to spending most of her time in her Paris laboratory, was stunned by the fanfare that greeted her. In 1898 she indeed identified one of the substances and named it polonium, after her homeland. Five months later, she identified a second element, which the world came to know as radium. Curie described the elements she studied as “radio-active.”',
 'In 1894, she met Pierre Curie, a 35-year-old physicist at a French technical college who had been studying crystals and magnetism. More than a decade before, he and his brother Jacques had discovered piezoelectricity, the electric charge produced in solid materials under pressure. Pierre was taken by Marie’s uncommon intellect and drive, and he proposed to her. They were married in 1895 in a civil service attended by family and a few friends.',
 'Both Curies shared the Nobel Prize in physics with Becquerel in 1903. It was the first Nobel to be awarded to a woman. In 1911, rumors spread that Curie was having an affair with the prominent physicist Paul Langevin, a man five years her junior who had been Pierre’s student and had worked closely with Albert Einstein. The front-page coverage of the scandal threatened to overshadow another news story later that year: her second Nobel Prize.',
 'This one, in chemistry, was for the discovery of polonium and radium. In her acceptance speech in Stockholm, she paid tribute to her husband but also made clear that her work was independent from his, spelling out their separate contributions and describing the discoveries she had made after his death.\n']

from pydantic import BaseModel, Field
from typing import Optional

class Node(BaseModel):
    """Node of the Knowledge Graph"""
    id: int = Field(
        ..., description="Unique identifier of the node starting at zero"
    )
    type: str = Field(..., description="Type of the node")
    label: str = Field(..., description="Label of the node")

class Edge(BaseModel):
    """Edge of the Knowledge Graph"""
    source: int = Field(..., description="Unique source of the edge")
    target: int = Field(..., description="Unique target of the edge")
    type: str = Field(..., description="Type of the edge")
    label: str = Field(..., description="Label of the edge")

from typing import List
from langchain_core.utils.function_calling import convert_to_openai_tool

class KnowledgeGraph(BaseModel):
    """Generated Knowledge Graph"""
    nodes: List[Node] = Field(
        ..., description="List of nodes of the knowledge graph")
    edges: List[Edge] = Field(
        ..., description="List of edges of the knowledge graph")

tools = [convert_to_openai_tool(KnowledgeGraph)]

def build_messages(text: str):
    messages = [
        {
            "role": "system",
            "content": (
                "You are a world class AI model who answers questions " 
                "in JSON with correct Pydantic schema. "
                "Here's the json schema you must adhere to:\n<schema>\n"
                f"{KnowledgeGraph.schema_json()}\n</schema>"
            )
        },{
            "role": "user",
            "content": (
                "Describe the following text as a detailed " 
                f"knowledge graph:\n{text}"
            )
        }]
    return messages

def generate_assistant_message(messages):
    outlines_logits_processor = outlines.processors.JSONLogitsProcessor(
        KnowledgeGraph,
        outlines_tokenizer,
        whitespace_pattern=r"[\n\t ]*"
    )
    output = llm.create_chat_completion(
        messages=messages,
1        tools=tools,
        tool_choice={"type": "function", 
                     "function": {"name": "KnowledgeGraph"}},
2        logits_processor=transformers.LogitsProcessorList(
            [outlines_logits_processor]),
        temperature=0,
        seed=42,
        max_tokens=5000,
    )
    return output["choices"][0]["message"]
1
Structured Outputs
2
Structured Generation

def generate_kg(text):
    messages = build_messages(text)
    assistant_message = generate_assistant_message(messages)
    return assistant_message["tool_calls"][0]["function"]["arguments"]
    
generate_kg("Alice loves Bob")
'{"nodes": [{"id": 0, "type": "Person", "label": "Alice"}, {"id": 1, "type": "Person", "label": "Bob"}], "edges": [{"source": 0, "target": 1, "type": "LOVES", "label": "loves"}]}'

import json

nodes, edges = [], []
for text in texts:
    kg = generate_kg(text)
    nodes.append(json.loads(kg)["nodes"])
    edges.append(json.loads(kg)["edges"])

Node and Edge Type Candidates

from collections import Counter

print(
    Counter([node["type"].upper() for i in range(len(texts)) 
             for node in nodes[i]])
)
print(
    Counter([edge["type"].upper() for i in range(len(texts)) 
             for edge in edges[i]])
)
Counter({'PERSON': 11, 'CONCEPT': 3, 'ELEMENT': 2, 'COUNTRY': 2, 'AWARD': 2, 'EVENT': 2, 'CHEMICALELEMENT': 2})
Counter({'DISCOVERED': 4, 'STUDIES': 3, 'AWARDED': 2, 'DESCRIBED_AS': 2, 'PARTICIPATES_IN': 2, 'AWARD': 2, 'RECEIVED': 2, 'POSTHUMOUS': 2, 'VISITED': 1, 'ORIGIN': 1, 'NAMED_AFTER': 1, 'MEETS': 1, 'PROPOSES': 1, 'MARRIED_TO': 1, 'RELATIVE': 1, 'DISCOVERS': 1, 'MARRIAGE': 1, 'AFFAIR': 1, 'COLLABORATION': 1, 'MARRIED': 1})

from typing import Literal, Optional
from pydantic import BaseModel, Field

class Node(BaseModel):
    """Node of the Knowledge Graph"""
    id: int = Field(
        ..., description="Unique identifier of the node starting at zero")
    type: str = Field(..., description="Type of the node")
    label: str = Field(..., description="Label of the node")

class Edge(BaseModel):
    """Edge of the Knowledge Graph"""
    source: int = Field(..., description="Unique source of the edge")
    target: int = Field(..., description="Unique target of the edge")
    type: str = Field(..., description="Type of the edge")
    label: str = Field(..., description="Label of the edge")

from typing import Literal, Optional
from pydantic import BaseModel, Field
class Node(BaseModel):
    """Node of the Knowledge Graph"""
    id: int = Field(
        ..., description="Unique identifier of the node starting at zero")
    type: Literal[
    "PERSON", "AWARD", "DISCOVERY", "LOCATION", "OTHER"
    ] = Field(..., description="Type of the node")
    label: str = Field(..., description="Label of the node")
class Edge(BaseModel):
    """Edge of the Knowledge Graph"""
    source: int = Field(..., description="Unique source of the edge")
    target: int = Field(..., description="Unique target of the edge")
    type: Literal[
    "DISCOVERED", "AWARDED", "INTERPERSONALRELATIONSHIP", "VISITED", "OTHER"
    ] = Field(..., description="Type of the edge")
    label: str = Field(..., description="Label of the edge")

Code
from graphviz import Digraph

dot = Digraph()
i = 0
for node in nodes[i]:
    if node['type'] != "OTHER":
        dot.node(str(node['id']), node['type']+"\n"+node['label'], shape='circle', width='1', height='1')
for edge in edges[i]:
    if edge["type"] != "OTHER": #and nodes[i][edge["source"]]["type"]!= "OTHER" and nodes[i][edge["target"]]["type"]!= "OTHER":
        dot.edge(str(edge['source']), str(edge['target']), label=edge['type']+"\n"+ edge['label'])
print(texts[i])
dot
When Marie Curie came to the United States for the first time, in May 1921, she had already discovered the elements radium and polonium, coined the term “radio-active” and won the Nobel Prize—twice. But the Polish-born scientist, almost pathologically shy and accustomed to spending most of her time in her Paris laboratory, was stunned by the fanfare that greeted her. In 1898 she indeed identified one of the substances and named it polonium, after her homeland. Five months later, she identified a second element, which the world came to know as radium. Curie described the elements she studied as “radio-active.”

Code
from graphviz import Digraph

dot = Digraph()
i = 1
for node in nodes[i]:
    if node['type'] != "OTHER":
        dot.node(str(node['id']), node['type']+"\n"+node['label'], shape='circle', width='1', height='1')
for edge in edges[i]:
    if edge["type"] != "OTHER": #and nodes[i][edge["source"]]["type"]!= "OTHER" and nodes[i][edge["target"]]["type"]!= "OTHER":
        dot.edge(str(edge['source']), str(edge['target']), label=edge['type']+"\n"+ edge['label'])
print(texts[i])
dot
In 1894, she met Pierre Curie, a 35-year-old physicist at a French technical college who had been studying crystals and magnetism. More than a decade before, he and his brother Jacques had discovered piezoelectricity, the electric charge produced in solid materials under pressure. Pierre was taken by Marie’s uncommon intellect and drive, and he proposed to her. They were married in 1895 in a civil service attended by family and a few friends.

Code
from graphviz import Digraph

dot = Digraph()
i = 2
for node in nodes[i]:
    if node['type'] != "OTHER":
        dot.node(str(node['id']), node['type']+"\n"+node['label'], shape='circle', width='1', height='1')
for edge in edges[i]:
    if edge["type"] != "OTHER": #and nodes[i][edge["source"]]["type"]!= "OTHER" and nodes[i][edge["target"]]["type"]!= "OTHER":
        dot.edge(str(edge['source']), str(edge['target']), label=edge['type']+"\n"+ edge['label'])
print(texts[i])
dot
Both Curies shared the Nobel Prize in physics with Becquerel in 1903. It was the first Nobel to be awarded to a woman. In 1911, rumors spread that Curie was having an affair with the prominent physicist Paul Langevin, a man five years her junior who had been Pierre’s student and had worked closely with Albert Einstein. The front-page coverage of the scandal threatened to overshadow another news story later that year: her second Nobel Prize.

Code
from graphviz import Digraph

dot = Digraph()
i = 3
for node in nodes[i]:
    if node['type'] != "OTHER":
        dot.node(str(node['id']), node['type']+"\n"+node['label'], shape='circle', width='1', height='1')
for edge in edges[i]:
    if edge["type"] != "OTHER":
        dot.edge(str(edge['source']), str(edge['target']), label=edge['type']+"\n"+ edge['label'])
print(texts[i])
dot
This one, in chemistry, was for the discovery of polonium and radium. In her acceptance speech in Stockholm, she paid tribute to her husband but also made clear that her work was independent from his, spelling out their separate contributions and describing the discoveries she had made after his death.

import pandas as pd

df_nodes = pd.DataFrame(nodes[0])
df_edges = pd.DataFrame({
    'source': [nodes[0][edge['source']]['label'] for edge in edges[0]],
    'target': [nodes[0][edge['target']]['label'] for edge in edges[0]],
    'type': [edge['type'] for edge in edges[0]],
    'label': [edge['label'] for edge in edges[0]]
})
for i in range(1,len(texts)):
    df_nodes_aux = pd.DataFrame({
        'id': [node['id']+len(df_nodes) for node in nodes[i]],
        'label': [node['label'] for node in nodes[i]],
        'type': [node['type'] for node in nodes[i]]
    })
    df_edges_aux = pd.DataFrame({
        'source': [nodes[i][edge['source']]['label'] for edge in edges[i]],
        'target': [nodes[i][edge['target']]['label'] for edge in edges[i]],
        'type': [edge['type'] for edge in edges[i]],
        'label': [edge['label'] for edge in edges[i]]
    })
    df_nodes = pd.concat([df_nodes, df_nodes_aux])
    df_edges = pd.concat([df_edges, df_edges_aux])

df_nodes=df_nodes[df_nodes["type"]!="OTHER"].drop_duplicates(subset="label")
df_edges=df_edges[df_edges["type"]!="OTHER"]

Insert nodes to the graph db

import kuzu

db = kuzu.Database()
conn = kuzu.Connection(db)
# create schema
conn.execute("CREATE NODE TABLE Person(name STRING, PRIMARY KEY (name))");
conn.execute("CREATE NODE TABLE Award(name STRING, PRIMARY KEY (name))");
conn.execute("CREATE NODE TABLE Discovery(name STRING, PRIMARY KEY (name))");
conn.execute("CREATE NODE TABLE Location(name STRING, PRIMARY KEY (name))");
# copy from dataframes
df_nodes_person = df_nodes[df_nodes["type"] == "PERSON"][["label"]]
df_nodes_award = df_nodes[df_nodes["type"] == "AWARD"][["label"]]
df_nodes_discovery = df_nodes[df_nodes["type"] == "DISCOVERY"][["label"]]
df_nodes_location = df_nodes[df_nodes["type"] == "LOCATION"][["label"]]
conn.execute("COPY Person FROM df_nodes_person");
conn.execute("COPY Award FROM df_nodes_award");
conn.execute("COPY Discovery FROM df_nodes_discovery");
conn.execute("COPY Location FROM df_nodes_location");

Check for nodes

result = conn.execute("MATCH (person:Person) RETURN person.name;")
while result.has_next():
    print(result.get_next())
['Marie Curie']
['Pierre Curie']
['Henri Becquerel']
['Paul Langevin']

Insert edges to the graph db

conn.execute("CREATE REL TABLE Awarded(FROM Person TO Award)"); # create schema
conn.execute("CREATE REL TABLE Discovered(FROM Person TO Discovery)");
conn.execute("CREATE REL TABLE Interpersonal_Relationship(FROM Person TO Person)");
conn.execute("CREATE REL TABLE Visited(FROM Person TO Location)");
def my_func(edge_type: str, source: str, target: str):
    mask1 = [a in list(df_nodes[df_nodes["type"] == source]['label']) for a in df_edges[df_edges["type"] == edge_type]["source"]]
    mask2 = [a in list(df_nodes[df_nodes["type"] == target]['label']) for a in df_edges[df_edges["type"] == edge_type]["target"]]
    mask = [a & b for a,b in zip(mask1,mask2)]
    return df_edges[df_edges["type"] == edge_type][mask]
df_edges_awarded = my_func("AWARDED", "PERSON", "AWARD")[["source", "target"]]
df_edges_discovered = my_func("DISCOVERED", "PERSON", "DISCOVERY")[["source", "target"]]
df_edges_visited = my_func("VISITED", "PERSON", "LOCATION")[["source", "target"]]
df_edges_interpersonal_relationship = my_func("INTERPERSONALRELATIONSHIP", "PERSON", "PERSON")[["source", "target"]]
conn.execute("COPY Discovered FROM df_edges_discovered"); # copy from dataframes
conn.execute("COPY Awarded FROM df_edges_awarded");
conn.execute("COPY Visited FROM df_edges_visited");
conn.execute("COPY Interpersonal_Relationship FROM df_edges_interpersonal_relationship");

Check for edges

result = conn.execute("MATCH (a)-[:DISCOVERED]->(c) RETURN a.name, c.name;")
while result.has_next():
    print(result.get_next())
['Marie Curie', 'Polonium']
['Marie Curie', 'Radium']
['Pierre Curie', 'Piezoelectricity']

KG-based Agents

Tools

Tool: Query Checker

def build_kuzu_query_checker_prompt(cypher_query):
    return (
    f"\n{cypher_query}\n"
    "Double check the Kùzu dialect of Cypher for the query above "
    "for common mistakes, including:\n"
    "- Using the correct number of arguments for functions\n"
    "- Casting to the correct data type\n"
    "- Do not omit the relationship pattern.\n" 
    "- Always use `()-[]->()` instead of `()->()`.\n\n"
    "If there are any of the above mistakes, rewrite the query. "
    "If there are no mistakes, just reproduce the original query.\n\n"
    "Output the final Kùzu Cypher query only.\n\n"
    "Kùzu Cypher Query: "
    )

Tool: Query Checker

def query_checker_tool(cypher_query):
    user_prompt = build_kuzu_query_checker_prompt(cypher_query)
    response = llm.create_chat_completion(
        messages=[{"role": "user", "content": user_prompt}]
    )
    return response["choices"][0]["message"]["content"]


query_checker_tool(
"MATCH (p:Person)-[:Discovered]-<(:Discovery {name: 'Polonium'})\nRETURN *"
)
"MATCH (p:Person)-[:Discovered]->(d:Discovery {name: 'Polonium'})\nRETURN *"

Tool: Query Generator + Executor

from langchain_community.graphs import KuzuGraph

graph_db_schema = KuzuGraph(db).get_schema
print(graph_db_schema)
Node properties: [{'properties': [('name', 'STRING')], 'label': 'Person'}, {'properties': [('name', 'STRING')], 'label': 'Discovery'}, {'properties': [('name', 'STRING')], 'label': 'Award'}, {'properties': [('name', 'STRING')], 'label': 'Location'}]
Relationships properties: [{'properties': [], 'label': 'Awarded'}, {'properties': [], 'label': 'Discovered'}, {'properties': [], 'label': 'Interpersonal_Relationship'}, {'properties': [], 'label': 'Visited'}]
Relationships: ['(:Person)-[:Awarded]->(:Award)', '(:Person)-[:Discovered]->(:Discovery)', '(:Person)-[:Interpersonal_Relationship]->(:Person)', '(:Person)-[:Visited]->(:Location)']

Tool: Query Generator + Executor

def build_kuzu_prompt(query, graph_db_schema=graph_db_schema):
    return """Task: Generate Kùzu Cypher statement to query a graph database.

Instructions:
Generate the Kùzu dialect of Cypher with the following rules in mind:
1. Do not omit the relationship pattern. Always use `()-[]->()` instead of `()->()`.
2. Do not include triple backticks ``` in your response. Return only Cypher.
3. Do not return any notes or comments in your response.

Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:\n""" + graph_db_schema + """
\nExample:
The question is:\n"Which songs does the load album have?"
MATCH (a:ALBUM {name: 'Load'})<-[:IN_ALBUM]-(s:SONG) RETURN s.name
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:\n""" + query

Tool: Query Generator + Executor

def query_generator_tool(query):
    user_prompt = build_kuzu_prompt(query)
    output = llm.create_chat_completion(
        messages = [{"role": "user", "content": user_prompt}]
    )
    cypher_query = output['choices'][0]['message']['content']
    cypher_query = query_checker_tool(cypher_query)
    print(f"\x1b[33m Cypher query: {cypher_query} \x1b[0m")
    response = conn.execute(f"{cypher_query}");
    df = response.get_as_pl()
    col_name = df.columns[0]
    _values = df[col_name].to_list()
    return f"[{col_name}: {_values}]"
    
query_generator_tool("What discovery is Marie Curie famous for?")
 Cypher query: MATCH (p:Person {name: 'Marie Curie'})-[:Discovered]->(d:Discovery) RETURN d.name 
"[d.name: ['Radium', 'Polonium']]"

Tool: Conversational Response

def conversational_response(text):
    return text

ReAct agent

from pydantic import BaseModel, Field


class Reason_and_Act(BaseModel):
    Scratchpad: str = Field(
        ...,
        description="Information from the Observation useful to answer the question",
    )
    Thought: str = Field(
        ...,
        description="It describes your thoughts about the question you have been asked",
    )
    Action: Literal["conversational_response", "query_generator", "query_checker"] = (
        Field(..., description="The action to take")
    )
    Action_Input: str = Field(..., description="The arguments of the Action.")

ReAct Agent

class Final_Answer(BaseModel):
    Scratchpad: str = Field(
        ...,
        description="Information from the Observation useful to answer the question",
    )
    Final_Answer: str = Field(
        ..., description="Answer to the question grounded on the Observation"
    )

ReAct Agent

class Decision(BaseModel):
    Decision: Reason_and_Act | Final_Answer

ReAct Agent

from outlines import generate, models
from outlines.fsm.json_schema import build_regex_from_schema, convert_json_schema_to_str

model = models.LlamaCpp(llm)
schema_str = convert_json_schema_to_str(json_schema=Decision.schema_json())
regex_str = build_regex_from_schema(schema_str)

ReAct Agent

def generate_hermes_prompt(question, schema=""):
    return (
        "<|im_start|>system\n"
        "You are a world class AI model who answers questions in JSON with correct Pydantic schema."
        f"\nHere's the JSON schema you must adhere to:\n<schema>\n{schema}\n</schema>\n"
        "You run in a loop of Scratchpad, Thought, Action, Action Input, PAUSE, Observation."
        "\nAt the end of the loop you output a Final Answer. "
        "\n- Use Scratchpad to store the information from the observation useful to answer the question"
        "\n- Use Thought to describe your thoughts about the question you have been asked "
        "and reflect carefully about the Observation if it exists. "
        "\n- Use Action to run one of the actions available to you. "
        "\n- Use Action Input to input the arguments of the selected action - then return PAUSE. "
        "\n- Observation will be the result of running those actions. "
        "\nYour available actions are:\n"
        "query_generator:\n" 
        "e.g. query_generator: Who is Marie Curie related to?\n"
        "Returns a detailed and correct Kùzu Cypher query\n"
        "query_checker:\n"
        "e.g. query_checker: MATCH (a:ALBUM {name: 'The Black Album'})<-[:IN_ALBUM]-(s:SONG) RETURN COUNT(s)\n"
        "Returns a detailed and correct Kùzu Cypher query after double checking the query for common mistakes\n"
        "conversational_reponse:"
        "e.g. conversational_response: Hi!\n"
        "Returns a conversational response to the user\n"
        "DO NOT TRY TO GUESS THE ANSWER. Begin! <|im_end|>"
        "\n<|im_start|>user\n" + question + "<|im_end|>"
        "\n<|im_start|>assistant\n"
    )

ReAct Agent

class ChatBot:
    def __init__(self, prompt=""):
        self.prompt = prompt

    def __call__(self, user_prompt):
        self.prompt += user_prompt
        result = self.execute()
        return result
        
    def execute(self):
        generator = generate.regex(model, regex_str)
        result = generator(self.prompt, max_tokens=1024, temperature=0, seed=42)
        return result

ReAct Agent

import json

def query(question, max_turns=5):
    i = 0
    next_prompt = (
        "\n<|im_start|>user\n" + question + "<|im_end|>"
        "\n<|im_start|>assistant\n"
    )
    previous_actions = []
    while i < max_turns:
        i += 1
        prompt = generate_hermes_prompt(question=question, schema=Decision.schema_json())
        bot = ChatBot(prompt=prompt)
        result = bot(next_prompt)
        json_result = json.loads(result)['Decision']
        if "Final_Answer" not in list(json_result.keys()):
            scratchpad = json_result['Scratchpad'] if i == 0 else ""
            thought = json_result['Thought']
            action = json_result['Action']
            action_input = json_result['Action_Input']
            print(f"\x1b[34m Scratchpad: {scratchpad} \x1b[0m")
            print(f"\x1b[34m Thought: {thought} \x1b[0m")
            print(f"\x1b[36m  -- running {action}: {str(action_input)}\x1b[0m")
            if action + ": " + str(action_input) in previous_actions:
                observation = "You already run that action. **TRY A DIFFERENT ACTION INPUT.**"
            else:
                if action=="query_checker":
                    try:
                        observation = query_checker_tool(str(action_input))
                    except Exception as e:
                        observation = f"{e}"
                elif action=="query_generator":
                    try:
                        observation = query_generator_tool(str(action_input))
                    except Exception as e:
                        observation = f"{e}"
                elif action=="conversational_response":
                    try:
                        observation = conversational_response(str(action_input))
                        observation += "\nAnswer to the user."
                    except Exception as e:
                        observation = f"{e}"
            print()
            print(f"\x1b[33m Observation: {observation} \x1b[0m")
            print()
            previous_actions.append(action + ": " + str(action_input))
            next_prompt += (
                "\nScratchpad: " + scratchpad +
                "\nThought: " + thought +
                "\nAction: " + action  +
                "\nAction Input: " + action_input +
                "\nObservation: " + str(observation)
            )
        else:
            scratchpad = json_result["Scratchpad"]
            final_answer = json_result["Final_Answer"]
            print(f"\x1b[34m Scratchpad: {scratchpad} \x1b[0m")
            print(f"\x1b[34m Final Answer: {final_answer} \x1b[0m")
            return final_answer
    print(f"\nFinal Answer: I am sorry, but I am unable to answer your question. Please provide more information or a different question.")
    return "No answer found"

ReAct Agent

query("Query the database to see which discovery did Pierre Curie do?")
 Scratchpad:  
 Thought: To find out which discovery Pierre Curie is known for, I should query the database for information about his scientific contributions. 
  -- running query_generator: MATCH (p:PERSON {name: 'Pierre Curie'})-[:DISCOVERED]->(d:DISCOVERY) RETURN d.name
 Cypher query: MATCH (p:PERSON {name: 'Pierre Curie'})-[:DISCOVERED]->(d:DISCOVERY) RETURN d.name 

 Observation: [d.name: ['Piezoelectricity']] 

 Scratchpad: Pierre Curie is known for discovering piezoelectricity. 
 Final Answer: Pierre Curie discovered piezoelectricity. 
'Pierre Curie discovered piezoelectricity.'

ReAct Agent

query("Query the database to see which discovery did Marie Curie do?")
 Scratchpad:  
 Thought: To find out which discovery Marie Curie is known for, I will query the database to search for her discoveries. 
  -- running query_generator: MATCH (p:PERSON {name: 'Marie Curie'})-[:DISCOVERED]->(d:DISCOVERY) RETURN d.name
 Cypher query: MATCH (p:PERSON {name: 'Marie Curie'})-[:DISCOVERED]->(d:DISCOVERY) RETURN d.name 

 Observation: [d.name: ['Radium', 'Polonium']] 

 Scratchpad: Marie Curie is known for discovering Radium and Polonium. 
 Final Answer: Marie Curie discovered Radium and Polonium. 
'Marie Curie discovered Radium and Polonium.'

ReAct Agent

query("Tell me a joke.")
 Scratchpad:  
 Thought: To provide a joke, I will use my knowledge base to find a suitable joke that is appropriate for all ages. 
  -- running conversational_response: Tell a family-friendly joke.

 Observation: Tell a family-friendly joke.
Answer to the user. 

 Scratchpad: To provide a joke, I will use my knowledge base to find a suitable joke that is appropriate for all ages. 
 Final Answer: Why don't scientists trust atoms? Because they make up everything! 
"Why don't scientists trust atoms? Because they make up everything!"

Conclusions

Conclusions

  • The goal is to provide better context to the LLM prior to generating its response
  • Combining graph + vector retrieval help us to achieve this goal better than vector-only retrieval or graph-only retrieval
  • Structured Outputs + Structured Generation help us to generate Knowledge Graphs from unstructured data
  • Structured Outputs + Structured Generation help us create KG-based Agents that can query graph databases as well as calling other tools

References

How was my presentation?

Please, leave me your anonymous feedback at https://www.admonymous.co/alonsosilva