Working with llama.cpp in Pixeltable

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

This tutorial demonstrates how to use Pixeltable’s built-in llama.cpp integration to run local LLMs efficiently.

Important Notes

Models are automatically downloaded from Hugging Face and cached locally
Different quantization levels are available for performance/quality tradeoffs
Consider memory usage when choosing models and quantizations

Set Up Environment

First, let’s install Pixeltable with llama.cpp support:

%pip install -qU pixeltable llama-cpp-python huggingface-hub

Create a Table for Chat Completions

Now let’s create a table that will contain our inputs and responses.

import pixeltable as pxt
from pixeltable.functions import llama_cpp

pxt.drop_dir('llama_demo', force=True)
pxt.create_dir('llama_demo')

t = pxt.create_table('llama_demo.chat', {'input': pxt.String})

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `llama_demo`.
Created table `chat`.

Next, we add a computed column that calls the Pixeltable create_chat_completion UDF, which adapts the corresponding llama.cpp API call. In our examples, we’ll use pretrained models from the Hugging Face repository. llama.cpp makes it easy to do this by specifying a repo_id (from the URL of the model) and filename from the model repo; the model will then be downloaded and cached automatically. (If this is your first time using Pixeltable, the Pixeltable Fundamentals tutorial contains more details about table creation, computed columns, and UDFs.) For this demo we’ll use Qwen2.5-0.5B, a very small (0.5-billion parameter) model that still produces decent results. We’ll use Q5_K_M (5-bit) quantization, which gives an excellent balance of quality and efficiency.

# Add a computed column that uses llama.cpp for chat completion
# against the input.

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': t.input}
]

t.add_computed_column(result=llama_cpp.create_chat_completion(
    messages,
    repo_id='Qwen/Qwen2.5-0.5B-Instruct-GGUF',
    repo_filename='*q5_k_m.gguf'
))

# Extract the output content from the JSON structure returned
# by llama_cpp.

t.add_computed_column(output=t.result.choices[0].message.content)

Added 0 column values with 0 errors.
Added 0 column values with 0 errors.
UpdateStatus(num_rows=0, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])

Test Chat Completion

Let’s try a simple query:

# Test with a simple question
t.insert([
    {'input': 'What is the capital of France?'},
    {'input': 'What are some edible species of fish?'},
    {'input': 'Who are the most prominent classical composers?'}
])

Computing cells: 100%|████████████████████████████████████████████| 9/9 [00:03<00:00,  2.25 cells/s]
Inserting rows into `chat`: 3 rows [00:00, 1112.74 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 9/9 [00:03<00:00,  2.25 cells/s]
Inserted 3 rows with 0 errors.
UpdateStatus(num_rows=3, num_computed_values=9, num_excs=0, updated_cols=[], cols_with_excs=[])

t.select(t.input, t.output).collect()

Comparing Models

Local model frameworks like llama.cpp make it easy to compare the output of different models. Let’s try comparing the output from Qwen against a somewhat larger model, Llama-3.2-1B. As always, when we add a new computed column to our table, it’s automatically evaluated against the existing table rows.

t.add_computed_column(result_l3=llama_cpp.create_chat_completion(
    messages,
    repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF',
    repo_filename='*Q5_K_M.gguf'
))

t.add_computed_column(output_l3=t.result_l3.choices[0].message.content)

t.select(t.input, t.output, t.output_l3).collect()

Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:08<00:00,  2.74s/ cells]
Added 3 column values with 0 errors.
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 349.89 cells/s]
Added 3 column values with 0 errors.

Just for fun, let’s try running against a different system prompt with a different persona.

messages_teacher = [
    {'role': 'system',
     'content': 'You are a patient school teacher. '
                'Explain concepts simply and clearly.'},
    {'role': 'user', 'content': t.input}
]

t.add_computed_column(result_teacher=llama_cpp.create_chat_completion(
    messages_teacher,
    repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF',
    repo_filename='*Q5_K_M.gguf'
))

t.add_computed_column(output_teacher=t.result_teacher.choices[0].message.content)

t.select(t.input, t.output_teacher).collect()

Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:06<00:00,  2.30s/ cells]
Added 3 column values with 0 errors.
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 605.33 cells/s]
Added 3 column values with 0 errors.

Welcome to Pixeltable

Concepts

Notebooks

Important Notes

Set Up Environment

Create a Table for Chat Completions

Test Chat Completion

Comparing Models

Additional Resources

Welcome to Pixeltable

Concepts

Notebooks

​Important Notes

​Set Up Environment

​Create a Table for Chat Completions

​Test Chat Completion

​Comparing Models

​Additional Resources

Important Notes

Set Up Environment

Create a Table for Chat Completions

Test Chat Completion

Comparing Models

Additional Resources