This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
This tutorial demonstrates how to use Pixeltable’s built-in llama.cpp
integration to run local LLMs efficiently.
Important Notes
- Models are automatically downloaded from Hugging Face and cached
locally
- Different quantization levels are available for performance/quality
tradeoffs
- Consider memory usage when choosing models and quantizations
Set Up Environment
First, let’s install Pixeltable with llama.cpp support:
%pip install -qU pixeltable llama-cpp-python huggingface-hub
Create a Table for Chat Completions
Now let’s create a table that will contain our inputs and responses.
import pixeltable as pxt
from pixeltable.functions import llama_cpp
pxt.drop_dir('llama_demo', force=True)
pxt.create_dir('llama_demo')
t = pxt.create_table('llama_demo.chat', {'input': pxt.String})
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `llama_demo`.
Created table `chat`.
Next, we add a computed column that calls the Pixeltable
create_chat_completion UDF, which adapts the corresponding llama.cpp
API call. In our examples, we’ll use pretrained models from the Hugging
Face repository. llama.cpp makes it easy to do this by specifying a
repo_id (from the URL of the model) and filename from the model repo;
the model will then be downloaded and cached automatically.
(If this is your first time using Pixeltable, the
Pixeltable
Fundamentals tutorial contains more details about table creation,
computed columns, and UDFs.)
For this demo we’ll use Qwen2.5-0.5B, a very small (0.5-billion
parameter) model that still produces decent results. We’ll use Q5_K_M
(5-bit) quantization, which gives an excellent balance of quality and
efficiency.
# Add a computed column that uses llama.cpp for chat completion
# against the input.
messages = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': t.input}
]
t.add_computed_column(result=llama_cpp.create_chat_completion(
messages,
repo_id='Qwen/Qwen2.5-0.5B-Instruct-GGUF',
repo_filename='*q5_k_m.gguf'
))
# Extract the output content from the JSON structure returned
# by llama_cpp.
t.add_computed_column(output=t.result.choices[0].message.content)
Added 0 column values with 0 errors.
Added 0 column values with 0 errors.
UpdateStatus(num_rows=0, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])
Test Chat Completion
Let’s try a simple query:
# Test with a simple question
t.insert([
{'input': 'What is the capital of France?'},
{'input': 'What are some edible species of fish?'},
{'input': 'Who are the most prominent classical composers?'}
])
Computing cells: 100%|████████████████████████████████████████████| 9/9 [00:03<00:00, 2.25 cells/s]
Inserting rows into `chat`: 3 rows [00:00, 1112.74 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 9/9 [00:03<00:00, 2.25 cells/s]
Inserted 3 rows with 0 errors.
UpdateStatus(num_rows=3, num_computed_values=9, num_excs=0, updated_cols=[], cols_with_excs=[])
t.select(t.input, t.output).collect()
Comparing Models
Local model frameworks like llama.cpp make it easy to compare the
output of different models. Let’s try comparing the output from Qwen
against a somewhat larger model, Llama-3.2-1B. As always, when we add
a new computed column to our table, it’s automatically evaluated against
the existing table rows.
t.add_computed_column(result_l3=llama_cpp.create_chat_completion(
messages,
repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF',
repo_filename='*Q5_K_M.gguf'
))
t.add_computed_column(output_l3=t.result_l3.choices[0].message.content)
t.select(t.input, t.output, t.output_l3).collect()
Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:08<00:00, 2.74s/ cells]
Added 3 column values with 0 errors.
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 349.89 cells/s]
Added 3 column values with 0 errors.
Just for fun, let’s try running against a different system prompt with a
different persona.
messages_teacher = [
{'role': 'system',
'content': 'You are a patient school teacher. '
'Explain concepts simply and clearly.'},
{'role': 'user', 'content': t.input}
]
t.add_computed_column(result_teacher=llama_cpp.create_chat_completion(
messages_teacher,
repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF',
repo_filename='*Q5_K_M.gguf'
))
t.add_computed_column(output_teacher=t.result_teacher.choices[0].message.content)
t.select(t.input, t.output_teacher).collect()
Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:06<00:00, 2.30s/ cells]
Added 3 column values with 0 errors.
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 605.33 cells/s]
Added 3 column values with 0 errors.
Additional Resources