Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Pixeltable unifies data and computation into a table interface. In this tutorial, we’ll go into more depth on the Hugging Face integration between datasets and how Hugging Face models can be incorporated into Pixeltable workflows to run models locally.
%pip install -qU pixeltable datasets torch transformers tiktoken spacy
Now let’s load the Hugging Face dataset, as described in the Hugging Face documentation.
import datasets

padoru = (
    datasets.load_dataset("not-lain/padoru", split='train')
    .select_columns(['Image', 'ImageSize', 'Name', 'ImageSource'])
)
It preserves the Hugging Face information about whether the data is part of the test, train or validation split.
padoru
Dataset({
    features: [‘Image’, ‘ImageSize’, ‘Name’, ‘ImageSource’],
    num_rows: 382
})

Create a Pixeltable Table from a Hugging Face Dataset

Now we create a table and Pixeltable will map column types as needed. Check out other ways to bring data into Pixeltable with pixeltable.io such as csv, parquet, pandas, json and others.
import pixeltable as pxt

pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')
t = pxt.create_table('hf_demo.padoru', source=padoru)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `hf_demo`.
Created table `padoru_tmp_8951741`.
Computing cells:  13%|█████▏                                   | 64/504 [00:01<00:07, 58.91 cells/s]
Computing cells:  25%|██████████▏                             | 128/504 [00:01<00:05, 73.21 cells/s]
Computing cells:  38%|███████████████▏                        | 192/504 [00:02<00:05, 61.52 cells/s]
Computing cells:  51%|████████████████████▎                   | 256/504 [00:04<00:04, 57.96 cells/s]
Computing cells:  63%|█████████████████████████▍              | 320/504 [00:04<00:02, 64.93 cells/s]
Computing cells:  76%|██████████████████████████████▍         | 384/504 [00:06<00:02, 55.42 cells/s]
Computing cells:  89%|███████████████████████████████████▌    | 448/504 [00:07<00:00, 67.03 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:08<00:00, 51.01 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:07, 17.34 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:08<00:00, 60.68 cells/s]
Inserted 126 rows with 0 errors.
Computing cells:  13%|█████▏                                   | 64/504 [00:00<00:06, 69.02 cells/s]
Computing cells:  25%|██████████▏                             | 128/504 [00:01<00:05, 69.13 cells/s]
Computing cells:  38%|███████████████▏                        | 192/504 [00:02<00:04, 69.73 cells/s]
Computing cells:  51%|████████████████████▎                   | 256/504 [00:03<00:03, 69.73 cells/s]
Computing cells:  63%|█████████████████████████▍              | 320/504 [00:04<00:02, 68.81 cells/s]
Computing cells:  76%|██████████████████████████████▍         | 384/504 [00:05<00:01, 69.80 cells/s]
Computing cells:  89%|███████████████████████████████████▌    | 448/504 [00:06<00:00, 70.41 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 69.62 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:06, 19.92 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 69.45 cells/s]
Inserted 126 rows with 0 errors.
Computing cells:  13%|█████▏                                   | 64/504 [00:00<00:06, 72.73 cells/s]
Computing cells:  25%|██████████▏                             | 128/504 [00:01<00:05, 71.98 cells/s]
Computing cells:  38%|███████████████▏                        | 192/504 [00:02<00:04, 71.21 cells/s]
Computing cells:  51%|████████████████████▎                   | 256/504 [00:03<00:03, 71.61 cells/s]
Computing cells:  63%|█████████████████████████▍              | 320/504 [00:04<00:02, 70.11 cells/s]
Computing cells:  76%|██████████████████████████████▍         | 384/504 [00:05<00:01, 69.79 cells/s]
Computing cells:  89%|███████████████████████████████████▌    | 448/504 [00:06<00:00, 68.98 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 68.68 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:06, 19.96 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 70.04 cells/s]
Inserted 126 rows with 0 errors.
Computing cells: 100%|██████████████████████████████████████████| 16/16 [00:00<00:00, 67.34 cells/s]
Inserting rows into `padoru_tmp_8951741`: 4 rows [00:00, 3502.55 rows/s]
Computing cells: 100%|██████████████████████████████████████████| 16/16 [00:00<00:00, 66.64 cells/s]
Inserted 4 rows with 0 errors.
t.head(3)

Leveraging Hugging Face Models with Pixeltable’s Embedding Functionality

Pixeltable contains a built-in adapter for certain model families, so all we have to do is call the Pixeltable function for Hugging Face. A nice thing about the Huggingface models is that they run locally, so you don’t need an account with a service provider in order to use them. Pixeltable can also create and populate an index with table.add_embedding_index() for string and image embeddings. That definition is persisted as part of the table’s metadata, which allows Pixeltable to maintain the index in response to updates to the table. In this example we are using CLIP. You can use any embedding function you like, via Pixeltable’s UDF mechanism (which is described in detail our guide to user-defined functions).
from pixeltable.functions.huggingface import clip
import PIL.Image

# create embedding index on the 'Image' column
t.add_embedding_index(
    'Image',
    embedding=clip.using(model_id='openai/clip-vit-base-patch32')
)
Computing cells: 100%|████████████████████████████████████████| 382/382 [00:16<00:00, 22.63 cells/s]
sample_img = t.select(t.Image).head(1)[0]['Image']

sim = t.Image.similarity(sample_img)

# use 'similarity()' in the order_by() clause and apply a limit in order to utilize the index
t.order_by(sim, asc=False).limit(3).select(t.Image, sim=sim).collect()
You can learn more about how to leverage indexes in detail with our tutorial: Working with Embedding and Vector Indexes