This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Pixeltable unifies data and computation into a table interface. In this
tutorial, we’ll go into more depth on the Hugging Face integration
between datasets and how Hugging Face models can be incorporated into
Pixeltable workflows to run models locally.
%pip install -qU pixeltable datasets torch transformers tiktoken spacy
Now let’s load the Hugging Face dataset, as described in the Hugging
Face
documentation.
import datasets
padoru = (
datasets.load_dataset("not-lain/padoru", split='train')
.select_columns(['Image', 'ImageSize', 'Name', 'ImageSource'])
)
It preserves the Hugging Face information about whether the data is part
of the test, train or validation split.
Dataset({
features: [‘Image’, ‘ImageSize’, ‘Name’, ‘ImageSource’],
num_rows: 382
})
Create a Pixeltable Table from a Hugging Face Dataset
Now we create a table and Pixeltable will map column types as needed.
Check out other ways to bring data into Pixeltable with
pixeltable.io such as csv,
parquet, pandas, json and others.
import pixeltable as pxt
pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')
t = pxt.create_table('hf_demo.padoru', source=padoru)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `hf_demo`.
Created table `padoru_tmp_8951741`.
Computing cells: 13%|█████▏ | 64/504 [00:01<00:07, 58.91 cells/s]
Computing cells: 25%|██████████▏ | 128/504 [00:01<00:05, 73.21 cells/s]
Computing cells: 38%|███████████████▏ | 192/504 [00:02<00:05, 61.52 cells/s]
Computing cells: 51%|████████████████████▎ | 256/504 [00:04<00:04, 57.96 cells/s]
Computing cells: 63%|█████████████████████████▍ | 320/504 [00:04<00:02, 64.93 cells/s]
Computing cells: 76%|██████████████████████████████▍ | 384/504 [00:06<00:02, 55.42 cells/s]
Computing cells: 89%|███████████████████████████████████▌ | 448/504 [00:07<00:00, 67.03 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:08<00:00, 51.01 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:07, 17.34 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:08<00:00, 60.68 cells/s]
Inserted 126 rows with 0 errors.
Computing cells: 13%|█████▏ | 64/504 [00:00<00:06, 69.02 cells/s]
Computing cells: 25%|██████████▏ | 128/504 [00:01<00:05, 69.13 cells/s]
Computing cells: 38%|███████████████▏ | 192/504 [00:02<00:04, 69.73 cells/s]
Computing cells: 51%|████████████████████▎ | 256/504 [00:03<00:03, 69.73 cells/s]
Computing cells: 63%|█████████████████████████▍ | 320/504 [00:04<00:02, 68.81 cells/s]
Computing cells: 76%|██████████████████████████████▍ | 384/504 [00:05<00:01, 69.80 cells/s]
Computing cells: 89%|███████████████████████████████████▌ | 448/504 [00:06<00:00, 70.41 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 69.62 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:06, 19.92 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 69.45 cells/s]
Inserted 126 rows with 0 errors.
Computing cells: 13%|█████▏ | 64/504 [00:00<00:06, 72.73 cells/s]
Computing cells: 25%|██████████▏ | 128/504 [00:01<00:05, 71.98 cells/s]
Computing cells: 38%|███████████████▏ | 192/504 [00:02<00:04, 71.21 cells/s]
Computing cells: 51%|████████████████████▎ | 256/504 [00:03<00:03, 71.61 cells/s]
Computing cells: 63%|█████████████████████████▍ | 320/504 [00:04<00:02, 70.11 cells/s]
Computing cells: 76%|██████████████████████████████▍ | 384/504 [00:05<00:01, 69.79 cells/s]
Computing cells: 89%|███████████████████████████████████▌ | 448/504 [00:06<00:00, 68.98 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 68.68 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:06, 19.96 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 70.04 cells/s]
Inserted 126 rows with 0 errors.
Computing cells: 100%|██████████████████████████████████████████| 16/16 [00:00<00:00, 67.34 cells/s]
Inserting rows into `padoru_tmp_8951741`: 4 rows [00:00, 3502.55 rows/s]
Computing cells: 100%|██████████████████████████████████████████| 16/16 [00:00<00:00, 66.64 cells/s]
Inserted 4 rows with 0 errors.
Leveraging Hugging Face Models with Pixeltable’s Embedding Functionality
Pixeltable contains a built-in adapter for certain model families, so
all we have to do is call the Pixeltable function for Hugging
Face. A nice thing
about the Huggingface models is that they run locally, so you don’t need
an account with a service provider in order to use them.
Pixeltable can also create and populate an index with
table.add_embedding_index() for string and image embeddings. That
definition is persisted as part of the table’s metadata, which allows
Pixeltable to maintain the index in response to updates to the table.
In this example we are using CLIP. You can use any embedding function
you like, via Pixeltable’s UDF mechanism (which is described in detail
our guide to user-defined
functions).
from pixeltable.functions.huggingface import clip
import PIL.Image
# create embedding index on the 'Image' column
t.add_embedding_index(
'Image',
embedding=clip.using(model_id='openai/clip-vit-base-patch32')
)
Computing cells: 100%|████████████████████████████████████████| 382/382 [00:16<00:00, 22.63 cells/s]
sample_img = t.select(t.Image).head(1)[0]['Image']
sim = t.Image.similarity(sample_img)
# use 'similarity()' in the order_by() clause and apply a limit in order to utilize the index
t.order_by(sim, asc=False).limit(3).select(t.Image, sim=sim).collect()
You can learn more about how to leverage indexes in detail with our
tutorial: Working with Embedding and Vector
Indexes