This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
In this tutorial, we’ll build an end-to-end workflow for creating and
indexing audio transcriptions of video data. We’ll demonstrate how
Pixeltable can be used to: 1) Extract audio data from video files; 2)
Transcribe the audio using OpenAI Whisper; 3) Build a semantic index of
the transcriptions, using the Huggingface sentence_transformers models;
4) Search this index.
The tutorial assumes you’re already somewhat familiar with Pixeltable.
If this is your first time using Pixeltable, the Pixeltable
Basics
tutorial is a great place to start.
Create a Table for Video Data
Let’s first install the Python packages we’ll need for the demo. We’re
going to use the popular Whisper library, running locally. Later in the
demo, we’ll see how to use the OpenAI API endpoints as an alternative.
%pip install -q pixeltable openai openai-whisper sentence-transformers spacy
Now we create a Pixeltable table to hold our videos.
import numpy as np
import pixeltable as pxt
pxt.drop_dir('transcription_demo', force=True) # Ensure a clean slate for the demo
pxt.create_dir('transcription_demo')
# Create a table to store our videos and workflow
video_table = pxt.create_table(
'transcription_demo.video_table',
{'video': pxt.Video}
)
video_table
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `transcription_demo`.
Created table `video_table`.
Next let’s insert some video files into the table. In this demo, we’ll
be using one-minute excerpts from a Lex Fridman podcast. We’ll begin by
inserting two of them into our new table. In this demo, our videos are
given as https links, but Pixeltable also accepts local files and S3
URLs as input.
videos = [
'https://github.com/pixeltable/pixeltable/raw/release/docs/resources/audio-transcription-demo/'
f'Lex-Fridman-Podcast-430-Excerpt-{n}.mp4'
for n in range(3)
]
video_table.insert({'video': video} for video in videos[:2])
video_table.show()
Computing cells: 100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 26.25 cells/s]
Inserting rows into `video_table`: 2 rows [00:00, 1073.67 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 25.69 cells/s]
Inserted 2 rows with 0 errors.
Now we’ll add another column to hold extracted audio from our videos.
The new column is an example of a computed column: it’s updated
automatically based on the contents of another column (or columns). In
this case, the value of the audio column is defined to be the audio
track extracted from whatever’s in the video column.
from pixeltable.functions.video import extract_audio
video_table.add_computed_column(
audio=extract_audio(video_table.video, format='mp3')
)
video_table.show()
Computing cells: 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 2.20 cells/s]
Added 2 column values with 0 errors.
If we look at the structure of the video table, we see that the new
column is a computed column.
We can also add another computed column to extract metadata from the
audio streams.
from pixeltable.functions.audio import get_metadata
video_table.add_computed_column(
metadata=get_metadata(video_table.audio)
)
video_table.show()
Computing cells: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 289.64 cells/s]
Added 2 column values with 0 errors.
Create Transcriptions
Now we’ll add a step to create transcriptions of our videos. As
mentioned above, we’re going to use the Whisper library for this,
running locally. Pixeltable has a built-in function,
whisper.transcribe, that serves as an adapter for the Whisper
library’s transcription capability. All we have to do is add a computed
column that calls this function:
from pixeltable.functions import whisper
video_table.add_computed_column(
transcription=whisper.transcribe(
audio=video_table.audio,
model='base.en'
)
)
video_table.select(
video_table.video,
video_table.transcription.text
).show()
Computing cells: 100%|████████████████████████████████████████████| 2/2 [00:04<00:00, 2.11s/ cells]
Added 2 column values with 0 errors.
In order to index the transcriptions, we’ll first need to split them
into sentences. We can do this using Pixeltable’s built-in
StringSplitter iterator.
from pixeltable.iterators.string import StringSplitter
sentences_view = pxt.create_view(
'transcription_demo.sentences_view',
video_table,
iterator=StringSplitter.create(
text=video_table.transcription.text,
separators='sentence'
)
)
Inserting rows into `sentences_view`: 25 rows [00:00, 9187.56 rows/s]
Created view `sentences_view` with 25 rows, 0 exceptions.
The StringSplitter creates a new view, with the audio transcriptions
broken into individual, one-sentence chunks.
sentences_view.select(
sentences_view.pos,
sentences_view.text
).show(8)
Add an Embedding Index
Next, let’s use the Huggingface sentence_transformers library to
create an embedding index of our sentences, attaching it to the text
column of our sentences_view.
from pixeltable.functions.huggingface import sentence_transformer
sentences_view.add_embedding_index(
'text',
embedding=sentence_transformer.using(model_id='intfloat/e5-large-v2')
)
Computing cells: 100%|██████████████████████████████████████████| 25/25 [00:03<00:00, 8.18 cells/s]
We can do a simple lookup to test our new index. The following snippet
returns the results of a nearest-neighbor search on the input “What is
happiness?”
sim = sentences_view.text.similarity('What is happiness?')
(
sentences_view
.order_by(sim, asc=False)
.limit(10)
.select(sentences_view.text,similarity=sim)
.collect()
)
Incremental Updates
Incremental updates are a key feature of Pixeltable. Whenever a new
video is added to the original table, all of its downstream computed
columns are updated automatically. Let’s demonstrate this by adding a
third video to the table and seeing how the updates propagate through to
the index.
video_table.insert([{'video': videos[2]}])
Computing cells: 100%|████████████████████████████████████████████| 5/5 [00:01<00:00, 2.58 cells/s]
Inserting rows into `video_table`: 1 rows [00:00, 277.18 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 5/5 [00:01<00:00, 2.57 cells/s]
Inserting rows into `sentences_view`: 8 rows [00:00, 978.69 rows/s]
Inserted 9 rows with 0 errors.
UpdateStatus(num_rows=9, num_computed_values=5, num_excs=0, updated_cols=[], cols_with_excs=[])
video_table.select(
video_table.video,
video_table.metadata,
video_table.transcription.text
).show()
sim = sentences_view.text.similarity('What is happiness?')
(
sentences_view
.order_by(sim, asc=False)
.limit(20)
.select(sentences_view.text, similarity=sim)
.collect()
)
We can see the new results showing up in sentences_view.
Using the OpenAI API
This concludes our tutorial using the locally installed Whisper library.
Sometimes, it may be preferable to use the OpenAI API rather than a
locally installed library. In this section we’ll show how this can be
done in Pixeltable, simply by using a different function to construct
our computed columns.
Since this section relies on calling out to the OpenAI API, you’ll need
to have an API key, which you can enter below.
import os
import getpass
if 'OPENAI_API_KEY' not in os.environ:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
OpenAI API Key: ········
from pixeltable.functions import openai
video_table.add_computed_column(
transcription_from_api=openai.transcriptions(
video_table.audio,
model='whisper-1'
)
)
Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:12<00:00, 4.14s/ cells]
Added 3 column values with 0 errors.
UpdateStatus(num_rows=3, num_computed_values=3, num_excs=0, updated_cols=[], cols_with_excs=[])
Now let’s compare the results from the local model and the API
side-by-side.
video_table.select(
video_table.video,
video_table.transcription.text,
video_table.transcription_from_api.text
).show()
They look pretty similar, which isn’t surprising, since the OpenAI
transcriptions endpoint runs on Whisper.
One difference is that the local library spits out a lot more
information about the internal behavior of the model. Note that we’ve
been selecting video_table.transcription.text in the preceding
queries, which pulls out just the text field of the transcription
results. The actual results are a sizable JSON structure that includes a
lot of metadata. To see the full output, we can select
video_table.transcription instead, to get the full JSON struct. Here’s
what it looks like (we’ll select just one row, since it’s a lot of
output):
video_table.select(
video_table.transcription,
video_table.transcription_from_api
).show(1)