This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
In this tutorial, we’ll demonstrate how to use Pixeltable to do
frame-by-frame object detection, made simple through Pixeltable’s
video-related functionality: * automatic frame extraction * running
complex functions against frames (in this case, the YOLOX object
detection models) * reassembling frames back into videos We’ll be
working with a single video file from Pixeltable’s test data repository.
This tutorial assumes you’re at least somewhat familiar with Pixeltable;
a good place to learn more is the Pixeltable
Documentation.
Creating a tutorial directory and table
First, let’s make sure the packages we need for this tutorial are
installed: Pixeltable itself, PyTorch, and the YOLOX object detection
library.
%pip install -qU pixeltable pixeltable-yolox
All data in Pixeltable is stored in tables, which in turn reside in
directories. We’ll begin by creating a detection_demo directory and a
table to hold our videos, with a single column of type pxt.Video.
import pixeltable as pxt
pxt.create_dir('detection_demo', if_exists='replace_force')
videos_table = pxt.create_table(
'detection_demo.videos',
{'video': pxt.Video}
)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory ‘detection_demo’.
Created table `videos`.
In order to interact with the frames, we take advantage of Pixeltable’s
component view concept: we create a “view” of our video table that
contains one row for each frame of each video in the table. Pixeltable
provides the built-in FrameIterator class for this.
from pixeltable.iterators import FrameIterator
frames_view = pxt.create_view(
'detection_demo.frames',
videos_table,
# `fps` determines the frame rate; a value of `0`
# indicates the native frame rate of the video.
iterator=FrameIterator.create(video=videos_table.video, fps=0)
)
Created view `frames` with 0 rows, 0 exceptions.
You’ll see that neither the videos table nor the frames view has any
actual data yet, because we haven’t yet added any videos to the table.
However, the frames view is now configured to automatically track the
videos table as new data shows up.
The new view is automatically configured with six columns: - pos - a
system column that is part of every component view - video - the
column inherited from our base table (all base table columns are visible
in any of its views) - frame_idx, pos_msec, pos_frame, frame -
these four columns are created by the FrameIterator class.
Let’s have a look at the new view:
We’ll now insert a single row into the videos table, containing a video
of a busy intersection in Bangkok.
videos_table.insert([
{
'video': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/bangkok.mp4'
}
])
Inserting rows into `videos`: 1 rows [00:00, 321.33 rows/s]
Inserting rows into `frames`: 461 rows [00:00, 14957.29 rows/s]
Inserted 462 rows with 0 errors.
UpdateStatus(num_rows=462, num_computed_values=2, num_excs=0, updated_cols=[], cols_with_excs=[])
Notice that both the videos table and frames view were automatically
updated, expanding the single video into 461 rows in the view. Let’s
have a look at videos first.
Now let’s peek at the first five rows of frames:
frames_view.select(
frames_view.pos,
frames_view.frame,
frames_view.frame.width,
frames_view.frame.height
).show(5)
One advantage of using Pixeltable’s component view mechanism is that
Pixeltable does not physically store the frames. Instead, Pixeltable
re-extracts the frames on retrieval using the frame index, which can be
done very efficiently and avoids any storage overhead (which can be
quite substantial for video frames).
Object Detection with Pixeltable
Now let’s apply an object detection model to our frames. Pixeltable
includes built-in support for a number of models; we’re going to use the
YOLOX family of models, which are lightweight models with solid
performance. We first import the yolox Pixeltable function.
from pixeltable.functions.yolox import yolox
Pixeltable functions operate on columns and expressions using standard
Python function call syntax. Here’s an example that shows how we might
experiment with applying one of the YOLOX models to the first few frames
in our video, using Pixeltable’s powerful select comprehension.
# Show the results of applying the `yolox_tiny` model
# to the first few frames in the table.
frames_view.select(
frames_view.frame,
yolox(frames_view.frame, model_id='yolox_tiny')
).head(3)
It may appear that we just ran the YOLOX inference over the entire view
of 461 frames, but remember that Pixeltable evaluates expressions
lazily: in this case, it only ran inference over the 3 frames that we
actually displayed.
The inference output looks like what we’d expect, so let’s add a
computed column that runs inference over the entire view (computed
columns are discussed in detail in the Computed
Columns
tutorial). Remember that once a computed column is created, Pixeltable
will update it incrementally any time new rows are added to the view.
This is a convenient way to incorporate inference (and other operations)
into data workflows.
This will cause Pixeltable to run inference over all 461 frames, so
please be patient.
# Create a computed column to compute detections using the `yolox_tiny`
# model.
# We'll adjust the confidence threshold down a bit (the default is 0.5)
# to pick up even more bounding boxes.
frames_view.add_computed_column(detections_tiny=yolox(
frames_view.frame, model_id='yolox_tiny', threshold=0.25
))
Added 461 column values with 0 errors.
UpdateStatus(num_rows=461, num_computed_values=461, num_excs=0, updated_cols=[], cols_with_excs=[])
The new column is now part of the schema of the frames view:
The data in the computed column is now stored for fast retrieval.
frames_view.select(
frames_view.frame,
frames_view.detections_tiny
).show(3)
Now let’s create a new set of images, in which we superimpose the
detected bounding boxes on top of the original images. We’ll use the
handy built-in draw_bounding_boxes UDF for this. We could create a new
computed column to hold the superimposed images, but we don’t have to;
sometimes it’s easier just to use a select comprehension, as we did
when we were first experimenting with the detection model.
import pixeltable.functions as pxtf
frames_view.select(
frames_view.frame,
pxtf.vision.draw_bounding_boxes(
frames_view.frame,
frames_view.detections_tiny.bboxes,
width=4
)
).show(1)
Our select comprehension ranged over the entire table, but just as
before, Pixeltable computes the output lazily: image operations are
performed at retrieval time, so in this case, Pixeltable drew the
annotations just for the one frame that we actually displayed.
Looking at individual frames gives us some idea of how well our
detection algorithm works, but it would be more instructive to turn the
visualization output back into a video.
We do that with the built-in function make_video(), which is an
aggregation function that takes a frame index (actually: any expression
that can be used to order the frames; a timestamp would also work) and
an image, and then assembles the sequence of images into a video.
frames_view.group_by(videos_table).select(
pxt.functions.video.make_video(
frames_view.pos,
pxtf.vision.draw_bounding_boxes(
frames_view.frame,
frames_view.detections_tiny.bboxes,
width=4
)
)
).show(1)
Comparing Object Detection Models
The detections that we get out of yolox_tiny are passable, but a
little choppy. Suppose we want to experiment with a more powerful object
detection model, to see if there is any improvement in detection
quality. We can create an additional column to hold the new inferences.
The larger model takes longer to download and run, so please be patient.
# Here we use the larger `yolox_m` (medium) model.
frames_view.add_computed_column(detections_m=yolox(
frames_view.frame, model_id='yolox_m', threshold=0.25
))
Added 461 column values with 0 errors.
UpdateStatus(num_rows=461, num_computed_values=461, num_excs=0, updated_cols=[], cols_with_excs=[])
Let’s see the results of the two models side-by-side.
frames_view.group_by(videos_table).select(
pxt.functions.video.make_video(
frames_view.pos,
pxtf.vision.draw_bounding_boxes(
frames_view.frame,
frames_view.detections_tiny.bboxes,
width=4
)
),
pxt.functions.video.make_video(
frames_view.pos,
pxtf.vision.draw_bounding_boxes(
frames_view.frame,
frames_view.detections_m.bboxes,
width=4
)
)
).show(1)
Running the videos side-by-side, we can see that the larger model is
higher in quality: less flickering, with more stable boxes from frame to
frame.
Evaluating Models Against a Ground Truth
In order to do a quantitative evaluation of model performance, we need a
ground truth to compare them against. Let’s generate some (synthetic)
“ground truth” data by running against the largest YOLOX model
available. It will take even longer to cache and evaluate this model.
frames_view.add_computed_column(detections_x=yolox(
frames_view.frame, model_id='yolox_x', threshold=0.25
))
Added 461 column values with 0 errors.
UpdateStatus(num_rows=461, num_computed_values=461, num_excs=0, updated_cols=[], cols_with_excs=[])
Let’s have a look at our enlarged view, now with three detections
columns.
We’re going to be evaluating the generated detections with the
commonly-used mean average
precision
metric (mAP).
The mAP metric is based on per-frame metrics, such as true and false
positives per detected class, which are then aggregated into a single
(per-class) number. In Pixeltable, functionality is available via the
eval_detections() and mean_ap() built-in functions.
from pixeltable.functions.vision import eval_detections, mean_ap
frames_view.add_computed_column(eval_yolox_tiny=eval_detections(
pred_bboxes=frames_view.detections_tiny.bboxes,
pred_labels=frames_view.detections_tiny.labels,
pred_scores=frames_view.detections_tiny.scores,
gt_bboxes=frames_view.detections_x.bboxes,
gt_labels=frames_view.detections_x.labels
))
frames_view.add_computed_column(eval_yolox_m=eval_detections(
pred_bboxes=frames_view.detections_m.bboxes,
pred_labels=frames_view.detections_m.labels,
pred_scores=frames_view.detections_m.scores,
gt_bboxes=frames_view.detections_x.bboxes,
gt_labels=frames_view.detections_x.labels
))
Added 461 column values with 0 errors.
Added 461 column values with 0 errors.
UpdateStatus(num_rows=461, num_computed_values=461, num_excs=0, updated_cols=[], cols_with_excs=[])
Let’s take a look at the output.
frames_view.select(
frames_view.eval_yolox_tiny,
frames_view.eval_yolox_m
).show(1)
The computation of the mAP metric is now simply a query over the
evaluation output, aggregated with the mean_ap() function.
frames_view.select(
mean_ap(frames_view.eval_yolox_tiny),
mean_ap(frames_view.eval_yolox_m)
).show()
This two-step process allows you to compute mAP at every granularity:
over your entire dataset, only for specific videos, only for videos that
pass a certain filter, etc. Moreover, you can compute this metric any
time, not just during training, and use it to guide your understanding
of your dataset and how it affects the quality of your models.