Section 2: Computed Columns
Welcome to Section 2 of the Pixeltable Fundamentals tutorial, Computed Columns. In the previous section, Tables and Data Operations, we learned how to create tables, populate them with data, and query and manipulate their contents. In this section, we’ll introduce one of Pixeltable’s most essential and powerful concepts: computed columns. We’ll learn how to:- Add computed columns to a table
- Use computed columns for complex operations such as image processing and model inference
transformers library, which
we’ll need for this tutorial section.
Computed Columns
Let’s start with a simple example that illustrates the basic concepts behind computed columns. We’ll use a table of world population data for our example. Remember that you can import datasets into a Pixeltable table by usingpxt.create_table() with the source parameter.
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory ‘fundamentals’.
Created table ‘population’.
Inserting rows into `population`: 234 rows [00:00, 6850.71 rows/s]
Inserted 234 rows with 0 errors.
Also recall that pop_t.head() returns the first few rows of a table,
and typing the table name pop_t by itself gives the schema.
select() such a quantity into a Pixeltable
Query, giving it the name yoy_change (year-over-year change):
Added 234 column values with 0 errors.
234 rows updated, 468 values computed.
As soon as the column is added, Pixeltable will (by default)
automatically compute its value for all rows in the table, storing the
results in the new column. If we now inspect the schema of pop_t, we
see the new column and its definition.
Added 234 column values with 0 errors.
234 rows updated, 468 values computed.
In traditional data workflows, it is commonplace
to recompute entire pipelines when the input dataset is changed or
enlarged. In Pixeltable, by contrast, all updates are applied
incrementally. When new data appear in a table or existing data are
altered, Pixeltable will recompute only those rows that are dependent on
the changed data.
Inserting rows into `population`: 1 rows [00:00, 228.35 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 5 values computed.
Observe that the computed columns yoy_growth and yoy_percent_growth
have been automatically updated in response to the new data.
Remember that all tables in Pixeltable are
persistent. This includes computed columns: when you create a
computed column, its definition is stored in the database. You can think
of computed columns as setting up a persistent compute workflow: if you
close your notebook or restart your Python instance, computed columns
(along with the relationships between them, and any data contained in
them) will be preserved.
Recomputing Columns
From time to time you might need to recompute the data in an existing computed column. Perhaps the code for one of your UDFs has changed, and you want to recompute a column that uses that UDF in order to pick up the new logic. Or perhaps you want to re-run a nondeterministic computation such as model inference. The command to do this isrecompute_columns(). It won’t do much in the current example, because
all our computations are simple and deterministic, but for demonstration
purposes here’s what it looks like:
Inserting rows into `population`: 235 rows [00:00, 8795.92 rows/s]
235 rows updated, 940 values computed.
If you modify the data that a computed
column depends on, Pixeltable will recompute automatically; so
recompute_columns() is primarily useful when the input data
remains the same, but your UDF business logic changes.A More Complex Example: Image Processing
In the Tables and Data Operations tutorial, we saw how media data such as images can be inserted into Pixeltable tables, alongside more traditional structured data. Let’s explore another example that uses computed columns for image processing operations. In this example, we’ll create the table directly by providing a schema, rather than importing it from a CSV like before.Created table ‘image_ops’.
Inserting rows into `image_ops`: 3 rows [00:00, 1133.39 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 6 values computed.
get_metadata(), which returns a dictionary with various metadata about
the image. Let’s go ahead and make this a computed column.
“UDF” is standard terminology in databases,
meaning “User-Defined Function”. Technically speaking, the
get_metadata() function isn’t user-defined, it’s built in
to the Pixeltable library. But we’ll consistently refer to Pixeltable
functions as “UDFs” in order to clearly distinguish them from ordinary
Python functions. Later in this tutorial, we’ll see how to turn (almost)
any Python function into a Pixeltable UDF.Added 3 column values with 0 errors.
Image operations, of course, can also return new images.
Added 3 column values with 0 errors.
3 rows updated, 3 values computed.
Added 3 column values with 0 errors.
In addition to
get_metadata(),
convert(), and rotate(), Pixeltable has a
sizable library of other common image operations that can be used as
UDFs in computed columns. For the most part, the image UDFs are analogs
of the operations provided by the
Pillow library
(in fact, Pixeltable is just using Pillow under the covers). You can
read more about the provided image (and other) UDFs in the
Pixeltable SDK
Documentation.Image Detection
In addition to simple operations likerotate() and convert(), the
Pixeltable API includes UDFs for various off-the-shelf image models.
Let’s look at one example: object detection using the ResNet-50 model.
Model inference is a UDF too, and it can be inserted into a computed
column like any other.
This one may take a little more time to compute, since it involves first
downloading the ResNet-50 model (if it isn’t already cached), then
running inference on the images in our table.
Added 3 column values with 0 errors.
3 rows updated, 3 values computed.
draw_bounding_boxes() API to superimpose
bounding boxes on the images, using different colors to distinguish
different object categories.
Added 3 column values with 0 errors.
It can be a little hard to see what’s going on, so let’s zoom in on just
one image. If you select a single image in a notebook, Pixeltable will
enlarge its display:
Inserting rows into `image_ops`: 2 rows [00:00, 944.77 rows/s]
Inserted 2 rows with 0 errors.
2 rows inserted, 14 values computed.
It bears repeating that Pixeltable is
persistent! Anything you put into a table, including computed
columns, will be saved in persistent storage. This includes inference
outputs such as
t.detections, as well as generated images
such as t.image_with_bb. (Later we’ll see how to tune this
behavior in cases where it might be undesirable to store
everything, but the default behavior is that computed column
output is always persisted.)Expressions
Let’s have a closer look at that call todraw_bounding_boxes() in the
last example.
draw_bounding_boxes() is, of
course, a UDF, and its first argument is a column reference of the sort
we’ve used many times now: t.source, the source image. The other two
arguments are more than simple column references, though: they’re
compound expressions that include the column reference t.detections
along with a suffix (.boxes or .label_text) that tells Pixeltable to
look inside the dictionary stored in t.detections.
These are all examples of Pixeltable expressions. In fact, we’ve
seen other types of Pixeltable expressions as well, without explicitly
calling them out:
- Calls to a UDF are expressions, such as
t.source.rotate(10), or thedraw_bounding_boxes()example above; - Arithmetic operations are expressions, such as year-over-year
calculation in our first example:
100 * pop_t.yoy_change / pop_t.pop_2022.