Computed Columns

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Section 2: Computed Columns

Welcome to Section 2 of the Pixeltable Fundamentals tutorial, Computed Columns. In the previous section, Tables and Data Operations, we learned how to create tables, populate them with data, and query and manipulate their contents. In this section, we’ll introduce one of Pixeltable’s most essential and powerful concepts: computed columns. We’ll learn how to:

Add computed columns to a table
Use computed columns for complex operations such as image processing and model inference

Next, let’s ensure the Pixeltable library is installed in your environment, along with the Huggingface transformers library, which we’ll need for this tutorial section.

%pip install -qU pixeltable torch transformers

Let’s start with a simple example that illustrates the basic concepts behind computed columns. We’ll use a table of world population data for our example. Remember that you can import datasets into a Pixeltable table by using pxt.create_table() with the source parameter.

import pixeltable as pxt

pxt.create_dir('fundamentals', if_exists='ignore')
pop_t = pxt.create_table(
    'fundamentals.population',
    source='https://github.com/pixeltable/pixeltable/raw/release/docs/resources/world-population-data.csv',
    if_exists='replace'
)

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory ‘fundamentals’.
Created table ‘population’.
Inserting rows into `population`: 234 rows [00:00, 6850.71 rows/s]
Inserted 234 rows with 0 errors.

Also recall that pop_t.head() returns the first few rows of a table, and typing the table name pop_t by itself gives the schema.

pop_t.head(5)

pop_t

Now let’s suppose we want to add a new column for the year-over-year population change from 2022 to 2023. In the previous tutorial section, Tables and Data Operations, we saw how one might select() such a quantity into a Pixeltable Query, giving it the name yoy_change (year-over-year change):

pop_t.select(pop_t.country, yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).head(5)

A computed column is a way of turning such a selection into a new, permanent column of the table. Here’s how it works:

pop_t.add_computed_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))

Added 234 column values with 0 errors.
234 rows updated, 468 values computed.

As soon as the column is added, Pixeltable will (by default) automatically compute its value for all rows in the table, storing the results in the new column. If we now inspect the schema of pop_t, we see the new column and its definition.

pop_t

The new column can be queried in the usual manner.

pop_t.select(pop_t.country, pop_t.yoy_change).head(5)

The output is identical to the previous example, but now we’re retrieving the computed output from the database, instead of computing it on-the-fly. Computed columns can be “chained” with other computed columns. Here’s an example that expresses population change as a percentage:

pop_t.add_computed_column(
    yoy_percent_change=(100 * pop_t.yoy_change / pop_t.pop_2022)
)

Added 234 column values with 0 errors.
234 rows updated, 468 values computed.

pop_t

pop_t.select(pop_t.country, pop_t.yoy_change, pop_t.yoy_percent_change).head(5)

Although computed columns appear superficially similar to Queries, there is a key difference. Because computed columns are a permanent part of the table, they will be automatically updated any time new data is added to the table. These updates will propagate through any other computed columns that are “downstream” of the new data, ensuring that the state of the entire data is kept up-to-date.

In traditional data workflows, it is commonplace to recompute entire pipelines when the input dataset is changed or enlarged. In Pixeltable, by contrast, all updates are applied incrementally. When new data appear in a table or existing data are altered, Pixeltable will recompute only those rows that are dependent on the changed data.

Let’s see how this works in practice. For purposes of illustration, we’ll add an entry for California to the table, as if it were a country.

pop_t.insert(
    country='California',
    pop_2023=39110000,
    pop_2022=39030000,
)

Inserting rows into `population`: 1 rows [00:00, 228.35 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 5 values computed.

Observe that the computed columns yoy_growth and yoy_percent_growth have been automatically updated in response to the new data.

pop_t.tail(5)

Remember that all tables in Pixeltable are persistent. This includes computed columns: when you create a computed column, its definition is stored in the database. You can think of computed columns as setting up a persistent compute workflow: if you close your notebook or restart your Python instance, computed columns (along with the relationships between them, and any data contained in them) will be preserved.

Recomputing Columns

From time to time you might need to recompute the data in an existing computed column. Perhaps the code for one of your UDFs has changed, and you want to recompute a column that uses that UDF in order to pick up the new logic. Or perhaps you want to re-run a nondeterministic computation such as model inference. The command to do this is recompute_columns(). It won’t do much in the current example, because all our computations are simple and deterministic, but for demonstration purposes here’s what it looks like:

pop_t.recompute_columns(pop_t.yoy_change, pop_t.yoy_percent_change)

Inserting rows into `population`: 235 rows [00:00, 8795.92 rows/s]
235 rows updated, 940 values computed.

pop_t.tail(5)

As expected, it looks the same.

If you modify the data that a computed column depends on, Pixeltable will recompute automatically; so recompute_columns() is primarily useful when the input data remains the same, but your UDF business logic changes.

A More Complex Example: Image Processing

In the Tables and Data Operations tutorial, we saw how media data such as images can be inserted into Pixeltable tables, alongside more traditional structured data. Let’s explore another example that uses computed columns for image processing operations. In this example, we’ll create the table directly by providing a schema, rather than importing it from a CSV like before.

t = pxt.create_table('fundamentals.image_ops', {'source': pxt.Image})

Created table ‘image_ops’.

url_prefix = 'https://github.com/pixeltable/pixeltable/raw/release/docs/resources/images'
images = ['000000000139.jpg', '000000000632.jpg', '000000000872.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in images)

Inserting rows into `image_ops`: 3 rows [00:00, 1133.39 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 6 values computed.

t.collect()

What are some things we might want to do with these images? A fairly basic one is to extract metadata. Pixeltable provides the built-in UDF get_metadata(), which returns a dictionary with various metadata about the image. Let’s go ahead and make this a computed column.

“UDF” is standard terminology in databases, meaning “User-Defined Function”. Technically speaking, the get_metadata() function isn’t user-defined, it’s built in to the Pixeltable library. But we’ll consistently refer to Pixeltable functions as “UDFs” in order to clearly distinguish them from ordinary Python functions. Later in this tutorial, we’ll see how to turn (almost) any Python function into a Pixeltable UDF.

t.add_computed_column(metadata=t.source.get_metadata())
t.collect()

Added 3 column values with 0 errors.

Image operations, of course, can also return new images.

t.add_computed_column(rotated=t.source.rotate(10))

Added 3 column values with 0 errors.
3 rows updated, 3 values computed.

t.collect()

Or, perhaps we want to rotate our images and fill them in with a transparent background rather than black. We can do this by chaining image operations, adding a transparency layer before doing the rotation.

t.add_computed_column(rotated_transparent=t.source.convert('RGBA').rotate(10))
t.collect()

Added 3 column values with 0 errors.

In addition to get_metadata(), convert(), and rotate(), Pixeltable has a sizable library of other common image operations that can be used as UDFs in computed columns. For the most part, the image UDFs are analogs of the operations provided by the Pillow library (in fact, Pixeltable is just using Pillow under the covers). You can read more about the provided image (and other) UDFs in the Pixeltable SDK Documentation.

Let’s have a look at our table schema.

Image Detection

In addition to simple operations like rotate() and convert(), the Pixeltable API includes UDFs for various off-the-shelf image models. Let’s look at one example: object detection using the ResNet-50 model. Model inference is a UDF too, and it can be inserted into a computed column like any other. This one may take a little more time to compute, since it involves first downloading the ResNet-50 model (if it isn’t already cached), then running inference on the images in our table.

from pixeltable.functions.huggingface import detr_for_object_detection

t.add_computed_column(detections=detr_for_object_detection(
    t.source,
    model_id='facebook/detr-resnet-50',
    threshold=0.8
))

Added 3 column values with 0 errors.
3 rows updated, 3 values computed.

t.select(t.source, t.detections).collect()

It’s great that the DETR model gave us so much information about the images, but it’s not exactly in human-readable form. Those are JSON structures that encode bounding boxes, confidence scores, and categories for each detected object. Let’s do something more useful with them: we’ll use Pixeltable’s draw_bounding_boxes() API to superimpose bounding boxes on the images, using different colors to distinguish different object categories.

from pixeltable.functions.vision import draw_bounding_boxes

t.add_computed_column(image_with_bb=draw_bounding_boxes(
    t.source, t.detections.boxes, t.detections.label_text, fill=True
))
t.select(t.source, t.image_with_bb).collect()

Added 3 column values with 0 errors.

It can be a little hard to see what’s going on, so let’s zoom in on just one image. If you select a single image in a notebook, Pixeltable will enlarge its display:

t.select(t.image_with_bb).head(1)

Let’s check in on our schema. We now have five computed columns, all derived from the single source column.

And as always, when we add new data to the table, its computed columns are updated automatically. Let’s try this on a few more images.

more_images = ['000000000108.jpg', '000000000885.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in more_images)

Inserting rows into `image_ops`: 2 rows [00:00, 944.77 rows/s]
Inserted 2 rows with 0 errors.
2 rows inserted, 14 values computed.

t.select(t.source, t.image_with_bb, t.detections.label_text, t.metadata).tail(2)

It bears repeating that Pixeltable is persistent! Anything you put into a table, including computed columns, will be saved in persistent storage. This includes inference outputs such as t.detections, as well as generated images such as t.image_with_bb. (Later we’ll see how to tune this behavior in cases where it might be undesirable to store everything, but the default behavior is that computed column output is always persisted.)

Expressions

Let’s have a closer look at that call to draw_bounding_boxes() in the last example.

draw_bounding_boxes(t.source, t.detections.boxes, t.detections.label_text, fill=True)

There are a couple of things going on. draw_bounding_boxes() is, of course, a UDF, and its first argument is a column reference of the sort we’ve used many times now: t.source, the source image. The other two arguments are more than simple column references, though: they’re compound expressions that include the column reference t.detections along with a suffix (.boxes or .label_text) that tells Pixeltable to look inside the dictionary stored in t.detections. These are all examples of Pixeltable expressions. In fact, we’ve seen other types of Pixeltable expressions as well, without explicitly calling them out:

Calls to a UDF are expressions, such as t.source.rotate(10), or the draw_bounding_boxes() example above;
Arithmetic operations are expressions, such as year-over-year calculation in our first example: 100 * pop_t.yoy_change / pop_t.pop_2022.

In the next tutorial section, we’ll learn more about the various types of Pixeltable expressions and how to use them:

Queries and Expressions

Welcome to Pixeltable

Concepts

Notebooks

Computed Columns

Section 2: Computed Columns