This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
This tutorial demonstrates how to integrate Pixeltable with Label
Studio, in order to provide seamless management of annotations data
across the annotation workflow. We’ll assume that you’re at least
somewhat familiar with Pixeltable and have read the Pixeltable
Basics
tutorial.
This tutorial can only be run in a local Pixeltable installation, not
in Colab or Kaggle, since it relies on spinning up a locally running
Label Studio instance. See the Installation
Guide for
instructions on how to set up a local Pixeltable instance.
To begin, let’s ensure the requisite dependencies are installed.
%pip install -qU pixeltable label-studio label-studio-sdk torch transformers
Set up Label Studio
Now let’s spin up a Label Studio server process. (If you’re already
running Label Studio, you can choose to skip this step, and instead
enter your existing Label Studio URL and access token in the subsequent
step.) Be patient, as it may take a minute or two to start.
This will open a new browser window containing the Label Studio
interface. If you’ve never run Label Studio before, you’ll need to
create an account; a link to create one will appear in the Label Studio
browser window. Everything is running locally in this tutorial, so the
account will exist only on your local system.
import subprocess
ls_process = subprocess.Popen(['label-studio'], stderr=subprocess.PIPE)
Performing system checks…System check identified no issues (1 silenced).
August 14, 2024 - 04:24:46
Django version 3.2.25, using settings ‘label_studio.core.settings.label_studio’
Starting development server at http://0.0.0.0:8080/
Quit the server with CONTROL-C.
If for some reason the Label Studio browser window failed to open, you
can always access it at: http://localhost:8080/
Once you’ve created an account in Label Studio, you’ll need to locate
your API key. In the Label Studio browser window, log in, and click on
“Account & Settings” in the top right. Copy the Access Token from the
interface.
Next, we configure Pixeltable to communicate with Label Studio. Run the
following command, pasting in the API key that you copied from the Label
Studio interface.
import getpass
import os
if 'LABEL_STUDIO_URL' not in os.environ:
os.environ['LABEL_STUDIO_URL'] = 'http://localhost:8080/'
if 'LABEL_STUDIO_API_KEY' not in os.environ:
os.environ['LABEL_STUDIO_API_KEY'] = getpass.getpass('Label Studio API key: ')
Label Studio API key: ········
Create a Table to Store Videos
Now we create the master table that will hold our videos to be
annotated. This only needs to be done once, when we initially set up the
workflow.
import pixeltable as pxt
schema = {
'video': pxt.Video,
'date': pxt.Timestamp
}
# Before creating the table, we drop the `ls_demo` dir and all its contents,
# in order to ensure a clean environment for the demo.
pxt.drop_dir('ls_demo', force=True)
pxt.create_dir('ls_demo')
videos_table = pxt.create_table('ls_demo.videos', schema)
Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `ls_demo`.
Created table `videos`.
Populate It with Data
Now let’s add some videos to the table to populate it. For this
tutorial, we’ll use some randomly selected videos from the Multimedia
Commons archive. The table also contains a date field, for which we’ll
use a fixed date (but in a production setting, it would typically be the
date on which the video was imported).
from datetime import datetime
url_prefix = 'http://multimedia-commons.s3-website-us-west-2.amazonaws.com/data/videos/mp4/'
files = [
'122/8ff/1228ff94bf742242ee7c88e4769ad5d5.mp4',
'2cf/a20/2cfa205eae979b31b1144abd9fa4e521.mp4',
'ffe/ff3/ffeff3c6bf57504e7a6cecaff6aefbc9.mp4',
]
today = datetime(2024, 4, 22)
videos_table.insert({'video': url_prefix + file, 'date': today} for file in files)
Inserting rows into `videos`: 3 rows [00:00, 993.05 rows/s]
Inserted 3 rows with 0 errors.
UpdateStatus(num_rows=3, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])
Let’s have a look at the table now.
Create a Label Studio project
Next we’ll create a new Label Studio project and link it to a new view
on the Pixeltable table. You can link a Label Studio project to either a
table or a view. For tables that are expecting a lot of input data, it’s
often easier to link to views. In this example, we’ll create a view that
filters the table down by date.
# Create a view to filter on the specified date
v = pxt.create_view(
'ls_demo.videos_2024_04_22',
videos_table.where(videos_table.date == today)
)
# Create a new Label Studio project and link it to the view. The
# configuration uses Label Studio's standard XML format. This only
# needs to be done once: after the view and project are linked,
# the relationship is stored indefinitely in Pixeltable's metadata.
label_config = '''
<View>
<Video name="video" value="$video"/>
<Choices name="video-category" toName="video" showInLine="true">
<Choice value="city"/>
<Choice value="food"/>
<Choice value="sports"/>
</Choices>
</View>
'''
pxt.io.create_label_studio_project(v, label_config)
Inserting rows into `videos_2024_04_22`: 3 rows [00:00, 1864.69 rows/s]
Created view `videos_2024_04_22` with 3 rows, 0 exceptions.
Added 3 column values with 0 errors.
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 857.44 cells/s]
Linked external store `ls_project_0` to table `videos_2024_04_22`.
Created 3 new task(s) in LabelStudioProject `videos_2024_04_22`.
SyncStatus(external_rows_created=3, external_rows_deleted=0, external_rows_updated=0, pxt_rows_updated=0, num_excs=0)
If you look in the Label Studio UI now, you’ll see that there’s a new
project with the name videos_2022_04_22, with three tasks, one for
each of the videos in the view. If you want to create the project
without populating it with tasks (yet), you can set
sync_immediately=False in the call to create_label_studio_project().
You can always sync the table and project by calling v.sync().
Note also that we didn’t have to specify an explicit mapping between
Pixeltable columns and Label Studio data fields. This is because, by
default, Pixeltable assumes the Pixeltable and Label Studio field names
coincide. The data field in the Label Studio project has the name
$video, which Pixeltable maps, by default, to the column in
ls_demo.videos_2024_02_22 that is also called video. If you want to
override this behavior to specify an explicit mapping of columns to
fields, you can do that with the col_mapping parameter of
create_label_studio_project().
Inspecting the view, we also see that Pixeltable created an additional
column on the view, annotations, which will hold the output of our
annotations workflow. The name of the output column can also be
overridden by specifying a dict entry in col_mapping of the form
{'my_col_name': 'annotations'}.
Add Some Annotations
Now, let’s add some annotations to our Label Studio project to simulate
a human-in-the-loop workflow. In the Label Studio UI, click on the new
videos_2024_02_22 project, and click on any of the three tasks. Select
the appropriate category (“city”, “food”, or “sports”), and click
“Submit”.
Import the Annotations Back To Pixeltable
Now let’s try importing annotations from Label Studio back to our view.
v = pxt.get_table('ls_demo.videos_2024_04_22')
v.sync()
Created 0 new task(s) in LabelStudioProject `videos_2024_04_22`.
Updated annotation(s) from 1 task(s) in LabelStudioProject `videos_2024_04_22`.
SyncStatus(external_rows_created=0, external_rows_deleted=0, external_rows_updated=0, pxt_rows_updated=1, num_excs=0)
Let’s see what effect that had. You’ll see that any videos that you
annotated now have their annotations field populated in the view.
v.select(v.video, v.annotations).head()
Parse Annotations with a Computed Column
Pixeltable pulls in all sorts of metadata from Label Studio during a
sync: everything that Label Studio reports back about the annotations,
including things like the user account that created the annotations.
Let’s say that all we care about is the annotation value. We can add a
computed column to our table to pull it out.
v.add_computed_column(
video_category=v.annotations[0].result[0].value.choices[0]
)
v.select(v.video, v.annotations, v.video_category).head()
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 394.63 cells/s]
Added 3 column values with 0 errors.
Another useful operation is the get_metadata function, which returns
information about the video itself, such as the resolution and codec
(independent of Label Studio). Let’s add another computed column to hold
such metadata.
from pixeltable.functions.video import get_metadata
v.add_computed_column(video_metadata=get_metadata(v.video))
v.select(v.video, v.annotations, v.video_category, v.video_metadata).head()
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 138.97 cells/s]
Added 3 column values with 0 errors.
Preannotations with Pixeltable and Label Studio
Frame extraction is another common operation in labeling workflows. In
this example, we’ll extract frames from our videos into a view, then use
an object detection model to generate preannotations for each frame. The
following code uses a Pixeltable FrameIterator to automatically
extract frames into a new view, which we’ll call frames_2024_04_22.
from datetime import datetime
from pixeltable.iterators import FrameIterator
today = datetime(2024, 4, 22)
videos_table = pxt.get_table('ls_demo.videos')
# Create the view, using a `FrameIterator` to extract frames with a sample rate
# of `fps=0.25`, or 1 frame per 4 seconds of video. Setting `fps=0` would use the
# native framerate of the video, extracting every frame.
frames = pxt.create_view(
'ls_demo.frames_2024_04_22',
videos_table.where(videos_table.date == today),
iterator=FrameIterator.create(video=videos_table.video, fps=0.25)
)
Inserting rows into `frames_2024_04_22`: 13 rows [00:00, 5434.66 rows/s]
Created view `frames_2024_04_22` with 13 rows, 0 exceptions.
# Show just the first 3 frames in the table, to avoid cluttering the notebook
frames.select(frames.frame).head(3)
Now we’ll use the Resnet-50 object detection model to generate
preannotations. We do this by creating a new computed column.
from pixeltable.functions.huggingface import detr_for_object_detection
# Run the Resnet-50 object detection model against each frame to generate bounding boxes
frames.add_computed_column(detections=detr_for_object_detection(
frames.frame,
model_id='facebook/detr-resnet-50',
threshold=0.95
)
frames.select(frames.frame, frames.detections).head(3)
Computing cells: 100%|██████████████████████████████████████████| 13/13 [00:06<00:00, 2.10 cells/s]
Added 13 column values with 0 errors.
We’d like to send these detections to Label Studio as preannotations,
but they’re not quite ready. Label Studio expects preannotations in
standard COCO format, but the Huggingface library outputs them in its
own custom format. We can use Pixeltable’s handy detr_to_coco function
to do the conversion, using another computed column.
from pixeltable.functions.huggingface import detr_to_coco
frames.add_computed_column(
preannotations=detr_to_coco(frames.frame, frames.detections)
)
frames.select(frames.frame, frames.detections, frames.preannotations).head(3)
Computing cells: 100%|██████████████████████████████████████████| 13/13 [00:00<00:00, 42.79 cells/s]
Added 13 column values with 0 errors.
Create a Label Studio Project for Frames
With our data workflow set up and the COCO preannotations prepared, all
that’s left is to create a corresponding Label Studio project. Note how
Pixeltable automatically maps RectangleLabels preannotation fields to
columns, just like it does with data fields. Here, Pixeltable interprets
the name="preannotations" attribute in RectangleLabels to mean, “map
these rectangle labels to the preannotations column in my linked table
or view”.
The Label values car, person, and train are standard COCO object
identifiers used by many off-the-shelf object detection models. You can
find the complete list of them here, and include as many as you wish:
https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/coco-categories.csv
frames_config = '''
<View>
<Image name="frame" value="$frame"/>
<RectangleLabels name="preannotations" toName="frame">
<Label value="car" background="blue"/>
<Label value="person" background="red"/>
<Label value="train" background="green"/>
</RectangleLabels>
</View>
'''
pxt.io.create_label_studio_project(frames, frames_config)
Added 13 column values with 0 errors.
Computing cells: 100%|██████████████████████████████████████████| 13/13 [00:00<00:00, 42.09 cells/s]
Linked external store `ls_project_0` to table `frames_2024_04_22`.
Created 13 new task(s) in LabelStudioProject `frames_2024_04_22`.
SyncStatus(external_rows_created=13, external_rows_deleted=0, external_rows_updated=0, pxt_rows_updated=0, num_excs=0)
If you go into Label Studio and open up the new project, you can see the
effect of adding the preannotations from Resnet-50 to our workflow.
Incremental Updates
As we saw in the Pixeltable
Basics
tutorial, adding new data to Pixeltable results in incremental updates
of everything downstream. We can see this by inserting a new video into
our base videos table: all of the downstream views and computed columns
are updated automatically, including the video metadata, frames, and
preannotations.
The update may take some time, so please be patient (it involves a
sequence of operations, including frame extraction and object
detection).
videos_table.insert(
video=url_prefix + '22a/948/22a9487a92956ac453a9c15e0fc4dd4.mp4',
date=today
)
Inserting rows into `videos`: 1 rows [00:00, 808.31 rows/s]
Inserting rows into `videos_2024_04_22`: 1 rows [00:00, 849.57 rows/s]
Inserting rows into `frames_2024_04_22`: 5 rows [00:00, 3225.89 rows/s]
Inserted 7 rows with 0 errors.
UpdateStatus(num_rows=7, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])
Note that the incremental updates do not automatically sync the
Table with the remote Label Studio projects. To issue a sync, we have
to call the sync() methods separately. Note that tasks will be created
only for the newly added rows in the videos and frames views, not the
existing ones.
Created 1 new task(s) in LabelStudioProject `videos_2024_04_22`.
Created 5 new task(s) in LabelStudioProject `frames_2024_04_22`.
SyncStatus(external_rows_created=5, external_rows_deleted=0, external_rows_updated=0, pxt_rows_updated=0, num_excs=0)
Deleting a Project
To remove a Label Studio project from a table or view, use
unlink_external_stores(), as demonstrated by the following example. If
you specify delete_external_data=True, then the Label Studio project
will also be deleted, along with all existing data and annotations (be
careful!) If delete_external_data=False, then the Label Studio project
will be unlinked from Pixeltable, but the project and data will remain
in Label Studio (so you’ll need to delete the project manually if you
later want to get rid of it).
v.external_stores # Get a list of all external stores for `v`
[‘ls_project_0’]
v.unlink_external_stores('ls_project_0', delete_external_data=True)
Deleted Label Studio project: videos_2024_04_22
Unlinked external store from table `videos_2024_04_22`: ls_project_0
All of the examples so far in this tutorial use HTTP file uploads to
send media data to Label Studio. This is the simplest method and the
easiest to configure, but it’s undesirable for complex projects or
projects with a lot of data. In fact, the Label Studio documentation
includes this specific warning: “Uploading data works fine for proof of
concept projects, but it is not recommended for larger projects.”
In Pixeltable, you can configure linked Label Studio projects to use
URLs for media data (instead of file uploads) by specifying the
media_import_method='url' argument in create_label_studio_project.
This is recommended for all production applications, and is mandatory
for projects whose input configuration is more complex than a single
media file (in the Label Studio parlance, projects with more than one
“data key”).
If media_import_method='url', then Pixeltable will simply pass the
media data URLs directly to Label Studio. If the URLs are http:// or
https:// URLs, then nothing more needs to be done.
Label Studio also supports s3:// URLs with credentialed access. To use
them, you’ll need to configure access to your bucket in the project
configuration. The simplest way to do this is by specifying an
s3_configuration in create_label_studio_project. Here’s an example,
though it won’t work directly in this demo notebook, since it relies on
having an access key. (If your AWS credentials are stored in
~/.aws/credentials, then you can omit the access key and secret, and
Pixeltable will fill them in automatically.)
pxt.io.create_label_studio_project(
v,
label_config,
media_import_method='url',
s3_configuration={'bucket': 'pxt-test', 'aws_access_key_id': my_key, 'aws_secret_access_key': my_secret}
)
Before you can set up credentialed S3 access, you’ll need to configure
your S3 bucket to work with Label Studio; the details on how to do this
are described here: - Label Studio Docs: Amazon
S3
For the full documentation on create_label_studio_project usage,
see: - Pixeltable SDK Docs:
create_label_studio_project()
Notebook Cleanup
That’s the end of the tutorial! To conclude, let’s terminate the running
Label Studio process. (Of course, feel free to leave it running if you
want to play around with it some more.)