Skip to main content
Learn more about Pixeltable tables and the data operations with our in-depth guide.

What are Tables?

Tables are the fundamental data storage units in Pixeltable. They function similarly to SQL database tables but with enhanced capabilities designed specifically for AI and ML workflows. Each table consists of columns with defined data types and can store both structured data and unstructured media assets. In Pixeltable, tables:
  • Persist across sessions, meaning your data remains available even after restarting your environment
  • Maintain strong typing for data consistency
  • Support operations like filtering, querying, and transformation
  • Can handle specialized data types for machine learning and media processing
  • Group logically into directories (namespaces) for organization
Creating a table requires defining a name and schema that describes its structure:
import pixeltable as pxt

# Create a directory to organize tables
pxt.create_dir('example')

# Create a table with a defined schema
films = pxt.create_table('example.films', {
    'title': pxt.String,
    'year': pxt.Int,
    'revenue': pxt.Float
})

Primary Keys

Primary keys uniquely identify rows in a table and enable efficient lookups and updates. You can designate one or more columns as primary keys when creating a table:
# Single primary key
products = pxt.create_table('inventory.products', {
    'sku': pxt.String,
    'name': pxt.String,
    'price': pxt.Float
}, primary_key='sku')

# Composite primary key (multiple columns)
order_items = pxt.create_table('sales.order_items', {
    'order_id': pxt.Int,
    'product_id': pxt.Int,
    'quantity': pxt.Int
}, primary_key=['order_id', 'product_id'])
With primary keys defined, you can use batch_update to efficiently update or insert (upsert) rows:
# Update existing rows or insert new ones based on primary key
products.batch_update([
    {'sku': 'ABC123', 'name': 'Widget', 'price': 9.99},
    {'sku': 'DEF456', 'name': 'Gadget', 'price': 19.99}
])
Primary key columns cannot be nullable and must have unique values across all rows.

Type System

  • Basic Types
  • Media Types
  • ML Types
  • JSON Schema Validation
# Schema definition
table = pxt.create_table('example', {
    'text': pxt.String,     # Text data
    'count': pxt.Int,       # Integer numbers
    'score': pxt.Float,     # Decimal numbers
    'active': pxt.Bool,     # Boolean values
    'created': pxt.Timestamp # Date/time values
})
Table and directory names can include hyphens (e.g., my-table, my-project.sub-dir). This is useful for matching external naming conventions.

Column Casting

Pixeltable allows you to explicitly cast column values to ensure they conform to the expected type. This is particularly useful when working with computed columns or transforming data from external sources.
# Cast columns to different types
table.update({
    'int_score': table.score.astype(pxt.Int),        # Cast float to integer
    'string_count': table.count.astype(pxt.String),  # Cast integer to string
})

# Using casting in computed columns
films.add_computed_column(
    budget_category=films.budget.astype(pxt.String) + ' million'
)

# Casting in expressions
films.where(films.revenue.astype(pxt.Int) > 100).collect()
Column casting helps maintain data consistency and prevents type errors when processing your data.

Data Operations

Query Operations

Filter and retrieve data:
# Basic row count
films.count()  # Returns total number of rows

# Basic filtering
films.where(films.budget >= 200.0).collect()

# Select specific columns
films.select(films.title, films.year).collect()

# Limit results
films.limit(5).collect()  # First 5 rows (no specific order)
films.head(5)  # First 5 rows by insertion order
films.tail(5)  # Last 5 rows by insertion order

# Order results
films.order_by(films.budget, asc=False).limit(5).collect()

String Operations

Manipulate text data:
# String contains
films.where(films.title.contains('Inception')).collect()

# String replacement
films.update({
    'plot': films.plot.replace('corporate secrets', 'subconscious secrets')
})

# String functions
films.update({
    'title': films.title.upper(),        # Convert to uppercase
    'length': films.title.len()          # Get string length
})

Insert Operations

Add new data:
# Insert single row
films.insert(
    title='Inside Out 2',
    year=2024,
    plot='Emotions navigate puberty',
    budget=200.0
)

# Insert multiple rows
films.insert([
    {
        'title': 'Jurassic Park',
        'year': 1993,
        'plot': 'Dinosaur theme park disaster',
        'budget': 63.0
    },
    {
        'title': 'Titanic',
        'year': 1997,
        'plot': 'Ill-fated ocean liner romance',
        'budget': 200.0
    }
])

Import from Source

Create tables or insert data directly from external sources:
import pixeltable as pxt
import pandas as pd

# Create a table from a CSV file
table = pxt.create_table('world_population', source='https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/world-population-data.csv')

# Create a table from a pandas DataFrame
df = pd.DataFrame({
    'cca3': ['FRA', 'DEU', 'ITA'],
    'country': ['France', 'Germany', 'Italy'],
    'continent': ['Europe', 'Europe', 'Europe'],
    'pop_2023': [68_000_000, 83_000_000, 59_000_000]
})
table = pxt.create_table('europe_population', source=df)

# Insert data from a pandas DataFrame into an existing table
new_df = pd.DataFrame({
    'cca3': ['ESP', 'GBR', 'POL'],
    'country': ['Spain', 'United Kingdom', 'Poland'],
    'continent': ['Europe', 'Europe', 'Europe'],
    'pop_2023': [47_000_000, 67_000_000, 38_000_000]
})
table.insert(new_df)
Pixeltable supports importing from various data sources:
  • CSV files (.csv)
  • Excel files (.xls, .xlsx)
  • Parquet files (.parquet, .pq, .parq)
  • JSON files (.json)
  • Pandas DataFrames
  • Pixeltable Queries
  • Hugging Face datasets

Update Operations

Modify existing data:
# Update all rows
films.update({
    'budget': films.budget * 1.1  # Increase all budgets by 10%
})

# Conditional updates
films.where(
    films.year < 2000
).update({
    'plot': films.plot + ' (Classic Film)'
})

# Batch updates for multiple rows
updates = [
    {'id': 1, 'budget': 175.0},
    {'id': 2, 'budget': 185.0}
]
films.batch_update(updates)

Delete Operations

Remove data with conditions:
# Delete specific rows
films.where(
    films.year < 1995
).delete()

# Delete with complex conditions
films.where(
    (films.budget < 100.0) &
    (films.year < 2000)
).delete()

# WARNING: Delete all rows (use with caution!)
# films.delete()  # Without where clause deletes all rows

Column Operations

Manage table structure:
# Add new column
films.add_column(rating=pxt.String)

# Drop column
films.drop_column('rating')

# View schema
films.describe()

Versioning

Manage table versions:
# Revert the last operation
films.revert()  # Cannot be undone!

# Revert multiple times to go back further
films.revert()
films.revert()  # Goes back two operations

Export Operations

Extract data for analysis:
# Get results as Python objects
result = films.limit(5).collect()
first_row = result[0]  # Get first row as dict
timestamps = result['timestamp']  # Get list of values for one column

# Convert to Pandas
df = result
df['revenue'].describe()  # Get statistics for revenue column

Join Tables

Combine data from multiple tables using different join types.
import pixeltable as pxt

# Define the customers table
customers = pxt.create_table(
    "customers",
    {"customer_id": pxt.Int, "name": pxt.String, "total_spent": pxt.Float},
    if_exists="replace",
)

# Define the orders table
orders = pxt.create_table(
    "orders",
    {"order_id": pxt.Int, "customer_id": pxt.Int, "amount": pxt.Float},
    if_exists="replace",
)


# Populate the tables with sample data
customers.insert([
    {'customer_id': 1, 'name': 'Alice Johnson', 'total_spent': 250.0},
    {'customer_id': 2, 'name': 'Bob Smith', 'total_spent': 180.0},
    {'customer_id': 3, 'name': 'Carol White', 'total_spent': 320.0},
    {'customer_id': 4, 'name': 'David Brown', 'total_spent': 150.0},
    {'customer_id': 5, 'name': 'Eve Davis', 'total_spent': 90.0}
])

orders.insert([
    {'order_id': 101, 'customer_id': 1, 'amount': 75.0},
    {'order_id': 102, 'customer_id': 1, 'amount': 30.0},
    {'order_id': 103, 'customer_id': 2, 'amount': 120.0},
    {'order_id': 104, 'customer_id': 4, 'amount': 60.0}
])

Inner Join

Returns only matching records from both tables.
inner_join_result = customers.join(
    orders,
    on=customers.customer_id == orders.customer_id,
    how='inner'
).select(
    customers.name,
    orders.amount
)
inner_df = inner_join_result.collect()
print(inner_df)
# Output will show only customers with matching orders (customer_id 1, 2, 4)

Left Outer Join

Returns all records from the left table and matching records from the right table.
left_join_result = customers.join(
    orders,
    on=customers.customer_id == orders.customer_id,
    how='left'
).select(
    customers.name,
    orders.amount
)
left_df = left_join_result.collect()
print(left_df)
# Output will show all customers (1-5), with null for amount where no order exists

Full Outer Join

Returns all records from both tables, with nulls where there’s no match.
full_join_result = customers.join(
    orders,
    on=customers.customer_id == orders.customer_id,
    how='full_outer'
).select(
    customers.name,
    orders.amount
)
full_df = full_join_result.collect()
print(full_df)
# Output will show all customers and all orders, with nulls where no match exists

Cross Join

Returns all possible combinations of records from both tables.
cross_join_result = customers.join(
    orders,
    how='cross'
).select(
    customers.name,
    orders.amount
)
cross_df = cross_join_result.collect()
print(cross_df)
# Output will show 5 customers x 4 orders = 20 combinations

Error Handling

When inserting data or evaluating computed columns, errors may occur (e.g., invalid media files, API failures, type mismatches). Pixeltable provides the on_error parameter to control how these errors are handled:
# Abort on any error (default behavior)
t.insert([...], on_error='abort')

# Continue processing and log errors
t.insert([...], on_error='ignore')
When using on_error='ignore', cells with errors will have None values, but you can access error details through special column properties:
# Create a table with a computed column that might fail
t = pxt.create_table('app.docs', {'url': pxt.String})
t.add_computed_column(content=some_function_that_might_fail(t.url))

# Insert data, ignoring errors
t.insert([{'url': 'https://example.com/valid'}, {'url': 'invalid-url'}], on_error='ignore')

# Query error information for failed rows
errors = t.where(t.content == None).select(
    t.url,
    t.content.errortype,   # Exception class name
    t.content.errormsg     # Error message
).collect()

print(errors)
# Shows which URLs failed and why
Use on_error='ignore' in production pipelines to ensure partial failures don’t halt your entire workflow. You can then query and retry failed rows later.

Table Management

Move and Rename

Move tables between directories or rename them using pxt.move():
# Move a table to a different directory
pxt.move('old_dir.my_table', 'new_dir.my_table')

# Rename a table (same directory)
pxt.move('my_dir.old_name', 'my_dir.new_name')

# Handle existing paths gracefully
pxt.move('source.table', 'dest.table', if_exists='ignore')

List Contents

Browse your Pixeltable namespace with pxt.ls():
# List contents of root directory
pxt.ls()
#    Name      Kind    Version  Base
#    my_dir    dir
#    my_table  table   5

# List contents of a subdirectory
pxt.ls('my_dir')

# Programmatic access to directory contents
contents = pxt.get_dir_contents('my_dir')
print(contents['tables'])  # List of table paths
print(contents['dirs'])    # List of subdirectory paths

Data Export

Export data to various formats for external use:
import pixeltable as pxt
from pixeltable import io

t = pxt.get_table('my_app.data')

# Export to Parquet file
io.export_parquet(t, 'output.parquet')

# Export a query result to Parquet
query = t.where(t.score > 0.9).select(t.id, t.embedding)
io.export_parquet(query, 'high_scores.parquet')

# Export to LanceDB for vector search
io.export_lancedb(t, 'lance_db_path', 'table_name')

# Convert to Pandas DataFrame
df = t.collect()
df.to_csv('output.csv')

# Export images as FiftyOne dataset
io.export_images_as_fo_dataset(t, 'fiftyone_dataset')

Best Practices

Schema Definition

  • Use clear naming for directories and tables
  • Document computed column dependencies

Application Code

  • Use get_table() to fetch existing tables
  • Use batch operations for multiple rows

Common Patterns

  1. Create table.py for structure
  2. Test schema and workflow
  3. Create app.py for usage
  4. Deploy both files
# table.py - Run once to set up
pxt.create_table(..., if_exists="ignore")

# app.py - Production code
table = pxt.get_table("myapp.mytable")
if table is None:
    raise RuntimeError("Run table.py first!")

Additional Resources

Remember that Pixeltable automatically handles versioning and lineage tracking. Every operation is recorded and can be reverted if needed.