Skip to main content

Python API Reference

The surface a Pond author touches is intentionally tiny: one decorator to register Ripples, and one handle passed into every Ripple at runtime.

from duckstring import ripple

@ripple

Registers a function as a Ripple of the Pond. The Catchment discovers a Pond's topology at deploy time by importing src/pond.py (or the ripples path declared in pond.toml) and reading these registrations.

@ripple
def daily_sales(pond): ...

@ripple(parents=[daily_sales])
def join_lines(pond): ...

@ripple(parents=[daily_sales], name="join_lines_v2")
def join_lines_impl(pond): ...
ParameterDefaultMeaning
parents[]Ripples (the decorated function objects) that must complete, within the same Pond Run, before this one starts. Ripples with no parent relationship run in parallel. All parent edges are required.
namethe function's nameThe Ripple's registered name — used in topology, run history, and duckstring get/query.

The decorated function must accept exactly one argument: the Pond handle. It returns nothing — output happens through pond.write_table. The decorator returns the function unchanged, so parents can reference it directly.

note

src/pond.py is imported both at deploy time (to read the topology) and at execution time. Keep module level free of side effects — work belongs inside Ripple bodies.

The Pond handle

Each Ripple invocation receives a fresh Pond — the runtime handle bound to the Pond's working database.

AttributeTypeMeaning
pond.namestrThe Pond's name
pond.versionstrThe deployed version executing
pond.conduckdb.DuckDBPyConnectionA connection to the Pond's private working database
pond.fdatetimeThe run's freshness (tz-aware UTC) — the natural watermark/provenance stamp, stable across retries and crash recovery (see Incremental Ripples)
pond.rootPathThe Catchment root (rarely needed directly)

pond.read_table(ref)

Returns a DuckDB relation for a table. The reference form decides where it reads from:

own = pond.read_table("daily_sales") # this Pond's table — live, from the working DB
src = pond.read_table("transactions.transaction") # a Source's table — its published Parquet snapshot
  • Bare name — a table this Pond wrote, read live from its working database. This is how intermediate state flows between Ripples in a run, and how a Ripple builds on its own previous output (see Incremental Ripples).
  • source_pond.table — a Source's published output: the Parquet snapshot exported by its last successful run. Reads never touch the Source's live database, so they see only consistent, completed data and never contend with the Source's execution. The table is also registered as a view under its own name, so plain SQL can reference it directly — FROM "transaction" after the read above.

Raises FileNotFoundError if a Source table has no exported snapshot yet — i.e. the Source hasn't completed a successful run.

pond.write_table(name, relation)

Publishes a relation as a table of this Pond, atomically:

agg = pond.con.sql('SELECT product_id, SUM(quantity) AS qty FROM "transaction" GROUP BY 1')
pond.write_table("daily_sales", agg)

The write is build-then-swap: the relation materialises into a temporary table which then replaces the target in one transaction. Readers see the old table or the new one, never anything in between. Concurrent write conflicts (other Ripples writing their own tables to the same database) are retried with backoff automatically — they queue rather than fail.

Each successful Pond Run ends with every table exported to Parquet (ponds/{pond}/data/{table}.parquet) — that export is what Sinks and queries consume.

pond.con — direct DuckDB

pond.con is an ordinary DuckDB connection, with the full SQL and Python-API surface. SQL sees every table this Pond has written, plus a view for each Source table read with read_table:

pond.read_table("transactions.transaction") # registers the view `transaction`
agg = pond.con.sql("""
SELECT product_id, SUM(quantity) AS total
FROM "transaction"
GROUP BY product_id
""")
pond.write_table("totals", agg)

Relations are lazy — pond.con.sql(...) builds a query plan, and nothing executes until the result is consumed (here, by write_table). Chains of relations compose into a single optimised query, and the relation API (.filter, .aggregate, .union, …) composes the same way.

One DuckDB feature to avoid inside Ripples: replacement scans — referencing a Python variable as a table name in SQL (FROM raw for a local named raw). That resolves by scanning Python stack frames, which is unreliable under the threaded executor Ripples run in. Reference registered names as above, or compose relations with the relation API instead.

Anything that produces a DuckDB relation works as write_table input, which is also the bridge for non-SQL transforms:

import pandas as pd

df = pd.DataFrame(fetch_from_api()) # arbitrary Python
pond.write_table("snapshot", pond.con.from_df(df))

@puddle and the Puddle handle

Registers a function in src/puddles.py (or the puddles path in pond.toml) as a Puddle — a local snapshot of the Source table it names, materialised by duckstring pond hydrate:

from duckstring import puddle

@puddle("transactions.transaction") # one table of a Source
def transactions(p):
return p.con.sql("SELECT range AS id FROM range(50)")

@puddle("products") # a whole Source — name each table
def products(p):
p.write_table("product", p.con.sql("SELECT 1 AS id"))

The handle p:

AttributeMeaning
p.target / p.source / p.tableThe target as declared / its Source / its table (None for whole-Source puddles).
p.conA scratch in-memory DuckDB connection.
p.pathThe destination directory (puddles/ponds/{source}/data/) — write any non-table artifact there directly.
p.write_table([name,] relation)Export a relation as a table's Parquet snapshot (atomic). Accepts anything write_table on a Pond accepts.
p.write_path(path)Copy a parquet/csv file or glob in.
p.catchment(name=None)A Catchment client bound to the Source: .get([table]) fetches a table, .query(sql) runs SQL against the Source's exported tables, .tables() lists them.

Returning a relation is shorthand for p.write_table(relation); returning a path string for p.write_path(path). Puddle code never runs on a Catchment — only pond hydrate imports it.

Execution environment

Facts about how Ripple code runs, occasionally relevant when writing it:

  • Threads, one process. A Pond's Ripples execute in a thread pool inside the Pond's worker process. Independent Ripples genuinely overlap (DuckDB releases the GIL for query work); module-level mutable state in pond.py is shared and best avoided.
  • One database per Pond. All of a Pond's Ripples share one working database; each invocation gets its own connection to it. Cross-Pond access is only ever via read_table("source.table") — Parquet snapshots, not the Source's database.
  • Failures are exceptions. A Ripple fails by raising. The exception's message and traceback are captured into run history, and the immediate-retry budget governs re-attempts. Write Ripples idempotently — a retry re-runs the whole function. write_table's replace semantics make the common derive-and-replace case idempotent by construction, and the same atomicity makes self-read appends replay-safe when the increment is computed from the previous state (see Incremental Ripples). What needs care is anything with external side effects — a retry repeats them (see External Pipelines for the ensure-then-poll shape).