# Data Pipeline Constitution (v1)

Derived from first principles. Paste into `.specify/memory/constitution.md` via
`/speckit.constitution`. Preserve the exact phrasing of each principle; do not
paraphrase into generalities.

Source: https://daita.io/blog/spec_kit_constitution_first_principles
Companion: https://alexsavio.github.io/first-principles-data-engineering.html

## Principles

### 1. Contract before pipeline

Every producer output and every consumer input is defined as a schema with a
version before any transformation is written. Pipelines are built against
contracts, not against live production tables. Schema changes are expand-then-
contract: add new, migrate, deprecate old.

### 2. Semantic reconciliation at the boundary

Shared column names (`customer_id`, `timestamp`, `amount`) do not imply shared
meaning. Reconciliation (unit conversion, timezone normalization, identity
resolution) happens at the boundary owned by the party that understands both
sides. No "silent" cross-domain joins.

### 3. Provenance is non-negotiable

Every row has a traceable origin: source system, ingest time, pipeline
version, upstream record id. If a downstream report is wrong, you must be
able to walk backward to the originating record within minutes, not days.

### 4. Collect with purpose

Only ingest data with an identified consumer and use case. "Collect
everything in case we need it" is rejected. New sources require a written
consumer and retention policy. Unused tables are deleted, not archived into
perpetuity.

### 5. Test the transformation, not the job runner

Unit tests cover pure transformation logic (SQL expressions, UDFs, mapping
functions) on fixture rows. Integration tests cover source-to-sink contracts.
"Job ran green" is not a correctness signal. A green pipeline run without a
failing test for the bug you just fixed is not green.

### 6. Observability before optimization

Continuous drift monitoring (row counts, null rates, distribution shifts,
freshness) precedes performance tuning. You cannot optimize what you cannot
observe. Every pipeline emits structured events for start, finish, row counts
in/out, and errors.

### 7. Do not distribute by default

A single-node pipeline (DuckDB, Polars, DataFusion, Postgres) is the starting
point. Moving to Spark, Flink, or a distributed warehouse requires a written
justification tied to working-set overflow, genuine parallel compute needs,
or regulatory data locality. Cluster complexity is a cost, not a default.

### 8. Idempotency is a correctness property

Every job is safe to re-run on the same inputs and produce the same outputs.
Upserts over inserts. Deterministic partitioning. No side effects outside the
declared outputs.

### 9. Commands are discoverable; local dev matches CI

Every pipeline (backfill, incremental, test, lint, deploy) is a single named
command runnable locally with the same arguments CI uses. No orchestrator-only
steps, no hidden env-var contracts. If a new contributor cannot run the full
pipeline against a fixture dataset in 30 seconds, the interface is broken.
