Provenance and Auditing — How Tango Keeps Data Consistent

Federal procurement data doesn't come from one place. Tango ingests from SAM.gov, USAspending, FPDS, agency feeds, and more. Those sources overlap, disagree, and update on different cadences. To keep the dataset consistent—and debuggable—we track which source last updated each field and which fields changed on each row. We call that provenance and change-data-capture (CDC). Provenance and changelog tables are not yet exposed in the public API, but they matter to you anyway: they're why Tango can explain behavior, avoid regressions, and improve reliability. In this post we'll cover why this matters, what we track and where, how it works under the hood, and what questions we can answer (and may expose later).

Why it matters

When multiple upstream systems feed the same kind of data (e.g. entity names, agency codes), you have to decide who wins when they disagree—and you have to do it in a way you can explain and debug.

Data changes for all kinds of reasons:

  • Good reasons: a company updates its legal name or address in SAM.gov; an agency reorgs.
  • Less good reasons: someone in the government typed in the wrong thing and it had to be corrected in a later feed.
  • Sometimes totally inexplicable changes: a value flips with no obvious source or explanation. And sometimes, yes, even we can screw something up through data enrichment or as a result of a loading error.

We need to be able to track all of the changes regardless of the reason, so we can tell a legitimate update from a correction from a bug (including our own), and so we don't overwrite good data with bad.

Without provenance and change logs, you can see the following in the data:

  • Mystery changes: A field flips from one value to another and you can't tell whether it was a correction, a bad ingest, or a source update.
  • Regressions: A high-quality source gets overwritten by a lower-quality one because there's no notion of "this field came from SAM; don't replace it with USAspending unless we say so."
  • Unclear readiness: Downstream consumers can't easily know "is this row stable for this sync?" or "what actually changed in the last run?"

What we track (and where)

To ensure that we can effectively keep track of changes, Tango uses two related ideas:

  • Field provenance: "Which source last updated this specific field, and when?" That drives priority rules so we don't overwrite authoritative data with weaker sources.
  • Change Data Capture: "Which fields changed on this row, by what operation (INSERT/UPDATE), and from what source?" That gives us an audit trail and lets us answer "what changed?" and "did this loader do anything?"

Together, they let us explain behavior (why a value changed or didn't), prevent regressions (authoritative sources win per field), and improve reliability (clear guarantees for "data ready" and "what changed").

Today we track provenance and auditing for:

  • Organizations — Agencies, departments, offices. CDC changelog: OrganizationChangeLog. Source/field priority: OrganizationSourceFieldAuthority.
  • Entities — Companies (UEI, etc.). CDC changelog: EntityChangeLog. Source/field priority: EntitySourceFieldAuthority.
  • Entity relationships — Links between entities. CDC changelog: EntityRelationshipChangeLog.

So when we ingest from SAM, USAspending, or an agency feed, we record which source touched which fields and which rows changed. That's what lets us enforce "SAM wins for legal_business_name" or "don't overwrite this org's name with a stale FPDS value."

Over time, we will be expanding provenance and auditing for all data models within Tango.

How it works (high level)

At ingest time we don't just overwrite the target table. We:

  1. Parse and normalize: The loader reads the upstream file or API and writes into a staging (or temp) representation.
  2. Compute a diff: We compare staging to the current target rows and figure out what actually changed.
  3. Write the audit trail: We insert ChangeLog rows (which fields changed, INSERT vs UPDATE, which source, batch/job id) and update field provenance (which source last wrote each field).
  4. Apply the update: We upsert or update the target table so the canonical data reflects the chosen source and the changelog reflects what happened.

So: we always diff before we write. That means we know exactly what changed and why, and we can enforce source/field rules so higher-trust sources aren't overwritten by lower-trust ones.

Example questions we can answer

Even though provenance and changelog tables aren't in the public API yet, this machinery is what lets us (and in the future, possibly you) answer:

  • "What last updated an Organization's name—and from which source?"
  • "What last updated legal_business_name for entity UEI X?"
  • "Which fields changed on organization FH key 12345 in the last sync?"
  • "Did loader X actually change data, or was it a no-op?"

That's useful for support, debugging, and for building tooling (exports, compliance reports, or opt-in API surfaces) that need to explain or audit data lineage. Keeping track of all of these changes is complicated and nuanced, and is another reason why we believe public-sector builders should be using Tango instead of just rolling it themselves.

Key takeaways

  1. Multiple sources, one dataset: Tango ingests from SAM, USAspending, FPDS, and agency feeds; provenance and CDC keep the result consistent and auditable.
  2. Two concepts: Field provenance answers "who last updated this field and when?" Change Data Capture answer "which fields changed on this row, and from what source?"
  3. All kinds of changes: Data changes for good reasons (company name update), less good ones (government typo corrected), no obvious reason, or our own mistakes in enrichment; we track every change so we can tell legitimate updates from corrections from bugs—including ours.
  4. Why it matters to you: We use this to explain behavior, prevent overwriting good data with bad, and improve reliability; you benefit from more consistent API responses even before we expose provenance in the API.
  5. Where we use it today: Organizations, entities, and entity relationships all have changelog and source/field authority; we diff before we write and record every change.
  6. Not in the API yet: Provenance and changelog tables are internal today; we're laying groundwork for possibly exposing them in a controlled way later (e.g. support tooling, exports, or opt-in endpoints).

If you're building on Tango and care about data quality, consistency, or debugging why a value changed—provenance and auditing are how we keep the house in order under the hood. For the full technical outline, see Provenance & Auditing in the Tango API docs.

Ready to Get Started with Tango?

If you're working with federal procurement data, Tango provides a unified API that combines federal procurement data sets, improves on them, with a developer-friendly approach. Skip the complexity of scraping and joining multiple government APIs yourself.

Sign up for Tango