Data Quality Monitoring Dashboard for SaaS

Jan 16, 2026·18 min read

Data Quality Monitoring Dashboard for SaaS

Summarize this article

Data quality problems are uniquely insidious because they fail silently. A database table that stops updating doesn't throw an exception — it just serves stale data until someone notices the numbers look wrong. A schema change in an upstream system doesn't alert downstream consumers — it just starts producing nulls or incorrect values in reports that someone questions three days later during a planning meeting.

By the time a data quality issue is discovered through normal operations, it's often been compounding for 48–72 hours. The affected data has already been used to make decisions, sent to downstream systems, displayed to customers, and in some cases exported or acted upon. The remediation isn't just fixing the pipeline — it's assessing the blast radius of how many downstream consumers were fed bad data and for how long.

A data quality monitoring dashboard doesn't prevent all data quality issues. Pipelines break, schemas change, upstream systems behave unexpectedly. What it does is reduce the discovery time from days to minutes, which dramatically reduces the blast radius of any individual failure.

The three failure modes that matter most

Data quality failures in production SaaS systems cluster into three categories, each with distinct detection characteristics and remediation patterns.

Freshness failures occur when data that should update on a defined schedule stops updating, while the pipeline or job continues to run without reporting an error. A pipeline that pulls data from a third-party API might run successfully — the job completes, the scheduler shows green — but receive zero records because the API's authentication token expired and the error was swallowed rather than surfaced. The table stops updating. Customers see dashboards frozen at yesterday's state. No alert fires because no job failed.

Freshness failures are the most common category and often the most impactful because they affect customer-visible data. A customer-facing analytics dashboard showing data that's 18 hours old — when the product promises near-real-time data — is a support ticket and a trust problem. The longer the freshness failure goes undetected, the more customers are affected and the harder the recovery conversation becomes.

Completeness failures occur when data arrives but is structurally incomplete. An ETL job that should produce 10,000 rows from an upstream export produces 400 because a filter condition changed without notice, or because the upstream system was migrated and the export scope changed. The pipeline runs, records land in the table, but a significant fraction of the expected data is missing. Aggregates computed on incomplete data produce incorrect results, and the error doesn't surface until someone compares the output to an external source and notices the discrepancy.

Null rate failures are a sub-type of completeness failures. A field that should always be populated — a user's account ID linking a usage event to its account — becomes null for 15% of records after an upstream schema change. Every aggregate that groups usage by account now silently under-counts for 15% of events. The query doesn't error. The dashboard doesn't show an error. The numbers are just wrong, and they'll be wrong in every report until the root cause is found and the affected records are backfilled.

Validity failures occur when data arrives complete but contains values outside the acceptable domain. A revenue figure that should be a positive dollar amount contains a negative value because a refund was processed and the accounting system doesn't distinguish between the transaction types. A timestamp field contains a date in 1970 because a null was coerced to Unix epoch zero. An integer ID field contains a string because a system migration changed the data type and the ETL didn't handle the coercion.

Validity failures are the hardest to detect because they require domain knowledge to identify — you need to know that a negative revenue figure is impossible, that a 1970 timestamp is invalid, that user IDs should always be positive integers. That domain knowledge has to be encoded explicitly in check rules. It doesn't surface automatically from a generic monitoring system.

What a data quality monitoring dashboard checks

A data quality dashboard defines explicit expectations for critical tables and columns, then checks actual data against those expectations on a schedule. The check schedule should align with the update frequency of the data: hourly for tables that update continuously, daily for batch jobs, and real-time webhook checks for the most critical customer-facing data.

Freshness checks verify the maximum age of the most recent record in each monitored table. Each table has a configured maximum freshness gap: "this table should have a record newer than 2 hours at all times during business hours." The check compares the current timestamp to the most recent record's timestamp and fires an alert if the gap exceeds the threshold. For tables with variable update patterns — higher frequency during business hours, lower at night — the threshold can be time-of-day aware.

Row count checks compare each batch's output record count to the historical baseline for that pipeline. The baseline is calculated as the rolling 30-day average and standard deviation for that pipeline's run, producing a statistical range rather than a fixed number. A pipeline that normally produces 40,000–55,000 rows should alert at anything outside that range — both below (possible data loss) and above (possible duplicate ingestion). For pipelines with strong seasonality, the baseline should be calculated against the same day-of-week and hour-of-day in prior periods, not a simple rolling average.

Null rate checks establish expected null rates for columns where null values are meaningful. For a column that should always be populated — a required foreign key, a field that's always present in the source data — the expected null rate is 0%, and any nulls should trigger an immediate alert. For optional fields, the expected null rate might be 20–30% based on historical patterns, and the check fires when the actual null rate deviates significantly from the expected range in either direction (more nulls suggests data loss; fewer nulls suggests a schema change that made the field required).

Value range checks encode domain knowledge as explicit constraints. Revenue fields must be greater than 0. Age fields must be between 0 and 150. Status fields must be in a defined set of accepted values. Timestamp fields must be after the product's launch date and before the current time plus a small buffer for timezone differences. These rules are the formal encoding of what "valid data" means for your domain.

Referential integrity checks verify that foreign keys point to records that exist. An event table where account_id references an account that doesn't exist in the accounts table is a sign of a data pipeline bug — either the account was deleted without cascading the deletion to event tables, or the join condition in the ETL was written incorrectly. These orphaned records are invisible in aggregate reports but can cause significant issues when someone joins the event table to the account table and expects all events to match.

Schema drift detection compares the schema of each table's current state to a stored baseline. New columns that weren't expected, missing columns that should be present, changed data types or lengths — all of these represent upstream schema changes that may have downstream consequences for the pipelines consuming that data.

Building the check framework

The check framework is the code layer that executes checks on a schedule and collects results. For teams already using dbt, dbt tests are the natural starting point: they run as part of the dbt build and write results to a structured output file. For teams using Great Expectations, the expectation suite per dataset serves the same purpose. For teams without an existing data testing framework, custom SQL checks running in a scheduler are sufficient for most use cases.

Regardless of the check framework, the check results need to land in a central store — a database table — that the dashboard queries. Each check result record contains: the check type, the table and column checked, the expected value or range, the actual value observed, whether the check passed or failed, and the timestamp of the check run.

The dashboard then queries this results table. The main view shows all checks across all monitored datasets, grouped by table, with passing, failing, and not-yet-run status. A filter for "currently failing" produces the triage list for the on-call data engineer. A trend view for any specific check shows the historical pass/fail pattern, which is useful for distinguishing a new failure from a recurring known issue.

Check configuration — which tables to monitor, which columns to check, what the acceptable ranges are — should live in version-controlled YAML or similar configuration files, not in the dashboard's UI. Configuration as code means that adding a new table to monitoring is a pull request, not a form submission, and the check history is tied to the configuration version.

Alerting design: making the tool useful, not noisy

A data quality dashboard that requires someone to log in and check it proactively is only marginally more useful than no monitoring at all. Checks need to push alerts to the channels where your team already operates.

The alert destination for a failed check should be the team responsible for the data pipeline that feeds the affected table, not a general data-engineering channel. Routing all alerts to a single channel trains the team to ignore them. A freshness failure on the billing table should page the billing data engineer. A null rate failure on the product analytics events table should notify the product engineering team. The routing configuration is part of the check configuration: each monitored table has an owner team and an alert channel.

Severity tiers prevent alert fatigue. Critical alerts represent customer-facing impact: a freshness failure on a table that powers a live customer dashboard, a null rate failure on a table used in billing calculations. Critical alerts should page the on-call engineer immediately, at any hour, and remain active until acknowledged. High alerts represent broken analytics or reporting that affects internal decision-making but not customer-facing data. These route to Slack with a mention, targeting the responsible team. Low alerts represent stale internal reports or quality issues in non-critical data. These aggregate into a daily digest rather than generating immediate notifications.

Alert deduplication is essential. A freshness failure that persists for 4 hours should not generate 4 hours of repeated alerts — it should generate one initial alert and one "still failing after 2 hours" escalation. Without deduplication, teams mute the alert channel during incidents, which defeats the purpose entirely.

Integrating with open-source tools

Great Expectations, dbt tests, and Soda are the three most widely used open-source data quality frameworks. All three handle the check logic well and are worth using as the foundation rather than writing check execution logic from scratch.

dbt tests run as part of the standard dbt build and support the core check types — not null, unique, accepted values, and referential integrity — out of the box. Custom checks require writing custom SQL macros, which is manageable for teams already comfortable in dbt. The main limitation is that dbt tests run on the batch build schedule, so they're not suitable for near-real-time monitoring of continuously updated tables.

Great Expectations offers more expressive check types and a more sophisticated expectation management system. It runs independently of the transformation layer, which makes it suitable for monitoring source tables before they enter the transformation pipeline. The configuration overhead is higher than dbt tests, and the open-source version requires building your own result storage and visualization layer.

Soda is positioned as a monitoring-focused tool with built-in alerting and a cleaner configuration language than Great Expectations. The open-source version handles most standard checks; the commercial version adds anomaly detection and integrations.

The choice between these tools is primarily a function of your existing stack. Teams using dbt should start with dbt tests and add a custom check runner for real-time monitoring needs. Teams without dbt should evaluate Soda for the monitoring-focused workflow and Great Expectations for more expressive check types.

The data quality dashboard sits on top of whichever framework you choose, consuming check results and presenting them in a unified view with alert routing, trend analysis, and the check configuration interface. The dashboard doesn't replace the framework — it makes the framework's output operationally useful.

What good monitoring coverage looks like

Not all tables need the same depth of monitoring. The investment in check configuration should be proportional to the business impact of data quality failures in that table.

Tier 1 tables are those whose data quality failures have direct, immediate customer-facing impact: billing tables, customer-visible analytics tables, integration sync tables. These tables should have full coverage: freshness, row count, null rates for all critical fields, value range constraints, and referential integrity checks. They should have critical-severity alerting and should be checked on a tight schedule — every 15–30 minutes for continuously updated tables, immediately after each batch job run for batch tables.

Tier 2 tables power internal reporting and analytics that affect business decisions but not directly customer-facing data. These warrant freshness checks, row count anomaly detection, and null rate monitoring for critical columns. Alerting should be high severity with Slack notification.

Tier 3 tables are supporting data — configuration tables, lookup tables, historical archives — where failures affect data completeness but not immediately actionable decisions. Basic freshness and row count checks with daily digest alerting are sufficient.

Starting with Tier 1 tables and expanding from there is the right approach. Building complete coverage for 5 critical tables gives you more operational value than building partial coverage for 50 tables. The monitoring is credible when the team trusts that an alert means an actual problem — and that trust is built by starting narrow, tuning the thresholds until false positive rates are low, and then expanding.

The teams that get the most value from data quality monitoring are the ones who respond to every alert and close the loop. When a check fails, the investigation findings — what caused the failure, how it was fixed, how long the affected data was bad, what downstream systems were affected — should be documented in the monitoring tool as an incident record attached to the failed checks. That history is what enables the team to distinguish a new failure pattern from a recurring issue, and it's what makes the monitoring tool progressively more useful over time as the institutional knowledge about failure patterns accumulates in the tool rather than in individual engineer memory.

Summarize this article

Bad data causing silent errors in your product or analytics?

We build data quality monitoring dashboards for SaaS engineering and data teams — automated checks on freshness, completeness, and validity across your critical data pipelines.