Bad Data Was Getting Through, and Auditors Were Asking Questions
The company's sequencing pipelines were producing results, but quality problems kept surfacing too late. A run with poor mapping quality or uneven coverage would make it all the way to variant calling before anyone noticed. By then, the choices were bad: rerun the sample and lose time, or push forward with suspect data and hope it held up.
The gruntwork of checking each run manually was eating hours. Worse, when auditors or regulatory reviewers asked for evidence that QC had been performed, the answer was a patchwork of logs and notes that did not inspire confidence. There was no single system that applied consistent checks, recorded what it found, and produced a report that could stand on its own.
They needed something purpose-built. Not another wrapper around samtools and Picard, but a layer that would enforce thresholds, catch anomalies early, and leave behind documentation with clear provenance.
Sequoia Built the QC Layer End to End
Sequoia Applied Technologies is a Santa Clara software engineering firm that builds production software for life sciences and healthcare companies. This was not a blank slate. There was existing pipeline infrastructure, and the QC layer had to fit into it. The client handed us the requirements and we owned the implementation through delivery.
The system reads BAM or CRAM files after alignment and runs a battery of checks: mapping quality, coverage depth and uniformity, duplicate rates, insert size distributions, GC bias, and contamination estimates. Each metric has configurable thresholds. When a value falls outside the acceptable range, the sample gets flagged as warn or fail, and the details go into a structured report.
The output is JSON and CSV for pipeline consumption, plus a batch summary that rolls up results across samples. For audit purposes, the system can export PDF reports showing exactly which thresholds were applied, what values were observed, and whether the run passed. Everything is timestamped. Everything is traceable.
We wrote it in Python. It slots in after the aligner and produces its output without requiring the existing infrastructure to change.
What the System Does
The QC layer sits between alignment and downstream analysis. It consumes aligned reads, gathers metrics, applies rules, and emits structured results. The design is modular so individual checks can be adjusted or extended without touching the rest.
Percentage of reads mapped, properly paired percentage, mean mapping quality score, and rates of secondary, supplementary, and chimeric alignments. These metrics surface aligner performance issues and sample contamination before they become expensive problems.
Mean and median depth, uniformity across the target region, and percentage of bases above specified depth thresholds. Low coverage regions are counted and flagged. Uneven coverage can indicate library prep issues, capture failures, or sample degradation.
Insert size mean and standard deviation, duplication rate, and GC bias score. These metrics reveal whether the library prep worked as expected. Anomalies here often correlate with problems that only manifest later in variant calling.
Contamination estimates and base quality recalibration summaries. Pass, warn, and fail flags for each sample. Output as JSON and CSV for pipelines, with PDF export for audits. Timestamped and traceable from raw data to final status.
Thresholds are configurable per project. A typical starting point might be 95% mapped reads, mean MAPQ of 30, uniformity above 80%, and duplication rate below 15%. These numbers are not dogma. The system is built to let teams tune them based on their own platforms and protocols.
Problems Caught Earlier, Documentation That Holds Up
After deployment, problematic runs stopped slipping through to variant calling. The QC layer flagged issues at the alignment stage, which meant decisions could be made before wasting compute on downstream analysis or, worse, before suspect results reached a report.
The audit story changed too. When reviewers asked how quality was assured, the answer was a PDF with timestamps, thresholds, observed values, and pass/fail determinations. Not a folder of logs. Not a verbal explanation. A document with clear provenance that could be archived alongside the data it described.
The compliance team stopped dreading those conversations.
Common Questions About NGS Alignment QC Software
What is alignment-based QC in NGS workflows?
Alignment-based QC evaluates how well sequencing reads map to a reference genome. It checks metrics like mapping quality scores, coverage depth and uniformity, duplicate rates, insert size distributions, and contamination estimates. The goal is to flag problematic samples before they reach variant calling or other downstream analysis, where bad data can produce misleading results or require costly reruns.
Why would an NGS company need custom QC software instead of using existing tools?
Standard QC tools like samtools flagstat, Picard, and mosdepth produce raw metrics, but they do not enforce thresholds, generate structured reports, or integrate into a documentation workflow. A custom layer can apply site-specific pass/warn/fail logic, aggregate results across batches, export audit-ready reports, and fit cleanly into an existing pipeline without manual fiddling.
What metrics does alignment QC typically evaluate?
Common metrics include mapped read percentage, properly paired percentage, mean mapping quality score, secondary and supplementary alignment rates, chimeric read frequency, coverage mean and median, coverage uniformity, percentage of bases above a depth threshold, duplication rate, insert size mean and standard deviation, GC bias, and contamination estimates. Each metric has configurable thresholds that determine whether a sample passes, triggers a warning, or fails outright.
How does QC software help with audits and regulatory compliance?
Regulatory and audit contexts require documented evidence that quality checks were performed and that results met defined criteria. QC software can produce timestamped reports showing which thresholds were applied, what values were observed, and whether the sample passed. These reports can be exported as JSON, CSV, or PDF and archived alongside the data they describe, creating a clear provenance trail.
Can alignment QC run on both short-read and long-read data?
Yes. The underlying metrics are derived from BAM or CRAM files regardless of the sequencing platform. Short-read data from Illumina instruments and long-read data from PacBio or Oxford Nanopore can both be evaluated. Some thresholds may differ between platforms, but the QC layer itself works across both and is configurable per project.
What kind of companies does Sequoia Applied Technologies work with on genomics software?
Sequoia Applied Technologies is a Santa Clara software engineering firm that works with life sciences and healthcare companies on bioinformatics pipelines, clinical data systems, and laboratory software. Engagements range from new product builds to integration of existing tools into production workflows. Sequoia has experience with NGS data processing, restricted network deployments, and regulated environments requiring audit documentation.