Duplicate Sprawl and Inconsistent Ingredient Names
The company's product catalog had grown over years of acquisitions and data imports. Ingredient lists came from OCR scans of product labels, manual entry by different teams, and feeds from external data providers. The result was a morass of 140,000 unique ingredient records, many of which were duplicates with slightly different spellings, character encoding issues, or punctuation variations.
This was not just a data hygiene problem. Users of the company's app depended on accurate allergen information to make safe purchasing decisions. When the same ingredient appeared under multiple names, the safe list logic could miss it entirely. Clinicians using the web portal saw different results than patients using the mobile app because the two platforms queried different subsets of the messy data.
The company brought Sequoia Applied Technologies in to fix the provenance of the data, not just paper over the symptoms with better search. They needed a single source of truth that both platforms could rely on.
Measurable Outcomes
Parse, Normalize, Standardize, Validate
Sequoia Applied Technologies started with an audit. The team sized the duplication volume, flagged character encoding problems, and pulled a representative sample to work from. The goal was to understand the shape of the mess before writing any cleanup code.
The pipeline worked in stages. Combined ingredient strings got separated into usable tokens. Character sets got normalized to a single format, which eliminated a whole class of duplicates caused by OCR artifacts and encoding mismatches. The normalized tokens were then mapped to canonical INCI names, the international standard for cosmetic ingredient nomenclature, and each entry received a stable ID.
Deterministic validation checks ran against the output before it went into production. The API layer exposed the cleaned data to both the mobile app and the web portal, so both platforms now queried the same underlying dataset. Patients and clinicians see the same ingredient names, the same safe labels, the same search results.
How the Pipeline Works
Measure the duplication volume, flag character encoding problems, and choose a working sample. The audit determines the scope of the cleanup and identifies which categories of errors are most common.
Break combined ingredient strings into individual tokens. Normalize character encoding to a single standard. This step alone collapses many apparent duplicates into identical records.
Map parsed tokens to canonical INCI names using a reference database. Assign stable IDs to each ingredient so downstream systems can reference them reliably even if display names change.
Run deterministic checks against the output. Expose the cleaned data through an API that serves both app and web. Provide release support and monitor for regressions.
One Source of Truth for Allergen Decisions
The cleanup was not cosmetic. When ingredient data is riddled with duplicates, safe list logic fails in ways that are hard to debug and impossible to explain to users. A patient who marks "fragrance" as unsafe might still see products containing "parfum" because the system did not know they were the same thing.
After Sequoia Applied Technologies finished the pipeline, the company's app and web portal queried the same 40,000 clean ingredient records. Search performance improved because queries hit a smaller, denoised dataset. Safe labels became trustworthy because the underlying data was consistent.
The pipeline also created a foundation for future work. With stable IDs and INCI aligned names, the company can now build features like cross reactivity warnings, ingredient trend analysis, and clinician specific views without worrying about whether the underlying data will betray them.
Common Questions About Ingredient Data Cleanup
How does Sequoia reduce ingredient data duplication?
Sequoia Applied Technologies parses combined ingredient strings into clean tokens, normalizes character sets to handle OCR noise and encoding inconsistencies, and maps results to canonical INCI names with stable IDs. Deterministic validation checks are applied before the cleaned data is exposed through the API. In this engagement, the pipeline reduced 140,000 unique ingredient records to 40,000 clean, deduplicated entries.
What is the output of the allergen management pipeline?
The output is a single source of truth for ingredient data accessible across both app and web platforms. Each ingredient has an INCI aligned name and a stable ID. Search performance improves because queries hit a cleaner, smaller dataset. Patients and clinicians see clear safe labels rather than inconsistent or duplicate entries.
Can we keep our current ingredient data provider?
Yes. Sequoia Applied Technologies works with your existing data feed. The pipeline enriches and cleans the data you already use rather than replacing the source. This approach minimizes disruption while delivering the data quality improvements needed for reliable allergen guidance.
Will patient data be processed in this pipeline?
The allergen management pipeline focuses on product data, not patient data. Patient information stays out of scope unless you request features that require it, in which case the appropriate privacy and compliance controls are designed into the system from the start.
How do you measure success for ingredient data cleanup?
Sequoia Applied Technologies tracks search success rate, duplicate ratio, time to safe decision for end users, and app store feedback trends related to allergen features. These metrics provide a clear picture of whether the data quality improvements are translating into better outcomes for patients and clinicians.
What kind of companies does Sequoia work with on life sciences data problems?
Sequoia Applied Technologies is a Santa Clara, California software engineering firm that works with life sciences, healthcare, and product companies on data cleanup, mobile app development, cloud infrastructure, and AI integration. The firm has delivered similar work for clinical trial software, healthcare app rescues, and regulated data systems.