Our projects
in backend & data engineering
Backend & Data Engineering Case Studies
Scheduled Data Ingestion from 43 Sources for U.S. Public Records Company
Challenge:
43 data vendors, unstable delivery patterns, thousands of ingestion jobs, and billions of records. Files changed, failed, or arrived late.
Solution:
Automated ingestion system with per-vendor logic, continuous endpoint checks, retries, and file integrity verification.
Result:
6,400+ jobs per year with a 0.09% failure rate. Billions of records ingested monthly.
Monthly Data Collection Across 60+ Sources for U.S. Data Aggregator
Challenge:
Public data across 60+ websites, frequent website changes, inconsistent formats, and the need for a stable monthly feed for a new search feature.
Solution:
Modular ingestion with internal orchestration, continuous monitoring, normalization, state-level consolidation, and schema-stable CSV delivery.
Result:
Reliable monthly dataset, new search capability launched, and a data pipeline that runs without breaking downstream systems for 10+ years.
Indian voters ETL pipeline for a political data & consulting firm
Challenge:
1 billion voter records, 22 local languages, 36 input formats (PDFs & hand-written forms).
Solution:
OCR text recognition and ML for transliteration, validation against postal data.
Result:
Transliterated, searchable database in a single format.
Data ingestion & processing 150M records for people lookup platform
Challenge:
150M records dataset with 1M records daily update.
Solution:
Automated data pipeline with daily data ingestion, unpacking, validation. Data process with Airflow, AWS Glue Jobs on Scala.
Result:
Searchable database in Snowflake cloud-based data storage.
Web Data Collection for AI-powered Vehicle Parts Procurement Platform
Challenge:
Multiple automotive catalog sources, millions of potential API calls, evolving scope, strict onboarding deadlines, and tight budget constraints.
Solution:
Phased engagement with free loading module development, validated sample delivery, and controlled parallel data collection aligned with business priorities.
Result:
313K+ records loaded. Budget-controlled rollout aligned with two customer onboarding milestones.
4B+ U.S. Voter and Mover Records ETL Pipeline for Identity Intelligence Company
Challenge:
Five disconnected voter and mover data sources. Billions of records. No shared format. High duplication risk. All built when big-data tooling was immature and hardware was expensive.
Solution:
A staged ETL pipeline with source-level logic, data standardization, identity resolution, and centralized SOLR indexing.
Result:
4+ billion records processed and unified. Five external sources integrated into one searchable dataset.
Entity resolution for B2B data intelligence platform
Challenge:
Unify records from three sources.
Solution:
A pipeline with probabilistic matching and normalization across name, contact, work, and education fields.
Result:
76% dedupe rate at ≥85% match, 400M records matched in 40 mins
PDF data extraction with AI for a leading B2B data intelligence platform
Challenge:
Extract data from 18K scanned PDFs in 30 formats.
Solution:
Trained Gemini Vertex AI model for pattern-based extraction. Automated data load and processing with Airflow Dags on Python.
Result:
Text files with business contact data.
Legacy address parsing system modernization for people lookup platform
Challenge:
Modernize the address parsing, verification, and cleaning system that struggled with high traffic and 1M dataset.
Solution:
System migration from MSSQL to Redis. Pre-compiled MSSQL queries. New address parsing algorithms. Internal caching, dedupe, and indexing.
Result:
Stable and easily scalable system for 1M+ records 2x faster data processing 40% reduction in processing traffic 12% error reduction