Excel-to-Database Processing Pipeline

Case Study — Batch Data Processing

Role: Full-Stack Developer

Duration: 3 months

Team: 2 developers

Status: In production

The Challenge

A company's operations team was manually importing Excel spreadsheets into their database — copy-pasting rows, reformatting data, fixing inconsistencies, and running validation by hand. This process took 4-6 hours per import, happened 3-4 times per week, and was error-prone. Data inconsistencies from manual entry were causing downstream reporting issues.

The Approach

I designed a fully automated batch processing pipeline to replace the manual workflow:

Analysis — Mapped all Excel formats, identified validation rules, documented edge cases
Schema Design — Designed a normalized database schema that could handle all import variations with proper constraints
Processing Pipeline — Built a Node.js batch processor: file upload, parsing (xlsx), validation, transformation, staged insert, confirmation
Error Handling — Each row validates independently so one bad row doesn't block the entire batch. Errors are collected and reported with row numbers and specific failure reasons
Admin Dashboard — Built a simple UI for the ops team: drag-and-drop upload, progress tracking, error review, and one-click confirmation

Technical Decisions

Streaming parser for large files (50K+ rows) to avoid memory issues
Staged inserts with transaction rollback for data integrity
Configurable validation rules so the ops team can adjust without code changes
Background job queue for processing so the UI stays responsive

Code: Pipeline Architecture

The core processing pipeline — each row validates independently so one bad row never blocks the batch:

// Simplified batch processor pipeline
async function processBatch(file: Buffer, config: ImportConfig) {
  const rows = parseExcel(file, { streaming: true })
  const results: ProcessResult = { success: [], errors: [] }

  for await (const chunk of batch(rows, 500)) {
    const validated = chunk.map((row, i) => {
      const errors = config.rules
        .map(rule => rule.validate(row))
        .filter(Boolean)
      return errors.length
        ? { row, index: i, errors }
        : { row, index: i, data: config.transform(row) }
    })

    const good = validated.filter(r => 'data' in r)
    const bad = validated.filter(r => 'errors' in r)

    if (good.length > 0) {
      await db.transaction(async (tx) => {
        await tx.insert(records).values(
          good.map(r => r.data)
        )
      })
    }

    results.success.push(...good)
    results.errors.push(...bad)
  }

  return results // { success: 49,847 rows, errors: 153 rows with details }
}

Results

15 min

Average processing time (was 4-6 hours)

99.8%

Data accuracy (was ~95%)

Zero

Manual data entry

12hrs

Saved per week

Key Takeaways

The hardest part wasn't the code — it was understanding all the edge cases in the Excel files
Streaming processing is essential once files exceed 10K rows
Showing row-level errors with context is worth the extra effort — it builds trust with the ops team
A simple drag-and-drop UI removes all friction from adoption

Need to automate a manual data workflow?

Let's Talk