← Back to Case Studies

Excel-to-Database Processing Pipeline

Case Study — Batch Data Processing

Role: Full-Stack Developer
Duration: 3 months
Team: 2 developers
Status: In production

The Challenge

A company's operations team was manually importing Excel spreadsheets into their database — copy-pasting rows, reformatting data, fixing inconsistencies, and running validation by hand. This process took 4-6 hours per import, happened 3-4 times per week, and was error-prone. Data inconsistencies from manual entry were causing downstream reporting issues.

The Approach

I designed a fully automated batch processing pipeline to replace the manual workflow:

  1. Analysis — Mapped all Excel formats, identified validation rules, documented edge cases
  2. Schema Design — Designed a normalized database schema that could handle all import variations with proper constraints
  3. Processing Pipeline — Built a Node.js batch processor: file upload, parsing (xlsx), validation, transformation, staged insert, confirmation
  4. Error Handling — Each row validates independently so one bad row doesn't block the entire batch. Errors are collected and reported with row numbers and specific failure reasons
  5. Admin Dashboard — Built a simple UI for the ops team: drag-and-drop upload, progress tracking, error review, and one-click confirmation

Technical Decisions

  • Streaming parser for large files (50K+ rows) to avoid memory issues
  • Staged inserts with transaction rollback for data integrity
  • Configurable validation rules so the ops team can adjust without code changes
  • Background job queue for processing so the UI stays responsive

Code: Pipeline Architecture

The core processing pipeline — each row validates independently so one bad row never blocks the batch:

// Simplified batch processor pipeline
async function processBatch(file: Buffer, config: ImportConfig) {
  const rows = parseExcel(file, { streaming: true })
  const results: ProcessResult = { success: [], errors: [] }

  for await (const chunk of batch(rows, 500)) {
    const validated = chunk.map((row, i) => {
      const errors = config.rules
        .map(rule => rule.validate(row))
        .filter(Boolean)
      return errors.length
        ? { row, index: i, errors }
        : { row, index: i, data: config.transform(row) }
    })

    const good = validated.filter(r => 'data' in r)
    const bad = validated.filter(r => 'errors' in r)

    if (good.length > 0) {
      await db.transaction(async (tx) => {
        await tx.insert(records).values(
          good.map(r => r.data)
        )
      })
    }

    results.success.push(...good)
    results.errors.push(...bad)
  }

  return results // { success: 49,847 rows, errors: 153 rows with details }
}

Results

15 min

Average processing time (was 4-6 hours)

99.8%

Data accuracy (was ~95%)

Zero

Manual data entry

12hrs

Saved per week

Key Takeaways

  • The hardest part wasn't the code — it was understanding all the edge cases in the Excel files
  • Streaming processing is essential once files exceed 10K rows
  • Showing row-level errors with context is worth the extra effort — it builds trust with the ops team
  • A simple drag-and-drop UI removes all friction from adoption

Need to automate a manual data workflow?

Let's Talk