Every spreadsheet that survives long enough becomes a database. The signs are unmistakable: multiple people editing simultaneously and stepping on each other's changes, VLOOKUP chains spanning five sheets because data is duplicated everywhere, a row count approaching Excel's 1M limit, or someone asking "can we run a report that..." and the answer requires a pivot table of a pivot table.
The migration from spreadsheet to database isn't primarily a technical challenge — the tools exist and work well. The challenge is translating between two fundamentally different data models. Spreadsheets are flexible: any cell can contain any type, columns can be added anywhere, and "schema" is whatever the person typing decided. Databases are rigid: every column has a type, every row must conform, and the schema is defined before data enters. This rigidity is the entire point — it prevents the data quality issues that made the spreadsheet unmanageable — but it means the migration requires decisions about types, constraints, and structure that the spreadsheet never forced you to make.
CSV as the Bridge Format
Regardless of your source spreadsheet format (XLSX, XLS, ODS, Google Sheets), the migration typically flows through CSV:
XLSX/ODS/Google Sheets → CSV → Database Import Tool → PostgreSQL/MySQL/SQLiteWhy CSV as the intermediate? Every database has a native CSV import tool. PostgreSQL has \COPY, MySQL has LOAD DATA, SQLite has .import. These tools are fast, well-tested, and handle the mechanics of bulk loading. Going directly from XLSX to database is possible (some tools like DBeaver support it), but CSV gives you an inspection point: you can open the CSV, verify the data looks correct, and fix issues before they hit the database.
Convert XLSX to CSV or convert ODS to CSV as the first step. For Google Sheets, export as CSV from File > Download. Then inspect the CSV for the common pitfalls before importing: encoding issues, date formats, leading zeros, and empty fields.
Type Inference: The Hardest Problem
CSV carries no type information. The string 42 could be an integer, a float, a string, or a ZIP code. The string 2026-03-19 could be a date or a product code. The string TRUE could be a boolean or someone's surname.
Database import tools try to infer types, and they frequently guess wrong. The common failures:
- ZIP codes become integers.
02134→ 2134. The fix: declare the column as VARCHAR/TEXT before import. - Mixed-type columns default to TEXT. A column that's 99% numeric with one text value ("N/A" in row 5,000) gets typed as TEXT. The fix: clean the data first, replacing "N/A" with NULL.
- Dates in ambiguous formats.
01/02/03— is this January 2, 2003? February 1, 2003? 2001-02-03? The fix: standardize dates to ISO 8601 (YYYY-MM-DD) before import. - Booleans as strings. CSV has no boolean type. "True", "true", "TRUE", "Yes", "yes", "Y", "1" are all common representations. The fix: standardize to a consistent value and let the database CAST to boolean.
The safest approach: define your schema explicitly before importing, rather than relying on auto-detection.
Defining the Schema First
Write the CREATE TABLE statement before running the import. Examine the spreadsheet and decide, column by column:
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
customer_code VARCHAR(20) NOT NULL, -- was 'ID' column, leading zeros
name VARCHAR(255) NOT NULL,
email VARCHAR(255),
phone VARCHAR(20), -- NOT integer, preserve formatting
signup_date DATE, -- standardize format first
revenue DECIMAL(10,2), -- NOT float, money needs precision
is_active BOOLEAN DEFAULT true,
notes TEXT -- unbounded text
);Key decisions in this schema:
- customer_code is VARCHAR, not INTEGER. Even if it looks numeric, identifiers should be strings. You don't do arithmetic on customer codes.
- revenue is DECIMAL, not FLOAT. Floating-point arithmetic produces rounding errors (0.1 + 0.2 = 0.30000000000000004). For money, use DECIMAL or NUMERIC.
- phone is VARCHAR. Phone numbers have leading zeros, country codes with +, and formatting characters. Never store as integer.
NULL vs. Empty String: The Eternal Ambiguity
In spreadsheets, an empty cell is just empty. In databases, there's a meaningful distinction between NULL (unknown/missing) and '' (empty string, deliberately blank). CSV conflates the two: an empty field (a,,c) could mean either.
This matters for queries. WHERE email IS NOT NULL includes empty strings. WHERE email != '' includes NULLs. If you don't handle the distinction during import, your query results will be wrong.
The standard approaches:
- PostgreSQL \COPY: By default, empty fields in non-string columns become NULL. For string columns, empty fields become empty strings. Use
FORCE_NULL (column1, column2)to force specific columns to treat empty as NULL. - MySQL LOAD DATA: Empty fields become the column's default value (usually NULL for nullable columns, '' for NOT NULL VARCHAR).
- Pandas to SQL:
df.to_sql()maps NaN to NULL. Replace empty strings with NaN before import:df.replace('', pd.NA, inplace=True).
The decision depends on your data semantics. A blank email field probably means "no email provided" (NULL). A blank notes field might mean "no notes" (empty string) or "notes not captured" (NULL). Make this decision per-column and enforce it during the import.
Date Format Standardization
Spreadsheet dates are a minefield. A single spreadsheet might contain dates in multiple formats: 3/19/2026 (US), 19/03/2026 (European), March 19, 2026 (human-readable), and 46107 (Excel serial number). All of these need to become 2026-03-19 (ISO 8601) before database import.
Steps for reliable date migration:
- Identify all date columns in the spreadsheet. Check for date-like values in supposedly non-date columns (text fields that contain dates, etc.).
- Determine the source format for each column. Is it M/D/YYYY? D/M/YYYY? YYYY-MM-DD? Excel serial number? Mixed?
- Convert to ISO 8601 before export. In Excel: format the column as
YYYY-MM-DDbefore saving as CSV. In pandas:pd.to_datetime(df['date'], format='%m/%d/%Y').dt.strftime('%Y-%m-%d'). - Handle timezone information if present. Decide whether to store as DATE (date only), TIMESTAMP (date + time, no timezone), or TIMESTAMPTZ (date + time + timezone). For most business data, TIMESTAMPTZ in UTC is safest.
Common gotcha: Excel's date serial numbers are different on Mac vs. Windows. Windows uses the 1900 date system (1 = January 1, 1900). Mac historically used the 1904 date system (0 = January 1, 1904). The same serial number produces a date 4 years and 1 day apart. Check the workbook's date system in File > Options > Advanced.
Schema Design from Spreadsheet Structure
Spreadsheets encourage denormalization — everything in one flat table. Databases reward normalization — splitting data into related tables with foreign keys. The migration is an opportunity to fix structural problems.
Common spreadsheet patterns that should become separate tables:
- Repeating groups. Columns like
product_1, qty_1, price_1, product_2, qty_2, price_2should become anorder_itemstable with one row per item. - Lookup values. A "Status" column with values "Active", "Inactive", "Pending" should become a
statusesreference table with a foreign key. - Duplicated data. Customer name and address repeated on every order row should be a
customerstable joined to anorderstable. - Multi-valued cells. A cell containing "tag1, tag2, tag3" should become a
tagsjunction table.
The normalization decision is a trade-off. Fully normalized schemas are cleaner but require joins for every query. For analytics, a denormalized (or star schema) approach may be better. For operational data, normalization prevents update anomalies and enforces consistency.
Import Tools: pgloader, DBeaver, and Programmatic Options
PostgreSQL — pgloader:
pgloader is the most powerful tool for CSV-to-PostgreSQL migration. It handles type casting, data transformation, and error handling in a single configuration:
LOAD CSV
FROM 'data.csv' (customer_code, name, email, signup_date, revenue)
INTO postgresql:///mydb?customers
WITH truncate,
fields terminated by ',',
fields optionally enclosed by '"'
SET client_encoding to 'UTF-8'
BEFORE LOAD DO
$$ CREATE TABLE IF NOT EXISTS customers (...) $$;DBeaver (GUI):
DBeaver's Data Import wizard reads CSV, XLSX, and JSON files directly. It auto-detects types (with manual override), maps columns to existing tables or creates new ones, and handles encoding. Best for one-time imports where the visual preview is valuable.
Python pandas:
import pandas as pd
from sqlalchemy import create_engine
df = pd.read_csv('data.csv', dtype={'customer_code': str, 'phone': str})
df['signup_date'] = pd.to_datetime(df['signup_date'])
df.replace('', pd.NA, inplace=True)
engine = create_engine('postgresql://user:pass@localhost/mydb')
df.to_sql('customers', engine, if_exists='replace', index=False)pandas to_sql is convenient but slow for large datasets (inserts row by row by default). For large imports, use method='multi' for batch inserts, or COPY-based methods via psycopg2.copy_expert().
Data Cleaning During Migration
Migration is the right time to clean data, because you're already touching every row. The common cleaning operations:
- Trim whitespace. Leading and trailing spaces cause join failures and duplicate groups.
df.columns = df.columns.str.strip()for headers.df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)for values. - Standardize case. "United States", "united states", "UNITED STATES" should be one value. Decide on a convention (title case for display, lowercase for matching) and apply it.
- Remove duplicates. Spreadsheets accumulate duplicates from copy-paste operations.
df.drop_duplicates(subset=['customer_code'])before import. - Validate constraints. Before importing, check that required fields aren't empty, email addresses match a pattern, dates are within expected ranges, and foreign keys reference valid values. It's much easier to fix a CSV than to clean up a half-imported database.
- Handle encoding issues. Read the CSV with explicit encoding. If you see mojibake (garbled characters), the encoding is wrong. Try UTF-8, UTF-8-BOM, cp1252, then latin-1.
Generate the clean CSV, convert to SQL INSERT statements for review, then import into the database. Or clean in pandas and write directly to the database via to_sql().
ETL Patterns for Ongoing Spreadsheet Imports
If spreadsheet-to-database is a recurring task (weekly reports, monthly data dumps), build an ETL (Extract, Transform, Load) pipeline:
- Extract: Convert XLSX to CSV or read directly with pandas/openpyxl. For Google Sheets, use the API.
- Transform: Clean types, standardize dates, handle NULLs, validate constraints. Log every transformation for auditing.
- Load: Truncate-and-reload for full refreshes, or upsert (INSERT ON CONFLICT UPDATE) for incremental updates.
For recurring imports, invest in schema validation. Define expected column names, types, and constraints in a config file. Before import, the pipeline validates the spreadsheet against the config and rejects files that don't conform. This catches upstream changes (someone renamed a column, changed a date format, added a new status value) before they corrupt the database.
Key tools: great_expectations or pandera in Python for data validation, dbt for SQL-based transformations, and simple cron + bash scripts for lightweight recurring imports.
Migrating from spreadsheet to database is a one-way door — once the data is structured, typed, and constrained, going back to the spreadsheet chaos feels wrong. The migration effort pays off immediately in query performance, data consistency, and the ability to answer questions that were impossible with VLOOKUP.
The common mistake is treating the migration as purely technical: export CSV, import to database, done. The real work is the schema design (what types, what constraints, how many tables) and the data cleaning (standardize dates, handle NULLs, fix encoding). Spend 80% of your time on those decisions and 20% on the import mechanics. A well-designed schema with clean data is worth more than any tool or optimization.