PDF text extraction has two failure modes: (1) the PDF has selectable text but the layout confuses parsers (multi-column papers), and (2) the PDF is a scanned image with no text layer. The pure-Go libraries handle case 1 with limitations and can't handle case 2 at all. pdftotext is more robust; the API adds OCR fallback.

Method 1: ledongthuc/pdf (pure Go)

ledongthuc/pdf is the most popular pure-Go PDF reader. No native deps — runs on Lambda, Alpine, anywhere Go runs.

go get github.com/ledongthuc/pdf
package main

import (
    "bytes"
    "fmt"
    "io"
    "os"

    "github.com/ledongthuc/pdf"
)

func pdfToText(inPath, outPath string) error {
    f, r, err := pdf.Open(inPath)
    if err != nil {
        return fmt.Errorf("open: %w", err)
    }
    defer f.Close()

    var buf bytes.Buffer
    b, err := r.GetPlainText()
    if err != nil {
        return fmt.Errorf("text: %w", err)
    }
    if _, err := io.Copy(&buf, b); err != nil {
        return err
    }

    return os.WriteFile(outPath, buf.Bytes(), 0o644)
}

func main() {
    if err := pdfToText("document.pdf", "document.txt"); err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
}

Three things to know:

  • It only reads selectable text. Scanned PDFs (image-only) return empty strings — there's no OCR.
  • Layout is reading-order, not visual. Multi-column papers may interleave columns, footnotes may end up mid-paragraph.
  • Encrypted PDFs need a password. Use pdf.OpenWithPassword(path, password) for protected documents.

Method 2: pdftotext via os/exec (highest quality)

Poppler's pdftotext CLI handles complex layouts (multi-column, tables, footnotes) much better than any pure-Go library. It's the standard tool for PDF text extraction.

apt install poppler-utils  # provides pdftotext
# macOS: brew install poppler
package main

import (
    "context"
    "fmt"
    "os"
    "os/exec"
    "time"
)

func pdfToText(ctx context.Context, inPath, outPath string) error {
    // -layout preserves visual layout (columns, tables)
    // -nopgbrk omits form-feed characters between pages
    cmd := exec.CommandContext(ctx,
        "pdftotext",
        "-layout",
        "-nopgbrk",
        inPath, outPath,
    )

    out, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("pdftotext: %w (%s)", err, out)
    }
    return nil
}

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
    defer cancel()

    if err := pdfToText(ctx, "document.pdf", "document.txt"); err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
}

Useful flags:

  • -layout — preserves visual layout. Without this, multi-column PDFs get scrambled.
  • -nopgbrk — omits the form-feed (\f) character between pages. Easier to grep.
  • -raw — reading-order output (no layout). Better for plain prose.
  • -table — preserves table structure where detectable.

Method 3: ChangeThisFile API (with OCR fallback)

If your PDFs include scanned documents (image-only, no text layer), neither of the above works. The API runs pdftotext first; if the result is empty or near-empty, it falls back to OCR. Free tier covers 1,000 conversions/month.

package main

import (
    "bytes"
    "fmt"
    "io"
    "mime/multipart"
    "net/http"
    "os"
    "time"
)

const apiKey = "ctf_sk_your_key_here"

func pdfToText(inPath, outPath string) error {
    body := &bytes.Buffer{}
    w := multipart.NewWriter(body)

    f, err := os.Open(inPath)
    if err != nil {
        return err
    }
    defer f.Close()

    fw, err := w.CreateFormFile("file", "input.pdf")
    if err != nil {
        return err
    }
    if _, err := io.Copy(fw, f); err != nil {
        return err
    }
    _ = w.WriteField("source", "pdf")
    _ = w.WriteField("target", "txt")
    _ = w.Close()

    req, err := http.NewRequest("POST", "https://changethisfile.com/v1/convert", body)
    if err != nil {
        return err
    }
    req.Header.Set("Authorization", "Bearer "+apiKey)
    req.Header.Set("Content-Type", w.FormDataContentType())

    client := &http.Client{Timeout: 120 * time.Second}
    resp, err := client.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        msg, _ := io.ReadAll(resp.Body)
        return fmt.Errorf("api %d: %s", resp.StatusCode, msg)
    }

    out, err := os.Create(outPath)
    if err != nil {
        return err
    }
    defer out.Close()
    _, err = io.Copy(out, resp.Body)
    return err
}

func main() {
    if err := pdfToText("scanned-receipt.pdf", "receipt.txt"); err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
}

For PDFs with selectable text the API uses pdftotext (instant). For scanned PDFs, it runs Tesseract OCR which adds 5-30s depending on page count. Set the client timeout to 120s minimum.

When to use each

ApproachBest forTradeoff
ledongthuc/pdfSingle-binary deploys, simple text-layer PDFsNo OCR, layout-fidelity gaps on complex PDFs
pdftotext via os/execSelf-hosted services with mostly text-layer PDFs~10MB install (poppler-utils), no OCR
ChangeThisFile APIMixed input including scanned PDFs, no infraNetwork call, slower for OCR (5-30s)

CLI alternatives: pdftotext, ocrmypdf

For shell pipelines and one-offs:

# Plain text extraction
pdftotext -layout document.pdf document.txt

# For scanned PDFs, add OCR with ocrmypdf:
ocrmypdf scanned.pdf ocrd.pdf  # adds text layer to scanned PDF
pdftotext ocrd.pdf scanned.txt   # then extract

# Or one-step with tesseract:
pdftoppm scanned.pdf page -png
for img in page-*.png; do tesseract "$img" - >> scanned.txt; done

From Go, you can chain these the same way:

// Try pdftotext first; if output is small, run ocrmypdf as fallback
cmd := exec.Command("pdftotext", "-layout", "input.pdf", "out.txt")
cmd.Run()
if info, _ := os.Stat("out.txt"); info.Size() < 100 {
    // Likely scanned — fall back to OCR
    exec.Command("ocrmypdf", "input.pdf", "ocrd.pdf").Run()
    exec.Command("pdftotext", "-layout", "ocrd.pdf", "out.txt").Run()
}

Production tips

  • Detect scanned PDFs early. Run pdftotext, check if output is <200 bytes for a multi-page document — that's the signal it's image-only. Either reject the file or fall back to OCR.
  • For multi-column PDFs, use -layout, not -raw. -layout preserves the visual columns; -raw concatenates them into mixed-up reading order.
  • OCR is slow. Tesseract takes ~2-5s per page on a typical server. For 100-page documents, async with a job queue is the right pattern (the API supports this via /v1/jobs).
  • Encrypted PDFs need passwords. ledongthuc/pdf has OpenWithPassword; pdftotext takes -upw / -opw flags. Both error cleanly on wrong passwords.
  • Set a timeout. Some malformed PDFs can hang text extractors. 60s is a reasonable bound for typical documents under 100 pages.

For text-layer PDFs, ledongthuc/pdf or pdftotext both work. For unknown input that may include scans, the API's automatic OCR fallback is the simplest path. Free tier covers 1,000 conversions/month.