PDF text extraction has two failure modes: (1) the PDF has selectable text but the layout confuses parsers (multi-column papers), and (2) the PDF is a scanned image with no text layer. The pure-Go libraries handle case 1 with limitations and can't handle case 2 at all. pdftotext is more robust; the API adds OCR fallback.
Method 1: ledongthuc/pdf (pure Go)
ledongthuc/pdf is the most popular pure-Go PDF reader. No native deps — runs on Lambda, Alpine, anywhere Go runs.
go get github.com/ledongthuc/pdf
package main
import (
"bytes"
"fmt"
"io"
"os"
"github.com/ledongthuc/pdf"
)
func pdfToText(inPath, outPath string) error {
f, r, err := pdf.Open(inPath)
if err != nil {
return fmt.Errorf("open: %w", err)
}
defer f.Close()
var buf bytes.Buffer
b, err := r.GetPlainText()
if err != nil {
return fmt.Errorf("text: %w", err)
}
if _, err := io.Copy(&buf, b); err != nil {
return err
}
return os.WriteFile(outPath, buf.Bytes(), 0o644)
}
func main() {
if err := pdfToText("document.pdf", "document.txt"); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
}
Three things to know:
- It only reads selectable text. Scanned PDFs (image-only) return empty strings — there's no OCR.
- Layout is reading-order, not visual. Multi-column papers may interleave columns, footnotes may end up mid-paragraph.
- Encrypted PDFs need a password. Use
pdf.OpenWithPassword(path, password)for protected documents.
Method 2: pdftotext via os/exec (highest quality)
Poppler's pdftotext CLI handles complex layouts (multi-column, tables, footnotes) much better than any pure-Go library. It's the standard tool for PDF text extraction.
apt install poppler-utils # provides pdftotext
# macOS: brew install poppler
package main
import (
"context"
"fmt"
"os"
"os/exec"
"time"
)
func pdfToText(ctx context.Context, inPath, outPath string) error {
// -layout preserves visual layout (columns, tables)
// -nopgbrk omits form-feed characters between pages
cmd := exec.CommandContext(ctx,
"pdftotext",
"-layout",
"-nopgbrk",
inPath, outPath,
)
out, err := cmd.CombinedOutput()
if err != nil {
return fmt.Errorf("pdftotext: %w (%s)", err, out)
}
return nil
}
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
if err := pdfToText(ctx, "document.pdf", "document.txt"); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
}
Useful flags:
- -layout — preserves visual layout. Without this, multi-column PDFs get scrambled.
- -nopgbrk — omits the form-feed (\f) character between pages. Easier to grep.
- -raw — reading-order output (no layout). Better for plain prose.
- -table — preserves table structure where detectable.
Method 3: ChangeThisFile API (with OCR fallback)
If your PDFs include scanned documents (image-only, no text layer), neither of the above works. The API runs pdftotext first; if the result is empty or near-empty, it falls back to OCR. Free tier covers 1,000 conversions/month.
package main
import (
"bytes"
"fmt"
"io"
"mime/multipart"
"net/http"
"os"
"time"
)
const apiKey = "ctf_sk_your_key_here"
func pdfToText(inPath, outPath string) error {
body := &bytes.Buffer{}
w := multipart.NewWriter(body)
f, err := os.Open(inPath)
if err != nil {
return err
}
defer f.Close()
fw, err := w.CreateFormFile("file", "input.pdf")
if err != nil {
return err
}
if _, err := io.Copy(fw, f); err != nil {
return err
}
_ = w.WriteField("source", "pdf")
_ = w.WriteField("target", "txt")
_ = w.Close()
req, err := http.NewRequest("POST", "https://changethisfile.com/v1/convert", body)
if err != nil {
return err
}
req.Header.Set("Authorization", "Bearer "+apiKey)
req.Header.Set("Content-Type", w.FormDataContentType())
client := &http.Client{Timeout: 120 * time.Second}
resp, err := client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
msg, _ := io.ReadAll(resp.Body)
return fmt.Errorf("api %d: %s", resp.StatusCode, msg)
}
out, err := os.Create(outPath)
if err != nil {
return err
}
defer out.Close()
_, err = io.Copy(out, resp.Body)
return err
}
func main() {
if err := pdfToText("scanned-receipt.pdf", "receipt.txt"); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
}
For PDFs with selectable text the API uses pdftotext (instant). For scanned PDFs, it runs Tesseract OCR which adds 5-30s depending on page count. Set the client timeout to 120s minimum.
When to use each
| Approach | Best for | Tradeoff |
|---|---|---|
| ledongthuc/pdf | Single-binary deploys, simple text-layer PDFs | No OCR, layout-fidelity gaps on complex PDFs |
| pdftotext via os/exec | Self-hosted services with mostly text-layer PDFs | ~10MB install (poppler-utils), no OCR |
| ChangeThisFile API | Mixed input including scanned PDFs, no infra | Network call, slower for OCR (5-30s) |
CLI alternatives: pdftotext, ocrmypdf
For shell pipelines and one-offs:
# Plain text extraction
pdftotext -layout document.pdf document.txt
# For scanned PDFs, add OCR with ocrmypdf:
ocrmypdf scanned.pdf ocrd.pdf # adds text layer to scanned PDF
pdftotext ocrd.pdf scanned.txt # then extract
# Or one-step with tesseract:
pdftoppm scanned.pdf page -png
for img in page-*.png; do tesseract "$img" - >> scanned.txt; done
From Go, you can chain these the same way:
// Try pdftotext first; if output is small, run ocrmypdf as fallback
cmd := exec.Command("pdftotext", "-layout", "input.pdf", "out.txt")
cmd.Run()
if info, _ := os.Stat("out.txt"); info.Size() < 100 {
// Likely scanned — fall back to OCR
exec.Command("ocrmypdf", "input.pdf", "ocrd.pdf").Run()
exec.Command("pdftotext", "-layout", "ocrd.pdf", "out.txt").Run()
}
Production tips
- Detect scanned PDFs early. Run pdftotext, check if output is <200 bytes for a multi-page document — that's the signal it's image-only. Either reject the file or fall back to OCR.
- For multi-column PDFs, use -layout, not -raw. -layout preserves the visual columns; -raw concatenates them into mixed-up reading order.
- OCR is slow. Tesseract takes ~2-5s per page on a typical server. For 100-page documents, async with a job queue is the right pattern (the API supports this via /v1/jobs).
- Encrypted PDFs need passwords. ledongthuc/pdf has OpenWithPassword; pdftotext takes -upw / -opw flags. Both error cleanly on wrong passwords.
- Set a timeout. Some malformed PDFs can hang text extractors. 60s is a reasonable bound for typical documents under 100 pages.
For text-layer PDFs, ledongthuc/pdf or pdftotext both work. For unknown input that may include scans, the API's automatic OCR fallback is the simplest path. Free tier covers 1,000 conversions/month.