JavaScript PDF parsing splits cleanly into three approaches: hit a hosted API, run pdf-parse in Node, or run pdf.js (which works in both Node and the browser). Each one fits a different shape of project — server-side ingestion pipelines, serverless functions where cold-start matters, or fully client-side apps where files cannot leave the user's device.

This guide shows working code for all three, with error handling and the failure modes you'll actually hit on user-uploaded PDFs.

Method 1: ChangeThisFile API (works anywhere fetch works)

If you want zero dependencies and no native binaries, hit the API. Get a free key at changethisfile.com/api — 1,000 conversions/month on the free tier.

import fs from "node:fs";

const API_KEY = "sk_test_your_key_here";

async function pdfToText(pdfPath) {
  const fileBuffer = fs.readFileSync(pdfPath);
  const form = new FormData();
  form.append("file", new Blob([fileBuffer], { type: "application/pdf" }), "input.pdf");
  form.append("source", "pdf");
  form.append("target", "txt");

  const response = await fetch("https://changethisfile.com/v1/convert", {
    method: "POST",
    headers: { Authorization: `Bearer ${API_KEY}` },
    body: form,
  });

  if (!response.ok) {
    const err = await response.json().catch(() => ({}));
    throw new Error(`Conversion failed (${response.status}): ${err.error || "unknown"}`);
  }

  return await response.text();
}

const text = await pdfToText("./invoice.pdf");
console.log(text.slice(0, 500));

The same code works in browsers — just swap the file source for a File from an <input type="file">:

async function pdfToTextBrowser(file) {
  const form = new FormData();
  form.append("file", file);
  form.append("source", "pdf");
  form.append("target", "txt");

  const response = await fetch("https://changethisfile.com/v1/convert", {
    method: "POST",
    headers: { Authorization: `Bearer ${API_KEY}` },
    body: form,
  });

  if (!response.ok) throw new Error(`HTTP ${response.status}`);
  return await response.text();
}

One caveat for browser usage: do not ship your API key in the client bundle. Either proxy through your backend, or use scoped session keys (contact api@changethisfile.com for the per-user key flow).

Method 2: pdf-parse (Node.js, lightweight)

For Node-only extraction with minimal dependencies, pdf-parse is the simplest option. It is a thin wrapper around pdfjs-dist optimized for plain text extraction.

npm install pdf-parse
import fs from "node:fs/promises";
import pdfParse from "pdf-parse";

async function pdfToText(pdfPath) {
  const buffer = await fs.readFile(pdfPath);
  const data = await pdfParse(buffer);
  return data.text;
}

const text = await pdfToText("./invoice.pdf");
console.log(text.slice(0, 500));

pdf-parse also returns metadata (page count, author, title, etc) on the same response object — useful if you are building a search index that wants more than the body text.

The main downside: pdf-parse pulls in pdfjs-dist as a dependency, which is large (~3MB). On serverless platforms with cold-start sensitivity (Lambda, Vercel functions), this matters.

Method 3: pdf.js (browser + Node, full PDF rendering)

Mozilla's pdf.js is the same library that powers Firefox's PDF viewer. It works in browsers (no server needed) and in Node. It is more verbose than pdf-parse but gives you complete control — page-by-page extraction, character positioning, font information.

npm install pdfjs-dist
import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs";

async function pdfToText(arrayBuffer) {
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
  const textParts = [];

  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const content = await page.getTextContent();
    const pageText = content.items.map(item => item.str).join(" ");
    textParts.push(pageText);
  }

  return textParts.join("\n\n");
}

// Browser usage:
// const arrayBuffer = await file.arrayBuffer();
// const text = await pdfToText(arrayBuffer);

For pure-browser extraction with no backend, pdf.js is the only real option. It is what powers most "drag a PDF onto this page" extraction tools.

When to use each

ApproachBest forTradeoff
ChangeThisFile APIProduction pipelines, varied input quality, serverlessPer-call cost, network dependency
pdf-parseNode scripts, simple extraction, low setup~3MB dependency, slow on big PDFs
pdf.jsPure-browser apps, custom layout extractionVerbose API, heavier in bundles

For server-side ingestion of user uploads — the most common case — the API wins on operational simplicity. No native binaries to install. No edge cases to handle in your own code. For client-side privacy-first apps where files cannot leave the user's browser, pdf.js is the only option.

Handling scanned PDFs

None of these methods handle scanned PDFs (image-only, no text layer). For OCR in JavaScript, Tesseract.js works in both browsers and Node, but it is slow (multiple seconds per page) and accuracy depends on scan quality.

import Tesseract from "tesseract.js";

// First convert PDF pages to images using pdf.js render(),
// then OCR each page image:
const { data: { text } } = await Tesseract.recognize(imageBlob, "eng");

The ChangeThisFile API does not currently OCR scanned PDFs — that is on the roadmap as a separate endpoint.

For Node-side ingestion pipelines with predictable PDFs, pdf-parse is fast and simple. For varied user uploads where input quality is unpredictable, the API removes a class of operational headaches. Grab a free API key for 1,000 conversions/month and try it without committing.