Does pdf.js work in Cloudflare Workers?

Partially. pdf.js relies on some Node APIs that Workers doesn't expose. The ChangeThisFile API works in Workers without modification — call it via fetch() like any other HTTP service.

Why does pdf-parse give weird spacing in extracted text?

pdf-parse returns text in raw PDF coordinate order, which can produce unexpected spacing on multi-column layouts. The ChangeThisFile API uses Poppler under the hood, which has better reading-order detection.

Can I extract images from PDFs in JavaScript?

pdf.js exposes embedded images via getOperatorList(). It is verbose. For 'extract all images from this PDF', the ChangeThisFile API supports source=pdf, target=zip — you get back a ZIP of all rasterized pages or embedded images.

What about TypeScript types?

pdfjs-dist ships TypeScript declarations. pdf-parse has community types via @types/pdf-parse. The ChangeThisFile API is plain HTTP — generate types from the OpenAPI spec at https://changethisfile.com/openapi.json.

Is there a streaming API for huge PDFs?

Sync /v1/convert returns the full text. For PDFs over 100MB or with thousands of pages, use the async /v1/jobs endpoint — submit the file, get a webhook callback with a result URL when extraction completes.

Does the API handle encrypted PDFs?

No, the API does not currently support password-protected PDFs. They return a 502 with a 'conversion_failed' error. Decrypt client-side first using qpdf or pdf-lib.

How to Convert PDF to Text in JavaScript (Node + Browser)

JavaScript PDF parsing splits cleanly into three approaches: hit a hosted API, run pdf-parse in Node, or run pdf.js (which works in both Node and the browser). Each one fits a different shape of project — server-side ingestion pipelines, serverless functions where cold-start matters, or fully client-side apps where files cannot leave the user's device.

This guide shows working code for all three, with error handling and the failure modes you'll actually hit on user-uploaded PDFs.

Method 1: ChangeThisFile API (works anywhere fetch works)

If you want zero dependencies and no native binaries, hit the API. Get a free key at changethisfile.com/api — 1,000 conversions/month on the free tier.

import fs from "node:fs";

const API_KEY = "sk_test_your_key_here";

async function pdfToText(pdfPath) {
  const fileBuffer = fs.readFileSync(pdfPath);
  const form = new FormData();
  form.append("file", new Blob([fileBuffer], { type: "application/pdf" }), "input.pdf");
  form.append("source", "pdf");
  form.append("target", "txt");

  const response = await fetch("https://changethisfile.com/v1/convert", {
    method: "POST",
    headers: { Authorization: `Bearer ${API_KEY}` },
    body: form,
  });

  if (!response.ok) {
    const err = await response.json().catch(() => ({}));
    throw new Error(`Conversion failed (${response.status}): ${err.error || "unknown"}`);
  }

  return await response.text();
}

const text = await pdfToText("./invoice.pdf");
console.log(text.slice(0, 500));

The same code works in browsers — just swap the file source for a File from an <input type="file">:

async function pdfToTextBrowser(file) {
  const form = new FormData();
  form.append("file", file);
  form.append("source", "pdf");
  form.append("target", "txt");

  const response = await fetch("https://changethisfile.com/v1/convert", {
    method: "POST",
    headers: { Authorization: `Bearer ${API_KEY}` },
    body: form,
  });

  if (!response.ok) throw new Error(`HTTP ${response.status}`);
  return await response.text();
}

One caveat for browser usage: do not ship your API key in the client bundle. Either proxy through your backend, or use scoped session keys (contact api@changethisfile.com for the per-user key flow).

Method 2: pdf-parse (Node.js, lightweight)

For Node-only extraction with minimal dependencies, pdf-parse is the simplest option. It is a thin wrapper around pdfjs-dist optimized for plain text extraction.

npm install pdf-parse

import fs from "node:fs/promises";
import pdfParse from "pdf-parse";

async function pdfToText(pdfPath) {
  const buffer = await fs.readFile(pdfPath);
  const data = await pdfParse(buffer);
  return data.text;
}

const text = await pdfToText("./invoice.pdf");
console.log(text.slice(0, 500));

pdf-parse also returns metadata (page count, author, title, etc) on the same response object — useful if you are building a search index that wants more than the body text.

The main downside: pdf-parse pulls in pdfjs-dist as a dependency, which is large (~3MB). On serverless platforms with cold-start sensitivity (Lambda, Vercel functions), this matters.

Method 3: pdf.js (browser + Node, full PDF rendering)

Mozilla's pdf.js is the same library that powers Firefox's PDF viewer. It works in browsers (no server needed) and in Node. It is more verbose than pdf-parse but gives you complete control — page-by-page extraction, character positioning, font information.

npm install pdfjs-dist

import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs";

async function pdfToText(arrayBuffer) {
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
  const textParts = [];

  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const content = await page.getTextContent();
    const pageText = content.items.map(item => item.str).join(" ");
    textParts.push(pageText);
  }

  return textParts.join("\n\n");
}

// Browser usage:
// const arrayBuffer = await file.arrayBuffer();
// const text = await pdfToText(arrayBuffer);

For pure-browser extraction with no backend, pdf.js is the only real option. It is what powers most "drag a PDF onto this page" extraction tools.

When to use each

Approach	Best for	Tradeoff
ChangeThisFile API	Production pipelines, varied input quality, serverless	Per-call cost, network dependency
pdf-parse	Node scripts, simple extraction, low setup	~3MB dependency, slow on big PDFs
pdf.js	Pure-browser apps, custom layout extraction	Verbose API, heavier in bundles

For server-side ingestion of user uploads — the most common case — the API wins on operational simplicity. No native binaries to install. No edge cases to handle in your own code. For client-side privacy-first apps where files cannot leave the user's browser, pdf.js is the only option.

Handling scanned PDFs

None of these methods handle scanned PDFs (image-only, no text layer). For OCR in JavaScript, Tesseract.js works in both browsers and Node, but it is slow (multiple seconds per page) and accuracy depends on scan quality.

import Tesseract from "tesseract.js";

// First convert PDF pages to images using pdf.js render(),
// then OCR each page image:
const { data: { text } } = await Tesseract.recognize(imageBlob, "eng");

The ChangeThisFile API does not currently OCR scanned PDFs — that is on the roadmap as a separate endpoint.

For Node-side ingestion pipelines with predictable PDFs, pdf-parse is fast and simple. For varied user uploads where input quality is unpredictable, the API removes a class of operational headaches. Grab a free API key for 1,000 conversions/month and try it without committing.

How to Convert PDF to Text in JavaScript

Method 1: ChangeThisFile API (works anywhere fetch works)

Method 2: pdf-parse (Node.js, lightweight)

Method 3: pdf.js (browser + Node, full PDF rendering)

When to use each

Handling scanned PDFs

Frequently Asked Questions

Ready to convert your files?