Custom PDF Text Extraction with Node.js

Learn why, what, and how to create a custom PDF text extraction API using Node.js, TypeScript, and pdf-parse. Includes setup, core implementation, page‑range extraction, metadata endpoint, search, testing and deployment.

13 February 2026 by

Suraj Barman

What is a Custom PDF Text Extractor?

A custom PDF text extractor is a program that reads a PDF file and returns its textual content, metadata, or specific page ranges on demand. Unlike generic libraries, a custom implementation can be tailored to the exact needs of an application, minimizing dependencies and avoiding unnecessary functionality.

Why Build a Custom Extractor?

Full control over the technology stack and performance characteristics.
Ability to expose only the required API surface (e.g., text, metadata, page‑range, search).
Reduced bundle size and a smaller security surface compared to monolithic libraries.
Facilitates tighter integration with business logic such as authentication, rate‑limiting, or caching.

How to Set Up a TypeScript‑Based Node.js Project

Initialize the project: npm init -y
Install runtime dependencies: npm install express cors file-upload pdf-parse
Install development dependencies: npm install -D typescript ts-node nodemon @types/node @types/express @types/cors @type/file-upload
Generate a TypeScript configuration file: npx tsc --init and adjust the target/module as needed.
Create a basic Express server that reads the PORT environment variable and starts listening.

Core Implementation: The Basic Parse Function

The central function accepts a Uint8Array representing the uploaded file, creates a pdf‑parse instance, and returns an object containing:

text – the extracted text (empty string if none).
info – author, title, creation date, etc.
numpages – total page count.

Both getText() and getInfo() are awaited to avoid blocking the event loop.

Adding Page‑Specific Extraction

Expose a second endpoint that accepts startPage and endPage query parameters.
Validate the range against the document’s total page count.
Slice the extracted text to the requested range and return it with the original metadata.

Providing a Lightweight Metadata‑Only Endpoint

Implement a function that calls pdfParse.getInfo() without invoking getText().
Normalize date strings and supply placeholder values for missing fields.
Return a concise JSON object that can be used for preview or indexing.

Implementing Full‑Text Search

Write a search utility that iterates over each page, applies a case‑sensitive or case‑insensitive pattern, and captures a 100‑character context around each match.
Return a structured result with total matches, the query, and an array of matches (page, position, snippet).

Handling Common Edge Cases

Corrupted or malformed PDFs – wrap parsing in try‑catch and return a 400‑type error.
Password‑protected PDFs – either reject them or accept a password and pass it to the parser if supported.
Scanned image‑based PDFs – integrate an OCR library (e.g., Tesseract) for text extraction.
Special‑character encoding – ensure the application uses UTF‑8 and test with non‑Latin scripts.

Testing with Jest and Supertest

Install test dependencies: npm install --save-dev jest ts-jest @types/jest supertest @types/supertest
Write unit tests for each endpoint (upload, metadata, page‑range, search) using Supertest to simulate HTTP requests.
Mock the file‑upload middleware and the pdf‑parse library to keep tests fast and deterministic.
Run the suite with npm test and generate a coverage report to ensure critical paths are exercised.

Deploying the API

Build the TypeScript source: npm run build (typically compiles to a dist folder).
Start the compiled server with npm start or use a platform‑specific start script.
Configure environment variables for PORT and any future secret (e.g., OCR service key).
Consider production enhancements: rate‑limiting, structured logging (Winston/Pino), monitoring (Sentry), and optional caching of metadata.

Integrating into a SaaS Application

Expose the API behind an authenticated gateway to enforce per‑user quotas.
Cache metadata for frequently accessed documents to reduce processing time.
Extend the service to handle other document types (DOCX, XLSX) by following the same pattern.
Offer asynchronous processing via a job queue for large files or batch uploads.