Name: pdf-text-extractor
Author: Michael-laffin

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

Key Features

Text Extraction: Extract text from PDFs without external tools, support for both text-based and scanned PDFs, preserve document structure and formatting, fast extraction (milliseconds for text-based)
OCR Support: Use Tesseract.js for scanned documents, support multiple languages (English, Spanish, French, German), configurable OCR quality/speed, fallback to text extraction when possible
Batch Processing: Process multiple PDFs at once, batch extraction for document workflows, progress tracking for large files, error handling and retry logic
Output Options: Plain text output, JSON output with metadata, markdown conversion, HTML output (preserving links)
Utility Features: Page-by-page extraction, character/word counting, language detection, metadata extraction (author, title, creation date)

How It Works

PDF-Text-Extractor uses a combination of embedded text extraction and OCR to extract text from PDFs. For text-based PDFs, it extracts text directly from the PDF. For scanned documents, it uses Tesseract.js for OCR.

Use Cases

Digitize documents: Extract text from PDFs to make them searchable and editable
Process invoices: Extract text from invoices to automate data entry
Analyze content: Extract text from PDFs to analyze their content and structure

pdf-text-extractor

Overview

Key Features

How It Works

Use Cases

Reviews