Retour au marché

pdf-text-extractor

Extract text from PDFs with OCR support, perfect for document digitization.

5,803téléchargements42installations11étoiles
v1.0.0
Michael-laffinMichael-laffinDevelopment3/2/2026

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

Key Features

  • Text Extraction: Extract text from PDFs without external tools, support for both text-based and scanned PDFs, preserve document structure and formatting, fast extraction (milliseconds for text-based)
  • OCR Support: Use Tesseract.js for scanned documents, support multiple languages (English, Spanish, French, German), configurable OCR quality/speed, fallback to text extraction when possible
  • Batch Processing: Process multiple PDFs at once, batch extraction for document workflows, progress tracking for large files, error handling and retry logic
  • Output Options: Plain text output, JSON output with metadata, markdown conversion, HTML output (preserving links)
  • Utility Features: Page-by-page extraction, character/word counting, language detection, metadata extraction (author, title, creation date)

How It Works

PDF-Text-Extractor uses a combination of embedded text extraction and OCR to extract text from PDFs. For text-based PDFs, it extracts text directly from the PDF. For scanned documents, it uses Tesseract.js for OCR.

Use Cases

  • Digitize documents: Extract text from PDFs to make them searchable and editable
  • Process invoices: Extract text from invoices to automate data entry
  • Analyze content: Extract text from PDFs to analyze their content and structure

Avis

Aucun avis pour le moment.