Back to Marketplace
pdf-text-extractor
Extract text from PDFs with OCR support, perfect for document digitization.
5,803downloads42installs11stars
v1.0.0
Overview
PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
Key Features
- Text Extraction: Extract text from PDFs without external tools, support for both text-based and scanned PDFs, preserve document structure and formatting, fast extraction (milliseconds for text-based)
- OCR Support: Use Tesseract.js for scanned documents, support multiple languages (English, Spanish, French, German), configurable OCR quality/speed, fallback to text extraction when possible
- Batch Processing: Process multiple PDFs at once, batch extraction for document workflows, progress tracking for large files, error handling and retry logic
- Output Options: Plain text output, JSON output with metadata, markdown conversion, HTML output (preserving links)
- Utility Features: Page-by-page extraction, character/word counting, language detection, metadata extraction (author, title, creation date)
How It Works
PDF-Text-Extractor uses a combination of embedded text extraction and OCR to extract text from PDFs. For text-based PDFs, it extracts text directly from the PDF. For scanned documents, it uses Tesseract.js for OCR.
Use Cases
- Digitize documents: Extract text from PDFs to make them searchable and editable
- Process invoices: Extract text from invoices to automate data entry
- Analyze content: Extract text from PDFs to analyze their content and structure
Reviews
No reviews yet.