Automated Document Parser Documentation
Welcome to the Automated Document Parser documentation!
Automated Document Parser is a powerful and intelligent document processing library built on top of LangChain. It provides automatic file type detection and loading for various document formats, making it easy to build RAG (Retrieval-Augmented Generation) applications.
Features
Automatic File Type Detection - Intelligently detects and processes 10+ file formats
Multiple PDF Loaders - 9 different PDF loading methods for various use cases
Modular Architecture - Clean separation with
file_load/andpdf_load/modulesEasy Integration - Simple API that works seamlessly with LangChain
Structured Data Support - Handle CSV, JSON, and other structured formats
Extensible Design - Easy to add custom loaders
Supported File Types
- Text Files:
.txt- Plain text files.md- Markdown files
- Structured Data:
.csv- CSV files with encoding support.json- JSON files with jq schema filtering
- Documents:
.docx- Microsoft Word documents.html- HTML files
- PDF Files (9 methods):
pypdf- Basic PDF text extraction (default)unstructured- Advanced OCR and layout detectionamazon_textract- AWS Textract for high-accuracy OCRmathpix- Specialized for mathematical formulaspdfplumber- High accuracy text and table extractionpypdfium2- Google PDFium librarypymupdf- PyMuPDF (fitz) backendpymupdf4llm- LLM-optimized extractionopendataloader- Advanced multi-format parsing
Quick Start
Installation
pip install automated-document-parser
Basic Usage
Step 1: Automatic File Type Detection
The parser automatically detects file types and loads them with sensible defaults:
from automated_document_parser import DocumentParser
# Initialize the parser
parser = DocumentParser()
# Parse a single document (auto-detects file type)
documents = parser.parse("path/to/document.pdf")
# Parse multiple documents (auto-detects each file type)
files = ["doc1.txt", "data.csv", "report.pdf"]
results = parser.parse_multiple(files)
Step 2: Specify Loading Methods
You can specify loading methods and parameters that apply to all appropriate files:
# Specify PDF loading method for all PDFs
files = ["doc1.txt", "data.csv", "report.pdf", "paper.pdf"]
results = parser.parse_multiple(
files,
pdf_loader_method="pdfplumber" # Applies to all PDF files
)
# Specify encoding for text files
results = parser.parse_multiple(
files,
encoding="utf-8" # Applies to all text files
)
# Combine multiple parameters
results = parser.parse_multiple(
files,
pdf_loader_method="pymupdf",
encoding="utf-8"
)
Single File with Parameters
# Parse single PDF with specific method
documents = parser.parse(
"document.pdf",
pdf_loader_method="pdfplumber"
)
# Parse text file with encoding
documents = parser.parse(
"file.txt",
encoding="utf-8"
)
Advanced PDF Loading
For more control, use PDFLoader directly:
from automated_document_parser.loaders import PDFLoader
# Use specific PDF loading method
loader = PDFLoader("document.pdf", method="pdfplumber")
documents = loader.load()
# Use Mathpix for mathematical content
loader = PDFLoader(
"math_paper.pdf",
method="mathpix",
mathpix_app_id="your_id",
mathpix_app_key="your_key"
)
documents = loader.load()
Direct Loader Usage
from automated_document_parser.loaders.file_load import (
TextFileLoader,
CSVFileLoader,
JSONFileLoader
)
# Load text file with specific encoding
loader = TextFileLoader("file.txt", encoding="utf-8")
docs = loader.load()
# Load CSV file
csv_loader = CSVFileLoader("data.csv")
csv_docs = csv_loader.load()
Contents
API Reference
Additional Resources