API Reference
This page contains the complete API reference for the Automated Document Parser package.
Core Module
Main DocumentParser class for automated document loading.
- class automated_document_parser.core.DocumentParser[source]
Bases:
objectMain class for automated document parsing.
Automatically detects file type and loads documents using appropriate loaders. Designed for seamless integration with LangChain RAG pipelines.
- parse(file_path: str | Path, pdf_loader_method: str = 'pypdf', **kwargs) List[Document][source]
Parse a document from file path.
- Parameters:
file_path – Path to the document file
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for the loader (e.g., encoding, api_key)
- Returns:
List of LangChain Document objects
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported
RuntimeError – If parsing fails
Example
>>> parser = DocumentParser() >>> # Basic usage with auto-detection >>> docs = parser.parse("document.pdf") >>> # Specify PDF loading method >>> docs = parser.parse("document.pdf", pdf_loader_method="pdfplumber") >>> # Pass additional parameters >>> docs = parser.parse("math.pdf", pdf_loader_method="mathpix", ... mathpix_app_id="id", mathpix_app_key="key")
- parse_multiple(file_paths: List[str | Path], pdf_loader_method: str = 'pypdf', **kwargs) dict[str, List[Document]][source]
Parse multiple documents with automatic file type detection.
- Parameters:
file_paths – List of file paths
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for loaders (e.g., encoding, api_key)
- Returns:
Dictionary mapping file paths to their loaded documents
Example
>>> parser = DocumentParser() >>> # Auto-detect all file types with default settings >>> results = parser.parse_multiple(["doc1.pdf", "doc2.txt", "data.csv"]) >>> # Specify PDF method for all PDFs >>> results = parser.parse_multiple( ... ["doc1.pdf", "doc2.pdf", "data.csv"], ... pdf_loader_method="pdfplumber" ... ) >>> for file, docs in results.items(): ... print(f"{file}: {len(docs)} documents")
Configuration
Configuration and mappings for document loaders.
Utilities
Helper functions for document parsing.
- automated_document_parser.utils.detect_file_type(file_path: str | Path) str | None[source]
Detect file type based on extension.
- Parameters:
file_path – Path to the file
- Returns:
Loader type string or None if unsupported
- Raises:
FileNotFoundError – If file doesn’t exist
- automated_document_parser.utils.is_supported_file(file_path: str | Path) bool[source]
Check if file type is supported.
- Parameters:
file_path – Path to the file
- Returns:
True if supported, False otherwise
- automated_document_parser.utils.validate_file_path(file_path: str | Path) Path[source]
Validate and normalize file path.
- Parameters:
file_path – Path to validate
- Returns:
Normalized Path object
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If path is not a file
Loaders
Main File Loaders
Local file system loaders (pdf, txt, csv, etc.).
- class automated_document_parser.loaders.file_loaders.FileLoader(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]
Bases:
objectAutomated file loader that detects file type and loads documents.
- __init__(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]
Initialize the FileLoader.
- Parameters:
file_path – Path to the file to load
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader (e.g., client, api_key)
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported
- load() List[Document][source]
Load documents from the file.
- Returns:
List of LangChain Document objects
- Raises:
RuntimeError – If loading fails
- automated_document_parser.loaders.file_loaders.load_document(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs) List[Document][source]
Convenience function to load a document from a file.
- Parameters:
file_path – Path to the file
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader
- Returns:
List of LangChain Document objects
Examples
>>> # Load a text file >>> documents = load_document("path/to/file.txt")
>>> # Load a PDF with default PyPDF >>> documents = load_document("path/to/file.pdf")
>>> # Load a PDF with Unstructured >>> documents = load_document("path/to/file.pdf", pdf_loader_method="unstructured")
>>> # Load a PDF with Amazon Textract >>> import boto3 >>> client = boto3.client("textract", region_name="us-east-2") >>> documents = load_document( ... "s3://bucket/file.pdf", ... pdf_loader_method="amazon_textract", ... client=client ... )
File Load Module
Base File Loader
Base file loader interface.
- class automated_document_parser.loaders.file_load.base.BaseFileLoader(file_path: str | Path, **kwargs)[source]
Bases:
ABCAbstract base class for file loaders.
All file loader implementations should inherit from this class.
- __init__(file_path: str | Path, **kwargs)[source]
Initialize the file loader.
- Parameters:
file_path – Path to the file
**kwargs – Additional loader-specific arguments
- abstractmethod load() List[Document][source]
Load documents from the file.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If required dependencies are not installed
RuntimeError – If loading fails
Text File Loader
Text file loader implementation.
Supports .txt and .md (markdown) files.
- class automated_document_parser.loaders.file_load.text_loader.TextFileLoader(file_path: str | Path, **kwargs)[source]
Bases:
BaseFileLoaderText file loader for .txt and .md files.
Uses LangChain’s TextLoader with configurable encoding.
- load() List[Document][source]
Load text file.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community is not installed
CSV File Loader
CSV file loader implementation.
- class automated_document_parser.loaders.file_load.csv_loader.CSVFileLoader(file_path: str | Path, **kwargs)[source]
Bases:
BaseFileLoaderCSV file loader.
Uses LangChain’s CSVLoader with configurable encoding.
- load() List[Document][source]
Load CSV file.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community is not installed
JSON File Loader
JSON file loader implementation.
- class automated_document_parser.loaders.file_load.json_loader.JSONFileLoader(file_path: str | Path, **kwargs)[source]
Bases:
BaseFileLoaderJSON file loader.
Uses LangChain’s JSONLoader with jq schema support.
- load() List[Document][source]
Load JSON file.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community or jq is not installed
DOCX File Loader
DOCX file loader implementation.
- class automated_document_parser.loaders.file_load.docx_loader.DOCXFileLoader(file_path: str | Path, **kwargs)[source]
Bases:
BaseFileLoaderDOCX file loader for Microsoft Word documents.
Uses LangChain’s Docx2txtLoader.
- load() List[Document][source]
Load DOCX file.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community or docx2txt is not installed
HTML File Loader
HTML file loader implementation.
- class automated_document_parser.loaders.file_load.html_loader.HTMLFileLoader(file_path: str | Path, **kwargs)[source]
Bases:
BaseFileLoaderHTML file loader.
Uses LangChain’s UnstructuredHTMLLoader.
- load() List[Document][source]
Load HTML file.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community or unstructured is not installed
PDF Load Module
PDF Loader Classes
- class automated_document_parser.loaders.pdf_load.PDFLoader(file_path: str | Path, method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] | str = 'pypdf', loader_class: type[BasePDFLoader] | None = None, **kwargs)[source]
Bases:
objectFlexible PDF loader supporting multiple parsing backends.
By default, uses PyPDF for standard PDF parsing. Users can specify alternative methods like ‘unstructured’ for advanced parsing or ‘amazon_textract’ for OCR capabilities.
Users can also provide custom loader classes that inherit from BasePDFLoader.
- __init__(file_path: str | Path, method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] | str = 'pypdf', loader_class: type[BasePDFLoader] | None = None, **kwargs)[source]
Initialize PDF loader with specified method or custom loader.
- Parameters:
file_path – Path to PDF file or URL (for amazon_textract)
method – Loading method - ‘pypdf’ (default), ‘unstructured’, or ‘amazon_textract’ Can also be a custom string if loader_class is provided
loader_class – Optional custom loader class inheriting from BasePDFLoader. If provided, this takes precedence over the method parameter.
**kwargs –
Additional arguments passed to the specific loader For amazon_textract:
client: boto3 Textract client (optional)
region_name: AWS region (default: ‘us-east-2’)
- For unstructured:
api_key: Unstructured API key (or set UNSTRUCTURED_API_KEY env var)
- For mathpix:
mathpix_api_key: Mathpix API key (or set MATHPIX_API_KEY env var)
- Raises:
ValueError – If method is not supported and no loader_class is provided
TypeError – If loader_class doesn’t inherit from BasePDFLoader
Examples
>>> # Default PyPDF loader >>> loader = PDFLoader("document.pdf") >>> docs = loader.load()
>>> # Use Unstructured >>> loader = PDFLoader("document.pdf", method="unstructured") >>> docs = loader.load()
>>> # Use custom loader class >>> from my_loaders import CustomPDFLoader >>> loader = PDFLoader("document.pdf", loader_class=CustomPDFLoader) >>> docs = loader.load()
- load() List[Document][source]
Load PDF documents using the specified method or loader.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If required dependencies are not installed
RuntimeError – If loading fails
- automated_document_parser.loaders.pdf_load.load_pdf(file_path: str | Path, method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] | str = 'pypdf', loader_class: type[BasePDFLoader] | None = None, **kwargs) List[Document][source]
Convenience function to load a PDF document.
- Parameters:
file_path – Path to PDF file or URL
method – Loading method - ‘pypdf’ (default), ‘unstructured’, or ‘amazon_textract’
loader_class – Optional custom loader class inheriting from BasePDFLoader
**kwargs – Additional arguments for the loader
- Returns:
List of LangChain Document objects
Examples
>>> # Basic usage with PyPDF (default) >>> docs = load_pdf("paper.pdf")
>>> # Use Unstructured API >>> docs = load_pdf("paper.pdf", method="unstructured")
>>> # Use Amazon Textract with URL >>> docs = load_pdf( ... "https://example.com/document.pdf", ... method="amazon_textract" ... )
>>> # Use Amazon Textract with S3 >>> import boto3 >>> client = boto3.client("textract", region_name="us-east-2") >>> docs = load_pdf( ... "s3://bucket/document.pdf", ... method="amazon_textract", ... client=client ... )
>>> # Use custom loader >>> from my_loaders import MyCustomLoader >>> docs = load_pdf("paper.pdf", loader_class=MyCustomLoader)
Base PDF Loader
Base PDF loader class and type definitions.
- class automated_document_parser.loaders.pdf_load.base.BasePDFLoader(file_path: str | Path, **kwargs)[source]
Bases:
ABCAbstract base class for PDF loaders.
All PDF loader implementations should inherit from this class.
- __init__(file_path: str | Path, **kwargs)[source]
Initialize the PDF loader.
- Parameters:
file_path – Path to PDF file or URL
**kwargs – Additional loader-specific arguments
- abstractmethod load() List[Document][source]
Load PDF documents.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If required dependencies are not installed
RuntimeError – If loading fails
PyPDF Loader
PyPDF loader implementation.
Reference: https://docs.langchain.com/oss/python/integrations/document_loaders/pypdfloader
- class automated_document_parser.loaders.pdf_load.pypdf_loader.PyPDFLoaderImpl(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using PyPDF backend.
Fast and simple PDF parsing suitable for most standard PDFs. This is the default loader method.
Unstructured PDF Loader
Unstructured API loader implementation.
Reference: https://docs.langchain.com/oss/python/integrations/document_loaders/unstructured_file
- class automated_document_parser.loaders.pdf_load.unstructured_loader.UnstructuredPDFLoader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using Unstructured API backend.
Advanced parsing with support for complex document layouts. Requires UNSTRUCTURED_API_KEY environment variable or api_key in kwargs.
- load() List[Document][source]
Load PDF using Unstructured API.
- Returns:
List of LangChain Document objects
- Raises:
ValueError – If API key is not provided
ImportError – If langchain-unstructured is not installed
Amazon Textract Loader
Amazon Textract loader implementation.
Reference: https://docs.langchain.com/oss/python/integrations/document_loaders/amazon_textract
- class automated_document_parser.loaders.pdf_load.textract_loader.AmazonTextractPDFLoader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using Amazon Textract backend.
OCR service for extracting text from scanned documents and images. Supports local files, HTTP/HTTPS URLs, and S3 URIs.
Requires AWS credentials to be configured.
- load() List[Document][source]
Load PDF using Amazon Textract.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If boto3 or amazon-textract-caller is not installed
Mathpix Loader
Mathpix API loader implementation.
Reference: https://python.langchain.com/docs/integrations/document_loaders/mathpix/
- class automated_document_parser.loaders.pdf_load.mathpix_loader.MathpixPDFLoader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using Mathpix API backend.
Specialized for converting PDFs with mathematical formulas, tables, and diagrams. Requires MATHPIX_API_KEY environment variable or mathpix_api_key in kwargs.
- load() List[Document][source]
Load PDF using Mathpix API.
- Returns:
List of LangChain Document objects
- Raises:
ValueError – If API key is not provided
ImportError – If langchain-community is not installed
PDFPlumber Loader
PDFPlumber loader implementation.
Reference: https://python.langchain.com/docs/integrations/document_loaders/pdfplumber/
- class automated_document_parser.loaders.pdf_load.pdfplumber_loader.PDFPlumberLoader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using pdfplumber backend.
Extracts text and table data from PDFs with high accuracy. No API key required - works locally.
- load() List[Document][source]
Load PDF using pdfplumber.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community or pdfplumber is not installed
PyPDFium2 Loader
PyPDFium2 loader implementation.
Reference: https://python.langchain.com/docs/integrations/document_loaders/pypdfium2/
- class automated_document_parser.loaders.pdf_load.pypdfium2_loader.PyPDFium2Loader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using pypdfium2 backend.
Fast and accurate PDF parsing using Google’s PDFium library. No API key required - works locally.
- load() List[Document][source]
Load PDF using pypdfium2.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community or pypdfium2 is not installed
PyMuPDF Loader
PyMuPDF loader implementation.
Reference: https://python.langchain.com/docs/integrations/document_loaders/pymupdf/
- class automated_document_parser.loaders.pdf_load.pymupdf_loader.PyMuPDFLoader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using PyMuPDF (fitz) backend.
Fast PDF parsing with support for text, images, and metadata extraction. No API key required - works locally.
- load() List[Document][source]
Load PDF using PyMuPDF.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-community or pymupdf is not installed
PyMuPDF4LLM Loader
PyMuPDF4LLM loader implementation.
Reference: https://python.langchain.com/docs/integrations/document_loaders/pymupdf4llm/
- class automated_document_parser.loaders.pdf_load.pymupdf4llm_loader.PyMuPDF4LLMLoader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using PyMuPDF4LLM backend.
Optimized for LLM processing with enhanced text extraction and formatting. No API key required - works locally.
- load() List[Document][source]
Load PDF using PyMuPDF4LLM.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-pymupdf4llm is not installed
OpenDataLoader PDF Loader
OpenDataLoader PDF loader implementation.
Reference: https://python.langchain.com/docs/integrations/document_loaders/opendataloader/
- class automated_document_parser.loaders.pdf_load.opendataloader_loader.OpenDataLoaderPDFLoader(file_path: str | Path, **kwargs)[source]
Bases:
BasePDFLoaderPDF loader using OpenDataLoader backend.
Advanced PDF parsing with support for multiple formats and configurations. No API key required - works locally.
- load() List[Document][source]
Load PDF using OpenDataLoader.
- Returns:
List of LangChain Document objects
- Raises:
ImportError – If langchain-opendataloader-pdf is not installed