API Reference

This page contains the complete API reference for the Automated Document Parser package.

Core Module

Main DocumentParser class for automated document loading.

class automated_document_parser.core.DocumentParser[source]

Bases: object

Main class for automated document parsing.

Automatically detects file type and loads documents using appropriate loaders. Designed for seamless integration with LangChain RAG pipelines.

__init__()[source]: Initialize the DocumentParser.

parse(file_path: str | Path, pdf_loader_method: str = 'pypdf', **kwargs) → List[Document][source]

Parse a document from file path.

Parameters:

file_path – Path to the document file
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for the loader (e.g., encoding, api_key)

Returns:

List of LangChain Document objects

Raises:

FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported
RuntimeError – If parsing fails

Example

>>> parser = DocumentParser()
>>> # Basic usage with auto-detection
>>> docs = parser.parse("document.pdf")
>>> # Specify PDF loading method
>>> docs = parser.parse("document.pdf", pdf_loader_method="pdfplumber")
>>> # Pass additional parameters
>>> docs = parser.parse("math.pdf", pdf_loader_method="mathpix",
...                     mathpix_app_id="id", mathpix_app_key="key")

parse_multiple(file_paths: List[str | Path], pdf_loader_method: str = 'pypdf', **kwargs) → dict[str, List[Document]][source]

Parse multiple documents with automatic file type detection.

Parameters:

file_paths – List of file paths
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for loaders (e.g., encoding, api_key)

Returns:

Dictionary mapping file paths to their loaded documents

Example

>>> parser = DocumentParser()
>>> # Auto-detect all file types with default settings
>>> results = parser.parse_multiple(["doc1.pdf", "doc2.txt", "data.csv"])
>>> # Specify PDF method for all PDFs
>>> results = parser.parse_multiple(
...     ["doc1.pdf", "doc2.pdf", "data.csv"],
...     pdf_loader_method="pdfplumber"
... )
>>> for file, docs in results.items():
...     print(f"{file}: {len(docs)} documents")

get_loaded_files() → List[str][source]

Get list of successfully loaded files.

Returns:: List of file paths that were successfully loaded

Configuration

Configuration and mappings for document loaders.

Utilities

Helper functions for document parsing.

automated_document_parser.utils.detect_file_type(file_path: str | Path) → str | None[source]

Detect file type based on extension.

Parameters:: file_path – Path to the file
Returns:: Loader type string or None if unsupported
Raises:: FileNotFoundError – If file doesn’t exist

automated_document_parser.utils.is_supported_file(file_path: str | Path) → bool[source]

Check if file type is supported.

Parameters:: file_path – Path to the file
Returns:: True if supported, False otherwise

automated_document_parser.utils.validate_file_path(file_path: str | Path) → Path[source]

Validate and normalize file path.

Parameters:

file_path – Path to validate

Returns:

Normalized Path object

Raises:

FileNotFoundError – If file doesn’t exist
ValueError – If path is not a file

automated_document_parser.utils.get_file_info(file_path: str | Path) → dict[source]

Get basic file information.

Parameters:: file_path – Path to the file
Returns:: Dictionary with file metadata

Loaders

Main File Loaders

Local file system loaders (pdf, txt, csv, etc.).

class automated_document_parser.loaders.file_loaders.FileLoader(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]

Bases: object

Automated file loader that detects file type and loads documents.

__init__(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]

Initialize the FileLoader.

Parameters:

file_path – Path to the file to load
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader (e.g., client, api_key)

Raises:

FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported

load() → List[Document][source]

Load documents from the file.

Returns:: List of LangChain Document objects
Raises:: RuntimeError – If loading fails

automated_document_parser.loaders.file_loaders.load_document(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs) → List[Document][source]

Convenience function to load a document from a file.

Parameters:

file_path – Path to the file
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader

Returns:

List of LangChain Document objects

Examples

>>> # Load a text file
>>> documents = load_document("path/to/file.txt")

>>> # Load a PDF with default PyPDF
>>> documents = load_document("path/to/file.pdf")

>>> # Load a PDF with Unstructured
>>> documents = load_document("path/to/file.pdf", pdf_loader_method="unstructured")

>>> # Load a PDF with Amazon Textract
>>> import boto3
>>> client = boto3.client("textract", region_name="us-east-2")
>>> documents = load_document(
...     "s3://bucket/file.pdf",
...     pdf_loader_method="amazon_textract",
...     client=client
... )

File Load Module

Base File Loader

Base file loader interface.

class automated_document_parser.loaders.file_load.base.BaseFileLoader(file_path: str | Path, **kwargs)[source]

Bases: ABC

Abstract base class for file loaders.

All file loader implementations should inherit from this class.

__init__(file_path: str | Path, **kwargs)[source]

Initialize the file loader.

Parameters:

file_path – Path to the file
**kwargs – Additional loader-specific arguments

abstractmethod load() → List[Document][source]

Load documents from the file.

Returns:

List of LangChain Document objects

Raises:

ImportError – If required dependencies are not installed
RuntimeError – If loading fails

abstractmethod static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

Text File Loader

Text file loader implementation.

Supports .txt and .md (markdown) files.

class automated_document_parser.loaders.file_load.text_loader.TextFileLoader(file_path: str | Path, **kwargs)[source]

Bases: BaseFileLoader

Text file loader for .txt and .md files.

Uses LangChain’s TextLoader with configurable encoding.

load() → List[Document][source]

Load text file.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

CSV File Loader

CSV file loader implementation.

class automated_document_parser.loaders.file_load.csv_loader.CSVFileLoader(file_path: str | Path, **kwargs)[source]

Bases: BaseFileLoader

CSV file loader.

Uses LangChain’s CSVLoader with configurable encoding.

load() → List[Document][source]

Load CSV file.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

JSON File Loader

JSON file loader implementation.

class automated_document_parser.loaders.file_load.json_loader.JSONFileLoader(file_path: str | Path, **kwargs)[source]

Bases: BaseFileLoader

JSON file loader.

Uses LangChain’s JSONLoader with jq schema support.

load() → List[Document][source]

Load JSON file.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community or jq is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

DOCX File Loader

DOCX file loader implementation.

class automated_document_parser.loaders.file_load.docx_loader.DOCXFileLoader(file_path: str | Path, **kwargs)[source]

Bases: BaseFileLoader

DOCX file loader for Microsoft Word documents.

Uses LangChain’s Docx2txtLoader.

load() → List[Document][source]

Load DOCX file.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community or docx2txt is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

HTML File Loader

HTML file loader implementation.

class automated_document_parser.loaders.file_load.html_loader.HTMLFileLoader(file_path: str | Path, **kwargs)[source]

Bases: BaseFileLoader

HTML file loader.

Uses LangChain’s UnstructuredHTMLLoader.

load() → List[Document][source]

Load HTML file.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community or unstructured is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

PDF Load Module

PDF Loader Classes

class automated_document_parser.loaders.pdf_load.PDFLoader(file_path: str | Path, method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] | str = 'pypdf', loader_class: type[BasePDFLoader] | None = None, **kwargs)[source]

Bases: object

Flexible PDF loader supporting multiple parsing backends.

By default, uses PyPDF for standard PDF parsing. Users can specify alternative methods like ‘unstructured’ for advanced parsing or ‘amazon_textract’ for OCR capabilities.

Users can also provide custom loader classes that inherit from BasePDFLoader.

__init__(file_path: str | Path, method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] | str = 'pypdf', loader_class: type[BasePDFLoader] | None = None, **kwargs)[source]

Initialize PDF loader with specified method or custom loader.

Parameters:

file_path – Path to PDF file or URL (for amazon_textract)
method – Loading method - ‘pypdf’ (default), ‘unstructured’, or ‘amazon_textract’ Can also be a custom string if loader_class is provided
loader_class – Optional custom loader class inheriting from BasePDFLoader. If provided, this takes precedence over the method parameter.
**kwargs –
Additional arguments passed to the specific loader For amazon_textract:
- client: boto3 Textract client (optional)
- region_name: AWS region (default: ‘us-east-2’)
For unstructured:
- api_key: Unstructured API key (or set UNSTRUCTURED_API_KEY env var)
For mathpix:
- mathpix_api_key: Mathpix API key (or set MATHPIX_API_KEY env var)

Raises:

ValueError – If method is not supported and no loader_class is provided
TypeError – If loader_class doesn’t inherit from BasePDFLoader

Examples

>>> # Default PyPDF loader
>>> loader = PDFLoader("document.pdf")
>>> docs = loader.load()

>>> # Use Unstructured
>>> loader = PDFLoader("document.pdf", method="unstructured")
>>> docs = loader.load()

>>> # Use custom loader class
>>> from my_loaders import CustomPDFLoader
>>> loader = PDFLoader("document.pdf", loader_class=CustomPDFLoader)
>>> docs = loader.load()

load() → List[Document][source]

Load PDF documents using the specified method or loader.

Returns:

List of LangChain Document objects

Raises:

ImportError – If required dependencies are not installed
RuntimeError – If loading fails

get_install_command() → str[source]: Get pip install command for the current loader’s dependencies.

automated_document_parser.loaders.pdf_load.load_pdf(file_path: str | Path, method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] | str = 'pypdf', loader_class: type[BasePDFLoader] | None = None, **kwargs) → List[Document][source]

Convenience function to load a PDF document.

Parameters:

file_path – Path to PDF file or URL
method – Loading method - ‘pypdf’ (default), ‘unstructured’, or ‘amazon_textract’
loader_class – Optional custom loader class inheriting from BasePDFLoader
**kwargs – Additional arguments for the loader

Returns:

List of LangChain Document objects

Examples

>>> # Basic usage with PyPDF (default)
>>> docs = load_pdf("paper.pdf")

>>> # Use Unstructured API
>>> docs = load_pdf("paper.pdf", method="unstructured")

>>> # Use Amazon Textract with URL
>>> docs = load_pdf(
...     "https://example.com/document.pdf",
...     method="amazon_textract"
... )

>>> # Use Amazon Textract with S3
>>> import boto3
>>> client = boto3.client("textract", region_name="us-east-2")
>>> docs = load_pdf(
...     "s3://bucket/document.pdf",
...     method="amazon_textract",
...     client=client
... )

>>> # Use custom loader
>>> from my_loaders import MyCustomLoader
>>> docs = load_pdf("paper.pdf", loader_class=MyCustomLoader)

automated_document_parser.loaders.pdf_load.PDFLoaderMethod: alias of Literal[‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’]

Base PDF Loader

Base PDF loader class and type definitions.

class automated_document_parser.loaders.pdf_load.base.BasePDFLoader(file_path: str | Path, **kwargs)[source]

Bases: ABC

Abstract base class for PDF loaders.

All PDF loader implementations should inherit from this class.

__init__(file_path: str | Path, **kwargs)[source]

Initialize the PDF loader.

Parameters:

file_path – Path to PDF file or URL
**kwargs – Additional loader-specific arguments

abstractmethod load() → List[Document][source]

Load PDF documents.

Returns:

List of LangChain Document objects

Raises:

ImportError – If required dependencies are not installed
RuntimeError – If loading fails

abstractmethod get_install_command() → str[source]

Get the pip install command for required dependencies.

Returns:: Install command string

PyPDF Loader

PyPDF loader implementation.

Reference: https://docs.langchain.com/oss/python/integrations/document_loaders/pypdfloader

class automated_document_parser.loaders.pdf_load.pypdf_loader.PyPDFLoaderImpl(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using PyPDF backend.

Fast and simple PDF parsing suitable for most standard PDFs. This is the default loader method.

load() → List[Document][source]

Load PDF using PyPDF.

Returns:: List of LangChain Document objects

get_install_command() → str[source]: Get pip install command for PyPDF dependencies.

Unstructured PDF Loader

Unstructured API loader implementation.

Reference: https://docs.langchain.com/oss/python/integrations/document_loaders/unstructured_file

class automated_document_parser.loaders.pdf_load.unstructured_loader.UnstructuredPDFLoader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using Unstructured API backend.

Advanced parsing with support for complex document layouts. Requires UNSTRUCTURED_API_KEY environment variable or api_key in kwargs.

load() → List[Document][source]

Load PDF using Unstructured API.

Returns:

List of LangChain Document objects

Raises:

ValueError – If API key is not provided
ImportError – If langchain-unstructured is not installed

get_install_command() → str[source]: Get pip install command for Unstructured dependencies.

Amazon Textract Loader

Amazon Textract loader implementation.

Reference: https://docs.langchain.com/oss/python/integrations/document_loaders/amazon_textract

class automated_document_parser.loaders.pdf_load.textract_loader.AmazonTextractPDFLoader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using Amazon Textract backend.

OCR service for extracting text from scanned documents and images. Supports local files, HTTP/HTTPS URLs, and S3 URIs.

Requires AWS credentials to be configured.

load() → List[Document][source]

Load PDF using Amazon Textract.

Returns:: List of LangChain Document objects
Raises:: ImportError – If boto3 or amazon-textract-caller is not installed

get_install_command() → str[source]: Get pip install command for Amazon Textract dependencies.

Mathpix Loader

Mathpix API loader implementation.

Reference: https://python.langchain.com/docs/integrations/document_loaders/mathpix/

class automated_document_parser.loaders.pdf_load.mathpix_loader.MathpixPDFLoader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using Mathpix API backend.

Specialized for converting PDFs with mathematical formulas, tables, and diagrams. Requires MATHPIX_API_KEY environment variable or mathpix_api_key in kwargs.

load() → List[Document][source]

Load PDF using Mathpix API.

Returns:

List of LangChain Document objects

Raises:

ValueError – If API key is not provided
ImportError – If langchain-community is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

PDFPlumber Loader

PDFPlumber loader implementation.

Reference: https://python.langchain.com/docs/integrations/document_loaders/pdfplumber/

class automated_document_parser.loaders.pdf_load.pdfplumber_loader.PDFPlumberLoader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using pdfplumber backend.

Extracts text and table data from PDFs with high accuracy. No API key required - works locally.

load() → List[Document][source]

Load PDF using pdfplumber.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community or pdfplumber is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

PyPDFium2 Loader

PyPDFium2 loader implementation.

Reference: https://python.langchain.com/docs/integrations/document_loaders/pypdfium2/

class automated_document_parser.loaders.pdf_load.pypdfium2_loader.PyPDFium2Loader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using pypdfium2 backend.

Fast and accurate PDF parsing using Google’s PDFium library. No API key required - works locally.

load() → List[Document][source]

Load PDF using pypdfium2.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community or pypdfium2 is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

PyMuPDF Loader

PyMuPDF loader implementation.

Reference: https://python.langchain.com/docs/integrations/document_loaders/pymupdf/

class automated_document_parser.loaders.pdf_load.pymupdf_loader.PyMuPDFLoader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using PyMuPDF (fitz) backend.

Fast PDF parsing with support for text, images, and metadata extraction. No API key required - works locally.

load() → List[Document][source]

Load PDF using PyMuPDF.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-community or pymupdf is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

PyMuPDF4LLM Loader

PyMuPDF4LLM loader implementation.

Reference: https://python.langchain.com/docs/integrations/document_loaders/pymupdf4llm/

class automated_document_parser.loaders.pdf_load.pymupdf4llm_loader.PyMuPDF4LLMLoader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using PyMuPDF4LLM backend.

Optimized for LLM processing with enhanced text extraction and formatting. No API key required - works locally.

load() → List[Document][source]

Load PDF using PyMuPDF4LLM.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-pymupdf4llm is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string

OpenDataLoader PDF Loader

OpenDataLoader PDF loader implementation.

Reference: https://python.langchain.com/docs/integrations/document_loaders/opendataloader/

class automated_document_parser.loaders.pdf_load.opendataloader_loader.OpenDataLoaderPDFLoader(file_path: str | Path, **kwargs)[source]

Bases: BasePDFLoader

PDF loader using OpenDataLoader backend.

Advanced PDF parsing with support for multiple formats and configurations. No API key required - works locally.

load() → List[Document][source]

Load PDF using OpenDataLoader.

Returns:: List of LangChain Document objects
Raises:: ImportError – If langchain-opendataloader-pdf is not installed

static get_install_command() → str[source]

Return the command to install required dependencies.

Returns:: Installation command string