automated_document_parser

Automated Document Parser - Intelligent document loading for LangChain.

class automated_document_parser.DocumentParser[source]

Main class for automated document parsing.

Automatically detects file type and loads documents using appropriate loaders. Designed for seamless integration with LangChain RAG pipelines.

__init__()[source]: Initialize the DocumentParser.

get_loaded_files() → List[str][source]

Get list of successfully loaded files.

Returns:: List of file paths that were successfully loaded

parse(file_path: str | Path, pdf_loader_method: str = 'pypdf', **kwargs) → List[Document][source]

Parse a document from file path.

Parameters:

file_path – Path to the document file
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for the loader (e.g., encoding, api_key)

Returns:

List of LangChain Document objects

Raises:

FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported
RuntimeError – If parsing fails

Example

>>> parser = DocumentParser()
>>> # Basic usage with auto-detection
>>> docs = parser.parse("document.pdf")
>>> # Specify PDF loading method
>>> docs = parser.parse("document.pdf", pdf_loader_method="pdfplumber")
>>> # Pass additional parameters
>>> docs = parser.parse("math.pdf", pdf_loader_method="mathpix",
...                     mathpix_app_id="id", mathpix_app_key="key")

parse_multiple(file_paths: List[str | Path], pdf_loader_method: str = 'pypdf', **kwargs) → dict[str, List[Document]][source]

Parse multiple documents with automatic file type detection.

Parameters:

file_paths – List of file paths
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for loaders (e.g., encoding, api_key)

Returns:

Dictionary mapping file paths to their loaded documents

Example

>>> parser = DocumentParser()
>>> # Auto-detect all file types with default settings
>>> results = parser.parse_multiple(["doc1.pdf", "doc2.txt", "data.csv"])
>>> # Specify PDF method for all PDFs
>>> results = parser.parse_multiple(
...     ["doc1.pdf", "doc2.pdf", "data.csv"],
...     pdf_loader_method="pdfplumber"
... )
>>> for file, docs in results.items():
...     print(f"{file}: {len(docs)} documents")

class automated_document_parser.FileLoader(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]

Automated file loader that detects file type and loads documents.

__init__(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]

Initialize the FileLoader.

Parameters:

file_path – Path to the file to load
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader (e.g., client, api_key)

Raises:

FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported

load() → List[Document][source]

Load documents from the file.

Returns:: List of LangChain Document objects
Raises:: RuntimeError – If loading fails

automated_document_parser.load_document(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs) → List[Document][source]

Convenience function to load a document from a file.

Parameters:

file_path – Path to the file
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader

Returns:

List of LangChain Document objects

Examples

>>> # Load a text file
>>> documents = load_document("path/to/file.txt")

>>> # Load a PDF with default PyPDF
>>> documents = load_document("path/to/file.pdf")

>>> # Load a PDF with Unstructured
>>> documents = load_document("path/to/file.pdf", pdf_loader_method="unstructured")

>>> # Load a PDF with Amazon Textract
>>> import boto3
>>> client = boto3.client("textract", region_name="us-east-2")
>>> documents = load_document(
...     "s3://bucket/file.pdf",
...     pdf_loader_method="amazon_textract",
...     client=client
... )