automated_document_parser
Automated Document Parser - Intelligent document loading for LangChain.
- class automated_document_parser.DocumentParser[source]
Main class for automated document parsing.
Automatically detects file type and loads documents using appropriate loaders. Designed for seamless integration with LangChain RAG pipelines.
- get_loaded_files() List[str][source]
Get list of successfully loaded files.
- Returns:
List of file paths that were successfully loaded
- parse(file_path: str | Path, pdf_loader_method: str = 'pypdf', **kwargs) List[Document][source]
Parse a document from file path.
- Parameters:
file_path – Path to the document file
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for the loader (e.g., encoding, api_key)
- Returns:
List of LangChain Document objects
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported
RuntimeError – If parsing fails
Example
>>> parser = DocumentParser() >>> # Basic usage with auto-detection >>> docs = parser.parse("document.pdf") >>> # Specify PDF loading method >>> docs = parser.parse("document.pdf", pdf_loader_method="pdfplumber") >>> # Pass additional parameters >>> docs = parser.parse("math.pdf", pdf_loader_method="mathpix", ... mathpix_app_id="id", mathpix_app_key="key")
- parse_multiple(file_paths: List[str | Path], pdf_loader_method: str = 'pypdf', **kwargs) dict[str, List[Document]][source]
Parse multiple documents with automatic file type detection.
- Parameters:
file_paths – List of file paths
pdf_loader_method – Method to use for PDF files (default: ‘pypdf’). Options: ‘pypdf’, ‘unstructured’, ‘amazon_textract’, ‘mathpix’, ‘pdfplumber’, ‘pypdfium2’, ‘pymupdf’, ‘pymupdf4llm’, ‘opendataloader’
**kwargs – Additional keyword arguments for loaders (e.g., encoding, api_key)
- Returns:
Dictionary mapping file paths to their loaded documents
Example
>>> parser = DocumentParser() >>> # Auto-detect all file types with default settings >>> results = parser.parse_multiple(["doc1.pdf", "doc2.txt", "data.csv"]) >>> # Specify PDF method for all PDFs >>> results = parser.parse_multiple( ... ["doc1.pdf", "doc2.pdf", "data.csv"], ... pdf_loader_method="pdfplumber" ... ) >>> for file, docs in results.items(): ... print(f"{file}: {len(docs)} documents")
- class automated_document_parser.FileLoader(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]
Automated file loader that detects file type and loads documents.
- __init__(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs)[source]
Initialize the FileLoader.
- Parameters:
file_path – Path to the file to load
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader (e.g., client, api_key)
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file type is unsupported
- load() List[Document][source]
Load documents from the file.
- Returns:
List of LangChain Document objects
- Raises:
RuntimeError – If loading fails
- automated_document_parser.load_document(file_path: str | Path, pdf_loader_method: Literal['pypdf', 'unstructured', 'amazon_textract', 'mathpix', 'pdfplumber', 'pypdfium2', 'pymupdf', 'pymupdf4llm', 'opendataloader'] = 'pypdf', **pdf_loader_kwargs) List[Document][source]
Convenience function to load a document from a file.
- Parameters:
file_path – Path to the file
pdf_loader_method – Method to use for PDF loading (‘pypdf’, ‘unstructured’, ‘amazon_textract’)
**pdf_loader_kwargs – Additional keyword arguments for PDF loader
- Returns:
List of LangChain Document objects
Examples
>>> # Load a text file >>> documents = load_document("path/to/file.txt")
>>> # Load a PDF with default PyPDF >>> documents = load_document("path/to/file.pdf")
>>> # Load a PDF with Unstructured >>> documents = load_document("path/to/file.pdf", pdf_loader_method="unstructured")
>>> # Load a PDF with Amazon Textract >>> import boto3 >>> client = boto3.client("textract", region_name="us-east-2") >>> documents = load_document( ... "s3://bucket/file.pdf", ... pdf_loader_method="amazon_textract", ... client=client ... )