Automated Document Parser Documentation ======================================== Welcome to the Automated Document Parser documentation! **Automated Document Parser** is a powerful and intelligent document processing library built on top of LangChain. It provides automatic file type detection and loading for various document formats, making it easy to build RAG (Retrieval-Augmented Generation) applications. Features -------- * **Automatic File Type Detection** - Intelligently detects and processes 10+ file formats * **Multiple PDF Loaders** - 9 different PDF loading methods for various use cases * **Modular Architecture** - Clean separation with ``file_load/`` and ``pdf_load/`` modules * **Easy Integration** - Simple API that works seamlessly with LangChain * **Structured Data Support** - Handle CSV, JSON, and other structured formats * **Extensible Design** - Easy to add custom loaders Supported File Types --------------------- **Text Files:** - ``.txt`` - Plain text files - ``.md`` - Markdown files **Structured Data:** - ``.csv`` - CSV files with encoding support - ``.json`` - JSON files with jq schema filtering **Documents:** - ``.docx`` - Microsoft Word documents - ``.html`` - HTML files **PDF Files (9 methods):** - ``pypdf`` - Basic PDF text extraction (default) - ``unstructured`` - Advanced OCR and layout detection - ``amazon_textract`` - AWS Textract for high-accuracy OCR - ``mathpix`` - Specialized for mathematical formulas - ``pdfplumber`` - High accuracy text and table extraction - ``pypdfium2`` - Google PDFium library - ``pymupdf`` - PyMuPDF (fitz) backend - ``pymupdf4llm`` - LLM-optimized extraction - ``opendataloader`` - Advanced multi-format parsing Quick Start ----------- Installation ~~~~~~~~~~~~ .. code-block:: bash pip install automated-document-parser Basic Usage ~~~~~~~~~~~ **Step 1: Automatic File Type Detection** The parser automatically detects file types and loads them with sensible defaults: .. code-block:: python from automated_document_parser import DocumentParser # Initialize the parser parser = DocumentParser() # Parse a single document (auto-detects file type) documents = parser.parse("path/to/document.pdf") # Parse multiple documents (auto-detects each file type) files = ["doc1.txt", "data.csv", "report.pdf"] results = parser.parse_multiple(files) **Step 2: Specify Loading Methods** You can specify loading methods and parameters that apply to all appropriate files: .. code-block:: python # Specify PDF loading method for all PDFs files = ["doc1.txt", "data.csv", "report.pdf", "paper.pdf"] results = parser.parse_multiple( files, pdf_loader_method="pdfplumber" # Applies to all PDF files ) # Specify encoding for text files results = parser.parse_multiple( files, encoding="utf-8" # Applies to all text files ) # Combine multiple parameters results = parser.parse_multiple( files, pdf_loader_method="pymupdf", encoding="utf-8" ) Single File with Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Parse single PDF with specific method documents = parser.parse( "document.pdf", pdf_loader_method="pdfplumber" ) # Parse text file with encoding documents = parser.parse( "file.txt", encoding="utf-8" ) Advanced PDF Loading ~~~~~~~~~~~~~~~~~~~~ For more control, use PDFLoader directly: .. code-block:: python from automated_document_parser.loaders import PDFLoader # Use specific PDF loading method loader = PDFLoader("document.pdf", method="pdfplumber") documents = loader.load() # Use Mathpix for mathematical content loader = PDFLoader( "math_paper.pdf", method="mathpix", mathpix_app_id="your_id", mathpix_app_key="your_key" ) documents = loader.load() Direct Loader Usage ~~~~~~~~~~~~~~~~~~~ .. code-block:: python from automated_document_parser.loaders.file_load import ( TextFileLoader, CSVFileLoader, JSONFileLoader ) # Load text file with specific encoding loader = TextFileLoader("file.txt", encoding="utf-8") docs = loader.load() # Load CSV file csv_loader = CSVFileLoader("data.csv") csv_docs = csv_loader.load() Contents -------- .. toctree:: :maxdepth: 2 :caption: API Reference api modules .. toctree:: :maxdepth: 2 :caption: Additional Resources GitHub Repository Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`