Document Ingestion

This guide explains how to ingest documents into DataBridge using the Python SDK. DataBridge supports ingesting both text content and files (like PDFs, Word documents, etc.).

Installation

First, install the DataBridge Python client:

pip install databridge-client

Basic Setup

Initialize the DataBridge client:

from databridge import DataBridge, AsyncDataBridge

# Synchronous client
db = DataBridge("your-uri")

# Asynchronous client
async_db = AsyncDataBridge("your-uri")

Text Ingestion

You can ingest text content directly:

# Synchronous ingestion
doc = db.ingest_text(
    content="Machine learning is fascinating...",
    metadata={
        "title": "ML Introduction",
        "category": "tech",
        "author": "John Doe"
    }
)

print(f"Document ID: {doc.external_id}")

# Asynchronous ingestion
async with AsyncDataBridge("your-uri") as db:
    doc = await db.ingest_text(
        content="Machine learning is fascinating...",
        metadata={
            "title": "ML Introduction",
            "category": "tech",
            "author": "John Doe"
        }
    )
    print(f"Document ID: {doc.external_id}")

What Happens During Text Ingestion?

  1. The text content is processed and split into semantic chunks

  2. Each chunk is embedded using state-of-the-art language models

  3. The embeddings are stored in a vector database for efficient semantic search

  4. Document metadata and content are stored for retrieval

Document Ingestion

For files like PDFs, Word documents, or other supported formats:

# Synchronous file ingestion
doc = db.ingest_file(
    file="research_paper.pdf",  # Can be path string, bytes, file object, or Path
    filename="research_paper.pdf",
    content_type="application/pdf",  # Optional, will be guessed if not provided
    metadata={
        "title": "Research Paper",
        "department": "research",
        "year": 2024
    }
)

print(f"Document ID: {doc.external_id}")
print(f"Storage location: {doc.storage_info['bucket']}/{doc.storage_info['key']}")

# Asynchronous file ingestion
async with AsyncDataBridge("your-uri") as db:
    # From file path
    doc = await db.ingest_file(
        "document.pdf",
        filename="document.pdf",
        content_type="application/pdf",
        metadata={"department": "research"}
    )

    # From file object
    with open("document.pdf", "rb") as f:
        doc = await db.ingest_file(f, "document.pdf")

Document Processing Pipeline

When you ingest a document:

  1. The file is uploaded and processed based on its content type

  2. For PDFs and other text-based documents:

    • Text is extracted while preserving structure

    • Content is split into meaningful chunks

    • Each chunk is embedded for semantic search

  3. Metadata and content are stored for retrieval

  4. Document chunks are indexed for efficient searching

Verifying Ingestion

You can verify your ingested documents:

# Synchronous verification
docs = db.list_documents(limit=10)
for doc in docs:
    print(f"ID: {doc.external_id}")
    print(f"Type: {doc.content_type}")
    print(f"Metadata: {doc.metadata}")
    print("---")

doc = db.get_document("doc_123")
print(f"Title: {doc.metadata.get('title')}")

# Asynchronous verification
async with AsyncDataBridge("your-uri") as db:
    docs = await db.list_documents(limit=10)
    for doc in docs:
        print(f"ID: {doc.external_id}")
        print(f"Type: {doc.content_type}")
        print(f"Metadata: {doc.metadata}")
        print("---")

    doc = await db.get_document("doc_123")
    print(f"Title: {doc.metadata.get('title')}")

Document Model

When you ingest a document, the response includes several important fields:

doc = db.ingest_file("document.pdf", "document.pdf")

# Document identifier
print(f"ID: {doc.external_id}")

# Storage information
print(f"Storage Info: {doc.storage_info}")
# Contains:
# - storage_type: Where the document is stored (e.g., "s3", "local")
# - bucket: Storage bucket name
# - path: Path within storage
# - size: Document size in bytes

# System metadata
print(f"System Metadata: {doc.system_metadata}")
# Contains:
# - created_at: Document creation timestamp
# - updated_at: Last modification timestamp
# - chunk_count: Number of chunks generated
# - embedding_model: Model used for embeddings
# - processing_status: Current status

# Access control
print(f"Access Control: {doc.access_control}")
# Contains:
# - readers: List of entities that can read the document
# - writers: List of entities that can modify the document
# - admins: List of entities that can manage the document

Next Steps

After ingesting documents, you can:

Last updated