Document Ingestion

This guide explains how to ingest documents into DataBridge using the Python SDK. DataBridge supports ingesting both text content and files (like PDFs, Word documents, etc.).

Installation

First, install the DataBridge Python client:

pip install databridge-client

Basic Setup

Initialize the DataBridge client:

from databridge import DataBridge, AsyncDataBridge

# Synchronous client
db = DataBridge("your-uri")

# Asynchronous client
async_db = AsyncDataBridge("your-uri")

Text Ingestion

You can ingest text content directly:

# Synchronous ingestion
doc = db.ingest_text(
    content="Machine learning is fascinating...",
    metadata={
        "title": "ML Introduction",
        "category": "tech",
        "author": "John Doe"
    }
)

print(f"Document ID: {doc.external_id}")

# Asynchronous ingestion
async with AsyncDataBridge("your-uri") as db:
    doc = await db.ingest_text(
        content="Machine learning is fascinating...",
        metadata={
            "title": "ML Introduction",
            "category": "tech",
            "author": "John Doe"
        }
    )
    print(f"Document ID: {doc.external_id}")

What Happens During Text Ingestion?

  1. The text content is processed and split into semantic chunks

  2. Each chunk is embedded using state-of-the-art language models

  3. The embeddings are stored in a vector database for efficient semantic search

  4. Document metadata and content are stored for retrieval

Document Ingestion

For files like PDFs, Word documents, or other supported formats:

Document Processing Pipeline

When you ingest a document:

  1. The file is uploaded and processed based on its content type

  2. For PDFs and other text-based documents:

    • Text is extracted while preserving structure

    • Content is split into meaningful chunks

    • Each chunk is embedded for semantic search

  3. Metadata and content are stored for retrieval

  4. Document chunks are indexed for efficient searching

Verifying Ingestion

You can verify your ingested documents:

Document Model

When you ingest a document, the response includes several important fields:

Processing Rules

You can apply processing rules during ingestion to transform content or extract metadata:

For detailed information about rules, see the Rules Guide.

Next Steps

After ingesting documents, you can:

Last updated