Document Ingestion

This guide explains how to ingest documents into DataBridge using the Python SDK. DataBridge supports ingesting both text content and files (like PDFs, Word documents, etc.).

Installation

First, install the DataBridge Python client:

pip install databridge-client

Basic Setup

Initialize the DataBridge client:

from databridge import DataBridge, AsyncDataBridge

# Synchronous client
db = DataBridge("your-uri")

# Asynchronous client
async_db = AsyncDataBridge("your-uri")

Text Ingestion

You can ingest text content directly:

# Synchronous ingestion
doc = db.ingest_text(
    content="Machine learning is fascinating...",
    metadata={
        "title": "ML Introduction",
        "category": "tech",
        "author": "John Doe"
    }
)

print(f"Document ID: {doc.external_id}")

# Asynchronous ingestion
async with AsyncDataBridge("your-uri") as db:
    doc = await db.ingest_text(
        content="Machine learning is fascinating...",
        metadata={
            "title": "ML Introduction",
            "category": "tech",
            "author": "John Doe"
        }
    )
    print(f"Document ID: {doc.external_id}")

What Happens During Text Ingestion?

The text content is processed and split into semantic chunks
Each chunk is embedded using state-of-the-art language models
The embeddings are stored in a vector database for efficient semantic search
Document metadata and content are stored for retrieval

Document Ingestion

For files like PDFs, Word documents, or other supported formats:

# Synchronous file ingestion
doc = db.ingest_file(
    file="research_paper.pdf",  # Can be path string, bytes, file object, or Path
    filename="research_paper.pdf",
    content_type="application/pdf",  # Optional, will be guessed if not provided
    metadata={
        "title": "Research Paper",
        "department": "research",
        "year": 2024
    }
)

print(f"Document ID: {doc.external_id}")
print(f"Storage location: {doc.storage_info['bucket']}/{doc.storage_info['key']}")

# Asynchronous file ingestion
async with AsyncDataBridge("your-uri") as db:
    # From file path
    doc = await db.ingest_file(
        "document.pdf",
        filename="document.pdf",
        content_type="application/pdf",
        metadata={"department": "research"}
    )

    # From file object
    with open("document.pdf", "rb") as f:
        doc = await db.ingest_file(f, "document.pdf")

Document Processing Pipeline

When you ingest a document:

The file is uploaded and processed based on its content type
For PDFs and other text-based documents:
- Text is extracted while preserving structure
- Content is split into meaningful chunks
- Each chunk is embedded for semantic search
Metadata and content are stored for retrieval
Document chunks are indexed for efficient searching

Verifying Ingestion

You can verify your ingested documents:

# Synchronous verification
docs = db.list_documents(limit=10)
for doc in docs:
    print(f"ID: {doc.external_id}")
    print(f"Type: {doc.content_type}")
    print(f"Metadata: {doc.metadata}")
    print("---")

doc = db.get_document("doc_123")
print(f"Title: {doc.metadata.get('title')}")

# Asynchronous verification
async with AsyncDataBridge("your-uri") as db:
    docs = await db.list_documents(limit=10)
    for doc in docs:
        print(f"ID: {doc.external_id}")
        print(f"Type: {doc.content_type}")
        print(f"Metadata: {doc.metadata}")
        print("---")

    doc = await db.get_document("doc_123")
    print(f"Title: {doc.metadata.get('title')}")

Document Model

When you ingest a document, the response includes several important fields:

doc = db.ingest_file("document.pdf", "document.pdf")

# Document identifier
print(f"ID: {doc.external_id}")

# Storage information
print(f"Storage Info: {doc.storage_info}")
# Contains:
# - storage_type: Where the document is stored (e.g., "s3", "local")
# - bucket: Storage bucket name
# - path: Path within storage
# - size: Document size in bytes

# System metadata
print(f"System Metadata: {doc.system_metadata}")
# Contains:
# - created_at: Document creation timestamp
# - updated_at: Last modification timestamp
# - chunk_count: Number of chunks generated
# - embedding_model: Model used for embeddings
# - processing_status: Current status

# Access control
print(f"Access Control: {doc.access_control}")
# Contains:
# - readers: List of entities that can read the document
# - writers: List of entities that can modify the document
# - admins: List of entities that can manage the document

Processing Rules

You can apply processing rules during ingestion to transform content or extract metadata:

from databridge import MetadataExtractionRule, NaturalLanguageRule

# Create rules
metadata_rule = MetadataExtractionRule(schema={
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "topics": {"type": "array", "items": {"type": "string"}}
    }
})

format_rule = NaturalLanguageRule(
    prompt="Convert the text into a professional format with clear paragraphs"
)

# Apply rules during text ingestion
doc = db.ingest_text(
    content="Your content...",
    rules=[metadata_rule, format_rule]
)

# Apply rules during file ingestion
doc = db.ingest_file(
    "document.pdf",
    rules=[metadata_rule, format_rule]
)

For detailed information about rules, see the Rules Guide.

Next Steps

After ingesting documents, you can:

Search through your documents
Generate completions using document context
Use filters to organize and retrieve specific document sets

PreviousShell NextProcessing Rules

Last updated 5 months ago