This guide explains how to ingest documents into DataBridge using the Python SDK. DataBridge supports ingesting both text content and files (like PDFs, Word documents, etc.).
Installation
First, install the DataBridge Python client:
pip install databridge-client
Basic Setup
Initialize the DataBridge client:
from databridge import DataBridge, AsyncDataBridge
# Synchronous client
db = DataBridge("your-uri")
# Asynchronous client
async_db = AsyncDataBridge("your-uri")
The text content is processed and split into semantic chunks
Each chunk is embedded using state-of-the-art language models
The embeddings are stored in a vector database for efficient semantic search
Document metadata and content are stored for retrieval
Document Ingestion
For files like PDFs, Word documents, or other supported formats:
# Synchronous file ingestion
doc = db.ingest_file(
file="research_paper.pdf", # Can be path string, bytes, file object, or Path
filename="research_paper.pdf",
content_type="application/pdf", # Optional, will be guessed if not provided
metadata={
"title": "Research Paper",
"department": "research",
"year": 2024
}
)
print(f"Document ID: {doc.external_id}")
print(f"Storage location: {doc.storage_info['bucket']}/{doc.storage_info['key']}")
# Asynchronous file ingestion
async with AsyncDataBridge("your-uri") as db:
# From file path
doc = await db.ingest_file(
"document.pdf",
filename="document.pdf",
content_type="application/pdf",
metadata={"department": "research"}
)
# From file object
with open("document.pdf", "rb") as f:
doc = await db.ingest_file(f, "document.pdf")
Document Processing Pipeline
When you ingest a document:
The file is uploaded and processed based on its content type
For PDFs and other text-based documents:
Text is extracted while preserving structure
Content is split into meaningful chunks
Each chunk is embedded for semantic search
Metadata and content are stored for retrieval
Document chunks are indexed for efficient searching
Verifying Ingestion
You can verify your ingested documents:
# Synchronous verification
docs = db.list_documents(limit=10)
for doc in docs:
print(f"ID: {doc.external_id}")
print(f"Type: {doc.content_type}")
print(f"Metadata: {doc.metadata}")
print("---")
doc = db.get_document("doc_123")
print(f"Title: {doc.metadata.get('title')}")
# Asynchronous verification
async with AsyncDataBridge("your-uri") as db:
docs = await db.list_documents(limit=10)
for doc in docs:
print(f"ID: {doc.external_id}")
print(f"Type: {doc.content_type}")
print(f"Metadata: {doc.metadata}")
print("---")
doc = await db.get_document("doc_123")
print(f"Title: {doc.metadata.get('title')}")
Document Model
When you ingest a document, the response includes several important fields:
doc = db.ingest_file("document.pdf", "document.pdf")
# Document identifier
print(f"ID: {doc.external_id}")
# Storage information
print(f"Storage Info: {doc.storage_info}")
# Contains:
# - storage_type: Where the document is stored (e.g., "s3", "local")
# - bucket: Storage bucket name
# - path: Path within storage
# - size: Document size in bytes
# System metadata
print(f"System Metadata: {doc.system_metadata}")
# Contains:
# - created_at: Document creation timestamp
# - updated_at: Last modification timestamp
# - chunk_count: Number of chunks generated
# - embedding_model: Model used for embeddings
# - processing_status: Current status
# Access control
print(f"Access Control: {doc.access_control}")
# Contains:
# - readers: List of entities that can read the document
# - writers: List of entities that can modify the document
# - admins: List of entities that can manage the document
Processing Rules
You can apply processing rules during ingestion to transform content or extract metadata:
from databridge import MetadataExtractionRule, NaturalLanguageRule
# Create rules
metadata_rule = MetadataExtractionRule(schema={
"type": "object",
"properties": {
"title": {"type": "string"},
"topics": {"type": "array", "items": {"type": "string"}}
}
})
format_rule = NaturalLanguageRule(
prompt="Convert the text into a professional format with clear paragraphs"
)
# Apply rules during text ingestion
doc = db.ingest_text(
content="Your content...",
rules=[metadata_rule, format_rule]
)
# Apply rules during file ingestion
doc = db.ingest_file(
"document.pdf",
rules=[metadata_rule, format_rule]
)
For detailed information about rules, see the Rules Guide.