Ingest
POST
Ingest Text Document
POST
Ingest Text DocumentIngest a text document with metadata. The document will be chunked and indexed for semantic search.
Parameters:
content
: Text content to ingestmetadata
: (Optional) Dictionary of metadatarules
: (Optional) List of processing rules to apply. Each rule can be:Metadata extraction rule with a JSON schema
Natural language rule with a transformation prompt
Returns: Document object with the following fields:
external_id
: Unique document identifiercontent_type
: Content type (always "text/plain" for text)filename
: Always None for text documentsmetadata
: Combined user-provided and rule-extracted metadatastorage_info
: Empty for text documentssystem_metadata
: System-managed metadata (created_at, updated_at, version)access_control
: Access control lists (readers, writers, admins)chunk_ids
: List of chunk identifiers
from databridge import DataBridge, MetadataExtractionRule, NaturalLanguageRule
# Create client instance
db = DataBridge(uri="your-databridge-uri")
# Create processing rules (optional)
metadata_rule = MetadataExtractionRule(schema={
"type": "object",
"properties": {
"title": {"type": "string"},
"topics": {"type": "array", "items": {"type": "string"}}
}
})
format_rule = NaturalLanguageRule(
prompt="Convert the text into a professional format with clear paragraphs"
)
# Ingest text document with rules
doc = db.ingest_text(
content="Machine learning is transforming industries...",
metadata={
"title": "ML Overview",
"category": "tech",
"tags": ["ml", "ai"]
},
rules=[metadata_rule, format_rule] # Optional processing rules
)
print(f"Document ID: {doc.external_id}")
Response:
{
"external_id": "doc_abc123",
"content_type": "text/plain",
"filename": null,
"metadata": {
"title": "ML Overview",
"category": "tech",
"tags": ["ml", "ai"]
},
"storage_info": {},
"system_metadata": {
"created_at": "2024-03-20T10:30:00Z",
"updated_at": "2024-03-20T10:30:00Z",
"version": 1
},
"access_control": {
"readers": ["user_123"],
"writers": ["user_123"],
"admins": ["user_123"]
},
"chunk_ids": ["chunk_1", "chunk_2"]
}
POST
Ingest File Document
POST
Ingest File DocumentUpload and ingest a file document. Supports various file types including PDFs, Word documents, presentations, and more. The file will be processed, chunked, and indexed for semantic search.
Parameters:
file
: File to ingest (path string, bytes, file object, or Path)filename
: Name of the filecontent_type
: MIME type (optional, will be guessed if not provided)metadata
: Optional dictionary of metadatarules
: (Optional) List of processing rules to apply to extracted text
Returns: Document object with storage information including:
All fields from text documents
storage_info
: Contains bucket and key information for file storagefilename
: Original filenamecontent_type
: MIME type of the file
from databridge import DataBridge, MetadataExtractionRule, NaturalLanguageRule
# Create client instance
db = DataBridge(uri="your-databridge-uri")
# Create processing rules (optional)
pii_rule = NaturalLanguageRule(
prompt="Remove all PII. Replace names with [NAME], emails with [EMAIL]"
)
classify_rule = MetadataExtractionRule(schema={
"type": "object",
"properties": {
"document_type": {"type": "string"},
"confidentiality": {"type": "string"}
}
})
# From file path with rules
doc = db.ingest_file(
file="presentation.pdf",
filename="Q4_Presentation.pdf",
content_type="application/pdf",
metadata={
"department": "Finance",
"year": 2024,
"quarter": 4
},
rules=[pii_rule, classify_rule] # Optional processing rules
)
print(f"Document ID: {doc.external_id}")
print(f"Storage location: {doc.storage_info['bucket']}/{doc.storage_info['key']}")
# From file object
with open("presentation.pptx", "rb") as f:
doc = db.ingest_file(
file=f,
filename="presentation.pptx",
rules=[pii_rule] # Rules work with file objects too
)
Response:
{
"external_id": "doc_xyz789",
"content_type": "application/pdf",
"filename": "Q4_Presentation.pdf",
"metadata": {
"department": "Finance",
"year": 2024,
"quarter": 4
},
"storage_info": {
"bucket": "your-bucket-name",
"key": "doc_xyz789/Q4_Presentation.pdf"
},
"system_metadata": {
"created_at": "2024-03-20T10:30:00Z",
"updated_at": "2024-03-20T10:30:00Z",
"version": 1
},
"access_control": {
"readers": ["user_123"],
"writers": ["user_123"],
"admins": ["user_123"]
},
"chunk_ids": ["chunk_1", "chunk_2", "chunk_3"]
}
Last updated