Ingest

POST Ingest Text Document

Ingest a text document with metadata. The document will be chunked and indexed for semantic search.

Parameters:

  • content: Text content to ingest

  • metadata: (Optional) Dictionary of metadata

  • rules: (Optional) List of processing rules to apply. Each rule can be:

    • Metadata extraction rule with a JSON schema

    • Natural language rule with a transformation prompt

Returns: Document object with the following fields:

  • external_id: Unique document identifier

  • content_type: Content type (always "text/plain" for text)

  • filename: Always None for text documents

  • metadata: Combined user-provided and rule-extracted metadata

  • storage_info: Empty for text documents

  • system_metadata: System-managed metadata (created_at, updated_at, version)

  • access_control: Access control lists (readers, writers, admins)

  • chunk_ids: List of chunk identifiers

from databridge import DataBridge, MetadataExtractionRule, NaturalLanguageRule

# Create client instance
db = DataBridge(uri="your-databridge-uri")

# Create processing rules (optional)
metadata_rule = MetadataExtractionRule(schema={
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "topics": {"type": "array", "items": {"type": "string"}}
    }
})

format_rule = NaturalLanguageRule(
    prompt="Convert the text into a professional format with clear paragraphs"
)

# Ingest text document with rules
doc = db.ingest_text(
    content="Machine learning is transforming industries...",
    metadata={
        "title": "ML Overview",
        "category": "tech",
        "tags": ["ml", "ai"]
    },
    rules=[metadata_rule, format_rule]  # Optional processing rules
)
print(f"Document ID: {doc.external_id}")

Response:

{
    "external_id": "doc_abc123",
    "content_type": "text/plain",
    "filename": null,
    "metadata": {
        "title": "ML Overview",
        "category": "tech",
        "tags": ["ml", "ai"]
    },
    "storage_info": {},
    "system_metadata": {
        "created_at": "2024-03-20T10:30:00Z",
        "updated_at": "2024-03-20T10:30:00Z",
        "version": 1
    },
    "access_control": {
        "readers": ["user_123"],
        "writers": ["user_123"],
        "admins": ["user_123"]
    },
    "chunk_ids": ["chunk_1", "chunk_2"]
}

POST Ingest File Document

Upload and ingest a file document. Supports various file types including PDFs, Word documents, presentations, and more. The file will be processed, chunked, and indexed for semantic search.

Parameters:

  • file: File to ingest (path string, bytes, file object, or Path)

  • filename: Name of the file

  • content_type: MIME type (optional, will be guessed if not provided)

  • metadata: Optional dictionary of metadata

  • rules: (Optional) List of processing rules to apply to extracted text

Returns: Document object with storage information including:

  • All fields from text documents

  • storage_info: Contains bucket and key information for file storage

  • filename: Original filename

  • content_type: MIME type of the file

from databridge import DataBridge, MetadataExtractionRule, NaturalLanguageRule

# Create client instance
db = DataBridge(uri="your-databridge-uri")

# Create processing rules (optional)
pii_rule = NaturalLanguageRule(
    prompt="Remove all PII. Replace names with [NAME], emails with [EMAIL]"
)

classify_rule = MetadataExtractionRule(schema={
    "type": "object",
    "properties": {
        "document_type": {"type": "string"},
        "confidentiality": {"type": "string"}
    }
})

# From file path with rules
doc = db.ingest_file(
    file="presentation.pdf",
    filename="Q4_Presentation.pdf",
    content_type="application/pdf",
    metadata={
        "department": "Finance",
        "year": 2024,
        "quarter": 4
    },
    rules=[pii_rule, classify_rule]  # Optional processing rules
)
print(f"Document ID: {doc.external_id}")
print(f"Storage location: {doc.storage_info['bucket']}/{doc.storage_info['key']}")

# From file object
with open("presentation.pptx", "rb") as f:
    doc = db.ingest_file(
        file=f,
        filename="presentation.pptx",
        rules=[pii_rule]  # Rules work with file objects too
    )

Response:

{
    "external_id": "doc_xyz789",
    "content_type": "application/pdf",
    "filename": "Q4_Presentation.pdf",
    "metadata": {
        "department": "Finance",
        "year": 2024,
        "quarter": 4
    },
    "storage_info": {
        "bucket": "your-bucket-name",
        "key": "doc_xyz789/Q4_Presentation.pdf"
    },
    "system_metadata": {
        "created_at": "2024-03-20T10:30:00Z",
        "updated_at": "2024-03-20T10:30:00Z",
        "version": 1
    },
    "access_control": {
        "readers": ["user_123"],
        "writers": ["user_123"],
        "admins": ["user_123"]
    },
    "chunk_ids": ["chunk_1", "chunk_2", "chunk_3"]
}

Last updated