Ingest

`POST` Ingest Text Document

Ingest a text document with metadata. The document will be chunked and indexed for semantic search.

Parameters:

content: Text content to ingest
metadata: (Optional) Dictionary of metadata
rules: (Optional) List of processing rules to apply. Each rule can be:
- Metadata extraction rule with a JSON schema
- Natural language rule with a transformation prompt

Returns: Document object with the following fields:

external_id: Unique document identifier
content_type: Content type (always "text/plain" for text)
filename: Always None for text documents
metadata: Combined user-provided and rule-extracted metadata
storage_info: Empty for text documents
system_metadata: System-managed metadata (created_at, updated_at, version)
access_control: Access control lists (readers, writers, admins)
chunk_ids: List of chunk identifiers

from databridge import DataBridge, MetadataExtractionRule, NaturalLanguageRule

# Create client instance
db = DataBridge(uri="your-databridge-uri")

# Create processing rules (optional)
metadata_rule = MetadataExtractionRule(schema={
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "topics": {"type": "array", "items": {"type": "string"}}
    }
})

format_rule = NaturalLanguageRule(
    prompt="Convert the text into a professional format with clear paragraphs"
)

# Ingest text document with rules
doc = db.ingest_text(
    content="Machine learning is transforming industries...",
    metadata={
        "title": "ML Overview",
        "category": "tech",
        "tags": ["ml", "ai"]
    },
    rules=[metadata_rule, format_rule]  # Optional processing rules
)
print(f"Document ID: {doc.external_id}")

curl -X POST "http://localhost:8000/ingest/text" \
  -H "Authorization: Bearer your_token" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Machine learning is transforming industries...",
    "metadata": {
        "title": "ML Overview",
        "category": "tech",
        "tags": ["ml", "ai"]
    },
    "rules": [
        {
            "type": "metadata_extraction",
            "schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "topics": {"type": "array", "items": {"type": "string"}}
                }
            }
        },
        {
            "type": "natural_language",
            "prompt": "Convert the text into a professional format with clear paragraphs"
        }
    ]
  }'

Response:

{
    "external_id": "doc_abc123",
    "content_type": "text/plain",
    "filename": null,
    "metadata": {
        "title": "ML Overview",
        "category": "tech",
        "tags": ["ml", "ai"]
    },
    "storage_info": {},
    "system_metadata": {
        "created_at": "2024-03-20T10:30:00Z",
        "updated_at": "2024-03-20T10:30:00Z",
        "version": 1
    },
    "access_control": {
        "readers": ["user_123"],
        "writers": ["user_123"],
        "admins": ["user_123"]
    },
    "chunk_ids": ["chunk_1", "chunk_2"]
}

{
    "detail": "Invalid authentication credentials"
}

{
    "detail": [
        {
            "loc": ["body", "content"],
            "msg": "field required",
            "type": "value_error.missing"
        }
    ]
}

`POST` Ingest File Document

Upload and ingest a file document. Supports various file types including PDFs, Word documents, presentations, and more. The file will be processed, chunked, and indexed for semantic search.

Parameters:

file: File to ingest (path string, bytes, file object, or Path)
filename: Name of the file
content_type: MIME type (optional, will be guessed if not provided)
metadata: Optional dictionary of metadata
rules: (Optional) List of processing rules to apply to extracted text

Returns: Document object with storage information including:

All fields from text documents
storage_info: Contains bucket and key information for file storage
filename: Original filename
content_type: MIME type of the file

from databridge import DataBridge, MetadataExtractionRule, NaturalLanguageRule

# Create client instance
db = DataBridge(uri="your-databridge-uri")

# Create processing rules (optional)
pii_rule = NaturalLanguageRule(
    prompt="Remove all PII. Replace names with [NAME], emails with [EMAIL]"
)

classify_rule = MetadataExtractionRule(schema={
    "type": "object",
    "properties": {
        "document_type": {"type": "string"},
        "confidentiality": {"type": "string"}
    }
})

# From file path with rules
doc = db.ingest_file(
    file="presentation.pdf",
    filename="Q4_Presentation.pdf",
    content_type="application/pdf",
    metadata={
        "department": "Finance",
        "year": 2024,
        "quarter": 4
    },
    rules=[pii_rule, classify_rule]  # Optional processing rules
)
print(f"Document ID: {doc.external_id}")
print(f"Storage location: {doc.storage_info['bucket']}/{doc.storage_info['key']}")

# From file object
with open("presentation.pptx", "rb") as f:
    doc = db.ingest_file(
        file=f,
        filename="presentation.pptx",
        rules=[pii_rule]  # Rules work with file objects too
    )

curl -X POST "http://localhost:8000/ingest/file" \
  -H "Authorization: Bearer your_token" \
  -F "file=@presentation.pdf" \
  -F 'metadata={"department":"Finance","year":2024,"quarter":4}' \
  -F 'rules=[
    {
        "type": "natural_language",
        "prompt": "Remove all PII. Replace names with [NAME], emails with [EMAIL]"
    },
    {
        "type": "metadata_extraction",
        "schema": {
            "type": "object",
            "properties": {
                "document_type": {"type": "string"},
                "confidentiality": {"type": "string"}
            }
        }
    }
  ]'

Response:

{
    "external_id": "doc_xyz789",
    "content_type": "application/pdf",
    "filename": "Q4_Presentation.pdf",
    "metadata": {
        "department": "Finance",
        "year": 2024,
        "quarter": 4
    },
    "storage_info": {
        "bucket": "your-bucket-name",
        "key": "doc_xyz789/Q4_Presentation.pdf"
    },
    "system_metadata": {
        "created_at": "2024-03-20T10:30:00Z",
        "updated_at": "2024-03-20T10:30:00Z",
        "version": 1
    },
    "access_control": {
        "readers": ["user_123"],
        "writers": ["user_123"],
        "admins": ["user_123"]
    },
    "chunk_ids": ["chunk_1", "chunk_2", "chunk_3"]
}

{
    "detail": "Invalid authentication credentials"
}

{
    "detail": "File size exceeds maximum allowed size of 100MB"
}

PreviousEndpoints NextSearch

Last updated 4 months ago