Processing Rules

Document processing rules allow you to transform content and extract structured metadata during document ingestion. Rules are processed sequentially and can be used to standardize content, extract information, or enhance documents with additional metadata.

Overview

Rules are applied during document ingestion, before the content is chunked and embedded. Each rule can either:

  • Extract structured metadata from the content

  • Transform the content using natural language instructions

Types of Rules

Metadata Extraction Rules

Metadata extraction rules use a schema to extract structured information from document content. This is useful for automatically categorizing documents, extracting key information, or standardizing metadata.

from databridge import DataBridge, MetadataExtractionRule
from pydantic import BaseModel

# Define your metadata schema
class ArticleMetadata(BaseModel):
    title: str
    author: str
    publication_date: str
    topics: list[str]

# Create the rule
metadata_rule = MetadataExtractionRule(schema=ArticleMetadata)

# Use during ingestion
db = DataBridge("your-uri")
doc = db.ingest_text(
    content="Your article content...",
    rules=[metadata_rule]
)

# The extracted metadata will be merged with any provided metadata
print(f"Extracted metadata: {doc.metadata}")

You can also specify the schema as a dictionary:

Natural Language Rules

Natural language rules allow you to transform content using plain English instructions. This is useful for content standardization, PII removal, or any other text transformation need.

Rule Configuration

Rules use language models for processing. You can configure which provider and model to use through environment variables:

Common Use Cases

PII Detection and Removal

Content Classification

Text Summarization

Format Standardization

Defining Custom Rules

DataBridge allows you to define your own custom rules for specialized document processing needs. You can implement rules either at the core level or create simplified SDK wrappers.

Core Implementation

To implement a custom rule in the core system, create a new class that inherits from BaseRule and implements the apply method:

SDK Implementation

For the Python SDK, you can create a simplified wrapper that generates the rule configuration:

Example: Custom Metadata Extraction

Here's a complete example of implementing a custom metadata extraction rule for research papers:

Best Practices for Custom Rules

  1. Core Implementation:

    • Inherit from BaseRule and use Pydantic for validation

    • Implement clear error handling in the apply method

    • Return both metadata and modified content

    • Use type hints and docstrings

    • Keep rules focused and single-purpose

  2. SDK Implementation:

    • Keep the interface simple and intuitive

    • Use descriptive class names

    • Implement to_dict() to match core rule format

    • Provide clear documentation and examples

  3. Testing:

    • Test both success and failure cases

    • Verify metadata extraction accuracy

    • Check content transformation correctness

    • Ensure proper error handling

Best Practices

  1. Rule Ordering:

    • Apply metadata extraction rules first to capture information from the original content

    • Apply transformation rules afterward in order of importance

    • Put format standardization rules last

  2. Performance:

    • Use specific, clear prompts for natural language rules

    • Keep metadata schemas focused on essential information

    • Consider the processing time when chaining multiple rules

  3. Error Handling:

    • Rules that fail will be skipped, allowing the ingestion to continue

    • Check the document's system metadata for processing status

    • Monitor rule performance in production

Next Steps

After setting up rules:

  • Monitor rule effectiveness through the extracted metadata

  • Refine rule prompts and schemas based on results

  • Consider implementing custom rules for specific use cases

Last updated