Processing Rules
Document processing rules allow you to transform content and extract structured metadata during document ingestion. Rules are processed sequentially and can be used to standardize content, extract information, or enhance documents with additional metadata.
Overview
Rules are applied during document ingestion, before the content is chunked and embedded. Each rule can either:
Extract structured metadata from the content
Transform the content using natural language instructions
Types of Rules
Metadata Extraction Rules
Metadata extraction rules use a schema to extract structured information from document content. This is useful for automatically categorizing documents, extracting key information, or standardizing metadata.
from databridge import DataBridge, MetadataExtractionRule
from pydantic import BaseModel
# Define your metadata schema
class ArticleMetadata(BaseModel):
title: str
author: str
publication_date: str
topics: list[str]
# Create the rule
metadata_rule = MetadataExtractionRule(schema=ArticleMetadata)
# Use during ingestion
db = DataBridge("your-uri")
doc = db.ingest_text(
content="Your article content...",
rules=[metadata_rule]
)
# The extracted metadata will be merged with any provided metadata
print(f"Extracted metadata: {doc.metadata}")You can also specify the schema as a dictionary:
Natural Language Rules
Natural language rules allow you to transform content using plain English instructions. This is useful for content standardization, PII removal, or any other text transformation need.
Rule Configuration
Rules use language models for processing. You can configure which provider and model to use through environment variables:
Common Use Cases
PII Detection and Removal
Content Classification
Text Summarization
Format Standardization
Defining Custom Rules
DataBridge allows you to define your own custom rules for specialized document processing needs. You can implement rules either at the core level or create simplified SDK wrappers.
Core Implementation
To implement a custom rule in the core system, create a new class that inherits from BaseRule and implements the apply method:
SDK Implementation
For the Python SDK, you can create a simplified wrapper that generates the rule configuration:
Example: Custom Metadata Extraction
Here's a complete example of implementing a custom metadata extraction rule for research papers:
Best Practices for Custom Rules
Core Implementation:
Inherit from
BaseRuleand use Pydantic for validationImplement clear error handling in the
applymethodReturn both metadata and modified content
Use type hints and docstrings
Keep rules focused and single-purpose
SDK Implementation:
Keep the interface simple and intuitive
Use descriptive class names
Implement
to_dict()to match core rule formatProvide clear documentation and examples
Testing:
Test both success and failure cases
Verify metadata extraction accuracy
Check content transformation correctness
Ensure proper error handling
Best Practices
Rule Ordering:
Apply metadata extraction rules first to capture information from the original content
Apply transformation rules afterward in order of importance
Put format standardization rules last
Performance:
Use specific, clear prompts for natural language rules
Keep metadata schemas focused on essential information
Consider the processing time when chaining multiple rules
Error Handling:
Rules that fail will be skipped, allowing the ingestion to continue
Check the document's system metadata for processing status
Monitor rule performance in production
Next Steps
After setting up rules:
Monitor rule effectiveness through the extracted metadata
Refine rule prompts and schemas based on results
Consider implementing custom rules for specific use cases
Last updated