Document processing rules allow you to transform content and extract structured metadata during document ingestion. Rules are processed sequentially and can be used to standardize content, extract information, or enhance documents with additional metadata.
Overview
Rules are applied during document ingestion, before the content is chunked and embedded. Each rule can either:
Extract structured metadata from the content
Transform the content using natural language instructions
Types of Rules
Metadata Extraction Rules
Metadata extraction rules use a schema to extract structured information from document content. This is useful for automatically categorizing documents, extracting key information, or standardizing metadata.
from databridge import DataBridge, MetadataExtractionRule
from pydantic import BaseModel
# Define your metadata schema
class ArticleMetadata(BaseModel):
title: str
author: str
publication_date: str
topics: list[str]
# Create the rule
metadata_rule = MetadataExtractionRule(schema=ArticleMetadata)
# Use during ingestion
db = DataBridge("your-uri")
doc = db.ingest_text(
content="Your article content...",
rules=[metadata_rule]
)
# The extracted metadata will be merged with any provided metadata
print(f"Extracted metadata: {doc.metadata}")
Natural language rules allow you to transform content using plain English instructions. This is useful for content standardization, PII removal, or any other text transformation need.
from databridge import DataBridge, NaturalLanguageRule
# Create a rule to remove PII
pii_rule = NaturalLanguageRule(
prompt="Remove all personal identifiable information including names, "
"email addresses, phone numbers, and addresses. Replace each "
"with an appropriate placeholder in square brackets."
)
# Create a rule to standardize format
format_rule = NaturalLanguageRule(
prompt="Convert the text into a clear, professional tone. "
"Break it into short paragraphs and add section headings "
"where appropriate."
)
# Apply multiple rules during ingestion
db = DataBridge("your-uri")
doc = db.ingest_text(
content="Your content with PII...",
rules=[pii_rule, format_rule] # Rules are applied in order
)
Rule Configuration
Rules use language models for processing. You can configure which provider and model to use through environment variables:
# Use OpenAI (default)
RULES_PROVIDER=openai
RULES_MODEL=gpt-3.5-turbo # or another OpenAI model
# Use Ollama
RULES_PROVIDER=ollama
RULES_MODEL=llama2 # or another Ollama model
COMPLETION_OLLAMA_BASE_URL=http://localhost:11434 # Ollama API URL
Common Use Cases
PII Detection and Removal
pii_rule = NaturalLanguageRule(
prompt="Remove all PII. Replace names with [NAME], emails with [EMAIL], "
"phone numbers with [PHONE], and addresses with [ADDRESS]."
)
summary_rule = NaturalLanguageRule(
prompt="Create a concise summary of the main points. "
"The summary should be 2-3 sentences long."
)
Format Standardization
standardize_rule = NaturalLanguageRule(
prompt="Standardize the text format: "
"1. Use consistent capitalization "
"2. Break into clear paragraphs "
"3. Use bullet points for lists "
"4. Add appropriate section headings"
)
Defining Custom Rules
DataBridge allows you to define your own custom rules for specialized document processing needs. You can implement rules either at the core level or create simplified SDK wrappers.
Core Implementation
To implement a custom rule in the core system, create a new class that inherits from BaseRule and implements the apply method:
from typing import Dict, Any, Literal
from pydantic import BaseModel
from core.models.rules import BaseRule
class KeywordExtractionRule(BaseRule):
"""Rule for extracting keywords using custom logic"""
type: Literal["keyword_extraction"]
min_length: int = 4 # Minimum word length
max_keywords: int = 10 # Maximum keywords to extract
async def apply(self, content: str) -> tuple[Dict[str, Any], str]:
"""Extract keywords from content"""
# Your keyword extraction logic here
words = content.split()
keywords = [
word for word in words
if len(word) >= self.min_length
][:self.max_keywords]
# Return tuple of (metadata, modified_content)
return {"keywords": keywords}, content
# Example usage in core:
rule = KeywordExtractionRule(
type="keyword_extraction",
min_length=5,
max_keywords=15
)
metadata, modified_text = await rule.apply("Your document content...")
SDK Implementation
For the Python SDK, you can create a simplified wrapper that generates the rule configuration:
from databridge.rules import Rule
from typing import Dict, Any
class KeywordExtractor(Rule):
"""SDK wrapper for keyword extraction rule"""
def __init__(self, min_length: int = 4, max_keywords: int = 10):
self.min_length = min_length
self.max_keywords = max_keywords
def to_dict(self) -> Dict[str, Any]:
"""Convert rule to dictionary format for API"""
return {
"type": "keyword_extraction",
"min_length": self.min_length,
"max_keywords": self.max_keywords
}
# Example usage in SDK:
from databridge import DataBridge
db = DataBridge("your-uri")
rule = KeywordExtractor(min_length=5, max_keywords=15)
doc = db.ingest_text(
content="Your content...",
rules=[rule]
)
Example: Custom Metadata Extraction
Here's a complete example of implementing a custom metadata extraction rule for research papers:
# Core Implementation
from typing import Dict, Any, Literal
from pydantic import BaseModel
from core.models.rules import BaseRule
import re
class ResearchPaperMetadata(BaseModel):
"""Schema for research paper metadata"""
title: str
authors: list[str]
abstract: str
keywords: list[str]
citations: list[str]
class ResearchPaperRule(BaseRule):
"""Rule for extracting research paper metadata"""
type: Literal["research_paper"]
async def apply(self, content: str) -> tuple[Dict[str, Any], str]:
# Example implementation - you'd want more robust parsing
sections = content.split("\n\n")
# Extract basic metadata
metadata = {
"title": sections[0].strip(),
"authors": self._extract_authors(content),
"abstract": self._find_section(content, "Abstract"),
"keywords": self._extract_keywords(content),
"citations": self._extract_citations(content)
}
return metadata, content
def _extract_authors(self, content: str) -> list[str]:
# Implementation details...
return []
def _find_section(self, content: str, section: str) -> str:
# Implementation details...
return ""
def _extract_keywords(self, content: str) -> list[str]:
# Implementation details...
return []
def _extract_citations(self, content: str) -> list[str]:
# Implementation details...
return []
# SDK Implementation
from databridge.rules import Rule
class ResearchPaperExtractor(Rule):
"""SDK wrapper for research paper metadata extraction"""
def to_dict(self) -> Dict[str, Any]:
return {"type": "research_paper"}
# Usage Example
from databridge import DataBridge
# Using the SDK
db = DataBridge("your-uri")
paper_rule = ResearchPaperExtractor()
doc = db.ingest_text(
content="Your research paper...",
rules=[paper_rule]
)
# The extracted metadata will be available in doc.metadata
print(f"Title: {doc.metadata.get('title')}")
print(f"Authors: {doc.metadata.get('authors')}")
print(f"Keywords: {doc.metadata.get('keywords')}")
Best Practices for Custom Rules
Core Implementation:
Inherit from BaseRule and use Pydantic for validation
Implement clear error handling in the apply method
Return both metadata and modified content
Use type hints and docstrings
Keep rules focused and single-purpose
SDK Implementation:
Keep the interface simple and intuitive
Use descriptive class names
Implement to_dict() to match core rule format
Provide clear documentation and examples
Testing:
Test both success and failure cases
Verify metadata extraction accuracy
Check content transformation correctness
Ensure proper error handling
Best Practices
Rule Ordering:
Apply metadata extraction rules first to capture information from the original content
Apply transformation rules afterward in order of importance
Put format standardization rules last
Performance:
Use specific, clear prompts for natural language rules
Keep metadata schemas focused on essential information
Consider the processing time when chaining multiple rules
Error Handling:
Rules that fail will be skipped, allowing the ingestion to continue
Check the document's system metadata for processing status
Monitor rule performance in production
Next Steps
After setting up rules:
Monitor rule effectiveness through the extracted metadata
Refine rule prompts and schemas based on results
Consider implementing custom rules for specific use cases