Processing Rules

Document processing rules allow you to transform content and extract structured metadata during document ingestion. Rules are processed sequentially and can be used to standardize content, extract information, or enhance documents with additional metadata.

Overview

Rules are applied during document ingestion, before the content is chunked and embedded. Each rule can either:

Extract structured metadata from the content
Transform the content using natural language instructions

Types of Rules

Metadata Extraction Rules

Metadata extraction rules use a schema to extract structured information from document content. This is useful for automatically categorizing documents, extracting key information, or standardizing metadata.

from databridge import DataBridge, MetadataExtractionRule
from pydantic import BaseModel

# Define your metadata schema
class ArticleMetadata(BaseModel):
    title: str
    author: str
    publication_date: str
    topics: list[str]

# Create the rule
metadata_rule = MetadataExtractionRule(schema=ArticleMetadata)

# Use during ingestion
db = DataBridge("your-uri")
doc = db.ingest_text(
    content="Your article content...",
    rules=[metadata_rule]
)

# The extracted metadata will be merged with any provided metadata
print(f"Extracted metadata: {doc.metadata}")

You can also specify the schema as a dictionary:

metadata_rule = MetadataExtractionRule(schema={
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"},
        "publication_date": {"type": "string"},
        "topics": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
})

Natural Language Rules

Natural language rules allow you to transform content using plain English instructions. This is useful for content standardization, PII removal, or any other text transformation need.

from databridge import DataBridge, NaturalLanguageRule

# Create a rule to remove PII
pii_rule = NaturalLanguageRule(
    prompt="Remove all personal identifiable information including names, "
           "email addresses, phone numbers, and addresses. Replace each "
           "with an appropriate placeholder in square brackets."
)

# Create a rule to standardize format
format_rule = NaturalLanguageRule(
    prompt="Convert the text into a clear, professional tone. "
           "Break it into short paragraphs and add section headings "
           "where appropriate."
)

# Apply multiple rules during ingestion
db = DataBridge("your-uri")
doc = db.ingest_text(
    content="Your content with PII...",
    rules=[pii_rule, format_rule]  # Rules are applied in order
)

Rule Configuration

Rules use language models for processing. You can configure which provider and model to use through environment variables:

# Use OpenAI (default)
RULES_PROVIDER=openai
RULES_MODEL=gpt-3.5-turbo  # or another OpenAI model

# Use Ollama
RULES_PROVIDER=ollama
RULES_MODEL=llama2  # or another Ollama model
COMPLETION_OLLAMA_BASE_URL=http://localhost:11434  # Ollama API URL

Common Use Cases

PII Detection and Removal

pii_rule = NaturalLanguageRule(
    prompt="Remove all PII. Replace names with [NAME], emails with [EMAIL], "
           "phone numbers with [PHONE], and addresses with [ADDRESS]."
)

Content Classification

class ContentClassification(BaseModel):
    category: str
    subcategory: str
    complexity_level: str  # "basic", "intermediate", "advanced"
    target_audience: list[str]
    key_concepts: list[str]

classify_rule = MetadataExtractionRule(schema=ContentClassification)

Text Summarization

summary_rule = NaturalLanguageRule(
    prompt="Create a concise summary of the main points. "
           "The summary should be 2-3 sentences long."
)

Format Standardization

standardize_rule = NaturalLanguageRule(
    prompt="Standardize the text format: "
           "1. Use consistent capitalization "
           "2. Break into clear paragraphs "
           "3. Use bullet points for lists "
           "4. Add appropriate section headings"
)

Defining Custom Rules

DataBridge allows you to define your own custom rules for specialized document processing needs. You can implement rules either at the core level or create simplified SDK wrappers.

Core Implementation

To implement a custom rule in the core system, create a new class that inherits from BaseRule and implements the apply method:

from typing import Dict, Any, Literal
from pydantic import BaseModel
from core.models.rules import BaseRule

class KeywordExtractionRule(BaseRule):
    """Rule for extracting keywords using custom logic"""
    
    type: Literal["keyword_extraction"]
    min_length: int = 4  # Minimum word length
    max_keywords: int = 10  # Maximum keywords to extract
    
    async def apply(self, content: str) -> tuple[Dict[str, Any], str]:
        """Extract keywords from content"""
        # Your keyword extraction logic here
        words = content.split()
        keywords = [
            word for word in words 
            if len(word) >= self.min_length
        ][:self.max_keywords]
        
        # Return tuple of (metadata, modified_content)
        return {"keywords": keywords}, content

# Example usage in core:
rule = KeywordExtractionRule(
    type="keyword_extraction",
    min_length=5,
    max_keywords=15
)
metadata, modified_text = await rule.apply("Your document content...")

SDK Implementation

For the Python SDK, you can create a simplified wrapper that generates the rule configuration:

from databridge.rules import Rule
from typing import Dict, Any

class KeywordExtractor(Rule):
    """SDK wrapper for keyword extraction rule"""
    
    def __init__(self, min_length: int = 4, max_keywords: int = 10):
        self.min_length = min_length
        self.max_keywords = max_keywords
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert rule to dictionary format for API"""
        return {
            "type": "keyword_extraction",
            "min_length": self.min_length,
            "max_keywords": self.max_keywords
        }

# Example usage in SDK:
from databridge import DataBridge

db = DataBridge("your-uri")
rule = KeywordExtractor(min_length=5, max_keywords=15)

doc = db.ingest_text(
    content="Your content...",
    rules=[rule]
)

Example: Custom Metadata Extraction

Here's a complete example of implementing a custom metadata extraction rule for research papers:

# Core Implementation
from typing import Dict, Any, Literal
from pydantic import BaseModel
from core.models.rules import BaseRule
import re

class ResearchPaperMetadata(BaseModel):
    """Schema for research paper metadata"""
    title: str
    authors: list[str]
    abstract: str
    keywords: list[str]
    citations: list[str]

class ResearchPaperRule(BaseRule):
    """Rule for extracting research paper metadata"""
    
    type: Literal["research_paper"]
    
    async def apply(self, content: str) -> tuple[Dict[str, Any], str]:
        # Example implementation - you'd want more robust parsing
        sections = content.split("\n\n")
        
        # Extract basic metadata
        metadata = {
            "title": sections[0].strip(),
            "authors": self._extract_authors(content),
            "abstract": self._find_section(content, "Abstract"),
            "keywords": self._extract_keywords(content),
            "citations": self._extract_citations(content)
        }
        
        return metadata, content
    
    def _extract_authors(self, content: str) -> list[str]:
        # Implementation details...
        return []
    
    def _find_section(self, content: str, section: str) -> str:
        # Implementation details...
        return ""
    
    def _extract_keywords(self, content: str) -> list[str]:
        # Implementation details...
        return []
    
    def _extract_citations(self, content: str) -> list[str]:
        # Implementation details...
        return []

# SDK Implementation
from databridge.rules import Rule

class ResearchPaperExtractor(Rule):
    """SDK wrapper for research paper metadata extraction"""
    
    def to_dict(self) -> Dict[str, Any]:
        return {"type": "research_paper"}

# Usage Example
from databridge import DataBridge

# Using the SDK
db = DataBridge("your-uri")
paper_rule = ResearchPaperExtractor()

doc = db.ingest_text(
    content="Your research paper...",
    rules=[paper_rule]
)

# The extracted metadata will be available in doc.metadata
print(f"Title: {doc.metadata.get('title')}")
print(f"Authors: {doc.metadata.get('authors')}")
print(f"Keywords: {doc.metadata.get('keywords')}")

Best Practices for Custom Rules

Core Implementation:
- Inherit from BaseRule and use Pydantic for validation
- Implement clear error handling in the apply method
- Return both metadata and modified content
- Use type hints and docstrings
- Keep rules focused and single-purpose
SDK Implementation:
- Keep the interface simple and intuitive
- Use descriptive class names
- Implement to_dict() to match core rule format
- Provide clear documentation and examples
Testing:
- Test both success and failure cases
- Verify metadata extraction accuracy
- Check content transformation correctness
- Ensure proper error handling

Best Practices

Rule Ordering:
- Apply metadata extraction rules first to capture information from the original content
- Apply transformation rules afterward in order of importance
- Put format standardization rules last
Performance:
- Use specific, clear prompts for natural language rules
- Keep metadata schemas focused on essential information
- Consider the processing time when chaining multiple rules
Error Handling:
- Rules that fail will be skipped, allowing the ingestion to continue
- Check the document's system metadata for processing status
- Monitor rule performance in production

Next Steps

After setting up rules:

Monitor rule effectiveness through the extracted metadata
Refine rule prompts and schemas based on results
Consider implementing custom rules for specific use cases

PreviousDocument Ingestion NextSemantic Search

Last updated 5 months ago