Semantic Search

This guide explains how to perform semantic search operations using DataBridge.

Basic Setup

First, install the DataBridge client:

pip install databridge-client

Initialize the client:

from databridge import DataBridge, AsyncDataBridge

# Synchronous client
db = DataBridge("your-uri")

# Asynchronous client
async_db = AsyncDataBridge("your-uri")

Search Operations

DataBridge provides two main search operations:

  1. Chunk-level Search (retrieve_chunks): Returns individual text chunks with their relevance scores

  2. Document-level Search (retrieve_docs): Returns complete documents with aggregated relevance scores

Search for specific chunks of text:

# Synchronous search
chunks = db.retrieve_chunks(
    query="What are the key findings about customer satisfaction?",
    filters={"department": "research"},
    k=4,  # Number of results
    min_score=0.7,  # Minimum similarity threshold
    use_reranking=True  # Enable reranking for better relevance
)

for chunk in chunks:
    print(f"Score: {chunk.score}")
    print(f"Content: {chunk.content}")
    print(f"Document ID: {chunk.document_id}")
    print(f"Chunk Number: {chunk.chunk_number}")
    print(f"Metadata: {chunk.metadata}")
    print("---")

# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
    chunks = await db.retrieve_chunks(
        query="What are the key findings?",
        filters={"department": "research"},
        k=4,
        min_score=0.7,
        use_reranking=True
    )
    
    for chunk in chunks:
        print(f"Score: {chunk.score}")
        print(f"Content: {chunk.content}")

Search for complete documents:

# Synchronous search
docs = db.retrieve_docs(
    query="machine learning applications",
    filters={"year": 2024},
    k=4,  # Number of results
    min_score=0.7,  # Minimum similarity threshold
    use_reranking=True  # Enable reranking for better relevance
)

for doc in docs:
    print(f"Score: {doc.score}")
    print(f"Document ID: {doc.document_id}")
    print(f"Metadata: {doc.metadata}")
    print(f"Content: {doc.content}")
    print("---")

# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
    docs = await db.retrieve_docs(
        query="machine learning",
        filters={"year": 2024},
        k=4,
        min_score=0.7,
        use_reranking=True
    )

Understanding Search Results

Two-Stage Ranking

DataBridge uses a two-stage ranking process for optimal search results:

  1. Vector Similarity (First Stage)

    • Query is converted to an embedding vector

    • Cosine similarity is computed with document chunks

    • Initial ranking based on vector similarity

    • Fast but may miss some semantic nuances

  2. Neural Reranking (Second Stage, Optional)

    • Top results from vector search are reranked

    • Uses a specialized neural model for scoring

    • More accurate but computationally intensive

    • Can be enabled with use_reranking=True

Similarity Scores

Similarity scores indicate how well a chunk or document matches your query:

  • Score Range: 0.0 to 1.0

    • 1.0: Perfect match

    • 0.0: No similarity

  • Typical Score Ranges:

    • 0.9 - 1.0: Near-exact semantic match

    • 0.8 - 0.9: Very strong semantic similarity

    • 0.7 - 0.8: Strong semantic similarity

    • 0.6 - 0.7: Moderate semantic similarity

    • < 0.6: Weak semantic similarity

Example of using similarity scores:

# High-confidence results only
chunks = db.retrieve_chunks(
    query="quantum computing applications",
    min_score=0.8,  # Only very strong matches
    use_reranking=True  # Enable reranking for accuracy
)

# Group results by confidence
def group_by_confidence(chunks):
    groups = {
        "very_high": [],  # 0.9 - 1.0
        "high": [],       # 0.8 - 0.9
        "medium": [],     # 0.7 - 0.8
        "low": []         # < 0.7
    }
    
    for chunk in chunks:
        if chunk.score >= 0.9:
            groups["very_high"].append(chunk)
        elif chunk.score >= 0.8:
            groups["high"].append(chunk)
        elif chunk.score >= 0.7:
            groups["medium"].append(chunk)
        else:
            groups["low"].append(chunk)
    
    return groups

results = group_by_confidence(chunks)

Document Content Types

The DocumentContent type represents either a URL or direct content string:

from databridge.models import DocumentContent

# URL content type (for large documents)
url_content = DocumentContent(
    type="url",
    value="https://example.com/document.pdf",
    filename="document.pdf"  # Required for URLs
)

# String content type (for small documents)
text_content = DocumentContent(
    type="string",
    value="Document text content..."
    # filename not allowed for string type
)

When retrieving documents, the content field will be one of these types:

docs = db.retrieve_docs("machine learning")
for doc in docs:
    if doc.content.type == "url":
        print(f"Download URL: {doc.content.value}")
        print(f"Filename: {doc.content.filename}")
    else:
        print(f"Content: {doc.content.value}")

Array Operations

# Array contains
filters = {"tags": {"$in": ["ml", "ai"]}}

# Array match all
filters = {"required_skills": {"$all": ["python", "ml"]}}

# Array size
filters = {"authors": {"$size": 2}}

Existence Checks

# Field exists
filters = {"review_date": {"$exists": True}}

# Field is null
filters = {"review_date": None}

Next Steps

After finding relevant documents or chunks:

Last updated