This guide explains how to perform semantic search operations using DataBridge.
Basic Setup
First, install the DataBridge client:
pip install databridge-client
Initialize the client:
from databridge import DataBridge, AsyncDataBridge
# Synchronous client
db = DataBridge("your-uri")
# Asynchronous client
async_db = AsyncDataBridge("your-uri")
Search Operations
DataBridge provides two main search operations:
Chunk-level Search (retrieve_chunks): Returns individual text chunks with their relevance scores
Document-level Search (retrieve_docs): Returns complete documents with aggregated relevance scores
Chunk-level Search
Search for specific chunks of text:
# Synchronous search
chunks = db.retrieve_chunks(
query="What are the key findings about customer satisfaction?",
filters={"department": "research"},
k=4, # Number of results
min_score=0.7, # Minimum similarity threshold
use_reranking=True # Enable reranking for better relevance
)
for chunk in chunks:
print(f"Score: {chunk.score}")
print(f"Content: {chunk.content}")
print(f"Document ID: {chunk.document_id}")
print(f"Chunk Number: {chunk.chunk_number}")
print(f"Metadata: {chunk.metadata}")
print("---")
# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
chunks = await db.retrieve_chunks(
query="What are the key findings?",
filters={"department": "research"},
k=4,
min_score=0.7,
use_reranking=True
)
for chunk in chunks:
print(f"Score: {chunk.score}")
print(f"Content: {chunk.content}")
Document-level Search
Search for complete documents:
# Synchronous search
docs = db.retrieve_docs(
query="machine learning applications",
filters={"year": 2024},
k=4, # Number of results
min_score=0.7, # Minimum similarity threshold
use_reranking=True # Enable reranking for better relevance
)
for doc in docs:
print(f"Score: {doc.score}")
print(f"Document ID: {doc.document_id}")
print(f"Metadata: {doc.metadata}")
print(f"Content: {doc.content}")
print("---")
# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
docs = await db.retrieve_docs(
query="machine learning",
filters={"year": 2024},
k=4,
min_score=0.7,
use_reranking=True
)
Understanding Search Results
Two-Stage Ranking
DataBridge uses a two-stage ranking process for optimal search results:
Vector Similarity (First Stage)
Query is converted to an embedding vector
Cosine similarity is computed with document chunks
Initial ranking based on vector similarity
Fast but may miss some semantic nuances
Neural Reranking (Second Stage, Optional)
Top results from vector search are reranked
Uses a specialized neural model for scoring
More accurate but computationally intensive
Can be enabled with use_reranking=True
Similarity Scores
Similarity scores indicate how well a chunk or document matches your query:
Score Range: 0.0 to 1.0
1.0: Perfect match
0.0: No similarity
Typical Score Ranges:
0.9 - 1.0: Near-exact semantic match
0.8 - 0.9: Very strong semantic similarity
0.7 - 0.8: Strong semantic similarity
0.6 - 0.7: Moderate semantic similarity
< 0.6: Weak semantic similarity
Example of using similarity scores:
# High-confidence results only
chunks = db.retrieve_chunks(
query="quantum computing applications",
min_score=0.8, # Only very strong matches
use_reranking=True # Enable reranking for accuracy
)
# Group results by confidence
def group_by_confidence(chunks):
groups = {
"very_high": [], # 0.9 - 1.0
"high": [], # 0.8 - 0.9
"medium": [], # 0.7 - 0.8
"low": [] # < 0.7
}
for chunk in chunks:
if chunk.score >= 0.9:
groups["very_high"].append(chunk)
elif chunk.score >= 0.8:
groups["high"].append(chunk)
elif chunk.score >= 0.7:
groups["medium"].append(chunk)
else:
groups["low"].append(chunk)
return groups
results = group_by_confidence(chunks)
Document Content Types
The DocumentContent type represents either a URL or direct content string:
from databridge.models import DocumentContent
# URL content type (for large documents)
url_content = DocumentContent(
type="url",
value="https://example.com/document.pdf",
filename="document.pdf" # Required for URLs
)
# String content type (for small documents)
text_content = DocumentContent(
type="string",
value="Document text content..."
# filename not allowed for string type
)
When retrieving documents, the content field will be one of these types:
docs = db.retrieve_docs("machine learning")
for doc in docs:
if doc.content.type == "url":
print(f"Download URL: {doc.content.value}")
print(f"Filename: {doc.content.filename}")
else:
print(f"Content: {doc.content.value}")