# Semantic Search

This guide explains how to perform semantic search operations using DataBridge.

## Basic Setup

First, install the DataBridge client:

```bash
pip install databridge-client
```

Initialize the client:

```python
from databridge import DataBridge, AsyncDataBridge

# Synchronous client
db = DataBridge("your-uri")

# Asynchronous client
async_db = AsyncDataBridge("your-uri")
```

## Search Operations

DataBridge provides two main search operations:

1. **Chunk-level Search** (`retrieve_chunks`): Returns individual text chunks with their relevance scores
2. **Document-level Search** (`retrieve_docs`): Returns complete documents with aggregated relevance scores

### Chunk-level Search

Search for specific chunks of text:

```python
# Synchronous search
chunks = db.retrieve_chunks(
    query="What are the key findings about customer satisfaction?",
    filters={"department": "research"},
    k=4,  # Number of results
    min_score=0.7,  # Minimum similarity threshold
    use_reranking=True  # Enable reranking for better relevance
)

for chunk in chunks:
    print(f"Score: {chunk.score}")
    print(f"Content: {chunk.content}")
    print(f"Document ID: {chunk.document_id}")
    print(f"Chunk Number: {chunk.chunk_number}")
    print(f"Metadata: {chunk.metadata}")
    print("---")

# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
    chunks = await db.retrieve_chunks(
        query="What are the key findings?",
        filters={"department": "research"},
        k=4,
        min_score=0.7,
        use_reranking=True
    )
    
    for chunk in chunks:
        print(f"Score: {chunk.score}")
        print(f"Content: {chunk.content}")
```

### Document-level Search

Search for complete documents:

```python
# Synchronous search
docs = db.retrieve_docs(
    query="machine learning applications",
    filters={"year": 2024},
    k=4,  # Number of results
    min_score=0.7,  # Minimum similarity threshold
    use_reranking=True  # Enable reranking for better relevance
)

for doc in docs:
    print(f"Score: {doc.score}")
    print(f"Document ID: {doc.document_id}")
    print(f"Metadata: {doc.metadata}")
    print(f"Content: {doc.content}")
    print("---")

# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
    docs = await db.retrieve_docs(
        query="machine learning",
        filters={"year": 2024},
        k=4,
        min_score=0.7,
        use_reranking=True
    )
```

## Understanding Search Results

### Two-Stage Ranking

DataBridge uses a two-stage ranking process for optimal search results:

1. **Vector Similarity** (First Stage)
   * Query is converted to an embedding vector
   * Cosine similarity is computed with document chunks
   * Initial ranking based on vector similarity
   * Fast but may miss some semantic nuances
2. **Neural Reranking** (Second Stage, Optional)
   * Top results from vector search are reranked
   * Uses a specialized neural model for scoring
   * More accurate but computationally intensive
   * Can be enabled with `use_reranking=True`

### Similarity Scores

Similarity scores indicate how well a chunk or document matches your query:

* **Score Range**: 0.0 to 1.0
  * 1.0: Perfect match
  * 0.0: No similarity
* **Typical Score Ranges**:
  * 0.9 - 1.0: Near-exact semantic match
  * 0.8 - 0.9: Very strong semantic similarity
  * 0.7 - 0.8: Strong semantic similarity
  * 0.6 - 0.7: Moderate semantic similarity
  * < 0.6: Weak semantic similarity

Example of using similarity scores:

```python
# High-confidence results only
chunks = db.retrieve_chunks(
    query="quantum computing applications",
    min_score=0.8,  # Only very strong matches
    use_reranking=True  # Enable reranking for accuracy
)

# Group results by confidence
def group_by_confidence(chunks):
    groups = {
        "very_high": [],  # 0.9 - 1.0
        "high": [],       # 0.8 - 0.9
        "medium": [],     # 0.7 - 0.8
        "low": []         # < 0.7
    }
    
    for chunk in chunks:
        if chunk.score >= 0.9:
            groups["very_high"].append(chunk)
        elif chunk.score >= 0.8:
            groups["high"].append(chunk)
        elif chunk.score >= 0.7:
            groups["medium"].append(chunk)
        else:
            groups["low"].append(chunk)
    
    return groups

results = group_by_confidence(chunks)
```

## Document Content Types

The `DocumentContent` type represents either a URL or direct content string:

```python
from databridge.models import DocumentContent

# URL content type (for large documents)
url_content = DocumentContent(
    type="url",
    value="https://example.com/document.pdf",
    filename="document.pdf"  # Required for URLs
)

# String content type (for small documents)
text_content = DocumentContent(
    type="string",
    value="Document text content..."
    # filename not allowed for string type
)
```

When retrieving documents, the content field will be one of these types:

```python
docs = db.retrieve_docs("machine learning")
for doc in docs:
    if doc.content.type == "url":
        print(f"Download URL: {doc.content.value}")
        print(f"Filename: {doc.content.filename}")
    else:
        print(f"Content: {doc.content.value}")
```

### Array Operations

```python
# Array contains
filters = {"tags": {"$in": ["ml", "ai"]}}

# Array match all
filters = {"required_skills": {"$all": ["python", "ml"]}}

# Array size
filters = {"authors": {"$size": 2}}
```

### Existence Checks

```python
# Field exists
filters = {"review_date": {"$exists": True}}

# Field is null
filters = {"review_date": None}
```

## Next Steps

After finding relevant documents or chunks:

* [Generate completions using the context](/databridge-docs/user-guides/03_completions.md)
* Use the document IDs to retrieve full documents
* Implement advanced filtering and sorting logic


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://databridge.gitbook.io/databridge-docs/user-guides/02_semantic_search.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
