# Semantic Search

This guide explains how to perform semantic search operations using DataBridge.

## Basic Setup

First, install the DataBridge client:

```bash
pip install databridge-client
```

Initialize the client:

```python
from databridge import DataBridge, AsyncDataBridge

# Synchronous client
db = DataBridge("your-uri")

# Asynchronous client
async_db = AsyncDataBridge("your-uri")
```

## Search Operations

DataBridge provides two main search operations:

1. **Chunk-level Search** (`retrieve_chunks`): Returns individual text chunks with their relevance scores
2. **Document-level Search** (`retrieve_docs`): Returns complete documents with aggregated relevance scores

### Chunk-level Search

Search for specific chunks of text:

```python
# Synchronous search
chunks = db.retrieve_chunks(
    query="What are the key findings about customer satisfaction?",
    filters={"department": "research"},
    k=4,  # Number of results
    min_score=0.7,  # Minimum similarity threshold
    use_reranking=True  # Enable reranking for better relevance
)

for chunk in chunks:
    print(f"Score: {chunk.score}")
    print(f"Content: {chunk.content}")
    print(f"Document ID: {chunk.document_id}")
    print(f"Chunk Number: {chunk.chunk_number}")
    print(f"Metadata: {chunk.metadata}")
    print("---")

# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
    chunks = await db.retrieve_chunks(
        query="What are the key findings?",
        filters={"department": "research"},
        k=4,
        min_score=0.7,
        use_reranking=True
    )
    
    for chunk in chunks:
        print(f"Score: {chunk.score}")
        print(f"Content: {chunk.content}")
```

### Document-level Search

Search for complete documents:

```python
# Synchronous search
docs = db.retrieve_docs(
    query="machine learning applications",
    filters={"year": 2024},
    k=4,  # Number of results
    min_score=0.7,  # Minimum similarity threshold
    use_reranking=True  # Enable reranking for better relevance
)

for doc in docs:
    print(f"Score: {doc.score}")
    print(f"Document ID: {doc.document_id}")
    print(f"Metadata: {doc.metadata}")
    print(f"Content: {doc.content}")
    print("---")

# Asynchronous search
async with AsyncDataBridge("your-uri") as db:
    docs = await db.retrieve_docs(
        query="machine learning",
        filters={"year": 2024},
        k=4,
        min_score=0.7,
        use_reranking=True
    )
```

## Understanding Search Results

### Two-Stage Ranking

DataBridge uses a two-stage ranking process for optimal search results:

1. **Vector Similarity** (First Stage)
   * Query is converted to an embedding vector
   * Cosine similarity is computed with document chunks
   * Initial ranking based on vector similarity
   * Fast but may miss some semantic nuances
2. **Neural Reranking** (Second Stage, Optional)
   * Top results from vector search are reranked
   * Uses a specialized neural model for scoring
   * More accurate but computationally intensive
   * Can be enabled with `use_reranking=True`

### Similarity Scores

Similarity scores indicate how well a chunk or document matches your query:

* **Score Range**: 0.0 to 1.0
  * 1.0: Perfect match
  * 0.0: No similarity
* **Typical Score Ranges**:
  * 0.9 - 1.0: Near-exact semantic match
  * 0.8 - 0.9: Very strong semantic similarity
  * 0.7 - 0.8: Strong semantic similarity
  * 0.6 - 0.7: Moderate semantic similarity
  * < 0.6: Weak semantic similarity

Example of using similarity scores:

```python
# High-confidence results only
chunks = db.retrieve_chunks(
    query="quantum computing applications",
    min_score=0.8,  # Only very strong matches
    use_reranking=True  # Enable reranking for accuracy
)

# Group results by confidence
def group_by_confidence(chunks):
    groups = {
        "very_high": [],  # 0.9 - 1.0
        "high": [],       # 0.8 - 0.9
        "medium": [],     # 0.7 - 0.8
        "low": []         # < 0.7
    }
    
    for chunk in chunks:
        if chunk.score >= 0.9:
            groups["very_high"].append(chunk)
        elif chunk.score >= 0.8:
            groups["high"].append(chunk)
        elif chunk.score >= 0.7:
            groups["medium"].append(chunk)
        else:
            groups["low"].append(chunk)
    
    return groups

results = group_by_confidence(chunks)
```

## Document Content Types

The `DocumentContent` type represents either a URL or direct content string:

```python
from databridge.models import DocumentContent

# URL content type (for large documents)
url_content = DocumentContent(
    type="url",
    value="https://example.com/document.pdf",
    filename="document.pdf"  # Required for URLs
)

# String content type (for small documents)
text_content = DocumentContent(
    type="string",
    value="Document text content..."
    # filename not allowed for string type
)
```

When retrieving documents, the content field will be one of these types:

```python
docs = db.retrieve_docs("machine learning")
for doc in docs:
    if doc.content.type == "url":
        print(f"Download URL: {doc.content.value}")
        print(f"Filename: {doc.content.filename}")
    else:
        print(f"Content: {doc.content.value}")
```

### Array Operations

```python
# Array contains
filters = {"tags": {"$in": ["ml", "ai"]}}

# Array match all
filters = {"required_skills": {"$all": ["python", "ml"]}}

# Array size
filters = {"authors": {"$size": 2}}
```

### Existence Checks

```python
# Field exists
filters = {"review_date": {"$exists": True}}

# Field is null
filters = {"review_date": None}
```

## Next Steps

After finding relevant documents or chunks:

* [Generate completions using the context](https://databridge.gitbook.io/databridge-docs/user-guides/03_completions)
* Use the document IDs to retrieve full documents
* Implement advanced filtering and sorting logic
