Vector-Based Similarity Search¶

YokedCache 0.2.0 introduces advanced vector-based similarity search capabilities, enabling semantic search across your cached data. This feature goes beyond traditional string matching to provide intelligent, context-aware search results.

Table of Contents¶

Overview
How It Works
Configuration
Usage Examples
Similarity Methods
Performance Optimization
Integration with Cache

Overview¶

Traditional fuzzy search relies on string similarity metrics like Levenshtein distance. Vector-based similarity search uses machine learning techniques to understand the semantic meaning of your data, providing more relevant and intelligent search results.

Key Features¶

Semantic Understanding: Finds conceptually related content, not just string matches
Multiple Similarity Methods: Cosine, Euclidean, and Manhattan distance calculations
TF-IDF Vectorization: Converts text to numerical vectors for comparison
Configurable Parameters: Fine-tune search behavior for your specific use case
Real-time Updates: Automatically updates search index when cache data changes
Redis Integration: Optional Redis-backed vector storage for distributed systems

Use Cases¶

Content Discovery: Find related articles, products, or documents
Recommendation Systems: Suggest similar items based on user behavior
Data Deduplication: Identify duplicate or near-duplicate content
Smart Search: Provide search results that understand user intent
Data Analysis: Group and analyze similar data patterns

How It Works¶

1. Text Extraction¶

The system extracts searchable text from cache keys and values:

# For key-value pairs
key = "user:123"
value = {"name": "Alice Smith", "role": "engineer", "skills": ["python", "redis"]}

# Extracted text
searchable_text = "user:123 name:Alice Smith role:engineer skills:python,redis"

2. Vectorization¶

Text is converted to numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency):

from yokedcache.vector_search import VectorSimilaritySearch

search = VectorSimilaritySearch(
    max_features=1000,      # Maximum vocabulary size
    min_df=1,               # Minimum document frequency
    max_df=0.95,            # Maximum document frequency
    ngram_range=(1, 2)      # Use unigrams and bigrams
)

3. Similarity Calculation¶

Vector similarities are calculated using mathematical distance metrics:

Cosine Similarity: Measures angle between vectors (best for text)
Euclidean Distance: Measures straight-line distance between points
Manhattan Distance: Measures city-block distance between points

4. Result Ranking¶

Results are ranked by similarity score and filtered by threshold:

results = search.search(
    query="python developer",
    cache_data=cache_data,
    threshold=0.1,          # Minimum similarity score
    max_results=10          # Maximum number of results
)

Configuration¶

Basic Setup¶

from yokedcache.vector_search import VectorSimilaritySearch

# Default configuration
search = VectorSimilaritySearch()

# Custom configuration
search = VectorSimilaritySearch(
    similarity_method="cosine",     # "cosine", "euclidean", "manhattan"
    max_features=2000,              # Vocabulary size
    min_df=2,                       # Ignore rare terms
    max_df=0.8,                     # Ignore common terms
    ngram_range=(1, 3)              # Use 1-3 word phrases
)

Advanced Configuration¶

search = VectorSimilaritySearch(
    similarity_method="cosine",
    max_features=5000,
    min_df=0.01,                    # Percentage-based frequency
    max_df=0.95,
    ngram_range=(1, 2),
    stop_words="english",           # Remove common English words
    lowercase=True,                 # Convert to lowercase
    strip_accents="unicode"         # Remove accents
)

Installation Requirements¶

Vector search requires additional dependencies:

# Install vector search dependencies
pip install yokedcache[vector]

# Or install manually
pip install numpy scipy scikit-learn

Usage Examples¶

Basic Search¶

from yokedcache.vector_search import VectorSimilaritySearch

# Sample cache data
cache_data = {
    "user:1": {"name": "Alice", "role": "Python Developer", "skills": ["FastAPI", "Redis"]},
    "user:2": {"name": "Bob", "role": "Data Scientist", "skills": ["Python", "ML"]},
    "user:3": {"name": "Charlie", "role": "Frontend Developer", "skills": ["React", "TypeScript"]},
    "post:1": {"title": "Python Best Practices", "content": "Tips for writing clean Python code"},
    "post:2": {"title": "Redis Caching Guide", "content": "How to implement efficient caching"}
}

# Initialize and fit the search engine
search = VectorSimilaritySearch()
search.fit(cache_data)

# Search for Python-related content
results = search.search("Python programming", cache_data, threshold=0.1)

for result in results:
    print(f"Key: {result.key}")
    print(f"Score: {result.score:.3f}")
    print(f"Value: {result.value}")
    print("---")

Integration with YokedCache¶

from yokedcache import YokedCache, CacheConfig
from yokedcache.backends import MemoryBackend
from yokedcache.vector_search import VectorSimilaritySearch

# Setup cache with vector search
backend = MemoryBackend()
config = CacheConfig(
    backend=backend,
    enable_fuzzy=True,
    fuzzy_threshold=70
)

cache = YokedCache(config)

# Add vector search capability
vector_search = VectorSimilaritySearch(similarity_method="cosine")

async def enhanced_search(query: str, use_vector: bool = True):
    """Search using both traditional fuzzy and vector search."""

    # Get all cache data
    all_keys = await cache.get_all_keys("*")
    cache_data = {}

    for key in all_keys:
        value = await cache.get(key)
        if value:
            cache_data[key] = value

    if use_vector and cache_data:
        # Fit and search using vector similarity
        vector_search.fit(cache_data)
        vector_results = vector_search.search(query, cache_data, threshold=0.1)

        # Convert to consistent format
        results = [
            {
                "key": r.key,
                "value": r.value,
                "score": r.score,
                "method": "vector"
            }
            for r in vector_results
        ]
    else:
        # Fall back to traditional fuzzy search
        fuzzy_results = await cache.fuzzy_search(query)
        results = [
            {
                "key": r.key,
                "value": r.value,
                "score": r.score,
                "method": "fuzzy"
            }
            for r in fuzzy_results
        ]

    return results

Real-time Index Updates¶

class CacheWithVectorSearch:
    """Cache wrapper with automatic vector search updates."""

    def __init__(self, cache: YokedCache):
        self.cache = cache
        self.vector_search = VectorSimilaritySearch()
        self._cache_data = {}
        self._fitted = False

    async def set(self, key: str, value, **kwargs):
        """Set cache value and update search index."""
        result = await self.cache.set(key, value, **kwargs)

        # Update search index
        self._cache_data[key] = value
        self.vector_search.update_cache_entry(key, value)

        return result

    async def delete(self, key: str):
        """Delete cache value and update search index."""
        result = await self.cache.delete(key)

        # Update search index
        if key in self._cache_data:
            del self._cache_data[key]
            self.vector_search.remove_cache_entry(key)

        return result

    async def search(self, query: str, **kwargs):
        """Search using vector similarity."""
        if not self._fitted:
            # Initial fit
            all_keys = await self.cache.get_all_keys("*")
            for key in all_keys:
                value = await self.cache.get(key)
                if value:
                    self._cache_data[key] = value

            self.vector_search.fit(self._cache_data)
            self._fitted = True

        return self.vector_search.search(query, self._cache_data, **kwargs)

Similarity Methods¶

Cosine Similarity¶

Best for text-based content and high-dimensional sparse data.

search = VectorSimilaritySearch(similarity_method="cosine")

Characteristics: - Range: 0.0 to 1.0 (higher is more similar) - Normalized by vector magnitude - Excellent for text similarity - Handles different document lengths well

Use Cases: - Document similarity - Content recommendation - Semantic search

Euclidean Distance¶

Measures straight-line distance between vectors in n-dimensional space.

search = VectorSimilaritySearch(similarity_method="euclidean")

Characteristics: - Range: 0.0 to ∞ (lower is more similar, converted to 1/(1+distance)) - Sensitive to magnitude differences - Good for numerical data - Intuitive geometric interpretation

Use Cases: - Numerical data comparison - Feature similarity - Spatial data analysis

Manhattan Distance¶

Measures city-block distance between vectors.

search = VectorSimilaritySearch(similarity_method="manhattan")

Characteristics: - Range: 0.0 to ∞ (lower is more similar, converted to 1/(1+distance)) - Less sensitive to outliers than Euclidean - Good for sparse data - Computationally efficient

Use Cases: - Categorical data - Sparse feature vectors - Robust similarity measurement

Comparison Example¶

# Test different similarity methods
methods = ["cosine", "euclidean", "manhattan"]
query = "machine learning engineer"

for method in methods:
    search = VectorSimilaritySearch(similarity_method=method)
    search.fit(cache_data)
    results = search.search(query, cache_data, threshold=0.1, max_results=3)

    print(f"\n{method.title()} Similarity Results:")
    for result in results:
        print(f"  {result.key}: {result.score:.3f}")

Performance Optimization¶

Vectorizer Optimization¶

# For large datasets
search = VectorSimilaritySearch(
    max_features=10000,         # Larger vocabulary
    min_df=5,                   # Ignore very rare terms
    max_df=0.7,                 # Ignore very common terms
    ngram_range=(1, 2)          # Limit n-gram range
)

# For small datasets
search = VectorSimilaritySearch(
    max_features=1000,          # Smaller vocabulary
    min_df=1,                   # Include rare terms
    max_df=0.95,                # Keep most terms
    ngram_range=(1, 3)          # Include trigrams
)

Memory Management¶

# Monitor memory usage
stats = search.get_stats()
print(f"Vector density: {stats['vector_density']:.3f}")
print(f"Number of features: {stats['num_features']}")
print(f"Memory efficiency: {stats['vector_density'] * 100:.1f}%")

# For memory-constrained environments
search = VectorSimilaritySearch(
    max_features=500,           # Reduce vocabulary
    min_df=3,                   # Filter rare terms
    ngram_range=(1, 1)          # Only unigrams
)

Batch Operations¶

# Batch index updates for better performance
class BatchVectorSearch:
    def __init__(self, batch_size=100):
        self.search = VectorSimilaritySearch()
        self.pending_updates = {}
        self.batch_size = batch_size

    def update_entry(self, key: str, value):
        """Add entry to batch update."""
        self.pending_updates[key] = value

        if len(self.pending_updates) >= self.batch_size:
            self.flush_updates()

    def flush_updates(self):
        """Apply all pending updates."""
        if self.pending_updates:
            # Merge with existing data and refit
            self.search.fit(self.pending_updates)
            self.pending_updates.clear()

Integration with Cache¶

Automatic Vector Search¶

from yokedcache import YokedCache, CacheConfig
from yokedcache.backends import MemoryBackend

# Enable vector search in cache configuration
config = CacheConfig(
    backend=MemoryBackend(),
    enable_fuzzy=True,
    fuzzy_threshold=70,
    vector_search=True,          # Enable vector search
    vector_similarity="cosine"   # Choose similarity method
)

cache = YokedCache(config)

# Search automatically uses vector similarity
results = await cache.fuzzy_search("python developer")

Redis Vector Storage¶

For distributed systems, store vectors in Redis:

from yokedcache.vector_search import RedisVectorSearch
import redis.asyncio as redis

# Setup Redis vector storage
redis_client = redis.Redis.from_url("redis://localhost:6379/1")
vector_store = RedisVectorSearch(
    redis_client,
    vector_key_prefix="vectors:",
    similarity_method="cosine"
)

# Store and retrieve vectors
import numpy as np

vector = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
await vector_store.store_vector("doc:123", vector)

retrieved = await vector_store.get_vector("doc:123")
print(f"Vectors match: {np.array_equal(vector, retrieved)}")

Hybrid Search Strategy¶

async def hybrid_search(cache, query: str, threshold: float = 0.1):
    """Combine traditional fuzzy search with vector search."""

    # Get traditional fuzzy results
    fuzzy_results = await cache.fuzzy_search(query, threshold=threshold*100)

    # Get vector search results
    vector_search = VectorSimilaritySearch()
    all_data = await cache.get_all_data()  # Implement this method
    vector_search.fit(all_data)
    vector_results = vector_search.search(query, all_data, threshold=threshold)

    # Combine and rank results
    combined_results = {}

    # Add fuzzy results
    for result in fuzzy_results:
        combined_results[result.key] = {
            "value": result.value,
            "fuzzy_score": result.score / 100.0,
            "vector_score": 0.0
        }

    # Add vector results
    for result in vector_results:
        if result.key in combined_results:
            combined_results[result.key]["vector_score"] = result.score
        else:
            combined_results[result.key] = {
                "value": result.value,
                "fuzzy_score": 0.0,
                "vector_score": result.score
            }

    # Calculate combined score (weighted average)
    final_results = []
    for key, data in combined_results.items():
        combined_score = (data["fuzzy_score"] * 0.3 + data["vector_score"] * 0.7)
        if combined_score >= threshold:
            final_results.append({
                "key": key,
                "value": data["value"],
                "score": combined_score,
                "fuzzy_score": data["fuzzy_score"],
                "vector_score": data["vector_score"]
            })

    # Sort by combined score
    return sorted(final_results, key=lambda x: x["score"], reverse=True)

Best Practices¶

1. Data Preparation¶

def prepare_search_data(data):
    """Prepare data for optimal vector search."""
    if isinstance(data, dict):
        # Extract meaningful text fields
        text_fields = []
        for key, value in data.items():
            if isinstance(value, str):
                text_fields.append(f"{key}:{value}")
            elif isinstance(value, list):
                text_fields.append(f"{key}:{','.join(map(str, value))}")
        return " ".join(text_fields)
    return str(data)

2. Index Management¶

class ManagedVectorSearch:
    """Vector search with automatic index management."""

    def __init__(self, rebuild_threshold=1000):
        self.search = VectorSimilaritySearch()
        self.rebuild_threshold = rebuild_threshold
        self.updates_since_rebuild = 0

    def should_rebuild_index(self):
        """Check if index should be rebuilt."""
        return self.updates_since_rebuild >= self.rebuild_threshold

    async def update_and_maybe_rebuild(self, cache_data):
        """Update index and rebuild if necessary."""
        self.updates_since_rebuild += 1

        if self.should_rebuild_index():
            self.search.fit(cache_data)
            self.updates_since_rebuild = 0

3. Error Handling¶

async def safe_vector_search(query: str, cache_data: dict):
    """Vector search with graceful error handling."""
    try:
        search = VectorSimilaritySearch()
        search.fit(cache_data)
        return search.search(query, cache_data)
    except ImportError:
        logger.warning("Vector search dependencies not available, skipping")
        return []
    except Exception as e:
        logger.error(f"Vector search failed: {e}")
        return []

Vector-based similarity search opens up powerful possibilities for intelligent caching and data discovery. By understanding semantic relationships in your data, you can build more intuitive and effective applications.