Vector-Based Similarity Search¶
YokedCache 0.2.0 introduces advanced vector-based similarity search capabilities, enabling semantic search across your cached data. This feature goes beyond traditional string matching to provide intelligent, context-aware search results.
Table of Contents¶
- Overview
- How It Works
- Configuration
- Usage Examples
- Similarity Methods
- Performance Optimization
- Integration with Cache
Overview¶
Traditional fuzzy search relies on string similarity metrics like Levenshtein distance. Vector-based similarity search uses machine learning techniques to understand the semantic meaning of your data, providing more relevant and intelligent search results.
Key Features¶
- Semantic Understanding: Finds conceptually related content, not just string matches
- Multiple Similarity Methods: Cosine, Euclidean, and Manhattan distance calculations
- TF-IDF Vectorization: Converts text to numerical vectors for comparison
- Configurable Parameters: Fine-tune search behavior for your specific use case
- Real-time Updates: Automatically updates search index when cache data changes
- Redis Integration: Optional Redis-backed vector storage for distributed systems
Use Cases¶
- Content Discovery: Find related articles, products, or documents
- Recommendation Systems: Suggest similar items based on user behavior
- Data Deduplication: Identify duplicate or near-duplicate content
- Smart Search: Provide search results that understand user intent
- Data Analysis: Group and analyze similar data patterns
How It Works¶
1. Text Extraction¶
The system extracts searchable text from cache keys and values:
# For key-value pairs
key = "user:123"
value = {"name": "Alice Smith", "role": "engineer", "skills": ["python", "redis"]}
# Extracted text
searchable_text = "user:123 name:Alice Smith role:engineer skills:python,redis"
2. Vectorization¶
Text is converted to numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency):
from yokedcache.vector_search import VectorSimilaritySearch
search = VectorSimilaritySearch(
max_features=1000, # Maximum vocabulary size
min_df=1, # Minimum document frequency
max_df=0.95, # Maximum document frequency
ngram_range=(1, 2) # Use unigrams and bigrams
)
3. Similarity Calculation¶
Vector similarities are calculated using mathematical distance metrics:
- Cosine Similarity: Measures angle between vectors (best for text)
- Euclidean Distance: Measures straight-line distance between points
- Manhattan Distance: Measures city-block distance between points
4. Result Ranking¶
Results are ranked by similarity score and filtered by threshold:
results = search.search(
query="python developer",
cache_data=cache_data,
threshold=0.1, # Minimum similarity score
max_results=10 # Maximum number of results
)
Configuration¶
Basic Setup¶
from yokedcache.vector_search import VectorSimilaritySearch
# Default configuration
search = VectorSimilaritySearch()
# Custom configuration
search = VectorSimilaritySearch(
similarity_method="cosine", # "cosine", "euclidean", "manhattan"
max_features=2000, # Vocabulary size
min_df=2, # Ignore rare terms
max_df=0.8, # Ignore common terms
ngram_range=(1, 3) # Use 1-3 word phrases
)
Advanced Configuration¶
search = VectorSimilaritySearch(
similarity_method="cosine",
max_features=5000,
min_df=0.01, # Percentage-based frequency
max_df=0.95,
ngram_range=(1, 2),
stop_words="english", # Remove common English words
lowercase=True, # Convert to lowercase
strip_accents="unicode" # Remove accents
)
Installation Requirements¶
Vector search requires additional dependencies:
# Install vector search dependencies
pip install yokedcache[vector]
# Or install manually
pip install numpy scipy scikit-learn
Usage Examples¶
Basic Search¶
from yokedcache.vector_search import VectorSimilaritySearch
# Sample cache data
cache_data = {
"user:1": {"name": "Alice", "role": "Python Developer", "skills": ["FastAPI", "Redis"]},
"user:2": {"name": "Bob", "role": "Data Scientist", "skills": ["Python", "ML"]},
"user:3": {"name": "Charlie", "role": "Frontend Developer", "skills": ["React", "TypeScript"]},
"post:1": {"title": "Python Best Practices", "content": "Tips for writing clean Python code"},
"post:2": {"title": "Redis Caching Guide", "content": "How to implement efficient caching"}
}
# Initialize and fit the search engine
search = VectorSimilaritySearch()
search.fit(cache_data)
# Search for Python-related content
results = search.search("Python programming", cache_data, threshold=0.1)
for result in results:
print(f"Key: {result.key}")
print(f"Score: {result.score:.3f}")
print(f"Value: {result.value}")
print("---")
Integration with YokedCache¶
from yokedcache import YokedCache, CacheConfig
from yokedcache.backends import MemoryBackend
from yokedcache.vector_search import VectorSimilaritySearch
# Setup cache with vector search
backend = MemoryBackend()
config = CacheConfig(
backend=backend,
enable_fuzzy=True,
fuzzy_threshold=70
)
cache = YokedCache(config)
# Add vector search capability
vector_search = VectorSimilaritySearch(similarity_method="cosine")
async def enhanced_search(query: str, use_vector: bool = True):
"""Search using both traditional fuzzy and vector search."""
# Get all cache data
all_keys = await cache.get_all_keys("*")
cache_data = {}
for key in all_keys:
value = await cache.get(key)
if value:
cache_data[key] = value
if use_vector and cache_data:
# Fit and search using vector similarity
vector_search.fit(cache_data)
vector_results = vector_search.search(query, cache_data, threshold=0.1)
# Convert to consistent format
results = [
{
"key": r.key,
"value": r.value,
"score": r.score,
"method": "vector"
}
for r in vector_results
]
else:
# Fall back to traditional fuzzy search
fuzzy_results = await cache.fuzzy_search(query)
results = [
{
"key": r.key,
"value": r.value,
"score": r.score,
"method": "fuzzy"
}
for r in fuzzy_results
]
return results
Real-time Index Updates¶
class CacheWithVectorSearch:
"""Cache wrapper with automatic vector search updates."""
def __init__(self, cache: YokedCache):
self.cache = cache
self.vector_search = VectorSimilaritySearch()
self._cache_data = {}
self._fitted = False
async def set(self, key: str, value, **kwargs):
"""Set cache value and update search index."""
result = await self.cache.set(key, value, **kwargs)
# Update search index
self._cache_data[key] = value
self.vector_search.update_cache_entry(key, value)
return result
async def delete(self, key: str):
"""Delete cache value and update search index."""
result = await self.cache.delete(key)
# Update search index
if key in self._cache_data:
del self._cache_data[key]
self.vector_search.remove_cache_entry(key)
return result
async def search(self, query: str, **kwargs):
"""Search using vector similarity."""
if not self._fitted:
# Initial fit
all_keys = await self.cache.get_all_keys("*")
for key in all_keys:
value = await self.cache.get(key)
if value:
self._cache_data[key] = value
self.vector_search.fit(self._cache_data)
self._fitted = True
return self.vector_search.search(query, self._cache_data, **kwargs)
Similarity Methods¶
Cosine Similarity¶
Best for text-based content and high-dimensional sparse data.
search = VectorSimilaritySearch(similarity_method="cosine")
Characteristics: - Range: 0.0 to 1.0 (higher is more similar) - Normalized by vector magnitude - Excellent for text similarity - Handles different document lengths well
Use Cases: - Document similarity - Content recommendation - Semantic search
Euclidean Distance¶
Measures straight-line distance between vectors in n-dimensional space.
search = VectorSimilaritySearch(similarity_method="euclidean")
Characteristics: - Range: 0.0 to ∞ (lower is more similar, converted to 1/(1+distance)) - Sensitive to magnitude differences - Good for numerical data - Intuitive geometric interpretation
Use Cases: - Numerical data comparison - Feature similarity - Spatial data analysis
Manhattan Distance¶
Measures city-block distance between vectors.
search = VectorSimilaritySearch(similarity_method="manhattan")
Characteristics: - Range: 0.0 to ∞ (lower is more similar, converted to 1/(1+distance)) - Less sensitive to outliers than Euclidean - Good for sparse data - Computationally efficient
Use Cases: - Categorical data - Sparse feature vectors - Robust similarity measurement
Comparison Example¶
# Test different similarity methods
methods = ["cosine", "euclidean", "manhattan"]
query = "machine learning engineer"
for method in methods:
search = VectorSimilaritySearch(similarity_method=method)
search.fit(cache_data)
results = search.search(query, cache_data, threshold=0.1, max_results=3)
print(f"\n{method.title()} Similarity Results:")
for result in results:
print(f" {result.key}: {result.score:.3f}")
Performance Optimization¶
Vectorizer Optimization¶
# For large datasets
search = VectorSimilaritySearch(
max_features=10000, # Larger vocabulary
min_df=5, # Ignore very rare terms
max_df=0.7, # Ignore very common terms
ngram_range=(1, 2) # Limit n-gram range
)
# For small datasets
search = VectorSimilaritySearch(
max_features=1000, # Smaller vocabulary
min_df=1, # Include rare terms
max_df=0.95, # Keep most terms
ngram_range=(1, 3) # Include trigrams
)
Memory Management¶
# Monitor memory usage
stats = search.get_stats()
print(f"Vector density: {stats['vector_density']:.3f}")
print(f"Number of features: {stats['num_features']}")
print(f"Memory efficiency: {stats['vector_density'] * 100:.1f}%")
# For memory-constrained environments
search = VectorSimilaritySearch(
max_features=500, # Reduce vocabulary
min_df=3, # Filter rare terms
ngram_range=(1, 1) # Only unigrams
)
Batch Operations¶
# Batch index updates for better performance
class BatchVectorSearch:
def __init__(self, batch_size=100):
self.search = VectorSimilaritySearch()
self.pending_updates = {}
self.batch_size = batch_size
def update_entry(self, key: str, value):
"""Add entry to batch update."""
self.pending_updates[key] = value
if len(self.pending_updates) >= self.batch_size:
self.flush_updates()
def flush_updates(self):
"""Apply all pending updates."""
if self.pending_updates:
# Merge with existing data and refit
self.search.fit(self.pending_updates)
self.pending_updates.clear()
Integration with Cache¶
Automatic Vector Search¶
from yokedcache import YokedCache, CacheConfig
from yokedcache.backends import MemoryBackend
# Enable vector search in cache configuration
config = CacheConfig(
backend=MemoryBackend(),
enable_fuzzy=True,
fuzzy_threshold=70,
vector_search=True, # Enable vector search
vector_similarity="cosine" # Choose similarity method
)
cache = YokedCache(config)
# Search automatically uses vector similarity
results = await cache.fuzzy_search("python developer")
Redis Vector Storage¶
For distributed systems, store vectors in Redis:
from yokedcache.vector_search import RedisVectorSearch
import redis.asyncio as redis
# Setup Redis vector storage
redis_client = redis.Redis.from_url("redis://localhost:6379/1")
vector_store = RedisVectorSearch(
redis_client,
vector_key_prefix="vectors:",
similarity_method="cosine"
)
# Store and retrieve vectors
import numpy as np
vector = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
await vector_store.store_vector("doc:123", vector)
retrieved = await vector_store.get_vector("doc:123")
print(f"Vectors match: {np.array_equal(vector, retrieved)}")
Hybrid Search Strategy¶
async def hybrid_search(cache, query: str, threshold: float = 0.1):
"""Combine traditional fuzzy search with vector search."""
# Get traditional fuzzy results
fuzzy_results = await cache.fuzzy_search(query, threshold=threshold*100)
# Get vector search results
vector_search = VectorSimilaritySearch()
all_data = await cache.get_all_data() # Implement this method
vector_search.fit(all_data)
vector_results = vector_search.search(query, all_data, threshold=threshold)
# Combine and rank results
combined_results = {}
# Add fuzzy results
for result in fuzzy_results:
combined_results[result.key] = {
"value": result.value,
"fuzzy_score": result.score / 100.0,
"vector_score": 0.0
}
# Add vector results
for result in vector_results:
if result.key in combined_results:
combined_results[result.key]["vector_score"] = result.score
else:
combined_results[result.key] = {
"value": result.value,
"fuzzy_score": 0.0,
"vector_score": result.score
}
# Calculate combined score (weighted average)
final_results = []
for key, data in combined_results.items():
combined_score = (data["fuzzy_score"] * 0.3 + data["vector_score"] * 0.7)
if combined_score >= threshold:
final_results.append({
"key": key,
"value": data["value"],
"score": combined_score,
"fuzzy_score": data["fuzzy_score"],
"vector_score": data["vector_score"]
})
# Sort by combined score
return sorted(final_results, key=lambda x: x["score"], reverse=True)
Best Practices¶
1. Data Preparation¶
def prepare_search_data(data):
"""Prepare data for optimal vector search."""
if isinstance(data, dict):
# Extract meaningful text fields
text_fields = []
for key, value in data.items():
if isinstance(value, str):
text_fields.append(f"{key}:{value}")
elif isinstance(value, list):
text_fields.append(f"{key}:{','.join(map(str, value))}")
return " ".join(text_fields)
return str(data)
2. Index Management¶
class ManagedVectorSearch:
"""Vector search with automatic index management."""
def __init__(self, rebuild_threshold=1000):
self.search = VectorSimilaritySearch()
self.rebuild_threshold = rebuild_threshold
self.updates_since_rebuild = 0
def should_rebuild_index(self):
"""Check if index should be rebuilt."""
return self.updates_since_rebuild >= self.rebuild_threshold
async def update_and_maybe_rebuild(self, cache_data):
"""Update index and rebuild if necessary."""
self.updates_since_rebuild += 1
if self.should_rebuild_index():
self.search.fit(cache_data)
self.updates_since_rebuild = 0
3. Error Handling¶
async def safe_vector_search(query: str, cache_data: dict):
"""Vector search with graceful error handling."""
try:
search = VectorSimilaritySearch()
search.fit(cache_data)
return search.search(query, cache_data)
except ImportError:
logger.warning("Vector search dependencies not available, skipping")
return []
except Exception as e:
logger.error(f"Vector search failed: {e}")
return []
Vector-based similarity search opens up powerful possibilities for intelligent caching and data discovery. By understanding semantic relationships in your data, you can build more intuitive and effective applications.