Can you recommend me a movie? Crafting Recommendations with Vector Databases

Photo by Mollie Sivaram on Unsplash

Imagine you are a software Engineer on the Netflix team and need to build a recommendation system.

There could be many paths but let's go through one of them.

Let's give a score to each user and each movie and then recommend the movies with the scores closest to the user's score?

How do we come up with a score for each user and each movie?

For a movie, we can first define different attributes of the movie and then give a score to each attribute. Consider the following attributes:

Genre score for Action
Genre score for Romance
Genre score for Comedy (and so on)
Release Year
Length
Average Rating
Popularity Score
Awards

Let's give each attribute a score between 0 and 1. For example, a score of 0.9 for the second attribute could indicate that the movie is mostly romance, a score of 0.1 for the release year attribute could indicate that the movie is very old, and so on.

With attribute scores in place, we can represent each movie's score with some numbers. For example, a highly popular, recent romantic-comedy movie that's well-liked, has a moderate length, and has received several awards might look something like this: [0, 0.8, 0.9, 0.95, 0.6, 0.85, 0.85, 0.8].

On to the user now. We can represent the user's preferences in a similar way. For example, a user's score might include attributes for their preference for action movies, preference for romantic movies, their interest in recent releases, their ratings for different genres and so on. So something like [0.9, 0.8, 0.2, 0.95, 0.6, 0.85, 0.8, 0.75].

Now to recommend movies to a user, we need to calculate "similarity" between the user's preferences and the attributes of the movies. We'll come to this in a bit.

Vectors

The attributes are referred to as dimensions and the scores are referred to as values in the dimensions. Together, they form a vector. This vector succinctly encapsulates a wealth of information about the movie, enabling a recommendation system to compare it with other movies and with users' preferences to make personalized recommendations. In this example we just took 8 dimensions but in reality, the vectors can have hundreds or thousands of dimensions. Imagine image recognition where each pixel is a dimension and the value is the intensity of the pixel. Or natural language processing where each word is a dimension and the value is the frequency of the word in the document. These high-dimensional vectors are the foundation of many recommendation systems.

Embeddings

The above system has some limitations.

Sparsity: If we try to represent all possible genres (and sub-genres) as separate dimensions, the vector can become very sparse, meaning a lot of scores in the vector will be simply 0. It can also become high dimensional, making it computationally expensive to work with.
Correlation: The attributes are not independent. For example, a movie that's a romantic comedy is likely to have high scores for both the romance and comedy attributes. This correlation is not captured in the vector representation.
Static: The vectors do not evolve based on user interaction or feedback.

Embeddings offer a more nuanced approach by representing movies in a dense, lower-dimensional space learned from data. Instead of manually assigning scores to predefined categories, an embedding model learns to represent movies based on user interactions, such as ratings, reviews and viewing history. This process involves analyzing which movies are watched or liked by similar users, extracting patterns from rating behaviors, and even incorporating textual reviews using NLP techniques.

With this change, the dimensions of this vector now encode patterns and relationships discovered through data. So the above representation will change to:

[Dimension 1, Dimension 2, .... ]=[0.45,−1.2,...]

Advantages

Relationships: Movies that are similar in terms of user preferences and behaviors will be closer in the embedding space, even if they differ on surface-level features. This allows the system to recommend movies that are more likely to match the user's tastes.
Efficiency and Dimensionality: Embeddings are dense, meaning they pack a lot of information into a relatively low number of dimensions, making the system more efficient and scalable.
Dynamic and Adaptive: Embeddings can be continuously updated based on new user data, making the recommendation system more responsive to changing trends and preferences.

Converting data to embeddings

Converting data to embeddings depends on the type of data. But the general approach for embedding generation is as following:

Preprocessing: Tailor preprocessing steps to the data type. E.g. text extraction for PDFs/Word documents by using tools like Apache Tika, image resizing and normalization for images, etc.
Feature Extraction: Use appropriate models or techniques to extract features from the preprocessed data. E.g. CNNs for images.
Embedding Generation: Convert the extracted features into dense, lower-dimensional vectors. E.g. for text data, vectorization can be done using TF-IDF, BERT embeddings, etc.

Vector databases

A vector database is a specialized database that is optimized for storing and querying high-dimensional vectors. The architecture of a vector database is designed to efficiently handle high-dimensional vector data, facilitating operations such as storage, search, and retrieval of vectors.

The following architectural components are typically found in a vector database:

Storage Engine: a storage engine optimized for high-dimensional data.
Indexing Mechanism: Indexing mechanism is designed to efficiently organize and retrieve high-dimensional vectors. Indexes such as KD-trees, R-trees, are used to enable fast similarity searches by effectively partitioning the vector space and reducing the search space for queries.
Query Processor & Similarity Search: Handles parsing and execution of queries against the vector database. It interprets the query parameters, executes the search using the indexing mechanism, and retrieves the relevant vectors based on the specified similarity metrics. Cosine similarity, Euclidean distance, and other similarity measures are commonly used to compare vectors.
API Layer: API layer to allow performing operations such as inserting vectors, updating metadata, and executing similarity searches.
Scalability and Distribution Mechanisms: Partitioning datasets across multiple nodes, replicating data for fault tolerance, and balancing query loads across the cluster.

Calculating similarity

Following are some of the common similarity measures used in recommendation systems:

Cosine Similarity: Measures the cosine of the angle between two vectors, with values closer to 1 indicating greater similarity. This method is particularly effective in high-dimensional spaces and is widely used in recommendation systems and information retrieval.
Euclidean Distance: The geometric distance between two points in multidimensional space. In the context of recommendation systems, smaller distances indicate greater similarity.
Pearson Correlation: Measures the linear correlation between two vectors, with values close to 1 indicating a strong positive correlation.

Vendors

Some of the popular vector databases are:

Conclusion

I know we started with building a recommendation system but the idea was to delve into the concept of embeddings, vector databases and various use cases. These concepts are not only limited to recommendation systems but are widely used in various domains such as image recognition, natural language processing, semantic search, and anomaly detection. The ability to represent complex, high-dimensional data in a dense, lower-dimensional space is a fundamental technique in machine learning. Let me know your thoughts.

Also published here.