Content Based Filtering

Aatish Sai
3 min readNov 10, 2016

--

Content Based Recommendation System makes recommendation on the basis of description of the item and profile of user that determines what he/she is interested on i.e. a Content based filtering system is based on correlation between the content of the item and user’s preferences. So, the basic assumption is that items with similar features will be rated similarly. Unlike collaborative filtering approach, the content filtering approach requires additional product information. The user profile can be manually created or automatically created on the basis of user activities. As user provides more inputs (rating), the engine becomes more and more accurate.

To understand how the user profile is generated from the user activities, some movie titles were obtained from IMDb along with the genres as shown in the table below.

Movie Sample with genres

Here we have used binary representation. The 1/0 under genre in the table represents whether or not the movie contains those features (action, drama etc.). For example, The Shawshank Redemption is all about drama and crime.

Attribute Count for Movies

Attribute Count contains total attributes a movie contains. There are 2 users and for easiness the movies are rated as 1(like), -1(dislike), 0(not rated). The concepts of Term Frequency (TF) and Inverse Document Frequency (IDF) are used in content based filtering mechanisms.

In second step, we perform Normalization. Normalization for binary representation is performed by dividing the term occurrence (1/0) by the square root of number of attributes in the movie. Example, for movie The Shawshank Redemption normalized attribute = 1/sqrt (2) = 0.7071

Normalized Table

In third step, we generate user profile by performing summation of products of movie vector and rating vector. Upon completion, we obtain a user profile that depicts the user’s preferences level on each of the attribute.

User Profile Score
Data Frequency and Inverse Document Frequency

Finally, the prediction score is generated for each movie by summing up the products movie vector, user vector and Inverse document frequency.

Prediction Score

Here we can see the result that user 1 has more interest on watching movies of Drama and Crime genre and doesn’t like movies of action genre. The prediction score obtained from the experiment thus matches with the user 1 preferences. Similarly, the calculations are fair enough for user 2 as well.

Thus, using this method, the system is able to maintain a user profile whose attribute values adjust themselves as user rates more and more movies. This implementation can be used in any number of movies to obtain the prediction score and recommend movies to the users.

Useful Resources

Van Meteren, R., & Van Someren, M. (2000, May). Using content-based filtering for recommendation. In Proceedings of the Machine Learning in the New Information Age: MLnet/ECML2000 Workshop (pp. 47–56).

recommeder-systems.org/content-based-filtering

Mooney, R. J., & Roy, L. (2000, June). Content-based book recommending using learning for text categorization. In Proceedings of the fifth ACM conference on Digital libraries (pp. 195–204). ACM.

--

--

Aatish Sai

Engineering Manager | Software | Machine Learning | AWS Certified