In this post I will introduce three metrics widely used for evaluating the utility of recommendations produced by a recommender system : Precision , Recall and F-1 Score. The F-1 Score is slightly different from the other ones, since it is a measure of a test's accuracy and considers both the precision and the recall of the test to compute the final score.
Introduction about the Recommender Systems Evaluation
The Recommender systems are a sub-domain of information filtering system techniques that attempts to recommend information items such as movies, programs, music, books, page that are likely to be of interest of the user. The final product of a recommender system is a top-list of items recommended for the user ordered by a evaluated score that represents the preference of that item for the user. So highest the value, more interested the user will be. But to produce the right recommendations is not trivial and there are several studies and research on evaluating the recommendations of such engine.
Therefore, when you develop a recommender engine for your problem domain, the main question after all work is done is: "What are the best recommendations for the user ?" Before looking after the answers, we should investigate the question. What exactly we mean as a good recommendation ? And how will we known when the recommender system is producing them ? The remainder of this post is to explain how we can evaluate such engines in a way that we provide the best possible recommender that should recommend possible items that the user hasn't yet seen or expressed any preference for. A optimal recommender system would be the one that could predict all your preferences exactly would present a set of items ranked by your future preference and be done.
For that , most recommender engines operate by trying to do just that, estimating rating for some or all other items. One possible way of evaluating recommender's suggestions is to evaluate the quality of its estimated preference values, that is, evaluating how closely the estimated preferences match the actual preferences of the user.
Introducing Precision and Recall Metrics
In recommender systems, for the final the user the most important result is to receive an ordered list of recommendations, from best to worst. In fact, in some cases the user doesn't care much about the exact ordering of the list - a set of few good recommendations is fine. Taking this fact into evaluation of recommender systems, we could apply classic information retrieval metrics to evaluate those engines: Precision and Recall. These metrics are widely used on information retrieving scenario and applied to domains such as search engines, which return some set of best results for a query out of many possible results.
For a search engine for example, it should not return irrelevant results in the top results, although it should be able to return as many relevant results as possible. We could say that the 'Precision' is the proportion of top results that are relevant, considering some definition of relevant for your problem domain. So if we say 'Precision at 10' would be this proportion judged from the top 10 results. The 'Recall' would measure the proportion of all relevant results included in the top results. See the Figure 1 as an example to illustrate those metrics.
|Precision and Recall in the context of Search Engines|
In a formal way, we could consider documents as instances and the task it to return a set of relevant documents given a search term. So the task would be assign each document to one of two categories: relevant and not relevant. Recall is defined as the number of relevant documents (the instances that belongs to relevant category) retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents (the instances that belongs to relevant category) retrieved by a search divided by the total number of documents retrieved by the search.
In recommender systems those metrics could be adapted so:
"The precision is the proportion of recommendations that are good recommendations, and recall is the proportion of good recommendations that appear in top recommendations."
In recommendations domain, a perfect precision score of 1.0 means that every item recommended in the list was good (although says nothing about if all good recommendations were suggested) whereas a perfect recall score of 1.0 means that all good recommended items were suggested in the list (although says nothing how many bad recommendations were also in the list).
Running to an example
For showing an example of evaluating a recommender system, let's test-drive a 'slope-one' recommender on a simple data set fetched from the brazilian location-based social networking website, software for mobile devices Apontador. Its main goal is to help people to interact with their friends and update their location by using GPS enabled mobile devices, such as iPhones and Blackberries. The sample data set, I've fetched using the Apontador API by writing a simple python crawler for accessing their data (The code I used is based on the Python Library developed by Apontador). The output is a comma-delimited file with 3463 samples composed by user IDs, location IDs (places) and the ratings (preference values from the range 1 to 5) for that location. In the figure below you can see the structure of this file. Unfortunately, for privacy purposes I can't show the respective usernames and places for each place rated.
|234 UserIDs, 1599 PlaceIDs, 3463 Ratings for the Apontador Data Set|
After some pre-processing for the experiments, I have pre-processed the database and came into a rating distribution. This plot represents a heat map illustrating the distribution of the ratings evaluated by the users (rows) to places (columns). The black areas represent the absence of the rating for that particular place from that user.
|Ratings Matrix - Users x Items|
To evaluate the accuracy of the recommender engine we must have a training data and a test data. Generally, this can be simulated to a recommender, since using new items it is not predictable for sure how the user will like that new item in the future. For this, the data mining researcher must set aside a small part of the real data set as test data. These test preferences are not present in the training data fed into a recommender under evaluation - which is all data except the test data. Therefore, the recommender is asked to estimate preferences for the missing test data, and the scores estimated are compared to the actual values.
Base on this data set split, it is simple to produce a kind of "score" for the recommender. For instance, we could compute the average difference between the estimated and actual preference. This references to another popular metric used to evaluate classifiers called root-mean-square. It is a metric represented by the square root of the average of the squares of the differences between actual and estimated preference values. The optimal classifier is the one that have lower scores (RMSE), because that would mean the estimates differed from the actual preference values by less. 0.0 would mean perfect estimation -- no difference at all between the estimates and actual values.
In the figure above, the table illustrates the difference between a set of actual and estimated preferences, and how they are translated into scores. RMSE heavily penalizes estimates that are quite different in range such as the place 2 there, and that is considered desirable by some. The magnitude of the difference is important for example when you estimate an item with 2 stars which the actual preference would be 5 stars, it is probably more than twice as 'bad' as one different by just 1 star.
After running our recommender system using as inputs the training and test set, which it will compare its estimated preferences to the actual test data. The main code for this task can be found here at my personal repository at Github (Crab Recommender System). We will use 70% of the data to train; and test with other 30%. And those sets are chosen randomly.
It shows as result three scores: The first one, it is a score indicating how well the recommender (slope-one) performed. In this case we use the Root Mean Square Error (RMSE) . The value 1.04 is not great, since there is so little data here to begin with. Your results may differ as the data set is split randomly, and hence the training and test set may differ with each run.
The second and the third ones are respectively the precision and recall. Precision at 20 is 0.1722; on average about 3 of recommendations were 'good'. Recall at 20 is 0.1722; so only on average about 3 are good recommendations among those top recommended. But we still haven't decided what exactly is a 'good' recommendation ? Intuitively, the most highly preferred items in the test set are the good recommendations, and the rest aren't.
Let's take look at the preferences of the user 2452498672 at our simple data set. If we consider as test data the preferences for items A,B,C and the preference values for these 4, 3, 2. With these values missing from the training data, our expectation is that the recommender engine to recommend A before B, and B before C because we know the order that the user 2452498672 prefer those items. But how will we know which item is a good idea to recommend ? Consider the item C which the user doesn't seem to like it much or the item B that is just average. We could consider that A is a good recommendation whereas B and C are valid, but not good recommendations. So we conclude that it is important to give an threshold that divides good recommendations from bad. If we choose not to pick explicitly one, we could use some statistics like the user's average plus one standard deviation for example.
If we decide to vary the quantities the number of recommended items, we could plot a graph by using these two measures together (precision x recall) so that we can assess to what extent the good results blend with bad results. The goal is to get the precision and recall values for my recommendation lists both close to one. The figure below shows the plot of those quantities varied for the slope-one recommender. For each list of N recommendations, we enter a point that corresponds to the precision and recall values of that list. The Good precision-recall points are located in the upper-right conner of the graph because we want to have high precision and high recall. These plots are extremely useful for you to compare different approaches for a particular data set.
Looking at the figure we can conclude that the slope-one approach is not particular efficient for this type of data set maybe because of small amount of samples to train/test the recommender or because of the characteristics of this data set.
Explaining F1- Score
The F-Score or F-measure is a measure of a statistic test's accuracy. It considers both precision and recall measures of the test to compute the score. We could interpret it as a weighted average of the precision and recall, where the best F1 score has its value at 1 and worst score at the value 0.
|F-Score Formula (Image from Wikipedia)|
In recommendations domain, it is considered an single value obtained combining both the precision and recall measures and indicates an overall utility of the recommendation list.
Running to an example
Running again our slope-one recommender into our simple data set we get as result the value 0.1722.
As we can see by the result the recommender performed not well at this data set. It can be due to the amount of data set or the particular characteristics of the data set that is not appropriate for this type of algorithm. It would be necessary more robust tests to see its performance and compare with another approaches. One of the particular features of slope-one recommender is that can produce quick recommendations at runtime with a simple algorithm, but it takes significant time to pre-compute its internal structures before it can start.
The figure below presents our plot combining the length of the recommendation lists represented by 'at' using our training and test data set and the F-scores estimated for each amount. It is expected that the best one approaches are with the F-Scores with values nearby 1.0.
|F1 Score Graph, the best are with score f near to 1.0 (top right of the graph)|
Conclusions and Contribution
For the slope-one recommender the values for precision and recall produced are interesting and illustrates the power of a simple collaborative filtering technique which involves just a little bit of linear algebra. You can learn more about the Slope-One recommender system here. Soon I will also talk more about this technique in a dedicated post. What is important to figure out in those evaluations is that you have to choose carefully which recommendation algorithm to use. Each algorithm has its own characteristics and properties that can interact in harder ways with a given data set. So is essential to use as many as possible variations of collaborative , content and hybrid algorithms to evaluate which one is faster or more accurate to the data set we will want to work with.
Another important observation related to precision and recall tests is how can we define what a 'good' recommendation is. As we have seen earlier, we must define a threshold to represent the usefulness of an item recommended and a poor choice could really prejudice the evaluation of the recommender algorithm. Furthermore, those tests could be problematic if we consider recommendations that aren't not necessarily among the user already knows about! Imagine running such a test for a user who would love the restaurant 'Recanto Paraibano'. Of course it is a great recommendation for this user, but the user has never heard of this restaurant before. If a recommender actually returned this restaurant when recommending new places to go, it would be penalized; since the test framework can only pick good recommendations from among those in the user's set of preferences already. It is a discovery problem, and a harder task for recommender engines and is a special topic approached by several researchers in the literature.
The problem is further complicated if we consider the preferences as 'booleans' such as 'liked' or 'disliked' and don't contain preference values. This implies that the test doesn't have a notion of relative preference on which to select a subset of good items. So the best test is to randomly select some 'liked' items as the good ones. But despite of all these problems the test has great use but it is not perfect, so you must understand the test's limitations in the context of the features of the data set available for you to work.
In this post I presented some metrics used to evaluate the quality of the recommendations in a recommender engine and explained through some examples using real data set from Apontador social network. All the code used here is provided at by personal repository at Github and is all written in Python.
Finally, I conclude that evaluations are really important in the recommendation engine building process, which can be used to empirically discover improvements to a recommendation algorithm.
Below all the references used to write this post.
I hope you have enjoyed this post.
Herlocker, Evaluating Collaborative Filtering Recommender Systems, 2004.
Wikipedia, Precision and Recall.
Bryan Sullivan's Blog, Collaborative Filtering made easy, 2006.
Apontador, Apontador API.