FAQ¶

You will find here the Frequently Asked Questions, as well as some other use-case examples that are not part of the User Guide.

How to get the top-N recommendations for each user¶

Here is an example where we retrieve the top-10 items with highest rating prediction for each user in the MovieLens-100k dataset. We first train an SVD algorithm on the whole dataset, and then predict all the ratings for the pairs (user, item) that are not in the training set. We then retrieve the top-10 prediction for each user.

From file examples/top_n_recommendations.py¶

from collections import defaultdict

from surprise import Dataset, SVD


def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

How to compute precision@k and recall@k¶

Here is an example where we compute Precision@k and Recall@k for each user:

\(\text{Precision@k} = \frac{ | \{ \text{Recommended items that are relevant} \} | }{ | \{ \text{Recommended items} \} | }\) \(\text{Recall@k} = \frac{ | \{ \text{Recommended items that are relevant} \} | }{ | \{ \text{Relevant items} \} | }\)

An item is considered relevant if its true rating \(r_{ui}\) is greater than a given threshold. An item is considered recommended if its estimated rating \(\hat{r}_{ui}\) is greater than the threshold, and if it is among the k highest estimated ratings.

Note that in the edge cases where division by zero occurs, Precision@k and Recall@k values are undefined. As a convention, we set their values to 0 in such cases.

From file examples/precision_recall_at_k.py¶

from collections import defaultdict

from surprise import Dataset, SVD
from surprise.model_selection import KFold


def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls


data = Dataset.load_builtin("ml-100k")
kf = KFold(n_splits=5)
algo = SVD()

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print(sum(rec for rec in recalls.values()) / len(recalls))

How to get the k nearest neighbors of a user (or item)¶

You can use the get_neighbors() methods of the algorithm object. This is only relevant for algorithms that use a similarity measure, such as the k-NN algorithms.

Here is an example where we retrieve the 10 nearest neighbors of the movie Toy Story from the MovieLens-100k dataset. The output is:

The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)

There’s a lot of boilerplate because of the conversions between movie names and their raw/inner ids (see this note), but it all boils down to the use of get_neighbors():

From file examples/k_nearest_neighbors.py¶

import io  # noqa

from surprise import Dataset, get_dataset_dir, KNNBaseline


def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + "/ml-100k/ml-100k/u.item"
    rid_to_name = {}
    name_to_rid = {}
    with open(file_name, encoding="ISO-8859-1") as f:
        for line in f:
            line = line.split("|")
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid


# First, train the algorithm to compute the similarities between items
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
sim_options = {"name": "pearson_baseline", "user_based": False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()

# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid["Toy Story (1995)"]
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# Convert inner ids of the neighbors into names.
toy_story_neighbors = (
    algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors
)
toy_story_neighbors = (rid_to_name[rid] for rid in toy_story_neighbors)

print()
print("The 10 nearest neighbors of Toy Story are:")
for movie in toy_story_neighbors:
    print(movie)

Naturally, the same can be done for users with minor modifications.

How to serialize an algorithm¶

Prediction algorithms can be serialized and loaded back using the dump() and load() functions. Here is a small example where the SVD algorithm is trained on a dataset and serialized. It is then reloaded and can be used again for making predictions:

From file examples/serialize_algorithm.py¶

import os

from surprise import Dataset, dump, SVD


data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()

algo = SVD()
algo.fit(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser("~/dump_file")
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print("Predictions are the same")

Algorithms can be serialized along with their predictions, so that can be further analyzed or compared with other algorithms, using pandas dataframes. Some examples are given in the two following notebooks:

Dumping and analysis of the KNNBasic algorithm.

Comparison of two algorithms.

How to build my own prediction algorithm¶

There’s a whole guide here.

What are raw and inner ids¶

Users and items have a raw id and an inner id. Some methods will use/return a raw id (e.g. the predict() method), while some other will use/return an inner id.

Raw ids are ids as defined in a rating file or in a pandas dataframe. They can be strings or numbers. Note though that if the ratings were read from a file which is the standard scenario, they are represented as strings. This is important to know if you’re using e.g. predict() or other methods that accept raw ids as parameters.

On trainset creation, each raw id is mapped to a unique integer called inner id, which is a lot more suitable for Surprise to manipulate. Conversions between raw and inner ids can be done using the to_inner_uid(), to_inner_iid(), to_raw_uid(), and to_raw_iid() methods of the trainset.

Can I use my own dataset with Surprise, and can it be a pandas dataframe¶

Yes, and yes. See the user guide.

How to tune an algorithm parameters¶

You can tune the parameters of an algorithm with the GridSearchCV class as described here. After the tuning, you may want to have an unbiased estimate of your algorithm performances.

How to get accuracy measures on the training set¶

You can use the build_testset() method of the Trainset object to build a testset that can be then used with the test() method:

From file examples/evaluate_on_trainset.py¶

from surprise import accuracy, Dataset, SVD
from surprise.model_selection import KFold


data = Dataset.load_builtin("ml-100k")

algo = SVD()

trainset = data.build_full_trainset()
algo.fit(trainset)

testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True)  # ~ 0.68 (which is low)

Check out the example file for more usage examples.

How to save some data for unbiased accuracy estimation¶

If your goal is to tune the parameters of an algorithm, you may want to spare a bit of data to have an unbiased estimation of its performances. For instance you may want to split your data into two sets A and B. A is used for parameter tuning using grid search, and B is used for unbiased estimation. This can be done as follows:

From file examples/split_data_for_unbiased_estimation.py¶

import random

from surprise import accuracy, Dataset, SVD
from surprise.model_selection import GridSearchCV


# Load the full dataset.
data = Dataset.load_builtin("ml-100k")
raw_ratings = data.raw_ratings

# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(0.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = A_raw_ratings  # data is now the set A

# Select your best algo with grid search.
print("Grid Search...")
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
grid_search.fit(data)

algo = grid_search.best_estimator["rmse"]

# retrain on the whole set A
trainset = data.build_full_trainset()
algo.fit(trainset)

# Compute biased accuracy on A
predictions = algo.test(trainset.build_testset())
print("Biased accuracy on A,", end="   ")
accuracy.rmse(predictions)

# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
predictions = algo.test(testset)
print("Unbiased accuracy on B,", end=" ")
accuracy.rmse(predictions)

How to have reproducible experiments¶

Some algorithms randomly initialize their parameters (sometimes with numpy), and the cross-validation folds are also randomly generated. If you need to reproduce your experiments multiple times, you just have to set the seed of the RNG at the beginning of your program:

import random
import numpy as np

my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)

Where are datasets stored and how to change it?¶

By default, datasets downloaded by Surprise will be saved in the '~/.surprise_data' directory. This is also where dump files will be stored. You can change the default directory by setting the 'SURPRISE_DATA_FOLDER' environment variable.

Can Surprise support content-based data or implicit ratings?¶

No: this is out of scope for surprise. Surprise was designed for explicit ratings.