Getting Started¶
Basic usage¶
Automatic cross-validation¶
Surprise has a set of built-in algorithms and datasets for you to play with. In its simplest form, it only takes a few lines of code to run a cross-validation procedure:
from surprise import Dataset, SVD
from surprise.model_selection import cross_validate
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin("ml-100k")
# We'll use the famous SVD algorithm.
algo = SVD()
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)
The result should be as follows (actual values may vary due to randomization):
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE 0.9311 0.9370 0.9320 0.9317 0.9391 0.9342 0.0032
MAE 0.7350 0.7375 0.7341 0.7342 0.7375 0.7357 0.0015
Fit time 6.53 7.11 7.23 7.15 3.99 6.40 1.23
Test time 0.26 0.26 0.25 0.15 0.13 0.21 0.06
The load_builtin()
method will
offer to download the movielens-100k dataset if it has not already been
downloaded, and it will save it in the .surprise_data
folder in your home
directory (you can also choose to save it somewhere else).
We are here using the well-known
SVD
algorithm, but many other algorithms are available. See
Using prediction algorithms for more details.
The cross_validate()
function runs a cross-validation procedure according to the cv
argument,
and computes some accuracy
measures. We are here
using a classical 5-fold cross-validation, but fancier iterators can be used
(see here).
Train-test split and the fit() method¶
If you don’t want to run a full cross-validation procedure, you can use the
train_test_split()
to sample a trainset and a testset with given sizes, and use the accuracy
metric
of your chosing. You’ll need to use the fit()
method which will
train the algorithm on the trainset, and the test()
method which will
return the predictions made from the testset:
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import train_test_split
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin("ml-100k")
# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=0.25)
# We'll use the famous SVD algorithm.
algo = SVD()
# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)
# Then compute RMSE
accuracy.rmse(predictions)
Result:
RMSE: 0.9411
Note that you can train and test an algorithm with the following one-line:
predictions = algo.fit(trainset).test(testset)
In some cases, your trainset and testset are already defined by some files. Please refer to this section to handle such cases.
Train on a whole trainset and the predict() method¶
Obviously, we could also simply fit our algorithm to the whole dataset, rather
than running cross-validation. This can be done by using the
build_full_trainset()
method which will
build a trainset
object:
from surprise import Dataset, KNNBasic
# Load the movielens-100k dataset
data = Dataset.load_builtin("ml-100k")
# Retrieve the trainset.
trainset = data.build_full_trainset()
# Build an algorithm, and train it.
algo = KNNBasic()
algo.fit(trainset)
We can now predict ratings by directly calling the predict()
method. Let’s say
you’re interested in user 196 and item 302 (make sure they’re in the
trainset!), and you know that the true rating \(r_{ui} = 4\):
uid = str(196) # raw user id (as in the ratings file). They are **strings**!
iid = str(302) # raw item id (as in the ratings file). They are **strings**!
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)
The result should be:
user: 196 item: 302 r_ui = 4.00 est = 4.06 {'actual_k': 40, 'was_impossible': False}
Note
The predict()
uses raw
ids (please read this about raw and inner ids). As
the dataset we have used has been read from a file, the raw ids are strings
(even if they represent numbers).
We have so far used a built-in dataset, but you can of course use your own. This is explained in the next section.
Use a custom dataset¶
Surprise has a set of builtin
datasets, but you can of course use a custom dataset.
Loading a rating dataset can be done either from a file (e.g. a csv file), or
from a pandas dataframe. Either way, you will need to define a Reader
object for Surprise to be able to parse the file or the
dataframe.
To load a dataset from a file (e.g. a csv file), you will need the
load_from_file()
method:import os from surprise import BaselineOnly, Dataset, Reader from surprise.model_selection import cross_validate # path to dataset file file_path = os.path.expanduser("~/.surprise_data/ml-100k/ml-100k/u.data") # As we're loading a custom dataset, we need to define a reader. In the # movielens-100k dataset, each line has the following format: # 'user item rating timestamp', separated by '\t' characters. reader = Reader(line_format="user item rating timestamp", sep="\t") data = Dataset.load_from_file(file_path, reader=reader) # We can now use this dataset as we please, e.g. calling cross_validate cross_validate(BaselineOnly(), data, verbose=True)
For more details about readers and how to use them, see the
Reader class
documentation.Note
As you already know from the previous section, the Movielens-100k dataset is built-in so a much quicker way to load the dataset is to do
data = Dataset.load_builtin('ml-100k')
. We will of course ignore this here.
To load a dataset from a pandas dataframe, you will need the
load_from_df()
method. You will also need aReader
object, but only therating_scale
parameter must be specified. The dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.import pandas as pd from surprise import Dataset, NormalPredictor, Reader from surprise.model_selection import cross_validate # Creation of the dataframe. Column names are irrelevant. ratings_dict = { "itemID": [1, 1, 1, 2, 2], "userID": [9, 32, 2, 45, "user_foo"], "rating": [3, 2, 4, 3, 1], } df = pd.DataFrame(ratings_dict) # A reader is still needed but only the rating_scale param is required. reader = Reader(rating_scale=(1, 5)) # The columns must correspond to user id, item id and ratings (in that order). data = Dataset.load_from_df(df[["userID", "itemID", "rating"]], reader) # We can now use this dataset as we please, e.g. calling cross_validate cross_validate(NormalPredictor(), data, cv=2)
The dataframe initially looks like this:
itemID rating userID 0 1 3 9 1 1 2 32 2 1 4 2 3 2 3 45 4 2 1 user_foo
Use cross-validation iterators¶
For cross-validation, we can use the cross_validate()
function that does all
the hard work for us. But for a better control, we can also instantiate a
cross-validation iterator, and make predictions over each split using the
split()
method of the iterator, and the
test()
method
of the algorithm. Here is an example where we use a classical K-fold
cross-validation procedure with 3 splits:
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import KFold
# Load the movielens-100k dataset
data = Dataset.load_builtin("ml-100k")
# define a cross-validation iterator
kf = KFold(n_splits=3)
algo = SVD()
for trainset, testset in kf.split(data):
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)
# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)
Result could be, e.g.:
RMSE: 0.9374
RMSE: 0.9476
RMSE: 0.9478
Other cross-validation iterator can be used, like LeaveOneOut or ShuffleSplit. See all the available iterators here. The design of Surprise’s cross-validation tools is heavily inspired from the excellent scikit-learn API.
A special case of cross-validation is when the folds are already predefined by
some files. For instance, the movielens-100K dataset already provides 5 train
and test files (u1.base, u1.test … u5.base, u5.test). Surprise can handle
this case by using a surprise.model_selection.split.PredefinedKFold
object:
import os
from surprise import accuracy, Dataset, Reader, SVD
from surprise.model_selection import PredefinedKFold
# path to dataset folder
files_dir = os.path.expanduser("~/.surprise_data/ml-100k/ml-100k/")
# This time, we'll use the built-in reader.
reader = Reader("ml-100k")
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + "u%d.base"
test_file = files_dir + "u%d.test"
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()
algo = SVD()
for trainset, testset in pkf.split(data):
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)
# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)
Of course, nothing prevents you from only loading a single file for training
and a single file for testing. However, the folds_files
parameter still
needs to be a list
.
Tune algorithm parameters with GridSearchCV¶
The cross_validate()
function reports accuracy
metric over a cross-validation procedure for a given set of parameters. If you
want to know which parameter combination yields the best results, the
GridSearchCV
class
comes to the rescue. Given a dict
of parameters, this class exhaustively
tries all the combinations of parameters and reports the best parameters for any
accuracy measure (averaged over the different splits). It is heavily inspired
from scikit-learn’s GridSearchCV.
Here is an example where we try different values for parameters n_epochs
,
lr_all
and reg_all
of the SVD
algorithm.
from surprise import Dataset, SVD
from surprise.model_selection import GridSearchCV
# Use movielens-100K
data = Dataset.load_builtin("ml-100k")
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)
# best RMSE score
print(gs.best_score["rmse"])
# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])
Result:
0.961300130118
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
We are here evaluating the average RMSE and MAE over a 3-fold cross-validation procedure, but any cross-validation iterator can used.
Once fit()
has been called, the best_estimator
attribute gives us an
algorithm instance with the optimal set of parameters, which can be used how we
please:
# We can now use the algorithm that yields the best rmse:
algo = gs.best_estimator["rmse"]
algo.fit(data.build_full_trainset())
Note
Dictionary parameters such as bsl_options
and sim_options
require
particular treatment. See usage example below:
param_grid = {
'k': [10, 20],
'sim_options': {
'name': ['msd', 'cosine'],
'min_support': [1, 5],
'user_based': [False],
},
}
Naturally, both can be combined, for example for the
KNNBaseline
algorithm:
param_grid = {
'bsl_options': {
'method': ['als', 'sgd'],
'reg': [1, 2],
},
'k': [2, 3],
'sim_options': {
'name': ['msd', 'cosine'],
'min_support': [1, 5],
'user_based': [False],
},
}
For further analysis, the cv_results
attribute has all the needed
information and can be imported in a pandas dataframe:
results_df = pd.DataFrame.from_dict(gs.cv_results)
In our example, the cv_results
attribute looks like this (floats are
formatted):
'split0_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split1_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split2_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'mean_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'std_test_rmse': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_rmse': [7 8 3 5 4 6 1 2]
'split0_test_mae': [0.81, 0.82, 0.78, 0.79, 0.79, 0.8, 0.77, 0.79]
'split1_test_mae': [0.8, 0.81, 0.78, 0.79, 0.78, 0.79, 0.77, 0.78]
'split2_test_mae': [0.81, 0.81, 0.78, 0.79, 0.78, 0.8, 0.77, 0.78]
'mean_test_mae': [0.81, 0.81, 0.78, 0.79, 0.79, 0.8, 0.77, 0.78]
'std_test_mae': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_mae': [7 8 2 5 4 6 1 3]
'mean_fit_time': [1.53, 1.52, 1.53, 1.53, 3.04, 3.05, 3.06, 3.02]
'std_fit_time': [0.03, 0.04, 0.0, 0.01, 0.04, 0.01, 0.06, 0.01]
'mean_test_time': [0.46, 0.45, 0.44, 0.44, 0.47, 0.49, 0.46, 0.34]
'std_test_time': [0.0, 0.01, 0.01, 0.0, 0.03, 0.06, 0.01, 0.08]
'params': [{'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}]
'param_n_epochs': [5, 5, 5, 5, 10, 10, 10, 10]
'param_lr_all': [0.0, 0.0, 0.01, 0.01, 0.0, 0.0, 0.01, 0.01]
'param_reg_all': [0.4, 0.6, 0.4, 0.6, 0.4, 0.6, 0.4, 0.6]
As you can see, each list has the same size of the number of parameter combination. It corresponds to the following table:
split0_test_rmse |
split1_test_rmse |
split2_test_rmse |
mean_test_rmse |
std_test_rmse |
rank_test_rmse |
split0_test_mae |
split1_test_mae |
split2_test_mae |
mean_test_mae |
std_test_mae |
rank_test_mae |
mean_fit_time |
std_fit_time |
mean_test_time |
std_test_time |
params |
param_n_epochs |
param_lr_all |
param_reg_all |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.99775 |
0.997744 |
0.996378 |
0.997291 |
0.000645508 |
7 |
0.807862 |
0.804626 |
0.805282 |
0.805923 |
0.00139657 |
7 |
1.53341 |
0.0305216 |
0.455831 |
0.000922113 |
{‘n_epochs’: 5, ‘lr_all’: 0.002, ‘reg_all’: 0.4} |
5 |
0.002 |
0.4 |
1.00381 |
1.00304 |
1.00257 |
1.00314 |
0.000508358 |
8 |
0.816559 |
0.812905 |
0.813772 |
0.814412 |
0.00155866 |
8 |
1.5199 |
0.0367117 |
0.451068 |
0.00938646 |
{‘n_epochs’: 5, ‘lr_all’: 0.002, ‘reg_all’: 0.6} |
5 |
0.002 |
0.6 |
0.973524 |
0.973595 |
0.972495 |
0.973205 |
0.000502609 |
3 |
0.783361 |
0.780242 |
0.78067 |
0.781424 |
0.00138049 |
2 |
1.53449 |
0.00496203 |
0.441558 |
0.00529696 |
{‘n_epochs’: 5, ‘lr_all’: 0.005, ‘reg_all’: 0.4} |
5 |
0.005 |
0.4 |
0.98229 |
0.982059 |
0.981486 |
0.981945 |
0.000338056 |
5 |
0.794481 |
0.790781 |
0.79186 |
0.792374 |
0.00155377 |
5 |
1.52739 |
0.00859185 |
0.44463 |
0.000888907 |
{‘n_epochs’: 5, ‘lr_all’: 0.005, ‘reg_all’: 0.6} |
5 |
0.005 |
0.6 |
0.978034 |
0.978407 |
0.976919 |
0.977787 |
0.000632049 |
4 |
0.787643 |
0.784723 |
0.784957 |
0.785774 |
0.00132486 |
4 |
3.03572 |
0.0431101 |
0.466606 |
0.0254965 |
{‘n_epochs’: 10, ‘lr_all’: 0.002, ‘reg_all’: 0.4} |
10 |
0.002 |
0.4 |
0.986263 |
0.985817 |
0.985004 |
0.985695 |
0.000520899 |
6 |
0.798218 |
0.794457 |
0.795373 |
0.796016 |
0.00160135 |
6 |
3.0544 |
0.00636185 |
0.488357 |
0.0576194 |
{‘n_epochs’: 10, ‘lr_all’: 0.002, ‘reg_all’: 0.6} |
10 |
0.002 |
0.6 |
0.963751 |
0.963463 |
0.962676 |
0.963297 |
0.000454661 |
1 |
0.774036 |
0.770548 |
0.771588 |
0.772057 |
0.00146201 |
1 |
3.0636 |
0.0597982 |
0.456484 |
0.00510321 |
{‘n_epochs’: 10, ‘lr_all’: 0.005, ‘reg_all’: 0.4} |
10 |
0.005 |
0.4 |
0.973605 |
0.972868 |
0.972765 |
0.973079 |
0.000374222 |
2 |
0.78607 |
0.781918 |
0.783537 |
0.783842 |
0.00170855 |
3 |
3.01907 |
0.011834 |
0.338839 |
0.075346 |
{‘n_epochs’: 10, ‘lr_all’: 0.005, ‘reg_all’: 0.6} |
10 |
0.005 |
0.6 |
Command line usage¶
Surprise can also be used from the command line, for example:
surprise -algo SVD -params "{'n_epochs': 5, 'verbose': True}" -load-builtin ml-100k -n-folds 3
See detailed usage by running:
surprise -h