Getting Started¶
Basic usage¶
Surprise has a set of built-in algorithms and datasets for you to play with. In its simplest form, it takes about four lines of code to evaluate the performance of an algorithm:
examples/basic_usage.py
¶from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
# Load the movielens-100k dataset (download it if needed),
# and split it into 3 folds for cross-validation.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
# We'll use the famous SVD algorithm.
algo = SVD()
# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)
If Surprise cannot find the
movielens-100k dataset, it will
offer to download it and will store it under the .surprise_data
folder in
your home directory. The split()
method automatically splits the
dataset into 3 folds and the evaluate()
function runs the cross-validation procedure and compute some accuracy
measures.
Load a custom dataset¶
You can of course use a custom dataset. Surprise offers two ways of loading a custom dataset:
- you can either specify a single file (e.g. a csv file) or a pandas dataframe
with all the ratings and use the
split ()
method to perform cross-validation, or train on the whole dataset ; - or if your dataset is already split into predefined folds, you can specify a list of files for training and testing.
Either way, you will need to define a Reader
object for Surprise to be able to
parse the file(s) or the dataframe. We’ll see now how to handle both cases.
Load an entire dataset from a file or a dataframe¶
To load a dataset from a file (e.g. a csv file), you will need the
load_from_file()
method:From fileexamples/load_custom_dataset.py
¶# path to dataset file file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data') # As we're loading a custom dataset, we need to define a reader. In the # movielens-100k dataset, each line has the following format: # 'user item rating timestamp', separated by '\t' characters. reader = Reader(line_format='user item rating timestamp', sep='\t') data = Dataset.load_from_file(file_path, reader=reader) data.split(n_folds=5) # data can now be used normally
For more details about readers and how to use them, see the
Reader class
documentation.Note
As you already know from the previous section, the Movielens-100k dataset is built-in so a much quicker way to load the dataset is to do
data = Dataset.load_builtin('ml-100k')
. We will of course ignore this here.
To load a dataset from a pandas dataframe, you will need the
load_from_df()
method. You will also need aReader
object, but only therating_scale
parameter must be specified. The dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.From fileexamples/load_from_dataframe.py
¶ratings_dict = {'itemID': [1, 1, 1, 2, 2], 'userID': [9, 32, 2, 45, 'user_foo'], 'rating': [3, 2, 4, 3, 1]} df = pd.DataFrame(ratings_dict) # A reader is still needed but only the rating_scale param is requiered. reader = Reader(rating_scale=(1, 5)) # The columns must correspond to user id, item id and ratings (in that order). data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader) data.split(2) # data can now be used normally
The dataframe initially looks like this:
itemID rating userID 0 1 3 9 1 1 2 32 2 1 4 2 3 2 3 45 4 2 1 user_foo
Load a dataset with predefined folds¶
examples/load_custom_dataset_predefined_folds.py
¶# path to dataset folder
files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')
# This time, we'll use the built-in reader.
reader = Reader('ml-100k')
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
Of course, nothing prevents you from only loading a single file for training
and a single file for testing. However, the folds_files
parameter still
needs to be a list
.
Advanced usage¶
We will here get a little deeper on what can Surprise do for you.
Tune algorithm parameters with GridSearch¶
The evaluate()
function gives us the
results on one set of parameters given to the algorithm. If the user wants
to try the algorithm on a different set of parameters, the
GridSearch
class comes to the rescue.
Given a dict
of parameters, this
class exhaustively tries all the combination of parameters and helps get the
best combination for an accuracy measurement. It is analogous to
GridSearchCV from scikit-learn.
For instance, suppose that we want to tune the parameters of the
SVD
. Some of
the parameters of this algorithm are n_epochs
, lr_all
and reg_all
.
Thus we define a parameters grid as follows
examples/grid_search_usage.py
¶param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
'reg_all': [0.4, 0.6]}
Next we define a GridSearch
instance
and give it the class
SVD
as an
algorithm, and param_grid
. We will compute both the
RMSE and FCP values for all the combination. Thus the following definition:
examples/grid_search_usage.py
¶grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP'])
Now that GridSearch
instance is ready,
we can evaluate the algorithm on any data with the
GridSearch.evaluate()
method,
exactly like with the regular
evaluate()
function:
examples/grid_search_usage.py
¶data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
grid_search.evaluate(data)
Everything is ready now to read the results. For example, we get the best RMSE and FCP scores and parameters as follows:
examples/grid_search_usage.py
¶# best RMSE score
print(grid_search.best_score['RMSE'])
# >>> 0.96117566386
# combination of parameters that gave the best RMSE score
print(grid_search.best_params['RMSE'])
# >>> {'reg_all': 0.4, 'lr_all': 0.005, 'n_epochs': 10}
# best FCP score
print(grid_search.best_score['FCP'])
# >>> 0.702279736531
# combination of parameters that gave the best FCP score
print(grid_search.best_params['FCP'])
# >>> {'reg_all': 0.6, 'lr_all': 0.005, 'n_epochs': 10}
For further analysis, we can easily read all the results in a pandas
DataFrame
as follows:
examples/grid_search_usage.py
¶import pandas as pd # noqa
results_df = pd.DataFrame.from_dict(grid_search.cv_results)
print(results_df)
Note
Dictionary parameters such as bsl_options
and sim_options
require
particular treatment. See usage example below:
param_grid = {'k': [10, 20],
'sim_options': {'name': ['msd', 'cosine'],
'min_support': [1, 5],
'user_based': [False]}
}
Naturally, both can be combined, for example for the
KNNBaseline
algorithm:
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
'reg': [1, 2]},
'k': [2, 3],
'sim_options': {'name': ['msd', 'cosine'],
'min_support': [1, 5],
'user_based': [False]}
}
Manually iterate over folds¶
We have so far used the evaluate()
function that does all the hard work for us. If you want to have better control
on your experiments, you can use the folds()
generator of your dataset, and then the
train()
and
test()
methods
of your algorithm on each of the folds:
examples/iterate_over_folds.py
¶data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
algo = BaselineOnly()
for trainset, testset in data.folds():
# train and test algorithm.
algo.train(trainset)
predictions = algo.test(testset)
# Compute and print Root Mean Squared Error
rmse = accuracy.rmse(predictions, verbose=True)
Train on a whole trainset and specifically query for predictions¶
We will here review how to get a prediction for specified users and items. In the mean time, we will also review how to train on a whole dataset, without performing cross-validation (i.e. there is no test set).
The latter is pretty straightforward: all you need is to load a dataset, and
the build_full_trainset()
method to build the
trainset
and train you algorithm:
examples/query_for_predictions.py
¶data = Dataset.load_builtin('ml-100k')
# Retrieve the trainset.
trainset = data.build_full_trainset()
# Build an algorithm, and train it.
algo = KNNBasic()
algo.train(trainset)
Now, there’s no way we could call the test()
method, because we
have no testset. But you can still get predictions for the users and items you
want.
Let’s say you’re interested in user 196 and item 302 (make sure they’re in the
trainset!), and you know that the true rating \(r_{ui} = 4\). All you need
is call the predict()
method:
examples/query_for_predictions.py
¶uid = str(196) # raw user id (as in the ratings file). They are **strings**!
iid = str(302) # raw item id (as in the ratings file). They are **strings**!
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)
The predict()
uses raw ids
(read this). As the dataset we have used has been read
from a file, the raw ids are strings (even if they represent numbers).
If the predict()
method is called
with user or item ids that were not part of the trainset, it’s up to the
algorithm to decide if it still can make a prediction or not. If it can’t,
predict()
will still predict the mean of all ratings \(\mu\).
Obviously, it is perfectly fine to use the predict()
method directly
during a cross-validation process. It’s then up to you to ensure that the user
and item ids are present in the trainset though.
Command line usage¶
Surprise can also be used from the command line, for example:
surprise -algo SVD -params "{'n_epochs': 5, 'verbose': True}" -load-builtin ml-100k -n-folds 3
See detailed usage by running:
surprise -h