Prediction algorithms¶

Surprise provides with a bunch of built-in algorithms. You can find the details of each of these in the surprise.prediction_algorithms package documentation.

Every algorithm is part of the global Surprise namespace, so you only need to import their names from the Surprise package, for example:

from surprise import KNNBasic
algo = KNNBasic()

Some of these algorithms may use baseline estimates, some may use a similarity measure. We will here review how to configure the way baselines and similarities are computed.

Baselines estimates configuration¶

Note

This section only applies to algorithms (or similarity measures) that try to minimize the following regularized squared error (or equivalent):

\[\sum_{r_{ui} \in R_{train}} \left(r_{ui} - (\mu + b_u + b_i)\right)^2 + \lambda \left(b_u^2 + b_i^2 \right).\]

For algorithms using baselines in another objective function (e.g. the SVD algorithm), the baseline configuration is done differently and is specific to each algorithm. Please refer to their own documentation.

First of all, if you do not want to configure the way baselines are computed, you don’t have to: the default parameters will do just fine. If you do want to well... This is for you.

You may want to read section 2.1 of Factor in the Neighbors: Scalable and Accurate Collaborative Filtering by Yehuda Koren to get a good idea of what are baseline estimates.

Baselines can be estimated in two different ways:

Using Stochastic Gradient Descent (SGD).
Using Alternating Least Squares (ALS).

You can configure the way baselines are computed using the bsl_options parameter passed at the creation of an algorithm. This parameter is a dictionary for which the key 'method' indicates the method to use. Accepted values are 'als' (default) and 'sgd'. Depending on its value, other options may be set. For ALS:

'reg_i': The regularization parameter for items. Corresponding to \(\lambda_2\) in the paper. Default is 10.
'reg_u': The regularization parameter for users, orresponding to \(\lambda_3\) in the paper. Default is 15.
'n_epochs': The number of iteration of the ALS procedure. Default is 10. Note that in the paper, what is described is a single iteration ALS process.

And for SGD:

'reg': The regularization parameter of the cost function that is optimized, corresponding to \(\lambda_1\) and then \(\lambda_5\) in the paper. Default is 0.02.
'learning_rate': The learning rate of SGD, corresponding to \(\gamma\) in the paper. Default is 0.005.
'n_epochs': The number of iteration of the SGD procedure. Default is 20.

Note

For both procedures (ALS and SGD), user and item biases (\(b_u\) and \(b_i\)) are initialized to zero.

Usage examples:

From file examples/baselines_conf.py¶

print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)

From file examples/baselines_conf.py¶

print('Using SGD')
bsl_options = {'method': 'sgd',
               'learning_rate': .00005,
               }
algo = BaselineOnly(bsl_options=bsl_options)

Note that some similarity measures may use baselines, such as the pearson_baseline similarity. Configuration works just the same, whether the baselines are used in the actual prediction \(\hat{r}_{ui}\) or not:

From file examples/baselines_conf.py¶

bsl_options = {'method': 'als',
               'n_epochs': 20,
               }
sim_options = {'name': 'pearson_baseline'}
algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)

This leads us to similarity measure configuration, which we will review right now.

Similarity measure configuration¶

Many algorithms use a similarity measure to estimate a rating. The way they can be configured is done in a similar fashion as for baseline ratings: you just need to pass a sim_options argument at the creation of an algorithm. This argument is a dictionary with the following (all optional) keys:

'name': The name of the similarity to use, as defined in the similarities module. Default is 'MSD'.
'user_based': Whether similarities will be computed between users or between items. This has a huge impact on the performance of a prediction algorithm. Default is True.
'min_support': The minimum number of common items (when 'user_based' is 'True') or minimum number of common users (when 'user_based' is 'False') for the similarity not to be zero. Simply put, if \(|I_{uv}| < \text{min_support}\) then \(\text{sim}(u, v) = 0\). The same goes for items.
'shrinkage': Shrinkage parameter to apply (only relevent for pearson_baseline similarity). Default is 100.

Usage examples:

From file examples/similarity_conf.py¶

sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo = KNNBasic(sim_options=sim_options)

From file examples/similarity_conf.py¶

sim_options = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }
algo = KNNBasic(sim_options=sim_options)