The model_selection package

Surprise provides various tools to run cross-validation procedures and search the best parameters for a prediction algorithm. The tools presented here are all heavily inspired from the excellent scikit learn library.

Cross validation iterators

The model_selection.split module contains various cross-validation iterators. Design and tools are inspired from the mighty scikit learn.

The available iterators are:

KFold A basic cross-validation iterator.
RepeatedKFold Repeated KFold cross validator.
ShuffleSplit A basic cross-validation iterator with random trainsets and testsets.
LeaveOneOut Cross-validation iterator where each user has exactly one rating in the testset.
PredefinedKFold A cross-validation iterator to when a dataset has been loaded with the load_from_folds method.

This module also contains a function for splitting datasets into trainset and testset:

train_test_split Split a dataset into trainset and testset.
class surprise.model_selection.split.KFold(n_splits=5, random_state=None, shuffle=True)

A basic cross-validation iterator.

Each fold is used once as a testset while the k - 1 remaining folds are used for training.

See an example in the User Guide.

Parameters:
  • n_splits (int) – The number of folds.
  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
  • shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.
split(data)

Generator function to iterate over trainsets and testsets.

Parameters:data (Dataset) – The data containing ratings that will be devided into trainsets and testsets.
Yields:tuple of (trainset, testset)
class surprise.model_selection.split.LeaveOneOut(n_splits=5, random_state=None)

Cross-validation iterator where each user has exactly one rating in the testset.

Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

See an example in the User Guide.

Parameters:
  • n_splits (int) – The number of folds.
  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
  • shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.
split(data)

Generator function to iterate over trainsets and testsets.

Parameters:data (Dataset) – The data containing ratings that will be devided into trainsets and testsets.
Yields:tuple of (trainset, testset)
class surprise.model_selection.split.PredefinedKFold

A cross-validation iterator to when a dataset has been loaded with the load_from_folds method.

See an example in the User Guide.

split(data)

Generator function to iterate over trainsets and testsets.

Parameters:data (Dataset) – The data containing ratings that will be devided into trainsets and testsets.
Yields:tuple of (trainset, testset)
class surprise.model_selection.split.RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)

Repeated KFold cross validator.

Repeats KFold n times with different randomization in each repetition.

See an example in the User Guide.

Parameters:
  • n_splits (int) – The number of folds.
  • n_repeats (int) – The number of repetitions.
  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
  • shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.
split(data)

Generator function to iterate over trainsets and testsets.

Parameters:data (Dataset) – The data containing ratings that will be devided into trainsets and testsets.
Yields:tuple of (trainset, testset)
class surprise.model_selection.split.ShuffleSplit(n_splits=5, test_size=0.2, train_size=None, random_state=None, shuffle=True)

A basic cross-validation iterator with random trainsets and testsets.

Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

See an example in the User Guide.

Parameters:
  • n_splits (int) – The number of folds.
  • test_size (float or int None) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. If None, the value is set to the complement of the trainset size. Default is .2.
  • train_size (float or int or None) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. If None, the value is set to the complement of the testset size. Default is None.
  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
  • shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Setting this to False defeats the purpose of this iterator, but it’s useful for the implementation of train_test_split(). Default is True.
split(data)

Generator function to iterate over trainsets and testsets.

Parameters:data (Dataset) – The data containing ratings that will be devided into trainsets and testsets.
Yields:tuple of (trainset, testset)
surprise.model_selection.split.train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)

Split a dataset into trainset and testset.

See an example in the User Guide.

Note: this function cannot be used as a cross-validation iterator.

Parameters:
  • data (Dataset) – The dataset to split into trainset and testset.
  • test_size (float or int None) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. If None, the value is set to the complement of the trainset size. Default is .2.
  • train_size (float or int or None) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. If None, the value is set to the complement of the testset size. Default is None.
  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
  • shuffle (bool) – Whether to shuffle the ratings in the data parameter. Shuffling is not done in-place. Default is True.

Cross validation

surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse', u'mae'], cv=None, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', verbose=False)

Run a cross validation procedure for a given algorithm, reporting accuracy measures and computation times.

See an example in the User Guide.

Parameters:
  • algo (AlgoBase) – The algorithm to evaluate.
  • data (Dataset) – The dataset on which to evaluate the algorithm.
  • measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
  • cv (cross-validation iterator, int or None) – Determines how the data parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed, KFold is used with the appropriate n_splits parameter. If None, KFold is used with n_splits=5.
  • return_train_measures (bool) – Whether to compute performance measures on the trainsets. Default is False.
  • n_jobs (int) –

    The maximum number of folds evaluated in parallel.

    • If -1, all CPUs are used.
    • If 1 is given, no parallel computing code is used at all, which is useful for debugging.
    • For n_jobs below -1, (n_cpus + n_jobs + 1) are used. For example, with n_jobs = -2 all CPUs but one are used.

    Default is -1.

  • pre_dispatch (int or string) –

    Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

    • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.
    • An int, giving the exact number of total jobs that are spawned.
    • A string, giving an expression as a function of n_jobs, as in '2*n_jobs'.

    Default is '2*n_jobs'.

  • verbose (int) – If True accuracy measures for each split are printed, as well as train and test times. Averages and standard deviations over all splits are also reported. Default is False: nothing is printed.
Returns:

A dict with the following keys:

  • 'test_*' where * corresponds to a lower-case accuracy measure, e.g. 'test_rmse': numpy array with accuracy values for each testset.
  • 'train_*' where * corresponds to a lower-case accuracy measure, e.g. 'train_rmse': numpy array with accuracy values for each trainset. Only available if return_train_measures is True.
  • 'fit_time': numpy array with the training time in seconds for each split.
  • 'test_time': numpy array with the testing time in seconds for each split.

Return type:

dict

Parameter search

class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', joblib_verbose=0)

The GridSearchCV class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This is useful for finiding the best set of parameters for a prediction algorithm. It is analogous to GridSearchCV from scikit-learn.

See an example in the User Guide.

Parameters:
  • algo_class (AlgoBase) – The class of the algorithm to evaluate.
  • param_grid (dict) – Dictionary with algorithm parameters as keys and list of values as keys. All combinations will be evaluated with desired algorithm. Dict parameters such as sim_options require special treatment, see this note.
  • measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
  • cv (cross-validation iterator, int or None) – Determines how the data parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed, KFold is used with the appropriate n_splits parameter. If None, KFold is used with n_splits=5.
  • refit (bool or str) – If True, refit the algorithm on the whole dataset using the set of parameters that gave the best average performance for the first measure of measures. Other measures can be used by passing a string (corresponding to the measure name). Then, you can use the test() and predict() methods. refit can only be used if the data parameter given to fit() hasn’t been loaded with load_from_folds(). Default is False.
  • return_train_measures (bool) – Whether to compute performance measures on the trainsets. If True, the cv_results attribute will also contain measures for trainsets. Default is False.
  • n_jobs (int) –

    The maximum number of parallel training procedures.

    • If -1, all CPUs are used.
    • If 1 is given, no parallel computing code is used at all, which is useful for debugging.
    • For n_jobs below -1, (n_cpus + n_jobs + 1) are used. For example, with n_jobs = -2 all CPUs but one are used.

    Default is -1.

  • pre_dispatch (int or string) –

    Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

    • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.
    • An int, giving the exact number of total jobs that are spawned.
    • A string, giving an expression as a function of n_jobs, as in '2*n_jobs'.

    Default is '2*n_jobs'.

  • joblib_verbose (int) – Controls the verbosity of joblib: the higher, the more messages.
best_estimator

dict of AlgoBase – Using an accuracy measure as key, get the algorithm that gave the best accuracy results for the chosen measure, averaged over all splits.

best_score

dict of floats – Using an accuracy measure as key, get the best average score achieved for that measure.

best_params

dict of dicts – Using an accuracy measure as key, get the parameters combination that gave the best accuracy results for the chosen measure (on average).

best_index

dict of ints – Using an accuracy measure as key, get the index that can be used with cv_results that achieved the highest accuracy for that measure (on average).

cv_results

dict of arrays – A dict that contains accuracy measures over all splits, as well as train and test time for each parameter combination. Can be imported into a pandas DataFrame (see example).

fit(data)

Runs the fit() method of the algorithm for all parameter combination, over different splits given by the cv parameter.

Parameters:data (Dataset) – The dataset on which to evaluate the algorithm, in parallel.
predict(*args)

Call predict() on the estimator with the best found parameters (according the the refit parameter). See AlgoBase.predict().

Only available if refit is not False.

test(testset, verbose=False)

Call test() on the estimator with the best found parameters (according the the refit parameter). See AlgoBase.test().

Only available if refit is not False.