The model_selection package

Surprise provides various tools to run cross-validation procedures and search the best parameters for a prediction algorithm. The tools presented here are all heavily inspired from the excellent scikit learn library.

Cross validation iterators

The model_selection.split module contains various cross-validation iterators. Design and tools are inspired from the mighty scikit learn.

The available iterators are:

KFold

A basic cross-validation iterator.

RepeatedKFold

Repeated KFold cross validator.

ShuffleSplit

A basic cross-validation iterator with random trainsets and testsets.

LeaveOneOut

Cross-validation iterator where each user has exactly one rating in the testset.

PredefinedKFold

A cross-validation iterator to when a dataset has been loaded with the load_from_folds method.

This module also contains a function for splitting datasets into trainset and testset:

train_test_split

Split a dataset into trainset and testset.

class surprise.model_selection.split.KFold(n_splits=5, random_state=None, shuffle=True)[source]

A basic cross-validation iterator.

Each fold is used once as a testset while the k - 1 remaining folds are used for training.

See an example in the User Guide.

Parameters
  • n_splits (int) – The number of folds.

  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.

  • shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.

split(data)[source]

Generator function to iterate over trainsets and testsets.

Parameters

data (Dataset) – The data containing ratings that will be divided into trainsets and testsets.

Yields

tuple of (trainset, testset)

class surprise.model_selection.split.LeaveOneOut(n_splits=5, random_state=None, min_n_ratings=0)[source]

Cross-validation iterator where each user has exactly one rating in the testset.

Contrary to other cross-validation strategies, LeaveOneOut does not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

See an example in the User Guide.

Parameters
  • n_splits (int) – The number of folds.

  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.

  • min_n_ratings (int) – Minimum number of ratings for each user in the trainset. E.g. if min_n_ratings is 2, we are sure each user has at least 2 ratings in the trainset (and 1 in the testset). Other users are discarded. Default is 0, so some users (having only one rating) may be in the testset and not in the trainset.

split(data)[source]

Generator function to iterate over trainsets and testsets.

Parameters

data (Dataset) – The data containing ratings that will be divided into trainsets and testsets.

Yields

tuple of (trainset, testset)

class surprise.model_selection.split.PredefinedKFold[source]

A cross-validation iterator to when a dataset has been loaded with the load_from_folds method.

See an example in the User Guide.

split(data)[source]

Generator function to iterate over trainsets and testsets.

Parameters

data (Dataset) – The data containing ratings that will be divided into trainsets and testsets.

Yields

tuple of (trainset, testset)

class surprise.model_selection.split.RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)[source]

Repeated KFold cross validator.

Repeats KFold n times with different randomization in each repetition.

See an example in the User Guide.

Parameters
  • n_splits (int) – The number of folds.

  • n_repeats (int) – The number of repetitions.

  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.

  • shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.

split(data)[source]

Generator function to iterate over trainsets and testsets.

Parameters

data (Dataset) – The data containing ratings that will be divided into trainsets and testsets.

Yields

tuple of (trainset, testset)

class surprise.model_selection.split.ShuffleSplit(n_splits=5, test_size=0.2, train_size=None, random_state=None, shuffle=True)[source]

A basic cross-validation iterator with random trainsets and testsets.

Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

See an example in the User Guide.

Parameters
  • n_splits (int) – The number of folds.

  • test_size (float or int None) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. If None, the value is set to the complement of the trainset size. Default is .2.

  • train_size (float or int or None) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. If None, the value is set to the complement of the testset size. Default is None.

  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.

  • shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Setting this to False defeats the purpose of this iterator, but it’s useful for the implementation of train_test_split(). Default is True.

split(data)[source]

Generator function to iterate over trainsets and testsets.

Parameters

data (Dataset) – The data containing ratings that will be divided into trainsets and testsets.

Yields

tuple of (trainset, testset)

surprise.model_selection.split.train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)[source]

Split a dataset into trainset and testset.

See an example in the User Guide.

Note: this function cannot be used as a cross-validation iterator.

Parameters
  • data (Dataset) – The dataset to split into trainset and testset.

  • test_size (float or int None) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. If None, the value is set to the complement of the trainset size. Default is .2.

  • train_size (float or int or None) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. If None, the value is set to the complement of the testset size. Default is None.

  • random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.

  • shuffle (bool) – Whether to shuffle the ratings in the data parameter. Shuffling is not done in-place. Default is True.

Cross validation

surprise.model_selection.validation.cross_validate(algo, data, measures=['rmse', 'mae'], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch='2*n_jobs', verbose=False)[source]

Run a cross validation procedure for a given algorithm, reporting accuracy measures and computation times.

See an example in the User Guide.

Parameters
  • algo (AlgoBase) – The algorithm to evaluate.

  • data (Dataset) – The dataset on which to evaluate the algorithm.

  • measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].

  • cv (cross-validation iterator, int or None) – Determines how the data parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed, KFold is used with the appropriate n_splits parameter. If None, KFold is used with n_splits=5.

  • return_train_measures (bool) – Whether to compute performance measures on the trainsets. Default is False.

  • n_jobs (int) –

    The maximum number of folds evaluated in parallel.

    • If -1, all CPUs are used.

    • If 1 is given, no parallel computing code is used at all, which is useful for debugging.

    • For n_jobs below -1, (n_cpus + n_jobs + 1) are used. For example, with n_jobs = -2 all CPUs but one are used.

    Default is 1.

  • pre_dispatch (int or string) –

    Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

    • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.

    • An int, giving the exact number of total jobs that are spawned.

    • A string, giving an expression as a function of n_jobs, as in '2*n_jobs'.

    Default is '2*n_jobs'.

  • verbose (int) – If True accuracy measures for each split are printed, as well as train and test times. Averages and standard deviations over all splits are also reported. Default is False: nothing is printed.

Returns

A dict with the following keys:

  • 'test_*' where * corresponds to a lower-case accuracy measure, e.g. 'test_rmse': numpy array with accuracy values for each testset.

  • 'train_*' where * corresponds to a lower-case accuracy measure, e.g. 'train_rmse': numpy array with accuracy values for each trainset. Only available if return_train_measures is True.

  • 'fit_time': numpy array with the training time in seconds for each split.

  • 'test_time': numpy array with the testing time in seconds for each split.

Return type

dict