The model_selection package¶
Surprise provides various tools to run cross-validation procedures and search the best parameters for a prediction algorithm. The tools presented here are all heavily inspired from the excellent scikit learn library.
Cross validation iterators¶
The model_selection.split
module
contains various cross-validation iterators. Design and tools are inspired from
the mighty scikit learn.
The available iterators are:
A basic cross-validation iterator. |
|
Repeated |
|
A basic cross-validation iterator with random trainsets and testsets. |
|
Cross-validation iterator where each user has exactly one rating in the testset. |
|
A cross-validation iterator to when a dataset has been loaded with the |
This module also contains a function for splitting datasets into trainset and testset:
Split a dataset into trainset and testset. |
- class surprise.model_selection.split.KFold(n_splits=5, random_state=None, shuffle=True)[source]¶
A basic cross-validation iterator.
Each fold is used once as a testset while the k - 1 remaining folds are used for training.
See an example in the User Guide.
- Parameters:
n_splits (int) – The number of folds.
random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
.shuffle (bool) – Whether to shuffle the ratings in the
data
parameter of thesplit()
method. Shuffling is not done in-place. Default isTrue
.
- class surprise.model_selection.split.LeaveOneOut(n_splits=5, random_state=None, min_n_ratings=0)[source]¶
Cross-validation iterator where each user has exactly one rating in the testset.
Contrary to other cross-validation strategies,
LeaveOneOut
does not guarantee that all folds will be different, although this is still very likely for sizeable datasets.See an example in the User Guide.
- Parameters:
n_splits (int) – The number of folds.
random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
.min_n_ratings (int) – Minimum number of ratings for each user in the trainset. E.g. if
min_n_ratings
is2
, we are sure each user has at least2
ratings in the trainset (and1
in the testset). Other users are discarded. Default is0
, so some users (having only one rating) may be in the testset and not in the trainset.
- class surprise.model_selection.split.PredefinedKFold[source]¶
A cross-validation iterator to when a dataset has been loaded with the
load_from_folds
method.See an example in the User Guide.
- class surprise.model_selection.split.RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)[source]¶
Repeated
KFold
cross validator.Repeats
KFold
n times with different randomization in each repetition.See an example in the User Guide.
- Parameters:
n_splits (int) – The number of folds.
n_repeats (int) – The number of repetitions.
random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
.shuffle (bool) – Whether to shuffle the ratings in the
data
parameter of thesplit()
method. Shuffling is not done in-place. Default isTrue
.
- class surprise.model_selection.split.ShuffleSplit(n_splits=5, test_size=0.2, train_size=None, random_state=None, shuffle=True)[source]¶
A basic cross-validation iterator with random trainsets and testsets.
Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
See an example in the User Guide.
- Parameters:
n_splits (int) – The number of folds.
test_size (float or int
None
) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. IfNone
, the value is set to the complement of the trainset size. Default is.2
.train_size (float or int or
None
) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. IfNone
, the value is set to the complement of the testset size. Default isNone
.random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
.shuffle (bool) – Whether to shuffle the ratings in the
data
parameter of thesplit()
method. Shuffling is not done in-place. Setting this to False defeats the purpose of this iterator, but it’s useful for the implementation oftrain_test_split()
. Default isTrue
.
- surprise.model_selection.split.train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)[source]¶
Split a dataset into trainset and testset.
See an example in the User Guide.
Note: this function cannot be used as a cross-validation iterator.
- Parameters:
data (
Dataset
) – The dataset to split into trainset and testset.test_size (float or int
None
) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. IfNone
, the value is set to the complement of the trainset size. Default is.2
.train_size (float or int or
None
) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. IfNone
, the value is set to the complement of the testset size. Default isNone
.random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
.shuffle (bool) – Whether to shuffle the ratings in the
data
parameter. Shuffling is not done in-place. Default isTrue
.
Cross validation¶
- surprise.model_selection.validation.cross_validate(algo, data, measures=['rmse', 'mae'], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch='2*n_jobs', verbose=False)[source]¶
Run a cross validation procedure for a given algorithm, reporting accuracy measures and computation times.
See an example in the User Guide.
- Parameters:
algo (
AlgoBase
) – The algorithm to evaluate.data (
Dataset
) – The dataset on which to evaluate the algorithm.measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the
accuracy
module. Default is['rmse', 'mae']
.cv (cross-validation iterator, int or
None
) – Determines how thedata
parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed,KFold
is used with the appropriaten_splits
parameter. IfNone
,KFold
is used withn_splits=5
.return_train_measures (bool) – Whether to compute performance measures on the trainsets. Default is
False
.n_jobs (int) –
The maximum number of folds evaluated in parallel.
If
-1
, all CPUs are used.If
1
is given, no parallel computing code is used at all, which is useful for debugging.For
n_jobs
below-1
,(n_cpus + n_jobs + 1)
are used. For example, withn_jobs = -2
all CPUs but one are used.
Default is
1
.pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None
, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.An int, giving the exact number of total jobs that are spawned.
A string, giving an expression as a function of
n_jobs
, as in'2*n_jobs'
.
Default is
'2*n_jobs'
.verbose (int) – If
True
accuracy measures for each split are printed, as well as train and test times. Averages and standard deviations over all splits are also reported. Default isFalse
: nothing is printed.
- Returns:
A dict with the following keys:
'test_*'
where*
corresponds to a lower-case accuracy measure, e.g.'test_rmse'
: numpy array with accuracy values for each testset.'train_*'
where*
corresponds to a lower-case accuracy measure, e.g.'train_rmse'
: numpy array with accuracy values for each trainset. Only available ifreturn_train_measures
isTrue
.'fit_time'
: numpy array with the training time in seconds for each split.'test_time'
: numpy array with the testing time in seconds for each split.
- Return type:
dict
Parameter search¶
- class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=['rmse', 'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch='2*n_jobs', joblib_verbose=0)[source]¶
The
GridSearchCV
class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This is useful for finding the best set of parameters for a prediction algorithm. It is analogous to GridSearchCV from scikit-learn.See an example in the User Guide.
- Parameters:
algo_class (
AlgoBase
) – The class of the algorithm to evaluate.param_grid (dict) – Dictionary with algorithm parameters as keys and list of values as keys. All combinations will be evaluated with desired algorithm. Dict parameters such as
sim_options
require special treatment, see this note.measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the
accuracy
module. Default is['rmse', 'mae']
.cv (cross-validation iterator, int or
None
) – Determines how thedata
parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed,KFold
is used with the appropriaten_splits
parameter. IfNone
,KFold
is used withn_splits=5
.refit (bool or str) – If
True
, refit the algorithm on the whole dataset using the set of parameters that gave the best average performance for the first measure ofmeasures
. Other measures can be used by passing a string (corresponding to the measure name). Then, you can use thetest()
andpredict()
methods.refit
can only be used if thedata
parameter given tofit()
hasn’t been loaded withload_from_folds()
. Default isFalse
.return_train_measures (bool) – Whether to compute performance measures on the trainsets. If
True
, thecv_results
attribute will also contain measures for trainsets. Default isFalse
.n_jobs (int) –
The maximum number of parallel training procedures.
If
-1
, all CPUs are used.If
1
is given, no parallel computing code is used at all, which is useful for debugging.For
n_jobs
below-1
,(n_cpus + n_jobs + 1)
are used. For example, withn_jobs = -2
all CPUs but one are used.
Default is
1
.pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None
, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.An int, giving the exact number of total jobs that are spawned.
A string, giving an expression as a function of
n_jobs
, as in'2*n_jobs'
.
Default is
'2*n_jobs'
.joblib_verbose (int) – Controls the verbosity of joblib: the higher, the more messages.
- best_estimator¶
Using an accuracy measure as key, get the algorithm that gave the best accuracy results for the chosen measure, averaged over all splits.
- Type:
dict of AlgoBase
- best_score¶
Using an accuracy measure as key, get the best average score achieved for that measure.
- Type:
dict of floats
- best_params¶
Using an accuracy measure as key, get the parameters combination that gave the best accuracy results for the chosen measure (on average).
- Type:
dict of dicts
- best_index¶
Using an accuracy measure as key, get the index that can be used with
cv_results
that achieved the highest accuracy for that measure (on average).- Type:
dict of ints
- cv_results¶
A dict that contains accuracy measures over all splits, as well as train and test time for each parameter combination. Can be imported into a pandas DataFrame (see example).
- Type:
dict of arrays
- fit(data)¶
Runs the
fit()
method of the algorithm for all parameter combinations, over different splits given by thecv
parameter.- Parameters:
data (
Dataset
) – The dataset on which to evaluate the algorithm, in parallel.
- predict(*args)¶
Call
predict()
on the estimator with the best found parameters (according the therefit
parameter). SeeAlgoBase.predict()
.Only available if
refit
is notFalse
.
- test(testset, verbose=False)¶
Call
test()
on the estimator with the best found parameters (according the therefit
parameter). SeeAlgoBase.test()
.Only available if
refit
is notFalse
.
- class surprise.model_selection.search.RandomizedSearchCV(algo_class, param_distributions, n_iter=10, measures=['rmse', 'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch='2*n_jobs', random_state=None, joblib_verbose=0)[source]¶
The
RandomizedSearchCV
class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. As opposed to GridSearchCV, which uses an exhaustive combinatorial approach, RandomizedSearchCV samples randomly from the parameter space. This is useful for finding the best set of parameters for a prediction algorithm, especially using a coarse to fine approach. It is analogous to RandomizedSearchCV from scikit-learn.See an example in the User Guide.
- Parameters:
algo_class (
AlgoBase
) – The class of the algorithm to evaluate.param_distributions (dict) – Dictionary with algorithm parameters as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. Parameters will be sampled n_iter times.
n_iter (int) – Number of times parameter settings are sampled. Default is
10
.measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the
accuracy
module. Default is['rmse', 'mae']
.cv (cross-validation iterator, int or
None
) – Determines how thedata
parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed,KFold
is used with the appropriaten_splits
parameter. IfNone
,KFold
is used withn_splits=5
.refit (bool or str) – If
True
, refit the algorithm on the whole dataset using the set of parameters that gave the best average performance for the first measure ofmeasures
. Other measures can be used by passing a string (corresponding to the measure name). Then, you can use thetest()
andpredict()
methods.refit
can only be used if thedata
parameter given tofit()
hasn’t been loaded withload_from_folds()
. Default isFalse
.return_train_measures (bool) – Whether to compute performance measures on the trainsets. If
True
, thecv_results
attribute will also contain measures for trainsets. Default isFalse
.n_jobs (int) –
The maximum number of parallel training procedures.
If
-1
, all CPUs are used.If
1
is given, no parallel computing code is used at all, which is useful for debugging.For
n_jobs
below-1
,(n_cpus + n_jobs + 1)
are used. For example, withn_jobs = -2
all CPUs but one are used.
Default is
1
.pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None
, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.An int, giving the exact number of total jobs that are spawned.
A string, giving an expression as a function of
n_jobs
, as in'2*n_jobs'
.
Default is
'2*n_jobs'
.random_state (int, RandomState or None) – Pseudo random number generator seed used for random uniform sampling from lists of possible values instead of scipy.stats distributions. If int,
random_state
is the seed used by the random number generator. IfRandomState
instance,random_state
is the random number generator. IfNone
, the random number generator is the RandomState instance used bynp.random
. Default isNone
.joblib_verbose (int) – Controls the verbosity of joblib: the higher, the more messages.
- best_estimator¶
Using an accuracy measure as key, get the algorithm that gave the best accuracy results for the chosen measure, averaged over all splits.
- Type:
dict of AlgoBase
- best_score¶
Using an accuracy measure as key, get the best average score achieved for that measure.
- Type:
dict of floats
- best_params¶
Using an accuracy measure as key, get the parameters combination that gave the best accuracy results for the chosen measure (on average).
- Type:
dict of dicts
- best_index¶
Using an accuracy measure as key, get the index that can be used with
cv_results
that achieved the highest accuracy for that measure (on average).- Type:
dict of ints
- cv_results¶
A dict that contains accuracy measures over all splits, as well as train and test time for each parameter combination. Can be imported into a pandas DataFrame (see example).
- Type:
dict of arrays
- fit(data)¶
Runs the
fit()
method of the algorithm for all parameter combinations, over different splits given by thecv
parameter.- Parameters:
data (
Dataset
) – The dataset on which to evaluate the algorithm, in parallel.
- predict(*args)¶
Call
predict()
on the estimator with the best found parameters (according the therefit
parameter). SeeAlgoBase.predict()
.Only available if
refit
is notFalse
.
- test(testset, verbose=False)¶
Call
test()
on the estimator with the best found parameters (according the therefit
parameter). SeeAlgoBase.test()
.Only available if
refit
is notFalse
.