dataset module

The dataset module defines the Dataset class and other subclasses which are used for managing datasets.

Users may use both built-in and user-defined datasets (see the Getting Started page for examples). Right now, three built-in datasets are available:

Built-in datasets can all be loaded (or downloaded if you haven’t already) using the Dataset.load_builtin() method. Summary:

Dataset.load_builtin Load a built-in dataset.
Dataset.load_from_file Load a dataset from a (custom) file.
Dataset.load_from_folds Load a dataset where folds (for cross-validation) are predefined by some files.
Dataset.folds Generator function to iterate over the folds of the Dataset.
DatasetAutoFolds.split Split the dataset into folds for future cross-validation.
class surprise.dataset.Dataset(reader=None, rating_scale=None)

Base class for loading datasets.

Note that you should never instantiate the Dataset class directly (same goes for its derived classes), but instead use one of the three available methods for loading datasets.

folds()

Generator function to iterate over the folds of the Dataset.

Warning

Deprecated since version 1.05. Use cross-validation iterators instead. This method will be removed in later versions.

Yields:tupleTrainset and testset of current fold.
classmethod load_builtin(name=u'ml-100k', prompt=True)

Load a built-in dataset.

If the dataset has not already been loaded, it will be downloaded and saved. You will have to split your dataset using the split method. See an example in the User Guide.

Parameters:
  • name (string) – The name of the built-in dataset to load. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is ‘ml-100k’.
  • prompt (bool) – Prompt before downloading if dataset is not already on disk. Default is True.
Returns:

A Dataset object.

Raises:

ValueError – If the name parameter is incorrect.

classmethod load_from_df(df, reader=None, rating_scale=None)

Load a dataset from a pandas dataframe.

Use this if you want to use a custom dataset that is stored in a pandas dataframe. See the User Guide for an example.

Parameters:
  • df (Dataframe) – The dataframe containing the ratings. It must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings, in this order.
  • reader (Reader) –

    A reader to read the ratings. Only the rating_scale field needs to be specified.

    Warning

    Using the reader parameter here is deprecated and will not be supported in future versions. Use instead the rating_scale parameter directly.

  • rating_scale (tuple) – The rating scale used for every rating, e.g. (1, 5).
classmethod load_from_file(file_path, reader, rating_scale=None)

Load a dataset from a (custom) file.

Use this if you want to use a custom dataset and all of the ratings are stored in one file. You will have to split your dataset using the split method. See an example in the User Guide.

Parameters:
  • file_path (string) – The path to the file containing ratings.
  • reader (Reader) – A reader to read the file.
  • rating_scale (tuple) – The rating scale used for every rating, e.g. (1, 5).
classmethod load_from_folds(folds_files, reader, rating_scale=None)

Load a dataset where folds (for cross-validation) are predefined by some files.

The purpose of this method is to cover a common use case where a dataset is already split into predefined folds, such as the movielens-100k dataset which defines files u1.base, u1.test, u2.base, u2.test, etc… It can also be used when you don’t want to perform cross-validation but still want to specify your training and testing data (which comes down to 1-fold cross-validation anyway). See an example in the User Guide.

Parameters:
  • folds_files (iterable of tuples) – The list of the folds. A fold is a tuple of the form (path_to_train_file, path_to_test_file).
  • reader (Reader) – A reader to read the files.
  • rating_scale (tuple) – The rating scale used for every rating, e.g. (1, 5).
class surprise.dataset.DatasetAutoFolds(ratings_file=None, reader=None, df=None, rating_scale=None)

A derived class from Dataset for which folds (for cross-validation) are not predefined. (Or for when there are no folds at all).

build_full_trainset()

Do not split the dataset into folds and just return a trainset as is, built from the whole dataset.

User can then query for predictions, as shown in the User Guide.

Returns:The Trainset.
split(n_folds=5, shuffle=True)

Split the dataset into folds for future cross-validation.

Warning

Deprecated since version 1.05. Use cross-validation iterators instead. This method will be removed in later versions.

If you forget to call split(), the dataset will be automatically shuffled and split for 5-fold cross-validation.

You can obtain repeatable splits over your all your experiments by seeding the RNG:

import random
random.seed(my_seed)  # call this before you call split!
Parameters:
  • n_folds (int) – The number of folds.
  • shuffle (bool) – Whether to shuffle ratings before splitting. If False, folds will always be the same each time the experiment is run. Default is True.