dataset module

the dataset module defines some tools for managing datasets.

Users may use both built-in and user-defined datasets (see the Getting Started page for examples). Right now, three built-in datasets are available:

Built-in datasets can all be loaded (or downloaded if you haven’t already) using the Dataset.load_builtin() method. For each built-in dataset, Surprise also provide predefined readers which are useful if you want to use a custom dataset that has the same format as a built-in one.

Summary:

Dataset.load_builtin Load a built-in dataset.
Dataset.load_from_file Load a dataset from a (custom) file.
Dataset.load_from_folds Load a dataset where folds (for cross-validation) are predifined by some files.
Dataset.folds Generator function to iterate over the folds of the Dataset.
DatasetAutoFolds.split Split the dataset into folds for futur cross-validation.
Reader The Reader class is used to parse a file containing ratings.
Trainset A trainset contains all useful data that constitutes a training set.
class surprise.dataset.Dataset(reader)

Base class for loading datasets.

Note that you should never instantiate the Dataset class directly (same goes for its derived classes), but instead use one of the three available methods for loading datasets.

folds()

Generator function to iterate over the folds of the Dataset.

See User Guide for usage.

Yields:tupleTrainset and testset of current fold.
classmethod load_builtin(name=u'ml-100k')

Load a built-in dataset.

If the dataset has not already been loaded, it will be downloaded and saved. You will have to split your dataset using the split method. See an example in the User Guide.

Parameters:name (string) – The name of the built-in dataset to load. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is ‘ml-100k’.
Returns:A Dataset object.
Raises:ValueError – If the name parameter is incorrect.
classmethod load_from_file(file_path, reader)

Load a dataset from a (custom) file.

Use this if you want to use a custom dataset and all of the ratings are stored in one file. You will have to split your dataset using the split method. See an example in the User Guide.

Parameters:
  • file_path (string) – The path to the file containing ratings.
  • reader (Reader) – A reader to read the file.
classmethod load_from_folds(folds_files, reader)

Load a dataset where folds (for cross-validation) are predifined by some files.

The purpose of this method is to cover a common use case where a dataset is already split into predefined folds, such as the movielens-100k dataset which defines files u1.base, u1.test, u2.base, u2.test, etc... It can also be used when you don’t want to perform cross-validation but still want to specify your training and testing data (which comes down to 1-fold cross-validation anyway). See an example in the User Guide.

Parameters:
  • folds_files (iterable of tuples) – The list of the folds. A fold is a tuple of the form (path_to_train_file, path_to_test_file).
  • reader (Reader) – A reader to read the files.
class surprise.dataset.DatasetAutoFolds(ratings_file=None, reader=None)

A derived class from Dataset for which folds (for cross-validation) are not predefined. (Or for when there are no folds at all).

build_full_trainset()

Do not split the dataset into folds and just return a trainset as is, built from the whole dataset.

User can then query for predictions, as shown in the User Guide.

Returns:The Trainset.
split(n_folds=5, shuffle=True)

Split the dataset into folds for futur cross-validation.

If you forget to call split(), the dataset will be automatically shuffled and split for 5-folds cross-validation.

You can obtain repeatable splits over your all your experiments by seeding the RNG:

import random
random.seed(my_seed)  # call this before you call split!
Parameters:
  • n_folds (int) – The number of folds.
  • shuffle (bool) – Whether to shuffle ratings before splitting. If False, folds will always be the same each time the experiment is run. Default is True.
class surprise.dataset.Reader(name=None, line_format=None, sep=None, rating_scale=(1, 5), skip_lines=0)

The Reader class is used to parse a file containing ratings.

Such a file is assumed to specify only one rating per line, and each line needs to respect the following structure:

user ; item ; rating ; [timestamp]

where the order of the fields and the seperator (here ‘;’) may be arbitrarily defined (see below). brackets indicate that the timestamp field is optional.

Parameters:
  • name (string, optional) – If specified, a Reader for one of the built-in datasets is returned and any other parameter is ignored. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is None.
  • line_format (string) – The fields names, in the order at which they are encountered on a line. Example: 'item user rating'.
  • sep (char) – the separator between fields. Example : ';'.
  • rating_scale (tuple, optional) – The rating scale used for every rating. Default is (1, 5).
  • skip_lines (int, optional) – Number of lines to skip at the beginning of the file. Default is 0.
class surprise.dataset.Trainset(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items)

A trainset contains all useful data that constitutes a training set.

It is used by the train() method of every prediction algorithm. You should not try to built such an object on your own but rather use the Dataset.folds() method or the DatasetAutoFolds.build_full_trainset() method.

ur

defaultdict of list – The users ratings. This is a dictionary containing lists of tuples of the form (item_inner_id, rating). The keys are user inner ids.

ir

defaultdict of list – The items ratings. This is a dictionary containing lists of tuples of the form (user_inner_id, rating). The keys are item inner ids.

n_users

Total number of users \(|U|\).

n_items

Total number of items \(|I|\).

n_ratings

Total number of ratings \(|R_{train}|\).

rating_scale

tuple – The minimum and maximal rating of the rating scale.

global_mean

The mean of all ratings \(\mu\).

all_items()

Generator function to iterate over all items.

Yields:Inner id of items.
all_ratings()

Generator function to iterate over all ratings.

Yields:A tuple (uid, iid, rating) where ids are inner ids.
all_users()

Generator function to iterate over all users.

Yields:Inner id of users.
global_mean

Return the mean of all ratings.

It’s only computed once.

knows_item(iid)

Indicate if the item is part of the trainset.

An item is part of the trainset if the item was rated at least once.

Parameters:iid (int) – The (inner) item id. See this note.
Returns:True if item is part of the trainset, else False.
knows_user(uid)

Indicate if the user is part of the trainset.

A user is part of the trainset if the user has at least one rating.

Parameters:uid (int) – The (inner) user id. See this note.
Returns:True if user is part of the trainset, else False.
to_inner_iid(riid)

Convert a raw item id to an inner id.

See this note.

Parameters:riid (str) – The item raw id.
Returns:The item inner id.
Return type:int
Raises:ValueError – When item is not part of the trainset.
to_inner_uid(ruid)

Convert a raw user id to an inner id.

See this note.

Parameters:ruid (str) – The user raw id.
Returns:The user inner id.
Return type:int
Raises:ValueError – When user is not part of the trainset.