Getting Started¶

Basic usage¶

Surprise has a set of built-in algorithms and datasets for you to play with. In its simplest form, it takes about four lines of code to evaluate the performance of an algorithm:

From file examples/basic_usage.py¶

from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf


# Load the movielens-100k dataset (download it if needed),
# and split it into 3 folds for cross-validation.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

If Surprise cannot find the movielens-100k dataset, it will offer to download it and will store it under the .surprise_data folder in your home directory. The split() method automatically splits the dataset into 3 folds and the evaluate() function runs the cross-validation procedure and compute some accuracy measures.

Load a custom dataset¶

You can of course use a custom dataset. Surprise offers two ways of loading a custom dataset:

you can either specify a single file (e.g. a csv file) or a pandas dataframe with all the ratings and use the split () method to perform cross-validation, or train on the whole dataset ;
or if your dataset is already split into predefined folds, you can specify a list of files for training and testing.

Either way, you will need to define a Reader object for Surprise to be able to parse the file(s) or the dataframe. We’ll see now how to handle both cases.

Load an entire dataset from a file or a dataframe¶

To load a dataset from a file (e.g. a csv file), you will need the load_from_file() method:

From file examples/load_custom_dataset.py¶

# path to dataset file
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)
data.split(n_folds=5)  # data can now be used normally

For more details about readers and how to use them, see the Reader class documentation.

Note

As you already know from the previous section, the Movielens-100k dataset is built-in so a much quicker way to load the dataset is to do data = Dataset.load_builtin('ml-100k'). We will of course ignore this here.

To load a dataset from a pandas dataframe, you will need the load_from_df() method. You will also need a Reader object, but only the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.

From file examples/load_from_dataframe.py¶

ratings_dict = {'itemID': [1, 1, 1, 2, 2],
                'userID': [9, 32, 2, 45, 'user_foo'],
                'rating': [3, 2, 4, 3, 1]}
df = pd.DataFrame(ratings_dict)

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)
data.split(2)  # data can now be used normally

The dataframe initially looks like this:

      itemID  rating    userID
     1       3         9
     1       2        32
     1       4         2
     2       3        45
     2       1  user_foo

Load a dataset with predefined folds¶

From file examples/load_custom_dataset_predefined_folds.py¶

# path to dataset folder
files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')

# This time, we'll use the built-in reader.
reader = Reader('ml-100k')

# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]

data = Dataset.load_from_folds(folds_files, reader=reader)

Of course, nothing prevents you from only loading a single file for training and a single file for testing. However, the folds_files parameter still needs to be a list.

Advanced usage¶

We will here get a little deeper on what can Surprise do for you.

Tune algorithm parameters with GridSearch¶

The evaluate() function gives us the results on one set of parameters given to the algorithm. If the user wants to try the algorithm on a different set of parameters, the GridSearch class comes to the rescue. Given a dict of parameters, this class exhaustively tries all the combination of parameters and helps get the best combination for an accuracy measurement. It is analogous to GridSearchCV from scikit-learn.

For instance, suppose that we want to tune the parameters of the SVD. Some of the parameters of this algorithm are n_epochs, lr_all and reg_all. Thus we define a parameters grid as follows