Getting Started¶

Basic usage¶

Surprise has a set of built-in algorithms and datasets for you to play with. In its simplest form, it takes about four lines of code to evaluate the performance of an algorithm:

From file examples/basic_usage.py¶

from surprise import SVD
from surprise import Dataset
from surprise import evaluate


# Load the movielens-100k dataset (download it if needed),
# and split it into 3 folds for cross-validation.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print(perf)

If Surprise cannot find the movielens-100k dataset, it will offer to download it and will store it under the .surprise_data folder in your home directory. The split() method automatically splits the dataset into 3 folds and the evaluate() function runs the cross-validation procedure and compute some accuracy measures.

Load a custom dataset¶

You can of course use a custom dataset. Surprise offers two ways of loading a custom dataset:

you can either specify a single file with all the ratings and use the split () method to perform cross-validation ;
or if your dataset is already split into predefined folds, you can specify a list of files for training and testing.

Either way, you will need to define a Reader object for Surprise to be able to parse the file(s).

We’ll see how to handle both cases with the movielens-100k dataset. Of course this is a built-in dataset, but we will act as if it were not.

Load an entire dataset¶

From file examples/load_custom_dataset.py¶

# path to dataset file
file_path = '/home/nico/.surprise_data/ml-100k/ml-100k/u.data'  # change this

# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)
data.split(n_folds=5)

Note

Actually, as the Movielens-100k dataset is builtin, Surprise provides with a proper reader so in this case, we could have just created the reader like this:

reader = Reader('ml-100k')

For more details about readers and how to use them, see the Reader class documentation.

Load a dataset with predefined folds¶

From file examples/load_custom_dataset_predefined_folds.py¶

# path to dataset folder
files_dir = os.path.exapanduser('~/.surprise_data/ml-100k/ml-100k/')

# This time, we'll use the built-in reader.
reader = Reader('ml-100k')

# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]

data = Dataset.load_from_folds(folds_files, reader=reader)

Of course, nothing prevents you from only loading a single file for training and a single file for testing. However, the folds_files parameter still needs to be a list.

Advanced usage¶

We will here get a little deeper on what can Surprise do for you.

Manually iterate over folds¶

We have so far used the evaluate() function that does all the hard work for us. If you want to have better control on your experiments, you can use the folds() generator of your dataset, and then the train() and test() methods of your algorithm on each of the folds:

From file examples/iterate_over_folds.py¶

data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

algo = BaselineOnly()

for trainset, testset in data.folds():

    # train and test algorithm.
    algo.train(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    rmse = accuracy.rmse(predictions, verbose=True)

Train on a whole trainset and specifically query for predictions¶

We will here review how to get a prediction for specified users and items. In the mean time, we will also review how to train on a whole dataset, whithout performing cross-validation (i.e. there is no test set).

The latter is pretty straightforward: all you need is to load a dataset, and the build_full_trainset() method to build the trainset and train you algorithm:

From file examples/query_for_predictions.py¶

data = Dataset.load_builtin('ml-100k')

# Retrieve the trainset.
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
algo = KNNBasic()
algo.train(trainset)

Now, there’s no way we could call the test() method, because we have no testset. But you can still get predictions for the users and items you want.

Let’s say you’re interested in user 196 and item 302 (make sure they’re in the trainset!), and you know that the true rating \(r_{ui} = 4\). All you need is call the predict() method:

From file examples/query_for_predictions.py¶

uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r=4, verbose=True)

If the predict() method is called with user or item ids that were not part of the trainset, it’s up to the algorithm to decide if he still can make a prediction or not. If it can’t, predict() will still predict the mean of all ratings \(\mu\).

Note

Raw ids are ids as defined in a rating file. They can be strings or whatever. On trainset creation, each raw id is mapped to a (unique) integer called inner id, which is a lot more suitable for Surprise to manipulate. To convert a raw id to an inner id, you can use the to_inner_uid() and to_inner_iid() methods of the trainset.

Obviously, it is perfectly fine to use the predict() method directly during a cross-validation process. It’s then up to you to ensure that the user and item ids are present in the trainset though.

Dump the predictions for later analysis¶

You may want to save your algorithm predictions along with all the usefull information about the algorithm. This way, you can run your algorithm once, save the results, and go back to them whenever you want to inspect in greater details each of the predictions, and get a good insight on why your algorithm performs well (or bad!). Surprise provides with some tools to do that.

You can dump your algorithm predictions either using the evaluate() function, or do it manually with the dump function. Either way, an example is worth a thousand words, so here a few jupyter notebooks:

Dumping and analysis of the KNNBasic algorithm.

Comparison of two algorithms.