similarities module

The similarities module includes tools to compute similarity metrics between users or items. You may need to refer to the Notation standards, References page. See also the Similarity measure configuration section of the User Guide.

Available similarity measures:

cosine Compute the cosine similarity between all pairs of users (or items).
msd Compute the Mean Squared Difference similarity between all pairs of users (or items).
pearson Compute the Pearson correlation coefficient between all pairs of users (or items).
pearson_baseline Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.
surprise.similarities.cosine()

Compute the cosine similarity between all pairs of users (or items).

Only common users (or items) are taken into account. The cosine similarity is defined as:

\[\text{cosine_sim}(u, v) = \frac{ \sum\limits_{i \in I_{uv}} r_{ui} \cdot r_{vi}} {\sqrt{\sum\limits_{i \in I_{uv}} r_{ui}^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} r_{vi}^2} }\]

or

\[\text{cosine_sim}(i, j) = \frac{ \sum\limits_{u \in U_{ij}} r_{ui} \cdot r_{uj}} {\sqrt{\sum\limits_{u \in U_{ij}} r_{ui}^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} r_{uj}^2} }\]

depending on the user_based field of sim_options (see Similarity measure configuration).

For details on cosine similarity, see on Wikipedia.

surprise.similarities.msd()

Compute the Mean Squared Difference similarity between all pairs of users (or items).

Only common users (or items) are taken into account. The Mean Squared Difference is defined as:

\[\text{msd}(u, v) = \frac{1}{|I_{uv}|} \cdot \sum\limits_{i \in I_{uv}} (r_{ui} - r_{vi})^2\]

or

\[\text{msd}(i, j) = \frac{1}{|U_{ij}|} \cdot \sum\limits_{u \in U_{ij}} (r_{ui} - r_{uj})^2\]

depending on the user_based field of sim_options (see Similarity measure configuration).

The MSD-similarity is then defined as:

\[\begin{split}\text{msd_sim}(u, v) &= \frac{1}{\text{msd}(u, v) + 1}\\ \text{msd_sim}(i, j) &= \frac{1}{\text{msd}(i, j) + 1}\end{split}\]

The \(+ 1\) term is just here to avoid dividing by zero.

For details on MSD, see third definition on Wikipedia.

surprise.similarities.pearson()

Compute the Pearson correlation coefficient between all pairs of users (or items).

Only common users (or items) are taken into account. The Pearson correlation coefficient can be seen as a mean-centered cosine similarity, and is defined as:

\[\text{pearson_sim}(u, v) = \frac{ \sum\limits_{i \in I_{uv}} (r_{ui} - \mu_u) \cdot (r_{vi} - \mu_{v})} {\sqrt{\sum\limits_{i \in I_{uv}} (r_{ui} - \mu_u)^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} (r_{vi} - \mu_{v})^2} }\]

or

\[\text{pearson_sim}(i, j) = \frac{ \sum\limits_{u \in U_{ij}} (r_{ui} - \mu_i) \cdot (r_{uj} - \mu_{j})} {\sqrt{\sum\limits_{u \in U_{ij}} (r_{ui} - \mu_i)^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} (r_{uj} - \mu_{j})^2} }\]

depending on the user_based field of sim_options (see Similarity measure configuration).

Note: if there are no common users or items, similarity will be 0 (and not -1).

For details on Pearson coefficient, see Wikipedia.

surprise.similarities.pearson_baseline()

Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.

The shrinkage parameter helps to avoid overfitting when only few ratings are available (see Similarity measure configuration).

The Pearson-baseline correlation coefficient is defined as:

\[\text{pearson_baseline_sim}(u, v) = \hat{\rho}_{uv} = \frac{ \sum\limits_{i \in I_{uv}} (r_{ui} - b_{ui}) \cdot (r_{vi} - b_{vi})} {\sqrt{\sum\limits_{i \in I_{uv}} (r_{ui} - b_{ui})^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} (r_{vi} - b_{vi})^2}}\]

or

\[\text{pearson_baseline_sim}(i, j) = \hat{\rho}_{ij} = \frac{ \sum\limits_{u \in U_{ij}} (r_{ui} - b_{ui}) \cdot (r_{uj} - b_{uj})} {\sqrt{\sum\limits_{u \in U_{ij}} (r_{ui} - b_{ui})^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} (r_{uj} - b_{uj})^2}}\]

The shrunk Pearson-baseline correlation coefficient is then defined as:

\[ \begin{align}\begin{aligned}\text{pearson_baseline_shrunk_sim}(u, v) &= \frac{|I_{uv}| - 1} {|I_{uv}| - 1 + \text{shrinkage}} \cdot \hat{\rho}_{uv}\\\text{pearson_baseline_shrunk_sim}(i, j) &= \frac{|U_{ij}| - 1} {|U_{ij}| - 1 + \text{shrinkage}} \cdot \hat{\rho}_{ij}\end{aligned}\end{align} \]

Obviously, a shrinkage parameter of 0 amounts to no shrinkage at all.

Note: here again, if there are no common users/items, similarity will be 0 (and not -1).

Motivations for such a similarity measure can be found on the Recommender System Handbook, section 5.4.1.