Here we briefly describe the problem solved. Suppose \(X\) is a large \(m\times n\) matrix, with many missing entries. Let \(\Omega\) contain the pairs of indices \((i,j)\) where \(X\) is observed, and let \(P_\Omega(X)\) denote a matrix with the entries in \(\Omega\) left alone, and all other entries set to zero. So when \(X\) has missing entries in \(\Omega^\perp\), \(P_\Omega(X)\) would set the missing values to zero.
Consider the criterion \[\min_M\frac12\|P_\Omega(X)-P_\Omega(M)\|^2_F+\lambda\|M\|_*,\] where \(\|M\|_*\) is the nucelar norm of \(M\) (sum of singular values).
If \(\widehat M\) solves this convex-cone problem, then it satisfies the following stationarity condition: \[ {\widehat M}=S_\lambda(Z)\] where \[Z=P_\Omega(X)+P_{\Omega^\perp}(\widehat M).\] Hence \(Z\) is the “filled in”" version of \(X\). The operator \(S_\lambda(Z)\) applied to matrix \(Z\) does the following:
- Compute the SVD of \(Z=UDV^T\), and let \(d_i\) be the singular values in \(D\).
- Soft-threshold the singular values: \(d_i^*= (d_i-\lambda)_+\).
- Reconstruct: \(S_\lambda(Z)=UD^*V^T\). We call this operation the “soft-thresholded SVD”. Notice that for sufficiently large \(\lambda\), \(D^*\) will be rank-reduced, and hence so will be \(UD^*V^T\).
This suggests the obvious iterative algorithm: using the current estimate for \(M\), create \(Z\), and update \(M\) by the soft-thresholded SVD of \(Z\).
This is exactly what softImpute
does on (small) matrices with missing values stored as NAs. By small we mean small enough that the SVD can be computed by R in a small amount of time.
This is not tenable for very large matrices, like those stored as class "Incomplete"
. Here we make two very important changes to the recipe:
- Re-express \(Z\) at each iteration as as \[Z=P_\Omega(X)-P_\Omega(\widehat M) + \widehat M.\] This is of the form
"SparseplusLowRank"
(assuming \(\widehat M\) is low rank), and hence can be stored. Left and right matrix multiplies by skinny matrices can also be efficiently performed. - Anticipating that the solution \(\widehat M\) will have low rank, compute only a low-rank SVD of \(Z\), using alternating subspace methods.
Indeed, softImpute
has a rank
argument that allows one to limit the rank of the solution; if the algorithm returns a solution with rank lower than the specified rank \(r\), you know it has solved the unrestricted problem.
Consider the alternative criterion \[\min_{A,B}\frac12\|P_\Omega(X)-P_\Omega(AB^T)\|^2_F+\frac{\lambda}2(\|A\|_F^2 +\|B\|_F^2),\] where \(A\) and \(B\) have each \(r\) columns, and let us suppose that \(r\) is bigger than or equal to the solution rank of the earlier criterion. This problem is not convex, but remarkably, it has a solution that satisfies \({\widehat A}{\widehat B}^T=\widehat M\)!
We can once again characterize the stationarity conditions, now in terms of \(\widehat A\) and \(\widehat B\). Let \[Z=P_\Omega(X)+P_{\Omega^\perp}({\widehat A}{\widehat B}^T),\] the filled in version of \(X\). Then \[\widehat B= ({\widehat A}^T{\widehat A}+\lambda I)^{-1}{\widehat A}^T Z.\] We get \(\widehat B\) by ridge regressions of the columns of \(Z\) on \(\widehat A\). For \(\widehat A\) its the same, with the roles of \(A\) and \(B\) reversed. This again suggests an obvious alternation, now by ridged regressions. After each regression, we update the component \(A\) or \(B\), and the filled in \(Z\). If \(r\) is sufficiently large, this again solves the same problem as before.
This last algorithm (softImpute ALS) can be seen as combining the alternating subspace SVD algorithm for computing the SVD with the iterative filling in and SVD calculation. It turns out that this interweaving leads to computational savings, and allows for a very efficient distributed implementation (not covered here).