andrejmuhic.github.io - Most independent subset selection

Subset selection using SVD

Do you want to find the most independent subset of columns of a matrix? This can be useful for example for determining independent subset of exogenous variables, or catching an error when adding a column that is a linear combination of remaining ones. The most general formulation of this would involve checking all size \(k\) subsets of columns. Luckily there is a simple heuristics, which works well for badly conditioned matrices, see Golub et al, Matrix Computations. Surprisingly enough the very good approximation can be obtained with following pretty straightforward procedure:

Compute svd decomposition, \(A= U S V^H\)
Determine rank estimate \(r \leq \texttt{rank}(A)\)
Set \(V_r = V[:, 0:r]^H\) and compute QR with column pivoting \(Q^H V_r P = \begin{bmatrix}R_{11} & R_{12} \end{bmatrix}\)
Set \(A P = \begin{bmatrix} B_1 & B_2 \end{bmatrix}\) and partition \(P = \begin{bmatrix} P_1 & P_2\end{bmatrix},\) where \(P_1\) represents “the most independent” columns in sense that this corresponds to heuristic that QR with pivoting produces well conditioned \(R_{11}.\)

Python implementation

from numpy import asarray, count_nonzero
from typing import Iterable
import numpy as np
import scipy

def determine_independent_subset(A: np.ndarray) -> Iterable:

    U, S, VH = np.linalg.svd(A, compute_uv=True)

    def compute_rank(S: np.ndarray, tol: float | None = None):
        # Provide tol if you have some prior on the problem
        # Singular values <= tol are considered 0
        if tol is None:
            tol = S.max(axis=-1, keepdims=True) * (
                    max(A.shape[-2:]) * np.finfo(S.dtype).eps)
        else:
            tol = asarray(tol)[..., np.newaxis]
        return count_nonzero(S > tol, axis=-1)

    r = compute_rank(S)
    Vr = VH[0:r, :]
    Q, R, P = scipy.linalg.qr(Vr, pivoting=True, mode='full')
    return P[0:r]

A = np.array([[1,2, 3], [1, 2, 3],[1, 2, 4]])
s = determine_independent_subset(A)
print(s)

[2 1]