Most independent subset selection

Linear Algebra
Numerical Linear Algebra
Author

Andrej Muhic

Published

January 19, 2025

Subset selection using SVD

Do you want to find the most independent subset of columns of a matrix? This can be useful for example for determining independent subset of exogenous variables, or catching an error when adding a column that is a linear combination of remaining ones. The most general formulation of this would involve checking all size \(k\) subsets of columns. Luckily there is a simple heuristics, which works well for badly conditioned matrices, see Golub et al, Matrix Computations. Surprisingly enough the very good approximation can be obtained with following pretty straightforward procedure:

  1. Compute svd decomposition, \(A= U S V^H\)
  2. Determine rank estimate \(r \leq \texttt{rank}(A)\)
  3. Set \(V_r = V[:, 0:r]^H\) and compute QR with column pivoting \(Q^H V_r P = \begin{bmatrix}R_{11} & R_{12} \end{bmatrix}\)
  4. Set \(A P = \begin{bmatrix} B_1 & B_2 \end{bmatrix}\) and partition \(P = \begin{bmatrix} P_1 & P_2\end{bmatrix},\) where \(P_1\) represents “the most independent” columns in sense that this corresponds to heuristic that QR with pivoting produces well conditioned \(R_{11}.\)

Python implementation

from numpy import asarray, count_nonzero
from typing import Iterable
import numpy as np
import scipy

def determine_independent_subset(A: np.ndarray) -> Iterable:

    U, S, VH = np.linalg.svd(A, compute_uv=True)

    def compute_rank(S: np.ndarray, tol: float | None = None):
        # Provide tol if you have some prior on the problem
        # Singular values <= tol are considered 0
        if tol is None:
            tol = S.max(axis=-1, keepdims=True) * (
                    max(A.shape[-2:]) * np.finfo(S.dtype).eps)
        else:
            tol = asarray(tol)[..., np.newaxis]
        return count_nonzero(S > tol, axis=-1)

    r = compute_rank(S)
    Vr = VH[0:r, :]
    Q, R, P = scipy.linalg.qr(Vr, pivoting=True, mode='full')
    return P[0:r]
A = np.array([[1,2, 3], [1, 2, 3],[1, 2, 4]])
s = determine_independent_subset(A)
print(s)
[2 1]