Implement Hardware-Efficient Jacobi SVD for Two-by-Two Matrices

Open Live Script

This example shows how to implement a hardware-efficient singular value decomposition (SVD) for real and complex two-by-two matrices, optimized for HDL applications.

The Square Jacobi SVD HDL Optimized block uses the two-sided Jacobi algorithm to compute the SVD for square matrices. This algorithm has inherent parallelism and performs well for FPGA and ASIC applications. The algorithm is iterative for a given n-by-n matrix. For the case of a real two-by-two matrix, this algorithm has a closed-form solution that can be implemented without iteration [1].

Real Two-by-Two Jacobi SVD Algorithm

If the input $A$ is a real two-by-two matrix

$A = (\begin{array}{c} A (1, 1) & A (1, 2) \\ A (2, 1) & A (2, 2) \end{array})$ ,

then the SVD of this matrix can be described as

$U^{'} A V = S = (\begin{array}{c} s_{1} & 0 \\ 0 & s_{2} \end{array})$ where $U = (\begin{array}{c} c o s (θ_{L}) & s i n (θ_{L}) \\ - s i n (θ_{L}) & c o s (θ_{L}) \end{array})$ and $V = (\begin{array}{c} c o s (θ_{R}) & s i n (θ_{R}) \\ - s i n (θ_{R}) & c o s (θ_{R}) \end{array})$ .

The rotation angle $θ_{L}$ and $θ_{R}$ can be computed as follows:

$θ_{S U M} = θ_{L} + θ_{R} = t a n^{- 1} (\frac{A (2, 1) + A (1, 2)}{A (2, 2) - A (1, 1)})$

$θ_{D I F F} = - θ_{L} + θ_{R} = t a n^{- 1} (\frac{A (2, 1) - A (1, 2)}{A (2, 2) + A (1, 1)})$

$θ_{L} = (\frac{θ_{S U M} - θ_{D I F F}}{2})$

$θ_{R} = (\frac{θ_{S U M} + θ_{D I F F}}{2})$

Note that both $t a n^{- 1}$ and the matrix rotation $U^{'} A V$ can be efficiently computed by the CORDIC algorithm on hardware.

Complex Two-by-Two Jacobi SVD Algorithm

If the input matrix $A$ is a complex two-by-two matrix, then you can perform bidiagonalization to transform it into a real matrix first by following these CORDIC rotation steps [2]. In these equations, $c$ represents complex numbers, and $r$ represents real numbers such that the initial value of the $A$ , $U$ , and $V$ matrices can be represented as

$A_{0} = (\begin{array}{c} c & c \\ c & c \end{array})$ , $U_{0} = (\begin{array}{c} 1 & 0 \\ 0 & 1 \end{array})$ , and $V_{0} = (\begin{array}{c} 1 & 0 \\ 0 & 1 \end{array})$ .

1. Rotate $A (1, 1)$ to real and apply the rotation to row 1 and $U_{0}$ .

$A_{0}$ is transformed into $A_{1} = J_{1}^{'} A_{0} = (\begin{array}{c} r & c \\ c & c \end{array})$ , where $J_{1} = (\begin{array}{c} e^{i θ_{1}} & 0 \\ 0 & 1 \end{array})$ , $θ_{1} = t a n^{- 1} (\frac{ℑ (A_{0} (1, 1)}{ℜ (A_{0} (1, 1)})$ .

$U_{1} = U_{0} J_{1}$ .

2. Rotate $A (2, 1)$ to real and apply the rotation to row 2 and $U_{1}$ such that

$A_{2} = J_{2}^{'} A_{1} = (\begin{array}{c} r & c \\ r & c \end{array})$ , where $J_{2} = (\begin{array}{c} 1 & 0 \\ 0 & e^{i θ_{2}} \end{array})$ and $θ_{2} = t a n^{- 1} (\frac{ℑ (A_{1} (2, 1)}{ℜ (A_{1} (2, 1)})$ .

$U_{2} = U_{1} J_{2}$ .

3. Rotate rows 1 and 2 together to rotate $A (2, 1)$ to 0. Apply the same rotation to $U_{2}$ .

$A_{2} = J_{3}^{'} A_{2} = (\begin{array}{c} r & c \\ 0 & c \end{array})$ , where $J_{3} = (\begin{array}{c} c o s (θ_{3}) & - s i n (θ_{3}) \\ s i n (θ_{3}) & c o s (θ_{3}) \end{array})$ and $θ_{3} = t a n^{- 1} (\frac{A_{2} (2, 1)}{A_{2} (1, 1)})$ .

$U_{3} = U_{2} J_{3}$ .

4. Rotate $A (1, 2)$ to real and apply the rotation to column 2 and $V_{0}$ .

$A_{4} = A_{3} K^{'} = (\begin{array}{c} r & r \\ 0 & c \end{array})$ , where $K = (\begin{array}{c} 1 & 0 \\ 0 & e^{i ψ} \end{array})$ and $ψ = t a n^{- 1} (\frac{ℑ (A_{3} (1, 2)}{ℜ (A_{3} (1, 2)})$ .

$V_{1} = V_{0} K^{'}$ .

5. Rotate $A (2, 2)$ to real and apply the rotation to $U_{3}$ .

$A_{5} = J_{4}^{'} A_{4} = (\begin{array}{c} r & r \\ 0 & r \end{array})$ , where $J_{4} = (\begin{array}{c} 1 & 0 \\ 0 & e^{i θ_{4}} \end{array})$ and $θ_{4} = t a n^{- 1} (\frac{ℑ (A_{4} (2, 2)}{ℜ (A_{4} (2, 2)})$ .

$U_{4} = U_{3} J_{4}$ .

$A_{5}$ is now a real matrix. Continue with the real two-by-two SVD such that

$S = (\begin{array}{c} c o s (θ_{L}) & - s i n (θ_{L}) \\ s i n (θ_{L}) & c o s (θ_{L}) \end{array}) A_{5} (\begin{array}{c} c o s (θ_{R}) & s i n (θ_{R}) \\ - s i n (θ_{R}) & c o s (θ_{R}) \end{array})$ ,

$U = U_{4} (\begin{array}{c} c o s (θ_{L}) & s i n (θ_{L}) \\ - s i n (θ_{L}) & c o s (θ_{L}) \end{array})$ , and

$V = V_{1} (\begin{array}{c} c o s (θ_{R}) & s i n (θ_{R}) \\ - s i n (θ_{R}) & c o s (θ_{R}) \end{array})$ .

Implement Two-by-Two Jacobi SVD in MATLAB

The twoByTwoJacobiSVD function implements the algorithm described in the previous section for both two-by-two real and complex inputs. The twoByTwoJacobiSVD function is a special case of the fixed.jacobiSVD function.

Specify Simulation Parameters

Specify the dimension of the inputs, the number of input sample matrices, the input complexity, and word length.

n = 2;
numSamples = 3;
complexity = "complex";
wordLength = 16;

Generate Input `A` Matrices

Use the specified simulation parameters to generate the input matrix A.

rng('default');
switch complexity
    case 'real'
        A = randn(n,n,numSamples);        
    case 'complex'
        A = complex(randn(n,n,numSamples),randn(n,n,numSamples));
    otherwise
        A = randn(n,n,numSamples);
end

Compute Fixed-Point Data Types

Use the upper bound on the singular values to define fixed-point types that will never overflow. First, use the fixed.singularValueUpperBound function to determine the upper bound on the singular values.

svdUpperBound = fixed.singularValueUpperBound(n,n,max(abs(A(:))));

Define the integer length based on the value of the upper bound, with one additional bit for the sign, one additional bit for intermediate CORDIC growth, and one more bit for intermediate growth to compute the Jacobi rotations.

additionalBitGrowth = 3;
integerLength = nextpow2(svdUpperBound) + additionalBitGrowth;

Compute the fraction length based on the integer length and the desired word length.

fractionLength = wordLength - integerLength;

Construct a prototype numeric type.

T = fi([],1,wordLength,fractionLength)

T = 

[]

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 16
        FractionLength: 10

Cast the matrix A to the signed fixed-point type.

A = cast(A,'like',T);

Compute SVD for Input A

Compute the singular value decomposition of A.

Tsvd = fixed.internal.svd.usv.cordicBidiagonalizationTypes(A);
U_fixed = removefimath(zeros(n,n,numSamples,'like',Tsvd.UV));
S_fixed = removefimath(zeros(n,1,numSamples,'like',Tsvd.B));
V_fixed = removefimath(zeros(n,n,numSamples,'like',Tsvd.UV));

for i = 1:numSamples
    [U_fixed(:,:,i),STmp,V_fixed(:,:,i)] = twoByTwoJacobiSVD(A(:,:,i));
    S_fixed(:,:,i) = diag(STmp);
end

Verify Output Solutions

Verify the output solutions. In these steps, identical means within roundoff error.

Verify that U*diag(s)*V' is identical to A. relativeErrorUSV represents the relative error between U*diag(s)*V' and A.
Verify that the singular values s are identical to the floating-point SVD solution. relativeErrorS represents the relative error between s and the singular values calculated by the MATLAB® svd function.
Verify that U and V are unitary matrices. relativeErrorUU represents the relative error between U'*U and the identity matrix. relativeErrorVV represents the relative error between V'*V and the identity matrix.

for i = 1:numSamples
    disp(['Sample #',num2str(i),':']);
    a = double(A(:,:,i));
    U = double(U_fixed(:,:,i));
    V = double(V_fixed(:,:,i));
    s = double(diag(S_fixed(:,:,i)));

    % Verify U*diag(s)*V'
    if norm(double(a)) > 1
        relativeErrorUSV = norm(U*s*V'-a)/norm(a);
    else
        relativeErrorUSV = norm(U*s*V'-a); 
    end    
    relativeErrorUSV

    % Verify s
    s_expected = svd(a);
    normS = norm(s_expected);
    relativeErrorS = norm(diag(s) - s_expected);
    if normS > 1
        relativeErrorS = relativeErrorS/normS;
    end
    relativeErrorS

    % Verify U'*U
    UU = U'*U;
    relativeErrorUU = norm(UU - eye(size(UU)))  

    % Verify V'*V
    VV = V'*V;
    relativeErrorVV = norm(VV - eye(size(VV)))
    disp('----------------');
end

Sample #1:

relativeErrorUSV = 
0.0020

relativeErrorS = 
0.0015

relativeErrorUU = 
6.7440e-04

relativeErrorVV = 
1.7814e-04

----------------

Sample #2:

relativeErrorUSV = 
0.0032

relativeErrorS = 
0.0014

relativeErrorUU = 
4.0594e-04

relativeErrorVV = 
1.3325e-04

----------------

Sample #3:

relativeErrorUSV = 
7.5346e-04

relativeErrorS = 
6.7534e-04

relativeErrorUU = 
4.3971e-04

relativeErrorVV = 
1.0590e-04

----------------

Implement HDL Optimized Two-by-Two Jacobi SVD in Simulink Blocks

The RealTwoByTwoSVDModel and ComplexTwoByTwoSVDModel models contain blocks that implement the real and complex two-by-two SVD algorithms, respectively, optimized for HDL applications. These blocks use the AMBA AXI handshake protocol. The valid/ready handshake process is used to transfer data and control information. This two-way control mechanism allows both the manager and subordinate to control the rate at which information moves between manager and subordinate. A valid signal indicates when data is available. The ready signal indicates that the block can accept the data. Transfer of data occurs only when both the valid and ready signals are high.

Open Model

Open the model corresponding to the specified complexity value.

switch complexity
    case 'real'
        model = 'RealTwoByTwoSVDModel';
    case 'complex'
        model = 'ComplexTwoByTwoSVDModel';
    otherwise
        model = 'RealTwoByTwoSVDModel';
end

open_system(model);

Configure Model Workspace and Run Simulation

Use the fixed.example.setModelWorkspace helper function to configure the model workspace. Run the simulation.

fixed.example.setModelWorkspace(model,'A',A,'numSamples',numSamples);
out = sim(model);

Verify Block Output

Verify that the output of the block is bit-exact with the output of the MATLAB implementation.

UBitExact = ispropequal(out.U,U_fixed)

UBitExact = logical
   1

sBitExact = ispropequal(out.s,S_fixed)

sBitExact = logical
   1

VBitExact = ispropequal(out.V,V_fixed)

VBitExact = logical
   1

Benchmark Latency

This block uses pipelined architecture.

For real 16-bit signed input, the input and output rate is 19 clocks per matrix, and the latency is 44 clocks.

For complex 16-bit signed input, the input and output rate is 50 clocks per matrix, and the latency is 93 clocks.

In this example plot:

Cursor 1 points to the input time of first matrix.
Cursor 2 points to the output time of the SVD corresponding to the first matrix.
Cursor 3 points to the output time of the SVD corresponding to the second matrix.
After initialization of 93 clocks, this block can provide SVD at every 50 clocks.

Measure Simulated Latency

Measure the simulated latency.

InputHandshakeHistory = out.logsout.get('Input handshake').Values.Data;
OutputHandshakeHistory = out.logsout.get('Output handshake').Values.Data;
tDataIn = find(InputHandshakeHistory == 1);
tDataOut = find(OutputHandshakeHistory == 1);
simulatedInputRate = diff(tDataIn)

simulatedInputRate = 3×1

    50
    50
    50

simulatedOutputRate = diff(tDataIn)

simulatedOutputRate = 3×1

    50
    50
    50

simulatedLatency = tDataOut - tDataIn(1:numSamples)

simulatedLatency = 3×1

    93
    93
    93

Hardware Resource Utilization

Both blocks in this example support HDL code generation using the Simulink® HDL Workflow Advisor. For an example, see HDL Code Generation and FPGA Synthesis from Simulink Model (HDL Coder) and Implement Digital Downconverter for FPGA (DSP HDL Toolbox).

This example data was generated by synthesizing the block on a AMD® Zynq®-UltraScale+ RFSoC ZCU111. The synthesis tool was Vivado® v2024.1.

These parameters were used for synthesis:

Input data type: sfix16_En10
Target frequency: 350 MHz

These tables show the synthesis resource utilization results and timing summary for the Real Two-by-Two Jacobi SVD HDL Optimized block.

Resource	Usage	Available	Utilization (%)
Slice LUTs	1416	425280	0.33
Slice Registers	1107	850560	0.13
DSPs	20	4272	0.51
Block RAM Tile	0	0	0.00
URAM	0	80	0.00

	Value
Requirement	2.5 ns (350 MHz)
Data Path Delay	2.355 ns
Slack	0.484 ns
Clock Frequency	421.38 MHz

These tables show the synthesis resource utilization results and timing summary for the Complex Two-by-Two Jacobi SVD HDL Optimized block.

Resource	Usage	Available	Utilization (%)
Slice LUTs	6717	425280	1.58
Slice Registers	3475	850560	0.41
DSPs	32	4272	0.75
Block RAM Tile	0	0	0.00
URAM	0	80	0.00

	Value
Requirement	2.8571 ns (350 MHz)
Data Path Delay	2.683 ns
Slack	0.156 ns
Clock Frequency	370.21 MHz

References

[1] Cavallaro, Joseph R., and Franklin T. Luk, "CORDIC Arithmetic for an SVD Processor." Journal of Parallel and Distributed Computing 5, no. 3 (June 1988): 271-90. https://doi.org/10.1016/0743-7315(88)90021-4.

[2] Hager, William W. "Bidiagonalization and Diagonalization." Computers & Mathematics with Applications 14, no. 7 (1987): 561–72. https://doi.org/10.1016/0898-1221(87)90051-4.

Implement Hardware-Efficient Jacobi SVD for Two-by-Two Matrices

Real Two-by-Two Jacobi SVD Algorithm

Complex Two-by-Two Jacobi SVD Algorithm

Implement Two-by-Two Jacobi SVD in MATLAB

Specify Simulation Parameters

Generate Input `A` Matrices

Compute Fixed-Point Data Types

Compute SVD for Input A

Verify Output Solutions

Implement HDL Optimized Two-by-Two Jacobi SVD in Simulink Blocks

Open Model

Configure Model Workspace and Run Simulation

Verify Block Output

Benchmark Latency

Measure Simulated Latency

Hardware Resource Utilization

References

See Also

Topics

Implement Hardware-Efficient Jacobi SVD for Two-by-Two Matrices

Real Two-by-Two Jacobi SVD Algorithm

Complex Two-by-Two Jacobi SVD Algorithm

Implement Two-by-Two Jacobi SVD in MATLAB

Specify Simulation Parameters

Generate Input A Matrices

Compute Fixed-Point Data Types

Compute SVD for Input A

Verify Output Solutions

Implement HDL Optimized Two-by-Two Jacobi SVD in Simulink Blocks

Open Model

Configure Model Workspace and Run Simulation

Verify Block Output

Benchmark Latency

Measure Simulated Latency

Hardware Resource Utilization

References

See Also

Topics

Generate Input `A` Matrices