Neural Network Training Implementation

Question

0 投票

Hi, I have been trying to implement my own version of gradient-descent training. The cost function I use for minimization is negative log likelihood. The datasets I have vary between ~1000 samples to ~5000 samples (both for training and unknown test sets). I have used these datasets for training using NNToolbox. Now, my implementation of the neural network do perform well and I have been able to attain accuracy close to 99%. However, when I try to compare my backpropgated partial derivatives with numerical gradients checking method , the difference is too large to not be suspicious of my implementation. I believe the problem somehow lies with when I update the parameters. I tried updating weights after scanning individual samples (on-line), mini-batch and the whole batch. Also, I believe, the final parameter values are too large.

Below is a piece of code that performs backprop and parameter updates for an epoch and updates after scanning individual example.

if true
  lambda = 0; % Regularisation parameter
  numbatches = 1; 
  multiPlier = (1 - (learnRate * lambda/size(ipFeatures,2))); 
  for l = 1 : numbatches
      currentBatch = trainP(:, batchInd( (l-1)*batchSize+1 : l*batchSize ) ); % Would be 1 for this case
      batchTargets = trainT(:, batchInd( (l-1)*batchSize+1 : l*batchSize ) ); % Would be 1 for this case
        activations = forwardPropagation(currentBatch, model);
        deltaErrors = computeDeltaError(activations, batchTargets, model);
        for t = 1 : numHiddenLayers
            % Compute Partial Derivatives
            dW{t} = (deltaErrors{t+1} * activations{t}'); 
            db{t} = sum(deltaErrors{t+1}, 2);
            % Update parameters
            model.weights{t} = multiPlier * model.weights{t}...
                                     - (learnRate/size(currentBatch,2)).*dW{t};
            model.bias{t} = model.bias{t} - (learnRate/size(currentBatch,2)).*db{t};
        end              
    end
  end

The value of numbatches decides the batch-mode, on-line mode or mini-batch mode operation of the network. I use dW and db to compare with numerical gradients.

Also, I believe, the final parameter values are too large. E.g., one of the weight matrix that I obtained for the dataset trained with 816 samples and tested on dataset with 725 samples, with the classification accuracy of 98.79% which is good as the test dataset has some noisy labels.

    -0.2853   -1.3728   -0.6968    0.4703   -1.2471    2.0104    0.2644   -0.6097    0.7695    0.3747
    1.4270    1.2017    0.6934    0.8725    0.4917   -1.0928   -0.3810    0.9145   -1.2533   -0.3824

Few sample values of the backpropogated partial derivatives, numerically computed gradient and their difference.

    backPropGrads = 0.000863502714559410  0.0112093229963550  9.74490423775809e-05  0.000175776868318497  0.00845120635863130  -0.00189301667233442  -0.00653141680913231  0.00496566802896389  -0.0205541611216203  -0.000101576463654545
   numericalGrads = -0.00246672065599973  0.00105451893203656  -0.000341400989006813  -0.000228545330785424  0.000629285591066675  0.00526790995436510  0.00255049267060270  -0.00283454504222680  0.00643259144054997  -2.32609314448906e-05
 GradientDifferene = 0.00333022337055914  0.0101548040643184  0.000438850031384393  0.000404322199103921  0.00782192076756462  0.00716092662669952  0.00908190947973501  0.00780021307119069  0.0269867525621703  7.83155322096546e-05

Can anyone suggest me as to what I am doing wrong here. Am I performing the weight updates properly.

- Nilay