Main Content

unet

Create U-Net convolutional neural network for semantic segmentation

Since R2024a

Description

unetNetwork = unet(imageSize,numClasses) returns a U-Net network.

Use unet to create the U-Net network architecture. You must train the network using the Deep Learning Toolbox™ function trainnet (Deep Learning Toolbox).

example

[unetNetwork,outputSize] = unet(imageSize,numClasses) also returns the size of the output size from the U-Net network.

___ = unet(imageSize,numClasses,Name=Value) specifies options using one or more name-value arguments. For example, unet(imageSize,numClasses,NumFirstEncoderFilters=64) specifies the number of output channels as 64 for the first encoder stage.

Examples

collapse all

Create a U-Net network with an encoder-decoder depth of 3.

imageSize = [480 640 3];
numClasses = 5;
encoderDepth = 3;
unetNetwork = unet(imageSize,numClasses,EncoderDepth=encoderDepth)
unetNetwork = 
  dlnetwork with properties:

         Layers: [48×1 nnet.cnn.layer.Layer]
    Connections: [53×2 table]
     Learnables: [36×3 table]
          State: [0×3 table]
     InputNames: {'encoderImageInputLayer'}
    OutputNames: {'FinalNetworkSoftmax-Layer'}
    Initialized: 1

  View summary with summary.

Display the network.

plot(unetNetwork)

Figure contains an axes object. The axes object contains an object of type graphplot.

Load training images and pixel labels into the workspace.

dataSetDir = fullfile(toolboxdir("vision"),"visiondata","triangleImages");
imageDir = fullfile(dataSetDir,"trainingImages");
labelDir = fullfile(dataSetDir,"trainingLabels");

Create an imageDatastore object to store the training images.

imds = imageDatastore(imageDir);

Define the class names and their associated label IDs.

classNames = ["triangle","background"];
labelIDs   = [255 0];

Create a pixelLabelDatastore object to store the ground truth pixel labels for the training images.

pxds = pixelLabelDatastore(labelDir,classNames,labelIDs);

Create the U-Net network.

imageSize = [32 32];
numClasses = 2;
unetNetwork = unet(imageSize, numClasses)
unetNetwork = 
  dlnetwork with properties:

         Layers: [61×1 nnet.cnn.layer.Layer]
    Connections: [68×2 table]
     Learnables: [46×3 table]
          State: [0×3 table]
     InputNames: {'encoderImageInputLayer'}
    OutputNames: {'FinalNetworkSoftmax-Layer'}
    Initialized: 1

  View summary with summary.

Create a datastore for training the network.

ds = combine(imds,pxds);

Set training options.

options = trainingOptions("sgdm", ...
    InitialLearnRate=1e-3, ...
    MaxEpochs=20, ...
    VerboseFrequency=10);

Train the network.

net = trainnet(ds,unetNetwork,"crossentropy",options)
    Iteration    Epoch    TimeElapsed    LearnRate    TrainingLoss
    _________    _____    ___________    _________    ____________
            1        1       00:00:05        0.001          3.2975
           10       10       00:00:48        0.001          0.6778
           20       20       00:01:36        0.001         0.27066
Training stopped: Max epochs completed
net = 
  dlnetwork with properties:

         Layers: [61×1 nnet.cnn.layer.Layer]
    Connections: [68×2 table]
     Learnables: [46×3 table]
          State: [0×3 table]
     InputNames: {'encoderImageInputLayer'}
    OutputNames: {'FinalNetworkSoftmax-Layer'}
    Initialized: 1

  View summary with summary.

Input Arguments

collapse all

Network input image size, specified as a:

  • 2-element vector in the form [height, width].

  • 3-element vector in the form [height, width, depth]. depth is the number of image channels. Set depth to 3 for RGB images, to 1 for grayscale images, or to the number of channels for multispectral and hyperspectral images.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Number of classes in the semantic segmentation, specified as an integer greater than 1.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: EncoderDepth=3 specifies the encoder depth as 3.

Encoder depth, specified as a positive integer. U-Net is composed of an encoder subnetwork and a corresponding decoder subnetwork. The depth of these networks determines the number of times the input image is downsampled or upsampled during processing. The encoder network downsamples the input image by a factor of 2D, where D is the value of EncoderDepth. The decoder network upsamples the encoder network output by a factor of 2D.

Note

If you also specify EncoderNetwork, specify the value of EncoderDepth using the depth of the EncoderNetwork input.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Encoder network that unet uses as the encoder, specified as a dlnetwork (Deep Learning Toolbox) object. You can specify a pretrained or custom encoder network. To use a pretrained encoder network, create the network using the pretrainedEncoderNetwork function.

Number of output channels for the first encoder stage, specified as a positive integer or vector of positive integers. In each subsequent encoder stage, the number of output channels doubles. The unet function sets the number of output channels in each decoder stage to match the number in the corresponding encoder stage.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Convolutional layer filter size, specified as a positive odd integer or a 2-element row vector of positive odd integers. Typical values are in the range [3, 7].

FilterSizeDescription
scalarThe filter is square.
2-element row vector

The filter has the size [height width].

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Type of padding, specified as 'same' or 'valid'. The type of padding specifies the padding style for the convolution2dLayer (Deep Learning Toolbox) in the encoder and the decoder subnetworks. The spatial size of the output feature map depends on the type of padding. If you specify type of padding as:

  • 'same' — Zero padding is applied to the inputs to convolution layers such that the output and input feature maps are the same size.

  • 'valid' — Zero padding is not applied to the inputs to convolution layers. The convolution layer returns only values of the convolution that are computed without zero padding. The output feature map is smaller than the input feature map.

Note

To ensure that the height and width of the inputs to max-pooling layers are even, choose the network input image size to confirm to any one of these criteria:

  • If you specify 'ConvolutionPadding' as 'same', then the height and width of the input image must be a multiple of 2D.

  • If you specify 'ConvolutionPadding' as 'valid', then the height and width of the input image must be chosen such that heighti=1D2i(fh1) and widthi=1D2i(fw1) are multiples of 2D.

    where fh and fw are the height and width of the two-dimensional convolution kernel, respectively. D is the encoder depth.

Data Types: char | string

Output Arguments

collapse all

Layers that represent the U-Net network architecture, returned as a dlnetwork (Deep Learning Toolbox) object.

Network output image size, returned as a three-element vector of the form [height, width, channels]. channels is the number of output channels and it is equal to the number of classes specified at the input. The height and width of the output image from the network depend on the type of padding convolution.

  • If you specify 'ConvolutionPadding' as 'same', then the height and width of the network output image are the same as that of the network input image.

  • If you specify 'ConvolutionPadding' as 'valid', then the height and width of the network output image are less than that of the network input image.

Data Types: double

More About

collapse all

U-Net Architecture

  • The U-Net architecture consists of an encoder subnetwork and decoder subnetwork that are connected by a bridge section.

  • The encoder and decoder subnetworks in the U-Net architecture consists of multiple stages. EncoderDepth, which specifies the depth of the encoder and decoder subnetworks, sets the number of stages.

  • The stages within the U-Net encoder subnetwork consist of two sets of convolutional and ReLU layers, followed by a 2-by-2 max pooling layer. The decoder subnetwork consists of a transposed convolution layer for upsampling, followed by two sets of convolutional and ReLU layers.

  • The bridge section consists of two sets of convolution and ReLU layers.

  • The bias term of all convolutional layers is initialized to zero.

  • Convolution layer weights in the encoder and decoder subnetworks are initialized using the 'He' weight initialization method [2].

Tips

  • Use 'same' padding in convolution layers to maintain the same data size from input to output and enable the use of a broad set of input image sizes.

  • Use patch-based approaches for seamless segmentation of large images. You can extract image patches by using the randomPatchExtractionDatastore function.

  • Use 'valid' padding to prevent border artifacts while you use patch-based approaches for segmentation.

  • You can use the network created using unet function for GPU code generation after training with trainnet (Deep Learning Toolbox). For details and examples, see Code Generation (Deep Learning Toolbox).

References

[1] Ronneberger, O., P. Fischer, and T. Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." Medical Image Computing and Computer-Assisted Intervention (MICCAI). Vol. 9351, 2015, pp. 234–241.

[2] He, K., X. Zhang, S. Ren, and J. Sun. "Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." Proceedings of the IEEE International Conference on Computer Vision. 2015, 1026–1034.

Extended Capabilities

Version History

Introduced in R2024a