Ebook

Chapter 3

Perception


Recognizing the surrounding environment from fused sensor data involves recognizing the shape and size of an object, and possibly object classification (e.g., recognizing a door versus a chair). Autonomous mobile robot (AMR) operations use this information in later action plans and decision-making. Image recognition and point cloud processing help with object detection.

section

Object Detection with Image Processing and Deep Learning

In recent years, deep learning has received much attention in the field of artificial intelligence (AI), and deep learning methods of image processing have progressed with remarkable results for robotics applications. Using Deep Learning Toolbox™, you can perform classification and regression on images and time-series data using reference examples for convolutional neural networks (ConvNets, CNNs) and long short-term memory (LSTM) networks.

In addition to image classification, deep learning also provides algorithms for object detection and segmentation. Object detection algorithms with high processing speeds and detection accuracy are often used in AMRs. For example, YOLO is one of the algorithms that can provide desirable results with speed and accuracy.

Object detection by YOLO v2.

Object detection by YOLO v2.

YOLO is a real-time object detection algorithm that is classified as a single-stage object detector. YOLO is faster than a two-stage object detector, such as Faster R-CNN, because YOLO simultaneously extracts and classifies candidate regions. Deep Learning Toolbox and Computer Vision Toolbox support real-time object detection using YOLO v2 and YOLO v3. You can also use single shot multibox detector (SSD), provided in these toolboxes. Object detectors usually require training data, and you can create such training data yourself or use publicly available datasets. Computer Vision Toolbox also provides the Image Labeler application for labeling (also known as tagging or annotating) the training data.

If you want to build a more sophisticated and flexible deep neural network, you can access the low-level interface in the Deep Learning Toolbox. This interface provides a dedicated data class called dlarray for automatic differentiation, which lets you build complex networks without cumbersome steps such as implementing a backpropagation function. The interface also allows you to create advanced networks such as a generative adversarial network (GAN) and variational auto encoder (VAE).

section

Object Detection by Point Cloud Processing

Light Detection and Ranging (lidar) is one of the essential sensor types used in AMRs. Lidar measures distance by throwing laser light at an object and recording the time that elapses as the light reflects off the target object and returns to the source. Because this is an active measurement method, lidar provides highly accurate distance measurements between the moving robot and objects in its environment. In addition, 360-degree laser rotation allows distance measurements to objects anywhere around the robot, as long as the objects are present within the plane of the laser’s rotation. A 2D lidar uses a single laser element and can generate 2D laser scans of an environment, and a 3D lidar uses multiple laser elements and generates 3D point clouds. In addition to distance estimation, you can process lidar point clouds to determine the object’s orientation and type.

Lidar Toolbox and Computer Vision Toolbox provide various methods for point cloud processing, such as down sampling and matching point clouds, shape fitting, and lidar registration. You can also train custom detection and semantic segmentation models using deep learning and machine learning algorithms such as PointSeg, PointPillars, and SqueezeSegV2.

Object detection by point cloud processing.

Object detection by point cloud processing.

section

Tracking Detected Objects

Objects in a robot’s environment are not always stationary. When moving objects are present, such as humans or pets, fusing detection results from multiple sensors, such as a camera and lidar, can provide better results. Such sensor fusion requires state estimation filters, such as a Kalman filter, and algorithms for managing the allocation of multiple tracks.

With Sensor Fusion and Tracking Toolbox™, you can explore various state estimation filters and object tracking algorithms. This toolbox provides algorithms to fuse data from real-world sensors, including active and passive radar, sonar, lidar, EO/IR, IMU, and GPS. This helps to compensate for the strengths and weaknesses of each sensor and improves position prediction accuracy, which is essential for an AMR’s safe navigation.

Object tracking with sensor fusion.

Object tracking with sensor fusion.

section

Simultaneous Localization and Mapping (SLAM)

Consider exploring an unknown space using a 2D lidar. By aligning the point clouds obtained from the lidar, you can iteratively generate a map. Once you have the map, you also need information about robot’s pose (position and orientation) relative to the map. However, sequential estimation of position leads to error accumulation, and the estimated position can deviate significantly from the actual position. Such misalignment is called drift, and safe autonomous mobile robot (AMR) operations require map generation with drift correction. A SLAM algorithm estimates the AMR’s pose while building the map of the environment, and simultaneously works to compensate for the position drift. There are several methods of performing SLAM, including lidar SLAM using a lidar sensor and visual SLAM using a camera.

’Lidar provides more accurate distance measurements than cameras and other time-of-flight (ToF) sensors, and lidar is widely used in fast-moving objects such as self-driving cars and drones. In lidar SLAM, the amount of movement can be estimated sequentially by matching point clouds, and the self-position can be estimated by integrating them. Iterative closest point (ICP) and normal distributions transform (NDT) algorithms are used for point cloud matching from lidar. Map representations managed by lidar SLAM are shown in the figure below, where the first row shows 2D occupancy grid maps and the second row shows 3D point clouds. The point cloud maps are raw output from the lidar sensor. Although the point clouds include detailed environmental information, they are not suitable for collision detection. A grid map samples the point clouds on an evenly spaced grid and can be used for obstacle detection and motion planning.

2D and 3D lidar SLAM.

2D and 3D lidar SLAM.

One challenge for lidar SLAM is that when the point cloud density is coarser, obtaining enough features for point cloud matching can be difficult. For example, in a place where there are few obstacles, the alignment between point clouds may fail and the position of the moving object may be lost. In addition, the processing load for point cloud matching is generally high, and you may need to devise ways to increase the speed.

As an alternative, visual SLAM uses images from cameras and image sensors. Camera types used for visual SLAM include monocular cameras (wide-angle cameras, fish-eye cameras, and all-sky cameras), compound-eye cameras (stereo cameras and multi-cameras), and RGB-D cameras (depth cameras and ToF cameras). Visual SLAM algorithms can be broadly classified into two types. A sparse method matches image feature points using algorithms such as PTAM and ORB-SLAM. A dense method leverages the brightness of the entire image and uses DTAM, LSD-SLAM, DSO, and SVO algorithms.

Visual SLAM (ORB-SLAM) with a monocular camera.

Visual SLAM (ORB-SLAM) with a monocular camera.

Visual SLAM has strong ties to technologies such as structure from Motion (SfM), visual odometry, and bundle adjustment. The main advantage of visual SLAM is that it can be implemented at much lower cost compared to lidar SLAM because cameras are relatively inexpensive. Cameras can also store a lot of information, which enables them to store previously observed locations as landmarks for later detection. However, with a monocular camera, calculating the distance to objects using only the camera geometry can be difficult.

Both lidar SLAM and visual SLAM accumulate drift errors that cause inaccuracies in localization. As a countermeasure, pose graphs are used to help correct the errors. Pose graphs are graph structures that contain nodes connected by edges. Each node estimate is connected to the graph by edge constraints that define the relative pose between nodes and the uncertainty on that measurement. Pose graphs can include known objects such as augmented reality markers and checkerboards. The graphs can also fuse the output from sensors such as inertial measurement units (IMUs) that can measure the pose directly.

Pose graphs provided in Navigation Toolbox™ can be extended for any sensor data. The SLAM front-end estimates the AMR’s movement and obstacle locations via sensor-dependent processes, perhaps using a Kalman filter to estimate the position by integrating multiple sensors. In contrast, building and optimizing the back-end pose graph is a sensor-independent process. The back-end component can be applied to various SLAM algorithms. Pose graph optimization is sometimes referred to as bundle adjustment.

Front-end and back-end of SLAM.

Front-end and back-end of SLAM.

Localization in a Known Map

With SLAM, you can build the map of the environment while simultaneously localizing the robot. In some cases, you may already have a map built using a mapping algorithm, and you need only to estimate the relative poses in the given map. State estimation methods such as Monte Carlo localization use a particle filter to estimate the AMR’s pose within the known map based on the AMR’s motion and sensing. The Monte Carlo localization algorithm uses distributed particles to represent different robot states. As the AMR moves in the environment, and its distance sensors detect different parts of the environment, the distributed particles converge around one place. AMR movement also can be detected using odometry sensors. By expressing the AMR’s position with a probabilistic distribution, you can perform robust pose estimation in a dynamic environment, accounting for sensor measurement errors. Navigation Toolbox includes a variant of this algorithm called adaptive Monte Carlo localization (AMCL).

Vehicle position estimation (localization).

Vehicle position estimation (localization).

Learn More About Perception