Dense 3-D Reconstruction from Multiple Views Using RAFT Optical Flow

Since R2026a

This example uses:

This example demonstrates multi-view dense 3-D reconstruction [1] by combining structure-from-motion (SfM) results with dense pixel-wise correspondences from RAFT optical flow [2].

The example follows these steps:

Select optimal view pairs for dense depth estimation by traversing the view graph obtained from SfM.
Compute optical flow on the selected view pairs.
Enforce geometric consistency checks on optical flow estimates.
Estimate dense depth by triangulation using relative camera pose and optical flow between view pairs.
Reject regions of uncertain depth estimation.
Fuse all individual depth estimates and corresponding RGB values into a single colored point cloud.

This process results in a fast and accurate pipeline for dense reconstruction, leveraging the compute capabilities of a GPU when available. Techniques such as Neural Radiance Field (NeRF) provide denser point clouds, often at better quality, at the cost of significantly longer processing times and hardware requirements such as a GPU with at least 16 GB memory. For more information on using NeRF for 3-D reconstruction, see Reconstruct 3-D Scenes and Synthesize Novel Views Using Neural Radiance Field Model.

Download and Unzip Training Data

This example uses a subset of the TUM RGB-D Benchmark [3] containing 104 images of an indoor office scene captured with a handheld camera. You can download the data to a temporary directory using a web browser or by running this code:

if ~exist("tum_rgbd_data.zip", "file")
    websave("tum_rgbd_data.zip", "https://ssd.mathworks.com/supportfiles/3DReconstruction/tum_rgbd_data.zip");
    unzip(fullfile("tum_rgbd_data.zip"), pwd);
end

if ~exist(fullfile("sfmTrainingDataTUMRGBD", "finalReconstructionSfM.mat"), "file")
    websave(fullfile("sfmTrainingDataTUMRGBD", "finalReconstructionSfM.mat"),...
        "https://ssd.mathworks.com/supportfiles/supportfiles/3DReconstruction/finalReconstructionSfM.mat");
end

Load Data and Structure-from-Motion Results

This example builds on the view graph and sparse point clouds created in the Structure from Motion from Multiple Views example. Ensure that the images used in the SfM example are available on the path.

Load the camera intrinsics, image view graph, and reconstructed sparse world points from the saved SfM results.

load(fullfile("sfmTrainingDataTUMRGBD", "cameraInfo.mat"), "intrinsics");
load(fullfile("sfmTrainingDataTUMRGBD", "finalReconstructionSfM.mat"), "viewGraph", "wpSet");

Create an image datastore to load the images of the scene.

imds = imageDatastore(fullfile("sfmTrainingDataTUMRGBD","images"));
numImages = length(imds.Files);
numConn = viewGraph.NumConnections;
numViews = viewGraph.NumViews;

Create `opticalFlowRAFT` Object

Create an opticalFlowRAFT model for computing dense pixel correspondences between image pairs.

raft = opticalFlowRAFT;

Select View Pairs for Dense Depth Estimation Using the SfM View Graph

For each image in the image collection, select its neighboring view most suitable for dense reconstruction. Dense reconstruction depends on finding pixel-wise matches between two images and then triangulating to obtain 3-D points in world coordinates for each of the matched pixels. A numerically stable triangulation computation requires the two views to have sufficient viewpoint variation between the two camera positions, along with a minimum number of mutually visible image features.

The minOverlap threshold rejects view pairs that are too far apart and do not share common features. The ideal value is data set dependent and has to be tuned to obtain a reasonable number of pairs covering multiple viewpoints for subsequent dense reconstruction. Reduce this threshold to consider more image pairs as valid for dense depth estimation. For more control over the view selection process, tune the parameters present in the helperSelectDepthViewPairs helper function, attached to this example as a supporting file

minOverlap = 0.2;
validViewPairs = helperSelectDepthViewPairs(viewGraph, wpSet, minOverlap);

numValidViewPairs = size(validViewPairs,1);
disp(numValidViewPairs + " valid views out of " + numViews+".");

78 valid views out of 104.

Visualize Selected View Pairs and Sparse Reconstruction

Visualize the selected views, along with the sparse 3-D points from SfM, to ensure that the object of interest has been captured from multiple viewpoints.

figure
pcshow(wpSet.WorldPoints, VerticalAxis="y", VerticalAxisDir="down");
xlabel("X")
ylabel("Y")
zlabel("Z")
hold on
camColors = autumn(numValidViewPairs);
for i = 1:numValidViewPairs
    viewId = validViewPairs(i,1);
    plotCamera(AbsolutePose=viewGraph.Views.AbsolutePose(viewId), Size=0.1, Color=camColors(i,:));
end
hold off
title("Selected Views and Sparse Point Cloud")

Estimate Depth Map from Image Pair

To compute a dense depth map between two views in the multi-view reconstruction sequence, first extract the camera poses and keypoint correspondences obtained from the earlier Structure-from-Motion (SfM) stage. The relative pose between the selected image pair defines the geometric relationship needed to triangulate dense pixel correspondences estimated using optical flow from the RAFT model. Because depths derived from optical-flow are determined only up to an unknown scale, you must then align and scale the resulting depth map to match the sparse 3-D points reconstructed by SfM. Finally, apply spatial filtering and superpixel-based region rejection to reduce noise, remove low-texture regions, and preserve depth boundaries.

Use the slider to select a particular image pair and compute a depth map of the first image using dense optical flow and triangulation from the relative camera pose between the two images.

pairIndex = 29;

viewPair = validViewPairs(pairIndex,:);

% Thresholds tuned according to dataset
maxDepthThreshold = 2;     % reject depth at points that are too far away to be estimated correctly
medianFilterSize = [5 5];  % size of the median filter kernel used to smooth the raw depth map
numSuperPixels = 50;       % number of super-pixels computed in an image to reject regions of low texture

% Extract 3D World Points seen in view1 from the SfM results
[pointIndices, featureIndices] = findWorldPointsInView(wpSet, viewPair(1));
xyzPoints = wpSet.WorldPoints(pointIndices,:);

% Extract 2D keypoints detected in view1 from the SfM results
view1 = findView(viewGraph, viewPair(1));
view2 = findView(viewGraph, viewPair(2));
featurePoints = view1.Points{:}(featureIndices);

I1 = readimage(imds, viewPair(1));
I2 = readimage(imds, viewPair(2));

% Compute relative pose between camera positions of the two views
pose1 = view1.AbsolutePose;
pose2 = view2.AbsolutePose;
p1Inv = invert(pose1);
relPose = p1Inv.A * pose2.A;
relPose = rigidtform3d(relPose);

% Compute depth map using dense pixel correspondences from optical flow and
% relative camera pose between two views.
[depthMap, flow, isValid] = helperComputeDepthMap(...
raft, I1, I2, relPose, intrinsics, featurePoints, maxDepthThreshold);

% Scale the estimated depth map values to the same coordinates as the
% sparse SfM keypoints.
scaleFactor = helperComputeDepthMapScale(featurePoints, pose1, xyzPoints, depthMap);
depthMap = scaleFactor * depthMap;

% Median filtering to remove noise in local neighborhoods
depthMap = medfilt2(depthMap, medianFilterSize);

% SLIC superpixels to reject regions that have low texture and likely to
% have incorrect estimates of depth.
if numSuperPixels > 0
    [L,N] = superpixels(I1, numSuperPixels);
    BW = boundarymask(L);
    depthMap = helperFilterSuperpixels(L, depthMap, featurePoints.Location);
end

Visualize the two images and the estimated depth map. Invalid depth map pixels are marked using NaN values. The default thresholds in this example are set to very conservative values that reject a large number of regions in an image where depth estimates may be uncertain. This is done to avoid a noisy point cloud in the final result. In less challenging scenarios, these thresholds can be relaxed and stages such as the superpixel rejection may be omitted altogether without impacting the final reconstruction.

figure
subplot(1,3,1)
imshow(imoverlay(I1,BW,"yellow"))
hold on
plot(featurePoints,ShowScale=false)
hold off
title("Image " + viewPair(1))
subplot(1,3,2)
imshow(I2)
title("Image " + viewPair(2))
subplot(1,3,3)
imagesc(depthMap)
axis image
colorbar
xticks([])
yticks([])
title("Depth Image " + viewPair(1))
truesize;

Dense Reconstruction on Entire Image Sequence

Run the reconstruction algorithm on all the valid image pairs in the image sequence and reconstruct a dense point cloud. Two-view reconstructions from all the selected views, along with their color or intensity information, are concatenated into a single point cloud.

Executing this section of the code takes about 7 minutes on a Linux® machine with an NVIDIA® GeForce RTX 3090 GPU.

% Specify parameters
offset = 50;                         % Ignore image points at the boundary
depthRange = [0 maxDepthThreshold];  % Determined by visually inspecting the earlier individual depth maps

ptCloudsFull = cell(numValidViewPairs,1);
imageSize = size(I1);
[X, Y] = meshgrid(offset:2:imageSize(2)-offset, offset:2:imageSize(1)-offset);

for i = 1:numValidViewPairs

    viewPair = validViewPairs(i,:);

    fprintf(i + "/" + numValidViewPairs + " ");
    if mod(i,20) == 0
        fprintf("\n");
    end

    I1 = readimage(imds, viewPair(1));
    I2 = readimage(imds, viewPair(2));

    view1 = findView(viewGraph, viewPair(1));
    view2 = findView(viewGraph, viewPair(2));

    % 3D world points visible from view1
    [pointIndices, featureIndices] = findWorldPointsInView(wpSet, viewPair(1));
    xyzPoints = wpSet.WorldPoints(pointIndices,:);

    % 2D keypoints for visible world points in view1
    featurePoints = view1.Points{:}(featureIndices);

    % Camera poses of the two views
    pose1 = view1.AbsolutePose;
    pose2 = view2.AbsolutePose;

    % Compute relative pose between cameras
    p1Inv = invert(pose1);
    relPose = p1Inv.A * pose2.A;
    relPose = rigidtform3d(relPose);

    % Compute depth map using dense pixel correspondences from optical flow
    % and relative camera pose between two views from SfM.
    [depthMap, flow, isValid] = helperComputeDepthMap(...
        raft, I1, I2, relPose, intrinsics, featurePoints, depthRange(2));

    if ~isValid
        % Numerically unstable depth estimation
        continue;
    end

    % Scale the estimated depth map values to the same coordinates as the
    % sparse SfM keypoints.
    scaleFactor = helperComputeDepthMapScale(featurePoints,pose1,xyzPoints,depthMap);
    depthMap = scaleFactor * depthMap;

    % Median filtering to remove noise in local neighborhoods
    depthMap = medfilt2(depthMap, medianFilterSize);

    % SLIC superpixels to reject regions that have low texture and likely to
    % have incorrect estimates of depth.
    if numSuperPixels > 0
            [L,N] = superpixels(I1, numSuperPixels);
            BW = boundarymask(L);
            depthMap = helperFilterSuperpixels(L, depthMap, featurePoints.Location);
    end

    % Create a colored point cloud from the depth map of a single view
    [xyzPoints, validIndex] = helperReconstructFromRGBD([X(:), Y(:)], ...
        depthMap, intrinsics, pose1, 1, depthRange);
    colors = zeros(numel(X), 1, 'like', I1);
    for j = 1:numel(X)
        colors(j, 1:3) = I1(Y(j), X(j), :);
    end
    ptCloudOneView = pointCloud(xyzPoints, Color=colors(validIndex, :));

    % If the point cloud is not empty, add to the list of point clouds
    if ptCloudOneView.Count > 0
        ptCloudOneView = pcdenoise(ptCloudOneView); % Denoise the point cloud
        ptCloudsFull{i} = ptCloudOneView;
    end

end

1/78 2/78 3/78 4/78 5/78 6/78 7/78 8/78 9/78 10/78 11/78 12/78 13/78 14/78 15/78 16/78 17/78 18/78 19/78 20/78 21/78 22/78 23/78 24/78 25/78 26/78 27/78 28/78 29/78 30/78 31/78 32/78 33/78 34/78 35/78 36/78 37/78 38/78 39/78 40/78 41/78 42/78 43/78 44/78 45/78 46/78 47/78 48/78 49/78 50/78 51/78 52/78 53/78 54/78 55/78 56/78 57/78 58/78 59/78 60/78 61/78 62/78 63/78 64/78 65/78 66/78 67/78 68/78 69/78 70/78 71/78 72/78 73/78 74/78 75/78 76/78 77/78 78/78

Merge the point clouds into a single point cloud.

ptCloudsFull = ptCloudsFull(~cellfun(@isempty, ptCloudsFull));
ptCloudsMerged = pccat([ptCloudsFull{:}]);

Visualize Reconstructed Point Cloud

Visualize the point cloud and the camera positions of the views used in the dense reconstruction. Despite challenging camera motion and casually captured settings in this image sequence, the details of the desk and the objects on it are accurate in the reconstruction. The process used in this example discards a large number of uncertain or inconsistent depth estimates in the process, resulting in gaps in the final point cloud. These are often textureless, planar surfaces such as the floor, or distant structures such as the walls.

figure
pcshow(ptCloudsMerged, VerticalAxis="y", VerticalAxisDir="down");
xlabel("X")
ylabel("Y")
zlabel("Z")
title("Dense Reconstructed Point Cloud")

The example demonstrates a simplified version of dense reconstruction, relying on robust estimates of optical flow, camera pose estimation and sparse reconstruction results from Structure-from-Motion. See Schönberger et al. [1] for more details on rigorous multi-view stereo methods for dense reconstruction. For higher fidelity reconstruction, you can consider using additional sensors such as RGB-D or stereo cameras, capturing the data at high resolution and avoiding motion blur, or by using deep learning techniques such as the nerfacto Neural Radiance Field (NeRF) model. For more information, see Reconstruct 3-D Scenes and Synthesize Novel Views Using Neural Radiance Field Model.

References

[1] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, "Pixelwise View Selection for Unstructured Multi-View Stereo," in Proceedings of the European Conference on Computer Vision (ECCV), 2016.

[2] Teed, Zachary, and Jia Deng. "RAFT: Recurrent All-Pairs Field Transforms for Optical Flow," in Proceedings of the European Conference on Computer Vision (ECCV), 2020.

[3] Sturm, Jürgen, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. "A benchmark for the evaluation of RGB-D SLAM systems". In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573-580, 2012.