We extensively evaluate current
state-of-the-art 3D reconstruction and camera pose estimation methods, such as Structure from Motion [1], Multi-View Stereo [2],
Visual Geometry Grounded Transformer (VGGT) [3], π³[4], as well as 2D Gaussian Splatting [5] on MVM-IOD and report our findings to create a baseline
for future research. We found that our upside down capture setup generates images that are out-of distribution for both VGGT
and π³, leading to suboptimal point clouds and camera poses. We show that in our case, out-of distribution images can be shifted
closer to the training distribution by applying preprocessing steps such as rotating the images by 180 degrees and changing the
aspect ratio of the images. This shows that in certain industrial applications, VGGT and π³ should
be used with caution.
In the example below, we show a qualitative comparison of VGGT and π³ on the original MVM-IOD images and the preprocessed version of MVM-IOD. For VGGT, we show VGGT depth, which refers to the point cloud obtained from the depth map branch, as well as VGGT point which indicates the point cloud obtained from the point map branch.