Generic Objects as Pose Probes for Few-Shot View Synthesis (2024)

Zhirui Gao,Renjiao Yi,Chenyang Zhu,Ke Zhuang,Wei Chen,Kai Xu,denotes the corresponding author

Abstract

Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images.We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as “pose probes”. The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance. Our project page is available at: this https URL

Generic Objects as Pose Probes for Few-Shot View Synthesis (1)

1 Introduction

As a milestone in the realm of computer vision and graphics, neural radiance fields (NeRFs) offer unprecedented capability of photorealistic rendering of scenes from multi-view posed images. The accuracy of novel-view renderings depends heavily on the precision of input camera poses and the number of input images, limiting its applicability in real-world scenarios.Camera poses of the input views are typically recovered with COLMAP(Schönberger and Frahm 2016) in most works. However, with limited and sparse input views, COLMAP may fail to obtain accurate poses due to wide baselines and insufficient feature matches.

To relax the requirement of accurate input poses, many works estimate or refine poses based on various assumptions. For example, NeRFmm(Wang etal. 2021b) focuses on forward-facing scenes where the baseline is relatively small. BARF(Lin etal. 2022) and SPARF(Truong etal. 2023) assume imperfect initial poses instead of fully accurate poses. GNeRF(Meng etal. 2021) assumes the distribution of camera poses. The same goes for NeRS(Zhang etal. 2021). Furthermore, recent works(Wang etal. 2021b; Lin etal. 2022; Meng etal. 2021) primarily rely on photometric losses to optimize NeRFs and camera poses. In sparse-view cases, the photometric loss becomes insufficient since the 3D construction is under-constrained. To obtain more constraints, Nope-NeRF(Bian etal. 2023) incorporates monocular depth estimation from dense video frames as additional input. Still, the requirement of dense input frames does not work for few-view cases. SPARF(Truong etal. 2023) is proposed for few-view inputs but still requires reasonable initial poses. Therefore, reconstructing NeRFs without any pose priors in the few-view setting remains challenging.

A traditional way is placing a calibration board in the scene to calibrate accurate poses. However, calibration boards are not easily accessible in everyday scenes.This limitation inspires us to explore the potential of utilizing ubiquitous everyday objects, such as co*ke cans or boxes, as calibration probes. Such objects are easily found in photos and offer a practical, low-burden alternative. We adopt SAM to automatically segment the probe object by prompts, and simply use a cube as shape initialization. We find that most objects with simple shapes can be efficiently employed as pose probes, as in Tab.6, using different objects in a scene leads to slight performance change (within 7% in PSNR).

Generic Objects as Pose Probes for Few-Shot View Synthesis (2)

In this paper, we introduce a pipeline for NeRF reconstruction from few-view (3 to 6) unposed images. The main idea is leveraging everyday objects as pose probes. As shown in Fig.2, a dual-branch volume rendering optimization workflow is adopted, targeting the probe object and the entire scene respectively. The object branch uses hybrid volume rendering with signed distance field (SDF) representation to jointly optimize camera poses and object geometry, where the SDF (geometry of the PoseProbe) is initialized by a cube and deformed by a DeformNet.Multi-view geometry consistency and multi-layer feature consistency are introduced as training constraints. Similarly, the scene branch aims to learn scene neural representation and refine the camera poses, in a self-supervised manner as well.

Specifically, initialized camera poses of two views are first obtained using Perspective-n-Poin (PnP) matching. Additional views are then added incrementally using PnP as well. Note that PnP matching requires only several feature matches, working for feature-sparse scenarios, while COLMAP often fails due to insufficient feature matches. As tested in Tab.7, even using an identity matrix or very noisy poses (adding 30% noises to PnP poses), the method still gets comparable performance with a slight drop in metrics.Once all views are acquired, we enable DeformNet to deform the cube shape to an accurate object shape. Poses are further optimized jointly with DeformNet by both branches to get final results.In this way, we obtain high-quality novel view synthesis and poses, without any pose priors, even for large-baseline and few-view images. As shown in Fig.1, aided by the proposed pose probe (the co*ke can), our method produces realistic novel-view renderings and accurate poses using only three input images, without relying on pose initialization, outperforming both COLMAP-based and COLMAP-free state-of-the-art methods.

The main contributions include:

  • We utilize generic objects as pose calibration probes, to tackle the challenging feature-sparse scenes using only 3 to 6 images, where COLMAP is inapplicable.

  • We propose an explicit-implicit SDF representation to efficiently bridge CAD initialization and implicit deformations. The whole pipeline is end-to-end differentiable and fully self-supervised.

  • We generate and capture a synthetic and a real dataset, and compare the proposed method with state-of-the-arts across three benchmarks, where our method achieves PSNR improvements of 31.9%, 28.0%, and 42.1% in novel view synthesis, along with significant enhancements in pose metrics. The proposed method successfully handles sparse-view scenes where COLMAP experiences a 67% initialization failure rate.

2 Related Works

Radiance fields with pose optimization. The reliance on high-precision camera poses as input restricts the applicability of NeRFs and 3D Gaussian Splatting(Kerbl etal. 2023) (3DGS). Several studies have sought to alleviate this dependency. NeRF-based techniques utilize neural networks to represent the radiance fields and jointly optimize camera parameters, as demonstrated by early approaches (Wang etal. 2021b; Jeong etal. 2021; Lin etal. 2022; Chng etal. 2021). L2G-NeRF(Chen etal. 2023) and LU-NeRF(Cheng etal. 2023) incorporate a local-to-global registration and boost noise resilience. Additionally, Camp(Park etal. 2023) proposes using a proxy problem to compute a whitening transform, which helps refine the initial camera poses. Furthermore, NoPe-NeRF(Bian etal. 2023) adopts monocular depth estimation to learn scene representation independent of pose priors, but faces challenges with sparse inputs. Similar to our method, NeRS(Zhang etal. 2021) employs a category-level shape template to effectively model object shapes and textures. However, it requires pose initialization and cannot render entire scenes. SPARF(Truong etal. 2023) addresses the challenge of NeRFs with sparse-view, wide-baseline input images but requires initial camera positions to be close to ground truth, which limits its applicability in real-world scenarios. 3DGS-based approaches utilize explicit 3D Gaussians rather than neural networks and have been explored in various studies. CF-3DGS(Fu etal. 2024) and COGS(Jiang etal. 2024) leverage monocular depth estimators to assist in registering camera poses. Recent work(Fan etal. 2024) proposes using an off-the-shelf model(Wang etal. 2024) to compute initial camera poses and achieve sparse-view and SfM-free optimization. However, 3DGS requires an initialized point cloud, which is often difficult to obtain in unconstrained scenes with sparse viewpoints and unknown poses. Consequently, 3DGS-based methods generally rely on pretrained vision models(Wang etal. 2024; Ranftl etal. 2020), significantly increasing complexity.

Novel-view synthesis from few views. To address the challenge of requiring dense input views, various regularization techniques shine in few-view learning. DS-NeRF(Deng etal. 2022) utilizes depth supervision to avoid overfitting.Additionally, appearance regularization(Niemeyer etal. 2022), geometry regularization(Song, Kwak, and Kim 2022; Niemeyer etal. 2022) and frequency regularization(Yang, Pavone, and Wang 2023) are introduced to optimize the radiance fields. FSGS(Zhu etal. 2023) and SparseGS(Xiong etal. 2023) utilize monocular depth estimators or diffusion models to enhance Gaussian Splatting in sparse-view scenarios. However, these methods assume the availability of ground-truth camera poses, while Structure-from-Motion algorithms often fail with sparse or few inputs, limiting their practical application. Recent studies(Liu etal. 2023a; Shi etal. 2023; Liu etal. 2024) leverage 2D diffusion models to generate 3D models from a single image, but they still face challenges in scene reconstruction. In our approach, we employ geometry regularization to facilitate scene learning with fewer views.

3 Method

Given sparse view (as low as 3) and unposed images of a scene, we tackle the challenge of photorealistic novel view synthesis and pose estimation, by telling a novel idea of using common objects as pose probes.Our method does not require any pose initialization since obtaining initial poses is not always convenient and COLMAP may be inapplicable for few-view and feature-sparse scenes. We propose a dual-branch pipeline as illustrated in Fig.2, which integrates both neural explicit and implicit volume rendering. In the object branch (Sec3.1), we utilize neural volume rendering with a hybrid signed distance field (SDF) to efficiently optimize both camera poses and object representation. In the scene branch (Sec.3.2), the scene representation is learned in an implicit NeRF, while the camera pose is optimized simultaneously. The joint training is introduced in Sec.3.3.

3.1 Object NeRF with pose estimation

Inspired by the fast convergence of explicit representations(Sun, Sun, and Chen 2022; Wu etal. 2022) while preserving high-quality rendering, we design a neural volume rendering framework similar to DVGO(Sun, Sun, and Chen 2022) for the object branch. To recover high-fidelity shapes and precise camera poses, we discard the density voxel grid 𝑽σ1×Nx×Ny×Nzsuperscript𝑽𝜎superscript1subscript𝑁𝑥subscript𝑁𝑦subscript𝑁𝑧\bm{V}^{\sigma}\in\mathbb{R}^{1\times N_{x}\times N_{y}\times N_{z}}bold_italic_V start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and adopt SDF(Wang etal. 2021a; Fu etal. 2022) as the rendering field. In particular, we design a hybrid explicit and implicit representation of SDF that assigns any point 𝒑𝒑\bm{p}bold_italic_p a scalar s𝑠sitalic_s:

SDF:𝒑3s:SDF𝒑superscript3𝑠\textit{SDF}:\bm{p}\in\mathbb{R}^{3}\longrightarrow s\in\mathbb{R}SDF : bold_italic_p ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⟶ italic_s ∈ blackboard_R(1)

To better utilize the geometry of the object, the gradient at each point is embedded into the color rendering process:

𝒄=MLPΘ(interp(𝒑,𝑽(feat)),𝒏,𝒑,𝒅).𝒄subscriptMLPΘinterp𝒑superscript𝑽(feat)𝒏𝒑𝒅\bm{c}=\operatorname{MLP}_{\Theta}\left(\operatorname{interp}(\bm{p},\bm{V}^{%\text{(feat)}}),\bm{n},\bm{p},\bm{d}\right).bold_italic_c = roman_MLP start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( roman_interp ( bold_italic_p , bold_italic_V start_POSTSUPERSCRIPT (feat) end_POSTSUPERSCRIPT ) , bold_italic_n , bold_italic_p , bold_italic_d ) .(2)

Here, the normal 𝒏𝒏\bm{n}bold_italic_n is computed as the normalized gradient of the SDF, and 𝒅𝒅\bm{d}bold_italic_d represents the viewing direction. Next, we will introduce how to use the pose probe to obtain the initial camera poses for each frame. Following this, we will delve into the hybrid explicit and implicit representation of the SDF and discuss strategies for optimizing neural fields in conjunction with camera poses.

Hybrid SDF representation. In the design of a hybrid explicit and implicit SDF generation network, the explicit template field T𝑇Titalic_T is a non-learnable voxel grid 𝑽(sdf)superscript𝑽(sdf)\bm{V}^{\text{(sdf)}}bold_italic_V start_POSTSUPERSCRIPT (sdf) end_POSTSUPERSCRIPT, initialized using the template object, while the implicit deform field D𝐷Ditalic_D is implemented as MLPs to predict a deformation field and a correction field on top of T𝑇Titalic_T. The voxel grid 𝑽(sdf)superscript𝑽(sdf)\bm{V}^{\text{(sdf)}}bold_italic_V start_POSTSUPERSCRIPT (sdf) end_POSTSUPERSCRIPT is initialized with a similar template, and we find that a coarse mesh (e.g., a cube) is sufficient to learn detailed geometry and appearance. We obtain the SDF values in 𝑽(sdf)superscript𝑽(sdf)\bm{V}^{\text{(sdf)}}bold_italic_V start_POSTSUPERSCRIPT (sdf) end_POSTSUPERSCRIPT by calculating the closest distance from each voxel center to the surface and determining whether the point lies inside or outside the object. This process is efficient and takes only a few seconds. The template field T𝑇Titalic_T provides a strong prior, reducing the search space from a known baseline and enabling detailed geometry representation with fewer parameters.

For finer shape details, we use an implicit deformation field D𝐷Ditalic_D to refine the coarse SDF. While optimizing an explicit SDF correction voxel grid topping on template field T𝑇Titalic_T is a straightforward choice, it limits information sharing and tends to degenerate solutions, especially in the sparse views. In contrast, the implicit field inherently provides a smooth and continuous representation beneficial for capturing fine details and complex deformations. Inspired by(Deng, Yang, and Tong 2021), Our deform field D𝐷Ditalic_D predicts a deformation vector v𝑣vitalic_v and a scalar correction value ΔsΔ𝑠\Delta sroman_Δ italic_s for each point 𝒑𝒑\bm{p}bold_italic_p:

D:𝒑3(v,Δs)4:𝐷𝒑superscript3𝑣Δ𝑠superscript4D:\bm{p}\in\mathbb{R}^{3}\longrightarrow(v,\Delta s)\in\mathbb{R}^{4}italic_D : bold_italic_p ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⟶ ( italic_v , roman_Δ italic_s ) ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT(3)

The ultimate SDF value of any pointis determined by interpolating at its deformed location within the template field T𝑇Titalic_T, further refined by a correction scalar. Therefore, the SDF value of the point 𝒑𝒑\bm{p}bold_italic_p is represented as:

SDF(𝒑)SDF𝒑\displaystyle\text{SDF}(\bm{p})SDF ( bold_italic_p )=T(𝒑+Dv(𝒑))+DΔs(𝒑)absent𝑇𝒑subscript𝐷𝑣𝒑subscript𝐷Δ𝑠𝒑\displaystyle=T(\bm{p}+D_{v}(\bm{p}))+D_{\Delta s}(\bm{p})= italic_T ( bold_italic_p + italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( bold_italic_p ) ) + italic_D start_POSTSUBSCRIPT roman_Δ italic_s end_POSTSUBSCRIPT ( bold_italic_p )(4)
=interp(𝒑+v,𝑽(sdf))+Δs.absentinterp𝒑𝑣superscript𝑽(sdf)Δ𝑠\displaystyle=\operatorname{interp}\left({\bm{p}+v},\bm{V}^{\text{(sdf)}}%\right)+\Delta s~{}.= roman_interp ( bold_italic_p + italic_v , bold_italic_V start_POSTSUPERSCRIPT (sdf) end_POSTSUPERSCRIPT ) + roman_Δ italic_s .

The predicted SDF value in our hybrid representation is to estimate volume opacity. But directly using the SDF values in Eqn.4 is not perfect for volume rendering since the value scale is predefined manually. To this end, we propose a mapping function sdf with two learnable parameters to scale the original SDF to the scene-customized scale:

sdf(𝒑)=β(1/(1+eγSDF(𝒑))0.5),𝑠𝑑𝑓𝒑𝛽11superscript𝑒𝛾𝑆𝐷𝐹𝒑0.5{sdf(\bm{p}})=\beta(1/(1+e^{-\gamma\cdot SDF(\bm{p})})-0.5),italic_s italic_d italic_f ( bold_italic_p ) = italic_β ( 1 / ( 1 + italic_e start_POSTSUPERSCRIPT - italic_γ ⋅ italic_S italic_D italic_F ( bold_italic_p ) end_POSTSUPERSCRIPT ) - 0.5 ) ,(5)

where β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are trainable parameters to control the scale of SDF voxel grid 𝑽(sdf)superscript𝑽(sdf)\bm{V}^{\text{(sdf)}}bold_italic_V start_POSTSUPERSCRIPT (sdf) end_POSTSUPERSCRIPT. To ensure that β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are always positive for maintaining the original SDF sign, we apply the Softplus activation function to them. The parameters β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ vary from scene to scene as illustrated in supplementary materials. Our hybrid SDF representation merges the advantages of explicit and implicit representations, balancing rapid convergence with detailed modeling.

Incremental pose optimization.We employ an incremental pose optimization approach, introducing a new image into the training loop at fixed intervals. Given the input images and corresponding masks of the calibration object, the first image is designated as the reference image I𝐼Iitalic_I. Multiple projection views around the object are sampled to acquire mask images, and the view with the best matching mask is selected as the initial pose for the first frame. For each newly added frame Ii+1subscript𝐼𝑖1I_{i+1}italic_I start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, we first compute 2D correspondences with the previous image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using SuperPoint(DeTone, Malisiewicz, and Rabinovich 2018) and SuperGlue(Sarlin etal. 2020). The matching pixels in the image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT cast rays to locate corresponding 3D points on the object, leveraging the optimized pose Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for precise surface positioning. We explain this process in supplementary. This forms 2D-3D correspondences between the newly added image and the object, allowing the PnP with RANSAC to calculate the initial pose of image Ii+1subscript𝐼𝑖1I_{i+1}italic_I start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. Finally, the added image views and the radiance field are jointly optimized.

Multi-view geometric consistency.Recently, SCNeRF(Jeong etal. 2021) and SPARF(Truong etal. 2023) propose to use reprojection error to learn geometry andcamera poses consistency. We adopt a more direct multi-view projection distance to constrain the camera poses. Formally, given an image pair (Ii,Ij)subscript𝐼𝑖subscript𝐼𝑗\left(I_{i},I_{j}\right)( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and matching pixel pairs (𝐱,𝐲)𝐱𝐲\left(\mathbf{x},\mathbf{y}\right)( bold_x , bold_y ), we first locate the surface points (𝐒𝐱,𝐒𝐲)subscript𝐒𝐱subscript𝐒𝐲\left(\mathbf{S_{x}},\mathbf{S_{y}}\right)( bold_S start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , bold_S start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ) using ray-casting. 3D surface points are then projected back to the image coordinate to minimize the distance between correspondences. The geometric circle projection distance of pair (𝐱,𝐲)𝐱𝐲\left(\mathbf{x},\mathbf{y}\right)( bold_x , bold_y ) is defined as:

𝒟(𝐱,𝐲)=ρ(π(𝐒𝐱,P^j)𝐲)+ρ(π(𝐒𝐲,P^i)𝐱),𝒟𝐱𝐲𝜌𝜋subscript𝐒𝐱subscript^𝑃𝑗𝐲𝜌𝜋subscript𝐒𝐲subscript^𝑃𝑖𝐱\mathcal{D}{(\mathbf{x},\mathbf{y})}=\rho(\,\pi(\mathbf{S_{x}},\hat{P}_{j})-%\mathbf{y})+\rho(\pi(\mathbf{S_{y}},\hat{P}_{i})-\mathbf{x}),caligraphic_D ( bold_x , bold_y ) = italic_ρ ( italic_π ( bold_S start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - bold_y ) + italic_ρ ( italic_π ( bold_S start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - bold_x ) ,(6)

where π𝜋\piitalic_π means the camera projection function, and ρ𝜌\rhoitalic_ρ denotes Huber loss function(Hastie etal. 2009). Additionally, based on the prior that rays emitted from feature points should intersect the object, we introduce a regularization term that minimizes the distance between these rays and the surface of the pose probe to refine the camera poses:

dist(𝒓,𝒐)=max(dis(𝒓,𝒐)L,0),subscriptdist𝒓𝒐dis𝒓𝒐𝐿0\mathcal{L}_{\text{dist}}(\bm{r},\bm{o})=\max(\text{dis}(\bm{r},\bm{o})-L,0),caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT ( bold_italic_r , bold_italic_o ) = roman_max ( dis ( bold_italic_r , bold_italic_o ) - italic_L , 0 ) ,(7)

where dis(𝒓,𝒐)dis𝒓𝒐\text{dis}(\bm{r},\bm{o})dis ( bold_italic_r , bold_italic_o ) denotes the shortest distance from the object center 𝒐𝒐\bm{o}bold_italic_o to the ray 𝒓𝒓\bm{r}bold_italic_r, and L𝐿Litalic_L represents the maximum radius of the object. Finally, our multi-view geometric consistency objective is formulated as:

geo(𝒫^)=(𝐱,𝐲)𝒱w𝐱𝒟(𝐱,𝐲)+λdist(𝒓𝒙,𝒚,𝒄).subscriptgeo^𝒫subscript𝐱𝐲𝒱subscript𝑤𝐱𝒟𝐱𝐲𝜆subscriptdistsubscript𝒓𝒙𝒚𝒄\mathcal{L}_{\text{geo}}(\hat{\mathcal{P}})=\sum_{(\mathbf{x},\mathbf{y})\in%\mathcal{V}}w_{\mathbf{x}}\mathcal{D}{(\mathbf{x},\mathbf{y})}+\lambda\mathcal%{L}_{\text{dist}}(\bm{r_{x,y}},\bm{c}).caligraphic_L start_POSTSUBSCRIPT geo end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_P end_ARG ) = ∑ start_POSTSUBSCRIPT ( bold_x , bold_y ) ∈ caligraphic_V end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT caligraphic_D ( bold_x , bold_y ) + italic_λ caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT ( bold_italic_r start_POSTSUBSCRIPT bold_italic_x bold_, bold_italic_y end_POSTSUBSCRIPT , bold_italic_c ) .(8)

Here, w𝐱subscript𝑤𝐱w_{\mathbf{x}}italic_w start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT represents the matching confidence associated with pair (𝐱,𝐲)𝐱𝐲\left(\mathbf{x},\mathbf{y}\right)( bold_x , bold_y ) and λ𝜆\lambdaitalic_λ is set to 10.

Multi-layer feature-metric consistency.Geometric consistency facilitates rapid convergence in camera pose optimization; however, mismatches can produce misleading supervisory signals, potentially trapping the optimization in local optima. Inspired by dense bundle adjustment(Tang and Tan 2018), we introduce a multi-layer feature-metric consistency. This constraint minimizes the feature difference of aligned pixels using dot-product similarity. The multi-layer feature-metric associated with pixel 𝐱𝐱\mathbf{x}bold_x is formulated as:

e𝐱=k=1M1cos(Fj,k(π(𝐒𝐱,P^j)),Fi,k(𝐱)),subscript𝑒𝐱superscriptsubscript𝑘1𝑀1cossubscript𝐹𝑗𝑘𝜋subscript𝐒𝐱subscript^𝑃𝑗subscript𝐹𝑖𝑘𝐱e_{\mathbf{x}}=\sum_{k=1}^{M}{1-\text{cos}\left(F_{j,k}(\pi(\mathbf{S_{x}},%\hat{P}_{j})),F_{i,k}(\mathbf{x})\right)},italic_e start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT 1 - cos ( italic_F start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ( italic_π ( bold_S start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) , italic_F start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( bold_x ) ) ,(9)

where 𝔽={Fi,k|i=1N,k=1M}𝔽conditional-setsubscript𝐹𝑖𝑘formulae-sequence𝑖1𝑁𝑘1𝑀\mathbb{F}=\{F_{i,k}|i=1...N,k=1...M\}blackboard_F = { italic_F start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | italic_i = 1 … italic_N , italic_k = 1 … italic_M } are the multi-layer image features extracted by the pretrained VGG(Simonyan and Zisserman 2015). Here, N𝑁Nitalic_N denotes the number of images, and M𝑀Mitalic_M is the number of layers. Our feature-metric loss is defined as fea(𝒫^)=x𝒱γ𝐱e𝐱subscriptfea^𝒫subscript𝑥𝒱subscript𝛾𝐱subscript𝑒𝐱\mathcal{L}_{\text{fea}}(\hat{\mathcal{P}})=\sum_{x\in\mathcal{V}}\gamma_{%\mathbf{x}}e_{\mathbf{x}}caligraphic_L start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_P end_ARG ) = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_V end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT. We incorporate a visible mask γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] to remove out-of-view or occluded points from another perspective. Points whose projected pixels fall outside the object masks are considered out-of-view, while points with invalid depth values are treated as occluded. This constraint considers more image pixels rather than focusing solely on keypoints in the geometric consistency. In contrast to photometric error, which is sensitive to initialization and increases non-convexity(Engel, Koltun, and Cremers 2017), our feature-based consistency loss provides smoother optimization.

3.2 Scene NeRF with pose refinement

While training the object NeRF, we simultaneously train a scene NeRF branch. The aim is to learn neural scene representation while fine-tuning the camera poses. To validate the effectiveness of our proposed modules, we employ a baseline NeRF model with coarse-to-fine positional encoding(Lin etal. 2022). We also use the projection distance loss (Eqn6) as an additional constraint in the scene branch. Furthermore, we observe that adding a depth smoothness prior enhances the geometric perception of the scene. Analogous to RegNeRF(Niemeyer etal. 2022), a depth total variation loss over small patches is introduced:

Ds(θ,)=𝒓i,j=1K1(dθ^(𝒓i,j)dθ^(𝒓i+1,j))2+(dθ^(𝒓i,j)dθ^(𝒓i,j+1))2,subscriptDs𝜃subscript𝒓superscriptsubscript𝑖𝑗1𝐾1superscript^subscript𝑑𝜃subscript𝒓𝑖𝑗^subscript𝑑𝜃subscript𝒓𝑖1𝑗2superscript^subscript𝑑𝜃subscript𝒓𝑖𝑗^subscript𝑑𝜃subscript𝒓𝑖𝑗12\begin{split}\mathcal{L}_{\text{Ds}}(\theta,\mathcal{R})=\sum_{\bm{r}\in%\mathcal{R}}\sum_{i,j=1}^{K-1}\left(\hat{d_{\theta}}(\bm{r}_{i,j})-\hat{d_{%\theta}}(\bm{r}_{i+1,j})\right)^{2}\\+\left(\hat{d_{\theta}}(\bm{r}_{i,j})-\hat{d_{\theta}}(\bm{r}_{i,j+1})\right)^%{2},\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT Ds end_POSTSUBSCRIPT ( italic_θ , caligraphic_R ) = ∑ start_POSTSUBSCRIPT bold_italic_r ∈ caligraphic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) - over^ start_ARG italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( bold_italic_r start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL + ( over^ start_ARG italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) - over^ start_ARG italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW(10)

where \mathcal{R}caligraphic_R is the set of sampled rays, dθ^(𝒓i,j)^subscript𝑑𝜃subscript𝒓𝑖𝑗\hat{d_{\theta}}(\bm{r}_{i,j})over^ start_ARG italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) is predicted depth of the ray through pixel (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), and K𝐾Kitalic_K is the patch size.

Generic Objects as Pose Probes for Few-Shot View Synthesis (3)

3.3 Joint training

The final training objectives consist of all losses for the object NeRF and scene NeRF: =λObj+Sce𝜆subscriptObjsubscriptSce\mathcal{L}=\lambda\mathcal{L}_{\text{Obj}}+\mathcal{L}_{\text{Sce}}caligraphic_L = italic_λ caligraphic_L start_POSTSUBSCRIPT Obj end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT Sce end_POSTSUBSCRIPT.

Object NeRF. To encourage smootherdeformation and prevent large shape distortion, We incorporate a smoothness loss on the deformation field and a minimal correction prior(Deng, Yang, and Tong 2021) for the correction field:

d=𝒑ΩdX,Y,ZDvd(𝒑)2+𝒑Ω|DΔs(𝒑)|.subscriptdsubscript𝒑Ωsubscript𝑑𝑋𝑌𝑍subscriptnormsubscript𝐷𝑣subscript𝑑𝒑2subscript𝒑Ωsubscript𝐷Δ𝑠𝒑\mathcal{L_{\text{d}}}=\sum_{\bm{p}\in\Omega}\sum_{d\in{X,Y,Z}}\left\|\nabla D%_{v}\text{|}_{d}(\bm{p})\right\|_{2}+\sum_{\bm{p}\in\Omega}\left|D_{\Delta s}(%\bm{p})\right|.caligraphic_L start_POSTSUBSCRIPT d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_italic_p ∈ roman_Ω end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d ∈ italic_X , italic_Y , italic_Z end_POSTSUBSCRIPT ∥ ∇ italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT — start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_italic_p ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT bold_italic_p ∈ roman_Ω end_POSTSUBSCRIPT | italic_D start_POSTSUBSCRIPT roman_Δ italic_s end_POSTSUBSCRIPT ( bold_italic_p ) | .(11)

Besides, we add an Eikonal term(Gropp etal. 2020) to regularize the the SDF:

r=𝒑Ω|sdf(𝒑)21|.subscriptrsubscript𝒑Ωsubscriptnorm𝑠𝑑𝑓𝒑21\mathcal{L_{\text{r}}}=\sum_{\bm{p}\in\Omega}\left|\ \left\|\nabla sdf(\bm{p})%\right\|_{2}-1\right|.caligraphic_L start_POSTSUBSCRIPT r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_italic_p ∈ roman_Ω end_POSTSUBSCRIPT | ∥ ∇ italic_s italic_d italic_f ( bold_italic_p ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 | .(12)
Obj=rgb+λ1m+λ2geo+λ3fea+λ4d+λ5r,subscriptObjsubscriptrgbsubscript𝜆1subscriptmsubscript𝜆2subscriptgeosubscript𝜆3subscriptfeasubscript𝜆4subscriptdsubscript𝜆5subscriptr\mathcal{L_{\text{Obj}}}=\mathcal{L_{\text{rgb}}}+\lambda_{1}\mathcal{L_{\text%{m}}}+\lambda_{2}\mathcal{L_{\text{geo}}}+\lambda_{3}\mathcal{L_{\text{fea}}}+%\lambda_{4}\mathcal{L_{\text{d}}}+\lambda_{5}\mathcal{L_{\text{r}}},caligraphic_L start_POSTSUBSCRIPT Obj end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT m end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT geo end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ,(13)

where rgbsubscriptrgb\mathcal{L_{\text{rgb}}}caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT and msubscriptm\mathcal{L_{\text{m}}}caligraphic_L start_POSTSUBSCRIPT m end_POSTSUBSCRIPT represent photometric l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss and mask l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss respectively.

Scene NeRF. In the scene training stage, the total loss is:

Sce=rgb+λ6Ds.subscriptScesubscriptrgbsubscript𝜆6subscriptDs\mathcal{L_{\text{Sce}}}=\mathcal{L_{\text{rgb}}}+\lambda_{6}\mathcal{L_{\text%{Ds}}}.caligraphic_L start_POSTSUBSCRIPT Sce end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT Ds end_POSTSUBSCRIPT .(14)

All λ𝜆\lambdaitalic_λs denote the balancing weights for the corresponding loss terms, whose values can be found in the supplementary.

4 Experiments

In this section, we compare state-of-the-art baselines for camera pose estimation and novel view synthesis in few-view (3similar-to\sim6) settings on multiple benchmarks. Furthermore, we conduct a series of ablations to assess the effectiveness and robustness of key components. Please refer to the supplementary PDF and videos for more results and details.

4.1 Experimental settings

Datasets.We propose a synthetic dataset (ShapenetScene) and a real-life dataset (co*keBox). The former provides a benchmark with precise poses for the quantitative evaluation of our method, while the latter demonstrates its practical applicability. Additionally, we conduct experiments on the ToyDesk(Yang etal. 2021) and DTU (Jensen etal. 2014) benchmarks. ShapenetScene is generated using BlenderProc(Denninger etal. 2023) and comprises six scenes rendered jointly from SceneNet(Handa etal. 2016) and ShapeNet(Chang etal. 2015). Each scene includes 100 RGB images and corresponding mask images captured around the object at 360°°\degree°. co*keBox contains four sets of densely posed images with 2D instance segmentation of calibration objects. We use COLMAP(Schönberger and Frahm 2016) and Grounded-SAM(Kirillov etal. 2023; Liu etal. 2023b) to recover the pseudo ground truth camera poses and mask images. ToyDesk, introduced by Object-NeRF(Yang etal. 2021), contains posed images and 2D instance segmentation masks. The images are partitioned into training and testing sets for training and evaluation. For DTU, we follow the dataset splitting protocol of SPARF(Truong etal. 2023) to separate the training and testing sets.

Metrics. For camera pose evaluation, we report the average rotation and translation errors as pose metrics after aligning the optimized poses with the ground truth. For novel view synthesis, we report the PSNR, SSIM(Wang etal. 2004), LPIPS(Zhang etal. 2018) (with AlexNet(Krizhevsky, Sutskever, and Hinton 2012)). We also present the Average metric (the geometric mean of 10PSNR/10superscript10PSNR1010^{-\mathrm{PSNR}/10}10 start_POSTSUPERSCRIPT - roman_PSNR / 10 end_POSTSUPERSCRIPT, 1SSIM1SSIM\sqrt{1-\mathrm{SSIM}}square-root start_ARG 1 - roman_SSIM end_ARG, and LPIPS) following (Yang, Pavone, and Wang 2023).

4.2 Comparison with State-of-the-arts

Rot. \downarrowTrans. \downarrowPSNR \uparrowSSIM \uparrowLPIPS \downarrowAverage \downarrow
3-view6-view3-view6-view3-view6-view3-view6-view3-view6-view3-view6-view
Nope-NeRF14.3916.188.6119.9012.9014.400.460.540.680.680.300.26
SCNeRF10.959.887.7214.6516.3916.740.510.530.580.550.210.20
BARF8.2513.1510.5310.0217.9518.970.560.580.650.640.180.17
SPARF8.4114.4816.2721.4518.2916.570.650.560.550.580.180.20
CF-3DGS56.1035.6927.3220.8116.7418.310.490.650.520.470.200.16
Ours0.720.701.891.0623.1126.080.680.790.480.350.110.07

We compare our method against state-of-the-art pose-free methods, including BARF(Lin etal. 2022), SCNeRF(Jeong etal. 2021), Nope-NeRF(Bian etal. 2023), SPARF(Truong etal. 2023), as well as FC-3DGS(Fu etal. 2024).

Rot. \downarrowTrans. \downarrowPSNR \uparrowSSIM \uparrowLPIPS \downarrowAverage \downarrow
Nope-NeRF15.6716.0314.460.550.680.25
SCNeRF3.596.4919.650.610.410.14
BARF10.6627.4316.410.520.660.22
SPARF6.0410.6521.210.640.510.16
Ours1.312.7325.070.730.380.09

Generic Objects as Pose Probes for Few-Shot View Synthesis (4)
Generic Objects as Pose Probes for Few-Shot View Synthesis (5)

Results on ShapenetScene. We evaluate our method and baselines with 3 and 6 input views. For a fair comparison, the camera poses derived via PnP in our method serve as the initial poses for all NeRF baselines, exhibiting average rotation and translation errors of approximately 35°°\degree° and 70, respectively. As shown inTab.1 andFig.3, we observe that most baselines fail to register poses accurately and produce poor novel views, as they rely on good initial poses or dense input views. In Fig.4, we display optimized poses of one scene. To further validate the robustness of our method, we experiment akin to SPARF by adding 15% additive Gaussian noise to the ground truth poses as initial estimates, and compare with state-of-the-arts, including BARF, Nope-NeRF, SCNeRF, and SPARF. The perturbed camera poses have an average rotation and translation error of around 15°°\degree° and 45, respectively. Quantitative results are presented in Tab.2. BARF and Nope-NeRF continue to struggle with optimizing camera poses, resulting in poor rendering quality. The geometric losses utilized by SCNeRF and SPARF facilitate improved learning of camera poses. However, SCNeRF faces challenges when rendering with few views, and SPARF similarly struggles with sparser input images. In contrast, our method achieves more accurate pose estimation and more realistic renderings both from scratch and noisy poses, resulting in higher-quality novel views.

Rot. \downarrowTrans. \downarrowPSNR \uparrowSSIM \uparrowLPIPS \downarrowAverage \downarrow
BARF15.3745.1310.080.290.700.39
SCNeRF11.6319.1312.620.450.570.28
SPARF9.6622.8914.340.500.510.25
CF-3DGS90.6830.4110.840.390.530.32
Ours1.273.8218.320.610.380.15

Results on DTU. We test on the DTU dataset with 6 input views. The PnP camera poses are used as the initial poses for NeRF baselines to ensure a fair comparison. In Tab.3 and Fig.5, our method performs better in pose estimation and novel-view synthesis. All baselines suffer from blurriness and inaccurate scene geometry, while our approach produces closer results to GT thanks to the pose probe constraint.

Results on real-life datasets.We conduct qualitative and quantitative evaluations in Fig.6 and Tab.4, comparing with state-of-the-art methods (BARF, SCNeRF, and SPARF) on co*keBox and ToyDesk datasets, using only 3 input views. Pseudo ground truth camera poses are recovered from dense image sequences via COLMAP to facilitate training and evaluation. These pseudo poses are used as initial poses for all baselines, while our approach operates independently of initial poses. Notably, BARF and SCNeRF struggle with view synthesis despite having COLMAP poses. In comparison to SPARF, also initialized with COLMAP poses, our method demonstrates superior performance. For a more intuitive comparison, we also present the results of the baselines initialized with identical poses in Tab.4.

PSNR \uparrowSSIM \uparrowLPIPS \downarrowAverage \downarrow
SCNeRF17.88 (12.08)0.69 (0.49)0.34 (0.56)0.15 (0.29)
BARF19.85 (14.08)0.53 (0.56)0.33 (0.41)0.14 (0.22)
SPARF24.10 (18.25)0.66 (0.69)0.27 (0.35)0.08 (0.15)
Ours25.950.760.230.07

Rot. \downarrowTrans. \downarrowPSNR \uparrowSSIM \uparrowLPIPS \downarrowAverage \downarrow
w/o Incre.11.8310.5317.540.620.640.191
w/o GeosubscriptGeo\mathcal{L}_{\text{Geo}}caligraphic_L start_POSTSUBSCRIPT Geo end_POSTSUBSCRIPT12.8512.3717.020.610.660.200
w/o FeasubscriptFea\mathcal{L}_{\text{Fea}}caligraphic_L start_POSTSUBSCRIPT Fea end_POSTSUBSCRIPT2.133.2225.310.780.360.079
w/o DssubscriptDs\mathcal{L}_{\text{Ds}}caligraphic_L start_POSTSUBSCRIPT Ds end_POSTSUBSCRIPT0.721.7725.570.780.350.077
w/o DeformNet3.148.5623.740.760.390.093
Full Model0.701.0626.080.790.350.073

4.3 Ablations and analysis

Effectiveness of proposed components.As shown in Tab.5, we ablate key modules using 6 input views on ShapenetScene. Incremental pose optimization improves initial poses for new frames by using the optimized poses from previous frames, making overall pose alignment easier. Removing this strategy results in a significant drop in model performance. Geometric consistency loss (Eqn.8) is crucial for guiding camera pose optimization, while feature consistency (Eqn.9) further refines the precision of pose estimation. Omitting these two constraints causes a noticeable decline in performance, as inaccurate poses result in poor novel view synthesis. Depth smoothness regularization (Eqn.10) enhances image quality with minimal impact on pose optimization. Furthermore, DeformNet is integral to our framework, demonstrating that more accurate geometric constraints can yield more precise camera poses, thereby producing higher-quality novel view synthesis.

Generic Objects as Pose Probes for Few-Shot View Synthesis (6)

Rot. \downarrowTrans. \downarrowPSNR \uparrowSSIM \uparrowLPIPS \downarrowAverage \downarrow
Candy1.911.3618.110.650.410.156
Face1.570.8719.170.680.390.139
Dragon0.650.8319.520.690.380.135

Impact of different pose probes. To investigate the impact of different pose probes, we use a scene shown in the last row of Fig.6, in which there are multiple partially-observed objects. We alternately use the toys (Candy, Face, and Dragon) in the scene as pose probes, with all shapes initialized as cubes. As shown in Tab.6, all pose probes work effectively, with Dragon achieving the lowest pose errors because of its richer features.

Robustness on initial poses and matching pairs.Our method utilizes PnP to compute the initial poses of new frames but does not rely on it. We conducted experiments with 3 input views using various pose initialization strategies, as detailed in Tab.7. Our method maintains comparable performance when using the previous frame’s pose as initialization (identical poses) and remains effective even with large Gaussian noises added to the ground truth poses. PnP initialization accelerates pose convergence, reducing the number of required optimization iterations.

Additionally, we compare the robustness of COLMAP and our PnP method by categorizing the data into sparse (3 views) and dense (6 views) splits. As illustrated in Tab.8, the state-of-the-art COLMAP with SuperPoint and SuperGlue (COLMAP-SP-SG) often fails in the sparse view split due to an insufficient number of feature pairs for pose initialization. Moreover, we verify that our method remains effective even when using only half of the matching pairs, demonstrating that our approach is less dependent on pose matching compared to COLMAP. PnP operates reliably with significantly fewer feature pairs, making it effective for both sparse and dense views. It is worth noting that the COLMAP poses are further refined using SPARF(Truong etal. 2023).

Pose init.IterationsRot. \downarrowTrans. \downarrowPSNR \uparrowSSIM \uparrowLPIPS \downarrowAverage \downarrow
Identical5k1.113.1522.820.670.480.114
30% noise5k2.847.1622.250.630.560.131
25% noise5k0.811.8222.910.670.500.114
15% noise5k0.802.3022.590.660.490.117
PnP3k0.721.8923.110.680.480.111

Pose init.Sparse viewsDense views
Rot. \downarrowTrans. \downarrowSR \uparrowMatchesRot. \downarrowTrans. \downarrowSR \uparrowMatches
COLMAP--0.0%2023.388.8283%2271
COLAMP-SP-SG10.2411.6133%49913.582.32100%3208
Ours-50%1.972.72100%1371.482.91100%392
Ours0.721.89100%2740.701.06100%783

5 Conclusion

We propose PoseProbe, a novel pipeline using common objects as calibration probes to joint pose-NeRF training, tailored for challenging scenarios of few-view and large-baseline, where COLMAP is infeasible. A main limitation is that our method only applies to scenarios where calibration objects are present in all input images. We will explore utilizing multiple pose probes to address the limitation.

References

  • Bian etal. (2023)Bian, W.; Wang, Z.; Li, K.; Bian, J.-W.; and Prisacariu, V.A. 2023.Nope-nerf: Optimising neural radiance field with no pose prior.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4160–4169.
  • Chang etal. (2015)Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; etal. 2015.Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012.
  • Chen etal. (2023)Chen, Y.; Chen, X.; Wang, X.; Zhang, Q.; Guo, Y.; Shan, Y.; and Wang, F. 2023.Local-to-global registration for bundle-adjusting neural radiance fields.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8264–8273.
  • Cheng etal. (2023)Cheng, Z.; Esteves, C.; Jampani, V.; Kar, A.; Maji, S.; and Makadia, A. 2023.LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs.arXiv preprint arXiv:2306.05410.
  • Chng etal. (2021)Chng, S.-F.; Ramasinghe, S.; Sherrah, J.; and Lucey, S. 2021.GARF: Gaussian Activated Radiance Fields for High Fidelity Reconstruction and Pose Estimation.In ICCV.
  • Deng etal. (2022)Deng, K.; Liu, A.; Zhu, J.-Y.; and Ramanan, D. 2022.Depth-supervised NeRF: Fewer Views and Faster Training for Free.In CVPR.
  • Deng, Yang, and Tong (2021)Deng, Y.; Yang, J.; and Tong, X. 2021.Deformed implicit field: Modeling 3d shapes with learned dense correspondence.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10286–10296.
  • Denninger etal. (2023)Denninger, M.; Winkelbauer, D.; Sundermeyer, M.; Boerdijk, W.; Knauer, M.; Strobl, K.H.; Humt, M.; and Triebel, R. 2023.BlenderProc2: A Procedural Pipeline for Photorealistic Rendering.Journal of Open Source Software, 8(82): 4901.
  • DeTone, Malisiewicz, and Rabinovich (2018)DeTone, D.; Malisiewicz, T.; and Rabinovich, A. 2018.SuperPoint: Self-Supervised Interest Point Detection and Description.
  • Engel, Koltun, and Cremers (2017)Engel, J.; Koltun, V.; and Cremers, D. 2017.Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3): 611–625.
  • Fan etal. (2024)Fan, Z.; Cong, W.; Wen, K.; Wang, K.; Zhang, J.; Ding, X.; Xu, D.; Ivanovic, B.; Pavone, M.; Pavlakos, G.; etal. 2024.Instantsplat: Unbounded sparse-view pose-free gaussian splatting in 40 seconds.arXiv preprint arXiv:2403.20309.
  • Fu etal. (2022)Fu, Q.; Xu, Q.; Ong, Y.S.; and Tao, W. 2022.Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35: 3403–3416.
  • Fu etal. (2024)Fu, Y.; Liu, S.; Kulkarni, A.; Kautz, J.; Efros, A.A.; and Wang, X. 2024.COLMAP-Free 3D Gaussian Splatting.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20796–20805.
  • Gropp etal. (2020)Gropp, A.; Yariv, L.; Haim, N.; Atzmon, M.; and Lipman, Y. 2020.Implicit geometric regularization for learning shapes.arXiv preprint arXiv:2002.10099.
  • Handa etal. (2016)Handa, A.; Pătrăucean, V.; Stent, S.; and Cipolla, R. 2016.Scenenet: An annotated model generator for indoor scene understanding.In 2016 IEEE International Conference on Robotics and Automation (ICRA), 5737–5743. IEEE.
  • Hastie etal. (2009)Hastie, T.; Tibshirani, R.; Friedman, J.H.; and Friedman, J.H. 2009.The elements of statistical learning: data mining, inference, and prediction, volume2.Springer.
  • Jensen etal. (2014)Jensen, R.; Dahl, A.; Vogiatzis, G.; Tola, E.; and Aanæs, H. 2014.Large scale multi-view stereopsis evaluation.In Proceedings of the IEEE conference on computer vision and pattern recognition, 406–413.
  • Jeong etal. (2021)Jeong, Y.; Ahn, S.; Choy, C.; Anandkumar, A.; Cho, M.; and Park, J. 2021.Self-calibrating neural radiance fields.In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5846–5854.
  • Jiang etal. (2024)Jiang, K.; Fu, Y.; VarmaT, M.; Belhe, Y.; Wang, X.; Su, H.; and Ramamoorthi, R. 2024.A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose.SIGGRAPH.
  • Kerbl etal. (2023)Kerbl, B.; Kopanas, G.; Leimkühler, T.; and Drettakis, G. 2023.3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph., 42(4): 139–1.
  • Kirillov etal. (2023)Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; Dollár, P.; and Girshick, R. 2023.Segment Anything.arXiv:2304.02643.
  • Krizhevsky, Sutskever, and Hinton (2012)Krizhevsky, A.; Sutskever, I.; and Hinton, G.E. 2012.Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25.
  • Lin etal. (2022)Lin, C.-H.; Ma, W.-C.; Torralba, A.; and Lucey, S. 2022.BARF: Bundle-Adjusting Neural Radiance Fields.In ECCV.
  • Liu etal. (2024)Liu, M.; Xu, C.; Jin, H.; Chen, L.; VarmaT, M.; Xu, Z.; and Su, H. 2024.One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization.Advances in Neural Information Processing Systems, 36.
  • Liu etal. (2023a)Liu, R.; Wu, R.; VanHoorick, B.; Tokmakov, P.; Zakharov, S.; and Vondrick, C. 2023a.Zero-1-to-3: Zero-shot one image to 3d object.In Proceedings of the IEEE/CVF international conference on computer vision, 9298–9309.
  • Liu etal. (2023b)Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; etal. 2023b.Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499.
  • Meng etal. (2021)Meng, Q.; Chen, A.; Luo, H.; Wu, M.; Su, H.; Xu, L.; He, X.; and Yu, J. 2021.GNeRF: GAN-based Neural Radiance Field without Posed Camera.In ICCV.
  • Niemeyer etal. (2022)Niemeyer, M.; Barron, J.T.; Mildenhall, B.; Sajjadi, M. S.M.; Geiger, A.; and Radwan, N. 2022.RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs.In CVPR.
  • Park etal. (2023)Park, K.; Henzler, P.; Mildenhall, B.; Barron, J.T.; and Martin-Brualla, R. 2023.CamP: Camera Preconditioning for Neural Radiance Fields.ACM Trans. Graph., 42(6).
  • Ranftl etal. (2020)Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; and Koltun, V. 2020.Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
  • Sarlin etal. (2020)Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; and Rabinovich, A. 2020.Superglue: Learning feature matching with graph neural networks.In CVPR.
  • Schönberger and Frahm (2016)Schönberger, J.L.; and Frahm, J.-M. 2016.Structure-from-Motion Revisited.In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4104–4113.
  • Shi etal. (2023)Shi, R.; Chen, H.; Zhang, Z.; Liu, M.; Xu, C.; Wei, X.; Chen, L.; Zeng, C.; and Su, H. 2023.Zero123++: a single image to consistent multi-view diffusion base model.arXiv preprint arXiv:2310.15110.
  • Simonyan and Zisserman (2015)Simonyan, K.; and Zisserman, A. 2015.Very deep convolutional networks for large-scale image recognition.In 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society.
  • Song, Kwak, and Kim (2022)Song, J.; Kwak, M.-S.; and Kim, S. 2022.Neural Radiance Fields with Geometric Consistency for Few-Shot Novel View Synthesis.
  • Sun, Sun, and Chen (2022)Sun, C.; Sun, M.; and Chen, H.-T. 2022.Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5459–5469.
  • Tang and Tan (2018)Tang, C.; and Tan, P. 2018.Ba-net: Dense bundle adjustment network.arXiv preprint arXiv:1806.04807.
  • Truong etal. (2023)Truong, P.; Rakotosaona, M.-J.; Manhardt, F.; and Tombari, F. 2023.Sparf: Neural radiance fields from sparse and noisy poses.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4190–4200.
  • Wang etal. (2021a)Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura, T.; and Wang, W. 2021a.Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction.arXiv preprint arXiv:2106.10689.
  • Wang etal. (2024)Wang, S.; Leroy, V.; Cabon, Y.; Chidlovskii, B.; and Revaud, J. 2024.DUSt3R: Geometric 3D Vision Made Easy.In CVPR.
  • Wang etal. (2004)Wang, Z.; Bovik, A.C.; Sheikh, H.R.; and Simoncelli, E.P. 2004.Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612.
  • Wang etal. (2021b)Wang, Z.; Wu, S.; Xie, W.; Chen, M.; and Prisacariu, V.A. 2021b.NeRF--- -: Neural Radiance Fields Without Known Camera Parameters.arXiv preprint arXiv:2102.07064.
  • Wu etal. (2022)Wu, T.; Wang, J.; Pan, X.; Xu, X.; Theobalt, C.; Liu, Z.; and Lin, D. 2022.Voxurf: Voxel-based efficient and accurate neural surface reconstruction.arXiv preprint arXiv:2208.12697.
  • Xiong etal. (2023)Xiong, H.; Muttukuru, S.; Upadhyay, R.; Chari, P.; and Kadambi, A. 2023.SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting.
  • Yang etal. (2021)Yang, B.; Zhang, Y.; Xu, Y.; Li, Y.; Zhou, H.; Bao, H.; Zhang, G.; and Cui, Z. 2021.Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering.In International Conference on Computer Vision (ICCV).
  • Yang, Pavone, and Wang (2023)Yang, J.; Pavone, M.; and Wang, Y. 2023.FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8254–8263.
  • Zhang etal. (2021)Zhang, J.; Yang, G.; Tulsiani, S.; and Ramanan, D. 2021.Ners: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild.Advances in Neural Information Processing Systems, 34: 29835–29847.
  • Zhang etal. (2018)Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; and Wang, O. 2018.The Unreasonable Effectiveness of Deep Features as a Perceptual Metric.In CVPR.
  • Zhu etal. (2023)Zhu, Z.; Fan, Z.; Jiang, Y.; and Wang, Z. 2023.FSGS: Real-Time Few-Shot View Synthesis using Gaussian Splatting.arXiv:2312.00451.
Generic Objects as Pose Probes for Few-Shot View Synthesis (2024)

References

Top Articles
Latest Posts
Article information

Author: Otha Schamberger

Last Updated:

Views: 6346

Rating: 4.4 / 5 (55 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Otha Schamberger

Birthday: 1999-08-15

Address: Suite 490 606 Hammes Ferry, Carterhaven, IL 62290

Phone: +8557035444877

Job: Forward IT Agent

Hobby: Fishing, Flying, Jewelry making, Digital arts, Sand art, Parkour, tabletop games

Introduction: My name is Otha Schamberger, I am a vast, good, healthy, cheerful, energetic, gorgeous, magnificent person who loves writing and wants to share my knowledge and understanding with you.