Hand-drawn scenes are not 3D consistent, so we create Toon3D to recover camera poses and dense geometry! We do this with a piecewise-rigid deformation optimization at hand-labeled keypoints and guide the process with structure inferred from monocular depth predictions. Now we can interpolate novel views never before seen! Press the button to move the cameras between two viewpoints! Note that we reconstruct the scenes with more than two hand-drawn images, but this demo shows a smooth transition between just two of the inputs views.
We recover the underlying 3D structure from images of cartoons and anime depicting the same scene. This is an interesting problem domain because images in creative media are often depicted without explicit geometric consistency for storytelling and creative expression—they are only 3D in a qualitative sense. While humans can easily perceive the underlying 3D scene from these images, existing Structure-from-Motion (SfM) methods that assume 3D consistency fail catastrophically. We present Toon3D for reconstructing geometrically inconsistent images. Our key insight is to deform the input images while recovering camera poses and scene geometry, effectively explaining away geometrical inconsistencies to achieve consistency. This process is guided by the structure inferred from monocular depth predictions. We curate a dataset with multi-view imagery from cartoons and anime that we annotate with reliable sparse correspondences using our user-friendly annotation tool. Our recovered point clouds can be plugged into novel-view synthesis methods to experience cartoons from viewpoints never drawn before. We evaluate against classical and recent learning-based SfM methods, where Toon3D is able to obtain more reliable camera poses and scene geometry.
(Left) We first recover camera poses and aligned point clouds. (Right) Then we initialize Gaussians from our dense point cloud and optimize Gaussian Splatting with the recovered cameras. Our method has depth regularization and is built on Nerfstudio. Here we show fly-through renders of our scenes.
Here is the gallery of all our scenes. Can you guess which is which? Click to reveal names.
We first predict the depth of each image with Marigold and obtain candidate transient masks with SAM. We then label images with the Toon3D Labeler Labeler to obtain correspondences and mark transient regions. We optimize camera poses and warp images to obtain calibrated, perspective cameras. You can use our labeler tool here. Finally, we can initialize Gaussians with the aligned dense point cloud and run refinement.
Here you can see the two major objectives of our method. The camera alignment objective recovers camera parameter. The deformation alignment objective warps the meshes for closer alignment. In practice, we optimize for both objectives at the same time. In the Deformation Alignment video, we show various layers used in the method (e.g., cameras, sparse correspondences, warping meshes, point clouds, and gaussians) and observe their alignment in 3D.
We reconstruct inside the Rick and Morty house by labeling between walls and ceilings to connect the rooms. In the first video, we show the point cloud & cameras and our custom labeling interface. In the second video, you can scrub the slider to see a walkthrough inside the house! The closest camera's image is shown in the bottom right corner.
Here we show point clouds and recovered cameras for the 12 cartoon scenes in the Toon3D Dataset. Click the icons to explore our scenes!
We can reconstruct scenes from few images and with large viewpoint changes. Where COLMAP may fail, we can intervene with the Toon3D Labeler to obtain human-labeled correspondences. Here we show a fly-through rendering for two rooms ("Living room" and "Bedroom 2") of this Airbnb listing.
Cartoons are hand-drawn so we need to warp the images to be 3D consistent. The first item is a video that shows the warp taking place during alignment optimization. The next two items are images which show the original and warped drawings, as well as the overlap between the two. Blurry regions indicate where a lot of warp occured.
We can reconstruct paintings with Toon3D even though the paintings are hand-drawn. We predict the depth of each image, then align and warp point clouds. Finally we use Gaussian refinement to create the video shown below.
Please consider citing our work if you find it useful.
@inproceedings{weber2023toon3d,
title = {Toon3D: Seeing Cartoons from New Perspectives},
author = {Ethan Weber* and Riley Peterlinz* and Rohan Mathur and
Frederik Warburg and Alexei A. Efros and Angjoo Kanazawa},
booktitle = {arXiv},
year = {2024},
}
We would like to thank Qianqian Wang, Justin Kerr, Brent Yi, David McAllister, Matthew Tancik, Evonne Ng, Anjali Thakrar, Christian Foley, Abhishek Kar, Georgios Pavlakos, the Nerfstudio team, and the KAIR lab for discussions, feedback, and technical support. We also thank Ian Mitchell and Roland Jose for helping to label points.