Toon3D: Seeing Cartoons from a New Perspective

* Equal contribution, 1Teton.ai, 2UC Berkeley

TLDR
Humans can perceive 3D world from images that aren't 3D consistent, but why can't machines?
COLMAP cannot reconstruct non-geometric hand-drawn images even with perfect correspondences!
Toon3D can recover camera poses and dense geometry with piecewise-rigid deformable optimization.

Hand-drawn scenes are not 3D consistent, so we create Toon3D to recover camera poses and dense geometry! We do this with a piecewise-rigid deformation optimization at hand-labeled keypoints and using monocular depth as a prior. Now we can interpolate novel views never before seen! Press the button to move the cameras between two viewpoints! Note that we reconstruct the scenes with more than two hand-drawn images, but this demo shows a smooth transition between just two of the inputs views.

Abstract

We propose Toon3D. In this work, we recover the underlying 3D structure of non-geometrically consistent scenes. We focus our analysis on hand-drawn images from cartoons and anime. Many cartoons are created by artists without a 3D rendering engine, which means that any new image of a scene is hand-drawn. The hand-drawn images are usually faithful representations of the world, but only in a qualitative sense, since it is difficult for humans to draw multiple perspectives of an object or scene 3D consistently. Nevertheless, people can easily perceive 3D scenes from inconsistent inputs! In this work, we correct for 2D drawing inconsistencies to recover a plausible 3D structure such that the newly warped drawings are consistent with each other. Our pipeline consists of a user-friendly annotation tool, camera pose estimation, and image deformation to recover a dense structure. Our method warps images to obey a perspective camera model, enabling our aligned results to be plugged into novel-view synthesis reconstruction methods to experience cartoons from viewpoints never drawn before.

Cartoon Reconstruction

(Left) We first recover camera poses and aligned point clouds. (Right) Then we initialize Gaussians from our dense point cloud and optimize Gaussian Splatting with the recovered cameras. Our method has depth regularization and is built on Nerfstudio. Here we show fly-through renders of our scenes.

Here is the gallery of all our scenes. Can you guess which is which? Click to reveal names.

Method

We first predict the depth of each image with Marigold and obtain candidate transient masks with SAM. We then label images with the Toon3D Labeler to obtain correspondences and mark transient regions. We optimize camera poses and warp images to obtain calibrated, perspective cameras. Finally, we can initialize Gaussians with the aligned dense point cloud and run refinement.

Overview

Toon3D Labeler

Here you can see the two major steps of our method. The sparse alignment video shows rough camera parameter estimation. The dense alignment video shows various layers used in the method (e.g., cameras, sparse correspondences, warping meshes, etc.) and how they align in 3D.

Sparse Alignment

Dense Alignment

Explore Inside Rick and Morty's House

We reconstruct inside the Rick and Morty house by labeling between walls and ceilings to connect the rooms. In the first video, we show the point cloud & cameras and our custom labeling interface. In the second video, you can scrub the slider to see a walkthrough inside the house! The closest camera's image is shown in the bottom right corner.

Point Clouds and Cameras

Here we show point clouds and recovered cameras for the 12 cartoon scenes in the Toon3D Dataset. Click the icons to explore our scenes!

Click a scene icon to start!

Sparse-View Reconstruction

We can reconstruct scenes from few images and with large viewpoint changes. Where COLMAP may fail, we can intervene with the Toon3D Labeler to obtain human-labeled correspondences. Here we show a fly-through rendering for two rooms ("Living room" and "Bedroom 2") of this Airbnb listing.

Visualizing Inconsistencies

Cartoons are hand-drawn so we need to warp the images to be 3D consistent. The first item is a video that shows the warp taking place during alignment optimization. The next two items are images which show the original and warped drawings, as well as the overlap between the two. Blurry regions indicate where a lot of warp occured.

Reconstructing Paintings

We can reconstruct paintings with Toon3D even though the paintings are hand-drawn. We predict the depth of each image, then align and warp point clouds. Finally we use Gaussian refinement to create the video shown below.

BibTeX

Please consider citing our work if you find it useful.

@inproceedings{weber2023toon3d,
  title = {Toon3D: Seeing Cartoons from a New Perspective},
  author = {Ethan Weber* and Riley Peterlinz* and Rohan Mathur and
    Frederik Warburg and Alexei A. Efros and Angjoo Kanazawa},
  booktitle = {arXiv},
  year = {2024},
}

We would like to thank Qianqian Wang, Justin Kerr, Brent Yi, David McAllister, Matthew Tancik, Evonne Ng, Anjali Thakrar, Christian Foley, Abhishek Kar, Georgios Pavlakos, the Nerfstudio team, and the KAIR lab for discussions, feedback, and technical support. We also thank Ian Mitchell and Roland Jose for helping to label points.