Time-Archival Camera Virtualization for Sports and Visual Performance

Yunxiao Zhang1   William Stone1   Suryansh Kumar1*
1Texas A&M University, College Station, TX, USA
* corresponding author: [email protected]
Manuscript under revision at Computer Vision and Image Understanding (CVIU)  •  Preprint not public
Code (GitHub) Interactive Demo Key Insight Results BibTeX Home Contact
What this page is for. This project page focuses on TACV’s domain-informed insight for sports & performance capture: a pre-calibrated, synchronized multi-view rig is already a strong geometric constraint. TACV uses that constraint to enable time-indexed archival (“rewind any moment”) + high-quality novel views, without relying on fragile multi-body tracking assumptions.
TACV graphical presentation (application scenario)
Application scenario. Figure 1: (a)-(d) Camera virtualization for football sports showing the image rendering from camera placed at different distances from the subject(s), i.e., (a) far-distance viewpoint, (c) near-distance viewpoint (d) top viewpoint.

Key Insight: Pre-calibrated Multi-view is Already a Strong Constraint

Assumption (by design): A fixed multi-camera rig with known intrinsics/extrinsics, synchronized across views (typical in stadium broadcast and stage capture).

In sports broadcasting, multiple static cameras capture exactly the same moment. At time t, all views of a subject differ only by rigid transformations. This eliminates the need for deformation graphs, warping fields, 4D canonicalization, or Gaussian tracking across time. This is a domain-informed insight that fundamentally changes how dynamic novel view synthesis should be modeled.

Many dynamic view synthesis methods implicitly rely on some form of temporal correspondence—tracking points, tracking Gaussians, or tracking deformation fields. In fast-paced sports and multi-person stage performances, these assumptions frequently break due to:

TACV takes a different stance. With a synchronized, calibrated multi-view rig, the scene at each time step is already strongly constrained by geometry. We model the sequence as a time-indexed set of archival multi-view snapshots: optimize each time step independently, store it, and later render it from any virtual camera pose—enabling “rewind” + novel-view replay.

What you get:
  • Time-archival: exact “rewind to time t” + render novel views.
  • Stability: avoids drift from long-range temporal coupling.
  • Parallelizable: per-time-step training is easy to distribute across GPUs.
What you avoid (in this setting):
  • per-time-step SfM point cloud initialization,
  • fragile multi-body tracking assumptions,
  • complex temporal deformation modeling for every sequence.

GUI Demo Videos

Below is the demo of the TACV GUI / interactive playback.

Demo : TACV GUI for interactive novel view synthesis and time-archival playback.

Results

Below are selected quantitative results and qualitative figures. In many challenging sports/performance sequences, several baselines fail to produce usable reconstructions; therefore, this page emphasizes what TACV enables in the pre-calibrated multi-view setting, with comparisons shown where available.

Selected quantitative results

Quantitative results table from the TACV manuscript (PSNR/LPIPS/Memory/Train Time/Iterations).
Table. Comparison of implicit and explicit dynamic-scene representations on three synthetic datasets. Our per-time-step implicit radiance fields achieve the highest PSNR/LPIPS performance while maintaining a compact memory footprint, outperforming 3DGS- and NeRF-based methods that rely on explicit geometry or deformation tracking. This makes our approach substantially more scalable and robust for long-horizon time-archival in dynamic sports and performance scenes. Note: here we have not included the initial 3D point cloud size that is used by 3DGS-based approaches as input. DWS, S-PK, and S-MP stand for Dance-Walking-Standing, Soccer Penalty-Kick, and Soccer Multi-Player dataset.

Qualitative comparisons (where available)

Selected comparisons against implicit baselines (from manuscript figure)
Qualitative results on Dancing-Walking-Standing (DWS). Left: the multi-view acquisition setup (top and side views); novel viewpoints are highlighted in [red]. Rows (top → bottom): image-based rendering results from state-of-the-art neural implicit methods for dynamic scenes, followed by our method. The red camera ID in each row indicates the source view used for rendering. Overall, our approach produces renderings that are visually closer to the ground truth than prior methods.
Selected comparisons against implicit baselines (from manuscript figure)
Comparison (from manuscript). TACV vs. representative implicit baselines.
Selected comparisons against implicit baselines (from manuscript figure)
Comparison (from manuscript). TACV vs. 4DGS.

Real multi-camera example (CMU Panoptic)

CMU Panoptic example results figure
Real-world multi-camera capture. Example results on CMU Panoptic Studio sequences.

Method at a Glance

TACV targets time-archival camera virtualization under a synchronized, calibrated multi-view capture setup commonly used in sports broadcasting and stage performances. At each discrete time instance t, N static cameras capture synchronized RGB images I_t = {I_t^(1), …, I_t^(N)}. Camera intrinsics K_i and extrinsics (R_i, t_i) are known (or estimated), providing strong geometric constraints at every time instance.

For each time t, TACV learns a temporally indexed functional scene representation F_t (a time-specific implicit radiance field) to model RGB appearance at that moment, enabling novel-view synthesis for any past or current time. We implement F_t as a compact neural implicit model: a small MLP augmented with a multi-resolution hash-grid encoding for efficient, high-detail rendering. Each time step is optimized independently, stored as a self-contained checkpoint for archival access, and can be parallelized across time when multiple GPUs are available.

Show pipeline diagram
TACV overall pipeline diagram
Pipeline. Per-time-step training from synchronized multi-view images; store checkpoints for time-indexed replay.

Datasets

TACV evaluates on synthetic sports/performance datasets and real multi-camera captures. For research and evaluation, we provide example datasets and optional pretrained checkpoints:

Download (Google Drive)
The shared datasets and checkpoints are provided for research and evaluation purposes only. Please ensure you have the rights to use any third-party assets contained in the data (e.g., captured footage / game renders). Please do not redistribute third-party content without permission from the original rights holders.
Dataset Type #Views #Time instances Notes
Dancing-Walking-Standing (DWS) Synthetic 100 65 Multi-person motion
Soccer Penalty Kick (S-PK) Synthetic 60 109 Soccer action
Soccer Multi-Player (S-MP) Synthetic 60 83 Multiple players; occlusions
Baseball Bat Real world 31 100 Fast motion
Hand Gesture Real world 31 201 Non-rigid articulation

Code & Ecosystem

Abstract

Camera virtualization enables photorealistic novel-view synthesis for live performances and sports broadcasting using a limited set of synchronized, calibrated static cameras, but existing dynamic-scene methods still struggle to deliver spatially/temporally coherent rendering with practical time-archival capability in fast, multi-person motions. Dynamic 3D Gaussian Splatting variants can be real-time, yet often depend on accurate SfM point clouds and fragile temporal tracking assumptions that break under large non-rigid motion and independent multi-body interactions. TACV advocates revisiting a neural volume rendering formulation: it treats each time instant as a geometry-constrained multi-view snapshot (views differ only by rigid transforms at that time), learns a compact per-time-step implicit representation, and stores it for true time-archival— letting users “rewind” to any past moment and render novel viewpoints for replay, analysis, and long-horizon archival without requiring per-time-step point-cloud initialization or heavy temporal coupling.

Full abstract (from manuscript) — expand to view

Camera virtualization—an emerging solution to novel view synthesis—holds transformative potential for visual entertainment, live performances, and sports broadcasting by enabling the generation of photorealistic images from novel viewpoints using images from a limited set of calibrated multiple static physical cameras. Despite recent advances, achieving spatially and temporally coherent and photorealistic rendering of dynamic scenes with efficient time- archival capabilities, particularly in fast-paced sports and stage performances, remains challenging for existing approaches. Recent methods based on 3D Gaussian Splatting (3DGS) for dynamic scenes could offer real-time view- synthesis results. Yet, they are hindered by their dependence on accurate 3D point clouds from the structure-from-motion method and their inability to handle large, non-rigid, rapid motions of different subjects (e.g., flips, jumps, articulations, sudden player-to-player transitions). Moreover, independent motions of multiple subjects can break the Gaussian-tracking assumptions commonly used in 4DGS, ST-GS, and other dynamic splatting variants. This paper advocates reconsidering a neural volume rendering formulation for camera virtualization and efficient time-archival capabilities, making it useful for sports broadcasting and related applications. By modeling a dynamic scene as rigid transformations across multiple synchronized camera views at Preprint submitted to Computer Vision and Image Understanding January 22, 2026 a given time, our method performs neural representation learning, providing enhanced visual rendering quality at test time. A key contribution of our approach is its support for time-archival , i.e., users can revisit any past temporal instance of a dynamic scene and can perform novel view synthesis, enabling retrospective rendering for replay, analysis, and archival of live events—a functionality absent in existing neural rendering approaches and novel view synthesis methods for dynamic scenes. While, in principle, dynamic 3DGS approaches can also perform time-archival, however, it will require either a multi-view structure-from-motion (SfM) point cloud to be stored at every time step or some form of additional multi-body temporal modeling constraint—both of which are complex, computationally expensive, and could be memory-intensive. We argue that a dynamic scene observed under a well-constrained synchronized multiview setup—typical in sports and visual performance scenarios, is already strongly constrained by geometry, and we may not need a temporally coupled constraint or 3d point cloud initialization. Extensive experiment and ablations on established benchmarks and our newly proposed dynamic scene datasets demonstrate that our method surpasses 4DGS-based baselines in rendered image quality and other performance metric for time-archival view-synthesis for a dynamic scene, thus setting a new standard for virtual camera systems in dynamic visual media. Furthermore, our approach could be an encouraging step towards compactly modeling the plenoptic function, allowing for time-archival of a long video sequence.

Citation

Use the repo citation for now; update when the paper is public (DOI/arXiv).

@misc{zhang_tacv_code,
  title        = {Time-Archival Camera Virtualization (TACV) -- Code},
  author       = {Zhang, Yunxiao (Jack)},
  year         = {2026},
  howpublished = {GitHub repository},
  note         = {Manuscript under revision at CVIU}
}

Acknowledgements

I am grateful to Prof. Suryansh Kumar for advising this project, drafting the manuscript and insightful discussions that shaped the problem framing and evaluation. Parts of the implementation are adapted from NVIDIA instant-ngp; we thank the authors for releasing their code and follow the original license and attribution requirements in this repository.