skip to content
amar jay
Dark Theme

Posts Neural Implicit Scene Reconstruction from WiFi CSI Observations

Reading Time: 30 min read | Created: |Author: Hermes

A comprehensive feasibility study on using WiFi Channel State Information for dense 3D geometry reconstruction via neural implicit representations — covering the RF perception lineage, neural rendering, non-visual NeRF analogues, dataset gaps, proposed architecture, and a research roadmap.

Introduction

Can we reconstruct the 3D geometry of a room using nothing but the WiFi signals already bouncing through it?

This question sits at the intersection of three rapidly advancing fields: RF perception (extracting scene information from radio signals), neural implicit representations (representing 3D geometry as continuous functions learned by neural networks), and multi-modal sensor fusion (combining complementary sensing modalities). It is a question that, remarkably, almost no one has tried to answer.

The intuition is compelling. WiFi signals fill our homes and offices. They reflect off walls, scatter from furniture, and diffract around corners. Every surface in a room leaves its imprint on the Channel State Information (CSI) — the fine-grained channel response that WiFi receivers already compute for communication purposes. If a camera can reconstruct a scene from the light bouncing off it, why can’t we do the same with the radio waves already passing through it?

This article is a deep-dive feasibility study. We survey the full literature — over 90 papers spanning RF perception, neural rendering, and non-visual 3D reconstruction — identify the critical gaps, propose a concrete architecture, and assess when (and whether) WiFi CSI → neural implicit geometry is achievable.


The Core Question in One Diagram

Camera-Based 3D Reconstruction (mature, proven)
Camera(s)RGB ImagesCOLMAP (SfM)NeRF / 3DGS
Camera poses + Sparse point cloud
DENSE 3D MESH
WiFi CSI 3D Reconstruction (proposed, no prior art)
WiFi APsCSI Streams
???
NeRF / 3DGS
??? No existing bridge converts multipath CSI into
the geometric constraints needed for 3D reconstruction.

The infrastructure already exists in billions of homes. The neural rendering machinery is mature. The missing piece is the bridge between them.


What WiFi CSI Actually Measures

Let’s start with the physics, because it determines everything.

Channel State Information

CSI is not a single number like RSSI. It is a rich, complex-valued matrix captured for every WiFi packet:

H(f,t)=H(f,t)ejϕ(f,t)H(f, t) = \lvert H(f, t) \rvert \cdot e^{j \cdot \phi(f, t)}

where:

  • ff = OFDM subcarrier index (56 subcarriers at 20 MHz, 234 at 80 MHz)
  • tt = time (CSI sampled at 30–300 Hz, per-packet)
  • H\lvert H \rvert = amplitude (path loss, scattering, absorption)
  • ϕ\phi = phase (path length, Doppler shift, hardware offsets)

A single CSI frame from a 2×2 MIMO AP at 80 MHz is a tensor:

CSIC2×2×234936 complex values per packet1,872 real values (amplitude + phase) CSI \in \mathbb{C}^{2 \times 2 \times 234} \longrightarrow 936\text{ complex values per packet} \longrightarrow 1{,}872\text{ real values (amplitude + phase)}

This is three orders of magnitude richer than the single RSSI scalars that older WiFi localization systems used. It’s what made modern WiFi sensing possible.

The Multipath Principle

Unlike a camera, which sees each scene point from a single viewpoint, a WiFi receiver sees the superposition of every propagation path:

Boundary (Wall)TXSourceRXCollectorDirect Path (Line-of-Sight)Reflected PathObjectScattered (Diffraction)

The Phase Space of CSI: Every object in the scene leaves a unique "fingerprint" on the multipath superposition. Reconstruction requires disentangling path lengths (ToF) and arrival angles (AoA).

Each path encodes geometric information:

  • Time of Flight (ToF) → path length → distance
  • Angle of Arrival (AoA) → direction of arrival
  • Doppler shift → motion
  • Amplitude → material properties, path loss

This is simultaneously the greatest strength (rich geometric encoding) and the fundamental challenge of WiFi sensing: the inverse problem — reconstructing the scene that produced a given multipath pattern — is severely ill-posed. Infinitely many scene configurations produce identical CSI.

The Physics Wall

Here is the uncomfortable truth, quantified:

ParameterWiFi (5 GHz)Camera (RGB-D)Ratio
Wavelength6 cm500 nm120,000∶1
Bandwidth (typical)20–80 MHzN/A
Range resolution Δr = c/(2B)3.75–15 mSub-mm (stereo)10,000∶1+
Angular resolution (4 antennas)~30°~0.01° (1080p)3,000∶1
Data rate per stream30–300 Hz30–60 Hz1–10∶1
Through-wall capableYesNo
Works in darknessYesNo
Privacy-preservingYesNo

The range resolution alone tells the story: with 80 MHz of bandwidth, two objects 2 meters apart in depth produce indistinguishable CSI. A 4-antenna AP cannot resolve directions finer than ~30°. These are Shannon limits, not engineering problems.

But — and this is the key insight — neural networks are exceptionally good at extracting signal from noise. They learn priors from data that regularize ill-posed inverse problems. The same principle that lets a CNN produce a sharp 1080p image from a noisy 240p input could, in principle, let a neural network produce centimeter-accurate geometry from meter-resolution CSI.

The question is: how far can the neural prior push beyond the physics?


The RF Perception Literature: Building Blocks

Before asking whether CSI can drive 3D reconstruction, we need to establish what CSI can do. The answer, thanks to two decades of wireless sensing research, is: quite a lot.

The Katabi Group / MIT CSAIL (2013–2022)

The foundational work was done by Dina Katabi’s group at MIT. They systematically demonstrated that RF signals can replace cameras for a growing list of computer vision tasks.

The physical foundationSub-Nanosecond Time of Flight on Commercial Wi-Fi Cards (Vasisht et al., 2015): proved that commodity WiFi hardware can timestamp signals with sub-nanosecond precision, enabling centimeter-accurate distance measurement. Without this, none of what follows would be possible.

Action recognition through wallsMaking the Invisible Visible (Li, Fan, Zhao, Liu, Katabi, 2019): a CNN trained on RF heatmaps classifies human actions (walking, sitting, waving) through walls, without any visual input. Trained with synchronized RF + video, tested on RF alone. The first proof that through-wall perception is viable.

Person identification from RFLearning Longterm Representations for Person Re-Identification Using Radio Signals (Fan, Li, Fang, Hristov, Yuan, Katabi, 2020): RF-based person ReID. Individuals are recognized by their body shape and gait patterns encoded in RF reflections — geometric features, not visual appearance.

RF-to-text scene descriptionIn-Home Daily-Life Captioning Using Radio Signals (Fan, Li, Yuan, Katabi, 2020): an encoder-decoder maps RF heatmaps to natural language captions like “person sitting on couch reading.” Demonstrates that RF carries scene-level semantic information, not just low-level motion.

Self-supervised RF learningUnsupervised Learning for Human Sensing Using Radio Signals (Li, Fan, Yuan, Katabi, 2022, IEEE WACV, 45 citations): the closest existing work to a “foundation model” for RF. A single RF encoder backbone, pretrained via contrastive learning on unlabeled RF data, is finetuned for pose estimation, activity recognition, and person ReID. The core insight: you don’t need to reconstruct images from RF to solve vision tasks — you can learn directly from the RF signal. This is the paradigm that any CSI → geometry system should follow.

CSI → Image: The New Frontier

Two papers, published within a year of each other, have demonstrated that you can go further — actually synthesizing recognizable images from WiFi CSI:

Through-Wall Imaging based on WiFi Channel State Information (Strohmayer, Sterzinger, Stippel, Kampel, 2024, updated 2025) — the first paper to show CSI → synthesized RGB image in through-wall scenarios. Architecture: a CNN encoder-decoder (U-Net style) mapping CSI tensors to 128×128 RGB images. Results show recognizable room layouts and human silhouettes in constrained single-person scenes. Groundbreaking proof-of-concept, but quality is low and generalization appears limited.

High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model (Ramesh, Nishio, June 2025) — second-generation approach under the name LatentCSI. Instead of training a CNN from scratch, they use a pretrained Stable Diffusion model as the decoder and train only a lightweight CSI encoder that maps CSI into the diffusion model’s latent (CLIP-aligned) space. The result is dramatically higher quality images — the Stable Diffusion prior provides the missing high-frequency detail, while the CSI encoder provides the structural constraints. This is conceptually elegant: it acknowledges that CSI alone cannot resolve fine textures, and offloads that to a pretrained image prior.

Both papers demonstrate that CSI → structured visual output is possible. But both are limited to 2D — they produce images, not geometry.

mmWave Radar → 3D Point Cloud

While WiFi operates at 2.4/5 GHz, mmWave radar (60–77 GHz) provides an order of magnitude better spatial resolution. Three key papers from Sun et al. demonstrate the mmWave → 3D pipeline:

  • DeepPoint (2021): mmWave radar heatmaps → dense 3D point cloud via deep CNN
  • 3DRIMR (2021): improved reconstruction of object shapes at centimeter resolution
  • R2P (2022): generates smooth, dense point clouds from sparse radar measurements

And critically, ImmFusion (Chen et al., 2022) fuses mmWave + RGB for 3D human mesh (SMPL model) reconstruction, showing RF provides complementary geometry that helps in bad weather and low light.

Analogy: If mmWave → point cloud is possible, WiFi CSI → point cloud should be possible with sufficient neural capacity — just at lower resolution, because the physics is worse by a factor of 10–100.

WiFi Pose Estimation

VST-Pose (Zhang et al., 2025) is the state-of-the-art: a velocity-integrated spatiotemporal attention network for human pose estimation from WiFi CSI. It’s the latest in a growing subfield of CSI → pose, which now has multiple working systems and established benchmarks.

What All These Systems Share

None of them attempt dense scene geometry. They all operate on human subjects in controlled environments. CSI → activity, CSI → pose, CSI → person ID, CSI → image of a person — every existing CSI sensing system is about humans, not about the room they’re in.

This is the gap: no one has attempted to reconstruct general scene geometry — walls, furniture, objects — from WiFi CSI.


CSI Infrastructure: Datasets, Hardware, and Preprocessing

Benchmarks and Datasets

DatasetYearUsersTasksHas 3D GT?
SenseFi (Yang et al., 2022)20221Activity, GestureNo
WiMANS (Huang et al., ECCV 2024)20241–6Multi-user ActivityNo
CSI-Bench (Zhu et al., May 2025)20251Multi-task (in-wild)No
ESPARGOS (Euchner, Brink, 2024)20241Motion, LocalizationNo

The critical gap: no existing CSI dataset includes 3D geometry ground truth — no depth maps, no point clouds, no meshes, no multi-view CSI from known AP poses. Every sensing benchmark is about what the human is doing, not where the walls are.

What a “CSI + Geometry” Dataset Would Need

For even a proof-of-concept, you need:

  • K ≥ 4 WiFi APs at precisely known 3D positions, transmitting and receiving
  • Phase-coherent CSI capture using ESPARGOS or FPGA-based SDR hardware
  • Synchronized RGB-D camera (Azure Kinect, RealSense L515) or LiDAR for ground truth geometry
  • Multi-view RGB for SfM/NeRF baseline comparison
  • Scene diversity: empty rooms, furnished rooms, multiple people, static and dynamic
  • Through-wall scenarios: APs in adjacent rooms to exploit RF’s unique advantage

CSI Capture Hardware

ToolChipsetBandwidthAntennasPhase-CoherentApprox. Cost
NexmonBroadcom20/40/80 MHzUp to 4PartialFree (software mod)
ESP32-CSIESP3220/40 MHz1 (ext.)No$5
Atheros CSIath9k/ath10k20/40 MHzUp to 3PartialFree
Intel 5300Intel WiFi20/40 MHz3Limited~$50
WARP v3FPGA SDRCustomUp to 4Yes~$8,000
USRPNI SDRCustomUp to 2Yes~$2,000+

For geometry reconstruction, phase coherence is essential. ESPARGOS provides tools for achieving it on commodity Broadcom hardware. FPGA-based SDRs (WARP, USRP) remain the gold standard for research-grade measurements.

CSI Preprocessing

Optimal preprocessing of WiFi CSI for sensing applications (Ratnam et al., 2023) is essential reading. Key steps:

  1. Phase sanitization: remove carrier frequency offset (CFO), sampling frequency offset (SFO), and packet detection delay — random per-packet phase offsets that destroy coherence
  2. Amplitude normalization: automatic gain control (AGC) introduces packet-dependent gain
  3. Conjugate multiplication across antennas: cancels common-mode phase noise while preserving relative phase (critical for AoA estimation)
  4. Outlier removal: CSI is bursty; occasional frames are corrupted

Neural Implicit 3D Reconstruction: The Vision Playbook

If we want CSI to drive 3D reconstruction, we need to understand how cameras do it, and what needs to change.

From Pixels to Geometry

The standard neural reconstruction pipeline:

1. Multi-view imagesCOLMAP (SfM)Camera poses + Sparse points
2. For each raythrough each camera pixel
NeRFfθ(x, d)(σ, c)density + color
NeuSfθ(x)s(x)signed distance
3DGS3D Gaussiansexplicit
Volumetric renderingorRasterization
Loss
Rendered RGBObserved RGB
3. Extract surfaceLevel set of density / SDForDirect from Gaussians

The key papers:

  • [NeRF][18] (Mildenhall et al., ECCV 2020): maps 3D position + view direction → volume density + color. Trained with volumetric rendering and photometric loss. Requires known camera poses.
  • [NeuS][19] (Wang et al., NeurIPS 2021): replaces density with signed distance field (SDF). Better surface quality — important when geometry (not appearance) is the goal.
  • [Instant-NGP][20] (Müller et al., SIGGRAPH 2022): multi-resolution hash encoding enables 100–1000× training speedup. Makes NeRF practical on consumer hardware.
  • [3D Gaussian Splatting (3DGS)][21] (Kerbl et al., SIGGRAPH 2023): explicit representation using millions of 3D Gaussians. Real-time rendering, explicit geometry, faster training than NeRF. The current state-of-the-art for most applications.
  • [DUST3R][22] (Wang et al., CVPR 2024): feed-forward stereo — image pair → 3D pointmap in a single forward pass. Eliminates the separate SfM + MVS pipeline.

What CSI Would Need to Replace

Vision ComponentCSI AnalogueStatus
Pixel colorsCSI complex values (per subcarrier, per antenna pair)Known
Camera rayTX → RX propagation path(s)Requires multipath model
Photometric loss”CSI-metric” loss (rendered vs. measured CSI)Requires differentiable RF model
COLMAPAP self-localization or known positionsAPs are usually fixed/known
Feature matching (SIFT/SuperPoint)CSI feature correspondence across APsDoes not exist

The last row is the killer. In vision, you can match SIFT features between images to establish 2D–2D correspondences, then triangulate 3D points. No equivalent exists for CSI. Two CSI frames from different APs don’t have “features” you can match. This is perhaps the deepest conceptual challenge.

Self-Supervised Pretraining: The Vision Blueprint for CSI

Three models from vision provide the template for a “CSI foundation model”:

  • [DINOv2][23] (Oquab et al., 2023): self-supervised ViT pretraining on massive uncurated image data. Single backbone → excellent features for depth, segmentation, retrieval — zero-shot.
  • [MAE][24] (He et al., 2022): masked autoencoding — mask 75% of image patches, reconstruct. Forces the model to learn scene structure.
  • [SAM][25] (Kirillov et al., 2023): promptable segmentation model that generalizes to unseen domains.

The Katabi group’s 2022 paper already showed that self-supervised contrastive learning works for RF. A CSI-MAE — masking subcarriers or antenna pairs and reconstructing them — could learn useful geometric priors from unlabeled CSI streams. A CSI-DINO — contrastive learning on temporally-adjacent CSI frames — could learn viewpoint-invariant representations across APs.


Non-Visual Neural Implicit Reconstruction: The Templates

This section is the most encouraging. It demonstrates that neural implicit reconstruction from non-optical sensors is not hypothetical — it’s been done.

Acoustic / Sonar → NeRF

Neural Implicit Surface Reconstruction using Imaging Sonar (Qadri, Kaess, Gkioulekas, 2022) is the most directly relevant prior work. Forward-looking sonar (FLS) produces 2D intensity images where each pixel encodes range + azimuth + intensity. Like WiFi CSI, sonar is a non-optical, active sensing modality. Like CSI, the raw measurements are not images — they must be interpreted through a physical model.

The paper represents the scene as a neural SDF and renders sonar images via a physically-based differentiable sonar formation model. The loss compares rendered sonar images against measured ones.

This is the exact template for WiFi CSI → SDF. Replace the sonar formation model with a differentiable RF propagation model, and the pipeline is identical.

Acoustic Neural 3D Reconstruction Under Pose Drift (Lin, Qadri, Zhang, Pediredla, March 2025) extends this to handle unknown sensor poses — joint optimization of surface + pose. This is the “SfM for sonar” equivalent, and the direct template for “SfM for WiFi CSI.”

X-ray / CT → NeRF / 3DGS

X-ray is a transmission-based modality — like WiFi in through-wall mode. Several papers have adapted neural rendering for X-ray by replacing the rendering equation:

MIMO Radar → Neural Reconstruction

Efficient Physics-Based Learned Reconstruction for Real-Time 3D Near-Field MIMO Radar Imaging (Manisali et al., 2023) is directly relevant because MIMO radar and MIMO WiFi are conceptually identical — multiple transmitters, multiple receivers, exploiting spatial diversity for imaging. The paper uses a physics-based forward model combined with learned reconstruction.

The Unifying Pattern

Every non-visual → neural implicit work follows the same template:

1. Domain-specific differentiable forward model
(physics-based ray-tracing, learned surrogate, or hybrid)
2. Neural scene representation
(NeRF, SDF, or 3DGS — same as vision)
3. Loss: rendered sensor output vs. measured sensor data
(sonar intensity, X-ray attenuation, radar heatmap)
4. Optional: joint pose optimization
(for unknown sensor positions)

WiFi CSI would follow this template exactly. The only domain-specific component is the forward model.


The Bridge That Doesn’t Exist Yet

After exhaustive search across arXiv, Semantic Scholar, IEEE Xplore, and Google Scholar, there are exactly zero published papers on:

  • WiFi CSI + Neural Radiance Fields (NeRF)
  • WiFi CSI + 3D Gaussian Splatting (3DGS)
  • WiFi CSI + Signed Distance Functions (SDF)
  • WiFi CSI + any neural implicit scene representation
  • WiFi-based Structure from Motion (SfM)
  • WiFi-based dense general scene reconstruction
  • RF-based multi-view geometric correspondence
  • RF-based differentiable rendering for inverse problems

All existing work is either:

  1. CSI → single 2D image (2 papers: 2024, 2025)
  2. CSI → human pose, activity, identity (dozens of papers)
  3. mmWave → 3D point cloud (several papers, different frequency band)
  4. Sonar/X-ray → NeRF (analogous but different physics)

This is a genuinely open research direction.

Why Doesn’t It Exist Yet?

  1. Bandwidth gap: 20–80 MHz WiFi vs. 4 GHz mmWave radar vs. optical cameras. Most researchers who want 3D use mmWave.

  2. Antenna sparsity: 2–4 antennas per AP limits angular resolution to ~30°. Multi-AP distributed aperture could overcome this, but requires synchronized, phase-coherent capture from multiple APs — non-trivial.

  3. Phase coherence challenge: WiFi CSI phase is notoriously noisy and hardware-dependent. Geometry requires coherent phase across antennas. Most commodity chipsets don’t guarantee this. ESPARGOS is the first practical solution.

  4. No dataset: No paired CSI + 3D geometry dataset exists. Building one requires synchronized CSI capture + LiDAR or depth cameras in controlled environments — expensive and time-consuming.

  5. The chicken-and-egg problem: research funding goes to what’s proven feasible; feasibility is proven by research.

  6. Computer vision dominates: most researchers who care about 3D reconstruction work in vision and don’t think about RF. Most wireless researchers who work on CSI sensing don’t know about NeRF/3DGS. The intersection of these communities is nearly empty.


Proposed Architecture: A Unified CSI-to-Geometry Pipeline

If you were to build this system today, here is the architecture.

Component 1: CSI Backbone Encoder

Three architecture options, ordered by complexity:

  1. 3D CNN: treat the CSI tensor as a spatiotemporal volume — (APs × antenna pairs × subcarriers × time). Proven in WiFi sensing benchmarks (SenseFi).

  2. Transformer: CSI tokens per (AP, antenna pair, subcarrier) with learned positional encodings. Better for capturing long-range dependencies across APs and subcarriers. The Katabi group’s self-supervised encoder uses a Transformer variant.

  3. Graph Neural Network: nodes = antennas, edges = CSI measurements between antenna pairs. Naturally captures the MIMO topology. More principled, less explored.

Pretraining strategy — inspired by MAE for vision:

  • Mask random subsets of subcarriers or antenna pairs (50–75%)
  • Train the backbone to reconstruct the masked values
  • This forces the model to learn the correlations that encode scene geometry
  • Unlabeled CSI streams from any deployment provide infinite training data

Component 2: Multi-Task Prediction Heads

HeadOutputLossRationale
Depth HeadPer-pixel depth map aligned to scene coordinate frameL1/L2 against RGB-D ground truthDirect geometry supervision
Pose Head6-DoF pose of each APL2 against known AP positionsEnables joint optimization when AP positions are uncertain
Feature HeadDense per-pixel feature descriptorsContrastive loss across AP viewsThe “SuperPoint for CSI” — enables multi-view correspondence
Image HeadSynthesized RGB/depth imageReconstruction + perceptual + adversarialLatentCSI approach: pretrained diffusion decoder

Component 3: Neural Implicit Scene Representation

The choice of representation depends on the goal:

  • SDF (NeuS-style): best surface quality. Important when the output is a clean mesh, not photorealistic rendering.
  • 3DGS: explicit geometry, real-time rendering, faster training. Best if the output needs to be interactive.
  • Instant-NGP: fastest training, good for rapid prototyping and ablation studies.

For a proof-of-concept, start with Instant-NGP for speed, then upgrade to SDF for quality.

Component 4: Differentiable RF Propagation Model

This is the hardest and most novel component. Unlike cameras where the rendering equation (emission-absorption along a ray) is well-established, RF propagation involves:

  • Reflection: signals bounce off surfaces at specular angles
  • Diffraction: signals bend around edges (important for through-wall)
  • Scattering: signals scatter from rough surfaces (diffuse component)
  • Multipath: the receiver sees the superposition of all paths

Three approaches:

  1. Ray-tracing (most principled): assume geometric optics. Trace rays from each TX antenna, reflect off SDF surfaces, accumulate at RX. At each intersection, compute path length, attenuation, and phase shift. Sum all paths coherently. Differentiable because the SDF is differentiable — you can backpropagate through the ray-surface intersection.

    Pros: physically interpretable, proven in sonar NeRF. Cons: only models specular paths, misses diffuse scattering; computationally expensive.

  2. Learned forward model: train a neural network CSI_render(scene_params, TX_pos, RX_pos) → CSI_predicted on paired (geometry, CSI) data. The network learns the complex propagation physics implicitly. Freeze the model and use it for inverse rendering.

    Pros: captures all propagation effects, fast inference. Cons: requires large paired dataset for training; may not generalize to unseen geometries.

  3. Hybrid: ray-tracing for specular paths + learned residual for diffuse scattering. Balances interpretability and completeness.

The ray-tracing approach is the most practical starting point, mirroring how Qadri et al. (2022) handled sonar propagation.

Component 5: Loss Functions

Ltotal=λ1Lcsi+λ2Ldepth+λ3Lfeat+λ4LregLcsi=CSIrenderedCSImeasured2(per subcarrier, antenna pair, AP)Ldep=DpredDgt1(if depth supervision available)Lfeat=contrastive(featview1, featview2)(self-supervised)Limg=reconstruction+perceptual (LPIPS)(auxiliary, via LatentCSI head)Lregr=SDF1+TV(density)(regularization)\begin{aligned} L_{\text{total}} &= \lambda_1 L_{\text{csi}} + \lambda_2 L_{\text{depth}} + \lambda_3 L_{\text{feat}} + \lambda_4 L_{\text{reg}} \\[4pt] \\ L_{\text{csi}} &= \lVert \mathrm{CSI}_{\text{rendered}} - \mathrm{CSI}_{\text{measured}} \rVert^2 && \text{(per subcarrier, antenna pair, AP)} \\[4pt] L_{\text{dep}} &= \lVert D_{\text{pred}} - D_{\text{gt}} \rVert _1 && \text{(if depth supervision available)}\\[4pt] L_{\text{feat}} &= \text{contrastive}(\text{feat}_{\text{view1}},\ \text{feat}_{\text{view2}}) && \text{(self-supervised)} \\[4pt] L_{\text{img}} &= \text{reconstruction} + \text{perceptual (LPIPS)} && \text{(auxiliary, via LatentCSI head)} \\[4pt] L_{\text{regr}} &= \lVert \nabla \text{SDF} \rVert - 1 + \text{TV}(\text{density}) && \text{(regularization)} \end{aligned}

Multi-AP Distributed Aperture

The single biggest lever for improving angular resolution is using multiple APs as a distributed aperture:

Angular resolutionλNeffectived\text{Angular resolution} \approx \frac{\lambda}{N_{\text{effective}} \cdot d}

With (K) APs, each with (N) antennas at spacing (d):

NeffectiveKN(if phase-coherent across APs)N_{\text{effective}} \approx K \cdot N \quad \text{(if phase-coherent across APs)}

For (K = 6) APs, (N = 2) antennas each, (d = \lambda/2):

Angular resolutionλ12(λ/2)16 rad9.5\text{Angular resolution} \approx \frac{\lambda}{12 \cdot (\lambda/2)} \approx \frac{1}{6} \ \text{rad} \approx 9.5^\circ

This is still worse than cameras (~0.01°), but:

  • Combined with temporal integration (CSI at 100+ Hz), synthetic aperture processing can push further
  • Neural super-resolution can push beyond the Rayleigh limit
  • WiFi 7 (802.11be) with 16 antennas per AP will improve this dramatically

Benefits: Why WiFi CSI for 3D Reconstruction?

This isn’t just academic curiosity. There are compelling practical motivations:

1. Privacy by Design

WiFi signals don’t capture recognizable faces, text, or fine visual details. A CSI-based reconstruction system cannot produce identifiable images of people — only coarse geometric silhouettes. This makes it suitable for deployment in sensitive environments:

  • Elder care monitoring: detect falls and unusual movement patterns without cameras in bedrooms or bathrooms
  • Smart homes: presence detection and activity monitoring without visual surveillance
  • Healthcare facilities: patient monitoring where cameras would violate dignity

2. Through-Wall Sensing

RF signals penetrate drywall, wood, glass, and most common building materials. This enables:

  • Multi-room monitoring from a single sensor deployment
  • Search and rescue: locate people behind walls in emergency scenarios
  • Security: detect intruders in adjacent rooms before entry

No camera-based system can do this.

3. Lighting Independence

WiFi works in complete darkness, smoke, fog, and direct sunlight — all conditions where cameras fail. This makes CSI sensing uniquely suitable for:

  • 24/7 operation without lighting infrastructure
  • Industrial environments with heavy particulates
  • Outdoor night operation without IR illuminators

4. Ubiquity and Cost

WiFi infrastructure already exists in virtually every building. A CSI-based sensing system can leverage existing routers — no additional hardware deployment.

SensorApprox. Cost (per room)
WiFi AP (commodity)$30–100
RGB camera$20–200
RGB-D camera (Kinect/RealSense)$200–500
LiDAR (Ouster/Velodyne)$500–10,000
mmWave radar$100–500

A room-scale WiFi deployment costs an order of magnitude less than LiDAR.

5. The Unified Inference Advantage

A single CSI backbone handling multiple geometric tasks — depth, pose, correspondence, imaging — offers:

  • Data efficiency: representations learned for one task transfer to others
  • Compute efficiency: one backbone, many heads (vs. N separate models)
  • Self-supervision: unlabeled CSI streams from any deployment provide free pretraining data
  • Natural fusion interface: a unified CSI encoder is a natural counterpart to a visual encoder for multi-modal systems

Drawbacks and Fundamental Limitations

We must be honest about what stands in the way.

Physics Constraints (Hard Limits)

Range Resolution: Δr = c/(2B) = 1.875 meters at 80 MHz. Two objects 1 meter apart in depth produce indistinguishable CSI without additional geometric constraints.

Mitigation: Multi-AP diversity provides angular disambiguation. Neuro-Wideband CSI extrapolation (Ji et al., Jan 2026) — a neural network that extrapolates narrowband CSI to wideband-equivalent CSI — is the most promising approach, effectively creating virtual bandwidth.

Angular Resolution: with N=2 antennas at λ/2 spacing, angular resolution ≈ 1 radian (57°). Even with N=4: ≈ 0.5 rad (29°). This is the single biggest limitation.

Mitigation: Distributed aperture with K recieving APs provides effective N >> 4 through phase-coherent combining. Neural super-resolution can push beyond the classical Rayleigh limit, as demonstrated in radar processing.

Phase Noise: WiFi chipsets introduce random per-packet phase offsets from oscillator drift, AGC changes, and carrier frequency offset. These destroy the phase relationships essential for geometry.

Mitigation: Phase sanitization algorithms (Ratnam et al., 2023). ESPARGOS (2024) provides phase-coherent capture on commodity hardware. Conjugate multiplication across antennas cancels common-mode noise while preserving relative phase.

The Inverse Problem is Pathological: for any given CSI measurement, infinitely many scene configurations produce identical multipath. The mapping from scene geometry to CSI is many-to-one.

Mitigation: This is where neural priors shine. A network trained on thousands of (scene, CSI) pairs learns the manifold of plausible scenes and can resolve ambiguities that classical inverse methods cannot. This is the same principle that enables single-image depth estimation — another severely ill-posed problem that neural networks solve surprisingly well.

Practical Limitations

No Dataset: training needs paired CSI + 3D geometry. Zero such datasets exist. Building even a small one (100 scenes) requires months of controlled capture with synchronized hardware.

Scene Specificity: CSI is highly sensitive to the specific furniture, wall materials, and geometry of each room. A model trained in one room may not transfer to another without extensive domain adaptation or retraining.

Dynamic Scenes: moving people change CSI constantly. Separating static scene geometry from dynamic occluders is an open problem even in camera-based reconstruction — and CSI has much lower resolution to work with.

No Standardized Benchmark: unlike vision (ScanNet, Matterport3D, Replica), WiFi sensing has no standardized geometry benchmark. Comparing methods is currently impossible.

Cross-Modality Comparison

Modality3D QualityPrivacyThrough-WallCostMaturity
RGB Camera★★★★★★☆☆☆☆★☆☆☆☆★★★★★★★★★★
RGB-D (Kinect)★★★★★★☆☆☆☆★☆☆☆☆★★★★☆★★★★★
LiDAR★★★★★★☆☆☆☆★☆☆☆☆★★☆☆☆★★★★★
mmWave Radar (60 GHz)★★★☆☆★★★★★★★★★☆★★★☆☆★★★★☆
Thermal IR★★★☆☆★★★★☆★☆☆☆☆★★★☆☆★★★☆☆
WiFi CSI★★☆☆☆★★★★★★★★★★★★★★★★★☆☆☆
Sonar★★☆☆☆★★★★★★★★☆☆★★★☆☆★★★☆☆

WiFi CSI scores perfectly on privacy, through-wall capability, and cost. It scores poorly on 3D quality. The question is whether neural methods can elevate that ★★ to ★★★ or ★★★★.


Multi-Modal Fusion: The Realistic Near-Term Path

Pure WiFi CSI → 3D geometry is likely 5–10 years out. But CSI as a complementary modality in a multi-sensor pipeline is feasible now.

CSI-Camera Fusion

flowchart TD Camera["Camera (RGB/RGB-D)"] WiFi["WiFi APs (1..K)"] VisualEncoder["Visual Encoder (DINOv2)"] CSIEncoder["CSI Encoder"] Fusion["Cross-Modal Fusion<br><small>(CLIP-style alignment)</small>"] Geometry["Neural Scene Geometry<br><small>(NeRF / 3DGS / SDF)</small>"] Camera --> VisualEncoder WiFi --> CSIEncoder VisualEncoder --> Fusion CSIEncoder --> Fusion Fusion --> Geometry classDef default fill:#f9f9f9,stroke:#333,stroke-width:1px classDef node fill:#d1d1d1,stroke:#2c3e50,stroke-width:2px class Camera,WiFi,VisualEncoder,CSIEncoder,Fusion,Geometry node

Where CSI provides what cameras cannot:

  1. Through-wall geometry: when the camera is occluded, CSI provides coarse structure
  2. Darkness/smoke operation: CSI works when the camera is blind
  3. Metric scale: CSI time-of-flight provides absolute distances — cameras are scale-ambiguous
  4. Optimization guidance: CSI features guide NeRF away from local minima, especially in textureless regions

Use Case: “WiFi as Coarse Prior”

The most practical architecture:

  1. WiFi CSI produces a low-resolution (~10 cm) SDF of the scene
  2. This SDF is used as initialization for a camera-based NeRF or 3DGS
  3. The camera fills in fine detail; WiFi provides global structure and scale

This is analogous to how COLMAP provides sparse points for NeRF initialization — WiFi could serve the same role, and in scenarios where COLMAP fails (textureless walls, through occlusions, in darkness).


Research Roadmap

Phase 1: Foundation (Year 1–2) — Feasible Now

Goal: prove CSI → coarse geometry is possible.

Tasks:

  1. Build a small CSI + RGB-D dataset: ~20 scenes, 4 APs, static environments, using ESPARGOS for phase-coherent capture. Each scene has ground truth geometry from an RGB-D camera.

  2. CSI → depth network: adapt existing CSI-to-image architectures (Through-Wall Imaging 2024, LatentCSI 2025) to output depth maps instead of RGB. Train supervised on the CSI-depth pairs.

  3. Differentiable RF ray-tracing renderer: implement on top of a neural SDF. Validate on simulated data: generate CSI from known geometry (simulated), then reconstruct geometry from CSI.

  4. Reproduce the sonar → NeRF pipeline (Qadri et al., 2022) as a baseline, then adapt for CSI.

  5. Demonstrate CSI → SDF reconstruction of simple geometric shapes (box, cylinder, sphere, wall, table) in controlled indoor environment.

Deliverable: proof-of-concept paper: “CSI-NeRF: Coarse Neural Implicit Scene Reconstruction from WiFi CSI.”

Phase 2: Scaling (Year 2–4) — Ambitious

Goal: generalize to unknown AP poses and complex scenes.

Tasks:

  1. Scale dataset: 500+ scenes with diverse geometry, furniture, materials.

  2. Joint optimization of SDF + AP poses: the “CSI-SfM” problem. Use the template from Acoustic Neural 3D Reconstruction Under Pose Drift (Lin et al., 2025).

  3. Integrate Neuro-Wideband CSI extrapolation: (Ji et al., 2026) to improve effective resolution.

  4. Self-supervised CSI-MAE pretraining: on unlabeled CSI streams from diverse environments. Release as a pretrained backbone for the community.

  5. Unified encoder with multiple heads: depth, pose, features, images — all from one backbone.

  6. Benchmark against camera-based SfM/NeRF: on shared scenes to quantify the gap.

Deliverable: CSI provides geometry within 10–20 cm accuracy in known environments. CSI-MAE encoder enables transfer learning across scenes and tasks.

Phase 3: Production (Year 5+) — Speculative

Goal: WiFi CSI as a practical 3D sensing modality.

  1. WiFi 7 hardware (802.11be): 16 antennas per AP, 320 MHz bandwidth — 4× more antennas, 4× more bandwidth than 802.11ac. Dramatically improves spatial resolution.

  2. CSI Foundation Model: pretrained on millions of CSI-hours from diverse deployments. Finetunable for any geometric task.

  3. Real-time CSI-NeRF: running on-device on WiFi router hardware.

  4. Multi-modal foundation model: single architecture handling CSI + camera + IMU + audio.


Summary Table: What Exists and What Doesn’t

CapabilityStatusKey Papers
CSI → human activity recognitionMatureSenseFi 13, WiMANS 14
CSI → 2D human poseMatureVST-Pose 12, Katabi 5
CSI → person identificationEmergingKatabi 3
CSI → through-wall synthesized imageEarlyStrohmayer 6, LatentCSI 7
CSI → 3D human mesh (mmWave)EarlyImmFusion 11
mmWave → dense point cloudEmergingDeepPoint 8, 3DRIMR 9, R2P 10
Sonar → neural implicit surfaceProvenQadri et al. 26
X-ray → NeRF/3DGSProvenCai et al. 28
CSI → scene depth mapNone
CSI → multi-view correspondenceNone
CSI → SfM camera/AP poseNone
CSI → dense point cloudNone
CSI → NeRF/3DGS/SDFNone
WiFi sensing foundation modelNone

Conclusion

“Neural Implicit Scene Reconstruction from WiFi CSI Observations” sits at a genuinely novel intersection of wireless sensing and neural rendering. The building blocks are real — CSI sensing, neural implicit representations, and non-visual NeRF analogues — but no one has connected them.

The honest assessment:

Pure WiFi CSI directly to dense 3D geometry is not feasible today. Current-day commodity hardware don’t meet the physical constrainst**. The bandwidth limits range resolution to meters, the antenna count limits angular resolution to tens of degrees, and no paired dataset exists for training. These are physics constraints, not algorithmic ones — you cannot reconstruct what you cannot measure. However, using the **unified neural inference architecture** analogous to DINO, MAE and SAM or other visual foundation model latent geometry as a regularization manifold for RF scene inference is possible. Especially when used in a VGGT style inference.

LatentCSI (2025) proves that pretrained diffusion priors can compensate for CSI’s information poverty

First focus is on — controlled environment, simple geometry, known AP positions. Use the sonar→NeRF template and the LatentCSI paradigm as proven building blocks. Position it as “WiFi CSI as a coarse geometric prior for neural scene reconstruction” rather than “WiFi replaces cameras for 3D.”


References

RF Perception (Katabi Group, MIT CSAIL)

1 D. Vasisht, S. Kumar, and D. Katabi, “Sub-Nanosecond Time of Flight on Commercial Wi-Fi Cards,” 2015.

2 T. Li, L. Fan, M. Zhao, Y. Liu, and D. Katabi, “Making the Invisible Visible: Action Recognition Through Walls and Occlusions,” 2019.

3 L. Fan, T. Li, R. Fang, R. Hristov, Y. Yuan, and D. Katabi, “Learning Longterm Representations for Person Re-Identification Using Radio Signals,” 2020.

4 L. Fan, T. Li, Y. Yuan, and D. Katabi, “In-Home Daily-Life Captioning Using Radio Signals,” 2020.

5 T. Li, L. Fan, Y. Yuan, and D. Katabi, “Unsupervised Learning for Human Sensing Using Radio Signals,” IEEE WACV, 2022.

WiFi CSI → Image / Depth

6 J. Strohmayer, R. Sterzinger, C. Stippel, and M. Kampel, “Through-Wall Imaging based on WiFi Channel State Information,” 2024, updated 2025.

7 E. Ramesh and T. Nishio, “High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model,” June 2025. (LatentCSI)

Bandwidth Enhancement

S. Ji, W. Hou, and C. Wu, “Neuro-Wideband WiFi Sensing via Self-Conditioned CSI Extrapolation,” January 2026.

mmWave → 3D

8 Y. Sun, H. Zhang, Z. Huang, and B. Liu, “DeepPoint: A Deep Learning Model for 3D Reconstruction in Point Clouds via mmWave Radar,” 2021.

9 Y. Sun, Z. Huang, H. Zhang, Z. Cao, and D. Xu, “3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning,” 2021.

10 Y. Sun, H. Zhang, Z. Huang, and B. Liu, “R2P: A Deep Learning Model from mmWave Radar to Point Cloud,” 2022.

11 A. Chen et al., “ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather Conditions,” 2022.

WiFi Pose Estimation

12 X. Zhang, Z. Ye, J. Zhang, X. Tian, Z. Liang, and S. Yu, “VST-Pose: A Velocity-Integrated Spatiotemporal Attention Network for Human WiFi Pose Estimation,” 2025.

WiFi CSI Hardware, Datasets, Preprocessing

13 J. Yang, X. Chen, D. Wang, H. Zou, C. X. Lu, S. Sun, and L. Xie, “SenseFi: A Library and Benchmark on Deep-Learning-Empowered WiFi Human Sensing,” 2022.

14 S. Huang, K. Li, D. You, Y. Chen, A. Lin, S. Liu, X. Li, and J. A. McCann, “WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing,” ECCV, 2024.

15 G. Zhu, Y. Hu, W. Gao, W.-H. Wang, B. Wang, and K. J. R. Liu, “CSI-Bench: A Large-Scale In-the-Wild Dataset for Multi-task WiFi Sensing,” May 2025.

16 F. Euchner and S. ten Brink, “ESPARGOS: Phase-Coherent WiFi CSI Datasets for Wireless Sensing Research,” 2024.

17 V. V. Ratnam, H. Chen, H. H. Chang, A. Sehgal, and J. Zhang, “Optimal preprocessing of WiFi CSI for sensing applications,” 2023.

CSI Compression

J. Yang, X. Chen, H. Zou, D. Wang, and Q. Xu, “EfficientFi: Towards Large-Scale Lightweight WiFi Sensing via CSI Compression,” 2022.

B. Barahimi, H. Singh, H. Tabassum, O. Waqar, and M. Omer, “RSCNet: Dynamic CSI Compression for Cloud-based WiFi Sensing,” 2024.

Neural Implicit Representations (Vision)

[18] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing Scenes as Neural Radiance Fields,” ECCV, 2020.

[19] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction,” NeurIPS, 2021.

[20] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant Neural Graphics Primitives with a Multiresolution Hash Encoding,” SIGGRAPH, 2022.

[21] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D Gaussian Splatting for Real-Time Radiance Field Rendering,” SIGGRAPH, 2023.

[22] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud, “DUST3R: Geometric 3D Vision Made Easy,” CVPR, 2024.

Self-Supervised Vision (Template for CSI)

[23] M. Oquab et al., “DINOv2: Learning Robust Visual Features without Supervision,” 2023.

[24] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” CVPR, 2022.

[25] A. Kirillov et al., “Segment Anything,” ICCV, 2023.

Non-Visual Neural Implicit Reconstruction

26 M. Qadri, M. Kaess, and I. Gkioulekas, “Neural Implicit Surface Reconstruction using Imaging Sonar,” 2022.

27 T. Lin, M. Qadri, K. Zhang, and A. Pediredla, “Acoustic Neural 3D Reconstruction Under Pose Drift,” March 2025.

28 Y. Cai, J. Wang, A. Yuille, Z. Zhou, and A. Wang, “Structure-Aware Sparse-View X-ray 3D Reconstruction,” 2023.

29 Y. Cai, Y. Liang, J. Wang, A. Wang, Y. Zhang, X. Yang, Z. Zhou, and A. Yuille, “Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis,” 2024.

30 G. Zhang, R. Zha, H. He, Y. Liang, et al., “X-LRM: X-ray Large Reconstruction Model for Extremely Sparse-View Computed Tomography Recovery in One Second,” March 2025.

31 I. Manisali, O. Oral, and F. S. Oktem, “Efficient Physics-Based Learned Reconstruction Methods for Real-Time 3D Near-Field MIMO Radar Imaging,” 2023.


This article represents a comprehensive snapshot of the literature as of May 2026. The field is moving fast — especially LatentCSI (June 2025), Neuro-Wideband CSI (January 2026), and the acoustic NeRF follow-up (March 2025). If you’re reading this more than six months later, check for updates.