Figure 1. How Interactive Avatar Model works
Figure 2. Traditional approach VS End-to-End Streaming approach
Figure 3. Pickle’s technical architecture
Data collecting pipeline
①
Video for Personal Model Training:
5 hours of human talking videos are collected daily.
②
Conversational Data for Foundation Model Training:
60 hours of conversational data are collected daily.
Training pipeline
①
LoRA Training [10] for Personal Models:
Three foundation models are personalized by LoRA training with user’s talking video.
②
Large Scale Training for Foundation Models:
DiT foundation model is trained with large-scale conversational data, unfreezing Context Projector, Cross-Attention Layer of DiT Blocks.
Inference pipeline
Renderer type
● : NeRF
● : DiT (in developing)
①
②
Extracting Audio Feature:
Wav2Vec [13] Model extracts audio features from the specific amount of chunk at the most recent audio buffer.
③
Generating Landmark Sequences:
Motion Renderer takes audio features as input and generates a landmark sequence. Past landmarks and audio are used for ensuring continuity.
④
Rendering Frames with NeRF [14]:
NeRF renders video frames conditioned on landmark sequences. The frames are then sent to the Pickle Camera which can be selected on video call apps.
①
②
Extracting Context Feature:
Context Projector(pre-trained upon Wav2Vec and multi-modal LLM layers) extract context features from video/audio of current call.
③
Rendering frames with DiT:
DiT Blocks denoise latent frame buffer using audio features and context features. To generate video frames faster with continuity, the middle part of frames buffer which is denoised enough, are decoded by 3D VAE and sent to the Pickle Camera.
Figure 4. DiT-based foundation model training strategy with conversational data
Figure 5. Latency requirements
[1] Sora: Creating Video from Text. 2025. https://openai.com/index/sora/
[2] Movie Gen: A Cast of Media Foundation Models. arXiv 2410.13720, 2024. https://arxiv.org/abs/2410.13720
[3] OmniHuman-1: Rethinking the Scaling-Up of One-Stage Human Animation. arXiv 2502.01061, 2025. https://arxiv.org/abs/2502.01061
[4] MoCha: Towards Movie-Grade Talking Character Synthesis. arXiv 2503.23307, 2025. https://arxiv.org/abs/2503.23307
[5] https://en.wikipedia.org/wiki/Albert_Mehrabian
[6] Video Increases the Perception of Naturalness During Remote Interactions with Latency, 2012. https://dl.acm.org/doi/10.1145/2212776.2223750
[8] AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents, 2023. https://arxiv.org/abs/2311.17465
[9] https://www.zoom.com/en/blog/how-you-used-zoom-2022/
[10] LoRA: Low-Rank Adaptation of Large Language Models, 2021. https://arxiv.org/abs/2106.09685
[11] https://kubernetes.io/docs/concepts/overview/
[12] https://webrtc.org/
[13] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, 2020. https://arxiv.org/abs/2006.11477
[14] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, 2020. https://arxiv.org/abs/2003.08934
[15] https://developers.zoom.us/blog/realtime-media-streams/
[16] https://developer.apple.com/documentation/coreaudio/capturing-system-audio-with-core-audio-taps
[17] https://developer.nvidia.com/tensorrt
[18] https://pytorch.org/tutorials/intermediate/pruning_tutorial.html
[19] Efficient Geometry-aware 3D Generative Adversarial Networks, 2021. https://arxiv.org/abs/2112.07945
[20] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, 2024. https://arxiv.org/abs/2407.08608
[21] Diffusion Adversarial Post-Training for One-Step Video Generation, 2025. https://arxiv.org/abs/2501.08316
Table 1. Comparison between DiT, NeRF, 3DGS
Figure 6. Architecture & output of DiT, NeRF, 3DGS
Figure 7. Personalizing by LoRA training with physical contexts