Efficient Generative Models

I aim to push the computational- and data-efficiency of generative models such as Diffusion Models, Large-language Models and Vision-Language Models, which apply to image generation, video generation, 3D generation and textual generation (e.g., QA and caption etc).

StreamDiffusion: A Pipeline-Level Solution for Real-time Interactive Generation

Chenfeng Xu*, Akio Kodaira*, Toshiki Hazama*, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Masayoshi Tomizuka, Kurt Keutzer. 🔥 10K stars [github][paper]. Check the TouchDesigner tutorial for designing your cool demo 😎 [Tutorial][Youtube]

We make the diffusion process achieve extremely high throughputs and very low power usage 😊. We design the strategies like Stream Batch, Residual CFG, and Stochastic Similarity Filtering. Our StreamDiffusion pipeline can integrate existing efficient diffusion models. Feel free to check it in our page!

Dobi-SVD: Differential SVD for LLM Compression and Some New Perspectives

Wang Qinsi*, Jinghan Ke*, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu (Advise this project, corresponding author). [ICLR 2025]

We are the first to make SVD-based compression method rival quantization and outperform pruning-based compression. 😎 Unlike quantization and pruning—which often require specialized support for low-precision or sparse computing—Dobi-SVD uses SVD sparsification to compress model weights and achieve acceleration on general GPU types. For example, Dobi-SVD speeds up LLM and VLM by over 5.5× on low-cost GPUs (e.g., Titan XP), but quantization and pruning can hardly work well.

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Haocheng Xi*, Shuo Yang*, Yilong Zhao, Chenfeng Xu (Advise this project, corresponding author), Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han. [Arxiv]

We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.😲 We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, 😎 SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.

7713_1741551017-ezgif.com-video-to-gif-converter.gif

Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment

Yiheng Li, Heyang Jiang, Akio Kodaira, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu (Advise this project, corresponding author) [NeurIPS 2024]

It is an interesting 1-2-3 idea. With just one line of code, we can achieve 3 times diffusion training acceleration. 😄 [paper]

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu. [Project page] [ICLR 2025]

We extend our StreamDiffusion into Video-to-video setting. It ensures better temporal consistency compared to our previous StreamDiffusion.

streamv2v_video-ezgif.com-video-to-gif-converter.gif

Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

Ido Sobol, Chenfeng Xu, Or Litany. [Project page] [NeurIPS 2024]

You will get better understanding on diffusion models 😜 we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity.

HallE-Control: Controlling Object Hallucination in Large Multimodal Models

Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, Manling Li. [Project page]

We introduce CCEval, a GPT-4 assisted evaluation method for detailed captioning. In this paper, we make the first attempt to investigate such hallucination from different aspects, including image resolution, the language decoder size, and instruction data amount, quality, granularity. Our findings underscore the unwarranted inference when the language description includes details at a finer object granularity than what the vision module can ground or verify, thus inducing hallucination. To control such hallucinations, we further attribute the reliability of captioning to contextual knowledge (involving only contextually grounded objects) and parametric knowledge (containing inferred objects by the model).

Efficient Embodied AI

Embodiments such as robots and autonomous driving vehicles require efficient perception and planning. In light of the popular trend towards using large-scale models and extensive datasets, I am advocating for the development of efficient models and low-data regimes to enable versatile, generalist embodiments.

RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning

Chenfeng Xu*, Lawrence Yunliang Chen*, Karthik Dharmarajan, Kurt Keutzer, Masayoshi Tomizuka, Quan Vuong, Ken Goldberg (CoRL 2024 Oral Paper 🎉) [Website]

We are interviewed by TechXplore and It is featured here! 😎

😼 This work is a synergy of our efforts of generation model and robotics. 😍 We propose RoVi-Aug, which leverages state-of-the-art image-to-image generative models to augment robot data by synthesizing demonstrations with different robots and camera views. By training on robot- and viewpoint-augmented data, RoVi-Aug can zero-shot deploy on a different robot with significantly different camera angles.

Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting

Lawrence Yunliang Chen*, Kush Hari*, Karthik Dharmarajan*, Chenfeng Xu, Quan Vuong, Ken Goldberg (RSS 2024) [Project page]

This is a surprisingly simple idea😄! X-paint the robot (or gripper) with the source robot (gripper) in images, then the visual policy can directly transfer really well!

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Google Team, Chenfeng Xu, et al. [Project page] (ICRA 2024 Best Paper)

I am proud to be part of this project! 🎉 Learning generalizable representations is not only important for vision tasks, but also for robot learning.
Check RT-X! It is a joint work from worldwide collaborators!

What Matters to You? Towards Visual Representation Alignment for Robot Learning

Ran Tian, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy. ICLR 2024. [Paper] and IJRR submission for the extension.

How can we align visual representations to human preferences? 🤔️
🙋‍♂️In this work, we propose that robots should leverage human feedback to align their visual representations with the end-user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport.

Human-oriented Representation Learning for Robotic Manipulation

Mingxiao Huo, Mingyu Ding, Chenfeng Xu, Thomas Tian, Xinghao Zhu, Yao Mu, Lingfeng Sun, Masayoshi Tomizuka, Wei Zhan
[Paper][Website] (RSS2024)

How can we train a vision model more suitable for robotic learning? 🤔️
🙋‍♂️ Train it like training a human! We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what’s important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks.

AutoBox: A Visual-based Auto-labeling Tool for 3D Detection in Autonomous Driving

Chenfeng Xu*, Jiachen Lu*, Huachao Zhu, Mingyu Ding, Thomas Hannagan, Frederic Large, Yongchao Xu, Masayoshi Tomizuka, Kurt Keutzer, Qianqian Wang†, Wei Zhan

😎 This is the first 3D bounding box auto-labeling tool for pure video data in autonomous driving. We present AutoBox, it efficiently handles scenarios with few or no ground truth 3D bounding boxes through three key modules: state interpolation, state filter, and state correction.

Efficient Representation Learning

Representation learning is a fundamental problem for both generative model and the robotic learning. I aim at building efficient representations from raw sensors to make the representation models run faster and learn from less data.

3D Object Detection with Geometry-aware Diffusion Features

Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany. [Project page] CVPR 2024
Can StableDiffusion model work for 3D detection? 🤔️
🙋‍♂️ Hmm, maybe yes? But it is hard because it lacks 3D awareness. We incorporate 3D awareness into 2D stablediffusion model via a geometric controlnet.

NeRF-Det: Learning Geometry-Aware Volumetric Representations for Multi-View Indoor 3D Object Detection

Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda , Kurt Keutzer, Masayoshi Tomizuka. [ICCV 2023]. [official Code] Appreciate MM-Detection3D integrate NeRF-Det!
Oral presentation at CV4Metaverse workshop at ICCV 2023.
Does NeRF only work for 3D reconstruction? 🤔️
🙋‍♂️NeRF-Det makes a novel use of NeRF to build geometry-aware volumetric representations for 3D detection, with large improvement while eliminating the heavy overhead of per-scene optimization.

Quadric Representations for LiDAR Odometry, Mapping, and Localization

Chenfeng Xu*, Chao Xia*, Patrick Rim, Mingyu Ding, Nanning Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. [RA-Letter 2023]

How to represent a point-cloud scene with thousands of points in an efficient manner? 🤔️
🙋‍♂️ You only need several quadrics. We propose quadric representations to describe the complex point-cloud scenes in LiDAR odometry, mapping and localization. Such a sparse representation enables better odometry accuracy, 3x faster and using 1500x less footprint for localization!

Also see our extension to monocular SLAM Q-SLAM (CoRL 2024)

Time will tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection

Chenfeng Xu*, Jinhyung Park*, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, Wei Zhan [ICLR 2023 (Notable 5%)][Code]
Is temporal multi-view 3D detection able to run in an efficient way? 🤔️
🙋‍♂️ We theoretically analyze the effect brought by time frames, image resolutions, camera rotations and translations etc. We find that long-term frames can compensate for the lack of resolutions. We propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup.

PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map

Chenfeng Xu*, Tian Li*, Chen Tang, Lingfeng Sun, Kurt Keutzer, Masayoshi Tomizuka, Alireza Fathi, Wei Zhan
[ECCV 2022] [Code]

Why is this work the first to pre-train for trajectory forecasting? 😯
Trajectory data is too scarce to lift trajectory forecasting model data-efficient from pre-training. We open up a new path by leveraging hundreds of map data and connecting the trajectory representations to strong map representations. We associate geometric representations of maps and shapes of trajectories, which boosts the performance of trajectory forecasting. We then extend this into synthetic data. See Pre-Training on Synthetic Driving Data For Trajectory Prediction [IROS 2024]

Screen Shot 2023-04-28 at 6.35.54 PM.png

Open-Vocabulary Point-Cloud Object Detection without 3D Annotation

Chenfeng Xu*, Yuheng Lu*, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang. [CVPR 2023][Code]
Can point-cloud detectors be trained without 3D labels? 🤔️
🙋‍♂️ Image domain has shown great generalizablities in 2D foundation models. We address open-vocabulary 3D point-cloud detection by leveraging the 2D foundation models such as CLIP.

Screen Shot 2023-04-28 at 6.13.52 PM.png

Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models

Chenfeng Xu∗, Shijia Yang∗, Tomer Galanti, Bichen Wu, Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka
[ECCV 2022] [Code]

This is a surprising work 😯！
Image and point-cloud have huge domain gap given that images are dense RGB arrays while point-clouds are sparse xyz points. We surprisingly found that image-pretrained model can be efficiently (300x fewer tuned parameters) tuned for point-cloud tasks. We also shed light on why it works through neural collapse, i.e., image-pretrained models present neural collapse in point-cloud.