Efficient Generative Model
I aim at pushing the computational- and data-efficiency of generative model such as Diffusion Models and Vision-Language Models, which are applicable to image generation, video generation, 3D generation and textual generation (e.g., QA and caption etc).
StreamDiffusion: A Pipeline-Level Solution for Real-time Interactive Generation
Akio Kodaira*, Chenfeng Xu*, Toshiki Hazama*, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Masayoshi Tomizuka, Kurt Keutzer. 🔥 9K stars [github][paper]
We make the diffusion process achieve extremely high throughputs and very low power usage 😊. We design the strategies like Stream Batch, Residual CFG, and Stochastic Similarity Filtering. Our StreamDiffusion pipeline can integrate existing efficient diffusion models. Feel free to check it in our page!
Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment
Yiheng Li, Heyang Jiang, Akio Kodaira, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu (Advise this project) [NeurIPS 2024]
It is an interesting 1-2-3 idea. With just one line of code, we can achieve 3 times diffusion training acceleration. 😄 [paper]
Looking Backward: Streaming Video-to-Video Translation with Feature Banks
Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu. [Project page]
We extend our StreamDiffusion into Video-to-video setting. It ensures better temporal consistency compared to our previous StreamDiffusion.
Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering
Ido Sobol, Chenfeng Xu, Or Litany. [Project page] [NeurIPS 2024]
You will get better understanding on diffusion models 😜 we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity.
HallE-Control: Controlling Object Hallucination in Large Multimodal Models
Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, Manling Li. [Project page]
We introduce CCEval, a GPT-4 assisted evaluation method for detailed captioning. In this paper, we make the first attempt to investigate such hallucination from different aspects, including image resolution, the language decoder size, and instruction data amount, quality, granularity. Our findings underscore the unwarranted inference when the language description includes details at a finer object granularity than what the vision module can ground or verify, thus inducing hallucination. To control such hallucinations, we further attribute the reliability of captioning to contextual knowledge (involving only contextually grounded objects) and parametric knowledge (containing inferred objects by the model).
Efficient Embodied AI
Embodiments such as robots and autonomous driving vehicles require efficient perception and planning. In light of the popular trend towards using large-scale models and extensive datasets, I am advocating for the development of efficient models and low-data regimes to enable versatile, generalist embodiments.
RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning
Chenfeng Xu*, Lawrence Yunliang Chen*, Karthik Dharmarajan, Kurt Keutzer, Masayoshi Tomizuka, Quan Vuong, Ken Goldberg (CoRL 2024 Oral Paper 🎉)
😼 This work is a synergy of our efforts of 3D vision and generation model. 😍 We propose RoVi-Aug, which leverages state-of-the-art image-to-image generative models to augment robot data by synthesizing demonstrations with different robots and camera views. By training on robot- and viewpoint-augmented data, RoVi-Aug can zero-shot deploy on a different robot with significantly different camera angles.
Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting
Lawrence Yunliang Chen*, Kush Hari*, Karthik Dharmarajan*, Chenfeng Xu, Quan Vuong, Ken Goldberg (RSS 2024) [Project page]
This is a surprisingly simple idea😄! X-paint the robot (or gripper) with the source robot (gripper) in images, then the visual policy can directly transfer really well!
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Google Team, Chenfeng Xu, et al. [Project page] (ICRA 2024 Best Paper)
I am proud to be part of this project! 🎉 Learning generalizable representations is not only important for vision tasks, but also for robot learning.
Check RT-X! It is a joint work from worldwide collaborators!
What Matters to You? Towards Visual Representation Alignment for Robot Learning
Ran Tian, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy. ICLR 2024. [Paper]
How can we align visual representations to human preferences? 🤔️
🙋♂️In this work, we propose that robots should leverage human feedback to align their visual representations with the end-user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport.
Human-oriented Representation Learning for Robotic Manipulation
Mingxiao Huo, Mingyu Ding, Chenfeng Xu, Thomas Tian, Xinghao Zhu, Yao Mu, Lingfeng Sun, Masayoshi Tomizuka, Wei Zhan
[Paper][Website] (RSS2024)
How can we train a vision model more suitable for robotic learning? 🤔️
🙋♂️ Train it like training a human! We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what’s important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks.
AutoBox: A Visual-based Auto-labeling Tool for 3D Detection in Autonomous Driving
Chenfeng Xu*, Jiachen Lu*, Huachao Zhu, Mingyu Ding, Thomas Hannagan, Frederic Large, Yongchao Xu, Masayoshi Tomizuka, Kurt Keutzer, Qianqian Wang†, Wei Zhan
😎 This is the first 3D bounding box auto-labeling tool for pure video data in autonomous driving. We present AutoBox, it efficiently handles scenarios with few or no ground truth 3D bounding boxes through three key modules: state interpolation, state filter, and state correction.
Efficient Representation Learning
Representation learning is a fundamental problem for both generative model and the robotic learning. I aim at building efficient representations from raw sensors to make the representation models run faster and learn from less data.
3D Object Detection with Geometry-aware Diffusion Features
Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany. [Project page] CVPR 2024
Can StableDiffusion model work for 3D detection? 🤔️
🙋♂️ Hmm, maybe yes? But it is hard because it lacks 3D awareness. We incorporate 3D awareness into 2D stablediffusion model via a geometric controlnet.
NeRF-Det: Learning Geometry-Aware Volumetric Representations for Multi-View Indoor 3D Object Detection
Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda , Kurt Keutzer, Masayoshi Tomizuka. [ICCV 2023]. [Code]
The paper is featured as top 5 ICCV papers in Meta AI! [Link] 🎉
CV4Metaverse workshop (oral🎉) at ICCV 2023.
Does NeRF only work for 3D reconstruction? 🤔️
🙋♂️NeRF-Det makes a novel use of NeRF to build geometry-aware volumetric representations for 3D detection, with large improvement while eliminating the heavy overhead of per-scene optimization.
Quadric Representations for LiDAR Odometry, Mapping, and Localization
Chenfeng Xu*, Chao Xia*, Patrick Rim, Mingyu Ding, Nanning Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. [RA-Letter 2023]
How to represent a point-cloud scene with thousands of points in an efficient manner? 🤔️
🙋♂️ You only need several quadrics. We propose quadric representations to describe the complex point-cloud scenes in LiDAR odometry, mapping and localization. Such a sparse representation enables better odometry accuracy, 3x faster.
Time will tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection
Chenfeng Xu*, Jinhyung Park*, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, Wei Zhan [ICLR 2023 (Notable 5%)][Code]
Is temporal multi-view 3D detection able to run in an efficient way? 🤔️
🙋♂️ We theoretically analyze the effect brought by time frames, image resolutions, camera rotations and translations etc. We find that long-term frames can compensate for the lack of resolutions. We propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup.
PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map
Chenfeng Xu*, Tian Li*, Chen Tang, Lingfeng Sun, Kurt Keutzer, Masayoshi Tomizuka, Alireza Fathi, Wei Zhan
[ECCV 2022] [Code]
Why is this work the first to pre-train for trajectory forecasting? 😯
Trajectory data is too scarce to lift trajectory forecasting model data-efficient from pre-training. We open up a new path by leveraging hundreds of map data and connecting the trajectory representations to strong map representations. We associate geometric representations of maps and shapes of trajectories, which boosts the performance of trajectory forecasting. We then extend this into synthetic data. See Pre-Training on Synthetic Driving Data For Trajectory Prediction.
Open-Vocabulary Point-Cloud Object Detection without 3D Annotation
Chenfeng Xu*, Yuheng Lu*, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang. [CVPR 2023][Code]
Can point-cloud detectors be trained without 3D labels? 🤔️
🙋♂️ Image domain has shown great generalizablities in 2D foundation models. We address open-vocabulary 3D point-cloud detection by leveraging the 2D foundation models such as CLIP.
Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models
Chenfeng Xu∗, Shijia Yang∗, Tomer Galanti, Bichen Wu, Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka
[ECCV 2022] [Code]
This is a surprising work 😯!
Image and point-cloud have huge domain gap given that images are dense RGB arrays while point-clouds are sparse xyz points. We surprisingly found that image-pretrained model can be efficiently (300x fewer tuned parameters) tuned for point-cloud tasks. We also shed light on why it works through neural collapse, i.e., image-pretrained models present neural collapse in point-cloud.