Yuang Peng (彭雨昂)

Master Student at ITML Group, Tsinghua University.

Research Intern at Foundation Model Group, StepFun.

Research: As a researcher and engineer specializing in large language model, my primary focus on the development of efficient and scalable methods for multimodal data modeling, with particular emphasis on text, images, and videos. My interest spans multiple ares, including generative modeling, representation learning, reinforcement learning, and embodied AI. My ultimate ambition is to cultivate multimodal perception, reasoning, and generation capabilities for Artificial General Intelligence (AGI), with the goal of creating fully intelligent systems and robots that can enhance human lives.

Experience: I am currently pursuing my Master’s degree in Computer Science at Tsinghua University, advised by Shutao Xia. I was a research intern at StepFun, and Foundation Model Group, Megvii Research, and Shanghai Artificial Intelligence Laboratory. I was a short-term visiting scholar at the Artificial Intelligence Group, University of Cambridge, advised by Pietro Liò. I obtained my Bachelor’s degree in Computer Science from Wuhan University, where I was recognized as a distinguished graduate and graduated summa cum laude.

News

Jun 25, 2024	Introduce DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Sep 20, 2023	Introduce multimodal LLM: DreamLLM

Selected Publications

arXiv

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Yuang Peng*, Yuxin Cui*, Haomiao Tang*, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia

arXiv preprint arXiv:2307.09474, 2024

Abs PDF

Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings.
ICLR

Dreamllm: Synergistic multimodal comprehension and creation

Runpei Dong*, Chunrui Han*, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Yi Li

ICLR Spotlight Presentation (4.96%), 2023

Abs PDF

This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM’s superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy.
IJCAI

Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning

Liang Zhao*, En Yu*, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, and Xiangyu Zhang

IJCAI Long Oral Presentation (4%), 2023

Abs PDF

Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained vision-language instruction-following dataset based on existing datasets and GPT-4 generating. Furthermore, we design a series of evaluation tasks to assess the effectiveness of region recognition and interaction. Experimental results showcase ChatSpot’s promising performance.
IEEE RA-L

Exploring recurrent long-term temporal fusion for multi-view 3d perception

Chunrui Han, Jianjian Sun, Zheng Ge, Jinrong Yang, Runpei Dong, Hongyu Zhou, Weixin Mao, Yuang Peng, and Xiangyu Zhang

IEEE Robotics and Automation Letters Oral Presentation, 2023

Abs PDF

Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird’s-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model’s robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains leading performance on various camera-based 3D perception tasks, including object detection (55.4% mAP and 62.9% NDS), segmentation (48.6% vehicle mIoU), tracking (54.8% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA). Code will be available.