Logo AnyEdit

Unified High-Quality Image Edit with Any Idea

1Zhejiang University, 2Nanyang Technological University,3Alibaba Group
*Equal Contribution

Introduction

Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present Logo AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains.

We ensure the diversity and quality of the Logo AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results.

Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that Logo AnyEdit consistently boosts the performance of diffusion-based editing models.

This presents prospects for developing instruction-driven image editing models that support human creativity.

Logo AnyEdit Dataset

Overview

    We comprehensively categorize image editing tasks into 5 groups based on different editing capabilities:
  • (a) Local Editing which focuses on region-based editing (green area);
  • (b) Global Editing which focuses on the full range of image rendering (yellow area);
  • (c) Camera Move Editing which focuses on viewpoints changing instead of scenes (gray area);
  • (d) Implicit Editing which requires commonsense knowledge to complete complex editing (orange area);
  • (e) Visual Editing which encompasses additional visual inputs, addressing the requirements for multi-modal editing (blue area).

algebraic reasoning

Examples of Logo AnyEdit at scale.


In Logo AnyEdit, we combine five distinct groups of data, covering 25 editing types, which will be released to help the community. It is worth noting that Logo AnyEdit is the only dataset that considers the data bias and introduces counterfactual synthetic scenes to balance the distribution of the dataset.

arithmetic reasoning
Comparison of existing image editing datasets. “Real Image” means the original images are from real world, “Synthetic Image” means they are from T2I models, “Synthetic Scene” indicates both images and captions are generated to address the inherent data bias.

Subsequently, we invoke off-the-shelf T2I models to produce the initial images. In this manner, we enrich the original dataset by incorporating rare concept combinations, resulting in ∼700K high-quality and diverse image-caption pairs for the Logo AnyEdit dataset collection.

data-overview

Data preparation details for Logo AnyEdit dataset collection.

Pipeline

We summarize the general pipeline into five steps:
(1) General data preparation from real-world image-text pairs and synthetic scenes.
(2) Diverse instruction generation using LLM to produce high-quality editing instructions.
(3) Pre-filtering for instruction validation.
(4) Adaptive editing pipeline tailors specific editing methods for each edit type to generate high-quality edited images.
(5) Image quality assessment ensures high-quality editing pairs for the Logo AnyEdit Dataset.

data-overview

The comprehensive construction pipeline of Logo AnyEdit.

Cases in Logo AnyEdit

Cases in Logo AnyEdit-Test

🎨AnySD Model

Overview

Since Logo AnyEdit contains a wide range of editing instructions across various domains, it holds promising potential for developing a powerful editing model to address high-quality editing tasks. However, training such a model has three extra challenges: (a) aligning the semantics of various multi-modal inputs; (b) identifying the semantic edits within each domain to control the granularity and scope of the edits; (c) coordinating the complexity of various editing tasks to prevent catastrophic forgetting. To this end, we propose a novel AnyEdit Stable Diffusion approach (🎨AnySD) to cope with various editing tasks in the real world.

grade-lv

Architecture of 🎨AnySD. 🎨AnySD is a novel architecture that supports three conditions (original image, editing instruction, visual prompt) for various editing tasks.

Quantity Results

We report the standard image editing results of Logo AnyEdit and other baselines on EMU-Edit Test and MagicBrush benchmarks in the table. Based on the experimental results, we have summarized the following conclusions:
(i) Our SD-1.5 with Logo AnyEdit, which only changes the training data to Logo AnyEdit, consistently demonstrates superior semantic performance in both edit alignment and content preservation compared to SOTA methods, even without additional mask supervision (0.872 for CLIPim and 0.285 for CLIPout on the EMU-Edit Test). It highlights Logo AnyEdit's effectiveness in mastering high-quality image editing, validating its high-quality editing data with significant semantic alignment and underlying clear editing structure.
(ii) Our 🎨AnySD model, trained on Logo AnyEdit using the 🎨AnySD architecture, further surpasses SOTA methods in both semantic and visual similarity (0.872 of CLIPim on EMU-Edit Test and 0.881 of DINO on MagicBrush Test), setting new records on MagicBrush and Emu-Edit benchmarks.
This demonstrates the superiority of 🎨AnySD in following editing instructions while preserving unchanged image elements, thanks to its task-aware architecture that learns task-specific knowledge from the diverse editing types in Logo AnyEdit, enhancing the model's cross-task editing capabilities.

grade-lv

Comparison of methods on EMU-Edit and MagicBrush benchmark. We show performance improvements
over SOTA models of the same architecture, with only training data differences.

Below Table presents the results of the Logo AnyEdit-Test benchmark, where each instruction is designed to rigorously evaluate Logo AnyEdit’s adaptability across a wider range of challenging editing scenarios. We provide further results of each editing category in Appendix F. It can be observed that
(i) most baselines struggle to effectively handle more complex editing tasks that are rarely in standard benchmarks (0.190 v.s. 0.121 on average L1), especially for implicit editing that requires reasoning abilities. This illustrates the importance of Logo AnyEdit-Test for evaluating the performance of editing models on complex tasks.
(ii) Even for common editing tasks, state-of-the-art models show a significant decline in consistency performance on Logo AnyEdit-Test (-3.5% on CLIPim and -19.2% on DINO of UltraEdit). This underscores the limitations of existing benchmarks in evaluating multi-scene editing.
(iii) In contrast, Logo AnyEdit significantly outperforms SOTA methods across all editing categories, demonstrating its scalability and robustness in handling complex tasks across diverse scenarios.
(iv) Traditional methods often struggle to handle visual editing effectively due to additional visual inputs. In such cases, even when compared to Uni-ControlNet, which is pre-trained with diverse visual conditions, Logo AnyEdit consistently performs better in visual editing tasks. It shows the efficacy of Logo AnyEdit in handling vision-conditioned editing instructions.

grade-lv

Comparison of methods on Logo AnyEdit-Test benchmark

More Quality Cases

Diversified Editing

Edit Cases in Logo AnyEdit-Test

Multi-Turnn Edit Cases

Comparison with More Models

BibTeX


@article{yu2024anyedit,
  title={AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea},
  author={Yu, Qifan and Chow, Wei and Yue, Zhongqi and Pan, Kaihang and Wu, Yang and Wan, Xiaoyang and Li, Juncheng and Tang, Siliang and Zhang, Hanwang and Zhuang, Yueting},
  journal={arXiv preprint arXiv:2411.15738},
  year={2024}
}