Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (2024)

\pdftrailerid

redacted

Cristina N. VasconcelosAbdullah RashwanAustin WatersTrevor WalkerKeyang XuJimmy YanRui QianShixin LuoZarana ParekhAndrew BunnerHongliang FeiRoopal GargMandy GuoIvana KajicYeqing LiHenna NandwaniJordi Pont-TusetYasumasa OnoeSarah RosstonSu WangWenlei ZhouKevin SwerskyDavid J. FleetJason M. BaldridgeOliver Wang

Abstract

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models.without the needs for cascaded super-resolution components.The key insight stems from careful pre-training of core components,namely, those responsible for text-to-image alignment vs. high-resolution rendering.We first demonstrate the benefits of scaling a Shallow UNet,with no down(up)-sampling enc(dec)oder.Scaling its deep core layers is shown to improve alignment, object structure, and composition.Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models,while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets.This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade.Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes.Vermeer, our full pipeline model trained with internal datasets to produce $1024\times 1024$ images, without cascades, is preferred by $44.0\%$ vs. $21.4\%$ human evaluators over SDXL.

1 Introduction

Training large-scale Pixel-Space text-to-image Diffusion Models (PSDM) to generate high-resolution images has been challenging due to optimization instabilities arising when growing model size and/or target image resolution,and due to the increasing demand for computational resources and high resolution training corpora.The predominant alternatives include cascaded models,comprising a sequence of diffusion models each targeting a progressively higher resolution and trained independently (Ho etal., 2022a; Saharia etal., 2022a; Nichol etal., 2022), and latent diffusion models (LDMs), where generation is performed in a low-dimensional latent representation, from which high resolution images are generated via a pre-trained latent decoder (Rombach etal., ).

In the development of cascaded models, it is challenging to identify sources of quality degradation and distortion resulting from design decisions at specific stages of the model.One well-known issue of cascades is the distribution shift between training and inference, where inputs to super-resolution or decoder models during training are obtained by down-sampling or encoding training images, but during inference they are generated from other models, and hence may deviate from the training distribution. This can cause amplification of unnatural distortions produced by models early in the cascade.The generation of realistic small objects such as faces or hands is one such challenge that has been difficult to diagnose in such models.

Beyond image generation per se, diffusion models serve as image priors for myriad downstream tasks, including inverse problems (Jalal etal., 2021; Kadkhodaie and Simoncelli, 2021; Kawar etal., 2022; Song etal., 2023; Chung etal., 2023; Graikos etal., 2022; Tang etal., 2023; Jaini etal., 2024; Zhan etal., 2023; Song etal., 2024), or other generative tasks (Ho etal., 2022b; Levy etal., 2023; Poole etal., 2023; Tan etal., 2023; Bar-Tal etal., 2024; Chen etal., 2023; Tewari etal., 2023).Cascaded diffusion models are not readily applicable to such tasks, and as a consequence, many such applications rely solely on the score function from the base model of a cascade, often at a relatively low resolution.A high resolution end-to-end model would alleviate these issues, but model development and effective training procedures have been elusive.

Key barriers to training high resolution models include prohibitive resource requirements in both memory and computation.Existent recipes require large batch sizes during training to avoid instabilities,and as a consequence, intractably large amounts of memory for high-resolution images.Another issue concerns the need for high quality, high resolution training data. Existing training methods require large, diverse corpora of text-to-image pairs at the target resolution, while in practice, such data are not readily available at high resolution.

This paper introduces a framework for training high resolution, large-scale text-to-image diffusion models without the use of cascades.To that end we explore the extent to which one can decouple the training of ’visual concepts’ associated with textual prompts, from the resolution at which one aims to render the image.Such disentanglement has two goals. It aims at a better understanding of alignment, composition and image fidelity (especially for well-known hard cases like generating consistent hands, text rendering, scene composition, etc.) as a function of model scaling (e.g., see Figure2).Second, and of equal importance, our framework yields a robust and stable recipe for training large-scale, non-cascaded pixel-based models targeting high-resolution generation. A bonus is that our recipe allows us to jointly train a single model with data comprising multiple resolutions, even if high-resolution text-image pairs are relatively scarce.

The contributions of this paper can be summarized as follows:

•
We introduce a novel architecture, Shallow-UViT,which allows one to pretrain the PSDM’s core layers on huge datasets of text-image data (subsection3.2), eliminating the need to train at the entire model with high resolution images.This also allows us to investigate the emergent properties of PSDM representation scaling in isolation from layers targeting generation at the final resolution.
•
We present a greedy algorithm for training the Shallow-UViT architecture that allows us to successfully train a high-resolution text-to-image model with small batch sizes (256 versus the typical 2k used in end-to-end solutions) (section3).
•
We show that one can significantly improve different image quality metrics by leveraging the representation pretrained at low-resolution, while growing model resolution in a greedy fashion.Scaling the core components of the Shallow-UViT architecture alone leads to significant improvements in image distribution, quality and text alignment (section5).
•
We demonstrate that these principles work at scale by presenting Vermeer, a model trained with our greedy algorithm on large-scale corpora, in conjunction with other well-known methods like asymmetric aspect ratio finetuning, prompt preemption and style tuning (section6).Vermeer is shown to surpass previous cascaded and auto-regressive models across different metrics. In a human evaluation study with 500 challenging prompts and 25 annotators per image, Vermeer is preferred over SDXL (Podell etal., 2024) by a 2 to 1 margin.

2 Related work

Current high-resolution image generation with diffusion models presents a trade-off between architectural complexity and efficiency.Cascaded diffusion models (Nichol etal., 2022; Dhariwal and Nichol, 2021; Saharia etal., 2022b; Ramesh etal., 2022; Balaji etal., 2022) were originally introduced to circumvent the difficulty of training a single stage, end-to-end model.Cascaded models employ a multi-stage architecture that progressively up-scales lower-resolution images to address the computational challenges of generating high-resolution images directly.Nevertheless, they entail significant complexity and training overhead, as the stages of the cascade are trained independently.

Simple Diffusion(Hoogeboom etal., 2023b) sought to simplify the process by targeting the high resolution generation with a single stage model, introducing a novel UViT architecture and several useful modifications to training methods that improve stability.While this approach is shown to be effective, stability issues remain when targeting large-scale models, and high resolution images, due in part to their dependence on large batch sizes.In this work we adopt a similar UViT architecture, and some of their techniques for scaling, extending the model to much higher resolutions through greedy training. Through scaling the core backbone of the model, and with our greedy training procedure, we find with can scale to much high resolution models( $2\times$ to $8\times$ higher than Simple Diffusion), with excellent alignment, and much smaller batches when training high resolution layers of the model.

Another line of work proposed Matryoshka Diffusion Models (MDM) (Gu etal., 2023) that denoises multiple resolutions using a proposed Nested UNet architecture. They progressively train the network to preserve the representation at higher resolutions. We show in this work an alternate and simpler approach where denoising multiple resolutions is not required, but instead it is crucial to preserve the representation by freezing the pretrained weights as we grow the architecture up to its final design.

On another front, latent diffusion models (LDMs) (Rombach etal., ; Jabri etal., 2022; Betker etal., 2023) reduce computational costs by operating within a compressed latent representation. However, LDMs still require separate super-resolution or latent decoder networks to produce final high-resolution images.

The model we introduce also resembles progressive GAN training (Karras etal., 2018) in which layers of increasing resolution are added at each stage.Our work can be thought of as an extension of progressive growing for diffusion models, where we evaluate different growing configurations, and come up with a two-step recipe that arrives at a good trade-off of training efficiency, robustness, and generation quality. Specifically, while all layers remain trainable in progressive GANs, and a sequence of growing operations is performed before reaching the final architecture, we pretrain a core representation that remains frozen when training all grown layers at once up to the target resolution. We find that this is crucial to preserve the quality of the representation learned at lower resolutions.

3 Method

Our goal is to create a straightforward, stable methodology for training large scale pixel-space diffusion models that operate as a single stage model, i.e., non-cascaded, at inference time.To this end, we first revisit the UNet architecture, aiming to decouple layers that have a major impact on text-to-image alignment (core components) from those responsible for rendering at the target image resolution (encoder-decoder or super-resolution components).Next, we focus on pre-training the core components pretraining and onrepresentation scaling (subsection3.2).Finally, we present a greedy algorithm to grow the initial architecture core by adding encoder-decoder layers while protecting core layers’ representation.This yields a single-stage model at inference time (subsection3.3).

3.1 Text-to-image core components

UNet is the architecture of choice for diffusion models. Two architecture families are common. In one, convolutional networks comprise a stack of convolutional blocks alternated with pooling or downsampling layers in the encoder, and upsampling layers in the decoder. More recently, the UViT family emerged(Hoogeboom etal., 2023a), in which convolutional blocks are used at the higher layers of the encoder and decoder but augmented with transformer layers at the bottom of the UNet.In both architectural families, text conditioning is accomplished via cross-attention layers, also at the bottom, low-resolution layers of the UNet.In doing so, these layers are responsible for conditioning the models’ deepest representation on the textual and/or multi-modal inputs.At these low-resolution layers, the text conditioning signal is able to influence the global image composition while the computational cost of attention is kept relatively low.

Our search for a methodology that allows stable training of large models starts by identifying and isolating core layers responsible for text-to-image alignment.Our main conjecture is that it is possible to reduce the instability typically observed during training large-scale PSDMs by warming up layers responsible for text-to-image alignment in isolation from layers responsible for target resolution encoding/decoding.

Specifically, we define the core components as those that directly interface with text conditioning signals and those that are crucial in the diffusion process. They can be described as:

•
Text encoding layers combine one or more textual, character, and/or multimodal pretrained representations (such as those from Raffel etal. (2020b); Xue etal. (2022a); Liu etal. (2023); Radford etal. (2021a)), and project them into the embedding space of the UNet. Typically composed of MLP on top of pooling layers.
•
Core representation layers comprise hidden layers in the main backbone interfacing with cross-attention layers. They include the bottom layers of the UNet architecture whose features are directly combined with the embedded text by the cross attention operation and layers between them.
•
Time encoding layers map the diffusion time step into the model’s embedding space.Typically designed as a sinusoidal positional encoder, followed by a shallow MLP. Despite not participating directly in the cross-attention operation, it is a core component of the diffusion process.

We isolate these core components of a PSDM text-to-image model in order to study their effect on the final model’s properties. Next, we propose an architecture that enables the pretraining of these layers, and also supports the study of the properties emerging from scaling them.

3.2 Shallow-UViT

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (1)

To assist the pretraining of the core components and, at the same time investigate the emerging properties from their scaling, we isolate the core components training and scaling from other confounding factors in the specification of the UNet’s encoder-decoder layers.To that end, we simplify the UNet’s conventional hierarchical structure, which operates on multiple resolutions, and define the Shallow-UViT (SU), a simplified architecture comprising a shallow encoder and decoder operating on a fixed spatial grid (Figure1). Its encoder and decoder have a single residual block each, containing two layers of $3\times 3$ convolutions with swish activationsRamachandran etal. (2017), and no upsampling or downsampling layers.As a result, they share the same spatial grid as the core representation layers at the bottom.The first convolutional layer at the entry of the architecture projects the input image into the fixed size grid used by its core layers.A corresponding upsampling head at the model’s output reverses this operation.These input/output layers facilitate quickly projecting input images with larger resolution into the core representation with fixed and lower resolution.

As a second simplification, we restrict our investigation to the core components from the UViT model family owing to the uniform structure of its core representation layers.In contrast, the corresponding layers of convolutional UNets present a broader spectrum ofdesign and hyperparameter choices,owing to their non-uniform yet hierarchical structure,rendering their analysis more complex.

An alternative to the proposed use of the Shallow-UViT architecture,might be to train the core components directly as an augmented ViT,as previously explored in latent diffusion models (Peebles and Xie, 2023).Our attempt to explore this approach proved not to be straightforward.A crucial difference between PSDM and LDM becomes highly relevant here.In the case of LDM, the transformer operates on latent tokens, and the diffusion model captures the latent token distribution.Our task, on the other hand, is to pretrain a rich representation directly from the raw pixels, for subsequent reuse as deep features within a higher-resolution pixel-space model.We conjecture that in such approaches the initial layers that are closer to the raw data do not transfer as well when reused within the final model.

Instead, our Shallow-UViT includes proxy additional layers that help with closing the gap between core components feature pretraining and their later use. That is, the auxiliary, yet shallow, input (output) and encoding (decoding) layers help adding expressiveness to the transformations between the input (output) and the models’ hidden representation.Across the variations explored, the input convolution expands the number input channels up to 256 (we observed no improvement with more channels).

Beyond ablations on scaling (see section5), we also found that certain variations for the Shallow-UViT composition tend to degrade performance in comparison to our best architecture.In particular, these include the removal of the shallow encoder/decoder blocks;the use of smaller/larger filters ( $4\times 4,5\times 5,..,9\times 9$ ) and strides (from 1 up to 8) at the entry convolution;and the use of a single output head with a subpixel convolution upsampling by a factor of $4$ .We also experimented with convolutional core representation layers, but like Dosovitskiy etal. (2021), we find they under-perform their transformed-based counterparts.

3.3 Greedy growing

Here we describe a greedy approach to learn PSDMs for high-resolution images. Our process consists of two distinct stages, where we first pretrain the core representation layers at a low resolution using a Shallow-UViT architecture. Then, in the second phase, we replace the encoder/decoder layers with a more expressive set of UNet layers and train at the target resolution. This two-stage process is in contrast to progressive growing, which seeks to add one layer at a time. With this approach, we aim to mitigate the well-known instabilities observed during training of large models (Saharia etal., 2022b; Hoogeboom etal., 2023b), while making the best use of the available training corpora.

The greedy growing algorithm can be described as follows.

Phase 1

In this phase, the core components of the chosen architecture are identified (see subsection3.1), and a Shallow-UViT model is build on top of them. The Shallow-UViT is trained on the entire training collection of text-image pairs, as it is not limited to high resolution training images.

Phase 2

The second phase greedily grows the Shallow-UViT’s encoder/decoder (namely, throwing away the lower-resolution blocks and adding higher-resolution blocks) to obtain the final model.More specifically, this phase adds encoder and decoder layers at different resolutions, while preserving the core representation layers at the spatial resolution used during the first phase. In other words, the core components continue operating on a $16\times 16$ grid.The added layers are randomly initialized, while the core components are initialized with the weights obtained on the first phase. The remaining components of the Shallow-UViT model are discarded.

Next, the grown model is trained. As it is a common practice for the generation of high fidelity images, at this point we filter the training data to remove text-image pairswith either image dimension is lower than the final model’s target resolution.The text encoding layers and the core representation layers are kept frozen, to preserve the richness of the pretrained representation.The time encoding layers, on the other hand, are further tuned, jointly with the new encoder and decoder layers introduced in the second phase,which allows it to adapt to changes in the diffusion noise schedule.We adjusted the diffusion logSNR shift for high resolution images as suggested by Hoogeboom etal. (2023b), by a factor of $2\log(64/d)$ .An optional third defrosting phase, may be applied in which all layers are jointly tuned, and seeks to benefit from the full capacity of the end-to-end architecture, but in practice we find that the first two phases are sufficient to obtain a good PSDM.

We empirically investigate the behaviour of the proposed algorithm in models of increasing size in subsection5.2.We investigate the effects of splitting the training of the two tasksin phase one and phase two (i.e., for text-alignment and high-resolution generation), and we compare with models jointly trained from scratch, end-to-end.During these ablations, we constrain the greedy growing phase to use considerably smaller batch sizes than previous work, with no further regularization to demonstrate the optimization stability.

4 Experimental settings

Shallow-UViT:

The proposed Shallow-UViT provides a proxy architecture for pre-training the core components of a larger PSDM.The ablation studies below us a specific instantiation of the model, but we expect Shallow-UViT to be flexible enough to be used with other component parts.In particular we adopt a combination of two pretrained text encoders for text conditioning:T5-XXL(Raffel etal., 2020a) with 128 sequence length and CLIP (VIT-H14)(Radford etal., 2021b) with 77 sequence length.Given a text prompt,we first tokenize and encode the text using the two encoders independently, and then concatenate the embeddings,yielding a final embedding with sequence length of 205.They are projected into model’s hidden size by the text encoding layers.We keep the Shallow-UViT design fixed, except for changing the capacity by increasing its width (hidden size) and depth (number of transformer’s blocks), as detailed in Table1.That produces a set of models varying from 672M up to 7.7B trainable parameters, mostly dedicated to the core components.

model	transf. blocks	hidden size	MLP channels	heads	params¹¹1Number of trainable parameters after ignoring text encoders.
Shallow-UViT Small	6	1536	6144	12	672M
Shallow-UViT Large	8	2048	8182	16	1.3B
Shallow-UViT Huge	12	3072	12288	24	3.5B
Shallow-UViT XHuge	16	4096	16384	32	7.7B

End-to-end model	channels per layer	residual blocks	*params
UViT Small	256-384-768-1536	1-1-1-1	707M
UViT Large	256-512-1024-2048	1-1-1-1	1.4B
UViT Huge	384-768-1536-3072	1-1-1-1	3.6B
UViT XHuge	512-1024-2048-4096	1-1-1-1	7.9B

We stress that we do not claim that these specific core components are optimal.For instance, it is widely recognized that larger pretrained text encoders and longer token sequence lengths increase image quality (Saharia etal., 2022b; Balaji etal., 2022; Podell etal., 2024).Investigating the optimal design of each core component is beyond the scope of this work.Instead, the variations of the Shallow-UViT were intentionally designed to explore the performance benefits gained by increasing core components’s capacity independent of the remaining model components.

Greedy growing:

In the experiments that follow we consider several different model sizes.Table1 specifies the Shallow-UViT variants, whileTable2 specifies encoder/decoder parameterizations.

To ablate our hypothesis that greedy growing helps the model learn strong representations with larger, diverse corpora, we also train the full model on a high resolution subset of data used to train the Shallow-UViT; i.e., we simply removed all samples with resolution lower than the target model resolution.To that end, beyond greedy growing, we explore the three training baselines: 1) We create a baseline with all layers trained from scratch on this subset; 2) As an alternative to the frozen phase in the greedy growing, we fine-tune the core components on this smaller high resolution subset jointly with the grown components (randomly initialized); and 3) A third baseline adds the optional phase of unfreezing the core components after warming up the random weights for 500k steps.Models are trained for 2M steps in total.

The greedy growing algorithm aims to make training large-scale PSDMs at high resolutions more stable.In the case of Simple Diffusion (Hoogeboom etal., 2023b), large batch sizes and regularizers like dropout and multi-scale losses enable end-to-end training from scratch.To stress test the stability and convergence of our greedy growing algorithm, we restrict the batch size to 256 instead of the standard 2k, and we use no other explicit form of regularization.Under that restriction, our largest model (UVit-XHuge) presented numerical instabilities when trained from scratch or fine-tuned, as multiple numerical issues occurred during training.Thus, the results of this large model are presented only for the frozen, and freeze-unfreeze methods.This behaviour confirms observations in previous work and their need for large batch sizes.

Dataset:

Rigorous evaluation of generative image models is challenging when models are trained on proprietary datasets.To avoid this issue, we first demonstrate our key findings through extensive empirical evaluations on a publicly available dataset, namely, Conceptual 12M (or CC12M) (Changpinyo etal., 2021).

To evaluate the hypothesis that the greedy algorithm allows one to make good use of available corpora, we trained Shallow-UViT on the entire CC12M training set, while corresponding end-to-end models were trained with CC12M’s subset of $8.7M$ images whose dimensions are equal or larger than $512$ pixels. Those end-to-end models were therefore trained on $27.5\%$ less data than the corresponding Shallow-UViT model.We do not explore more aggressive reduction of the corpora as the CC12M dataset is already a relatively small dataset for the models tested, and the variations tested already show overfitting characteristics under this setting, as discussed below.Thus, in what follows, the Shallow-UViT models were trained on $64\times 64$ images, by resizing the smallest dimension of the images to $64$ and random cropping along the remaining dimension as needed. The end-to-end models are trained at a target resolution of $512\times 512$ as CC12M does not contain images at resolutions above $1024$ pixels.

Full pipeline model:

With those findings in place, we then explore the generation of larger images and train on a much larger curated datasets in order to show that the approach scales to state-of-the-art models (section6). The resulting model, named Vermeer, is used to generate $1024\times 1024$ images, well beyond the scale for which quantitative metrics are readily available. As such, with Vermeer we reply on human evaluation, in comparison to other recent models, like SDXL.

Sampling:

Unless mentioned, the images and metrics were produced using 256 steps of a DDPM sampler (Ho etal., ) with classifier-free guidance (Ho and Salimans, 2021). We tune the guidance hyper-parameter by a FD-Dino/Clip (VIT-L14) trade-off as described in subsection5.3.

4.1 Metrics

The evaluation of generative models poses considerable difficulties and constitutes an active research area (Kirstain etal., 2024; Xu etal., 2024; Hessel etal., 2021; Serra etal., 2023; Kim etal., 2024; Lee etal., 2023).In light of its inherent complexity, we utilize a multi-faceted evaluation strategy that combines image distribution metrics, text-aligment metrics and semantic question and answering metrics to validate our intermediary results, but the overall performance of our final model evaluation, Vermeer, is delegated to human evaluators (subsection6.2). The following criteria are considered:

Image distribution metrics:

We evaluate models on three key metrics, namely, the Fréchet Inception Distance (FID) (Heusel etal., 2017), the Fréchet Distance on Dino-v2 feature space (FD-Dino) (Stein etal., 2023; Oquab etal., 2023) andthe Clip Maximum Mean Discrepancy (CMMD) distance (Jayasumana etal., 2023).FID is widely used to assess generative image models and select model hyper-parameters, but our findings corroborate its known limitations: it fails to reflect model improvements through training, itdoes not capture readily apparent distortions in individual images, and it does not correlate well with human perception (Stein etal., 2023; Otani etal., 2023; Jayasumana etal., 2023).Thus, in our study, we do not select training or sampling hyper-parameters solely on the basis of FID but, as described in Appendix 5.3, we review the trade-offs between the observed set of metrics.

We also note that metrics derived from image features vary considerably with image resolution.In what follows we compute metrics using the same resolution as the reference papers. The exception is for CMMD on Shallow-UViT outputs; the original metric taken at $336\times 336$ pixels is dominated by up-sampling effects, obscuring differences between models.Thus, we replaced the original $ViT-L14$ operating at $336\times 336$ by its version at $224\times 224$ pixels.

Multimodal metrics:

We adopt CLIP Score as a metric for text-image alignment, as it is widely used, and it complements image distribution metrics above, reflecting the consistency of the generated image with the given prompt.Unlike the original formulation based on ViT-B with path size 32 (Hessel etal., 2021) and previous papers in the area Saharia etal. (2022a); Hoogeboom etal. (2023b), we adopt the ViT-L (patch 14) embedding due to its improved representation.This choice results in lower absolute values of our CLIP Scores compared to previous results, however we noticed that these scores better correlate with the presence of absence of observed distortions.

Semantic QG/A frameworks:

One can also automatically generate question-answer pairs with a language model, and then compute image faithfulness by checking whether existing VQA models can answer the questions from the generated image (Hu etal., 2023; Cho etal., 2024).They were intended to address the shortcomings of existing metrics.Despite their effectiveness in evaluating color and material aspects, they often struggle in assessing counting, spatial relationships, and compositions with multiple objects.Such evaluation measures are naturally dependent on the quality of the underlying question generation (QG) and answering (QA) models.Here we adopt DSG (image-text alignment metric) and its set of $1k$ prompts (Cho etal., 2024). The DSG-1k test-prompts cover different challenges (e.g., counting correctly, correct color/shape/text rendering, etc.), semantic categories, and writing styles. A description of the QG, QA used, with qualitative and detailed results, are included in Appendix Shallow-UViT: Vqva detailed categories.

5 Experiments

5.1 Pretraining and scaling the core components

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (2)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (3)

Shallow-UViT Base
Shallow-UViT Large
Shallow-UViT Huge
Shallow-UViT XHuge

models_@64×64	FID ${}_{30k}\downarrow$	FD-Dino ${}_{30k}\downarrow$	CMMD ${}_{30k}\downarrow$	CLIP ${}_{score}\uparrow$
Shallow-UViT Base	16.97	356.25	0.197	0.234
Shallow-UViT Large	14.80	236.24	0.156	0.240
Shallow-UViT Huge	8.81	133.51	0.139	0.244
Shallow-UViT XHuge	8.41	116.83	0.136	0.246

	DSG - VqVa Question Types
	Entities	Relations	Attributes	Global	DSG $(\uparrow)$
#questions:	3378	1485	1722	649
Shallow-UViT Small	54.38	33.32	43.70	39.98	48.08
Shallow-UViT Large	59.93	39.36	48.75	43.68	52.54
Shallow-UViT Huge	69.18	48.52	54.36	43.30	60.25
Shallow-UViT XHuge	70.66	51.61	57.38	44.14	61.91

We next use Shallow-UViT as a proxy architecture to investigate the effect of scaling PSDM’s core components.We train Shallow-UViT variants on $64\times 64$ images from the CC12M dataset for 2k steps. Image distribution metrics and Clip-Score are obtained using $30k$ prompts from the MSCOCO-captions validation set (Chen etal., 2015), while the semantic metrics are extracted on the 1k prompts from DSG-1k (Cho etal., 2024).A summary of the impact of scaling the Shallow-UViT model is given in Tables 3 and 4, while fine grained results on semantic categories are reported in Appendix Shallow-UViT: Vqva detailed categories.All performance measures indicate significant improvements due to model scaling.A smaller numerical gain is observed in the comparison of the larger two models, but the difference is reflected in qualitative comparisons of the models below.

Figure2, presents a qualitative comparison of the results the Shallow-UViT variants on challenging prompts. They illustrate the impact of scaling on objects structure, composition and alignment (e.g., with numbers of objects depicted).Despite of the small training dataset, the larger models show significant improvement in generating intricate shapes like hands, body parts and text.

We observed further quantitative improvements across the metrics when training our larger models for longer (Shallow-UViT-Huge and Shallow-UViT-XHuge), but longer training also exhibits overfitting to the CC12 training samples.Figure5 illustrates images generated using the Shallow-UViT XHuge model with increasing numbers of training steps.As training progresses, the model diverges from the original prompt to produce images that are closer to training samples from the CC12M dataset, and/or representing parts of the prompt only.This hidden phenomena was not associated with changes in the adopted metrics.We conjecture that this effect is largely aggravated by the small size of the training dataset.

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (22)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (23)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (49)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (50)

Considering the complexity associated with evaluating improvements in representation and the limitations of automatic performance measures,we also ablate the effect of scaling the core components under a semantic task that is evaluated by human annotators.In this experiment we consider a simple counting task, defined here as the task of generating images of up to 5 objects based on a subset of text prompts from the numerical split of the Gecko benchmark(Wiles etal., 2024).We explore this task as a proxy for gauging both prompt consistency and the model’s understanding of objects composition and shapes.It allows less subjective interpretation and noise in human judgments of the model’s performance than other image qualities that are influenced by individual preferences.The task of counting under an open set would ultimately imply the ability to keep track of objects. Thus, this ablation emulates a much simpler version of the problem.Figure6 shows the accuracy improvement associated with scaling observed over 59 prompts.Random condition uses a random number between 1-5.The detailed description of this experiment is presented on Appendix On validating the representation quality improvements from scale by counting.

Given the shallow encoder-decoder structure of the Shallow-UViT architecture, we conjecture that the performance improvements observedhere, on multiple metrics, are a direct consequence of scaling the core components.This hypothesis is further investigated via the reuse of their representation in the next section.

5.2 Experiments on Greedy growing

model		tr. params	steps	FID ${}_{30k}\downarrow$	FD-Dino ${}_{30k}\downarrow$	CMMD ${}_{30k}\downarrow$	CLIP ${}_{score}\uparrow$
UViT-Base	scratch	707M	2M	27.90	624.34	1.355	0.241
	finetuning		2M	23.67	554.99	1.450	0.241
	frozen core	217M	2M	24.68	563.35	1.614	0.235
	freeze-unfreeze	217M/707M	2M	21.13	503.16	1.196	0.247
UViT-Large	scratch	1.4B	2M	21.73	498.82	1.156	0.247
	finetuning		2M	21.89	414.42	1.160	0.253
	frozen core	351M	2M	17.68	195.80	0.752	0.264
	freeze-unfreeze	351M/1.4B	2M	18.37	362.58	0.952	0.256
UViT-Huge	scratch	3.6B	2M	18.58	382.17	1.053	0.256
	finetuning		2M	17.52	302.28	0.988	0.264
	frozen core	723M	2M	15.21	156.24	0.663	0.268
	freeze-unfreeze	723M/3.6B	2M	16.17	231.94	0.683	0.262
UViT-XHuge	freeze	1.2B	2M	15.32	152.12	0.571	0.269
	freeze-unfreeze	1.2B/7.9B	2M	16.58	222.38	0.620	0.267

			DSG - VqVa Question Types
model		steps	Entities	Relations	Attributes	Global	DSG
UVIT-Base	scratch	2M	73.16	53.91	62.31	55.55	64.83
	finetuning	2M	70.23	49.90	58.89	53.24	62.75
	frozen	2M	69.57	49.36	58.22	53.39	61.16
	freeze-unfreeze	2M	73.40	53.54	62.83	56.86	66.13
UVIT-Large	scratch	2M	73.31	52.02	62.95	58.01	66.02
	finetuning	2M	75.01	54.11	65.82	57.86	67.39
	frozen	2M	78.97	61.55	67.19	61.40	72.13
	freeze-unfreeze	2M	74.67	55.45	64.08	58.78	67.79
UViT-Huge	scratch	2M	74.33	55.02	62.98	58.63	66.90
	finetuning	2M	77.29	56.40	67.13	62.56	69.67
	frozen	2M	82.59	64.65	70.35	61.86	75.15
	freeze-unfreeze	2M	79.04	58.11	65.97	60.86	71.50
UViT-XHuge	frozen	2M	83.70	66.77	70.01	62.94	75.70
	freeze-unfreeze	2M	81.14	60.44	69.40	60.25	73.53

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (54)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (55)

We next explore greedy growing of Shallow-UViT models to high resolution, non-cascaded models.We compare training models from scratch on the subset of the CC12M dataset filtered by the target resolution (512 pixels) with alternatives for reusing of the core components pretrained on the full dataset. They validate our main intuitions behind the greedy growing algorithm, i.e., that the introduction of new, untrained layers, as well as shifts in the distribution of the training data are known causes of the catastrophic forgetting phenomena Vasconcelos etal. (2022); Kuo etal. (2023); Yu etal. (2023) possibly damaging the pre-trained representation.

Tables 5 and 6 summarize performance as a function of model scale for greedy growing, along with various ablations of the training procedure.Our greedy growing recipe with frozen core components’s and its optional defrosting phase lead to the best results across the metrics.The optional defrosting phase is required for improving the performance of the smallest model ablated (UViT-Base).Its frozen counterpart showed signs of underfitting during training, as it has a small number of trainable parameters (217M) in the added layers.Under this low-capacity scenario, the defrosting phase offers a balance between protecting the core components representation and the use of the model’s full capacity, as it reduces the degradation of the pretrained representation by warming up the growth layers.Other than this special case, the defrosting phase did not appear to benefit larger models.These quantitative results agree with our hypothesis that the final model benefits from protecting the pretrained representation in our greedy growing algorithm.

Figure8 qualitatively compares generations obtained by finetuning and freezing the core components.Additional qualitative comparisons are shown in Appendix Qualitative comparison of finetuning and frozen e2e models.They illustrate the the benefits of protecting the core components from the noise introduced when back-propagating through the randomly initialized growth layers.We observe that the low-resolution images produced by the use of the same representation under their original Shallow-UViT models produce objectswhose shapes and parts are correctly defined.

The high-resolution images generated from early steps (20k) of finetuning the core components under the UVit architecture present objects with correct shapes superimposed with the diffusion noise.Soon after that (around 50k-100k steps)the quality of object shapes and structure decays as the training backpropagates the noise introduced by the growth layers through the pretrained representation.

Under the greedy growing regime and same number of training steps (20k steps) the frozen model is able to produce objects with correct shapes and parts, and maintain their composition as training progresses.Another direct side effect of maintaining the core components representation is the fast reduction of the diffusion noise early in training.

5.3 Guidance tuning

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (96)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (97)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (103)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (104)

Diffusion model hyper-parameters affect both training and sampling quality.It is a common practice to tune the sampler guidance weights using FID-CLIP_score trade-off curves (Saharia etal., 2022a; Hoogeboom etal., 2023b; Podell etal., 2024).In doing so one aims to strike a balance between images quality (by minimizing FID)and alignment with the text prompt (maximizing the CLIP_score score).That said, it is well known that FID does not correlate particularly well with human perception (Stein etal., 2023; Otani etal., 2023; Jayasumana etal., 2023), and large guidance weights are known toincrease CLIP-Score but tend to produce over-sharpened, high-contrast images and unrealistic objects (Ho and Salimans, 2021; Saharia etal., 2022b).Due to such limitations, despite widespread use of FID-CLIP_score scores for performance comparisons, in practice they are adopted as loose measure of performance, and guidance weights are typically set through qualitative inspection.

Here we explore alternative metrics for hyper-parameters tuning, aiming to better reflect their deployment use, and ultimately human perception.These include recent measures with alternative feature spaces that exhibit better robustness in classification tasks, and align somewhat better with human judgements of image quality and alignment.More specifically, we investigate the use of FD-Dino and CMMD as alternatives to FID in the calibration of the guidance hyper-parameter.Figure9 plots the response curve of different metrics as a function of guidance weight. They were measured using our UVIT-XHuge frozen model taken over 30k samples from the MSCOCO-caption validation set.It illustrates that the three image distribution metrics are minimized byvery different guidance values.Similar curves are observed on the other models and training modalities, in which the best guidance value for minimizing FID, FD-Dino and CMMD are in increasing order.Figure11 further illustrates samples obtained at the optimal values for each metric, and also when using the maximum guidance tested (16) for increasing CLIP_score even further.

A qualitative analysis shows that by minimizing FID, one favors the generation of natural colors and textures, but under closer inspection, it fails to produce realistic object shapes and parts.We conjecture that this matches prior observations on the existence of texture vs shape bias by image classifiers (Geirhos etal., 2019).Guidance values minimizing Dino-v2 features, on the other hand, appear to produce natural color distributions and objects with natural shapes and composition.We adopt the value at this minimum as our new lower bound.Increasing guidance from that value tends to increase color-contrast and sharpening.

Images produced with guidance weights minimizingCMMD tend to produce images with initial signs of saturated colors and over-sharpening.Given its use of Clip features for image distribution comparison, this agrees with previous observations on CLIP_score.But unlike CLIP_score curves, CMMD curves present an inflection point within the range investigated. We use this inflection point to define a closed range for our search of reasonable guidance weights.That is, the range of guidance weights between FD-Dino and CMMD minimums was observed to strike a balance between producing correct shapes and aesthetically pleasing images characterized by enhanced color contrast and sharp edges.

All results presented in this section have their image generated using guidance weightswithin the FD-Dino/CMMD trade-off range.The specific value selected was taken at the intersection of the optimal ranges of models under the same comparison.Following this approach, our Shallow-UViT results were obtained with guidance weights fixed at 1.75, and their corresponding UViT models with guidance 4.0.

6 A full diffusion pipeline: Vermeer

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (123)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (124)

Vermeer is an 8B parameter model grown from 256 to 1024 pixel resolution. The UViT architecture is similar to our UViT-Huge model (Table2), except that its bottom layers operate at a grid of $32x32$ and with 32 transformer blocks in total.We found that allocating transformer blocks at 32x scale improves details (like small faces).For Vermeer’s text encoding, in addition to T5-XXL(Raffel etal., 2020a) and Clip(Radford etal., 2021b)embeddings previously mentioned, we also include a ByT5(Xue etal., 2022b) encoder with 256 sequence length, resulting in a final embedding with sequence length of 461.

The baseline version (Vermeer raw model) is trained with 2k batch size at 256 resolution for 2M iterations, and grown to 1k resolution and finetuned for an additional 1M steps. As illustrated in Figure13, it supports 3 aspect ratios, i.e., $1024\times 1024$ , $768\times 1376$ , and $1376\times 768$ thought aspect ratio bucketing (Anlatan, ).Once the raw model is trained, we apply the following extra steps to improve the aesthetics of the generated images:

•
Style finetuning.We train an image classifier based on images that conform to aesthetic and compositional attributes like those described in (Dai etal., 2023), and use it to select 3k images from our training data as a fine-tuning set.We then fine-tune for 8K steps with a mixture of the original data and the aesthetic subset.We condition the model on the aesthetic subset by adding a token to the text prompt. We found that finetuning the pixel model with a mixture of pretraining and finetuning data is needed to avoid catastrophic forgetting and to avoid the introduction of additional artifacts.
•
Distillation.The vanilla Vermeer model adopts 256-step sampling process, making it computationally expensive for real-world use. We employed the multistep consistency model (MCM)(Heek etal., 2024) to distill style-tuned Vermeer to 16 steps, achieving a substantial 16x speedup while maintaining high visual quality.

6.1 Vermeer results

model		FID ${}_{30k}\downarrow$	FD-Dino ${}_{30k}\downarrow$	CMMD ${}_{30k}\downarrow$	CLIP ${}_{score}\uparrow$
SDXL_v1.0		13.19	185.57	0.898	0.279
Vermeer	raw model	16.26	185.25	0.631	0.270
	+prompt engineering	17.33	216.01	0.867	0.269
	+style tuning	24.51	336.25	1.167	0.262
	distilled	25.97	347.19	0.885	0.261

		DSG $\uparrow$
model		Entities	Relations	Attributes	Global	DSG
SD2.1		75.44	53.06	69.66	68.49	71.23
Muse		77.65	60.64	75.61	67.18	73.09
Imagen Cascade		79.94	62.73	75.73	69.34	75.93
SDXL_v1.0		88.04	73.00	78.48	75.19	81.47
Vermeer	raw model	86.92	76.36	76.48	68.49	80.77
	+promp eng	87.94	74.92	76.31	67.41	80.99
	+style tunning	88.04	74.21	77.38	69.57	81.16
	+distillation.	84.71	69.23	72.68	65.49	76.88

We ablated four steps of Vermeer’s development:(i) its raw model resulting from training on a large dataset;(ii) the result of applying prompt engineering at inference to the same model, adding words to improve aesthetic image quality, but with no further training;(iii) the final model, after style finetuning on a curated subset of 3k aesthetically pleasing images; and finally,(iv) its distilled, fast inference variation.Table7 reports key performance metrics for all four variants, along with Stable Diffusion XL v1.0 (SDXL) (Podell etal., 2024).One can see that the raw model minimizes image distribution metrics that use state of the art feature space, i.e., FD-Dino and CMMD, while CLIP-score suggests a minor drop compared to SDXL.These metrics also highlight a significant shift away from the distribution of MSCOCO-captions (Chen etal., 2015), after augmenting the prompts (+prompt engineering) that is further increased when combined with the finetuning of the model for aesthetics pleasing image(+style finetuning).

6.2 Human evaluation

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (145)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (146)

Assessing the performance of text-to-image models, ideally, depends on human evaluation, as this complex cognitive process necessitates a profound understanding of text and image relationships.Prior research has demonstrated that many recent works rely exclusively on automated metrics, such as the Fréchet Inception Distance (FID). However, it has been observed that the current automated measures are not fully consistent with human perception in assessing the quality of text-to-image samples (Otani etal., 2023).Thus, to objectively access the quality of images generated by Vermeer,we conduct a side-by-side human evaluation comparing our model with SDXL (Podell etal., 2024).

Setup.In this human evaluation, we ask annotators to evaluate generated images by Vermeer and SDXL based on the same prompt.For this, we collected 495 prompts²²2We first sampled 510 prompts, and 495 of them were usable after filtering incomplete samples.covering a range of skills: 160 are from TIFA v1.0targeting measuring the faithfulness of a generated image to its text input covering 12 categories (object, attributes, counting, etc.)(Hu etal., 2023); 200 are sampled from the 1600 Parti Prompts (Yu etal., 2022), selecting for both complexity and diversity of challenges; and 150 others are created fresh for, or are sourced from, more recent prompting strategies targeting challenging cases.

We create two tasks in which we instruct annotators to consider either image quality (aesthetics) or fit to the prompt (consistency), and indicate their preferences using 3-point Liker scale: Vermeer is preferred, Unsure, and SDXL is preferred (the model names are anonymized).The neural response includes cases that both images are equally good and bad.In the annotation UI, the annotators are shown a prompt along with two images that are randomly shuffled. We collected 13 human ratings per prompt for both aesthetics and consistency (26 ratings per image).

Results.Prompt engineering and style tuning are confirmed to have a positive effect on human aesthetics preference (Figure14, left), and small impact on text consistency (Figure14, middle).They confirm our conjecture that the decrease on Vermeer’s performance based on metrics grounded on the appearance of MSCOCO-caption dataset induced by these two steps are in alignment with the ultimate goal of human preference (Table7).

Figure14 (right) plots the Likert scale for our final model in each task (aesthetics or consistency) as well as the aggregated responses (shown in in the bottom bar).Overall, annotators prefer Vermeer 44% of the time, while they select SDXL 21.4% of the time, with relatively fewer Neutral responses (34.7%).Vermeer is clearly preferred for its aesthetics, with a win rate of $61.4\%$ , while the gap in consistency between the two models is small, with a difference in the win rate of just $1.7\%$ .Krippendorff’s $\alpha$ for aesthetics and consistency are 0.27 and 0.41, respectively, indicating moderate agreement among annotators.

7 Conclusion

We propose a novel recipe for training non-cascaded large scale pixel-space text-to-image diffusion models. It benefits from splitting their training in two phases representing different tasks: learning image-text condition alignment and learning to generate images at high-resolution.

We identified the model core components as those responsible for the first task and propose a proxy architecture (Shallow-UViT) to supports its pretraining.The second task is learned with a greedy growing algorithm that stacks encoder-decoder layers of the final architecture on top of the pretrained core components. When learning the second task, our training recipe preserves the core components representation from the noise introduced by the grown layers and their random initialized weights.

Existing non-cascaded models training recipes struggle with scale, if not supported with large batch size and further regularization like dropout and multi-scale loss.Our approach is able to train models up to 8B parameters with small batch size (256) and no further regularization, by pretraining the core components and preserving it during the second training phase targeting high-resolution generation.

Compared with training from scratch and finetuning, the greedy growing procedure is more stable, and improves performance on a set of different metrics. Qualitative analysis shows that while keeping the core components representation stable it helps to preserve objects shape and overall structure, improving the definition of body parts.Our method allows use of data at different resolutions; the first phase benefits from the larger corpora with minimal requirements on image resolution, while the second phase learn to produce sharp images from the set filtered by the target resolution while reusing the representation learned from the larger set. We also explore models with increasing size, and show the benefits from scaling under different aspects and metrics.

In practice, the non-cascaded solution removes the out-of-distribution shift existent between training and deploying super-resolution phases. Based on that, we present Vermeer, an 8B parameter Pixel based Text-to-Image Diffusion Model that produces high-resolution high-quality images using a single non-cascaded model. By training it on a larger dataset, and incorporating a final style tuning phase, Vermeer is able to surpass SDXL v1.0 in human preference study.

\nobibliography

References

(1)Anlatan.Novelai improvements on stable diffusion.URL https://blog.novelai.net/.
Balaji etal. (2022)Y.Balaji, S.Nah, X.Huang, A.Vahdat, J.Song, K.Kreis, M.Aittala, T.Aila,S.Laine, B.Catanzaro, etal.ediffi: Text-to-image diffusion models with an ensemble of expertdenoisers.arXiv preprint arXiv:2211.01324, 2022.
Bar-Tal etal. (2024)O.Bar-Tal, H.Chefer, O.Tov, C.Herrmann, R.Paiss, S.Zada, A.Ephrat,J.Hur, G.Liu, A.Raj, Y.Li, M.Rubinstein, T.Michaeli, O.Wang, D.Sun,T.Dekel, and I.Mosseri.Lumiere: A space-time diffusion model for video generation, 2024.
Betker etal. (2023)J.Betker, G.Goh, L.Jing, T.Brooks, J.Wang, L.Li, L.Ouyang, J.Zhuang,J.Lee, Y.Guo, etal.Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3.pdf, 2(3):8, 2023.
Chang etal. (2023)H.Chang, H.Zhang, J.Barber, A.Maschinot, J.Lezama, L.Jiang, M.-H. Yang,K.P. Murphy, W.T. Freeman, M.Rubinstein, Y.Li, and D.Krishnan.Muse: Text-to-image generation via masked generative transformers.In A.Krause, E.Brunskill, K.Cho, B.Engelhardt, S.Sabato, andJ.Scarlett, editors, Proceedings of the 40th International Conferenceon Machine Learning, volume 202 of Proceedings of Machine LearningResearch, pages 4055–4075. PMLR, 23–29 Jul 2023.
Changpinyo etal. (2021)S.Changpinyo, P.Sharma, N.Ding, and R.Soricut.Conceptual 12M: Pushing web-scale image-text pre-training torecognize long-tail visual concepts.In CVPR, 2021.
Chen etal. (2023)H.Chen, J.Gu, A.Chen, W.Tian, Z.Tu, L.Liu, and H.Su.Single-stage diffusion nerf: A unified approach to 3d generation andreconstruction.In ICCV, 2023.
Chen etal. (2015)X.Chen, H.Fang, T.-Y. Lin, R.Vedantam, S.Gupta, P.Dollár, and C.L.Zitnick.Microsoft coco captions: Data collection and evaluation server.CoRR, abs/1504.00325, 2015.
Cho etal. (2024)J.Cho, Y.Hu, R.Garg, P.Anderson, R.Krishna, J.Baldridge, M.Bansal,J.Pont-Tuset, and S.Wang.Davidsonian Scene Graph: Improving Reliability in Fine-GrainedEvaluation for Text-to-Image Generation.In ICLR, 2024.
Chung etal. (2023)H.Chung, J.Kim, M.T. Mccann, M.L. Klasky, and J.C. Ye.Diffusion posterior sampling for general noisy inverse problems.In The Eleventh International Conference on LearningRepresentations, 2023.
Dai etal. (2023)X.Dai, J.Hou, C.-Y. Ma, S.Tsai, J.Wang, R.Wang, P.Zhang, S.Vandenhende,X.Wang, A.Dubey, etal.Emu: Enhancing image generation models using photogenic needles in ahaystack.arXiv preprint arXiv:2309.15807, 2023.
Dhariwal and Nichol (2021)P.Dhariwal and A.Nichol.Diffusion models beat gans on image synthesis.NeurIPS, pages 8780–8794, 2021.
Dosovitskiy etal. (2021)A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weissenborn, X.Zhai,T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, J.Uszkoreit,and N.Houlsby.An image is worth 16x16 words: Transformers for image recognition atscale.In International Conference on Learning Representations, 2021.
Geirhos etal. (2019)R.Geirhos, P.Rubisch, C.Michaelis, M.Bethge, F.A. Wichmann, andW.Brendel.Imagenet-trained CNNs are biased towards texture; increasing shapebias improves accuracy and robustness.In International Conference on Learning Representations, 2019.
Graikos etal. (2022)A.Graikos, N.Malkin, N.Jojic, and D.Samaras.Diffusion models as plug-and-play priors.In A.H. Oh, A.Agarwal, D.Belgrave, and K.Cho, editors,Advances in Neural Information Processing Systems, 2022.
Gu etal. (2023)J.Gu, S.Zhai, Y.Zhang, J.M. Susskind, and N.Jaitly.Matryoshka diffusion models.In The Twelfth International Conference on LearningRepresentations, 2023.
Heek etal. (2024)J.Heek, E.Hoogeboom, and T.Salimans.Multistep consistency models, 2024.
Hessel etal. (2021)J.Hessel, A.Holtzman, M.Forbes, R.L. Bras, and Y.Choi.Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021.
Heusel etal. (2017)M.Heusel, H.Ramsauer, T.Unterthiner, B.Nessler, and S.Hochreiter.Gans trained by a two time-scale update rule converge to a local nashequilibrium.In Proceedings of the 31st International Conference on NeuralInformation Processing Systems, NIPS’17, page 6629–6640, Red Hook, NY,USA, 2017. Curran Associates Inc.ISBN 9781510860964.
Ho and Salimans (2021)J.Ho and T.Salimans.Classifier-free diffusion guidance.In NeurIPS 2021 Workshop on Deep Generative Models andDownstream Applications, 2021.
(21)J.Ho, A.Jain, and P.Abbeel.Denoising diffusion probabilistic models.In H.Larochelle, M.Ranzato, R.Hadsell, M.Balcan, and H.Lin,editors, Advances in Neural Information Processing Systems, pages6840–6851. Curran Associates, Inc.
Ho etal. (2022a)J.Ho, C.Saharia, W.Chan, D.J. Fleet, M.Norouzi, and T.Salimans.Cascaded diffusion models for high fidelity image generation.J. Mach. Learn. Res., 23(1), jan 2022a.ISSN 1532-4435.
Ho etal. (2022b)J.Ho, T.Salimans, A.A. Gritsenko, W.Chan, M.Norouzi, and D.J. Fleet.Video diffusion models.In ICLR Workshop on Deep Generative Models for HighlyStructured Data, 2022b.
Hoogeboom etal. (2023a)E.Hoogeboom, J.Heek, and T.Salimans.Simple diffusion: End-to-end diffusion for high resolution images.In ICML, 2023a.
Hoogeboom etal. (2023b)E.Hoogeboom, J.Heek, and T.Salimans.Simple diffusion: End-to-end diffusion for high resolution images.In Proceedings of the 40th International Conference on MachineLearning, ICML’23. JMLR.org, 2023b.
Hu etal. (2023)Y.Hu, B.Liu, J.Kasai, Y.Wang, M.Ostendorf, R.Krishna, and N.A. Smith.Tifa: Accurate and interpretable text-to-image faithfulnessevaluation with question answering.In Proceedings of the IEEE/CVF International Conference onComputer Vision (ICCV), pages 20406–20417, October 2023.
Jabri etal. (2022)A.Jabri, D.Fleet, and T.Chen.Scalable adaptive computation for iterative generation.arXiv preprint arXiv:2212.11972, 2022.
Jaini etal. (2024)P.Jaini, K.Clark, and R.Geirhos.Intriguing properties of generative classifiers.In The Twelfth International Conference on LearningRepresentations, 2024.
Jalal etal. (2021)A.Jalal, M.Arvinte, G.Daras, E.Price, A.G. Dimakis, and J.Tamir.Robust compressed sensing mri with deep generative priors.In M.Ranzato, A.Beygelzimer, Y.Dauphin, P.Liang, and J.W.Vaughan, editors, Advances in Neural Information Processing Systems,volume34, pages 14938–14954. Curran Associates, Inc., 2021.
Jayasumana etal. (2023)S.Jayasumana, S.Ramalingam, A.Veit, D.Glasner, A.Chakrabarti, andS.Kumar.Rethinking fid: Towards a better evaluation metric for imagegeneration.arXiv preprint arXiv:2401.09603, 2023.
Kadkhodaie and Simoncelli (2021)Z.Kadkhodaie and E.Simoncelli.Stochastic solutions for linear inverse problems using the priorimplicit in a denoiser.In M.Ranzato, A.Beygelzimer, Y.Dauphin, P.Liang, and J.W.Vaughan, editors, Advances in Neural Information Processing Systems,volume34, pages 13242–13254. Curran Associates, Inc., 2021.
Karras etal. (2018)T.Karras, T.Aila, S.Laine, and J.Lehtinen.Progressive growing of GANs for improved quality, stability, andvariation.In International Conference on Learning Representations, 2018.
Kawar etal. (2022)B.Kawar, M.Elad, S.Ermon, and J.Song.Denoising diffusion restoration models.In A.H. Oh, A.Agarwal, D.Belgrave, and K.Cho, editors,Advances in Neural Information Processing Systems, 2022.
Kim etal. (2024)K.Kim, J.Jeong, M.An, M.Ghavamzadeh, K.D. Dvijotham, J.Shin, and K.Lee.Confidence-aware reward optimization for fine-tuning text-to-imagemodels.In The Twelfth International Conference on LearningRepresentations, 2024.
Kirstain etal. (2024)Y.Kirstain, A.Polyak, U.Singer, S.Matiana, J.Penna, and O.Levy.Pick-a-pic: An open dataset of user preferences for text-to-imagegeneration.Advances in Neural Information Processing Systems, 36, 2024.
Kuo etal. (2023)W.Kuo, Y.Cui, X.Gu, A.Piergiovanni, and A.Angelova.Open-vocabulary object detection upon frozen vision and languagemodels.In The Eleventh International Conference on LearningRepresentations, 2023.
Lee etal. (2023)T.Lee, M.Yasunaga, C.Meng, Y.Mai, J.S. Park, A.Gupta, Y.Zhang,D.Narayanan, H.B. Teufel, M.Bellagente, M.Kang, T.Park, J.Leskovec,J.-Y. Zhu, L.Fei-Fei, J.Wu, S.Ermon, and P.Liang.Holistic evaluation of text-to-image models.In Thirty-seventh Conference on Neural Information ProcessingSystems Datasets and Benchmarks Track, 2023.
Levy etal. (2023)M.Levy, B.D. Giorgi, F.Weers, A.Katharopoulos, and T.Nickson.Controllable music production with diffusion models and guidancegradients.In NeurIPS, 2023.
Liu etal. (2023)R.Liu, D.Garrette, C.Saharia, W.Chan, A.Roberts, S.Narang, I.Blok,R.Mical, M.Norouzi, and N.Constant.Character-aware models improve visual text rendering.In A.Rogers, J.Boyd-Graber, and N.Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), Toronto, Canada, July2023. Association for Computational Linguistics.
Nichol etal. (2022)A.Nichol, P.Dhariwal, A.Ramesh, P.Shyam, P.Mishkin, B.McGrew,I.Sutskever, and M.Chen.GLIDE: Towards photorealistic image generation and editing withtext-guided diffusion models.arXiv preprint arXiv:2112.10741, 2022.
Nieder and Dehaene (2009)A.Nieder and S.Dehaene.Representation of number in the brain.Annual review of neuroscience, 32:185–208, 2009.
Oquab etal. (2023)M.Oquab, T.Darcet, T.Moutakanni, H.V. Vo, M.Szafraniec, V.Khalidov,P.Fernandez, D.Haziza, F.Massa, A.El-Nouby, R.Howes, P.-Y. Huang, H.Xu,V.Sharma, S.-W. Li, W.Galuba, M.Rabbat, M.Assran, N.Ballas, G.Synnaeve,I.Misra, H.Jegou, J.Mairal, P.Labatut, A.Joulin, and P.Bojanowski.Dinov2: Learning robust visual features without supervision, 2023.
Otani etal. (2023)M.Otani, R.Togashi, Y.Sawai, R.Ishigami, Y.Nakashima, E.Rahtu,J.Heikkilä, and S.Satoh.Toward verifiable and reproducible human evaluation for text-to-imagegeneration.In Proceedings - 2023 IEEE/CVF Conference on Computer Visionand Pattern Recognition, CVPR 2023, Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, pages 14277–14286.IEEE, 2023.10.1109/CVPR52729.2023.01372.Publisher Copyright: © 2023 IEEE.; IEEE/CVF Conferenceon Computer Vision and Pattern Recognition ; Conference date: 18-06-2023Through 22-06-2023.
Peebles and Xie (2023)W.Peebles and S.Xie.Scalable diffusion models with transformers.In 2023 IEEE/CVF International Conference on Computer Vision(ICCV), pages 4172–4182, 2023.10.1109/ICCV51070.2023.00387.
Podell etal. (2024)D.Podell, Z.English, K.Lacey, A.Blattmann, T.Dockhorn, J.Müller,J.Penna, and R.Rombach.SDXL: Improving Latent Diffusion Models for High-Resolution ImageSynthesis.In ICLR, 2024.
Poole etal. (2023)B.Poole, A.Jain, J.T. Barron, and B.Mildenhall.Dreamfusion: Text-to-3d using 2d diffusion.In The Eleventh International Conference on LearningRepresentations, 2023.
Radford etal. (2021a)A.Radford, J.W. Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry,A.Askell, P.Mishkin, J.Clark, G.Krueger, and I.Sutskever.Learning transferable visual models from natural languagesupervision.In M.Meila and T.Zhang, editors, Proceedings of the 38thInternational Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR,18–24 Jul 2021a.
Radford etal. (2021b)A.Radford, J.W. Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry,A.Askell, P.Mishkin, J.Clark, G.Krueger, and I.Sutskever.Learning transferable visual models from natural languagesupervision.In M.Meila and T.Zhang, editors, Proceedings of the 38thInternational Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR,18–24 Jul 2021b.
Raffel etal. (2020a)C.Raffel, N.Shazeer, A.Roberts, K.Lee, S.Narang, M.Matena, Y.Zhou,W.Li, and P.J. Liu.Exploring the limits of transfer learning with a unified text-to-texttransformer.Journal of Machine Learning Research, 21(140):1–67, 2020a.
Raffel etal. (2020b)C.Raffel, N.Shazeer, A.Roberts, K.Lee, S.Narang, M.Matena, Y.Zhou,W.Li, and P.J. Liu.Exploring the limits of transfer learning with a unified text-to-texttransformer.Journal of Machine Learning Research, 21(140):1–67, 2020b.
Ramachandran etal. (2017)P.Ramachandran, B.Zoph, and Q.V. Le.Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017.
Ramesh etal. (2022)A.Ramesh, P.Dhariwal, A.Nichol, C.Chu, and M.Chen.Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,2022.
(53)R.Rombach, A.Blattmann, D.Lorenz, P.Esser, and B.Ommer.
Saharia etal. (2022a)C.Saharia, W.Chan, S.Saxena, L.Li, J.Whang, E.Denton, S.K.S.Ghasemipour, R.Gontijo-Lopes, B.K. Ayan, T.Salimans, J.Ho, D.J. Fleet,and M.Norouzi.Photorealistic text-to-image diffusion models with deep languageunderstanding.In A.H. Oh, A.Agarwal, D.Belgrave, and K.Cho, editors,Advances in Neural Information Processing Systems, 2022a.
Saharia etal. (2022b)C.Saharia, W.Chan, S.Saxena, L.Li, J.Whang, E.L. Denton, K.Ghasemipour,R.GontijoLopes, B.KaragolAyan, T.Salimans, J.Ho, D.J. Fleet, andM.Norouzi.Photorealistic text-to-image diffusion models with deep languageunderstanding.In S.Koyejo, S.Mohamed, A.Agarwal, D.Belgrave, K.Cho, and A.Oh,editors, Advances in Neural Information Processing Systems, volume35,pages 36479–36494. Curran Associates, Inc., 2022b.
Serra etal. (2023)A.Serra, F.Carrara, M.Tesconi, and F.Falchi.The emotions of the crowd: Learning image sentiment from tweets viacross-modal distillation.arXiv preprint arXiv:2304.14942, 2023.
Song etal. (2024)B.Song, S.M. Kwon, Z.Zhang, X.Hu, Q.Qu, and L.Shen.Solving inverse problems with latent diffusion models via hard dataconsistency.In The Twelfth International Conference on LearningRepresentations, 2024.
Song etal. (2023)J.Song, A.Vahdat, M.Mardani, and J.Kautz.Pseudoinverse-guided diffusion models for inverse problems.In International Conference on Learning Representations, 2023.
Stein etal. (2023)G.Stein, J.Cresswell, R.Hosseinzadeh, Y.Sui, B.Ross, V.Villecroze,Z.Liu, A.L. Caterini, E.Taylor, and G.Loaiza-Ganem.Exposing flaws of generative model evaluation metrics and theirunfair treatment of diffusion models.In Advances in Neural Information Processing Systems,volume36, 2023.
Tan etal. (2023)V.Tan, J.Nam, J.Nam, and J.Noh.Motion to dance music generation using latent diffusion model.In SIGGRAPH Asia 2023 Technical Communications, SA ’23, NewYork, NY, USA, 2023. Association for Computing Machinery.ISBN 9798400703140.10.1145/3610543.3626164.
Tang etal. (2023)L.Tang, M.Jia, Q.Wang, C.P. Phoo, and B.Hariharan.Emergent correspondence from image diffusion.In Thirty-seventh Conference on Neural Information ProcessingSystems, 2023.
Tewari etal. (2023)A.Tewari, T.Yin, G.Cazenavette, S.Rezchikov, J.B. Tenenbaum, F.Durand,W.T. Freeman, and V.Sitzmann.Diffusion with forward models: Solving stochastic inverse problemswithout direct supervision.In Thirty-seventh Conference on Neural Information ProcessingSystems, 2023.
Vasconcelos etal. (2022)C.Vasconcelos, V.N. Birodkar, and V.Dumoulin.Proper reuse of image classification features improves objectdetection.2022.
Wiles etal. (2024)O.Wiles, C.Zhang, I.Albuquerque, I.Kajic, S.Wang, E.Bugliarello, Y.Onoe,C.Knutsen, C.Rashtchian, J.Pont-Tuset, and A.Nematzadeh.Revisiting text-to-image evaluation with gecko: On metrics, prompts,and human ratings.Under review (ECCV), 2024.
Xu etal. (2024)J.Xu, X.Liu, Y.Wu, Y.Tong, Q.Li, M.Ding, J.Tang, and Y.Dong.Imagereward: Learning and evaluating human preferences fortext-to-image generation.Advances in Neural Information Processing Systems, 36, 2024.
Xue etal. (2022a)L.Xue, A.Barua, N.Constant, R.Al-Rfou, S.Narang, M.Kale, A.Roberts, andC.Raffel.ByT5: Towards a Token-Free Future with Pre-trained Byte-to-ByteModels.Transactions of the Association for Computational Linguistics,10:291–306, 03 2022a.
Xue etal. (2022b)L.Xue, A.Barua, N.Constant, R.Al-Rfou, S.Narang, M.Kale, A.Roberts, andC.Raffel.Byt5: Towards a token-free future with pre-trained byte-to-bytemodels.Transactions of the Association for Computational Linguistics,10:291–306, 2022b.
Yu etal. (2022)J.Yu, Y.Xu, J.Y. Koh, T.Luong, G.Baid, Z.Wang, V.Vasudevan, A.Ku,Y.Yang, B.K. Ayan, B.Hutchinson, W.Han, Z.Parekh, X.Li, H.Zhang,J.Baldridge, and Y.Wu.Scaling Autoregressive Models for Content-Rich Text-to-ImageGeneration, 2022.
Yu etal. (2023)Q.Yu, J.He, X.Deng, X.Shen, and L.Chen.Convolutions die hard: Open-vocabulary segmentation with singlefrozen convolutional CLIP.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, andS.Levine, editors, Advances in Neural Information Processing Systems36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
Zhan etal. (2023)G.Zhan, C.Zheng, W.Xie, and A.Zisserman.What does stable diffusion know about the 3d scene?, 2023.

Teaser image prompts

1	3	4	6	7
				8
2		5
				9
10	11	12	14	16
		13	15
17	18	19		22
		20	21

Next, we list the prompts used for generating images at Figure13 using Vermeer. Their corresponding location is shown in Table9).

1.
the word ’START’ written in chalk on a sidewalk
2.
a basketball to the left of two soccer balls on a gravel driveway
3.
An Egyptian tablet shows an automobile.
4.
Macro photography of rose, centered, mini, dark tones, drops of water, cannon
5.
photo of a woman’s face floating in the water with her eyes closed, you can only see top part of her face above water, reflections, abstract conceptual, realistic reflection, pale sky, scientific photo, high quality fantasy stock photo
6.
cyberpunk starship troopers cinematic 4d
7.
3-d Letter "O" made from orange fruit, studio shot, pastel orange background, centered
8.
3-d Letter "W" made from transparent water, studio shot, pastel light blue background, centered
9.
3-d Letter "T" made from tiger fur, studio shot, pastel orange background, centered.
10.
Many people carry sacks along a trail through a bright field with long grass and flowers and muted tones. Two small cottages. Dark row of trees. Green hills, blue sky, clouds. Pastoral landscape. Ein plein air. Vibrant, saturation, free brush strokes. Impressionism. Oil on canvas by Auguste Renoir.
11.
a photograph of a blue porsche 356 coming around a bend in the road
12.
photography of a cat sitting at a sushi restaurant, wearing a blue coat and taking sushi from the boat. Neon bright light, high contrast, low vibrance
13.
turtle with German Shepherd dog’s head growing from it, DSLR
14.
A futuristic street train a rainy street at night in an old European city. Painting by David Friedrich, Claude Monet and John Tenniel.
15.
building behind train
16.
Realistic photograph of a cute otter zebra mouse in a field at sunset, tall grass, macro 35mm film
17.
A 1920’s race car with number 7 parked near a fountain in a modern city. Painting by David Friedrich, Claude Monet and John Tenniel.
18.
The clock on the bricked building is green. The numbers are in roman numerals. The details have gold accents. The bricked building has a window beside the clock.
19.
duck with rabbit’s head growing from it, DSLR
20.
cauliflower with sheep’s head growing from it, DSLR
21.
Silver 1963 Ferrari 250 GTO in profile racing along a beach front road. Bokeh, high-quality 4k photograph.
22.
a photograph of a knight in shining armor holding a basketball

Shallow-UViT: Vqva detailed categories

	VqVa Question types
Model	whole	part	spatial	shape	color	state	type	count	text rendering	texture	global	material	scale	size	DSG
# prompts	2851	517	1477	84	432	740	173	196	116	40	649	92	25	11
Shallow-UViT Base	0.567	0.412	0.333	0.417	0.550	0.409	0.402	0.523	0.487	0.450	0.400	0.272	0.500	0.318	48.08
Shallow-UViT Large	0.626	0.451	0.395	0.446	0.579	0.478	0.454	0.554	0.552	0.475	0.437	0.353	0.600	0.364	52.54
Shallow-UViT Huge	0.706	0.617	0.488	0.548	0.624	0.530	0.509	0.587	0.552	0.562	0.433	0.424	0.740	0.409	60.25
Shallow-UViT XHuge	0.724	0.617	0.518	0.577	0.646	0.568	0.540	0.582	0.591	0.562	0.441	0.424	0.820	0.636	61.91

	VqVa Question types
Model		whole	part	spatial	shape	color	state	type	count	text rendering	texture	global	material	scale	size	DSG
	# prompts	2851	517	1477	84	432	740	173	196	116	40	649	92	25	11
UVIT-Base	scratch	0.743	0.671	0.540	0.655	0.635	0.639	0.552	0.564	0.690	0.575	0.555	0.489	0.780	0.545	64.83
	finetuning	0.723	0.587	0.500	0.560	0.597	0.596	0.590	0.628	0.625	0.525	0.532	0.478	0.700	0.500	62.75
	frozen	0.702	0.666	0.495	0.560	0.544	0.647	0.566	0.584	0.591	0.613	0.534	0.332	0.640	0.409	61.16
	freeze-unfreeze	0.748	0.657	0.536	0.649	0.650	0.645	0.587	0.640	0.694	0.625	0.569	0.418	0.620	0.455	66.13
UVIT-Large	scratch	0.750	0.642	0.521	0.631	0.642	0.640	0.618	0.643	0.547	0.562	0.580	0.511	0.640	0.591	66.02
	finetuning	0.761	0.688	0.542	0.601	0.681	0.678	0.604	0.666	0.638	0.650	0.579	0.473	0.780	0.636	67.39
	frozen	0.800	0.730	0.616	0.643	0.738	0.684	0.618	0.648	0.728	0.700	0.614	0.418	0.760	0.500	72.13
	freeze-unfreeze	0.761	0.668	0.556	0.571	0.646	0.655	0.665	0.717	0.616	0.625	0.588	0.473	0.840	0.545	67.79
UViT-Huge	scratch	0.758	0.662	0.551	0.595	0.634	0.645	0.627	0.607	0.698	0.600	0.586	0.505	0.840	0.591	66.90
	finetuning	0.773	0.775	0.565	0.548	0.684	0.716	0.705	0.648	0.659	0.650	0.626	0.484	0.640	0.591	69.67
	frozen	0.827	0.814	0.648	0.649	0.749	0.722	0.685	0.635	0.797	0.688	0.619	0.500	0.820	0.500	75.15
	freeze-unfreeze	0.798	0.748	0.582	0.583	0.675	0.684	0.653	0.666	0.711	0.575	0.609	0.467	0.780	0.500	71.50
UVIT-XHuge	freeze	0.840	0.821	0.668	0.637	0.744	0.720	0.720	0.671	0.668	0.688	0.629	0.473	0.860	0.455	75.70
	freeze-unfreeze	0.817	0.783	0.607	0.631	0.722	0.705	0.679	0.681	0.681	0.675	0.602	0.533	0.880	0.727	73.53
SD2.1		0.760	0.730	0.530	0.679	0.707	0.729	0.665	0.571	0.655	0.637	0.685	0.495	0.780	0.455	71.23

	VqVa Question types
Model		whole	part	spatial	shape	color	state	type	count	text rendering	texture	global	material	scale	size	DSG
Muse		0.780	0.761	0.605	0.714	0.814	0.766	0.668	0.610	0.651	0.838	0.672	0.647	0.780	0.773	73.09
SD2.1		0.760	0.730	0.530	0.679	0.707	0.729	0.665	0.571	0.655	0.637	0.685	0.495	0.780	0.455	71.23
Imagen Cascade		0.799	0.806	0.626	0.714	0.806	0.772	0.723	0.673	0.750	0.738	0.693	0.641	0.820	0.636	75.93
Imagen Vermeer	raw	0.884	0.787	0.765	0.690	0.810	0.737	0.798	0.689	0.789	0.787	0.685	0.701	0.940	0.773	80.77
	+ prompt eng.	0.892	0.810	0.751	0.750	0.840	0.732	0.809	0.679	0.784	0.825	0.674	0.625	0.880	0.591	80.99
	+ style tuning	0.889	0.833	0.744	0.696	0.836	0.747	0.818	0.714	0.716	0.838	0.696	0.707	0.840	0.591	81.16

This appendix complements the results on broad categories presented in the main text by providing the fine grain corresponding results.

On validating the representation quality improvements from scale by counting

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (148)

Given the importance of counting and other basic numerical skills in biological intelligence(Nieder and Dehaene, 2009), we expect that competitively performing T2I show similar behaviour when evaluated on such skills.Counting requires manipulation of abstract concepts (numbers) and evaluating this ability provides an objective measure of a well-defined skill. As such it is easier to evaluate and interpret the performance of the model on the counting task, in contrast to some other image characteristics such as aesthetics that might depend on an individual’s preferences.

To evaluate models’ ability to correctly generate an image with an exact number of objects, we use 59 prompts in the att/count category of the Gecko benchmark(Wiles etal., 2024).The Gecko benchmark aims to comprehensively and systematically probe T2I model alignment along different skills such as numerical and spatial reasoning, text rendering, depicting of colors and shapes, and many others.

Specifically, our analyses include 48 simple modifier prompts and 11 additive prompts with numbers between 1 and 5.Simple modifier prompts are of form “num noun” (eg. “1 cat”), where num is a number represented by a single digit (ie. 1, 2, 3) or a numeral (ie. “one”, “two” or “three”) and the noun is a word from a common natural semantic categories such as foods, animals and everyday objects.Additive prompts are compositions of individual simple modifier prompts as they combine two nouns and two numbers, such as “1 cat and 3 dogs”.By using such systematically curated prompts, we are implicitly testing whether models can count, as the ability to correctly generate a number of objects depends on the ability to keep track of objects that were already generated.

To evaluate the correctness of T2I generation of numbers, we recruit human raters through a crowd-sourcing platform to provide the count of objects in every generated image.The study design, including remuneration for the work were reviewed and approved by our institution’s independent ethical review committee.We collect 5 annotations per generated image by asking “How many X are there in the image?” where X is the object mentioned in the original prompt used to generate that image. We generate three images for each prompt and each model using different seeds.

Figure15 shows the breakdown of accuracy per model type as well as per the ground truth number. The ground truth number is the number in the original prompt used to generate the image. The accuracy is the average number of annotations that match the ground truth label for a question and a given model.We observe that all models (with the exception of Base) perform comparably well on generating images with only one object, but this deteriorates with higher number, and only XHuge is able to correctly generate number 3 above the chance level. While exact number generation appears to improve with scale, it is unclear whether this pattern saturates for higher numbers.

Qualitative comparison of finetuning and frozen e2e models

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (149)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (150)

Our qualitative comparison between finetuning and frozen core components is based on 50 prompts covering different animal species. They are chosen for covering a diverse set of shapes, textures and structures. Figure18 present a side by side comparison at 50k training steps using the UVIT-Huge model.Structural elements like legs, wings and trunks are better formed when freezing the pretraing core components representation. The images were produced with the following list of prompts.

1.
"A majestic lion with a flowing mane, basking in the golden African sunset."
2.
"A playful dolphin leaping out of the water, glistening with droplets."
3.
"A wise old owl perched on a moonlit branch, gazing with piercing yellow eyes."
4.
"A colorful macaw soaring through a lush, vibrant rainforest."
5.
"A mischievous raccoon rummaging through a trash can in a suburban backyard."
6.
"A close-up portrait of a fluffy panda munching on bamboo."
7.
"A graceful hummingbird hovering near a bright pink flower."
8.
"A herd of elephants silhouetted against a fiery orange sky."
9.
"A group of meerkats standing alert in the desert, looking out for danger."
10.
"A photorealistic image of a chameleon blending seamlessly with its surroundings."
11.
"A Van Gogh-inspired painting of sunflowers with butterflies flitting around them."
12.
"A pixel art rendition of a pixelated cat chasing a pixelated mouse."
13.
"A watercolor painting of a majestic tiger stalking through a bamboo forest."
14.
"A surreal landscape with a melting elephant in the style of Salvador Dalí."
15.
"A vibrant pop art image of a zebra with bold stripes and contrasting colors."
16.
"A cubist artwork depicting a fragmented and reassembled bear."
17.
"A pointillist painting of a turtle, created with tiny dots of color."
18.
"A minimalist line drawing of a graceful swan."
19.
"A whimsical cartoon illustration of a group of singing frogs in a pond."
20.
"A dark and gothic illustration of a raven perched on a skull."
21.
"A penguin riding a surfboard on a giant tropical wave."
22.
"A giraffe wearing a top hat and monocle, enjoying a cup of tea in a fancy cafe."
23.
"A zebra crossing a busy city street at a crosswalk."
24.
"A cat wearing a space suit, exploring the surface of Mars."
25.
"A monkey DJ mixing beats at a neon-lit dance club."
26.
"An octopus painting a self-portrait with its many arms."
27.
"A sloth running a marathon, surprisingly outrunning all competitors."
28.
"A polar bear relaxing in a hot tub in the middle of the Arctic."
29.
"A group of rabbits building a snowman in a winter wonderland."
30.
"A dog astronaut floating in space, gazing at the Earth."
31.
"A grumpy bulldog wearing a birthday hat and refusing to smile."
32.
"A joyful rabbit hopping through a field of wildflowers."
33.
"A curious chimpanzee looking intently through a magnifying glass."
34.
"A proud peaco*ck displaying its magnificent tail feathers."
35.
"A loving mother kangaroo carrying her joey in her pouch."
36.
"A mischievous squirrel hiding nuts in a tree trunk."
37.
"A sleepy koala clinging to a tree branch, taking a nap."
38.
"A determined sea turtle swimming against the ocean current."
39.
"A playful wolf pup chasing its own tail."
40.
"A group of penguins waddling together in a comical huddle."
41.
"A chameleon painted with the vibrant colors of a bustling city skyline." (Imagine a chameleon camouflaged with neon signs and skyscraper patterns.)
42.
"A flock of birds forming the shape of a musical note in flight." (Visualize a dynamic dance of birds creating a melody in the sky.)
43.
"A fishbowl on the moon, with an astronaut goldfish gazing at Earth." (A whimsical and thought-provoking perspective shift.)
44.
"A microscopic landscape teeming with life, where insects are giants and blades of grass are towering trees."
45.
"A cat wearing a crown and royal robe, sitting regally on a throne made of yarn balls." (A playful portrait with a touch of humor.)**
46.
"A photorealistic image of extinct animals roaming in a modern city landscape." ** (Blend the past and present for a surreal scene.)
47.
"An underwater ballet performed by graceful sea creatures." (Capture the beauty and movement of marine life in an artistic way.)
48.
"A hedgehog painted as a starry night sky, with its spines representing twinkling stars." (A dreamy fusion of nature and the cosmos.)
49.
"Animals playing musical instruments together in a harmonious orchestra." (Imagine the symphony created by a unique animal band.)
50.
"A close-up portrait of a butterfly, revealing the intricate patterns and textures on its wings in exquisite detail." (Appreciate the delicate beauty of nature.)

Vermeer distillation: qualitative results

Figure20 presents additional qualitative results produced using the Vermeer model and its distilled version.

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (249)

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (250)

The images were produced with the following list of prompts.

1.
"Ruined circular stone tower on a cliff next to the ocean. Shepherd and sheep on green hillock. Sunrise, big puffy clouds. Naturalistic landscape. Romanticism. Hudson River School. Oil on canvas by Thomas Cole."
2.
"Photo of a cute raccoon lizard at sunset, 35mm"
3.
"Wallpaper of minimal origami corgi made of multi colored paper, abstract, clean, minimalist, 4K, 8K, soft colors, high definition."
4.
"A cat lying a top on the desk on a laptop."
5.
"A green stop sign on a pole."
6.
"A grey motorcycle on dirt road next to a building."
7.
"’Fall is here’ written in autumn leaves floating on a lake."
8.
"A cake topped with whole bulbs of garlic"
9.
"A red plate topped with broccoli, meat and veggies."
10.
"A photorealistic image of a chameleon blending seamlessly with its surroundings."
11.
"A cat wearing a cowboy hat and sunglasses and standing in front of a rusty old white spaceship at sunrise. Pixar cute. Detailed anime illustration."
12.
"A pizza with cherry toppings"