Running Stable Diffusion without GPU
Learn more about Stable Diffusion, CPU vs. GPU, and three ways of running diffusion models on a CPU machine.
Learn more about Stable Diffusion, CPU vs. GPU, and three ways of running diffusion models on a CPU machine.
Stable Diffusion is a generative model used for image and audio generation. It is based on the diffusion process and can model complex, high-dimensional distributions. The model works by iteratively adding noise to an input image or audio signal, and then denoising it to produce a new sample. This process is repeated many times to generate a full image or audio clip. Stable Diffusion has shown promising results in image and audio generation tasks and is a popular model in the machine learning research community.
Neural networks that use diffusion models heavily rely on matrix and vector operations during both training and inference. This is where modern graphical processing units, or GPUs, demonstrate their capabilities.
CPUs and GPUs differ in their architectures and purposes. CPUs are general-purpose processors that are designed to handle a wide range of tasks, including running operating systems, running applications, and handling input/output operations. They typically have a few cores, each capable of executing multiple threads in parallel.
GPUs, on the other hand, are specialized processors designed for graphics rendering and parallel processing. They have many more cores than CPUs, each optimized for executing a single instruction on multiple data points in parallel. This makes them very efficient at performing matrix and vector operations, which are common in neural network training and inference.
However, recent advancements in CPUs have made them more capable of performing operations with vectors and matrices. For example, Intel's Advanced Vector Extensions (AVX) provides a set of instructions that allow CPUs to perform multiple arithmetic operations on vectors and matrices simultaneously. However, GPUs still have an advantage over CPUs in terms of parallel processing power and are often used in deep learning applications.
At Realm, we confidently utilize CPU machines to satisfy customer requests for art generation. Although it may be slower than using GPU machines to execute the request queue, our product's unique delivery of goodie bags asynchronously allows us to benefit from simpler CPU setups. It is crucial to note that despite the slower speed, it remains acceptable, as we will demonstrate later in this article.
Here are three ways of running diffusion models on a CPU machine:
Let’s have a look at the execution time comparison of these three approaches:
To use Intel PyTorch Extension (IPEX), one needs to have an Intel CPU with AVX512, which comes with the recent Sapphire Rapids chips. These chips are significantly more expensive than the previous generation (Ice Lake) when purchasing hardware resources on demand.
For example, on Google Cloud, a c3-standard-22 (16 vCPUs, 64 GB memory) instance with Sapphire Rapids costs around $1.14884 per hour, while a c2-standard-16 (16 vCPUs, 64 GB memory) instance with Ice Lake costs around $0.8352 per hour.However, it is worth noting that the performance improvements with Sapphire Rapids can be significant, especially when using IPEX. If cost is a concern, it may be more economical to use the baseline method or OpenVINO conversion, depending on the specific use case and budget constraints
OpenVINO is a great tool for optimizing deep learning models on the CPU, especially on older hardware. One of the advantages of OpenVINO is the ability to use both static and dynamic model shapes. The static shape means that the dimensions of the model's input and output tensors are fixed at compile time, while the dynamic shape means that the dimensions can vary at runtime.
In general, static shape models can be faster and more efficient than dynamic shape models. However, static shape models can also be more memory-intensive. For example, when using OpenVINO with a static shape model for image generation, the memory usage can be up to four times higher than with a dynamic shape model that outputs images one at a time. It's important to consider the trade-offs between speed, efficiency, and memory usage when choosing between static and dynamic shape models in OpenVINO.
In recent years, float16 (also known as half-precision) has become a popular format for representing numerical values in deep-learning models. Float16 uses half the number of bits as float32 (single precision), which makes it more memory-efficient and faster to compute on GPUs.
Although float16 has a lower precision than float32, recent research has shown that it can be used without significant loss of accuracy in many deep-learning tasks. In fact, some studies have even shown that using float16 can improve the generalization of the model by acting as a regularizer and reducing overfitting.
Although float16 can improve the efficiency of deep learning models on GPUs, it is not always efficient on CPUs. However, there is an alternative called bfloat16, which operates similarly to float16 but can be more efficient on systems that support it. Bfloat16 uses 16 bits to represent numerical values, just like float16, but uses a different encoding scheme that allows for more efficient computation on CPUs. This format is supported by Intel's Advanced Vector Extensions (AVX) instruction set, which is used by Intel PyTorch Extension (IPEX) to optimize PyTorch models for CPU execution.
One interesting thing that happens when you combine GPU and CPU neural network jobs is maintaining consistency with the same random seed.
If you use different GPUs but the same initial random seed, the results will be different. However, the difference in terms of MSE between the two images might be very small as long as the GPU generates the same pseudo-random numbers.
The other story happens when we exclude the GPU from a setup and still want to generate consistent images over different runs. One can achieve consistency in two ways:
In this article, we explored different ways of running the Stable Diffusion model on CPUs, including the baseline method, OpenVINO conversion, and Intel PyTorch Extension (IPEX). We also discussed the differences between CPUs and GPUs, the trade-offs between static and dynamic model shapes in OpenVINO, and the advantages of using float16 and bfloat16 over float32 in deep learning models.
While GPUs are often more efficient for running deep learning models, CPUs can still be a viable option, especially for smaller-scale projects or when the cost is a concern. With recent advancements in CPU architecture and optimization tools like OpenVINO and IPEX, CPUs can offer a reasonable alternative to GPUs for running deep learning models like Stable Diffusion.
At Realm, we utilize CPU machines for deep learning art generation, leveraging our unique delivery system to make up for any potential performance drawbacks. We hope that this article has provided some insights into running the Stable Diffusion model on CPUs and the factors to consider when choosing between different methods and hardware configurations.