Why isn't the 4090 used for training large models?

Why is the A100 used for training large models instead of the 4090? This is a good question. NADDOD believes that using the 4090 for training large models is not feasible, but using it for inference/serving is not only feasible but also cost-effective compared to the A100.

In fact, the biggest difference between the H100/A100 and the 4090 lies in communication and memory, with little difference in computational power.

	H100	A100	4090
Tensor FP16 Performance	1979 Tflops	312 Tflops	330 Tflops
Tensor FP32 Performance	989 Tflops	156 Tflops	83 Tflops
Memory Capacity	80 GB	80 GB	24 GB
Memory Bandwidth	3.35 TB/s	2 TB/s	1 TB/s
Communication Bandwidth	900 GB/s	900 GB/s	64 GB/s
Communication Latency	~1 us	~1 us	~10 us
Price	$30000~$40000	$15000	$1600

GPU Training Performance and Cost Comparison

NADDOD has been dedicated to innovative computing and networking solutions. After conducting research, their professional team has discovered a great GPU single-machine training performance and cost comparison provided by LambdaLabs. Here are the excerpts:

First, let's look at throughput. At first glance, there doesn't seem to be anything contradictory. When the model can fit on a single card, it is indeed the H100 that has the highest throughput, reaching twice that of the 4090. Looking at the computational power and memory, it is also evident that the H100's FP16 performance is approximately six times that of the 4090, and its memory bandwidth is 3.35 times higher. During the training process, with a relatively large batch size, most operators are compute-bound (computation-intensive), while only a few are memory-bound. This result is not surprising.

Single card training throughput comparison chart

LambdaLabs PyTorch Single-GPU Training Throughput Comparison Chart

Single Card Training Throughput Comparison Table

Single Card Training Throughput Comparison Table

Next, let's consider cost-effectiveness. As we can see, the H100, which used to top the list, now ranks almost at the bottom, and the gap between the 4090 and the H100 is nearly tenfold. This is because the H100 is much more expensive than the 4090.

Single Card Training Unit Cost Throughput Comparison Chart LambdaLabs PyTorch Single-GPU Training Throughput Comparison Table

Single Card Training Unit Cost Throughput Comparison Table

LambdaLabs PyTorch Single-GPU Training Cost per Unit Throughput Comparison Table

Compute Power Requirements for Training Large Models

Since the 4090 single-GPU training offers such high cost-effectiveness, why can't it be used for training large models? NADDOD's professional technical team, with extensive experience in implementing and servicing various application scenarios, believes that aside from the licensing constraints that prohibit gaming GPUs from being used in data centers, the fundamental reason is that large model training requires high-performance communication, which the 4090 lacks.

How much compute power does training large models require? Total training compute power (FLOPs) = 6 * Model Parameters * Number of Tokens in Training Data.

In essence, 6 * Model Parameters * Number of Tokens in Training Data represents the compute power needed to process all training data once. The value 6 represents the number of multiplication and addition computations required for each token during the forward and backward propagation of the model.

Thinking about a bunch of matrix multiplications, it can be simplified as a complete bipartite graph with several neurons on the left side and several neurons on the right side. Let's take one left neuron "l" and one right neuron "r."

During Forward Propagation:

"l" multiplies its output by the weight "w" between "l" and "r" and sends it to "r."

"r" cannot be connected to just one neuron, so it needs to sum up multiple "l" outputs, which requires one addition.

During Backward Propagation:

"r" multiplies the received gradient by the weight "w" between "l" and "r" and sends it back to "l."

"l" is also connected to multiple "r" neurons, so it needs to sum up the gradients through addition.

Don't forget that the weight "w" needs to be updated, which requires calculating the gradient of "w" by multiplying the received gradient by the forward propagation output (activation) of "l."

In a batch, there are usually multiple samples, and updating the weight "w" requires summing up the gradients of these samples.

In total, there are 3 multiplications and 3 additions. Regardless of how complex the Transformer model is, matrix computations are as simple as that. Other vector computations, such as softmax, are not the main factors affecting compute power and can be ignored when estimating.

There is a proportional relationship between the number of parameters in a model and the number of tokens in the training data. This can be easily understood by imagining the model as a compressed version of the data, and there are always limits to compression ratios. If the number of parameters in the model is too small, it cannot capture all the knowledge present in the training data. On the other hand, if the number of parameters exceeds the number of tokens in the training data, it becomes wasteful and can lead to overfitting.

Training large models involves dealing with massive datasets and complex computational tasks, requiring high bandwidth and low-latency data transmission to support efficient data exchange and parallel computing. Optical modules provide high-speed fiber optic communication capabilities, meeting the demands of large-scale data transmission by rapidly transferring data to GPUs for computation, thereby accelerating the training process and significantly reducing data transmission latency. Therefore, optical modules play a crucial role in large model training, contributing to high-performance deep learning training.

NADDOD continues to provide users with innovative, efficient, and reliable products, solutions, and services, offering optimal switch+AOC/DAC/optical module+smart network card+DPU+GPU integrated solutions for data centers, high-performance computing, edge computing, artificial intelligence, and other application scenarios. These solutions enhance customers' business acceleration capabilities with low cost and outstanding performance.

Conclusion

In conclusion, why can't the 4090 be used for large model training? The main reason lies in the limitations of the 4090 in terms of communication and memory. Although the 4090 is comparable to the H100/A100 in terms of computational power, its memory capacity, bandwidth, and communication efficiency are significantly lower than the H100/A100. These factors are crucial for large model training, as it requires high-performance communication to handle a large number of parameters and training data. The relatively low communication efficiency of the 4090 cannot meet the demands of large model training.

NADDOD believes that while the 4090 exhibits good performance and cost-effectiveness in inference (serving) tasks, large model training requires immense compute power, and choosing the appropriate hardware can improve training efficiency and cost-effectiveness. Therefore, selecting the H100/A100 is a wiser decision.

Additionally, as a leading provider of optical networking equipment, NADDOD can offer optimal switch+optical module+smart NIC+DPU+GPU integrated solutions for large model training. This improves training efficiency, supports large-scale parallel computing, and enables faster and more effective completion of training tasks.