Иностранный Даниэль учит, как выбрать подходящий графический процессор для глубокого обучения

Deep learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores? How to make a cost-efficient choice? This blog post will delve into these questions and will lend you advice which will help you to make a choice that is right for you.
TL;DR
Наличие быстрого графического процессора — очень важный аспект когда кто-то начинает изучать глубокое обучение, поскольку это позволяет быстро получить практический опыт, который является ключом к накоплению опыта, с которым вы сможете применять глубокое обучение к новым проблемам, Без этой быстрой обратной связи это просто займет слишком много времени, чтобы учиться на своих ошибках, и продолжать глубокое обучение может быть обескураживающим и разочаровывающим. С графическими процессорами я быстро научился применять глубокое обучение в ряде соревнований Kaggle, и мне удалось заработать второе место в конкурс Частично солнечно с шансом на хэштеги Kaggle с использованиемa deep learning approach, where it was the task to predict weather ratings for a given tweet. In the competition, I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory. The GTX Titan GPUs that powered me in the competition were a main factor of me reaching 2nd place in the competition.

Overview

This blog post is structured in the following way. First I discuss how useful it is to have multiple GPUs, then I discuss all relevant hardware options such as NVIDIA and AMD GPUs, Intel Xeon Phis, Google TPUs, and new startup hardware. Then I discuss what GPU specs are good indicators for deep learning performance. The main part discusses a performance and cost-efficiency analysis. I conclude with general and more specific GPU recommendations.

Do Multiple GPUs Make My Training Faster?

When I started using multiple GPUs I was excited about using data parallelism to improve runtime performance for a Kaggle competition. However, I found that it was very difficult to get a straightforward speedup by using multiple GPUs. I was curious about this problem, and thus I started to do research in parallelism in deep learning. I analyzed parallelization in deep learning architectures, developed an 8-bit quantization technique to increase the speedups in GPU clusters from 23x to 50x for a system of 96 GPUs and published my research at ICLR 2016.

The main insight was that convolution and recurrent networks are rather easy to parallelize, especially if you use only one computer or 4 GPUs. However, fully connected networks including transformers are not straightforward to parallelize and need specialized algorithms to perform well.

GPU pic — Figure 1: Setup in my main computer: You can see three GPUs and an InfiniBand card. Is this a good setup for doing deep learning?

Modern libraries like TensorFlow and PyTorch are great for parallelizing recurrent and convolutional networks, and for convolution, you can expect a speedup of about 1.9x/2.8x/3.5x for 2/3/4 GPUs. For recurrent networks, the sequence length is the most important parameter and for common NLP problems, one can expect similar or slightly worse speedups compared to convolutional networks. Fully connect networks, including transformers, however, usually have poor performance for data parallelism and more advanced algorithms are necessary to accelerate these parts of the network. If you run transformers on multiple GPUs, you should try running it on 1 GPU and see if it is faster or not.

Using Multiple GPUs Without Parallelism

Another advantage of using multiple GPUs, even if you do not parallelize algorithms, is that you can run multiple algorithms or experiments separately on each GPU. Efficient hyperparameter search is the most common use of multiple GPUs. You gain no speedups, but you get faster information about the performance of different hyperparameter settings or different network architecture. This is also very useful for novices, as you can quickly gain insights and experience into how you can train a unfamiliar deep learning architecture.

Using multiple GPUs in this way is usually more useful than running a single network on multiple GPUs via data parallelism. You should keep this in mind when you buy multiple GPUs: Qualities for better parallelism like the number of PCIe lanes is not that important when you buy multiple GPUs.

Additionally, note that a single GPU should be sufficient for almost any task. Thus the range of experiences that you can have with 1 GPU will not differ from when you have 4 GPUs. The only difference is that you can run more experiments in a given time with multiple GPUs.

Your Options: NVIDIA vs AMD vs Intel vs Google vs Amazon vs Microsoft vs Fancy Startup

NVIDIA: The Leader

Стандартные библиотеки NVIDIA упростили создание первых библиотек глубокого обучения в CUDA, в то время как для AMD OpenCL не было таких мощных стандартных библиотек.Это раннее преимущество в сочетании с мощной поддержкой сообщества со стороны NVIDIA увеличило размер сообщества CUDA быстро.Это означает, что если вы используете графические процессоры NVIDIA, вы легко найдете поддержку, если что-то пойдет не так, вы найдете поддержку и совет, если будете программировать CUDA самостоятельно, и вы обнаружите, что большинство библиотек глубокого обучения имеют лучшую поддержку графических процессоров NVIDIA. В последние месяцы NVIDIA вложила еще больше ресурсов вsoftware still. For example, the Apex library offers support to stabilize 16-bit gradients in PyTorch and also includes fused fast optimizers like FusedAdam. Overall, software is a very strong point for NVIDIA GPUs.

С другой стороны, у NVIDIA теперь есть политика, согласно которой использование CUDA в центрах обработки данных разрешено только для графических процессоров Tesla, а не для карт GTX или RTX.Неясно, что подразумевается под «центрами обработки данных», но это означает, что организации и университеты часто вынуждены покупать дорогие и неэффективные графические процессоры Tesla из-за страха перед юридическими проблемами.Однако карты Tesla не имеют реального преимущества перед картами GTX и RTX и стоят в 10 раз дороже.

То, что NVIDIA может сделать это без каких-либо серьезных препятствий, показывает силу их монополии — они могут делать все, что захотят, и мы должны принять условия.Если вы выберете основные преимущества, которыми обладают графические процессоры NVIDIA с точки зрения сообщества и поддержки, вы также нужно принять тот факт, что вас могут толкать по желанию.

AMD: Powerful But Lacking Support

HIP via ROCm unifies NVIDIA and AMD GPUs under a common programming language which is compiled into the respective GPU language before it is compiled to GPU assembly. If we would have all our GPU code in HIP this would be a major milestone, but this is rather difficult because it is difficult to port the TensorFlow and PyTorch code bases. TensorFlow and PyTorch have some support for AMD GPUs and all major networks can be run on AMD GPUs, but if you want to develop new networks some details might be missing which could prevent you from implementing what you need. The ROCm community is also not too large and thus it is not straightforward to fix issues quickly. AMD invests little into their deep learning software and as such one cannot expect that the software gap between NVIDIA and AMD will close.

Currently, the performance of AMD GPUs is okay. They now have 16-bit compute capability which is an important milestone, but the Tensor Cores of NVIDIA GPUs provide much superior compute performance for transformers and convolutional networks (not so much for word-level recurrent networks, though).

Overall I think I still cannot give a clear recommendation for AMD GPUs for ordinary users that just want their GPUs to work smoothly. More experienced users should have fewer problems and by supporting AMD GPUs and ROCm/HIP developers they contribute to the combat against the monopoly position of NVIDIA as this will greatly benefit everyone in the long-term. If you are a GPU developer and want to make important contributions to GPU computing, then an AMD GPU might be the best way to make a good impact over the long-term. For everyone else, NVIDIA GPUs might be the safer choice.

Intel: Trying Hard

Мой личный опыт работы с Intel Xeon Phis был очень разочаровывающим, и я не вижу в них реального конкурента картам NVIDIA или AMD, поэтому буду краток: если вы решите использовать Xeon Phi, обратите внимание, что вы можете столкнуться с плохой поддержка, проблемы с вычислениями, которые делают участки кода медленнее, чем процессоры, трудности с написанием оптимизированного кода, отсутствие полной поддержки функций C++11, отсутствие поддержки некоторых важных шаблонов проектирования графических процессоров, плохая совместимость с другими библиотеками, которые полагаются на подпрограммах BLAS (NumPy и SciPy) и, вероятно, многих других разочарованиях, с которыми я не сталкивался.

Помимо Xeon Phi, я действительно с нетерпением ждал процессора нейронной сети Intel Nervana (NNP), потому что его характеристики были чрезвычайно мощными в руках разработчика графического процессора, и он позволил бы использовать новые алгоритмы, которые могли бы переопределить, как используются нейронные сети, но это бесконечно откладывалось, и ходят слухи, что большая часть разработанных прыгнула с лодки.NNP запланирован на Q3 / Q4 2019. Если вы хотите ждать так долго, имейте в виду, что хорошее оборудование не все как мы можем видеть на собственном Xeon Phi от AMD и Intel.Это может произойти в 2020 или 2021 году, пока NNP не станет конкурентоспособным с графическими процессорами или TPU.

Google: Powerful, Cheap On-Demand Processing

The Google TPU developed into a very mature cloud-based product that is cost-efficient. The easiest way to make sense of the TPU is by seeing it as multiple specialized GPUs packaged together that only have one purpose: Doing fast matrix multiplications. If we look at performance measures of the Tensor-Core-enabled V100 versus TPUv2 we find that both systems have nearly the same in performance for ResNet50 [source is lost, not on Wayback Machine]. However, the Google TPU is more cost-efficient. Since the TPUs have a sophisticated parallelization infrastructure, TPUs will have a major speed benefit over GPUs if you use more than 1 cloud TPU (equivalent to 4 GPUs).

Although still experimental, PyTorch is now also supporting TPUs which will help strengthen the TPU community and ecosystem.

TPUs still have some problems here and there, for example, a report from February 2018 said that the TPUv2 did not converge when LSTMs were used. I could not find a source if the problem has been fixed as of yet.

On the other hand, there is a big success story for training big transformers on TPUs. GPT-2, BERT, and machine translation models can be trained very efficiently on TPUs. According to my estimates from my TPU vs GPU blog post, TPUs are about 56% faster than GPUs and thanks to their lower price compared to cloud GPUs they are an excellent choice for big transformer projects.

One issue with training large models on TPUs, however, can be cumulative cost. TPUs have high performance which is best used in the training phase. In the prototyping and inference phase, you should rely on non-cloud options to reduce costs. Thus training on TPUs, but prototyping and inferring on your personal GPU is the best choice.

To conclude, currently, TPUs seem to be best used for training convolutional network or large transformers and should be supplemented with other compute resources rather than a main deep learning resource.

Amazon AWS and Microsoft Azure: Reliable but Expensive

GPU instances from Amazon AWS and Microsoft Azure are very attractive because one can easily scale up and scale down based on needs. This is very useful for paper deadlines or for larger one-off projects. However, similarly to TPUs the raw costs add up quickly. Currently, GPU cloud instances are too expensive to be used in isolation and I recommend to have some dedicated cheap GPUs for prototyping before one launches the final training jobs in the cloud.

Fancy Startup: Revolutionary Hardware Concept with No Software

Существует ряд стартапов, которые нацелены на производство оборудования для глубокого обучения следующего поколения.Эти компании обычно имеют отличный теоретический дизайн, затем их покупают Google/Intel или другие, чтобы получить финансирование, необходимое им для полноценной работы. спроектировать и произвести чип. Для следующего поколения чипов (3 нм) затраты на это составляют примерно 1 миллиард долларов, прежде чем чип может быть произведен. является программным обеспечением. Ни одной компании не удалось создать программное обеспечение, которое будет работать в текущем стеке глубокого обучения.Чтобы быть конкурентоспособным, необходимо разработать полный набор программного обеспечения, что ясно из примера AMD и NVIDIA: у AMD отличное оборудование, но только 90% софт — этого недостаточно, чтобы конкурировать с NVIDIA.

Currently, no company is anywhere close to completing both hardware and software steps. The Intel NNP might be the closest, but from all of this one cannot expect a competitive product before 2020 or 2021. So currently we will need to stick to GPUs and TPUs.

Thus, fancy new hardware from your favorite startup can be safely disregarded for now.

What Makes One GPU Faster Than Another?

TL;DR

Your first question might be what is the most important feature for fast GPU performance for deep learning: Is it CUDA cores? Clock speed? RAM size?

In 2019 the choice of a GPU is more confusing then ever: 16-bit computing, Tensor Cores, 16-bit GPUs without Tensor Cores, multiple generations of GPUs which are still viable (Turning, Volta, Maxwell). But still there are some reliable performance indicators which people can use as a rule of thumb. Here some prioritization guidelines for different deep learning architectures:

Convolutional networks and Transformers: Tensor Cores > FLOPs > Memory Bandwidth > 16-bit capability
Recurrent networks: Memory Bandwidth > 16-bit capability > Tensor Cores > FLOPs

This reads as follows: If I want to use, for example, convolutional networks, I should first prioritize a GPU that has tensor cores, then a high FLOPs number, then a high memory bandwidth, and then a GPU which has 16-bit capability. While prioritizing, it is important to pick a GPU which has enough GPU memory to run the models one is interested in.

Why These Priorities?

One thing that to deepen your understanding to make an informed choice is to learn a bit about what parts of the hardware makes GPUs fast for the two most important tensor operations: Matrix multiplication and convolution.

A simple and effective way to think about matrix multiplication A*B=C is that it is memory bandwidth bound: Copying memory of A, B unto the chip is more costly than to do the computations of A*B. This means memory bandwidth is the most important feature of a GPU if you want to use LSTMs and other recurrent networks that do lots of small matrix multiplications. The smaller the matrix multiplications, the more important is memory bandwidth.

On the contrary, convolution is bound by computation speed. Thus TFLOPs on a GPU is the best indicator for the performance of ResNets and other convolutional architectures. Tensor Cores can increase FLOPs dramatically.

Large matrix multiplication as used in transformers is in-between convolution and small matrix multiplication of RNNs. Big matrix multiplications benefit a lot from 16-bit storage, Tensor Cores, and FLOPs but they still need high memory bandwidth.

Обратите внимание, что для использования преимуществ Tensor Cores вы должны использовать 16-битные данные и веса — избегайте использования 32-битных с картами RTX!Если у вас возникли проблемы с 16-битным обучением с использованием PyTorch, вам следует использовать динамическое масштабирование потерь, как это предусмотрено вApex library. If you use TensorFlow you can implement loss scaling yourself: (1) multiply your loss by a big number, (2) calculate the gradient, (3) divide by the big number, (4) update your weights. Usually, 16-bit training should be just fine, but if you are having trouble replicating results with 16-bit loss scaling will usually solve the issue.

Figure 2: Normalized performance data of GPUs and TPU. Higher is better. RTX cards assume 16-bit computation. The word RNN numbers refer to biLSTM performance for short sequences of length <100. Benchmarking was done using PyTorch 1.0.1 and CUDA 10.

Cost Efficiency Analysis

The cost-efficiency of a GPU is probably the most important criterion for selecting a GPU. The performance analysis for this blog post update was done as follows:
(1) For transformers, I benchmarked Transformer-XL and BERT.
(2) For word and char RNNs I benchmarked state-of-the-art biLSTM models.
(3) The benchmarking in (1) and (2) was done for Titan Xp, Titan RTX, and RTX 2080 Ti. For other cards, I scaled the performance differences linearly.
(4) I used the existing benchmark for CNNs: (1, 2, 3, 4, 5, 6, 7).
(5) I used the mean cost of Amazon and eBay as a reference cost for a GPU.

Figure 3: Normalized performance/cost numbers for convolutional networks (CNN), recurrent networks (RNN) and transformers. Higher is better. An RTX 2060 is more than 5 times more cost-efficient than a Tesla V100. The word RNN numbers refer to biLSTM performance for short sequences of length <100. Benchmarking was done using PyTorch 1.0.1 and CUDA 10.

From this data, we see that the RTX 2060 is more cost-efficient than the RTX 2070, RTX 2080 or the RTX 2080 Ti. Why is this so? The ability to do 16-bit computation with Tensor Cores is much more valuable than just having a bigger ship with more Tensor Cores cores. With the RTX 2060, you get these features for the lowest price.

However, this analysis has certain biases which should be taken into account:
(1) This analysis is strongly biased in favor of smaller cards. Smaller, cost-efficient GPUs might not have enough memory to run the models that you care about!
(2) Overprices GTX 10xx cards: Currently, GTX 10XX cards seem to be overpriced since gamers do not like RTX cards.
(3) Single-GPU bias: One computer with 4 cost-inefficient cards (4x RTX 2080 Ti) is much more cost-efficient than 2 computers with the most cost/efficient cards (8x RTX 2060).

Warning: Multi-GPU RTX Heat Problems

Существуют проблемы с RTX 2080 Ti и другими графическими процессорами RTX со стандартным двойным вентилятором, если вы используете несколько графических процессоров, работающих рядом друг с другом.Это особенно актуально для нескольких RTX 2080 Ti на одном компьютере, но несколько RTX 2080 и RTX 2070 также могут Вентилятор на некоторых картах RTX представляет собой новый дизайн, разработанный NVIDIA для улучшения игрового процесса для геймеров, использующих один графический процессор (бесшумный, с меньшим нагревом для одного графического процессора).Однако конструкция ужасна, если вы используете несколько графических процессоров. у которых есть это открытый дизайн с двумя вентиляторами. Если вы хотите использовать несколько карт RTX, которые работают рядом друг с другом (непосредственно в следующем слоте PCIe), вам следует приобрести версию с одним вентилятором в стиле «нагнетателя». Это особенно верно для Карты RTX 2080 Ti. ASUS и PNY в настоящее время предлагают на рынке модели RTX 2080 Ti с нагнетательным вентилятором.Если вы используете две RTX 2070, вам подойдет любой вентилятор. 2 RTX 2070 рядом друг с другом.

Required Memory Size and 16-bit Training

The memory on a GPU can be critical for some applications like computer vision, machine translation, and certain other NLP applications and you might think that the RTX 2070 is cost-efficient, but its memory is too small with 8 GB. However, note that through 16-bit training you virtually have 16 GB of memory and any standard model should fit into your RTX 2070 easily if you use 16-bits. The same is true for the RTX 2080 and RTX 2080 Ti. Note though, that in most software frameworks you will not automatically save half of the memory by using 16-bit since some frameworks store weights in 32-bits to do more precise gradient updates and so forth. A good rule of thumb is to assume 50% more memory with 16-bit compute. So a 16-bit 8GB memory is about equivalent in size to a 12 GB 32-bit memory.

General GPU Recommendations

Currently, my main recommendation is to get an RTX 2070 GPU and use 16-bit training. I would never recommend buying an XP Titan, Titan V, any Quadro cards, or any Founders Edition GPUs. However, there are some specific GPUs which also have their place:
(1) Для дополнительной памяти я бы порекомендовал RTX 2080 Ti. Если вам действительно нужно много дополнительной памяти, RTX Titan — лучший вариант, но убедитесь, что вам действительно нужна эта память!
(2) For extra performance, I would recommend an RTX 2080 Ti.
(3) If you are short on money I would recommend any cheap GTX 10XX card from eBay (depending on how much memory you need) or an RTX 2060. If that is too expensive have a look at Colab.
(4) If you just want to get started with deep learning a GTX 1060 (6GB) is a great option.
(5) If you already have a GTX 1070 or better: Wait it out. An upgrade is not worth it unless you work with large transformers.
(6) You want to learn quickly how to do deep learning: Multiple GTX 1060 (6GB).

Deep Learning in the Cloud

Оба экземпляра графического процессора в AWS/Azure и TPU в облаке Google являются жизнеспособными вариантами для глубокого обучения.Хотя TPU немного дешевле, ему не хватает универсальности и гибкости облачных графических процессоров.TPU могут быть предпочтительным оружием для обучения распознаванию объектов или модели-трансформеры.Для других рабочих нагрузок облачные графические процессоры являются более безопасным выбором — облачные экземпляры хороши тем, что вы можете переключаться между графическими процессорами и TPU в любое время или даже использовать оба одновременно.

However, mind the opportunity cost here: If you learn the skills to have a smooth work-flow with AWS/Azure, you lost time that could be spent doing work on a personal GPU, and you will also not have acquired the skills to use TPUs. If you use a personal GPU, you will not have the skills to expand into more GPUs/TPUs via the cloud. If you use TPUs you might be stuck with TensorFlow for a while if you want full features and it will not be straightforward to switch your code-base to PyTorch. Learning a smooth cloud GPU/TPU work-flow is an expensive opportunity cost and you should weight this cost if you make the choice for TPUs, cloud GPUs, or personal GPUs.

Another question is also about when to use cloud services. If you try to learn deep learning or you need to prototype then a personal GPU might be the best option since cloud instances can be pricey. However, once you have found a good deep network configuration and you just want to train a model using data parallelism then using cloud instances is a solid approach. This means that a small GPU will be sufficient for prototyping and one can rely on the power of cloud computing to scale up to larger experiments.

If you are short on money the cloud computing instances might also be a good solution: Prototype on a CPU and then roll out on GPU/TPU instances for a quick training run. This is not the best work-flow since prototyping on a CPU can be a big pain, but it can be a cost-efficient alternative.

Conclusion

With the information in this blog post, you should be able to reason which GPU is suitable for you. In general, I see three main strategies (1) stick with your GTX 1070 or better GPU, (2) buy a RTX GPU, (3) use some kind of GPU for prototyping and then train your model on TPUs or cloud GPUs in parallel.

TL;DR advice

Best GPU overall: RTX 2070
GPUs to avoid: Any Tesla card; any Quadro card; any Founders Edition card; Titan RTX, Titan V, Titan XP
Cost-efficient but expensive: RTX 2070
Cost-efficient and cheap: RTX 2060, GTX 1060 (6GB).
I have little money: GTX 1060 (6GB)
I have almost no money: GTX 1050 Ti (4GB).Alternatively: CPU (prototyping) + AWS/TPU (training); or Colab.
I do Kaggle: RTX 2070. If you do not have enough money go for a GTX 1060 (6GB) or GTX Titan (Pascal) from eBay for prototyping and AWS for final training. Use fastai library.
I am a competitive computer vision or machine translation researcher: GTX 2080 Ti with the blower fan design. If you train very large networks get RTX Titans.
I am an NLP researcher: RTX 2080 Ti use 16-bit.
I want to build a GPU cluster: This is really complicated, you can get some ideas from my multi-GPU blog post.
I started deep learning and I am serious about it: Start with an RTX 2070. Buy more RTX 2070 after 6-9 months and you still want to invest more time into deep learning. Depending on what area you choose next (startup, Kaggle, research, applied deep learning) sell your GPU and buy something more appropriate after about two years.
I want to try deep learning, but I am not serious about it: GTX 1050 Ti (4 or 2GB). This often fits into your standard desktop and does not require a new PSU. If it fits, do not buy a new computer!

Update 2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
Update 2018-11-26: Added discussion of overheating issues of RTX cards.
Update 2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
Update 2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
Update 2017-04-09: Added cost efficiency analysis; updated recommendation with NVIDIA Titan Xp
Update 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
Update 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgments

I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes.