Practical Tips for Researchers to Optimize Training of Deep Learning Models on AWS Servers

Apr 7, 20233 min read

Updated: Apr 17, 2023

How to Optimize Price, Speed, and Performance for Large-Scale Deep Learning Models on AWS

Training large-scale deep learning models requires significant computing power, making it a challenging task for many data scientists. Fortunately, Amazon Web Services (AWS) provides a scalable infrastructure that enables scientists to optimize price, speed, and performance. However, working with AWS can be frustrating in the beginning, and it can be expensive if you are not sure what you are doing.

In this blog post, Vision Elements will share some basic principles and best practices for data scientists who handle out-of-the-ordinary deep learning models using AWS. We hope that you can learn from our experiences and that you can walk away with some tips to streamline your deep learning projects and optimize your costs, saving both time and money.

1. Choose between SageMaker for straightforward modeling or a "clean" instance

SageMaker is a powerful tool provided by Amazon that allows you to build new deep learning projects with minimal effort using pre-built models and packages. However, if your project requires something beyond the built-in packages and models, you might be a little restricted when using SageMaker. In practice, almost every project we handled had unique challenges, so we prefer to launch "clean" instances with appropriate hardware and establish the models and the environment ourselves. This approach gives us full control of the code and the training process and better debugging options than using SageMaker.

2. Choose the storage that fits your speed and budget

S3 is the common storage method in AWS. Still, we recommend using EBS for storing the training data. Unlike S3, Amazon’s cloud storage, EBS can be seen as a local drive for your specific instance. During training, we frequently load a large amount of data, and using EBS makes this process much faster. Although EBS is more expensive than S3, the training process will be much faster, and since the instance price is usually much higher than both EBS and S3, we will save not only time but also money.

3. AWS allows you to start small and save money in the early stages

Training deep learning models usually requires large and expensive instances. However, a good practice is to start with a small instance, establish the environment and all the code on it, and train it on a small amount of data. When you see that everything is working, you can easily change the instance type while keeping the environment. By doing this, you can save many expensive hours.

4. Trade price for performance upon region selection

When choosing a region to launch an instance, we usually prefer a location that is close to us to make the work on the instance smoother and with less delays. However, the same instance might have different prices in different regions. Those differences might be significant. When training time is expected to be long, the instance price might be the major consideration for choosing the region.

5. Use a Remote Desktop connection

For coding, debugging, and convenient tools for viewing and analyzing results, you need a proper GUI with the ability to use the preferred environments and software packages. Installing XRDP and a desktop environment such as XFCE is a convenient way to use the IDE of choice – be it Visual Studio, PyCharm, or MATLAB®and work just like you are used to working on your personal laptop.

Conclusion

In conclusion, AWS provides scalable infrastructure that enables data scientists to optimize price, speed, and performance for large-scale deep learning models. By following the tips and best practices outlined in this blog post, you can streamline your deep learning projects and optimize your costs, saving you time and money.

Vision Elements enables companies to expedite and de-risk product development by strengthening an existing team with experienced data scientists. Let us help you ramp up your next project.