What is our primary use case?
In many ways, it's like using an HPC environment but a lot more flexible. In theory, you could have many different kinds of computing systems and computers, ranging from those geared toward computational speed to larger memory machines or GPU machines.
The idea is to break your computational jobs into smaller jobs that can be run on multiple machines. Any of these virtual machines will orchestrate all of the computation, send out jobs to different machines, wait for them to be done, and then run the next process in the sequence. It's simply a way to run multiple processes on multiple machines in the cloud.
What is most valuable?
Scalability is the most valuable feature for me. I could run anywhere from 32 cores to over 2,000 cores. So it scales very well, and it's really good for situations where jobs are very heterogeneous, meaning that you have a long-running job that sometimes needs a lot of small compute-intensive machines, for example.
But then, in the second stage, you may need a few very high-memory machines. It's really good for those kinds of situations, for HPCs, where you can really customize and tailor the compute, memory, and GPU requirements for the job.
You can run as many jobs as you want, provided you pay the cost. So, it's about the scalability to run really large jobs in a really short amount of time with a very minimal setup.
One person can set up a compute cluster on AWS Batch. I don't need all the hardware resources, people to maintain those resources, or software installations.
Moreover, there is one other feature in confirmation or call confirmation where you can have templates of what you want to do and just modify those to customize it to your needs.
And these templates basically make it a lot easier for you to get started. So, if you've been doing this for a while, you probably already have a template in your toolbox, and you can use one of those to get started, but you would just customize it. So, these templates help a lot.
What needs improvement?
The main drawback to using AWS Batch would be the cost. It will be more expensive in some cases than using an HPC. It's more amenable to cases where you have spot requirements.
So, for instance, you don't exactly know how much compute resources you'll need and when you'll need them. So it's much better for that flexibility. But if you're going to be running jobs consistently and using the compute cluster consistently for a lot of time, and it's not going to have a lot of downtime, then the HPC system might be a better alternative. So, really, it boils down to cost versus usage trade-offs. It's going to be more expensive for a lot of people.
In future releases, I would like to see anything that could help make it easier to set up your initial system. And besides improving the GUI a little bit, the interface to it, making it a little bit more descriptive and having more information at your fingertips, so if you could point to the help of what the different features are, you can get quick access to that. That might help.
With most of the AWS services, the difficulty really is getting information and knowledge about the system and seeing examples. So, seeing examples of how it's being used under multiple use cases would be the best way to become familiar with it.
And some of that would just come with experience. You have to just use it and play with it. But in terms of the system itself, it's not that difficult to set up or use.
For how long have I used the solution?
I have been working with AWS Batch for six years now.
What do I think about the stability of the solution?
It's a stable product. There are times when your compute access may be down due to AWS having an outage, but there's a lot of redundancy in the system.
So, again, it's one of those things where you get what you pay for. You can have your system up 24/7 all the time if you have implemented some redundancy in there.
So if something goes down, it fails over automatically to another system, another region, or a compute, which, in my experience, has also been very, very, very, extremely rare that something like that happens.
The trade-off again is that you're going to pay for it. Because of that redundancy, you have double the amount of resources needed, or maybe not double, but you do have more resources allocated than you would otherwise need.
So, if you're willing to pay for that, you have that flexibility. If you're willing to tolerate some downtime, then you'll end up spending less. If it's crucial for you to have your system operational all the time with no downtime, then it will cost you more, but you can achieve it.
What do I think about the scalability of the solution?
Currently, I am consulting. But in my previous organization, all of our compute was done in the cloud. So, that was the full informatics team, the stock five people, and all of the DevOps people, that was about three or four people. So, around ten people were on the systems, and all of our compute, all of our storage was done in Amazon cloud. Everything.
How are customer service and support?
It's one of those things where, with AWS, you'll quickly realize that you can get things faster, quicker, and better, but you're going to pay for it. So it's the same thing with support.
With the cheaper support, you'll get a response within 48 hours. If you pay extra, then you can get support within an hour. Again, it really depends on you and how quickly you need the support and the service, and that's all customizable.
And there are those gradients in between. If it's a mission-critical problem where your production system needs to be up right now and there's a problem, you can get support right away, within the analogy or even inline.
If your problem is something that you don't need right away, and maybe you can figure it out in a few days, or you can wait for a response after a few days, then that's going to cost you less. So you have that sort of flexibility.
So it comes down to basically how much you're paying is what you're going to be receiving at the end of the day.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
In my experience, I've only used Amazon and Google Cloud significantly. And Amazon is the reason why I use, by far, the number one kind of compute service.
They just have a lot of maturity, and they've been doing things for a long time. So, they've worked down a lot of the problems that other cloud providers may be affected by. But, again, there is a trade-off that they are typically more expensive than Google or Microsoft.
There are a few parameters here. In the beginning, one of the decisions I needed to make was whether we were going to go with Amazon or Microsoft.
And Microsoft Azure was significantly cheaper, and they were even giving us credit, the equivalent of about a million dollars of credit, to use Microsoft Azure. So, it was going to be a lot less than the AWS.
But there were several reasons why we decided to go with Amazon. One of the advantages is that Amazon is just a much more mature system. And so we could find people who are most familiar working with the system in terms of solutions architects and other companies, third-party companies, in order to build infrastructure.
It is easier to give access to people. There were a lot more services in Amazon, so there was a lot less need for us to build things ourselves. There was a lot more knowledge on Amazon. So, I was pretty familiar with Amazon. I had little general experience with Microsoft, so that was another factor.
And then so given all of these things in the long term, we were thinking of not just over the next couple of years, but long term, maybe five years from now, once we added dedicated our systems to one system, we didn't really want to change them or would end up being cut.
So even though we were going to save a lot upfront, in the long term, that would probably end up being about the same or a little bit more expensive. So that was one of the reasons for choosing AWS Batch.
In terms of using AWS Batch, specifically, we needed computational tasks that, like were very heterogeneous, so sometimes you needed maybe 32 cores on a very large memory machine. Sometimes we needed 2000 cores, on a lot of different machines to be running in parallel. And we really needed computation that was very bursting.
So sometimes we would need 2000 cores running at the same time. And then, sometimes for weeks, we just wouldn't have the system running at all. We didn't want to have to pay for services that we weren't going to be using.
So, the on-demand cost structure of on-demand ended up working out better in the long run. But it also had some flexibility. Some of the machines when we were going to use them for a long time and used them consistently, so we could stay a lot more upfront to reserve it, and that ended up helping us a lot less over the long run.
AWS has integrated well with a lot of different software. We're using things like NextFlow and other systems. So, in many ways, it's the natural fit. These were the reasons we opted for AWS Batch.
How was the initial setup?
It's one of those things where, if you have the appropriate background, experience, and skill set, it's not very difficult to set up.
As with many other AWS systems and services, the main factor is the experience and scale of the solutions architect, administrator, or DevOps person setting up the system.
If it's someone with experience who has been doing this for some time, it's not very difficult to set up. If it's a situation where somebody is very new at it, then it's going to take a long time. But for me, for example, I can set up a batch system in an hour or less.
What about the implementation team?
Mainly all of the AWS systems can be set up through their console and their accounts. There's a fair amount of setup in terms of allocating the resources you want, and especially with security, making sure that the appropriate people have the necessary, but minimal, requirements for security and minimal access to what they need.
You also need to look at what kind and how many compute resources you would want. Because with a lot of the power that you get from these kinds of systems, there's a lot of customization that you can do.
You can add a lot of different features, and you can also remove a lot of features. Security is very important because you don't want to give people access to things that they shouldn't have. At the same time, you want to make sure that people have access to the things that they need to do their work.
What was our ROI?
AWS Batch can end up saving you money. It really depends on your long-term situation. If you are looking for situations where you want to get started very quickly, there's nothing going to be faster than doing something in the cloud like a data system. So, it doesn't have to be AWS. Microsoft Azure could work, or Google Cloud, etcetera.
So if you want it to be set up very quickly and you want to do things like prototyping, development, etc., it will be more cost-effective for you to use AWS Batch because you don't have to have that upfront investment in very expensive computers. You can just get started and figure out exactly what your compute requirements are going to be, how much power you need, how long you need, etcetera.
It's especially good for situations where the compute requirements vary over time. Like, if you were running an e-commerce site, maybe during Black Friday, you need much more compute resources, but maybe in the summer, you don't sell as many items, so you don't need as many compute resources. So you wouldn't need to invest in a lot of equipment that you're probably not going to be using at the time. So in those kinds of circumstances, in my opinion, it's the best way to go.
In situations where it may not be cost-effective, if you're in a production environment where you expect to run a very consistent workflow over time, it's not going to change very much. You know exactly what you need and what you want, and you have the resources to dedicate people to maintaining the system, administrative people, people looking at the hardware, and you expect to run the system for years, not something that's just going to be running for a short amount of time. Then, for the HPC environment, it would be better.
Honestly, that's going to change in a few years. As time has gone along, people are moving more to the cloud compute environment, and it works better in most of the circumstances.
Overall long-term cost is not as high as it's going to be in the HPC environment. So it really depends on the situation and the circumstances. So, HPCs, in my experience, having been using HPCs for almost 20 years, I would say there are less and fewer circumstances where HPC is going to be a more cost-effective approach.
What other advice do I have?
If you're very new to Amazon and AWS, I would recommend spending six months to a year in training. This advice is particularly relevant if you're coming from an administrative standpoint, where you're either individually responsible or part of a team setting up systems for others. Gaining experience and understanding the nuances of these systems can be the difference between a system that's consistently operational and one that's frequently down or slow.
Moreover, this experience can significantly reduce costs. An experienced professional could slash your costs by 50% to 60%. Hence, it's crucial to amass as much knowledge and experience with the system as possible, particularly if Amazon is a new environment for you.
If you're already familiar with Amazon, then getting up to speed with AWS Batch should be relatively straightforward. It follows a similar implementation procedure to other AWS services. You might need just a few weeks to get accustomed to it, gaining exposure and experience in running different systems.
For end-users, those who need to run computational jobs but aren't involved in the setup, extensive exposure isn't necessary. The systems are already in place, and you'll likely use AWS Batch indirectly through third-party software. It's a simpler process for end-users; you submit your job, and the system, like a "magic black box," delivers the results without you needing to understand the backend processes.
Personally, I would rate the system a nine out of ten. My experience with it has been mostly positive, albeit with a steep learning curve, especially if you're coming from an HPC environment, in which case I was. It's a different way of thinking about how you're going to do your computation. You have a lot more flexibility. It's one of those things like the old adage: the more power you get, the more responsibility you have. And that's the case with AWS services or cloud services in general. You get a lot more power, a lot more flexibility. And at the same time, you require more knowledge to be able to use it, and you need to understand what all of these different features are if you want to really maximize using the system.
The only reason I shy away from a perfect score is the documentation and use cases and examples; things like that, there could be more information on those kinds of things. It would definitely help somebody get used to the system. You really need somebody who's had a significant amount of experience in order to make the most of the system. Otherwise, when somebody's new to the system, and they don't really fully appreciate all of the features. And they get frustrated very quickly because they can't get it to do what they want.
Overall, AWS offers immense capabilities. Though it's not a standalone solution — often requiring third-party software for specific projects — Amazon continually integrates more of these tools. For instance, AWS Parallel Cluster, akin to Batch, can simulate an HPC environment, providing additional flexibility.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
*Disclosure: I am a real user, and this review is based on my own experience and opinions.