What is our primary use case?
As an example of a use case, when I was a contractor for Cisco, we were processing mobile network data and the volume was too big. RDBMS was not supporting anything. We started using the Hadoop framework to improve the process and get the results faster.
What is most valuable?
The data is stored in micro-partitions which makes the processes very fast compared to other RDBMS systems. Apache Spark is in the memory process, and it's much better than MapReduce.
Micro-partitions and the HDFS are both excellent features.
What needs improvement?
I'm not sure if I have any ideas as to how to improve the product.
Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution.
The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance.
The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning.
We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout.
For how long have I used the solution?
I've been using the solution for four years.
What do I think about the stability of the solution?
We haven't had too many problems with stability. For the POC we used a small amount of data and we started with 10 nodes. We're gradually increasing in now to 40 nodes. We haven't seen any issues after the small teething period in the beginning. The configuration issues and the performance issues have subsided. Once we learned how to stack everything, it has been much better.
What do I think about the scalability of the solution?
The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so.
We are supporting a multitenancy model and we get the data on supporting the users. I would say, per organization, we have eight to 10 users and probably have a total of around 40 users across the board.
How are customer service and technical support?
We started on the solution as a POC. Once we got into production, we had some minor issues. We get great support. They share advice and helped us tweak some things in terms of the configurations. We've been satisfied with the level of service we've been provided.
Which solution did I use previously and why did I switch?
We have only ever used Apache Hadoop, or a version of it. When we looked for the commercial tier, there was Cloudera and Hortonworks. We started with the Hortonworks due to the fact that at that time we felt it was cost-effective. However, Cloudera bought Hadoop and Hortonworks and now it's all basically the same solution.
How was the initial setup?
The initial setup was a little complex the first time around. We were new to the system, and we didn't have any expertise at that time. Once we get some support and insights into how to work everything properly it went more smoothly.
First, we started with a POC - proof of concept. It takes a couple of days in terms of understanding and configuring everything, etc. When we went to production, it was a couple of hours for deployment and we put into practice everything we learned from the POC.
There's definitely a learning curve. It's stable for us now.
We have a team of developers doing multiple tasks on the solution and few of them are taking care of Hadoop, so we do have a few people handling maintenance.
What about the implementation team?
As we were new to the solution, we found we needed some outside assistance to guide us. However, that was for the POC. In the end, I did it myself.
What other advice do I have?
We're just a customer. We don't have a business relationship with Hadoop.
My day-to-day job is data modeling and architecting.
Originally we used it as an open-source solution. We downloaded it, then we went for a commercial version of it.
In terms of advice, I'd tell other potential users that whether the solution is right for them depends on a few items. If the data volume is too big, it's IoT data, or the stream of data is too much, this solution can handle it and I would definitely recommend Apache Hadoop.
Recently, in the last 18 months, I've been working with the Snowflake, it's a Data Lake project, and I am really impressed with that one. I got a certification so that we started using Snowflake set for our Data Lake environment.
I'd rate the solution eight out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
We have since partnered with Hortonworks and are researching into the Cloudera and MapR spaces right now as well. Though our strong suit is Hortonworks, we do have a good implementation team for any of the distributions.