What is our primary use case?
Big Data analytics, customer incubation.
We host our Big Data analytics "lab" on Amazon EC2. Customers are new to Big Data analytics so we do proofs of concept for them in this lab. Customers bring historical, structured data, or IoT data, or a blend of both. We ingest data from these sources into the Hadoop environment, build the analytics solution on top, and prove the value and define the roadmap for customers.
How has it helped my organization?
Initially, with RDBMS alone, we had a lot of work and few servers running on-premise and on cloud for the PoC and incubation. With the use of Hadoop and ecosystem components and tools, and managing it in Amazon EC2, we have created a Big Data "lab" which helps us to centralize all our work and solutions into a single repository. This has cut down the time in terms of maintenance, development and, especially, data processing challenges.
We were using MySQL and PostgreSQL for these engagements, and scaling and processing were not as easy when compared to Hadoop. Also, customers who are embarking on a big journey with semi-structured information prefer to use Hadoop rather than a RDBMS stack. This gives them clarity on the requirements.
In addition, since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done.
Flexibility, ease of data processing, reduced cost and efforts are the three key improvements for us.
What is most valuable?
HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect.
What needs improvement?
Based on our needs, we would like to see a tool for data visualization and enhanced Ambari for management, plus a pre-built IoT hub/model. These would reduce our efforts and the time needed to prove to a customer that this will help them.
For how long have I used the solution?
Less than one year.
What do I think about the stability of the solution?
We have a three-node cluster running on cloud by default, and it has been stable so far without any stoppages due to Hadoop or other ecosystem components.
What do I think about the scalability of the solution?
Since this is primarily for customer incubation, there is a need to process huge volumes of data, based on the proof of value engagement. During these processes, we scale the number of instances on demand (using Amazon spot instances), use them for a defined period, and scale down when the PoC is done. This gives us good flexibility and we pay only for usage.
How are customer service and support?
Since this is mostly community driven, we get a lot of input from the forums and our in-house staff who are skilled in doing the job. So far, most of the issues we have had during setup or scaling have primarily been on the infrastructure side and not on the stack. For most of the problems we get answers from the community forums.
How was the initial setup?
We didn't have any major issues except for knowledge, so we hired the right person who had hands-on experience with this stack, and worked with the cloud provider to get the right mechanism for handling the stack.
General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error. In addition, the old PoCs which were migrated had issues in directly connecting to Hive. We had to build some user functions to handle that.
What's my experience with pricing, setup cost, and licensing?
We normally do not suggest any specific distributions. When it comes to cloud, our suggestion would be to choose different types of instances offered by Amazon cloud, as we are technology partners of Amazon for cost savings. For all our PoCs, we stick to the default distribution.
Which other solutions did I evaluate?
None, as this stack is familiar to us and we were sure it could be used for such engagements without much hassle. Our primary criteria were the ability to migrate our existing RDBMS-based PoC and connectivity via our ETL and visualization tool. On top of that, support for semi-structured data for ELT. All three of these criteria were a fit with this stack.
What other advice do I have?
Our general suggestion to any customer is not to blindly look and compare different options. Rather, list the exact business needs - current and future - and then prepare a matrix to see product capabilities and evaluate costs and other compliance factors for that specific enterprise.
*Disclosure: I am a real user, and this review is based on my own experience and opinions.