What is our primary use case?
We have different use cases for different clients. For example, a banking sector client wants to use Apache Kafka for fraud detection and messaging back to the client. For instance, if there's a fraudulent swipe of a credit or debit card, we stream near real-time data (usually three to five minutes old) into the platform. Then, we use a look-up model to relay the data to the messaging queue to the customers. This is just one use case.
We also have data science engineers who use Kafka to feed on the data (usually within the last five to seven minutes of transactions) to detect any fraudulent transactions for internal consumption of the bank. This is not for relaying back to the customer. This client is based in the Middle East, and they have some interesting use cases.
How has it helped my organization?
We are still in the cluster database phase. Based on the use cases captured during the advisory phase, there will be a mix of 40 to 60% of users. 40% will be internal data science and IT teams, while 60% will be end users. So, the total number of users we have seen is 25. Out of these, around 15 business users will make decisions based on reports generated by Kafka analytic data. The remaining users are internal, who analyze this data daily to identify more use cases from a predictive and AI perspective for the future banking domain.
Moreover, our current client is an enterprise business. It is a globally renowned bank that has entered Saudi.
What is most valuable?
One of the major features that we are currently exploring, which is coming from my previous experience as well, is a multiple PubSub model kind of architecture. We have one hub data, which is the IBM DB2 system that banks use for their daytime transaction tracking from OLTP systems. We want to use the data on different platforms. So, we are trying to use Kafka in a model where it will be a publisher onto multiple messaging queues. These different messaging queues belong to different business units, where we are segregating the data lake we are building into different domains. For example, HR data is becoming too sensitive, so they don't want to give it to any other businesses. We are working on a common publisher and multiple subscriber model, which I feel is much more easily implementable using Kafka.
The other part that we are trying to implement, and which is in its very near center stages, is to see if we can make it future-ready. Right now, in the Middle East, there are not many cloud subscribers like DCP AWS and Azure. It is all on-premise. But it'll be there just in the next two or three years. So, we are trying to see if we can have these Kafka models working from a future perspective wherein instead of dumping some of the data into a data lake, we can directly dump it into solutions like DCP BigQuery for real-time analytics. This is just for the use cases, which are for real-time analytics. This data will definitely also be there in the data lake as that is the intention of keeping it.
But using Kafka, we are trying to see if we can make these subscribers ready to use these DCP BigQuery platforms for real-time analytics. It's still in the remittance stages, but those are still use cases.
What needs improvement?
One of the major areas for improvement, which I have to check out, is their pulling mechanism. Sometimes, when the data volume is too huge, and I have a pulling period of, let's say, one minute, there can be issues due to technical glitches, data anomalies, or platform-related issues such as cluster restarts. These polling periods tend to stop messaging use, and the restart ability part needs to be improved, especially when data volumes are too high.
If there are obstructions due to technical glitches or platform issues, sometimes we have to manually clean up or clear the queue before it eventually gets sealed. It doesn't mean it doesn't get restarted on its own, but it takes too much time to catch up. At that point, one year ago, I couldn't find a solution to make it more agile in terms of catching up quickly and showing that it is real-time in case of any downtime.
This was one area where I couldn't find a solution when I connected with Cloudera and Apache. One of our messaging tools was sending a couple of million records. We found it tough when there were any cluster downtimes or issues with the subscribers consuming data.
For future releases, one feature I would like to see is a more robust solution in terms of restart ability. It should be able to handle platform issues and data issues and restart seamlessly. It should not cause a cascading effect if there is any downtime.
Another feature that would be helpful is if they could add monitoring features as they have for their other services. A UI where I can monitor the capacity of the managed queue and resources I need to utilize more to make it ready for future data volumes. It would be great to have analytics on the overall performance of Kafka to plan for data volumes and messaging use. Currently, we plan the cluster resources manually based on data volumes for Kafka. If they can have a UI for resource planning based on data volume, that could be a great addition.
For how long have I used the solution?
I have been using Apache Kafka for five years. In the current project, we're setting up a cluster. We'll be doing the service installations next week. It's a private cloud-based implementation, and I'm leading the end-to-end implementation. In my previous project, we mainly used Kafka for streaming real-time SAP data into the analytics platform for a technology client.
But for the current banking sector client, we're setting up a 58-node cluster and reserving six nodes for Kafka because we have a lot of streaming use cases.
What do I think about the stability of the solution?
I would rate the stability of Apache Kafka a six out of ten due to the polling period and high data volumes; there is a catch-up problem. If there is a five-minute downtime, it can have a cascading effect.
When it comes to data availability, that is, how available the data is on the messaging queue, I would rate it a little less due to the coding mechanism and data getting stuck when the data volumes are high. From the data availability perspective, I would rate it between six to seven. The major reason for this is the ransomware data that comes in day in and day out. If the resources are not allocated correctly to the Kafka messaging queue, sometimes it gets stuck. And once it is stuck, it can have a steady effect on catching up to the real-time data. Only because of this issue, I rate it between six to seven.
However, from an overall data security perspective and ensuring that the data is consistent across the system, I would rate it around nine. If your PubSub model is written correctly, you can be assured that the data will not be lost. It will either be in the messaging queues or your landing tables or staging tables, but it will not be lost, at least if you have written it correctly.
What do I think about the scalability of the solution?
I would rate the scalability of Apache Kafka somewhere around seven out of ten. I'm not going on the higher side because a lot of manual work is involved in upgrading Kafka. You have to estimate the overall capacity, not just in data but also in other use cases running on the same cluster. In the cloud, it is easier as you don't have to worry about the turnaround time of your cluster setup. But in an on-premise setup, you need to add more nodes, RAM resources, and storage based on the increasing data volume.
Also, it would help if you ensured that the streaming use cases running on Kafka are not impacting other use cases like batch or archival use cases. Because of this manual activity of overall estimation, I will still keep it somewhere around six to seven. But regarding scalability in terms of horizontal or vertical scalability from the data perspective, I feel more comfortable with Kafka compared to other available streaming solutions.
How are customer service and support?
I have noticed a drastic improvement in the last five to seven months. Last year, the turnaround time for certain cases was around 14 days unless you escalated. Issues used to take two, three, or even four times to follow up on. Even though the solutions provided were often resource upgradation solutions, which I felt were not always the best.
However, in the last month, I have seen Cloudera coming through with two to three days turnaround times, even for low-severity issues. I'm unsure if it is region-specific, but I assume it should be, as they have region-specific teams. Sometimes, however, you cannot always depend on them. The type of solutions provided are sometimes like hidden trials. When you work for bigger enterprises, you cannot always go with them because there is a cost associated.
For example, in one of my recent cases, we implemented the engine policy, a security setup on our cluster. We were stuck at some point and asked for technical support. They provided a solution that was just a patchwork. When we did our analysis and went to the bank security team to review it, we found out that their solution was inadequate. They told us to set up role-based access control through Ranger, where AD users should be synced with the Ranger, and access control policies should be set up. However, they provided only for the local range level if you have Linux users. That is not a solution because, at the enterprise level, everything is integrated with AD and authenticated by AD.
I would rate Cloudera's support on a scale of five to six. But from the turnaround time, there has been an improvement from last year. We used to wait two to three days for critical solutions, but now it is much better. I used to work for the US region in my previous project, and the turnaround time was not as expected. It all depends on the licensing, and if you have a premium license from Cloudera, they assign a professional services guide to your project, and you get better support. If you do not have a premium license, you have to go through the process of rating the cases and wait for their support team to come. Overall, it is not at par with them if you compare it with solutions like Azure DCP and others.
How would you rate customer service and support?
How was the initial setup?
The initial two months were for capacity estimation, where we worked with the client's different business teams to understand the data volumes and use cases. Then, the next four to five months went into procurement, where we had to work with infrastructure teams and vendors to understand the servers and networks required for the cluster.
The actual cluster setup took us two months, and it was a little longer due to a shortage of expertise on the client's networking team. We had to handle everything ourselves since it was an on-premise setup with physical servers and network connections. Currently, we are in the security review phase, and once it completes, we will start implementing various use cases like task and batch processing, archival, etc.
If you see my experience from the Apache Kafka implementation and clusters as a perspective, I will rate the setup somewhere between seven to eight out of ten.
What about the implementation team?
We deployed the solution On-premises because it is a Middle East client. That is where my admin experience in the last two years has been too much. So even if I move to the cloud, it'll be much easier because I have seen cluster implementation from scratch and how it is done. So I have been involved in the very first stage of working with Cloudera on the sizing and then working on the actual infrastructure networking team on the implementation, working with the network team on doing all the network structuring, then setting up the cluster ourselves. I have a team of around seventeen people who am I getting here. So yeah, from that perspective, it is there, but what we are implementing is Cloudera private cloud-based as a solution, which is a future-ready solution for the cloud.
So once cloud services enter with least, especially Saudi Arabia, for example, the CPaaW as an Azure, our cluster will always be ready to be upgraded to the cloud because there's a private cloud-based solution on which we have the cluster. We can anytime add the cloud-native hosts and nodes onto our cluster. Also, at the same time, because it's a banking client, it has some restrictions in terms of geographies and all for the data to decide. The physical cluster provides a solution from the future AD perspective that once TCP, Azure, and AWS set up their data centers in Saudi Arabia, we can have some of our data nodes in the bank data center, plus we can have other sets of nodes or VMs in the cloud service provider's data center.
What was our ROI?
I have seen that the ROI is very good when implemented correctly and used for a period of time. I have seen, from a POC perspective, data getting churned in a couple of months, and the amount of insights generated was overwhelming.
I have also seen some critical decisions taken based on that data at an enterprise level, which earlier used to take years. Because of the time it used to take, the intention of doing those analytics used to lose its flavor. But now, people can make decisions in a few months based on this streaming analytics use case through Kafka. And they see, if those decisions had been taken earlier, they could have quickly gone on for four to five percent of their year-on-year profits.
However, too many things are involved because of the overall use case perspective, data perspective, underlying cluster sizing, and sourcing. One has to think from a holistic point of view, not just from a business point of view.
What's my experience with pricing, setup cost, and licensing?
I have experience in private cluster implementation. When you use Apache Kafka with Cloudera, the pricing is included in your Cloudera license. The pricing is based on the number of nodes, the storage cost, and other components. As part of this license, Kafka is one of the solutions offered. When you compare it with OnCloud, if you don't have a good volume of data and use cases, your benefits realization will not be there, as the initial cost of setting up the cluster and bringing up the license can be as much as $760k for a small cluster of ten to twenty nodes. You need at least 20-30 GBs of data and use cases before utilizing and profiting from the Teradata license and cloud data. Kafka is just one piece of it.
When it comes to the cloud, the pricing also goes at the solution level so that you can compare it at the Kafka level. Still, I don't have much information on that from where I am currently implementing the solution. After we did the cost-benefit analysis, we only opted for the solution. We realized that by bringing in Cloudera along with Kafka, we would be able to replace two or three existing systems, including Teradata, Oracle, Informatica, and IBM Datastage. Only then were we able to realize the benefit for the bank. Otherwise, Cloudera would be much more expensive, especially in the short term. With distributed computing, the concept of Delta Lake is coming in, and IDBMS systems like Teradata and distributed systems like data lakes will coexist. Not all use cases will be solved, but cloud solutions like Azure come as a package, and you need not worry about having different physical systems in your enterprise to take care of. That's where I think the cost-benefit analysis from a data perspective becomes too important.
At the end of the day, we bring in big data systems only when the data volumes are high. When the data volumes are low, the cost-benefit analysis can easily show that systems like Oracle or Teradata can run it just fine.
What other advice do I have?
From an architecture and solution design perspective, I would say that before going for streaming solutions, we should analyze the data, which might be old, and decide if it's a streaming use case or not. Often, people think it's a streaming use case, but when they perform analytics on top of it, they realize they can't do a month-to-date or year-to-date analysis. So, it's essential to think again from the data basics perspective before going to Kafka.
Overall, from the product and solution perspective, I would rate it a nine based on my personal use of data.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer. Implementor