The only issue I faced with the tool was that I used to choose the compute device to support parallel processing, and it has to be more like scaling up horizontally. The tool should be more scalable, not in terms of increasing the CPU or something, but more in the area of units. If two units are not enough, the third or fourth unit should be able to come into the picture.
From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable. Sometimes, I get an error saying that it is an RDD-related error, and it becomes difficult to understand where it went wrong. When I deal with datasets using a library called Pandas in Python, I can actually apply functions on each column and get a transformation from the column. When I try to do the same thing with Apache Spark, it is okay and works, but it is not straightforward; I need to deal with it a little differently, and even after trying to do that differently, the problem I face there is, sometimes it will throw an error saying that it is looping back to the same, but I was not getting that kind of errors in Pandas.
In future updates, the tool should be made more user-friendly. I want to take fifty parallel processes rather than one, and I want to pick some particular columns to be split by partition, so if the tool is user-friendly and offers clarity and flexibility, then that will be good.
I have been using Apache Spark for four years.
Stability-wise, I rate the solution a nine out of ten. The only issues with the tool revolve around user interaction and user flexibility.
It is a scalable solution. Scalability-wise, I rate the solution an eight out of ten.
Around five people in my company use the tool.
The solution's technical support is helpful, but I faced some problems which were more of a generic issue. If I face any problems which are non- generic issues, I get help from the tool's team. For the generic issues, I get answers mainly from the forums where the problem was already resolved. When it comes to some unknown problem or specific problem with my work, then the support takes time. I rate the technical support a seven out of ten.
I only work with Apache Spark.
The product's initial setup phase was easy.
I managed the product's installation phase, both locally and on the cloud.
The solution is deployed on the on-premises version.
The solution can be deployed in two to three hours.
Apache Spark has helped save 50 percent of the operational costs. Time was reduced with the use of the tool, but the computing part increased. Overall, I can see that the tool's use has led to a 50 percent reduction in costs.
I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten.
Previously, I was more of a Python full-stack developer, and I was happy dealing with PySpark libraries, which gave me an edge in continuing the work with Apache.
Speaking about Apache Spark's use in our company's data processing workflows, I would say that when we deal with large datasets of data, if we don't use Spark, then when we try to use a data frame consisting of one year of data, it used to take me 45 minutes to an hour. Moreover, sometimes I used to get the memory out of space errors, but such issues were avoided the moment I started using Apache Spark, as I was able to get the whole processing done in less than five minutes, and there were no memory issues.
For big data processing, the tool's parallel processing and time are areas that have been helpful. When I try to apply a function, I can directly data write one code. Basically, I used Apache Spark to forecast multiple units at the same time, and if not with Apache Spark, I would be doing that one by one, which is more of a serial processing process that used to take me around five hours. At the moment, we use Apache Spark in parallel processing, where computing happens parallelly, and all these computations are cut down by at least 90 percent. It helps me significantly to reduce the time needed for operations.
The tool's real-time processing is an area that I have not tried to use much. When it comes to real-time processing of my data, I use Kafka.
I am handling data governance using Databricks Unity Catalog.
When I try to apply an ML model, I am unable to get that model done on a table partitioned by a particular column; it makes me get the job done in a reduced number of partitions. If I go with five partitions, I am able to get at least three to four times the benefits in a lesser amount of time.
Regular maintenance exists, but it is not like I have to sit week by week and upgrade a patch or something like that. The maintenance is done mostly in about six months to a year.
I take care of the tool's maintenance.
I recommend the tool to others.
I rate the tool an eight out of ten.