

Apache Spark and Apache NiFi compete in the big data processing and integration category. Apache Spark may have the upper hand due to its extensive in-memory computing and scalability capabilities, despite NiFi’s intuitive data flow management interface.
Features: Apache Spark is known for in-memory computing, enabling high-speed data processing and real-time analytics with Spark Streaming. It also provides extensive capabilities for machine learning with MLlib and efficient large-scale data analysis using Spark SQL. Apache NiFi is recognized for its user-friendly visual tools for designing data pipelines, real-time data integration, and comprehensive connectors that simplify diverse data flow management.
Room for Improvement: Apache Spark users desire enhancements in scalability and stability, improved documentation, and advanced monitoring tools. Additional stream processing capabilities and machine learning algorithms are also suggested. Apache NiFi users call for better stability, reduced operational complexity, enhanced integration features, and better JSON processing. Both could benefit from improved user interfaces and advanced alert systems.
Ease of Deployment and Customer Service: Apache Spark offers flexible deployment options in On-premises, Hybrid, and Public Cloud environments. Community support is vibrant but experiences vary, with better results seen using commercial support. Apache NiFi is praised for its visual pipeline management, with similarly flexible deployment options. Customer service is primarily community-driven, with some positive experiences from commercial support.
Pricing and ROI: Both Apache Spark and Apache NiFi are open-source, thus available without licensing fees, allowing cost-effective deployment. Apache Spark costs can rise with infrastructure needs, yet it promises high ROI through enhanced processing capacity. Apache NiFi, while free at its core, may incur costs in complex integration setups. Both provide substantial efficiency and cost savings over time.
Thanks to improvements on both our side in how we run processes and enhancements to Apache NiFi, we have reduced the time commitment to almost not needing to interact with Apache NiFi except for minor queue-clearance tasks, allowing it to run smoothly.
It supports not just ETL but also ELT, allowing us to save significant time.
There may be return on investment based on the technology and easily moving our workloads onto Apache NiFi from our previous system.
The customer support is really good, and they are helpful whenever concerns are posted, responding immediately.
Customer support for Apache NiFi has been excellent, with minimal response times whenever we raise cases that cannot be directly addressed by logs.
I would rate the customer support of Apache NiFi a 10 on a scale of 1 to 10.
I have received support via newsgroups or guidance on specific discussions, which is what I would expect in an open-source situation.
Depending on the workload we process, it remains stable since at the end of the day, it is just used as an orchestration tool that triggers the job while the heavy lifting is done on Spark servers.
Scaling up is fairly straightforward, provided you manage configurations effectively.
Based on the workload, more nodes can be added to make a bigger cluster, which enhances the cluster whenever needed.
I have seen Apache NiFi crashing at times, which is one of the issues we have faced in production.
Apache NiFi is stable in most cases.
Apache Spark resolves many problems in the MapReduce solution and Hadoop, such as the inability to run effective Python or machine learning algorithms.
Without a doubt, we have had some crashes because each situation is different, and while the prototype in my environment is stable, we do not know everything at other customer sites.
Apache NiFi should have APIs or connectors that can connect seamlessly to other external entities, whether in the cloud or on-premises, creating a plug-and-play mechanism.
The history of processed files should be more readable so that not only the centralized teams managing Apache NiFi but also application folks who are new to the platform can read how a specific document is traversing through Apache NiFi.
The initial error did not indicate it was related to memory or size limitations but appeared as a parsing error or something similar.
Various tools like Informatica, TIBCO, or Talend offer specific aspects, licensing can be costly;
The pricing in Italy is considered a little bit high, but the product is worth it.
Apache NiFi has positively impacted my organization by definitely bridging the gap between the on-premises and cloud interaction until we find a solution to open the firewall for cloud components to directly interact with on-premises services.
Development has improved with a reduction in time spent being the main benefit; before we needed a matter of days to create the ingestion flows, but now it only takes a couple of hours to configure.
The ease of use in Apache NiFi has helped my team because anyone can learn how to use it in a short amount of time, so we were able to get a lot of work done.
Not all solutions can make this data fast enough to be used, except for solutions such as Apache Spark Structured Streaming.
The solution is beneficial in that it provides a base-level long-held understanding of the framework that is not variant day by day, which is very helpful in my prototyping activity as an architect trying to assess Apache Spark, Great Expectations, and Vault-based solutions versus those proposed by clients like TIBCO or Informatica.
| Product | Market Share (%) |
|---|---|
| Apache Spark | 11.2% |
| Apache NiFi | 9.5% |
| Other | 79.3% |

| Company Size | Count |
|---|---|
| Small Business | 5 |
| Midsize Enterprise | 1 |
| Large Enterprise | 18 |
| Company Size | Count |
|---|---|
| Small Business | 28 |
| Midsize Enterprise | 15 |
| Large Enterprise | 32 |
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
We monitor all Compute Service reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.