What is our primary use case?
I have been using Apache NiFi virtually daily, as it is part of my main responsibility in my current role.
My main use case for Apache NiFi involves integrating various data sources and performing transformations to load them into mostly our NoSQL database, Elasticsearch, but sometimes into other databases as well.
For integrating and transforming data, we receive a lot of logs generated with our AWS services that the company wants to collect, particularly for our security team to review those logs and ensure they can conduct their security checks and reviews to confirm there is no abnormal behavior. We use Apache NiFi to capture those logs sent to many S3 buckets, collect those logs, decompress them with Apache NiFi, perform any necessary transformations, and send them to Elasticsearch so that end users, often from the network team or security team, can then use Elasticsearch and Kibana for data analysis.
My advice for others considering Apache NiFi is that if you are willing to, you can use it on-premises; it offers great customizability. While it is specifically designed for streaming data, it can also accommodate batch data. Moreover, it is useful for various out-of-the-box solutions, including unique uses such as email notifications, showcasing flexibility in data orchestration, ETL, and other applications.
What is most valuable?
Apache NiFi offers great flexibility in terms of whether you want to be a low-code user or a high-code user, especially if you are a Python or Java developer, thanks to the recent addition of custom-built processors in the latest versions of Apache NiFi where you can use Python or Java to create your own processors versus using the great selection of out-of-the-box processors already available in Apache NiFi to do almost anything. If you are willing to put together a complex web of processors, you can do almost any data transformation you want, but the customizability with making your own processors, again with Python or Java, has been a huge benefit for performing both what Apache NiFi is specifically made to do and some more out-of-the-box solutions, such as creating some kind of email notification system as well. This kind of use with Apache NiFi has existed even before the implementation of custom processors. You could create scripts, even putting them in Python in Apache NiFi using the execute script methods, and this has existed before, but now it has even better functionality with the latest version of Python rather than just a Jython type of hybrid. Those are some of the best things that it offers.
The flexibility of Apache NiFi has helped me in my daily work, especially because instead of utilizing a bunch of Apache NiFi processors, which we do use for most of our processes, it can be much easier to combine transformation logic within Python processors since the majority of our team prefers Python programming as our choice of language. This integration allows us to put it all in one place. We can integrate Apache NiFi with our Python processors that we host on a Git repository, which integrates very well, and we can manage the same scripts and make changes efficiently. It is great coming from a Python developer mindset shared amongst the team.
Apache NiFi has positively impacted my organization as it continually improves functionality and throughput with each iteration over the past three years. One of the big tradeoffs with open source is that how well it functions is largely dependent on the user, but that means you can adapt it to whatever custom use case you have. We have been able to consolidate several different authentication methods through just Microsoft, and Apache NiFi has been helpful in facilitating that. Additionally, due to its many ways of extracting data from different sources, we can develop specific solutions ourselves, allowing us to integrate various data sources. Thanks to the open-source customizability, we can adapt Apache NiFi to our built cluster, which has numerous benefits, particularly since we are managing many of our processes. This approach saves us significant costs compared to moving to something more managed or on the cloud, as managing open-source technologies ourselves ultimately reduces expenses.
Regarding cost savings, I do not have a strict idea of how much we have saved since the company was already using Apache NiFi when I joined, but I am certain comparisons have been made against other ETL or data orchestration tools that are popular among different cloud providers such as AWS or Azure. The cost savings must be significant, particularly given that we are handling terabytes and petabytes of data daily, trying to find software that allows this in an affordable manner. It is clear that substantial savings exist, as long as we manage our own clusters and bugs effectively. The tradeoff with managed services is that they handle much of this, ensuring uninterrupted service, but these come at a cost. Conversely, with open-source software management, we incur no costs as we handle everything ourselves.
What needs improvement?
I believe Apache NiFi could be improved with easier, out-of-the-box provided monitoring solutions. While Apache NiFi has an API that generates logs, it would be beneficial to have simpler access to that data saved historically. It would assist in easily retrieving data for historical analysis and storing it elsewhere without the hassle of setting up APIs and delving into documentation. Just having a more streamlined approach to collecting this data would be greatly advantageous.
I would suggest continuous improvements regarding the custom developer-built processors, as many times the errors that arise are not useful. We often seem to struggle with a combination of implementing our own error handling or analyzing logs, as the information does not always align or proves unhelpful. Continuous enhancement in this area would be wonderful, so we do not need to decipher which error is more accurate or which report gets us nearer to the actual problem. For instance, I encountered a situation where flow files would not process; they were retried but returned to the queue before the Python processor due to ambiguous errors. It eventually turned out that the issue was the flow files' size being too large for the Python processor, which we only discovered by splitting the flow files, at which point the issue resolved. The initial error did not indicate it was related to memory or size limitations but appeared as a parsing error or something similar.
For how long have I used the solution?
I have been working in my current field for about three years.
What do I think about the stability of the solution?
Apache NiFi is now more stable than before.
What do I think about the scalability of the solution?
Apache NiFi's scalability is good. You can scale it up as long as you have the machines and servers available. If you have room for more instances, scaling up is fairly straightforward, provided you manage configurations effectively.
How are customer service and support?
Apache NiFi's customer support is good.
How would you rate customer service and support?
What was our ROI?
I have definitely seen a return on investment through time savings. Working with Apache NiFi allows us to manage it more efficiently, transitioning from spending hours or days resolving issues to requiring much less intervention now. Thanks to improvements on both our side in how we run processes and enhancements to Apache NiFi, we have reduced the time commitment to almost not needing to interact with Apache NiFi except for minor queue-clearance tasks, allowing it to run smoothly. At this point, we have certainly saved hundreds of hours.
What other advice do I have?
The customizability of Apache NiFi helps even with unique use cases, as I mentioned before, given that Apache NiFi can be used in this capacity. While there are better applications or software options available, when you are trying to keep it simple and finding ways to utilize a couple of processors for a unique solution, you can do that in Apache NiFi. For example, we have several notification-type pipelines we have built in Apache NiFi, such as reading from a SQL database to identify users who have not completed training and then sending them an email reminder to complete that training. We have that running regularly, week by week. Another instance involves a processing data flow that scans for specific data found in logs, which triggers an email notification to the relevant team letting them know that a unique identifier has appeared, allowing them to handle the situation.
I encountered some odd cases such as increasing concurrent threads on a processor, which should work similarly to copying several processors, yet functional throughput varies. It seems that using a distributed processor yields better throughput than just increasing the concurrent threads on one processor, which has been odd but is a workaround we had to adopt to boost throughput. Resolving such quirks could elevate the rating further.
I rate Apache NiFi an eight out of ten. I choose eight because, as open-source software, there is always room for improvement, but the tradeoff between learning how to use the software and the savings it provides, along with its customizability, ranks it pretty high. It is effective for what it does and continues to improve, so it could score higher if there are significant enhancements in custom-built processors and ongoing improvements in functionality.
Which deployment model are you using for this solution?
On-premises
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)