What is our primary use case?
I've used it more for ETL. It's useful for creating data pipelines, streaming datasets, generating synthetic data, synchronizing data, creating data lakes, and loading and unloading data is fast and easy.
In my ETL work, I often move data from multiple sources into a data lake. Apache Spark is very helpful for tracking the latest data delivery and automatically streaming it to the target database.
How has it helped my organization?
Apache Spark is a versatile technology useful not only for data solutions but also for data creation. This is especially valuable given GDPR regulations and limited access to production, which make tasks like testing quite difficult. It helps with data creation and alignment for both consumers and developers.
What is most valuable?
Apache Spark Streaming is particularly good at handling real-time data. It has built-in data streaming integration, which allows it to stream data from any source as soon as it becomes available.
What needs improvement?
The scalability features are already good, but they could be further enhanced. Additionally, the debugging aspect could use some improvement.
Buyer's Guide
Streaming Analytics
June 2025
Find out what your peers are saying about Apache, Amazon Web Services (AWS), Microsoft and others in Streaming Analytics. Updated: June 2025.
856,873 professionals have used our research since 2012.
What do I think about the stability of the solution?
The stability is very good. Since everything runs as code, it's easy to understand what's happening under the hood. It's not a closed-box system, which makes it quite transparent.
What do I think about the scalability of the solution?
On my team, there are about six or seven people using it. However, on the analytics side, where users view the reports, there are many more, perhaps over a hundred.
How was the initial setup?
The deployment process is quite easy and not very complicated.
Since it's an open-source technology, it can be deployed in various environments, including local machines and all kinds of clouds. If you're using the cloud, scaling is quite easy.
What about the implementation team?
If there are knowledgeable, experienced team members, it doesn't require a large team. One or two developers are enough.
What was our ROI?
It can handle large datasets and is relatively easy to manage, especially with cloud technologies. This means you can process a lot of data even with a low-configuration environment, which helps with cost savings.
What other advice do I have?
I would rate it a seven out of ten. Apache Spark's capabilities for machine learning are quite extensive and can be used in a low-code way. This can be much more efficient than using various technologies. You can also combine its batch processing capabilities with new technologies and machine learning.
It's quite useful for AI because of its machine-learning capabilities, which allow for model training and output generation.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.