What is our primary use case?
We used Spark and Spark Streaming, as well as Spark ML, for multiple use cases, particularly streaming IoT-related data. Additionally, we applied Spark ML for various machine learning algorithms on the streaming data, mainly in the healthcare space. So, primarily in the healthcare domain.
What is most valuable?
With Spark Streaming, there was native Python support, which was beneficial for us. It was easy to deploy as a cluster, and the website was user-friendly. The documentation was also pretty good, and there was strong community support. Overall, it was considered an industry standard at the time.
What needs improvement?
In terms of disadvantages, it was a bit cumbersome due to its size. It wasn't quite cloud-native back then, meaning it wasn't easy to deploy it in a Kubernetes cluster and similar environments. I found it a bit challenging, but I'm not sure if that's still the case now. It probably has better support.
It was on-prem when we wanted to migrate it to the cloud, especially on Kubernetes, I remember facing some difficulties in successfully migrating the system.
For how long have I used the solution?
I explored it as part of a pilot project some time ago. We were using Spark Streaming, and I explored Pulse as a replacement for Spark Streaming for that use case. Overall, I've used Spark Streaming for around five years or so.
What do I think about the stability of the solution?
What do I think about the scalability of the solution?
Scalability is pretty good. However, I must mention that I haven't tested it extensively with large-scale production scenarios. The testing I conducted was more of a pilot nature, and the scale was not very high. But based on what I've read, scalability shouldn't be an issue.
In the pilot project, there were around a thousand users. I didn't encounter any issues while scaling to that level.
How are customer service and support?
I mainly relied on the documentation and community support. There was sufficient support available for me during various times. I didn't actually contact Apache for any support-related activities.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
Spark Streaming was more widely used and had better documentation. It had frequent releases and active development compared to Storm, which had limited language support and stopped active development at some point. Spark Streaming also had top-level consultant support, which was beneficial for the team I was working with. That's why I made a switch.
How was the initial setup?
It was easy to install. I didn't find any difficulties while installing and trying it out, at least on a smaller scale.
Apache Spark Streaming was straightforward in terms of maintenance. It was actively developed, and migrating from an older to a newer version was quite simple. That was the main aspect of maintenance, and overall, it was a straightforward process. The documentation was good, and there was good community support. So I didn't face any problems while deploying and maintaining the solution.
What's my experience with pricing, setup cost, and licensing?
I was using the open-source community version, which was self-hosted. I'm not familiar with the pricing of the commercial version.
Which other solutions did I evaluate?
I had previously used Apache Storm, which is an open-source solution. I later switched to Spark Streaming and also tried Pulsar for similar use cases in the healthcare domain.
What other advice do I have?
I would highly recommend Spark Streaming for standard streaming or IoT use cases. The entire Spark ecosystem, including Spark Core, streaming, ML, and other components, can be highly beneficial. It's better to stick with the Spark ecosystem rather than use other platforms and frameworks. For streaming and IoT, Spark Streaming is a great choice.
Overall, I would rate the solution an eight out of ten. The only issue I found, at least during the time I actively worked with it, was that it was resource-intensive, even for small-scale applications. In comparison, some other platforms, like Pulsar, had lighter resource consumption and performed better in terms of resource usage and associated costs. At least, to begin with, it performs better with the resource usage and dollar value associated with it. But at least to begin with it is a bit heavy and resource intensive, which is why I rate it an eight.
Which deployment model are you using for this solution?
On-premises
*Disclosure: I am a real user, and this review is based on my own experience and opinions.