What is our primary use case?
I am not just an end customer of Apache Spark; I use it for the solutions that I produce, primarily data integration solutions and data access solutions based on Apache Spark, but it may depend on the situation to align with other tools present in various customer locations, such as Informatica.
Essentially, my main reason for using Apache Spark is data integration, and the two major use cases are data access and data integration.
I do use Apache Spark for event analysis.
Apache Spark, specifically PySpark and the tools available there, have been quite helpful in my event analysis work.
What is most valuable?
The in-memory computation feature is certainly helpful for my processing tasks.
It is helpful because while using structures that could be held in memory rather than stored during the period of computation, I go for the in-memory option, though there are limitations related to holding it in memory that need to be addressed, but I have a preference for in-memory computation.
The solution is beneficial in that it provides a base-level long-held understanding of the framework that is not variant day by day, which is very helpful in my prototyping activity as an architect trying to assess Apache Spark, Great Expectations, and Vault-based solutions versus those proposed by clients like TIBCO or Informatica.
What needs improvement?
Areas for improvement are obviously ease of use considerations, though there are limitations in doing that, so while various tools like Informatica, TIBCO, or Talend offer specific aspects, licensing can be costly; I prefer to work this way, which does not imply being anti-tooling, but since your focus was on my technology, these will continue to be my technologies.
For how long have I used the solution?
I have been using Apache Spark for approximately four years.
What do I think about the stability of the solution?
Without a doubt, we have had some crashes because each situation is different, and while the prototype in my environment is stable, we do not know everything at other customer sites, and we have not identified a way of assessing the sanity of the whole environment, so there have been some crashes.
What do I think about the scalability of the solution?
As long as one knows how to scale the environment, scalability is not a problem at all and is not difficult to manage.
How are customer service and support?
I have had experience with technical support from Apache, mainly through newsgroups, which have been reasonably forthcoming when required.
I think it would be farcical to compare technical support from Apache Spark to what I would receive under an Informatica license, but I have received support via newsgroups or guidance on specific discussions, which is what I would expect in an open-source situation.
How would you rate customer service and support?
How was the initial setup?
I think things are becoming easier for installation and deployment, and I have cleaned up the process over a number of years to minimize pain, although the process of installing a commercial tool might be easier.
What other advice do I have?
API management is not my interest at the moment, so I do not remember reading information about API management products and Enterprise Service Bus, ESB in the past, and I am not using any solutions like that.
I am working much more in data science and data engineering.
I can discuss my experience with data engineering tools, and I am using some now.
I work primarily with open-source tools, and that is the way I work instead of big data solutions like Informatica.
The open-source big data stacks I use include Apache Spark, Hive, and tools such as Vault for protection and Great Expectations for data quality, which are my primary tools along with the PostgreSQL database for database support.
I have been using Apache Spark for quite a long time.
Real-time data analytics is an area of interest for me, but I have not had to do that in most of what I have done, although it is increasingly becoming an area of interest that I am looking seriously at Apache possibilities for instead of going down the Kafka path, using the Apache streaming API to see if it fits my use cases, but I do not know it very well.
Very often in many of my experiments, the data set has had to be partitioned, and there have been issues in handling very large data sets, with most of my work done using Python machine learning libraries, requiring chunking, and speed of prediction has been an issue of concern in some experiments where we have had to shut down processes due to CPU requirements, then restart with different Apache configurations, and resourcing support is a major determinant if I were to name a constraint in terms of running machine learning experiments.
Apache Spark is basically free compared to other similar products like Informatica and Talend.
Sometimes, we have to use cloud resources due to insufficient on-premise resources for certain types of computing, so it is hybrid.
Our builds have all been based on the Azure Apache marketplace.
While I have worked with Hive and not really on Talend, I confirm I indeed work mainly with Apache Spark and Hive.
In the past, Hive has been a kind of a default part of the process, but that is not the case anymore in recent times.
I have given this review a rating of 9 out of 10.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?