Mahdi Sharifmousavi - PeerSpot reviewer
Lecturer at Amirkabir University of Technology
Real User
Top 10
A scalable solution that can grow with the needs of a business, and provides excellent functionality for analytical tasks
Pros and Cons
  • "This solution provides a clear and convenient syntax for our analytical tasks."
  • "This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed."

What is our primary use case?

We use this solution for it's anti-money laundering and direct marketing features within a banking environment.

What is most valuable?

This solution provides a clear and convenient syntax for our analytical tasks.

What needs improvement?

This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed.

There is also limited Python compatibility, which should be improved.

For how long have I used the solution?

We have used this solution for around seven years, through several versions.

Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.

What do I think about the stability of the solution?

We have found this solution to be stable during our time using it.

What do I think about the scalability of the solution?

This is a very scalable solution from our experience.

What about the implementation team?

We implemented the solution using our in-house team, but the UI was developed using a third party contractor.

What's my experience with pricing, setup cost, and licensing?

The deployment time of this solution is dependent on the requirements of an organization, and the compatibility of the systems they will be using alongside this solution. We would recommend that these are clearly defined when designing the product for the businesses needs.

What other advice do I have?

I would rate this solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Handles large volume data, cloud and on-premise deployments, but difficult to use
Pros and Cons
  • "Apache Spark can do large volume interactive data analysis."
  • "Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn."

What is our primary use case?

The solution can be deployed on the cloud or on-premise.

How has it helped my organization?

We are using Apache Spark, for large volume interactive data analysis.

MechBot is an enterprise, one-click installation, trusted data excellence platform. Underneath, I am using Apache Spark, Kafka, Hadoop HDFS, and Elasticsearch.

What is most valuable?

Apache Spark can do large volume interactive data analysis.

What needs improvement?

Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.

For how long have I used the solution?

I have been using Apache Spark for approximately 11 years.

What do I think about the stability of the solution?

The solution is stable.

What do I think about the scalability of the solution?

Apache Spark is scalable. However, it needs enormous technical skills to make it scalable. It is not a simple task.

We have approximately 20 people using this solution.

How was the initial setup?

If you want to distribute Apache Spark in a certain way, it is simple. Not every engineer can do it. You need DevOps specialized skills on Spark is what is required.

If we are going to deploy the solution in a one-layer laptop installation, it is very straightforward, but this is not what someone is going to deploy in the production site.

What's my experience with pricing, setup cost, and licensing?

Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free.

What other advice do I have?

We are well versed in Spark, the version, the internal structure of Spark, and we know what exactly Spark is doing. 

The solution cannot be easier. Everything cannot be made simpler because it involves core data, computer science, pro-engineering, and not many people are actually aware of it.

I rate Apache Spark a six out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Offers good machine learning, data learning, and Spark Analytics features
Pros and Cons
  • "The features we find most valuable are the machine learning, data learning, and Spark Analytics."
  • "We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data."

What is our primary use case?

We have built a product called "NetBot." We take any form of data, large email data, image,  videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data. 

What is most valuable?

We use all the features. We use it for end-to-end. All of our data analysis and execution happens through Spark.

The features we find most valuable are the: 

  • Machine learning
  • Data learning
  • Spark Analytics.

What needs improvement?

We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.

For how long have I used the solution?

I have been using Apache Spark for more than five years. 

What do I think about the stability of the solution?

We haven't had any issues with stability so far. 

What do I think about the scalability of the solution?

As long as you do it correctly, it is scalable.

Our users mostly consist of data analysts, engineers, data scientists, and DB admins.

Which solution did I use previously and why did I switch?

Before using this solution we used Apache Storm

How was the initial setup?

The initial setup is complex. 

What about the implementation team?

We installed it ourselves. 

What other advice do I have?

I would rate it a nine out of ten. 

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
CEO International Business at a tech services company with 1,001-5,000 employees
MSP
Top 5
A powerful open-source framework for fast, flexible, and versatile big data processing, with a strong learning curve and resource demands
Pros and Cons
  • "The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations."
  • "It requires overcoming a significant learning curve due to its robust and feature-rich nature."

What is our primary use case?

In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking. 

What is most valuable?

The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations.

What needs improvement?

It requires overcoming a significant learning curve due to its robust and feature-rich nature.

For how long have I used the solution?

We have been using it for two years now.

What do I think about the stability of the solution?

It provides excellent stability. We never faced any issues with it.

What do I think about the scalability of the solution?

It ensures outstanding scalability capabilities.

Which solution did I use previously and why did I switch?

Opting for Apache Spark, an open-source solution, provides a distinct advantage by offering control over the code. This means you can identify issues, make necessary fixes, and determine what aspects to accept as they are. In contrast, dealing with a vendor may limit control, requiring you to submit requests and advocate for changes based on your business volume with them. This dependency on volume can potentially compromise control. To safeguard both your customers and your business, the choice of an open-source solution like Apache Spark allows for more autonomy and control over the technology stack.

What about the implementation team?

The system's smooth operation relies on deploying a comprehensive container with Kubernetes clusters, configured with essential toolsets. Instrumentation data from the backend is fed back to a central framework equipped with specific tools for driving various processes. In a case involving a customer with Red Hat and Postini clusters, the OpenShift Container Platform, comprising Kubernetes clusters, is used. The tools manage onboarding, infrastructure provisioning, certificate management, authorization control, etc. The deployment spans multiple independent data centers, like telecom circles in India, requiring unique approaches for various tasks, including disaster recovery planning and central alerting, facilitated through SaaS. The deployment process typically takes approximately forty to forty-five days for six thousand servers.

What was our ROI?

It provides a dual advantage by saving both time and money while enhancing performance, particularly by leveraging my skill sets. 

What's my experience with pricing, setup cost, and licensing?

It is an open-source solution, it is free of charge.

What other advice do I have?

I would give it a rating of seven out of ten, which, by my standards, is quite high.

Which deployment model are you using for this solution?

Private Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Director - Data Management, Governance and Quality at Hilton Worldwide
Real User
Powerful language but complicated coding

What is our primary use case?

Ingesting billions of rows of data all day.

How has it helped my organization?

Spark on AWS is not that cost-effective as memory is expensive and you cannot customize hardware in AWS. If you want more memory, you have to pay for more CPUs too in AWS.

What is most valuable?

Powerful language.

What needs improvement?

It is like going back to the '80s for the complicated coding that is required to write efficient programs.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user365304 - PeerSpot reviewer
Software Consultant at a tech services company with 10,001+ employees
Real User
It provides large scale data processing with negligible latency at the cost of commodity hardwares.

Valuable Features:

The most important feature of Apache Spark is that it provides large scale data processing with negligible latency at the cost of commodity hardwares. Spark framework is just a blessings over Hadoop, as the later does not allow fast processing of data, which is accomplished by the in-memory data processing of Spark.

Improvements to My Organization:

Apache Spark is a framework, which allows one organization to perform business & data analytics, at a very low cost, as compared to Ab-Initio or Informatica. Thus, by using Apache Spark in place of those tools, one organization can achieve huge reduction in cost, & without compromising with any data security & other data related issues, if controlled by an expert Scala programmer  & Apache Spark does not bear the overheads of Hadoop of having high latency. All these points, by which my organization is being benefitted as well.

Room for Improvement:

Question of improvement always comes to mind of the developers. Just like the most common need of the developers, if a user-friendly GUI along with 'drag & drop' feature can be attached to this framework, then it would be easier to access it.

Another thing to mention, there always is a place for improvement in terms of the memory usage. If in future, it is achievable to use less memory for processing, it would obviously be better.

Deployment Issues:

We've had no issues with deployment.

Stability Issues:

See above regarding memory usage.

Scalability Issues:

We've had no issues with scalability.

Other Advice:

My advice to others would be just to use Apache Spark for large scale data processing, as it provides good performance at low cost, unlike Ab-Initio or Informatica. But the main problem is, now in the market, there are not many people certified in Apache Spark.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user371334 - PeerSpot reviewer
it_user371334CEO at a tech consulting company with 51-200 employees
Consultant

The drag and drop GUI comment is very true. We developed such a GUI for spatial and time series data in Spark. But there are other tools out there. Maybe you should do a review of such tools.

Salvatore Campana - PeerSpot reviewer
CEO & Founder at XAUTOMATA TECHNOLOGY GmbH
Real User
Top 10
Reduces startup time and gives excellent ROI
Pros and Cons
  • "Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term."
  • "The initial setup was not easy."

What is our primary use case?

I use Spark to run automation processes driven by data.

How has it helped my organization?

Apache Spark helped us with horizontal scalability and cost optimizations.

What is most valuable?

The most valuable feature is the grid computing.

What needs improvement?

An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.

For how long have I used the solution?

I've been using Spark for around four years.

How was the initial setup?

The initial setup was not easy, but we created a means of asking the user about their needs, making the setup much easier. We can now deploy the platform in thirty minutes using the public cloud or Kubernetes space.

What was our ROI?

Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term.

What's my experience with pricing, setup cost, and licensing?

Spark is an open-source solution, so there are no licensing costs. 

What other advice do I have?

I would rate Apache Spark eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
it_user371832 - PeerSpot reviewer
Chief System Architect at a marketing services firm with 501-1,000 employees
Vendor
Spark gives us the ability to run queries on MySQL database without pressurising our database

What is most valuable?

With spark SQL we've now the capabilities to analyse very large quantities of data located in S3 on Amazon at very low cost comparing other solution we checked. 

We also use our own Spark cluster to aggregate data on near real time and save the result on MySQL database. 

We've started new projects using the machine learning library ML.

How has it helped my organization?

Until Spark we didn't have the ability to analyse this quantity of data we're talking about two TB/hour. So we're now able to produce a lot of reports, and are also able to develop machine learning based analysis to optimize our business. 

We've central access to every piece of data in the company including finance, business, debug etc. and the ability to join all this data together.

What needs improvement?

Spark is actually very good for batch analysis much more good than Hadoop, it's much simple, much more quicker etc., but it actually lacks the ability to perform real-time querying like Vertica or Redshift.  

Also, it is more difficult for an end user to work with Spark than normal database. even comparing with analytic database like Vertica or Redshift.

For how long have I used the solution?

We're now using Spark-Streaming and Spark-SQL for almost 2 years. 

What was my experience with deployment of the solution?

We're working on AWS so we need to have a managed environment. We've choose to go with a solution based on Chef to deploy and configure the spark clusters. Tip : if you don't have any devops you can use the ec2 script (provided by spark distro) to deploy cluster on amazon. We've tested it and work perfectly.  

What do I think about the stability of the solution?

Spark Streaming is difficult to stabilize as you're always dependant to your stream flow. If you start to be late on the consumer you've a serious problem. We've encountered a lot of stability issue to configure it as expected

What do I think about the scalability of the solution?

It's linked to stability in our case it's takes time to evaluate what is the correct size of the cluster you need. It's very important to always add to you jobs monitoring to be able to understand what's the problem. We use datadog as monitoring platform

Which solution did I use previously and why did I switch?

Yes to make this job we've used a MySQL database. We switch because MySQL is not a scalable solution and we've reach it's limits.

How was the initial setup?

Setup a spark cluster can be difficult. it's related to your clustering strategy. There is 4 solution at least. 

ec2 script : work only on Amazon AWS

Standalone : manually configuration (hard)

Yarn : to leverage your already existing Hadoop environment.

Mesos : to use with your other Mesos ready application

What about the implementation team?

We use Databricks as online DB ad hoc query. It's work on AWS as managed service, it manage for you the cluster creation, configuration and monitoring.

Give a notebook oriented user interface to query any data source using Spark: DB, Parquet, CSV, Avro etc...

Which other solutions did I evaluate?

Yes we've started to evaluate analytics databases : vertica, exasol, and other for all the them the price was an issue regarding the quantity of data we want to manipulate.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: April 2024
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.