Try our new research platform with insights from 80,000+ expert users
it_user371832 - PeerSpot reviewer
Chief System Architect at a marketing services firm with 501-1,000 employees
Vendor
Spark gives us the ability to run queries on MySQL database without pressurising our database

What is most valuable?

With spark SQL we've now the capabilities to analyse very large quantities of data located in S3 on Amazon at very low cost comparing other solution we checked. 

We also use our own Spark cluster to aggregate data on near real time and save the result on MySQL database. 

We've started new projects using the machine learning library ML.

How has it helped my organization?

Until Spark we didn't have the ability to analyse this quantity of data we're talking about two TB/hour. So we're now able to produce a lot of reports, and are also able to develop machine learning based analysis to optimize our business. 

We've central access to every piece of data in the company including finance, business, debug etc. and the ability to join all this data together.

What needs improvement?

Spark is actually very good for batch analysis much more good than Hadoop, it's much simple, much more quicker etc., but it actually lacks the ability to perform real-time querying like Vertica or Redshift.  

Also, it is more difficult for an end user to work with Spark than normal database. even comparing with analytic database like Vertica or Redshift.

For how long have I used the solution?

We're now using Spark-Streaming and Spark-SQL for almost 2 years. 

Buyer's Guide
Apache Spark
June 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
856,873 professionals have used our research since 2012.

What was my experience with deployment of the solution?

We're working on AWS so we need to have a managed environment. We've choose to go with a solution based on Chef to deploy and configure the spark clusters. Tip : if you don't have any devops you can use the ec2 script (provided by spark distro) to deploy cluster on amazon. We've tested it and work perfectly.  

What do I think about the stability of the solution?

Spark Streaming is difficult to stabilize as you're always dependant to your stream flow. If you start to be late on the consumer you've a serious problem. We've encountered a lot of stability issue to configure it as expected

What do I think about the scalability of the solution?

It's linked to stability in our case it's takes time to evaluate what is the correct size of the cluster you need. It's very important to always add to you jobs monitoring to be able to understand what's the problem. We use datadog as monitoring platform

Which solution did I use previously and why did I switch?

Yes to make this job we've used a MySQL database. We switch because MySQL is not a scalable solution and we've reach it's limits.

How was the initial setup?

Setup a spark cluster can be difficult. it's related to your clustering strategy. There is 4 solution at least. 

ec2 script : work only on Amazon AWS

Standalone : manually configuration (hard)

Yarn : to leverage your already existing Hadoop environment.

Mesos : to use with your other Mesos ready application

What about the implementation team?

We use Databricks as online DB ad hoc query. It's work on AWS as managed service, it manage for you the cluster creation, configuration and monitoring.

Give a notebook oriented user interface to query any data source using Spark: DB, Parquet, CSV, Avro etc...

Which other solutions did I evaluate?

Yes we've started to evaluate analytics databases : vertica, exasol, and other for all the them the price was an issue regarding the quantity of data we want to manipulate.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user365304 - PeerSpot reviewer
Software Consultant at a tech services company with 10,001+ employees
Real User
It provides large scale data processing with negligible latency at the cost of commodity hardwares.

Valuable Features:

The most important feature of Apache Spark is that it provides large scale data processing with negligible latency at the cost of commodity hardwares. Spark framework is just a blessings over Hadoop, as the later does not allow fast processing of data, which is accomplished by the in-memory data processing of Spark.

Improvements to My Organization:

Apache Spark is a framework, which allows one organization to perform business & data analytics, at a very low cost, as compared to Ab-Initio or Informatica. Thus, by using Apache Spark in place of those tools, one organization can achieve huge reduction in cost, & without compromising with any data security & other data related issues, if controlled by an expert Scala programmer  & Apache Spark does not bear the overheads of Hadoop of having high latency. All these points, by which my organization is being benefitted as well.

Room for Improvement:

Question of improvement always comes to mind of the developers. Just like the most common need of the developers, if a user-friendly GUI along with 'drag & drop' feature can be attached to this framework, then it would be easier to access it.

Another thing to mention, there always is a place for improvement in terms of the memory usage. If in future, it is achievable to use less memory for processing, it would obviously be better.

Deployment Issues:

We've had no issues with deployment.

Stability Issues:

See above regarding memory usage.

Scalability Issues:

We've had no issues with scalability.

Other Advice:

My advice to others would be just to use Apache Spark for large scale data processing, as it provides good performance at low cost, unlike Ab-Initio or Informatica. But the main problem is, now in the market, there are not many people certified in Apache Spark.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user371334 - PeerSpot reviewer
it_user371334CEO at a tech consulting company with 51-200 employees
Consultant

The drag and drop GUI comment is very true. We developed such a GUI for spatial and time series data in Spark. But there are other tools out there. Maybe you should do a review of such tools.

Buyer's Guide
Apache Spark
June 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
856,873 professionals have used our research since 2012.
it_user374028 - PeerSpot reviewer
Core Engine Engineer at a computer software company with 51-200 employees
Real User
It makes web-based queries for plotting data easier. It needs to be simpler to use the machine learning algorithms supported by Octave.

Valuable Features

  • RDDs
  • DataFrames
  • Machine learning libraries

Improvements to My Organization

Faster time to parse and compute data. It makes web-based queries for plotting data easier.

Room for Improvement

It needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).

Use of Solution

I've been using it for one year.

Deployment Issues

There have been no issues with the deployment.

Stability Issues

There have been no issues with the stability.

Scalability Issues

There have been no issues with the scalability.

Customer Service and Technical Support

We still rely on user forums for my answers. We do not use a commercial product yet.

Initial Setup

The initial set-up was easy. I have not explored using this on AWS clusters.

Implementation Team

We did an in-house implementation and development for our regression tool.

ROI

The ROI will be an in-house product to do machine learning analytics on data obtained from customer.

Other Solutions Considered

We did not evaluate any other products.

Other Advice

It's easy to use and has a learning curve.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user374040 - PeerSpot reviewer
Systems Engineering Lead, Mid-Atlantic at a tech company with 10,001+ employees
Vendor
It allows you to construct event-driven information systems.

Valuable Features

Spark Streaming, which allows you to construct event-driven information systems and respond to the events in near-real time.

Improvements to My Organization

Apache Spark’s ability to perform batch processing at one second or less intervals is the most transformative and less pervasive for any data processing application. The ingested data can also be validated and verified for quality early in the data pipeline.

Room for Improvement

Apache Spark as a data processing engine has come a long way since its inception. Although you are able to perform complex transformations using Spark libraries, the support for SQL to perform transformations is still limited. You can alleviate some of these limitations by running Spark within Hadoop ecosystem and by leveraging the fairly evolved HiveQL.

Use of Solution

I've used it for 16 months.

Deployment Issues

The enterprise scale deployment of Apache Spark is slightly involved to derive its full potential of stability, scalability and security. However, some Hadoop vendors like Cloudera have integrated Spark data processing engine into their Hadoop platforms and have made it easier to deploy, scale and secure.

Customer Service and Technical Support

This is an open source technology and is dependent on community support. The Apache Spark community is vibrant and it is easy to find answers to questions. The enterprises can also get commercial support from Hadoop vendors such as Cloudera. I recommend enterprises to inspect Hadoop vendors’ commitment to open source as well as their ability to curate Apache Spark technology into the rest of the ecosystem before signing up for a commercial support or subscription.

Initial Setup

The initial set-up is straightforward as long as you have picked a right Hadoop distribution.

Implementation Team

I recommend engaging an experienced Hadoop vendor during the planning and initial implementation phases of the project. You will be able to avoid any potential pitfalls or reduce overall project time by having a Hadoop expert guiding you during the initial stages of the project.

Other Solutions Considered

I evaluated some other technologies such as Samza but community backing for Apache Spark stood out.

Other Advice

I also suggest having a Chief Technologist who has extensive experience in architecting several Big Data solutions. They should be able to communicate in business as well as technology language. Their expertise should range from infrastructure to application development and have command of Hadoop technologies.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user373173 - PeerSpot reviewer
Lead Big Data Engineer at a non-profit with 51-200 employees
Vendor
​I use it to process large amount of data in the energy industry.

What is most valuable?

Spark is relatively easy to deploy, with rich features in handling big data. Spark Core, Spark SQL, Spark MLlib are used mostly in our applications.

How has it helped my organization?

I use Spark to process large amount of data in the energy industry.

What needs improvement?

Good tool to analyse Spark application performance. Right now there are still many parameters to tune in order to get good performance of Spark application, I would like to see the auto tuning of parameters.

For how long have I used the solution?

I've been using Spark for seven months.

What was my experience with deployment of the solution?

There were no issues with the deployment.

What do I think about the stability of the solution?

I ran into Spark application performance issues. For instance, Spark JDBC write performance needs to be improved.

What do I think about the scalability of the solution?

There were no issues with the scalability.

How are customer service and technical support?

Customer Service:

I use Apache open source. Everything is on our own.

Technical Support:

I use Apache open source. Everything is on our own.

Which solution did I use previously and why did I switch?

I evaluated Hadoop-based solution, and chose Spark due to the fast processing and ease of use.

How was the initial setup?

The initial setup is not complex. The online documents are pretty good.

What about the implementation team?

I implemented it in-house.

What other advice do I have?

Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
PeerSpot user
Engineer at a tech vendor with 10,001+ employees
Real User
Spark provides lots of high-level APIs, which reduces duplication of work.

Valuable Features

Streaming data processing

Improvements to My Organization

In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.

Room for Improvement

Better monitoring ability. Especially monitoring integration with customer codes.

Use of Solution

I've used it for one year.

Stability Issues

We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode

Customer Service and Technical Support

I have to say it is bad. I can only ask for help in the Google group. However, it is run in the developer-for-developer style. There are almost no people from databricks. I also use a Cassandra-Spark-connector, and Datastax has at least one dedicated person to help the community.

Initial Setup

Not that straightforward in terms of standalone deployment, there are some tricks which are not mentioned in the docs.

Implementation Team

We did it in-house.

Pricing, Setup Cost and Licensing

So far we have no plan to switch to commercial license.

Other Advice

I love Spark over other solutions.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user371334 - PeerSpot reviewer
CEO at a tech consulting company with 51-200 employees
Consultant
It's enabled interactive self-service access to data​.

What is most valuable?

There are several valuable features.

  • Interactive data access (low latency)
  • Batch ETL-style processing
  • Schema-free data models
  • Algorithms

How has it helped my organization?

We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.

What needs improvement?

Better integration of BI tools wold be a much appreciated improvement.

For how long have I used the solution?

I've used it for about 14 months.

What was my experience with deployment of the solution?

I haven't had any issues with deployment.

What do I think about the stability of the solution?

It's been stable for us.

What do I think about the scalability of the solution?

It's scaled without issue.

How are customer service and technical support?

Customer Service:

Customer service is excellent.

Technical Support:

Technical support is excellent.

Which solution did I use previously and why did I switch?

Yes, we previously used Oracle, from which we ported our data.

How was the initial setup?

The initial setup was simple.

What about the implementation team?

We implemented it with our in-house team.

What other advice do I have?

Be sure to Uuse the Apache versions and avoid vendor-specific extensions.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user371325 - PeerSpot reviewer
Data Scientist at a tech vendor with 10,001+ employees
Vendor
It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.

What is most valuable?

It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.

How has it helped my organization?

We're able to perform data discovery on large datasets without too much difficulty.

What needs improvement?

It needs better documentation as well as examples for all the Spark libraries. That would be very helpful in maximizing its capabilities and results.

For how long have I used the solution?

I've used it for over nine months now.

What was my experience with deployment of the solution?

I haven't encountered any issues with deployment.

What do I think about the stability of the solution?

There have been no stability issues.

What do I think about the scalability of the solution?

I haven't had any scalability issues. It scales better than Python and R.

How are customer service and technical support?

Customer Service:

I haven't had to use customer service.

Technical Support:

I haven't had to use technical support.

Which solution did I use previously and why did I switch?

I previously used Python and R, but neither of these scaled particularly well.

How was the initial setup?

The initial setup was complex. It was not easy getting the correct version and dependencies set up.

What about the implementation team?

I implemented it in-house on my own!

What was our ROI?

It's open-source, so ROI is inapplicable.

What other advice do I have?

Learn Scala as this will greatly reduce the pain in starting off with Spark.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user