Try our new research platform with insights from 80,000+ expert users
Managing Consultant at a computer software company with 501-1,000 employees
Real User
Good performance and resource management for hosting our data science platform
Pros and Cons
  • "The processing time is very much improved over the data warehouse solution that we were using."
  • "I would like to see integration with data science platforms to optimize the processing capability for these tasks."

What is our primary use case?

Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data.

Apache Spark was used to host this entire project.

How has it helped my organization?

The processing time is very much improved over the data warehouse solution that we were using.

What is most valuable?

The most valuable features are the storage engine, the memory engine, and the processing engine.

What needs improvement?

I would like to see integration with data science platforms to optimize the processing capability for these tasks.

Buyer's Guide
Apache Spark
June 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
856,873 professionals have used our research since 2012.

For how long have I used the solution?

I have been using Apache Spark for the past year.

How are customer service and support?

We have not been in contact with technical support.

What's my experience with pricing, setup cost, and licensing?

The initial setup is straightforward. It took us around one week to set it up, and then the requirements and creation of the project flow and design needed to be done. The design stage took three to four weeks, so in total, it required between four and five weeks to set up.

What other advice do I have?

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user1223676 - PeerSpot reviewer
Lead Consultant at a tech services company with 51-200 employees
Consultant
The data storage capacity means we can inject somewhere in the user database in more efficient ways
Pros and Cons
  • "The main feature that we find valuable is that it is very fast."
  • "We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."

What is most valuable?

The main feature that we find valuable is that it is very fast. In terms of big data, the main feature is that the data is in so many different nodes. It goes through many data nodes so whenever we use the data, it enables us to parse the data from different data nodes. 

What needs improvement?

We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time. There is some latency in the system and latency in the data caching. The main issue is that we need to design it in a way that data will be available to us very quickly. It takes a long time and the latest data should be available to us much quicked. 

What do I think about the stability of the solution?

We don't have any problems with stability. 

How are customer service and technical support?

I'm not the one who would contact their support if we needed it. 

How was the initial setup?

The initial setup is straightforward. 

What other advice do I have?

The advice that I would give to someone considering this solution is that the quality of data has key streaming capabilities like velocity. This means how quickly you are going to refer to the data. These things matter by designing the solution. We need to take these things out. 

I would rate Apache Spark an eight out of ten. 

To make it a ten they should improve the speed. The data storage capacity means we can inject somewhere in the user database in more efficient ways.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
June 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
856,873 professionals have used our research since 2012.
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Offers good machine learning, data learning, and Spark Analytics features
Pros and Cons
  • "The features we find most valuable are the machine learning, data learning, and Spark Analytics."
  • "We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data."

What is our primary use case?

We have built a product called "NetBot." We take any form of data, large email data, image,  videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data. 

What is most valuable?

We use all the features. We use it for end-to-end. All of our data analysis and execution happens through Spark.

The features we find most valuable are the: 

  • Machine learning
  • Data learning
  • Spark Analytics.

What needs improvement?

We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.

For how long have I used the solution?

I have been using Apache Spark for more than five years. 

What do I think about the stability of the solution?

We haven't had any issues with stability so far. 

What do I think about the scalability of the solution?

As long as you do it correctly, it is scalable.

Our users mostly consist of data analysts, engineers, data scientists, and DB admins.

Which solution did I use previously and why did I switch?

Before using this solution we used Apache Storm

How was the initial setup?

The initial setup is complex. 

What about the implementation team?

We installed it ourselves. 

What other advice do I have?

I would rate it a nine out of ten. 

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
reviewer879201 - PeerSpot reviewer
Technical Consultant at a tech services company with 1-10 employees
Consultant
Good Streaming features enable to enter data and analysis within Spark Stream
Pros and Cons
  • "I feel the streaming is its best feature."
  • "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."

What is our primary use case?

We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.

What is most valuable?

I have worked with Hadoop a lot in my career and you need to do a lot of things to get it to Hello World. But in Spark it is easy. You could say it's an umbrella to do everything under the one shelf. It also has Spark Streaming. I feel the streaming is its best feature because I have extracted to enter data and analysis within Spark Stream.

What needs improvement?

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.

Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.

Overall, it offers everything that I can imagine right now. 

For how long have I used the solution?

I have been using Apache Spark for a couple of months.

What do I think about the stability of the solution?

In terms of stability, I have not seen any bugs, glitches or crashes. Even if there is, that's fine, because I would probably take care of it and then I'd have progressed further in the process.

What do I think about the scalability of the solution?

I have not tested the scalability yet.

In my company, there are two or three people that are using it for different products. But right now, the client I'm engaged with doesn't know anything about Spark or Hadoop. They are a typical financial company so they do what they do, and they ask us to do everything. They have pretty much outsourced their whole big data initiative to us.

Which solution did I use previously and why did I switch?

I have used MapReduce from Hadoop previously. Otherwise, I haven't used any other big data infrastructure.

In my work previously, not in this company, I was working with some big data, but I was extracting using a single-core off my PC. I realized over time that my system had eight cores. So instead, I used all of those cores for multi-core programming. Then I realized that Hadoop and Spark do the same thing but with different PC's. That was then I used multi-core programming and that's the point - Spark needs to go and search Hadoop and other things.

How was the initial setup?

The initial setup to get it to Hello World is pretty easy, you just have to install it. But when you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources. But you can get a lot of help from different sources on the internet. So it's great. A lot of people are doing it.

I work with a startup company. You know that in startups you do not have the luxury of different people doing different things, you have to do everything on your own, and it's an opportunity to learn everything. In a typical corporate or big organization you only have restricted SOPs, you have to work within the boundaries. In my organization, I have to set up all the things, configure it, and work on it myself.

What's my experience with pricing, setup cost, and licensing?

I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.

What other advice do I have?

On a scale of 1 to 10, I'd put it at an eight.

To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Director of BigData Offer at IVIDATA
Real User
Stable, fast, and easy to use
Pros and Cons
  • "The solution is very stable."
  • "The solution needs to optimize shuffling between workers."

What is our primary use case?

We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers. 

What is most valuable?

It is a very fast solution. It's very easy to use. There are many RPis with many languages like Scala, Java, R, and Python. The greatest advantage of Spark is that we can initiate many kinds of analytics including SQL analytics, graphics analytics, etc. 

What needs improvement?

The solution needs to optimize shuffling between workers.

For how long have I used the solution?

I've been using the solution for four or five years.

What do I think about the stability of the solution?

The solution is very stable.

What do I think about the scalability of the solution?

The solution is scalable. My understanding is version 3.0 has renewed scaling capabilities and will be able to do so automatically.

How are customer service and technical support?

Apache is an open-source platform so there is no technical support.

What other advice do I have?

We use both on-premises and public and private cloud deployment models. We're partners with Databricks.

I'm a consultant. Our company works for large enterprises such as banks and energy companies. 17 of our workers use Apache Spark.

With the cloud, there are many companies that integrate Spark. Most projects in big data around the world use Spark, indirectly or directly. 

I'd rate the solution eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer1046250 - PeerSpot reviewer
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
Easy to use and is capable of processing large amounts of data
Pros and Cons
  • "The most valuable feature of this solution is its capacity for processing large amounts of data."
  • "When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."

What is our primary use case?

We use this solution for information gathering and processing. 

I use it myself when I am developing on my laptop.

I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.

What is most valuable?

The most valuable feature of this solution is its capacity for processing large amounts of data.

This solution makes it easy to do a lot of things. It's easy to read data, process it, save it, etc.

What needs improvement?

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable.

When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

For how long have I used the solution?

I have been using this solution for between two and three years.

What do I think about the stability of the solution?

This solution is difficult for users who are just beginning and they experience out of memory errors when dealing with large amounts of data.

How are customer service and technical support?

I have not been in contact with technical support. I find all of the answers that I need in the forums.

What other advice do I have?

The work that we are doing with this solution is quite common and is very easy to do.

My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things.

I would rate this solution a nine out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user946074 - PeerSpot reviewer
Principal Architect at a financial services firm with 1,001-5,000 employees
Real User
Fast performance and has an easy initial setup
Pros and Cons
  • "I found the solution stable. We haven't had any problems with it."
  • "It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."

What is our primary use case?

We use the solution for analytics.

How has it helped my organization?

I'm not sure how it has improved my organization but I believe that it's a good product.

What is most valuable?

The fast performance is the most valuable aspect of the solution.

What needs improvement?

The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better.

It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.

In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script. 

For how long have I used the solution?

I've been using the solution for two years.

What do I think about the stability of the solution?

I found the solution stable. We haven't had any problems with it.

How are customer service and technical support?

Usually, we can fix any issues. If we have problems, we google a little bit to find the issue. 

Which solution did I use previously and why did I switch?

I was using some other systems and we moved to Spark later. We faced performance and other issues with the other solution.

How was the initial setup?

The initial setup was easy. We keep on getting data from different sources so we will keep on porting in little bits. It's not done in a single sitting, so I can't really say how long it takes.

What other advice do I have?

I would recommend the solution. I would rate it an eight or nine out of 10.

For some areas, I would give it ten but I cannot use some parts. If you are going to use it for a consumer then I would be able to recommend it and you should go ahead. It doesn't work for me as I have different clients and different engagements.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Snr Security Engineer at a tech vendor with 201-500 employees
Real User
Provides security analytics and has good scalability
Pros and Cons
  • "The scalability has been the most valuable aspect of the solution."
  • "The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive."

What is our primary use case?

We primarily use the solution for security analytics.

What is most valuable?

The scalability has been the most valuable aspect of the solution.

What needs improvement?

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.

What do I think about the scalability of the solution?

The scalability is very good.

How are customer service and technical support?

You actually buy Cloudera along with it. You don't really get any support, except you need support.

Which solution did I use previously and why did I switch?

In previous companies, we used MySQL platform and solutions like ArcSight and Splunk. We switched for scalability. MySQL wasn't going to scale, and we don't use Splunk at this company.

How was the initial setup?

The initial setup was complex. It is a complex tool. It's a lot to do with how you will use it. There is a lot to set up. They need to put a lot of scripts to it. There's nearly 60 to set up. When you set up the cloud, it takes about a day to set up. If you set it up on-premise, you know, on hardware, it only takes about a week.

What other advice do I have?

I would rate this solution eight out of 10. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: June 2025
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.