it_user371325 - PeerSpot reviewer
Data Scientist at a tech vendor with 10,001+ employees
Vendor
It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.

What is most valuable?

It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.

How has it helped my organization?

We're able to perform data discovery on large datasets without too much difficulty.

What needs improvement?

It needs better documentation as well as examples for all the Spark libraries. That would be very helpful in maximizing its capabilities and results.

For how long have I used the solution?

I've used it for over nine months now.

Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.

What was my experience with deployment of the solution?

I haven't encountered any issues with deployment.

What do I think about the stability of the solution?

There have been no stability issues.

What do I think about the scalability of the solution?

I haven't had any scalability issues. It scales better than Python and R.

How are customer service and support?

Customer Service:

I haven't had to use customer service.

Technical Support:

I haven't had to use technical support.

Which solution did I use previously and why did I switch?

I previously used Python and R, but neither of these scaled particularly well.

How was the initial setup?

The initial setup was complex. It was not easy getting the correct version and dependencies set up.

What about the implementation team?

I implemented it in-house on my own!

What was our ROI?

It's open-source, so ROI is inapplicable.

What other advice do I have?

Learn Scala as this will greatly reduce the pain in starting off with Spark.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user365301 - PeerSpot reviewer
Software Developer (Product Engineering) at a computer software company with 501-1,000 employees
Vendor
We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.

Valuable Features:

\Spark Streaming, Spark SQL and MLib in that order.

Improvements to My Organization:

We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.

Room for Improvement:

Like I said scalability is still an issue, also stability. Spark on Yarn still doesn't seem to have programming submission api, so have to rely on spark-submit script to run jobs on YARN. Scala vs Java API have performance differences which will require sometimes to code in Scala.

Other Advice:

Have Scala developers at hand. Base Java competency will not be enough during optimization rounds.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.
Snr Security Engineer at a tech vendor with 201-500 employees
Real User
Provides security analytics and has good scalability
Pros and Cons
  • "The scalability has been the most valuable aspect of the solution."
  • "The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive."

What is our primary use case?

We primarily use the solution for security analytics.

What is most valuable?

The scalability has been the most valuable aspect of the solution.

What needs improvement?

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.

What do I think about the scalability of the solution?

The scalability is very good.

How are customer service and technical support?

You actually buy Cloudera along with it. You don't really get any support, except you need support.

Which solution did I use previously and why did I switch?

In previous companies, we used MySQL platform and solutions like ArcSight and Splunk. We switched for scalability. MySQL wasn't going to scale, and we don't use Splunk at this company.

How was the initial setup?

The initial setup was complex. It is a complex tool. It's a lot to do with how you will use it. There is a lot to set up. They need to put a lot of scripts to it. There's nearly 60 to set up. When you set up the cloud, it takes about a day to set up. If you set it up on-premise, you know, on hardware, it only takes about a week.

What other advice do I have?

I would rate this solution eight out of 10. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user1223676 - PeerSpot reviewer
Lead Consultant at a tech services company with 51-200 employees
Consultant
The data storage capacity means we can inject somewhere in the user database in more efficient ways
Pros and Cons
  • "The main feature that we find valuable is that it is very fast."
  • "We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."

What is most valuable?

The main feature that we find valuable is that it is very fast. In terms of big data, the main feature is that the data is in so many different nodes. It goes through many data nodes so whenever we use the data, it enables us to parse the data from different data nodes. 

What needs improvement?

We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time. There is some latency in the system and latency in the data caching. The main issue is that we need to design it in a way that data will be available to us very quickly. It takes a long time and the latest data should be available to us much quicked. 

What do I think about the stability of the solution?

We don't have any problems with stability. 

How are customer service and technical support?

I'm not the one who would contact their support if we needed it. 

How was the initial setup?

The initial setup is straightforward. 

What other advice do I have?

The advice that I would give to someone considering this solution is that the quality of data has key streaming capabilities like velocity. This means how quickly you are going to refer to the data. These things matter by designing the solution. We need to take these things out. 

I would rate Apache Spark an eight out of ten. 

To make it a ten they should improve the speed. The data storage capacity means we can inject somewhere in the user database in more efficient ways.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: April 2024
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.