Manager - Data Science Competency at a tech services company with 201-500 employees
Consultant
Fast-performance, cost-effective, and runs in a cloud-agnostic environment
Pros and Cons
  • "One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
  • "When you are working with large, complex tasks, the garbage collection process is slow and affects performance."

What is our primary use case?

My main task is working on predictive analytics, and Apache Spark is one of the tools that I utilize in this role. Primarily, we work with the predictive analysis of very large amounts of data.

Apache Spark is also helpful for data pre-processing, including data cleaning.

This solution is cloud-agnostic. You can use it with an EC2 instance and you can even install it on-premises. Some environments have it installed in VMs.

What is most valuable?

One of the key features is that Apache Spark is a distributed computing framework. You can have multiple slaves and distribute the workload between them.

Another feature is memory-based computing. This is unlike Hadoop, which relies on storage. As it uses in-memory data processing, Spark is very fast.

What needs improvement?

When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.

For how long have I used the solution?

I have been working with Apache Spark for the past four years.

Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.

What do I think about the stability of the solution?

This product is pretty stable. Companies like Facebook, Uber, and Netflix are all using Apache Spark. It's stable enough to be used all over the world.

What do I think about the scalability of the solution?

In our team that works on this, we have approximately 10 people.

How are customer service and support?

There is no official support for this solution. Because it's open-source and there is no cost involved, there is nobody to contact for support. Our own internal team of experts, which work on different problems, both support and contribute to the platform.

Which solution did I use previously and why did I switch?

I work on several open-source frameworks including Python, Scikit-learn, TensorFlow, PyTorch, H20.ai, and R. We don't endorse proprietary tools so we aren't working with them.

How was the initial setup?

With respect to the initial setup, it's neither easy nor very difficult. Our team has experience so it is not difficult for them. However, for a person that is new to using it, the setup might be very difficult.

What about the implementation team?

We have a team of experts in my company, and they handle it very well.

What's my experience with pricing, setup cost, and licensing?

This is an open-source tool, so it can be used free of charge. There is no cost involved.

What other advice do I have?

We are not using the current version of this platform, Spark 3. However, we do know that it is used in the market and it has new features. We will eventually move to it.

My advice for anybody who wants to use Apache Spark is that they have two options. The first is Databricks, which are the creators of Apache Spark, and use their proprietary version. If you choose this option then you will have to pay for the product.

If instead, you use Apache Spark, then you can rely on your own expert in-house team for support, maintenance, and deployment. In this option, you don't have to pay anything to anybody outside of your company.

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user372393 - PeerSpot reviewer
Big Data Consultant at a tech services company with 501-1,000 employees
Consultant
We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.
Pros and Cons
  • "The good performance. The nice graphical management console. The long list of ML algorithms."
  • "Apache Spark provides very good performance The tuning phase is still tricky."

What is most valuable?

The good performance. The nice graphical management console. The long list of ML algorithms.

How has it helped my organization?

We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.

What needs improvement?

Apache Spark provides very good performance The tuning phase is still tricky.

For how long have I used the solution?

I've used it for 2 years.

What was my experience with deployment of the solution?

We didn't have an issue with the deployment.

What do I think about the stability of the solution?

In the past we deployed Spark 1.3 to use Spark SQL but unfortunately one of our queries failed because of a bug fixed in following releases. Then we moved to Spark 1.6 but still some queries were failing when run against huge datasets. Now we are using version 2.1: it is more stable, it ensures better performances and the SQL/ML parts are reacher than before.

What do I think about the scalability of the solution?

I've had no issues with the scalability.

How is customer service and technical support?

Customer Service:

I've never had to use customer service.

Technical Support:

I've never had to use technical support.

How was the initial setup?

The initial set-up is quite complex because you have to set-up many different configuration parameters that are deployment-specific. It is not trivial to set-up the correct configuration with so many variables involved.

What about the implementation team?

In-house team. The setup itself is not a problem when you have just to test the system. The challenging part is discovering the optimal configuration needed to obtain a production system proving good performance.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.
PeerSpot user
Engineer at a tech vendor with 10,001+ employees
Real User
Spark provides lots of high-level APIs, which reduces duplication of work.

Valuable Features

Streaming data processing

Improvements to My Organization

In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.

Room for Improvement

Better monitoring ability. Especially monitoring integration with customer codes.

Use of Solution

I've used it for one year.

Stability Issues

We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode

Customer Service and Technical Support

I have to say it is bad. I can only ask for help in the Google group. However, it is run in the developer-for-developer style. There are almost no people from databricks. I also use a Cassandra-Spark-connector, and Datastax has at least one dedicated person to help the community.

Initial Setup

Not that straightforward in terms of standalone deployment, there are some tricks which are not mentioned in the docs.

Implementation Team

We did it in-house.

Pricing, Setup Cost and Licensing

So far we have no plan to switch to commercial license.

Other Advice

I love Spark over other solutions.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user746943 - PeerSpot reviewer
Big Data and Cloud Solution Consultant at a financial services firm with 10,001+ employees
Real User
Provides flexibility for application creation with less coding effort
Pros and Cons
  • "DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort."
  • "Dynamic DataFrame options are not yet available."

What is most valuable?

DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort.

How has it helped my organization?

We developed a tool for data ingestion from HDFS->Raw->L1 layer with data quality checks, putting data to elastic search, performing CDC.

What needs improvement?

Dynamic DataFrame options are not yet available.

For how long have I used the solution?

One and a half years.

What do I think about the stability of the solution?

No.

What do I think about the scalability of the solution?

No.

What other advice do I have?

Spark gives the flexibility for developing custom applications.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user373173 - PeerSpot reviewer
Lead Big Data Engineer at a non-profit with 51-200 employees
Vendor
​I use it to process large amount of data in the energy industry.

What is most valuable?

Spark is relatively easy to deploy, with rich features in handling big data. Spark Core, Spark SQL, Spark MLlib are used mostly in our applications.

How has it helped my organization?

I use Spark to process large amount of data in the energy industry.

What needs improvement?

Good tool to analyse Spark application performance. Right now there are still many parameters to tune in order to get good performance of Spark application, I would like to see the auto tuning of parameters.

For how long have I used the solution?

I've been using Spark for seven months.

What was my experience with deployment of the solution?

There were no issues with the deployment.

What do I think about the stability of the solution?

I ran into Spark application performance issues. For instance, Spark JDBC write performance needs to be improved.

What do I think about the scalability of the solution?

There were no issues with the scalability.

How are customer service and technical support?

Customer Service:

I use Apache open source. Everything is on our own.

Technical Support:

I use Apache open source. Everything is on our own.

Which solution did I use previously and why did I switch?

I evaluated Hadoop-based solution, and chose Spark due to the fast processing and ease of use.

How was the initial setup?

The initial setup is not complex. The online documents are pretty good.

What about the implementation team?

I implemented it in-house.

What other advice do I have?

Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user371334 - PeerSpot reviewer
CEO at a tech consulting company with 51-200 employees
Consultant
It's enabled interactive self-service access to data​.

What is most valuable?

There are several valuable features.

  • Interactive data access (low latency)
  • Batch ETL-style processing
  • Schema-free data models
  • Algorithms

How has it helped my organization?

We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.

What needs improvement?

Better integration of BI tools wold be a much appreciated improvement.

For how long have I used the solution?

I've used it for about 14 months.

What was my experience with deployment of the solution?

I haven't had any issues with deployment.

What do I think about the stability of the solution?

It's been stable for us.

What do I think about the scalability of the solution?

It's scaled without issue.

How are customer service and technical support?

Customer Service:

Customer service is excellent.

Technical Support:

Technical support is excellent.

Which solution did I use previously and why did I switch?

Yes, we previously used Oracle, from which we ported our data.

How was the initial setup?

The initial setup was simple.

What about the implementation team?

We implemented it with our in-house team.

What other advice do I have?

Be sure to Uuse the Apache versions and avoid vendor-specific extensions.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
Easy to use and is capable of processing large amounts of data
Pros and Cons
  • "The most valuable feature of this solution is its capacity for processing large amounts of data."
  • "When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."

What is our primary use case?

We use this solution for information gathering and processing. 

I use it myself when I am developing on my laptop.

I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.

What is most valuable?

The most valuable feature of this solution is its capacity for processing large amounts of data.

This solution makes it easy to do a lot of things. It's easy to read data, process it, save it, etc.

What needs improvement?

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable.

When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

For how long have I used the solution?

I have been using this solution for between two and three years.

What do I think about the stability of the solution?

This solution is difficult for users who are just beginning and they experience out of memory errors when dealing with large amounts of data.

How are customer service and technical support?

I have not been in contact with technical support. I find all of the answers that I need in the forums.

What other advice do I have?

The work that we are doing with this solution is quite common and is very easy to do.

My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things.

I would rate this solution a nine out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user326142 - PeerSpot reviewer
Architect at a healthcare company with 51-200 employees
Real User
Having everything in the same framework has helped us out a lot
Pros and Cons
  • "ETL and streaming capabilities."
  • "Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet)."

What is most valuable?

ETL and streaming capabilities.

How has it helped my organization?

Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.

What needs improvement?

Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).

For how long have I used the solution?

I have used Spark since its inception in March 2015, from Spark 1.1 onwards.

Currently, I use 2.2 extensively.

What do I think about the stability of the solution?

Yes, occasionally with different APIs.

What do I think about the scalability of the solution?

No.

How are customer service and technical support?

Since we were using the Open Source version of Apache Spark, without the Databricks support, we never used technical support form Databricks.

Which solution did I use previously and why did I switch?

Yes we used Hive, Pig, and Storm. Having everything in the same framework has helped us out a lot.

Which other solutions did I evaluate?

Yes, we considered other big data products in the Big Data Ecosystem.

What other advice do I have?

Go for it.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: April 2024
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.