it_user374040 - PeerSpot reviewer
Systems Engineering Lead, Mid-Atlantic at a tech company with 10,001+ employees
Vendor
It allows you to construct event-driven information systems.

What is most valuable?

Spark Streaming, which allows you to construct event-driven information systems and respond to the events in near-real time.

How has it helped my organization?

Apache Spark’s ability to perform batch processing at one second or less intervals is the most transformative and less pervasive for any data processing application. The ingested data can also be validated and verified for quality early in the data pipeline.

What needs improvement?

Apache Spark as a data processing engine has come a long way since its inception. Although you are able to perform complex transformations using Spark libraries, the support for SQL to perform transformations is still limited. You can alleviate some of these limitations by running Spark within Hadoop ecosystem and by leveraging the fairly evolved HiveQL.

For how long have I used the solution?

I've used it for 16 months.

Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.

What was my experience with deployment of the solution?

The enterprise scale deployment of Apache Spark is slightly involved to derive its full potential of stability, scalability and security. However, some Hadoop vendors like Cloudera have integrated Spark data processing engine into their Hadoop platforms and have made it easier to deploy, scale and secure.

How are customer service and support?

This is an open source technology and is dependent on community support. The Apache Spark community is vibrant and it is easy to find answers to questions. The enterprises can also get commercial support from Hadoop vendors such as Cloudera. I recommend enterprises to inspect Hadoop vendors’ commitment to open source as well as their ability to curate Apache Spark technology into the rest of the ecosystem before signing up for a commercial support or subscription.

How was the initial setup?

The initial set-up is straightforward as long as you have picked a right Hadoop distribution.

What about the implementation team?

I recommend engaging an experienced Hadoop vendor during the planning and initial implementation phases of the project. You will be able to avoid any potential pitfalls or reduce overall project time by having a Hadoop expert guiding you during the initial stages of the project.

Which other solutions did I evaluate?

I evaluated some other technologies such as Samza but community backing for Apache Spark stood out.

What other advice do I have?

I also suggest having a Chief Technologist who has extensive experience in architecting several Big Data solutions. They should be able to communicate in business as well as technology language. Their expertise should range from infrastructure to application development and have command of Hadoop technologies.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Chief Technology Officer at a tech services company with 11-50 employees
Real User
Top 20
Helpful support, easy to use, and high availability
Pros and Cons
  • "The most valuable feature of Apache Spark is its ease of use."
  • "Apache Spark can improve the use case scenarios from the website. There is not any information on how you can use the solution across the relational databases toward multiple databases."

What is our primary use case?

I am using Apache Spark for the data transition from databases. We have customers who have one database as a data lake.

What is most valuable?

The most valuable feature of Apache Spark is its ease of use.

What needs improvement?

Apache Spark can improve the use case scenarios from the website. There is not any information on how you can use the solution across the relational databases toward multiple databases.  

For how long have I used the solution?

I have been using Apache Spark for approximately 18 months.

What do I think about the stability of the solution?

Apache Spark is stable.

What do I think about the scalability of the solution?

We are using Apache Spark across multiple nodes and it is scalable.

We have approximately five people using this solution.

How are customer service and support?

The technical support from Apache Spark is very good.

What other advice do I have?

I rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
April 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,065 professionals have used our research since 2012.
it_user1059558 - PeerSpot reviewer
Portfolio Manager, Enterprise Solutions Architect at Capgemini
Real User
Supports streaming and micro-batch

What is our primary use case?

Streaming telematics data.

How has it helped my organization?

It's a better MR, supports streaming and micro-batch, and supports Spark ML and Spark SQL.

What is most valuable?

It supports streaming and micro-batch.

What needs improvement?

Better data lineage support.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user374028 - PeerSpot reviewer
Core Engine Engineer at a computer software company with 51-200 employees
Real User
It makes web-based queries for plotting data easier. It needs to be simpler to use the machine learning algorithms supported by Octave.

Valuable Features

  • RDDs
  • DataFrames
  • Machine learning libraries

Improvements to My Organization

Faster time to parse and compute data. It makes web-based queries for plotting data easier.

Room for Improvement

It needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).

Use of Solution

I've been using it for one year.

Deployment Issues

There have been no issues with the deployment.

Stability Issues

There have been no issues with the stability.

Scalability Issues

There have been no issues with the scalability.

Customer Service and Technical Support

We still rely on user forums for my answers. We do not use a commercial product yet.

Initial Setup

The initial set-up was easy. I have not explored using this on AWS clusters.

Implementation Team

We did an in-house implementation and development for our regression tool.

ROI

The ROI will be an in-house product to do machine learning analytics on data obtained from customer.

Other Solutions Considered

We did not evaluate any other products.

Other Advice

It's easy to use and has a learning curve.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Managing Consultant at a computer software company with 501-1,000 employees
Real User
Top 20
Good performance and resource management for hosting our data science platform
Pros and Cons
  • "The processing time is very much improved over the data warehouse solution that we were using."
  • "I would like to see integration with data science platforms to optimize the processing capability for these tasks."

What is our primary use case?

Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data.

Apache Spark was used to host this entire project.

How has it helped my organization?

The processing time is very much improved over the data warehouse solution that we were using.

What is most valuable?

The most valuable features are the storage engine, the memory engine, and the processing engine.

What needs improvement?

I would like to see integration with data science platforms to optimize the processing capability for these tasks.

For how long have I used the solution?

I have been using Apache Spark for the past year.

How are customer service and technical support?

We have not been in contact with technical support.

What's my experience with pricing, setup cost, and licensing?

The initial setup is straightforward. It took us around one week to set it up, and then the requirements and creation of the project flow and design needed to be done. The design stage took three to four weeks, so in total, it required between four and five weeks to set up.

What other advice do I have?

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Director of BigData Offer at IVIDATA
Real User
Stable, fast, and easy to use
Pros and Cons
  • "The solution is very stable."
  • "The solution needs to optimize shuffling between workers."

What is our primary use case?

We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers. 

What is most valuable?

It is a very fast solution. It's very easy to use. There are many RPis with many languages like Scala, Java, R, and Python. The greatest advantage of Spark is that we can initiate many kinds of analytics including SQL analytics, graphics analytics, etc. 

What needs improvement?

The solution needs to optimize shuffling between workers.

For how long have I used the solution?

I've been using the solution for four or five years.

What do I think about the stability of the solution?

The solution is very stable.

What do I think about the scalability of the solution?

The solution is scalable. My understanding is version 3.0 has renewed scaling capabilities and will be able to do so automatically.

How are customer service and technical support?

Apache is an open-source platform so there is no technical support.

What other advice do I have?

We use both on-premises and public and private cloud deployment models. We're partners with Databricks.

I'm a consultant. Our company works for large enterprises such as banks and energy companies. 17 of our workers use Apache Spark.

With the cloud, there are many companies that integrate Spark. Most projects in big data around the world use Spark, indirectly or directly. 

I'd rate the solution eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer894894 - PeerSpot reviewer
Works at a computer software company with 51-200 employees
User
Features include machine learning, real time streaming, and data processing. It doesn't enable spark job scheduling with monitoring capability.
Pros and Cons
  • "Features include machine learning, real time streaming, and data processing."
  • "The fault tolerant feature is provided."
  • "It provides a scalable machine learning library."
  • "It should support more programming languages."
  • "Needs to provide an internal schedule to schedule spark jobs with monitoring capability."

What is our primary use case?

Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.

How has it helped my organization?

It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.

What is most valuable?

Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.

What needs improvement?

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

For how long have I used the solution?

Trial/evaluations only.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user746673 - PeerSpot reviewer
Sr. Software Engineer at a tech vendor with 1-10 employees
Real User
Helped us reduce 3TB Google Ngrams in hours instead of days
Pros and Cons
  • "The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
  • "More ML based algorithms should be added to it, to make it algorithmic-rich for developers."

What is most valuable?

The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics. The community is growing and hence executing ML in a distributed fashion is quite good.

How has it helped my organization?

Previously we were using Hadoop MapReduce to reduce the Google Ngrams (3TB), which took us approximately five days on our cluster. After using Spark, we were able to accomplish this task within hours.

What needs improvement?

This product is already improving as the community is developing it rapidly. More ML based algorithms should be added to it, to make it algorithmic-rich for developers.

For how long have I used the solution?

Two and a half years.

What do I think about the stability of the solution?

No, I did not encounter any problems with the stability. It is also quite backwards compatible.

What do I think about the scalability of the solution?

No I did not as of now, it is quite scalable. Using simple scripts you can add as many workers as you want.

What other advice do I have?

This is a very good product for the big data analytics and integrates well with other parts like Machine Learning and graph analytics.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: April 2024
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.