Apache Spark OverviewUNIXBusinessApplication

Apache Spark is the #1 ranked solution in top Hadoop tools, #2 ranked solution in top Compute Service tools, and #2 ranked solution in top Java Frameworks. PeerSpot users give Apache Spark an average rating of 8.0 out of 10. Apache Spark is most commonly compared to Spring Boot: Apache Spark vs Spring Boot. Apache Spark is popular among the large enterprise segment, accounting for 73% of users researching this solution on PeerSpot. The top industry researching this solution are professionals from a financial services firm, accounting for 19% of all views.
Apache Spark Buyer's Guide

Download the Apache Spark Buyer's Guide including reviews and more. Updated: November 2022

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Apache Spark Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Apache Spark Video

Apache Spark Pricing Advice

What users are saying about Apache Spark pricing:
  • "Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."
  • "Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free."
  • Apache Spark Reviews

    Filter by:
    Filter Reviews
    Industry
    Loading...
    Filter Unavailable
    Company Size
    Loading...
    Filter Unavailable
    Job Level
    Loading...
    Filter Unavailable
    Rating
    Loading...
    Filter Unavailable
    Considered
    Loading...
    Filter Unavailable
    Order by:
    Loading...
    • Date
    • Highest Rating
    • Lowest Rating
    • Review Length
    Search:
    Showingreviews based on the current filters. Reset all filters
    Ilya Afanasyev - PeerSpot reviewer
    Senior Software Development Engineer at Yahoo!
    Real User
    Top 5Leaderboard
    Reliable, able to expand, and handle large amounts of data well
    Pros and Cons
    • "There's a lot of functionality."
    • "I know there is always discussion about which language to write applications in and some people do love Scala. However, I don't like it."

    What is our primary use case?

    It's a root product that we use in our pipeline.

    We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.

    What is most valuable?

    We use batch processing. It works well with our formats and file versions. There's a lot of functionality. 

    In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000.

    The solution is scalable.

    It's a stable product.

    What needs improvement?

    The primary language for developers on Spark is Scala. Now it's also about Java. I prefer Java versus Scala, and since they are supported, it is good. I know there is always discussion about which language to write applications in, and some people do love Scala. However, I don't like it.

    They use currently have a JDK version which is a little bit old. Not all features are on it. Maybe they should pull support of the JDK version.

    For how long have I used the solution?

    I've used the solution for a year and a half. 

    Buyer's Guide
    Apache Spark
    November 2022
    Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2022.
    653,757 professionals have used our research since 2012.

    What do I think about the stability of the solution?

    The solution is stable. There are no bugs or glitches. It doesn't crash or freeze. 

    What do I think about the scalability of the solution?

    The product scales well. It's fine to expand if needed. 

    Many teams use Spark. For example, we have a few kinds of pipelines, huge pipelines. One of them processes 300 billion events each day. It's our core technology currently.

    We do not plan to increase usage. We keep our legacy system on Spark, and we are now discussing Flink and Spark and what we would prefer. However, most of the people are already migrating new systems to Flink. We will keep Spark for a few more years still. 

    How are customer service and support?

    We have an internal team, and they participate in process of developing Spark. They are Spark contributors, and if we have some problems, we turn to them. It's our own people, yet they work with Spark. Generally, if the problem is more minor, we look at some sites or have some discussion about Spark or internal guys who have experience with Spark. 

    Which solution did I use previously and why did I switch?

    We also use Flink.

    Before Spark, I worked with another company that we used some different technology, including Kafka, Radius, Postgres SQL, S3, and Spring. 

    How was the initial setup?

    I didn't handle the initial setup. We were using this pipeline and clusters already. I just installed it on my local server. However, in terms of difficulty, I didn't see any problem. The deployment might only take a few hours. 

    I found some documentation. I got the documentation from the site and downloaded the archive and unzipped it, and installed it. I can't say that I installed something from a special configuration. I just installed a few nodes for debugging and for running locally, and that's all. Also, in one case I used, for example, a Docker configuration with Spark. It all worked fine.

    What's my experience with pricing, setup cost, and licensing?

    It's an open-source product. I don't know much about the licensing aspect. 

    Which other solutions did I evaluate?

    We have compared Flink and Spark as two possible options. 

    What other advice do I have?

    I can recommend the product. It's a nice system for batch processing huge data.

    I'd rate the solution eight out of ten. 

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    Flag as inappropriate
    PeerSpot user
    NitinKumar - PeerSpot reviewer
    Director of Enginnering at Sigmoid
    Real User
    Top 5Leaderboard
    Easy to code, fast, open-source, very scalable, and great for big data
    Pros and Cons
    • "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
    • "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."

    What is our primary use case?

    I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.

    How has it helped my organization?

    Spark has been at the forefront of data processing engine. I have used Apache Spark for multiple projects for different clients. It is an excellent tool to process massive amount of data. 

    What is most valuable?

    Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica.

    Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark.

    What needs improvement?

    Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

    For how long have I used the solution?

    I have been using this solution for around 7 years.

    What do I think about the stability of the solution?

    There were bugs three to four years ago, which have been resolved. There were a couple of issues related to slowness when we did a lot of transformations using the Width columns. I was writing a POC on ETL for moving from Informatica to Spark SQL for the ETL pipeline. It required the use of hundreds of Width columns to change the column name or add some transformation, which made it slow. It happened in versions prior to version 1.6, and it seems that this issue has been fixed later on.

    What do I think about the scalability of the solution?

    It is very scalable. You can scale it a lot.

    How are customer service and support?

    I haven't contacted them.

    How was the initial setup?

    The initial setup was a little complex when I was using open-source Spark. I was doing a POC in the on-premise environment, and the initial setup was a little cumbersome. It required a lot of set up on Unix systems. We also had to do a lot of configurations and install a lot of things. 

    After I moved to the Cloudera CDH version, it was a little easy. It is a bundled product, so you just install whatever you want and use it.

    What's my experience with pricing, setup cost, and licensing?

    Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.

    What other advice do I have?

    I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy.

    I would rate Apache Spark an eight out of ten.

    Which deployment model are you using for this solution?

    Public Cloud

    If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    Flag as inappropriate
    PeerSpot user
    Buyer's Guide
    Apache Spark
    November 2022
    Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2022.
    653,757 professionals have used our research since 2012.
    Manager - Data Science Competency at a tech services company with 201-500 employees
    Consultant
    Fast-performance, cost-effective, and runs in a cloud-agnostic environment
    Pros and Cons
    • "One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
    • "When you are working with large, complex tasks, the garbage collection process is slow and affects performance."

    What is our primary use case?

    My main task is working on predictive analytics, and Apache Spark is one of the tools that I utilize in this role. Primarily, we work with the predictive analysis of very large amounts of data.

    Apache Spark is also helpful for data pre-processing, including data cleaning.

    This solution is cloud-agnostic. You can use it with an EC2 instance and you can even install it on-premises. Some environments have it installed in VMs.

    What is most valuable?

    One of the key features is that Apache Spark is a distributed computing framework. You can have multiple slaves and distribute the workload between them.

    Another feature is memory-based computing. This is unlike Hadoop, which relies on storage. As it uses in-memory data processing, Spark is very fast.

    What needs improvement?

    When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.

    For how long have I used the solution?

    I have been working with Apache Spark for the past four years.

    What do I think about the stability of the solution?

    This product is pretty stable. Companies like Facebook, Uber, and Netflix are all using Apache Spark. It's stable enough to be used all over the world.

    What do I think about the scalability of the solution?

    In our team that works on this, we have approximately 10 people.

    How are customer service and support?

    There is no official support for this solution. Because it's open-source and there is no cost involved, there is nobody to contact for support. Our own internal team of experts, which work on different problems, both support and contribute to the platform.

    Which solution did I use previously and why did I switch?

    I work on several open-source frameworks including Python, Scikit-learn, TensorFlow, PyTorch, H20.ai, and R. We don't endorse proprietary tools so we aren't working with them.

    How was the initial setup?

    With respect to the initial setup, it's neither easy nor very difficult. Our team has experience so it is not difficult for them. However, for a person that is new to using it, the setup might be very difficult.

    What about the implementation team?

    We have a team of experts in my company, and they handle it very well.

    What's my experience with pricing, setup cost, and licensing?

    This is an open-source tool, so it can be used free of charge. There is no cost involved.

    What other advice do I have?

    We are not using the current version of this platform, Spark 3. However, we do know that it is used in the market and it has new features. We will eventually move to it.

    My advice for anybody who wants to use Apache Spark is that they have two options. The first is Databricks, which are the creators of Apache Spark, and use their proprietary version. If you choose this option then you will have to pay for the product.

    If instead, you use Apache Spark, then you can rely on your own expert in-house team for support, maintenance, and deployment. In this option, you don't have to pay anything to anybody outside of your company.

    I would rate this solution an eight out of ten.

    Which deployment model are you using for this solution?

    Hybrid Cloud
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    Oscar Estorach - PeerSpot reviewer
    Chief Data-strategist and Director at theworkshop.es
    Real User
    Top 5Leaderboard
    Scalable, open-source, and great for transforming data
    Pros and Cons
    • "The solution has been very stable."
    • "It's not easy to install."

    What is our primary use case?

    You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.

    What is most valuable?

    Overall, it's a very nice tool.

    It is great for transforming data and doing micro-streamings or micro-batching.

    The product offers an open-source version.

    The solution has been very stable.

    The scalability is good.

    Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms. 

    Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.

    What needs improvement?

    If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.

    It's not easy to install. You are typically dealing with a big data system.

    It's not a simple, straightforward architecture. 

    For how long have I used the solution?

    I've been using the solution for three years.

    What do I think about the stability of the solution?

    The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution. 

    What do I think about the scalability of the solution?

    We have found the scalability to be good. If your company needs to expand it, it can do so.

    We have five people working on the solution currently.

    How are customer service and technical support?

    There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.

    Which solution did I use previously and why did I switch?

    I also use Databricks, which I use in the cloud.

    How was the initial setup?

    When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.

    I am not a professional admin. I am a developer for and design architecture.

    You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.

    What's my experience with pricing, setup cost, and licensing?

    We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud. 

    What other advice do I have?

    I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.

    I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities. 

    I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.

    Which deployment model are you using for this solution?

    On-premises
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    Co-Founder at a tech vendor with 11-50 employees
    Real User
    Top 5
    Handles large volume data, cloud and on-premise deployments, but difficult to use
    Pros and Cons
    • "Apache Spark can do large volume interactive data analysis."
    • "Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn."

    What is our primary use case?

    The solution can be deployed on the cloud or on-premise.

    How has it helped my organization?

    We are using Apache Spark, for large volume interactive data analysis.

    MechBot is an enterprise, one-click installation, trusted data excellence platform. Underneath, I am using Apache Spark, Kafka, Hadoop HDFS, and Elasticsearch.

    What is most valuable?

    Apache Spark can do large volume interactive data analysis.

    What needs improvement?

    Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.

    For how long have I used the solution?

    I have been using Apache Spark for approximately 11 years.

    What do I think about the stability of the solution?

    The solution is stable.

    What do I think about the scalability of the solution?

    Apache Spark is scalable. However, it needs enormous technical skills to make it scalable. It is not a simple task.

    We have approximately 20 people using this solution.

    How was the initial setup?

    If you want to distribute Apache Spark in a certain way, it is simple. Not every engineer can do it. You need DevOps specialized skills on Spark is what is required.

    If we are going to deploy the solution in a one-layer laptop installation, it is very straightforward, but this is not what someone is going to deploy in the production site.

    What's my experience with pricing, setup cost, and licensing?

    Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free.

    What other advice do I have?

    We are well versed in Spark, the version, the internal structure of Spark, and we know what exactly Spark is doing. 

    The solution cannot be easier. Everything cannot be made simpler because it involves core data, computer science, pro-engineering, and not many people are actually aware of it.

    I rate Apache Spark a six out of ten.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    AmitMataghare - PeerSpot reviewer
    Associate Director at PwC
    Real User
    Top 20
    High performance, beneficial in-memory support, and useful online community support
    Pros and Cons
    • "One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
    • "Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."

    What is our primary use case?

    Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.

    What is most valuable?

    One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast.

    What needs improvement?

    Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.

    For how long have I used the solution?

    I have been using Apache Spark for approximately five years.

    What do I think about the stability of the solution?

    Apache Spark is stable.

    What do I think about the scalability of the solution?

    I have found Apache Spark to be scalable.

    How are customer service and support?

    Apache Spark is open-source, there is no team that will give you dedicated support, but you can post your queries on the community forums, and usually, you will receive a good response. Since it's open-source, you depend on freelance developers to respond to you, you cannot put a time limit there, but the response, on average, is pretty good.

    How was the initial setup?

    If Apache Spark is in the cloud, setting it up will require only minutes. If it's on Amazon, GCP, or Microsoft cloud, it'll take minutes to set everything up. However, if you are using the on-premise version, then it might take some time to set up the environment.

    What other advice do I have?

    I rate Apache Spark an eight out of ten.

    Which deployment model are you using for this solution?

    Public Cloud

    If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

    Amazon Web Services (AWS)
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    Flag as inappropriate
    PeerSpot user
    Senior Test Automation Consultant / Architect at a tech services company with 11-50 employees
    Consultant
    Useful for big data and scientific purposes, but needs better query handling, stability, and scalability
    Pros and Cons
    • "It is useful for handling large amounts of data. It is very useful for scientific purposes."
    • "We are building our own queries on Spark, and it can be improved in terms of query handling."

    What is our primary use case?

    We are using it for big data. We are using a small part of it, which is related to using data.

    What is most valuable?

    It is useful for handling large amounts of data. It is very useful for scientific purposes.

    What needs improvement?

    There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.

    They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.

    We are building our own queries on Spark, and it can be improved in terms of query handling.

    For how long have I used the solution?

    In my company, it has been used for several years, but I have been using it for seven months.

    What do I think about the scalability of the solution?

    It is not scalable. Scalability is one of the issues.

    How are customer service and support?

    It is open source from my point of view. So, there is no support.

    What other advice do I have?

    I would advise not using it if you don't have experienced users inside your organization. If you have to figure it all out on your own, then you shouldn't start with it.

    Overall, I would rate it a six out of 10. For a commercial use case, it is a six out of 10. For scientific purposes, it is an eight out of 10.

    Which deployment model are you using for this solution?

    On-premises
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    Onur Tokat - PeerSpot reviewer
    Big Data Engineer Consultant at Collective[i]
    Consultant
    Top 20
    Scala-based solution with good data evaluation functions and distribution

    What is our primary use case?

    I mainly use Spark to prepare data for processing because it has APIs for data evaluation. 

    What is most valuable?

    The most valuable feature is that Spark uses Scala, which has good data evaluation functions. Spark also supports good distribution on the clusters and provides optimization on the APIs.

    What needs improvement?

    Spark could be improved by adding support for other open-source storage layers than Delta Lake. The UI could also be enhanced to give more data on resource management.

    For how long have I used the solution?

    I've been using Spark for six years.

    What do I think about the stability of the solution?

    Generally, Spark works correctly without any errors. It may give out some errors if your data changes, but in that case, it's a problem with the configuration, not with Spark.

    What do I think about the scalability of the solution?

    The cloud version of Spark is very easy to scale.

    How was the initial setup?

    The initial setup is not complex, but it depends on the product's component on the architecture. For example, if you use Hadoop, setup may not be easy. Deployment takes about a week, but the Spark cluster can be installed in the virtual architecture in a day.

    What other advice do I have?

    Spark can handle small to huge data and is suitable for any size of company. I would rate Spark as eight out of ten. 

    Which deployment model are you using for this solution?

    On-premises
    Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
    PeerSpot user
    Buyer's Guide
    Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
    Updated: November 2022
    Buyer's Guide
    Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.