Spark SQL Reviews

Name: Spark SQL
Brand: Apache
Rating: 3.9 (15 reviews)

Vendor: Apache

3.9 out of 5

15 reviews
86% willing to recommend

Leave a review

What is Spark SQL?

Spark SQL leverages SQL capabilities to process large datasets, offering high performance, seamless integration with Spark programs, and the ability to run parallel queries. It supports Hive interoperability and facilitates data transformation with DataFrames and Datasets.

Get the Spark SQL Buyer's Guide and find out what your peers are saying about Spark SQL, Apache Spark, Cloudera Distribution for Hadoop and more!

Spark SQL is the #5 ranked solution in top Hadoop solutions. PeerSpot users give Spark SQL an average rating of 7.8 out of 10. Spark SQL is most commonly compared to Apache Spark: Spark SQL vs Apache Spark. Spark SQL is popular among the large enterprise segment, accounting for 53% of users researching this solution on PeerSpot. The top industry researching this solution are professionals from a financial services firm, accounting for 21% of all views.

Helped 900,277 peers since 2012

Featured Spark SQL reviews

Kemal Duman

Team Lead, Data Engineering at Nesine.com

We do not have any performance problems, but we do have some resource problems. Spark SQL consumes so many resources that we migrated our streaming job from Spark to Apache Flink. Resource management in Spark SQL should be better. It consumes more resources, which is normal. The main reason we switched from Spark is memory and CPU consumption. The major reason is the resource problem because the number of streaming jobs has been increasing in our company. That is why we considered resource management as a priority. Because of the resource consumption, I would say the development of Spark SQL is better. For development purposes, it is a top product and not difficult to work with, but resources are the major problem. We changed to Flink regardless of development time. Development time is less in Spark compared with Flink.

Read full review

SurjitChoudhury

Data engineer at Cocos pt

My experience with the initial setup of Spark SQL was relatively smooth. Understanding the system wasn't overly difficult because the data was structured in databases, and we could use notebooks for coding in Python or Java. Configuring networks and running scripts to load data into the database were routine tasks that didn't pose significant challenges. The flexibility to use different languages for coding and the ability to process data using key-value pairs in Python made the setup adaptable. Once we received the source data, processing it in SparkSQL involved writing scripts to create dimension and fact tables, which became a standard part of our workflow. Setting up Spark SQL was reasonably quick, but sometimes we face performance issues, especially during data loading into the SQL Server data warehouse. Sequencing notebooks for efficient job runs is crucial, and managing complex tasks with multiple notebooks requires careful tracking. Exploring ways to optimize this process could be beneficial. However, once you are familiar with the database architecture and project tools, understanding and adapting to the system become more straightforward.

Read full review

Sahil Taneja

Principal Consultant/Manager at Tenzing

Spark SQL can improve the documentation they have provided. It can be a bit unclear at times. They could improve the documentation a bit more so that we can understand it more easily. Moreover, they could improve SparkUI to have more advanced versions of the performance and the queries and all.

Read full review

Spark SQL mindshare

As of June 2026, the mindshare of Spark SQL in the Hadoop category stands at 5.1%, down from 10.5% compared to the previous year, according to calculations based on PeerSpot user engagement data.

Hadoop Mindshare Distribution
Product	Mindshare (%)
Spark SQL	5.1%
Cloudera Distribution for Hadoop	14.7%
Apache Spark	13.9%
Other	66.30000000000001%

Hadoop

PeerResearch reports based on Spark SQL reviews

Type	Title	Date
Category	Hadoop	Jun 22, 2026	Download
Product	Reviews, tips, and advice from real users	Jun 22, 2026	Download
Comparison	Spark SQL vs Apache Spark	Jun 22, 2026	Download
Comparison	Spark SQL vs Cloudera Distribution for Hadoop	Jun 22, 2026	Download
Comparison	Spark SQL vs Amazon EMR	Jun 22, 2026	Download

Valuable Features

"Speed is the major benefit of using Spark SQL."
"Spark SQL gives us a handful of methods to design queries based on its own syntax and also incorporates the regular SQL syntax within tasks."
"We use it to gather all the transaction data."

Room for Improvement

"Spark SQL consumes so many resources that we migrated our streaming job from Spark to Apache Flink."
"There are many inconsistencies in syntax for the different querying tasks like selecting columns and joining between two tables so I'd like to see a more consistent syntax."
"Being a new user, I am not able to find out how to partition it correctly."

Pricing

"We don't have to pay for licenses with this solution because we are working in a small market, and we rely on open-source because the budgets of projects are very small."
"We use the open-source version, so we do not have direct support from Apache."
"The on-premise solution is quite expensive in terms of hardware, setting up the cluster, memory, hardware and resources. It depends on the use case, but in our case with a shared cluster which is quite large, it is quite expensive."

These insights are based on the in-depth reviews provided by peers to help you make a better buying decision.

Download our Spark SQL Buyer's Guide for additional reliable information.

Review data by company size

By reviewers
Company Size	Count
Small Business	5
Midsize Enterprise	6
Large Enterprise	4

By reviewers

By visitors reading reviews
Company Size	Count
Small Business	29
Midsize Enterprise	8
Large Enterprise	41

By visitors reading reviews

Top industries

By visitors reading reviews

Financial Services Firm

21%

University

12%

Healthcare Company

Manufacturing Company

Comms Service Provider

Retailer

Computer Software Company

Construction Company

Insurance Company

Government

Marketing Services Firm

Outsourcing Company

Performing Arts

Real Estate/Law Firm

Hospitality Company

Import And Exporter

Logistics Company

Educational Organization

Media Company

Compare Spark SQL with alternative products

Learn more about Spark SQL

Spark SQL enables efficient data engineering, transformation, and analytics for organizations dealing with large-scale data processing. It supports big data queries, builds data pipelines and warehouses, and interfaces with various databases, especially in distributed settings such as Hadoop and Azure. Users employ Spark SQL to establish business logic in Jupyter notebooks and facilitate data loading into SQL Server, enabling analytics with tools like Power BI. The documentation and flexibility to manage extensive data processing are valued by users, although a steep learning curve and documentation clarity are noted challenges. Enhancements for data visualization, GUI, and resource management alongside better integration with tools like Tableau are recommended.

What are the key features of Spark SQL?

Query Language: Supports complex SQL queries for effective large dataset processing.
Seamless Integration: Easily integrates into Spark programs, boosting performance and speed.
Parallel Queries: Capable of running numerous queries in parallel to enhance processing efficiency.
Interoperability with Hive: Facilitates interaction within distributed data ecosystems efficiently.
Data Transformation: Utilizes DataFrames and Datasets for flexible data manipulation.

What benefits or ROI should users consider in reviews?

Performance: Offers high-speed data processing for demanding workloads.
Scalability: Efficiently handles increasing data volumes and processing demands.
Ease of Use: Facilitates simpler SQL-based data management operations.
Flexibility: Adapts to various data sources and formats for broad usability.

In industries, Spark SQL is a critical part of data engineering, transformation, and analytics. It empowers organizations to manage big data processing and analytics in sectors like finance, healthcare, and telecommunications. By enabling seamless data pipeline creation, it supports real-time business decision-making processes and data-driven strategies across sectors.

Spark SQL customers

UC Berkeley AMPLab, Amazon, Alibaba Taobao, Kenshoo, Hitachi Solutions

Product Categories

Hadoop

Popular Comparisons

Apache Spark vs Spark SQL

Cloudera Distribution for Hadoop vs Spark SQL

IBM Netezza Performance Server vs Spark SQL

Amazon EMR vs Spark SQL

HPE Data Fabric vs Spark SQL

IBM Analytics Engine vs Spark SQL

IBM Db2 Big SQL vs Spark SQL

See all alternatives

Spark SQL Reviews Summary
Author info	Rating	Review Summary
Team Lead, Data Engineering at Nesine.com	4.0	I use Spark SQL for batch processing and transformations, appreciating its speed and Hive interoperability. However, its high resource consumption led me to migrate streaming jobs to Apache Flink, despite its ease of development.
Data engineer at Cocos pt	3.5	I use Spark SQL for data processing from various sources, integrating efficiently with our CI/CD workflow via Azure DevOps. It offers flexible and scalable data handling, although stability could be improved. Transitioning from Apache Hive enhanced our performance significantly.
Principal Consultant/Manager at Tenzing	4.0	We use PySpark for big data processing with Spark SQL on Microsoft Azure, appreciating its SQL connectivity and ease of use while suggesting improvements in documentation and SparkUI for better performance insights. Spark SQL facilitates complex task implementation using SQL.
Data Engineer at Behsazan Mellat	4.5	We use Spark SQL for business analytics in our HDFS environment to handle large data volumes efficiently. Its capability to run parallel queries is a key advantage over Python. Integration with data visualization tools like Tableau would enhance its functionality.
Data Engineer at BBD	4.0	We use Spark SQL for data engineering, transformation, and querying with around 30–40 users. Its powerful query language benefits us, but it has a steep learning curve. Previously, we used Panda and Dask, which were less scalable than Spark SQL.
Senior Analyst/ Customer Business and Insights Specialist at a tech services company with 501-1,000 employees	4.0	Our company uses Spark SQL for creating pipelines and data sets, finding it easy to use with basic SQL knowledge, especially for analytics within specific use cases. However, it could improve by offering more in-solution guidance on aggregate functions.
Lecturer at Amirkabir University of Technology	4.0	I use Spark SQL for data preparation and querying, valuing its flexible methods and good documentation. While I recommend it for being stable and scalable, I wish for more consistent syntax across different tasks.
CTO at Dokument IT d.o.o.	5.0	I used Spark SQL for analytics and statistical reports from content management platforms. The Thrift connection is valuable, but I've faced on-premise Delta Lake compatibility issues. The documentation lacks detail, especially for Thrift server setup, and interactive queries need improvement.
Engineering Manager/Solution architect at a computer software company with 201-500 employees	4.0	I find this solution stable, scalable, and useful within a distributed ecosystem, with straightforward setup and no licensing costs. I recommend it at 8/10, though it needs better EMR monitoring and integration.
Associate Manager at a consultancy with 501-1,000 employees	5.0	I use Spark SQL for data validation and queries. Its ease of use and validation are valuable, despite needing better integration. It's stable, scalable, free, and I rate it ten out of ten.

Kemal Duman

Team Lead, Data Engineering at Nesine.com

Jan 21, 2026

Data pipelines have run faster and support flexible batch and streaming transformations

What is our primary use case?

Spark SQL has been in our stack for less than one year, though some of our colleagues are using it. It is a useful product for transformation jobs.

We generally use Spark SQL for batch processing. We use it for general batch operations and have used it for some streaming jobs.

How has it helped my organization?

We did not spend much additional effort on Spark jobs.

It is a useful product for transformation jobs.

What is most valuable?

Speed is the major benefit of using Spark SQL.

Spark SQL is interoperable with Hive. While migrating from HDFS to Iceberg, we did not need to change our Spark SQL job configurations, only the location or type of connection from Hive to Trino or S3. This interoperability proved to be very useful.

What needs improvement?

We do not have any performance problems, but we do have some resource problems. Spark SQL consumes so many resources that we migrated our streaming job from Spark to Apache Flink.

Resource management in Spark SQL should be better. It consumes more resources, which is normal. The main reason we switched from Spark is memory and CPU consumption. The major reason is the resource problem because the number of streaming jobs has been increasing in our company. That is why we considered resource management as a priority.

Because of the resource consumption, I would say the development of Spark SQL is better. For development purposes, it is a top product and not difficult to work with, but resources are the major problem. We changed to Flink regardless of development time. Development time is less in Spark compared with Flink.

Which solution did I use previously and why did I switch?

We first migrated from Spark to NiFi for some streaming tasks. It is a good alternative for us, but we changed our decision from NiFi to Flink because NiFi had some problems while ingesting data to Iceberg or S3.

Which other solutions did I evaluate?

We are not looking for any other options. We generally use open-source products and are supporters of the open-source movement. Our tendency is to use open-source products, so Apache products are at the top of our list.

We are not paying any money for Apache products. We always use the free versions, the open-source versions.

We are currently dealing with Iceberg, Flink, Airflow, and Trino.

What other advice do I have?

Regarding the Catalyst query optimizer, I think we are using it. We were using it in the past, but I am not certain if we use it now. We used it a long time ago.

I rate my experience with Spark SQL as an eight out of ten. I have been working in this field for eight years.

SurjitChoudhury

Data engineer at Cocos pt

Nov 23, 2023

Offers the flexibility to handle large-scale data processing

What is our primary use case?

I employ Spark SQL for various tasks. Initially, I gathered data from databases, SAP systems, and external sources via SFTP, storing it in blob storage. Using Spark SQL within Jupyter notebooks, I define and implement business logic for data processing. Our CI/CD process managed with Azure DevOps, oversees the execution of Spark SQL scripts, facilitating data loading into SQL Server. This structured data is then used by analytics teams, particularly in tools like Power BI, for thorough analysis and reporting. The seamless integration of Spark SQL in this workflow ensures efficient data processing and analysis, contributing to the success of our data-driven initiatives.

What is most valuable?

I find Spark SQL's seamless integration of SQL queries with Spark programs and its use of DataFrames and Datasets particularly valuable. While we mostly stick to traditional T-SQL, Spark SQL brings flexibility to handle large-scale data processing. The ability to write SQL queries, even with minor adjustments for functions like LICA, simplifies our data transformation. Although the syntax differs from traditional SQL, Spark SQL's efficiency in managing distributed data and its simplicity in expressing complex operations make it an essential part of our data pipeline.

What needs improvement?

In terms of improvement, the only thing that could be enhanced is the stability aspect of Spark SQL. There could be additional features that I haven't explored but the current solution for working with databases seems effective. I haven't worked extensively with all components, so there might be untapped features that could enhance the solution's value.

For how long have I used the solution?

I have been working with Spark SQL for three years.

What do I think about the stability of the solution?

In terms of stability, I would rate Spark SQL a four out of ten. While it generally performs well, there are occasional challenges, especially when dealing with substantial amounts of data, like creating fact tables. Despite using Spark functions to optimize processes, it can still take a considerable amount of time, impacting overall stability.

Which solution did I use previously and why did I switch?

Before choosing to work with Spark SQL, we used Apache Hive. We switched to Spark from a previous solution to stay current with the latest technology trends. As big data became prominent, Spark emerged as a faster and more robust alternative with its in-memory processing capabilities. The transition was relatively straightforward, especially because SQL knowledge could be easily applied, with many functions having equivalent counterparts in Spark. The move was motivated by the improved performance and scalability that Spark offered for our data processing requirements.

How was the initial setup?

What other advice do I have?

Overall, I would rate Spark SQL as a seven out of ten.

Sahil Taneja

Principal Consultant/Manager at Tenzing

May 5, 2023

Easy to use and do not require a learning curve

What is our primary use case?

We are using PySpark for big data processing, like multiple competitors of stock. We process it in in-memory using data frames and Spark SQL. We are using it along with the database to process the big data, especially the special Azure data.

We are using PySpark. Databricks itself provides an environment that is pre-installed with Spark.

How has it helped my organization?

We are using Spark SQL in Databricks. There are three ways to code: Task Equals, PySpark, and Scala. Some team members are also using Spark SQL. They were new to Databricks and recent graduates, but they knew SQL and were able to code in Spark SQL. It's a great tool to use altogether because they were able to learn along with it. They were implementing some basic stuff.

What is most valuable?

With Spark SQL, I would say the best feature is the connectivity and the ease of use with SQL. The team members don't have to learn a new language and can implement complex tasks very easily using only SQL.

We don't need to use the system much and can use Spark features along with the SQL code.

What needs improvement?

Spark SQL can improve the documentation they have provided. It can be a bit unclear at times. They could improve the documentation a bit more so that we can understand it more easily.

Moreover, they could improve SparkUI to have more advanced versions of the performance and the queries and all.

What do I think about the scalability of the solution?

There are around 30-40 users in our company using Spark SQL.

What other advice do I have?

It's pretty good to use in the initial phases. Overall, I would rate the solution an eight out of ten.

Which deployment model are you using for this solution?

On-premises

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Microsoft Azure

Aria Amini

Data Engineer at Behsazan Mellat

Jul 26, 2023

A great solution for handling large volumes of data by parallel queries

What is our primary use case?

We have an HDFS environment for archiving data when there is an enormous volume of data, and the solution helps retrieve data from our HDFS archive. Developers use the solution for business analytics.

What is most valuable?

One of Spark SQL's most beautiful features is running parallel queries to go through enormous data. We can run queries in parallel and retrieve more data and results in aggregation. Spark does this faster than Python.

What needs improvement?

It would be useful if Spark SQL integrated with some data visualization tools. For example, we could integrate Spark SQL with Tableau for data visualization.

For how long have I used the solution?

I've been using Spark SQL for one year.

What do I think about the stability of the solution?

I do not have any issues with stability in Spark SQL.

What do I think about the scalability of the solution?

There are challenges to scalability with Spark SQL, but it is not too complex. Four users use Spark SQL professionally, but other developers use the solution in some minor cases. We are in the data engineering department, but our PI department also uses Spark SQL. Totally ten developers use Spark SQL in our organization.

How are customer service and support?

Customer service is good in general.

How was the initial setup?

We used Amber for Spark SQL's installation. We used Amber to install some IDs like Zeppelin and altering and Python. We used the tool and the Zeppelin ID. Installation is easy, but it can get complex if you want to use SparkSQL's cluster feature as well. But overall, the installation is not complex. It takes two or three days to deploy the solution if you want to install it on the Zeppelin ID and the Hadoop cluster. We needed one engineer to install and deploy the solution. Four engineers and some developers are working on the solution and doing development work in this environment.

The solution just requires one person for maintenance because of the Amber framework.

What's my experience with pricing, setup cost, and licensing?

We use the open-source version, so we do not have direct support from Apache.

Which other solutions did I evaluate?

We use the Python directory, and in some cases, we use Apache Hive on Python. We use Hive and Spark SQL simultaneously. We switched to FireSpark from Python.

Spark SQL is better than Python for running queries in a parallel way. The main difference between Spark, or PySpark, and Python pandas is that Spark SQL is used parallelly in a cluster. The main difference between Apache Hive and PySpark is that PySpark is more flexible than Apache Hive because Apache Hive arranges known scenarios, but PySpark queries on the fly.

What other advice do I have?

If the user data has a big volume of data, I think they should use PySpark, but for scenarios where they use a medium amount of data, they should not use PySpark because they have some overheads. I rate Spark SQL a nine out of ten.

Lucas Dreyer

Data Engineer at BBD

Jan 4, 2023

Processing solution used for data engineering and transformation with the ability to process large datasets

What is our primary use case?

We use this solution for data engineering, data transformation, repairing data for machine learning and doing queries. We have between 30 and 40 users making use of this solution.

How has it helped my organization?

Certain data sets that are very large are very difficult to process with Pandas and Python libraries. Spark SQL has helped us a lot with that.

What is most valuable?

The query language and using it to process large datasets has been the biggest benefit for us.

What needs improvement?

It takes a bit of time to get used to using this solution versus Panda as it has a steep learning curve. You need quite a high level of skill with SQL in general to use this solution. If SQL is not someone's primary language, they might find it difficult to get used to.

This solution could be improved if there was a bridge between Panda and Spark SQL such as translating from Panda operations to SQL and then working with those queries that are generated.

In a future release, it would be useful to have a real time dashboard versus batch updates to Power BI.

For how long have I used the solution?

I have been using this solution for four years.

What do I think about the stability of the solution?

I would rate it a nine out of ten for stability. We have not had any crashes or major bugs. It's just a case of the behavior that is a different than expected and then changing the queries but I haven't had stability issues.

What do I think about the scalability of the solution?

I haven't had major issues with scaling very large datasets. I would rate it a nine out of ten for scalability.

How are customer service and support?

We have worked with Hortonworks for the management of this solution. In terms of online help, there is a lot of information.

Which solution did I use previously and why did I switch?

We have previously used Panda DataFrame but it is typically not very scalable. I have also used the distributed version of Panda, Dask but neither of these solutions work as well as Spark and Spark SQL.

How was the initial setup?

In terms of setting up the on-premise cluster, it can be quite complex. I would rate it a six or seven in terms of complexity. Using it on the cloud is very straightforward.

We're using it mostly is on-premise, but we also have cloud instances where we use Spark so we have a mix of use cases.

orThe deployment can be done by one person. Typically, we have bigger teams two or three people at least, but one person can look after it and maintain.

What about the implementation team?

The deployment was done between the in-house team and Hortonworks external team. Our internal team included developers and data engineers.

What's my experience with pricing, setup cost, and licensing?

The on-premise solution is quite expensive in terms of hardware, setting up the cluster, memory, hardware and resources. It depends on the use case, but in our case with a shared cluster which is quite large, it is quite expensive.

I would rate the pricing a seven or eight out of ten however, it is easy to run into pricing issues with something like a Databricks cluster if you don't manage usage properly.

What other advice do I have?

Training is quite important to get users up to scratch with Sparks SQL and Spark. Planning is needed in terms of training and skillsets. In terms of the typical DevOps MLOps deployment with pipelines, this training is particularly important. Otherwise you may end up with lots of functionality and queries that are difficult to change, deploy or maintain.

I would rate this solution an eight out of ten. In terms of scalability, it is very useful.

Which deployment model are you using for this solution?

Hybrid Cloud

Keshav Mandal

Senior Analyst/ Customer Business and Insights Specialist at a tech services company with 501-1,000 employees

Nov 22, 2022

Analytics are easy because data is contained within each use case

What is our primary use case?

Our company uses the solution to create pipelines and data sets. The ETL process transforms the data and certain written aggregations convert the raw data to data sets. The data sets are then exported to tables for dashboards.

What is most valuable?

The solution is easy to understand if you have basic knowledge of SQL commands.

Projects sit within the Spark scope and there are multiple options for data sets such as closed, private, or public.

It is easy to perform analytics because data is contained within each use case. For example, you request data for a particular use case, receive the data link, and import it for analytics.

What needs improvement?

It would be beneficial for aggregate functions to include a code block or toolbox that explains calculations or supported conditional statements. Multiple functions come within an aggregate so it is important to understand them. When you are trying to do something new, it would be easier and quite unique to get information within the solution rather than having to search the web.

For example, once you select an aggregate it tells you what type of functions the solution can perform and includes a code block explaining its calculations. Or, a certain conditional statement gives you a second option or explains other types of statements the solution performs as part of a rule-level function.

For how long have I used the solution?

I have been using the solution for fourteen months.

What do I think about the stability of the solution?

The solution is stable.

What do I think about the scalability of the solution?

The scalability depends on administrative rights. Every use case has certain allocated resources so a use case demanding scalability or extensive use can have additional resources allocated to it.

How are customer service and support?

I have not experienced any issues with the solution so have not needed technical support.

Which solution did I use previously and why did I switch?

I have been using SQL to extract data throughout my five-year career.

How was the initial setup?

The setup is very straightforward so I rate it a ten out of ten.

What about the implementation team?

We implemented the solution in-house.

What's my experience with pricing, setup cost, and licensing?

The solution is bundled with Palantir Foundry at no extra charge.

Which other solutions did I evaluate?

Our company gives us the freedom to use Python R, PySpark, or SQL languages so we have many tools available. Our team includes 17 developers and 25% of them use the solution.

The solution is way better than Oracle SQL because Oracle takes a lot of effort to understand and use.

The solution is similar to the format of MS SQL. With MS, there are defined data sources that place restrictions on what you are supposed to use. Sometimes we had to make sure we had a way through the restrictions. For example, if we didn't have access to a physical table then we had to create a duplicate instance or view of it. We could see the values but couldn't manipulate them because we didn't have access to the physical table. The effect of MS restrictions is based on the complexity of a project and any privacy-related data constraints.

For the solution, use cases sit within the Spark scope so you get multiple options for creating them. You can individually set each use case as closed, private, or public. You can run analytics for each use case because the data is contained within it. This process is much easier when compared to Oracle SQL or MS SQL.

What other advice do I have?

The solution is very similar to the generic Spark and SQL language.

I rate the solution an eight out of ten.

Mahdi Sharifmousavi

Lecturer at Amirkabir University of Technology

Aug 10, 2022

Incorporates regular SQL syntax within tasks and very useful for querying and depicting data

How has it helped my organization?

Spark SQL has enabled us to perform our data preparation tasks within our analytical code.

What is most valuable?

Spark SQL gives us a handful of methods to design queries based on its own syntax and also incorporates the regular SQL syntax within tasks. It's also a good tool for querying and depicting data. It's very good for us and it's helpful that we have access to a lot of good documentation.

What needs improvement?

There are many inconsistencies in syntax for the different querying tasks like selecting columns and joining between two tables so I'd like to see a more consistent syntax. Notations should be unified for all tasks within Spark SQL.

For how long have I used the solution?

I've been using this solution for seven years.

What do I think about the stability of the solution?

The solution is stable.

What do I think about the scalability of the solution?

We only have 15 users but the solution is scalable.

How are customer service and support?

I generally get assistance from Stack Overflow and other websites as well as from Udemy and Lynda. I haven't contacted technical support.

Which solution did I use previously and why did I switch?

Up until a year or so ago I was using RapidMiner for data preparation for educational purposes, but it's not as scalable as SQL. If you connect RapidMiner to Apache Hadoop and use its interface within a big data environment, it's good but I use RapidMiner locally. It's very slow and scalability is very poor for data preparation. We also used SQL Server but found that scalability and speed are not as good as that of SparkSQL.

How was the initial setup?

Deployment was carried out by our infrastructure department.

What's my experience with pricing, setup cost, and licensing?

The solution is open source but you pay for any extra features.

What other advice do I have?

I recommend this solution. Spark provides good, clear documentation that is well organized.

Which deployment model are you using for this solution?

On-premises

Slaven Batnozic

CTO at Dokument IT d.o.o.

Aug 18, 2023

If implemented well, the solution is highly compatible and great for data analysis

What is our primary use case?

We used the solution for analytics of data and statistical reports from content management platforms.

What is most valuable?

I find the Thrift connection valuable.

What needs improvement?

I'm using DBeaver to connect Spark with external tools. I've experienced some incompatibilities when using the Delta Lake format. It is compatible when you're using Databricks on the cloud, but when I'm using Spark on-premise, there are some incompatibility issues. We expect interactive queries with Dremio to provide better results. We issue a query but see that it's a batch process in the background. The documentation is also limited, especially in the setup for Thrift servers.

For how long have I used the solution?

I have been using the solution for a year.

What do I think about the stability of the solution?

I rate Spark SQL's stability between nine and ten out of ten because I didn't have any problems with it.

What do I think about the scalability of the solution?

Spark SQL's scalability is excellent. In production, there will only be a few users who are analysts using statistical reports. Queries have many joints, and one query, on average, has seven to 12 joints. There are no plans to increase usage because I'm working on creating markets. There are higher management staff analysts for the data platform, and we have plans to expand the business with data platforms.

How was the initial setup?

The setup process for Spark is not well-documented, but that's expected because the solution is open-source. You must sneak around various blocks, but this is usual for an open-source solution. You could hire guys from the Databricks center, and they can fix nearly anything.

When you learn all the tricks, you can deploy the solution very fast in one hour. But that applies just to the development environment. We are not in production right now. I tested it on Windows and tested it on Ubuntu, and everything works well. But you have to reinvent the wheel because documentation is incomplete.

The deployment process is based on bash scripts. I was considering making Ansible playbooks and custom roles in Ansible, but I didn't have the time, though this is the plan. I moved from the bash scripts on Ansible because I prefer a declarative approach in software engineering. I have plans to totally automate the deployment, where one experienced engineer would be enough. The solution's final deployment would be on the Kubernetes cluster, and the infrastructure would be set up with Terraform on Ansible. Everything will be heavily optimized.

What's my experience with pricing, setup cost, and licensing?

We don't have to pay for licenses with this solution because we are working in a small market, and we rely on open-source because the budgets of projects are very small.

What other advice do I have?

I recommend Spark SQL, but I will need to see what the results will be of our evaluation of Dremio. I'm especially expecting good performance because of the reflection mechanisms, which are actually materials used. But the open question is issues with the refresh rate. I don't know how bad or good that is.

I rate Spark SQL a ten out of ten with the correct implementation.

reviewer1724670

Engineering Manager/Solution architect at a computer software company with 201-500 employees

Dec 2, 2021

Useful tool within a distributed ecosystem

What is our primary use case?

The primary use case of this solution is to function within a distributed ecosystem. Spark is part of EMR, a Hadoop distribution, and is one of the tools in the ecosystem. You are not working with Hadoop in a vacuum—you leverage Spark, Hive, HBase—because it is just a distributed ecosystem. It has no value within itself.

This solution can be deployed both on the cloud and on Cloudera distributions.

What is most valuable?

This solution is useful to leverage within a distributed ecosystem.

What needs improvement?

This solution could be improved by adding monitoring and integration for the EMR.

For how long have I used the solution?

We have been working with Spark SQL for a few years. We are an outsourcing and consulting company, so it's not for our use—we mostly work with clients.

What do I think about the stability of the solution?

This solution is stable.

What do I think about the scalability of the solution?

This solution is scalable.

How was the initial setup?

The installation is straightforward because it's a cloud-based solution.

What about the implementation team?

We implement this solution for customers ourselves.

What's my experience with pricing, setup cost, and licensing?

There is no license or subscription for this solution.

What other advice do I have?

I rate this solution an eight out of ten and would recommend it to others.

reviewer1488372

Associate Manager at a consultancy with 501-1,000 employees

May 29, 2021

Easy to use, reliable, and useful data validation

What is our primary use case?

I am using this solution for data validation and writing queries.

What is most valuable?

Data validation and ease of use are the most valuable features.

What needs improvement?

There should be better integration with other solutions.

For how long have I used the solution?

I have been using the solution for approximately two years.

What do I think about the stability of the solution?

The solution has been stable.

What do I think about the scalability of the solution?

I have found the solution to be scalable. We have 20 people using the solution in my organization and we plan to increase usage.

What's my experience with pricing, setup cost, and licensing?

The solution is open-sourced and free.

What other advice do I have?

I rate Spark SQL a ten out of ten.

Which deployment model are you using for this solution?

Public Cloud

Title	Rating	Mindshare	Recommending
Apache Spark	4.2	13.9%	90%	69 interviews Add to research
Cloudera Distribution for Hadoop	4.0	14.7%	92%	51 interviews Add to research

Spark SQL Reviews

What is Spark SQL?

Featured Spark SQL reviews

Spark SQL mindshare

PeerResearch reports based on Spark SQL reviews

Valuable Features

Room for Improvement

Pricing

Review data by company size

Top industries

Compare Spark SQL with alternative products

Learn more about Spark SQL

Spark SQL customers

Related questions

Product Categories

Popular Comparisons

What is our primary use case?

How has it helped my organization?

What is most valuable?

What needs improvement?

Which solution did I use previously and why did I switch?

Which other solutions did I evaluate?

What other advice do I have?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

Which solution did I use previously and why did I switch?

How was the initial setup?

What other advice do I have?

What is our primary use case?

How has it helped my organization?

What is most valuable?

What needs improvement?

What do I think about the scalability of the solution?

What other advice do I have?

Which deployment model are you using for this solution?

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

How was the initial setup?

What's my experience with pricing, setup cost, and licensing?

Which other solutions did I evaluate?

What other advice do I have?

What is our primary use case?

How has it helped my organization?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

Which solution did I use previously and why did I switch?

How was the initial setup?

What about the implementation team?

What's my experience with pricing, setup cost, and licensing?

What other advice do I have?

Which deployment model are you using for this solution?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

Which solution did I use previously and why did I switch?

How was the initial setup?

What about the implementation team?

What's my experience with pricing, setup cost, and licensing?

Which other solutions did I evaluate?

What other advice do I have?

How has it helped my organization?

What is most valuable?

What needs improvement?