We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.
The easiest route - we'll conduct a 15 minute phone interview and write up the review for you.
Use our online form to submit your review. It's quick and you can post anonymously.
We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.
The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.
Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.
I have been using the product for six years.
Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.
About 70-80 percent of employees in my company use the product.
We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.
The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.
The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.
The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.
If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.
We use the product for computing. We mainly use Spark, Hive, HDFS, and Impala.
The product has been instrumental for all computing needs. We have a data warehouse and a data lake. We read from S3 and load it into different databases. We compute all the transformations, logic, and code we write in PySpark or Spark Scala. Spark is very valuable for data processing.
The product is completely secure. It meets our protection needs. We have a dedicated on-premise cluster. Every year, the vendor introduces new versions and supports many tools that are available. They have different hosts. They have a private cloud and a public cloud base.
The competitors provide better functionalities.
I have been using the solution for six years.
The tool’s stability is good. I rate the stability an eight or nine out of ten.
We have 2000 to 3000 users in our organization.
I rate the ease of setup a seven out of ten. The deployment takes 48 hours. We need six Hadoop administrators for the deployment.
The tool is not expensive. However, it has a cost to it. I rate the pricing a seven out of ten.
Databricks has a Runtime version. It works well with the cloud.
We have an analytics data mart. It is built on top of SQL Server. We use Spark for computing. We use SSIS and SSRS for SQL Server. There is a path set for analytics to migrate to Azure. I will recommend the solution to others. Cloudera is the best option if we need an on-premise implementation of Hadoop. If an organization wants to choose a cloud version, then Databricks might be a good option. Overall, I rate the solution a seven out of ten.