We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.
The easiest route - we'll conduct a 15 minute phone interview and write up the review for you.
Use our online form to submit your review. It's quick and you can post anonymously.
We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.
The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.
Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.
I have been using the product for six years.
Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.
About 70-80 percent of employees in my company use the product.
We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.
The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.
The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.
The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.
If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.
We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.
The product’s most valuable features are lazy evaluation and workload distribution.
They could improve the issues related to programming language for the platform.
We have been using Apache Spark for around two and a half years.
The platform’s stability depends on how effectively we write the code. We encountered a few issues related to programming languages.
We have more than 100 Apache Spark users in our organization.
Before choosing Apache Spark for processing big data, we evaluated another option, Hadoop. However, Spark emerged as a superior choice comparatively.
The initial setup complexity depends on whether it's on the cloud or on-premise. For cloud deployments, especially using platforms like Databricks, the process is straightforward and can be configured with ease. However, if the deployment is on-premise, the setup tends to be more time-consuming, although not overly complex.
They provide an open-source license for the on-premise version. However, we have to pay for the cloud version including data centers and virtual machines.
Apache Spark is a good product for processing large volumes of data compared to other distributed systems. It provides efficient integration with Hadoop and other platforms.
I rate it a ten out of ten.