Data Engineer at a tech company with 10,001+ employees
Real User
Top 20
2025-08-11T14:34:31Z
Aug 11, 2025
I don't use a big data solution such as a Data Lake. I have used Apache Spark code on a lot of big data. We don't have a Data Lake and these new technologies, but we have used Apache Spark code to get data from big data, CRM, and Siebel. Siebel is the application that exists in Teleco branches in Egypt, and the customers communicate with it. We have a lot of data from the finance team, CRM, Siebel, and other sources. We consolidate all of this data and perform transformations. With Apache Spark, we can perform various transformations. For instance, when a customer calls their mother and consumes many minutes, we should consolidate all of them and calculate the net minutes. For the monthly invoice, we should determine how much the customer should pay for ADSL, phones, and family phones. We take all of this information, transform it, and use it to generate the invoice. We enhance the data processing by using Apache Spark SQL. I haven't used Apache Spark machine learning, but I've used Apache Spark SQL because we have data in HDFS tables. We take the Apache Spark code, get the data, and then get the aggregate. For example, we can request aggregated data about a customer's consumption since last week. You can use Apache Spark code or Python code itself, and if you know SQL, you can type SQL code within the same script and output it to any table or Excel file. It's durable and easy. When a customer has an issue with their phone number and can't call or access the internet, they visit a branch and speak with an agent. We need to take action based on the data, so we need real-time data processing to get aggregated data for this customer from the last week or month. All data solutions serve the customer and business needs, which is what I appreciate about data solutions.
I use Apache Spark ( /products/apache-spark-reviews ) for any data engineering part. I handle some computation processes where it is necessary to process big data.
I have some hands-on experience with Spark. I have one year of experience that should be considered as one year working with Spark, which is six months to one year. We use it for faster processing, especially compute. Spark is used for transformations from large volumes of data, and it is usefully distributed. We receive data from various sources and need to transform it. The data is enormous, in terabytes, and often from specific databases. We perform transformations, aggregations, and deduplication. We meet business requirements by computing data, minimizing it, aggregating it, or performing other operations. We typically write to Hive downstream.
The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.
Data Scientist at a financial services firm with 10,001+ employees
Real User
Top 5
2024-07-10T15:58:00Z
Jul 10, 2024
Most of my use cases involve data processing. For example, someone tried to run sentiment analysis on Databricks using Apache Spark. They had to handle data from many countries and languages, which presented some challenges. Besides that, I primarily use Apache Spark for data processing tasks. I work with mobile phone datasets, around one terabyte in size. This involves extracting and analyzing data before building any models.
CEO International Business at a tech services company with 1,001-5,000 employees
MSP
Top 5
2023-11-10T13:04:33Z
Nov 10, 2023
In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking.
Apache Spark can be used in multiple use case in big data and in data engineering task. We are using Apache spark for ETL, integration with streaming data and performing real time prediction like anomaly, price prediction and data exploration on large volume of data.
It's a root product that we use in our pipeline. We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.
Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.
Chief Data-strategist and Director at Theworkshop.es
Real User
Top 10
2021-08-18T14:51:07Z
Aug 18, 2021
You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.
Senior Solutions Architect at a retailer with 10,001+ employees
Real User
2021-03-27T15:39:24Z
Mar 27, 2021
We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.
We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them. This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.
When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.
Managing Consultant at a computer software company with 501-1,000 employees
Real User
2020-02-02T10:42:14Z
Feb 2, 2020
Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data. Apache Spark was used to host this entire project.
We have built a product called "NetBot." We take any form of data, large email data, image, videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data.
Technical Consultant at a tech services company with 1-10 employees
Consultant
2019-12-23T07:05:00Z
Dec 23, 2019
We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers.
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
2019-10-13T05:48:00Z
Oct 13, 2019
We use this solution for information gathering and processing. I use it myself when I am developing on my laptop. I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function...
I don't use a big data solution such as a Data Lake. I have used Apache Spark code on a lot of big data. We don't have a Data Lake and these new technologies, but we have used Apache Spark code to get data from big data, CRM, and Siebel. Siebel is the application that exists in Teleco branches in Egypt, and the customers communicate with it. We have a lot of data from the finance team, CRM, Siebel, and other sources. We consolidate all of this data and perform transformations. With Apache Spark, we can perform various transformations. For instance, when a customer calls their mother and consumes many minutes, we should consolidate all of them and calculate the net minutes. For the monthly invoice, we should determine how much the customer should pay for ADSL, phones, and family phones. We take all of this information, transform it, and use it to generate the invoice. We enhance the data processing by using Apache Spark SQL. I haven't used Apache Spark machine learning, but I've used Apache Spark SQL because we have data in HDFS tables. We take the Apache Spark code, get the data, and then get the aggregate. For example, we can request aggregated data about a customer's consumption since last week. You can use Apache Spark code or Python code itself, and if you know SQL, you can type SQL code within the same script and output it to any table or Excel file. It's durable and easy. When a customer has an issue with their phone number and can't call or access the internet, they visit a branch and speak with an agent. We need to take action based on the data, so we need real-time data processing to get aggregated data for this customer from the last week or month. All data solutions serve the customer and business needs, which is what I appreciate about data solutions.
I use Apache Spark ( /products/apache-spark-reviews ) for any data engineering part. I handle some computation processes where it is necessary to process big data.
I have some hands-on experience with Spark. I have one year of experience that should be considered as one year working with Spark, which is six months to one year. We use it for faster processing, especially compute. Spark is used for transformations from large volumes of data, and it is usefully distributed. We receive data from various sources and need to transform it. The data is enormous, in terabytes, and often from specific databases. We perform transformations, aggregations, and deduplication. We meet business requirements by computing data, minimizing it, aggregating it, or performing other operations. We typically write to Hive downstream.
The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.
Most of my use cases involve data processing. For example, someone tried to run sentiment analysis on Databricks using Apache Spark. They had to handle data from many countries and languages, which presented some challenges. Besides that, I primarily use Apache Spark for data processing tasks. I work with mobile phone datasets, around one terabyte in size. This involves extracting and analyzing data before building any models.
We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.
In my company, the solution is used for batch processing or real-time processing.
Our primary use case is for interactively processing large volume of data.
We use it for real-time and near-real-time data processing. We use it for ETL purposes as well as for implementing the full transformation pipelines.
In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking.
We use Apache Spark for storage and processing.
Apache Spark can be used in multiple use case in big data and in data engineering task. We are using Apache spark for ETL, integration with streaming data and performing real time prediction like anomaly, price prediction and data exploration on large volume of data.
Our customers configure their software applications, and I use Apache to check them. We use it for data processing.
Predominantly, I use Spark for data analysis on top of datasets containing tens of millions of records.
We use Spark for machine learning applications, clustering, and segmentation of customers.
It's a root product that we use in our pipeline. We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.
I am using Apache Spark for the data transition from databases. We have customers who have one database as a data lake.
Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.
I use Spark to run automation processes driven by data.
I mainly use Spark to prepare data for processing because it has APIs for data evaluation.
The solution can be deployed on the cloud or on-premise.
You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.
We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.
I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.
We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them. This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.
When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.
Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data. Apache Spark was used to host this entire project.
We have built a product called "NetBot." We take any form of data, large email data, image, videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data.
We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers.
We use this solution for information gathering and processing. I use it myself when I am developing on my laptop. I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.
We primarily use the solution for security analytics.
We use the solution for analytics.
Streaming telematics data.
Ingesting billions of rows of data all day.
Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.