Our main use cases for Spark are Apache Spark SQL and sometimes Spark Streaming to process streaming data.
Like most solutions, we got data from SAP or Azure Data Warehouse. Suppose they were using Azure Cloud technology. So, the data comes from there, relational or sometimes semi-structured data like JSON files and all.
So, we process the data with Spark, writing this code with PySpark, actually Python, which Spark allows, to create the data forms and all and load it into the Tableau format, basically.
So, we try to load it into some database, like SQL Server or any other database. From there, the business data scientists or analysts pick up the data. So, any sort of different sources, basically, like e-commerce sites.
So, previously, we used mostly structured data, which was stored in SAP, mainframe Oracle, or any other system provided in structured formats like CSV.
Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark.
Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more.
Before Spark, there was MapReduce, but it was much slower. Even running the same query a second time would be time-consuming due to the I/O operations with disk storage. Spark was introduced to address these issues, offering processing speeds a hundred times faster than MapReduce, an initiative that saw contributions from Adobe Systems among others.
So, in response to the evolving needs of the industry, Spark has proven to be the solution, efficiently handling the processing requirements we face today.
Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used.
For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated.
In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.