

Apache Spark and Amazon EC2 are both leading solutions in the data processing and cloud computing domains, respectively. Apache Spark, known for its in-memory data processing, seems to have a competitive edge due to its speed and scalability in handling large datasets efficiently, while Amazon EC2 excels in flexible scalability and integration with AWS services despite having complex pricing structures.
Features: Apache Spark is designed for in-memory data processing, enabling efficient handling of large datasets. It includes Spark Streaming for real-time data processing, Spark SQL for querying large data volumes economically, and MLlib for machine learning. Amazon EC2 provides flexible and scalable cloud computing services with the capability to quickly launch and manage server instances. It integrates well with other AWS services, offering cost-effective scalability and versatility.
Room for Improvement: Apache Spark could enhance stability, scalability, and integration with BI tools, as real-time querying limitations and complex APIs pose challenges. It also lacks user-friendly interfaces and comprehensive documentation. Amazon EC2 is critiqued for its intricate pricing structures, leading to potential high costs and occasional connectivity issues during AMI upgrades, necessitating better integration and cost management.
Ease of Deployment and Customer Service: Apache Spark can be deployed in both on-premises and cloud environments, offering flexibility. As an open-source solution, its support is community-driven, which can sometimes lack depth and immediacy. Amazon EC2 operates within the public cloud, well-regarded for ease of use and deployment. Its customer service is more structured, providing comprehensive support options for commercial users.
Pricing and ROI: Apache Spark, being open-source, provides cost savings on software, though requires significant resources for optimal performance, impacting operational costs. It delivers substantial ROI through reduced operational expenses and its mature ecosystem. Amazon EC2 follows a pay-as-you-go model, perceived as expensive due to its complex billing structure. Regardless, it offers flexibility in instance types, improving ROI when usage is strategically managed.
I would say I have saved more than a week with Amazon EC2 compared to my previous on-premises setup.
I have received support via newsgroups or guidance on specific discussions, which is what I would expect in an open-source situation.
Apache Spark resolves many problems in the MapReduce solution and Hadoop, such as the inability to run effective Python or machine learning algorithms.
Without a doubt, we have had some crashes because each situation is different, and while the prototype in my environment is stable, we do not know everything at other customer sites.
I have heard from multiple people that if you have an Amazon EC2 instance running and you stop it, the billing continues unless you terminate the Amazon EC2 instance.
I think improvements can be made to Amazon EC2 by increasing the memory, offering more instance types, and including GPUs as mentioned in the keynote.
Various tools like Informatica, TIBCO, or Talend offer specific aspects, licensing can be costly;
With the cloud, deployment is easy, and within a minute, we can deploy the server and give it to the developers so they can work on it right away after deployment.
Amazon EC2 offers flexibility.
Not all solutions can make this data fast enough to be used, except for solutions such as Apache Spark Structured Streaming.
The solution is beneficial in that it provides a base-level long-held understanding of the framework that is not variant day by day, which is very helpful in my prototyping activity as an architect trying to assess Apache Spark, Great Expectations, and Vault-based solutions versus those proposed by clients like TIBCO or Informatica.
| Product | Market Share (%) |
|---|---|
| Apache Spark | 11.4% |
| Amazon EC2 | 10.8% |
| Other | 77.8% |

| Company Size | Count |
|---|---|
| Small Business | 30 |
| Midsize Enterprise | 13 |
| Large Enterprise | 28 |
| Company Size | Count |
|---|---|
| Small Business | 28 |
| Midsize Enterprise | 15 |
| Large Enterprise | 32 |
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate them from common failure scenarios.
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
We monitor all Compute Service reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.