Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
Cask Data Application Platform (CDAP) is the first Unified Platform for Big Data. It provides standardization and deep integrations with diverse Hadoop technologies allowing companies to focus on application logic and insights, rather than infrastructure and integration. The platform is 100% open-source, highly extensible, and delivers enterprise-class features to help accelerate time to build, deploy, and manage data-centric applications & data lakes on Hadoop and Spark.
There are 3 extensions packaged with CDAP: Cask Hydrator, Cask Wrangler and Cask Tracker. CDAP Extensions are self-service, purpose-built applications on CDAP designed to solve common and critical big data challenges. Cask Hydrator for data pipelines, Cask Wrangler for data wrangling and Cask Tracker for data discovery and metadata.
CDAP removes barriers to innovation as an extensible and future-proof platform that provides consistency across environments and easily integrates with existing MDM, BI, and security solutions.
Cloudera Distribution for Hadoop is the world's most complete, tested, and popular distribution of Apache Hadoop and related projects. CDH is 100% Apache-licensed open source and is the only Hadoop solution to offer unified batch processing, interactive SQL, and interactive search, and role-based access controls. More enterprises have downloaded CDH than all other such distributions combined.
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
AT&T, Salesforce, Cloudera, Hortonworks, Lotame, MAPR, Pet360, Ignition, Safeguard, Cloudwick, Kogentix
37signals, Adconion,adgooroo, Aggregate Knowledge, AMD, Apollo Group, Blackberry, Box, BT, CSC