

Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop.
I have received support via newsgroups or guidance on specific discussions, which is what I would expect in an open-source situation.
Apache Spark resolves many problems in the MapReduce solution and Hadoop, such as the inability to run effective Python or machine learning algorithms.
Without a doubt, we have had some crashes because each situation is different, and while the prototype in my environment is stable, we do not know everything at other customer sites.
It can handle large datasets.
Various tools like Informatica, TIBCO, or Talend offer specific aspects, licensing can be costly;
Pentaho Business Analytics is hard to learn and not suited for initial users as it requires knowledge of operating systems, Java, and other technical skills.
Pentaho Business Analytics is priced similarly to other competitors such as QlikView and Tableau.
Not all solutions can make this data fast enough to be used, except for solutions such as Apache Spark Structured Streaming.
The solution is beneficial in that it provides a base-level long-held understanding of the framework that is not variant day by day, which is very helpful in my prototyping activity as an architect trying to assess Apache Spark, Great Expectations, and Vault-based solutions versus those proposed by clients like TIBCO or Informatica.
It is a stable product, and it can handle large datasets.
| Product | Market Share (%) |
|---|---|
| Apache Spark | 13.9% |
| Cloudera Distribution for Hadoop | 15.1% |
| HPE Data Fabric | 14.9% |
| Other | 56.1% |
| Product | Market Share (%) |
|---|---|
| Pentaho Business Analytics | 0.6% |
| Microsoft Power BI | 9.4% |
| Tableau Enterprise | 6.7% |
| Other | 83.3% |

| Company Size | Count |
|---|---|
| Small Business | 28 |
| Midsize Enterprise | 15 |
| Large Enterprise | 32 |
| Company Size | Count |
|---|---|
| Small Business | 22 |
| Midsize Enterprise | 7 |
| Large Enterprise | 15 |
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
Pentaho is an open source business intelligence company that provides a wide range of tools to help their customers better manage their businesses. These tools include data integration software, mining tools, dashboard applications, online analytical processing options, and more.
Pentaho has two product categories: There is the standard enterprise version. This is the product that comes directly from Pentaho itself with all of the benefits, features, and programs that come along with a paid application such us analysis services, dashboard design, and interactive reporting.
The alternative is an open source version, which the public is permitted to add to and tweak the product. This solution has its advantages, aside from the fact that it is free, in that there are many more people working on the project to improve its quality and breadth of functionality.
We monitor all Hadoop reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.