Try our new research platform with insights from 80,000+ expert users
PeerSpot user
Senior Data Architect at a pharma/biotech company with 1,001-5,000 employees
Vendor
It's a scale out, MPP shared nothing architecture where there are multiple physical nodes.

Originally published at https://www.linkedin.com/pulse/microsoft-parallel-data-warehouse-pdw-stephen-c-folkerts

What’s the Difference Between Microsoft’s Parallel Data Warehouse (PDW) and SQL Server?

In this post, I’ll provide an in-depth view of Microsoft SQL Server Parallel Data Warehouse (PDW) and differentiate PDW with SQL Server SMP.

SQL Server is a scale up, Symmetric Multi-Processing (SMP) architecture. It doesn’t scale out to a Massively Parallel Processing (MPP) design. SQL Server SMP runs queries sequentially on a shared everything architecture. This means everything is processed on a single server and shares CPU, memory, and disk. In order to get more horse power out of your SMP box, or as your data grows, you need to buy a brand new larger more expensive server that has faster processors, more memory, and more storage, and find some other use for the old machine.

SQL Server PDW is designed for parallel processing. PDW is a scale out, MPP shared nothing architecture where there are multiple physical nodes. MPP architectures allow multiple servers to cooperate as one thus enabling distributed and parallel processing of large scan-based queries against large data volume tables. Most all data warehouse workload centric appliances leverage MPP architecture in some form of fashion. Each PDW node runs its own instance of SQL Server with dedicated CPU, memory, and storage. As queries go through the system, they are broken up to run simultaneously over each physical node. The primary benefits include the breath-taking query performance gains MPP provides, and the ability to add additional hardware to your deployment to linearly scale out to petabytes of data, without the diminishing returns of an SMP architecture.


The Grey Zone, When to use SQL Server PDW & What About Netezza, Teradata, Exasol, & a Hundred Others?

Often we’re not dealing with petabyte-scale data. Not even close, just in the terabytes, and we’re in a ‘grey zone’, where SQL Server SMP overlaps with PDW. Or the data is all relational and well structured, and there’s no Big Data business need like social media analysis, or fraud detection or combine structured and unstructured data from internal and external sources.

The capabilities of xVelocity Columnstore indexes and other in-memory capabilities and performance enhancements in SQL Server SMP should first be explored before recommending a PDW appliance. A combination of technologies, all native to SQL Server SMP may be the answer if you’re just dealing with relational data problems. The distinction between these sister products will always be blurry. And the underlying question of when to use SQL Server SMP versus PDW will persist, especially since SQL Server capabilities will keep clawing up into PDW territory, while PDW keeps growing. It is wise to understand the important differences for decision making.

Organizations demand results in near real time, and they expect their internal systems to match the speed of an Internet search engine to analyze virtually all data, regardless of its size or type. Once non-relational, unstructured or semi-structured data is thrown into the mix, you suddenly have a very different story. Business analysts struggle to figure out how to add the value of non-relational Hadoop data into their analysis. As a result, they’re held back from making faster more accurate data-driven decisions that are needed to compete in the marketplace. This is the current data challenge.

If you want to get into Netezza and Teradata, see my article Should I Choose Netezza, Teradata or Something Else?

Microsoft Parallel Data Warehouse (PDW)

Microsoft PDW is the rebuilt DATAllegro appliance with hardware and software designed together to achieve maximum performance and scalability. If you’d like to know more, see my article Microsoft PDW History, DATAllegro.

Query processing in PDW is highly parallelized. Data is distributed across processing and storage units called Compute Nodes. Each Compute Node has its own direct attached storage, processors, and memory that run as an independent processing unit. The Control Node is the brains of PDW and figures out how to run each T-SQL query in parallel across all of the Compute Nodes. As a result, queries run fast!

Microsoft SQL Server is foundational to PDW and runs on each Compute Node. PDW uses updateable in-memory clustered columnstore indexes for high compression rates and fast performance on the individual Compute Nodes.

The first rack of the PDW appliance is called the base rack. Every appliance has at least one base rack with 2 or 3 SQL Server Compute Nodes, depending on vendor hardware. As your business requirements change, you can expand PDW by adding scale units to the base rack. When the base rack is full, PDW expands by adding additional racks, called expansion racks, and adding scale units to them.

The base rack has two InfiniBand and two Ethernet switches for redundant network connectivity. A dedicated server runs the Control Node and the Management Node. A spare server ships in the rack for failover clustering. Optionally, you can add a second spare server.

With PDW’s MPP design, you don’t need to buy a new system to add capacity. Instead, PDW grows by adding to the existing system. PDW is designed to expand processing, memory, and storage by adding scale units consisting of SQL Server Compute nodes. By scaling out, you can easily expand capacity to handle a few terabytes to over 6 petabytes in a single appliance.

You don’t need to over-buy and waste storage that you don’t need, and if you under-buy you can quickly add more capacity if your data growth is faster than projected. When one rack is full, you can purchase another rack and start filling it with Compute nodes.

You also don’t need to migrate your data to a new system in order to add capacity. You can scale out without having to redesign your application or re-engineer the distribution mechanism. There is no need to migrate your data to a new system, and no need to restructure your database files to accommodate more nodes. PDW takes care of redistributing your data across more Compute nodes.

In-Memory xVelocity Clustered Columnstore Indexes Improve Query Performance

If MPP provides the computing power for high-end data warehousing, columnar has emerged as one of the most powerful architectures. For certain kinds of applications, columnar provides both accelerated performance and much better compressibility. Teradata, picked up columnar capabilities with its acquisition of Aster Data. HP acquired Vertica, which gave it a columnar MPP database.

PDW uses in-memory clustered columnstore indexes to improve query performance and to store data more efficiently. These indexes are updateable, and are applied to the data after it is distributed. A clustered columnstore index stores, retrieves and manages data by using a columnar data format, called a columnstore. The data is compressed, stored, and managed as a collection of partial columns, called column segments. Columns often have similar data, which results in high compression rates. In turn, higher compression rates further improve query performance because SQL Server PDW can perform more query and data operations in-memory. Most queries select only a few columns from a table, which reduces total I/O to and from the physical media. The I/O is reduced because columnstore tables are stored and retrieved by column segments, and not by B-tree pages.

SQL Server PDW provides xVelocity columnstores that are both clustered and updateable which saves roughly 70% on overall storage use by eliminating the row store copy of the data entirely. The hundreds or thousands of terabytes of information in your EDW can be built entirely on xVelocity columnstores. Updates and direct bulk load are fully supported on xVelocity columnstores, simplifying and speeding up data loading, and enabling real-time data warehousing and trickle loading; all while maintaining interactive query responsiveness.

Combining xVelocity and PDW integrates fast, in-memory technology on top of a massively parallel processing (MPP) architecture. xVelocity technology was originally introduced with PowerPivot for Excel. The core technology provides an in-memory columnar storage engine designed for analytics. Storing data in xVelocity provides extremely high compression ratios and enables in-memory query processing. The combination yields query performance orders of magnitude faster than conventional database engines. Both SQL Server SMP and PDW provide xVelocity columnstores.

Fast Parallel Data Loads

Loads are significantly faster with PDW than SMP because the data is loaded, in parallel, into multiple instances of SQL Server. For example, if you have 10 Compute Nodes and you load 1 Terabyte of data, you will have 10 independent SQL Server databases that are each compressing and bulk inserting 100 GB of data at the same time. This 10 times faster than loading 1 TB into one instance of SQL Server.

PDW uses a data loading tool called dwloader which is the fastest way to load data into PDW and does in-database set-based transformation of data. You can also use SQL Server Integration Services (SSIS) to bulk load data into PDW. The data is loaded from an off-appliance loading staging server into PDW Compute nodes. Informatica also works with PDW.

Scalable, Fast, and Reliable

With PDW’s Massively Parallel Processing (MPP) design, queries run in minutes instead of hours, and in seconds instead of minutes in comparison to Symmetric Multi-Processing (SMP) databases. PDW is not only fast and scalable, it is designed with high redundancy and high availability, making it a reliable platform you can trust with your most business critical data. PDW is designed for simplicity which makes it easy to learn and to manage. PDW’s PolyBase technology for analyzing Hadoop HDInsight data, and its deep integration with Business Intelligence tools make it a comprehensive platform for building end-to-end solutions.

Fast & Expected Query Performance Gains

With PDW, complex queries can complete 5-100 times faster than data warehouses built on symmetric multi-processing (SMP) systems. 50 times faster means that queries complete in minutes instead of hours, or seconds instead of minutes. With this performance, your business analysts can perform ad-hoc queries or drill down into the details faster. As a result, your business can make better decisions, faster.

Why Queries Run Fast in PDW

Queries Run in PDW on Distributed and Highly Parallelized data. To support parallel query processing, PDW distributes fact table rows across the Compute Nodes and stores the table as many smaller physical tables. Within each SQL Server Compute Node, the distributed data is stored into 8 physical tables that are each stored on independent disk-pairs. Each independent storage location is called a distribution. PDW runs queries in parallel on each distribution. Since every Compute Node has 8 distributions, the degree of parallelism for a query is determined by the number of Compute Nodes. For example, if your appliance has 8 Compute Nodes your queries will run in parallel on 64 distributions across the appliance.

When PDW distributes a fact table, it uses one of the columns as the key for determining the distribution to which the row belongs. A hash function assigns each row to a distribution according to the key value in the distribution column. Every row in a table belongs to one and only one distribution. If you don’t choose the best distribution column, it’s easy to re-create the table using a different distribution column.

PDW doesn’t require that all tables get distributed. Small dimension tables are usually replicated to each SQL Server Compute Node. Replicating small tables speeds query processing since the data is always available on each Compute Node and there is no need to waste time transferring the data among the SQL Server Compute Nodes in order to satisfy a query.

PDW’s Cost-Based Query Optimizer

PDW’s cost-based query optimizer is the brain that makes parallel queries run fast and return accurate results. A result of Microsoft’s extensive research and development efforts, the query optimizer uses proprietary algorithms to successfully choose a high performing parallel query plan. The parallel query plan, contains all of the operations necessary to run the query in parallel. As a result, PDW handles all the complexity that comes with parallel processing and processes the query seamlessly in parallel behind the scenes. The results are streamed back to the client as though only one instance of SQL Server ran the query.

PDW Query Processing

Here’s a look into how PDW query processing works ‘under the covers’. First, a query client submits a T-SQL query to the Control Node, which coordinates the parallel query process. After receiving a query, PDW’s cost-based parallel query optimizer uses statistics to choose a query plan, from many options, for running the user query in parallel across the Compute Nodes. The Control Node sends the parallel query plan, called the dsql plan, to the Compute Nodes, and the Compute Nodes run the parallel query plan on their portion of the data.

The Compute Nodes each use SQL Server to run their portion of the query. When the Compute nodes finish, the results are quickly streamed back to the client through the Control node. All of this occurs quickly without landing data on the Control node, and the data does not bottleneck at the Control node.

PDW relies on co-located data, which means the right data must be on the right Compute Node at the right time before running a query. When two tables use the same distribution column they can be joined without moving data. Data movement is necessary though, when a distributed table is joined to another distributed table and the two tables are not distributed on the same column.

PDW Data Movement Service (DMS) Transfers Data Fast

PDW uses Data Movement Service (DMS) to efficiently move and transfer data among the SQL Server Compute Nodes, as necessary, for parallel query processing. Data movement is necessary when tables are joined where the data isn’t co-located DMS only moves the minimum amount of data necessary to satisfy the query. Since data movement takes time, the query optimizer considers the cost of moving the data when it chooses a query plan.

Microsoft Analytics Platform System (APS)

See my article Microsoft Analytics Platform System (APS) for a more in-depth look at Microsoft APS.

These views are my own and may not necessarily reflect those of my current or previous employers.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
PeerSpot user
Developer at a tech consulting company with 51-200 employees
Consultant
It offers high performance & low cost Data Warehousing over industry latest hardware

Valuable Features:

Microsoft SQL Server Parallel Data Warehouse provides significant performance boost from 10-100 X times faster for data loading operation.It minimized the cost of design with lower infrastructure requirements.It is designed to provide optimized performance on latest industry hardware.

Room for Improvement:

It requires significant infrastructure expertise to implement. Possibility of over specified storage or under specified CPU (need proper planning before starting implementation)It requires significant SQL Server expertise. Microsoft SQL Server Parallel Data Warehouse is purposely built to provide high performance data warehousing. By adopting industry standard hardware it avoids vendor lock-in issue. It is best suited for mid-level to large scale organization.

Other Advice:

Both hardware and software support is provided by Microsoft.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Microsoft Parallel Data Warehouse
May 2025
Learn what your peers think about Microsoft Parallel Data Warehouse. Get advice and tips from experienced pros sharing their opinions. Updated: May 2025.
856,873 professionals have used our research since 2012.
it_user7554 - PeerSpot reviewer
Consultant at a tech consulting company with 51-200 employees
Consultant
Parallel Data Warehouse (PDW) POC – lessons learned

The first version of Microsoft’s Parallel Data Warehouse is out for a while, now I had the chance to get my hands on it during a customer POC. Because PDW is an appliance solution the software is hardware bounded. You can’t download and install PDW on a normal server, you need the right hardware which needs to be MPP capable. Currently there are only 2 vendors providing PDW hardware, HP and DELL.

For all of you who need a further introduction into PDW I can recommend my past blog posts:

Customer Requirements

So let’s go back to the customer POC. My customer currently has a Data Warehouse solution build on SQL Server. The following list provides some high level information about the environment:

  • Data Volume: 15 TB
  • Number of users: 200
  • Biggest Dimension: 20 mill. records
  • Biggest Fact Table: 750 mill. records
  • Number of relational Data Warehouse layers: 3
  • Number of data sources: 15
  • Number of SSIS packages: 400
  • Daily processed data: 50 mill. records

Since his current hardware is not able to scale beyond this size he is looking for a new solution. Together with HP & Microsoft we started a 3 weeks POC to check if PDW can meet customer needs. Without going to much into detail let me quickly highlight the current customer pain points.

ETL Performance: Currently all data loads are implemented with SQL Server Integration Services. Altogether there are about 400 SSIS package that are responsible for source system connectivity and to support data loading over 3 data warehouse layers. In the current environment data loads are done daily and have a predefined time frame for execution. The customer requirement was to get a least the same SSIS-Performance as on the current production system and to be able to scale SSIS in case of adding additional load processes to the daily load without breaking the current loading window.

Relational Query Performance: Today nearly 80% of all customer analysis is done based on an Analysis Services cube. The cube gives us the ability to present the complex data model and business logic in a simplified form that end users understand and are able to work with. But there is also a small amount of technical affine power users which want to be able to analyze data directly on the relational data model. So the demand for a scalable relational reporting solution is growing.

Scalability: As mentioned before my customer is looking for a scalable Data Warehouse solution that is able to handle up to 50TB of relational data (estimated growth in the next 2 years).

Administration: This area is not a real critical success criteria but the customer is also interested in administration efforts and IT operation processes like backup / restore, database administration, database reorg, etc.

Test cases

As you can imagine we can’t test all this within 3 weeks and there are also points that doesn’t make sense to test on your own, like scalability. I think based on the given MPP architecture it’s totally clear, that the system is able to scale for data volume. If you need more space, you need to buy an additional rack, this is how it works.

So clearly we had to focus on what we can really test within 3 weeks and what are the most important things the customer wants to see that are working so that he is able to make a decision to go ahead with PDW or not. Together with the team we agreed that we have a deeper look at ETL Performance and relational query performance.

ETL Performance

We decided to take a representative ETL process and run it against PDW. This process includes:

  • 5 SSIS packages
  • 33 tables
  • 3 Stored Procedures (SP)
  • a User Defined Function (UDF)
    The first question that came up was “where do we executed the SSIS packages”?

As we all know from Best Practices it is recommended to have a dedicated environment for ETL and one for the relational engine. But what about PDW?

PDW is a solution for a relational SQL Server engine based on a MPP architecture. So the Best Practice about a dedicated ETL server is still valid. The good news is, that within the PDW architecture there is also a small ETL server included.

Why small? The server itself is not a high end scalable ETL server (it only has 6 Cores, 24 GB RAM, connected via Infiniband to the other servers). And because PDW is an appliance there is no option to put in more cores or memory in. So for the POC this server was good enough but for a real life project scenario an additional ETL server with enough hardware resources would be needed.

Relational Query Performance

To measure relational query performance the customer provided us 3 representative relational SQL queries with corresponding runtimes on the current environment. So we took these and executed them against PDW.

Next Steps

Within Part 2 of this PDW POC series I will dig into more details about our test cases, about the migration, issues we faced with and of course the results and lessons learned we got. So stay tuned…

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
it_user6414 - PeerSpot reviewer
CEO at a tech company with 51-200 employees
Real User
We examined SAP Data Warehouse and IBM Data Warehouse but SQL Data Warehouse provided a more extensive way of integrating data.

Valuable Features:

• SQL Data Warehouse provides our organization with a way to access all corporate data in a uniform way. • SQL Data Warehouse allows us to manage our database using network or hierarchical database systems rather than solely relying on relational DBMSs. • It supports the bottom-up process of data warehouse creation where several data marts can be joined to form one big warehouse. I found this very handy especially where a few teams are involved.

Room for Improvement:

• Data cleaning involves the use of special software to help the cleaning team uniformly match attributes from the target source data to its destination. • It is expensive. SQL Data Warehouse is not supported by all versions of SQL Server such as the Express version and therefore becomes unavailable to many limiting its acquisition and use. • Uses a different approach to databases than that used in relational DBMS hence posing a challenge to novices.

Other Advice:

• It comes as a special tool of cleaning data through the data transformation process where data is extracted from many sources before it is assembled and consolidated to form a single database. This is the main factor that led my company to choose SQL Data Warehouse as the special tool to consolidate relational databases. Close examination was done on SAP Data Warehouse and IBM Data Warehouse but SQL Data Warehouse provided a more extensive way of integrating data. I personally value the role that SQL Data Warehouse has contributed in achieving the goals of business intelligence. Users will to generate timely reports from huge databases as well as database cleaning should embrace this provision.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer2255670 - PeerSpot reviewer
Team Leader at a tech services company with 201-500 employees
Real User
Top 20
A stable and easy-to-deploy solution that enables users to manage large volumes of data
Pros and Cons
  • "We can store the data in a data lake for a very low cost."
  • "The product does not have all of the features that the native products have."

What is most valuable?

It's a very good solution because we can manage large volumes of data. We can store the data in a data lake for a very low cost. We can move the data in Parallel when we need more performance or decide to leave the data in the data lake. It's a very good product at the moment.

What needs improvement?

The product does not have all of the features that the native products have. If we want to do something advanced, we must use Data Factory and Databricks.

For how long have I used the solution?

I have been working with the product for about three to four years. I am leading a team. I don't work in depth with the solution.

What do I think about the stability of the solution?

The solution is pretty stable. Sometimes we have problems, but Microsoft manages and solves them quickly. We had some problems with Python and ETL processes inside Synapse. However, Microsoft solved the problem in less than an hour.

What do I think about the scalability of the solution?

The tool is scalable by design. Our customers are medium and enterprise-level businesses.

How are customer service and support?

The support is pretty good. Support could be faster. The team must be more knowledgeable about Azure.

How would you rate customer service and support?

Positive

How was the initial setup?

The initial setup is really easy. We don't need to do anything major. The deployment takes a few minutes.

The number of people required to deploy the tool depends on the number of environments we have. Usually, one architect is enough to deploy the product.

What about the implementation team?

Maintaining the product is easy because Microsoft does it for us. We don't need to do anything. It is upgraded by default.

What's my experience with pricing, setup cost, and licensing?

The pricing depends on the solution. If we decide to use a low-cost product, it is not expensive. The tool could be expensive if we need to manage a lot of data. The pricing depends on what we need to do. The tool might be expensive if we need high performance and want to manage a lot of data.

What other advice do I have?

All of the components related to the pipeline are included in Synapse. We have some ETL tools based on Data Factory and other advanced functions based on Databricks. Not all of the functionality is included in Synapse. It makes no sense to have the product without all of the functionality.

It is tricky to manage the storage mode of the data because if we don't read the documentation and don't create a good distribution of the data, we will have problems in the performance. If we do it, we will have no problem. 

Overall, I rate the solution an eight out of ten.

Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
Buyer's Guide
Download our free Microsoft Parallel Data Warehouse Report and get advice and tips from experienced pros sharing their opinions.
Updated: May 2025
Product Categories
Data Warehouse
Buyer's Guide
Download our free Microsoft Parallel Data Warehouse Report and get advice and tips from experienced pros sharing their opinions.