Try our new research platform with insights from 80,000+ expert users
it_user7554 - PeerSpot reviewer
Consultant at a tech consulting company with 51-200 employees
Consultant
Parallel Data Warehouse (PDW) POC – lessons learned

The first version of Microsoft’s Parallel Data Warehouse is out for a while, now I had the chance to get my hands on it during a customer POC. Because PDW is an appliance solution the software is hardware bounded. You can’t download and install PDW on a normal server, you need the right hardware which needs to be MPP capable. Currently there are only 2 vendors providing PDW hardware, HP and DELL.

For all of you who need a further introduction into PDW I can recommend my past blog posts:

Customer Requirements

So let’s go back to the customer POC. My customer currently has a Data Warehouse solution build on SQL Server. The following list provides some high level information about the environment:

  • Data Volume: 15 TB
  • Number of users: 200
  • Biggest Dimension: 20 mill. records
  • Biggest Fact Table: 750 mill. records
  • Number of relational Data Warehouse layers: 3
  • Number of data sources: 15
  • Number of SSIS packages: 400
  • Daily processed data: 50 mill. records

Since his current hardware is not able to scale beyond this size he is looking for a new solution. Together with HP & Microsoft we started a 3 weeks POC to check if PDW can meet customer needs. Without going to much into detail let me quickly highlight the current customer pain points.

ETL Performance: Currently all data loads are implemented with SQL Server Integration Services. Altogether there are about 400 SSIS package that are responsible for source system connectivity and to support data loading over 3 data warehouse layers. In the current environment data loads are done daily and have a predefined time frame for execution. The customer requirement was to get a least the same SSIS-Performance as on the current production system and to be able to scale SSIS in case of adding additional load processes to the daily load without breaking the current loading window.

Relational Query Performance: Today nearly 80% of all customer analysis is done based on an Analysis Services cube. The cube gives us the ability to present the complex data model and business logic in a simplified form that end users understand and are able to work with. But there is also a small amount of technical affine power users which want to be able to analyze data directly on the relational data model. So the demand for a scalable relational reporting solution is growing.

Scalability: As mentioned before my customer is looking for a scalable Data Warehouse solution that is able to handle up to 50TB of relational data (estimated growth in the next 2 years).

Administration: This area is not a real critical success criteria but the customer is also interested in administration efforts and IT operation processes like backup / restore, database administration, database reorg, etc.

Test cases

As you can imagine we can’t test all this within 3 weeks and there are also points that doesn’t make sense to test on your own, like scalability. I think based on the given MPP architecture it’s totally clear, that the system is able to scale for data volume. If you need more space, you need to buy an additional rack, this is how it works.

So clearly we had to focus on what we can really test within 3 weeks and what are the most important things the customer wants to see that are working so that he is able to make a decision to go ahead with PDW or not. Together with the team we agreed that we have a deeper look at ETL Performance and relational query performance.

ETL Performance

We decided to take a representative ETL process and run it against PDW. This process includes:

  • 5 SSIS packages
  • 33 tables
  • 3 Stored Procedures (SP)
  • a User Defined Function (UDF)
    The first question that came up was “where do we executed the SSIS packages”?

As we all know from Best Practices it is recommended to have a dedicated environment for ETL and one for the relational engine. But what about PDW?

PDW is a solution for a relational SQL Server engine based on a MPP architecture. So the Best Practice about a dedicated ETL server is still valid. The good news is, that within the PDW architecture there is also a small ETL server included.

Why small? The server itself is not a high end scalable ETL server (it only has 6 Cores, 24 GB RAM, connected via Infiniband to the other servers). And because PDW is an appliance there is no option to put in more cores or memory in. So for the POC this server was good enough but for a real life project scenario an additional ETL server with enough hardware resources would be needed.

Relational Query Performance

To measure relational query performance the customer provided us 3 representative relational SQL queries with corresponding runtimes on the current environment. So we took these and executed them against PDW.

Next Steps

Within Part 2 of this PDW POC series I will dig into more details about our test cases, about the migration, issues we faced with and of course the results and lessons learned we got. So stay tuned…

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
EDRMS Practice Lead at a tech services company with 51-200 employees
Real User
Collects data through SSIS packages from different sources and puts them all in one data repository
Pros and Cons
  • "Collecting the data through SSIS packages from different sources and putting them all in one data repository is the most powerful thing. While others have this feature, they don't have the simplicity or ease of use when getting a resource and knowing everything about it."
  • "I would like the tool to support different operating systems."

What is our primary use case?

We provide this solution as a service to customers. Sometimes, it is just used as a data warehouse. However, a couple of our customers use the data warehouse to collect information from everywhere, e.g., from their subsidiaries, such as electric or petrol companies.

We are also developing some dashboards using the data warehouse. Sometimes, you are using Microsoft BI, but most of the time, you are using Qlik. It depends on the customer needs at the end of the day. 

Most of our limitations until this moment have been on-premise. There is some sensitive information, like the billing system for an entire country's electricity. However, because of the current circumstances with COVID-19, we are discussing with them how to move it to the cloud. Mostly because of the regulation that any government data cannot be hosted on a cloud, we need to host it on a local cloud at the moment. We cannot use AWS or similar things as of now. 

While we do have local cloud vendors that we are working with, this is something in progress.

What is most valuable?

  • Availability of resources everywhere
  • The cost of resources
  • The information knowledge is there.
  • It is easy to use. You can do what you want with the tool.
  • It integrates well with SQL Server.

Collecting the data through SSIS packages from different sources and putting them all in one data repository is the most powerful thing. While others have this feature, they don't have the simplicity or ease of use when getting a resource and knowing everything about it. You can collect data from Oracle Databases, SQL databases, or file systems where it is then saved in one location, building cubes in an easy for SSIS packages. 

What needs improvement?

The setup has room for improvement. Some customers want to go to non-Windows Servers. Have the entire SQL Server done there though is an obstacle for us, as the configuration of Microsoft BI with cubes is a bit complicated.

I would like the integration to be offered in a simpler way. I would like improvement in the integration between Microsoft SQL cubes and Parallel Data Warehouse with other members of the Microsoft family, such as Microsoft Power BI and SharePoint.

This tool needs a lot of memory for processing. As it does a lot of jobs, it may need this memory while other tools don't need that much. It would help if they enhanced the memory and hardware usage of the tool in order to support the performance of doing the queries

I would like the tool to support different operating systems.

I want a dashboard.

For how long have I used the solution?

We have been using it for five to six years.

What do I think about the stability of the solution?

It is stable. We have migrated millions to billions of data points in Cubes from system to system, while running data cleansing conditions. Putting restraints on this can be time consuming which is why people may want to go to another server. When you inject equation and conditions, then it consumes a lot of time. Though, I have never compared this time consumption to other tools.

What do I think about the scalability of the solution?

I have no concern about the scalability. We have never faced any problems regarding the scalability. If we wanted to extend this, it is extendable.

How are customer service and technical support?

The online help is good enough along with our technical expertise. 

How was the initial setup?

If everything is available, such as information knowledge sources and connections, I can have the solution deployed in five days.

What about the implementation team?

I recommend getting assistance from an integrator or consultant with the integration.

What's my experience with pricing, setup cost, and licensing?

Microsoft has an agreement with the government in our country, so our customers get their licensing costs from the Ministry. Whenever we work with any government, company, or government institute, which is mainly what we are doing, that license comes directly from the Ministry of Technology and Information. 

We actually do not know that much about the price of Microsoft tools. Even if I ask for the price, I'm asking it as a partner.

What other advice do I have?

We have had a good experience working with this tool.

If you don't have a problem with memory and have good processors, then you won't have an issue. However, depending on your needs, you may want to go check out other tools.

I would rate the tool as an eight out of 10.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner.
PeerSpot user
Buyer's Guide
Microsoft Parallel Data Warehouse
November 2024
Learn what your peers think about Microsoft Parallel Data Warehouse. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
reviewer1038354 - PeerSpot reviewer
Team Lead at a computer software company with 10,001+ employees
Real User
Good reporting functionality for historical and real-time data, but the initial setup is complex
Pros and Cons
  • "The most valuable feature of this solution is performance."
  • "The reporting for certain types of data needs to be improved."

What is our primary use case?

I am using the Microsoft Parallel Data Warehouse for working with information services and datasets, creating reports. It is used primarily for our historical data, but we do some in real-time also. We can receive data weekly, nightly, or even hourly.

What is most valuable?

The most valuable feature of this solution is performance. It reduces the time it takes to process data.

What needs improvement?

The initial setup of this solution is complex and can be improved.

Sometimes you need to re-run the reports because they don't work.

The reporting for certain types of data needs to be improved.

For how long have I used the solution?

I have been working with these kinds of services for the past five to ten years.

What do I think about the stability of the solution?

The stability depends on your data and whether it is null. If the data is set up properly then it will be more stable.

What do I think about the scalability of the solution?

It is definitely scalable. We have more than 220 users in multiple cities. Most of the people, perhaps 70%, are information technology users. The remaining 30% are business analysts who are concerned only with reports.

We do plan on increasing our usage. We recently converted all of our Tableau data, and now we are working on permissions and security.

How are customer service and technical support?

The technical support from Microsoft is good.

How was the initial setup?

The initial setup is definitely complex.

We are not always aware of where the data comes from because it is distributed. Sometimes the data is fine and sometimes it needs to be set up. Once it is configured then it runs fine in parallel.

The length of time for deployment depends on the size of the data and whether we were also implementing a data warehouse or a data mart. When I was implementing this solution and copying data, it took between six and eight months. That is end-to-end with everything finished.

If it is a small data mart then it may only take a month.

What about the implementation team?

Our in-house team, including our DBA, handled the deployment.

What other advice do I have?

My advice for anybody who is implementing Microsoft Parallel Data Warehouse is to be aware of how big the data is. You have to be sure of how to secure and take care of the data. You have to think about scalability.

I would rate this solution a seven out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Software Engineer at Tech Mahindra Limited
Real User
Easy to set up and deploy, but it can be difficult to clean large amounts of data
Pros and Cons
  • "We are able to monitor daily jobs, so if there is anything that needs to be done then we can do it."
  • "If the database is large with a lot of columns then it is difficult to clean the data."

What is our primary use case?

This is a solution that we implement for our clients. We provide end-to-end development, maintenance, enhancement, testing, and support. Everything is done by us.

How has it helped my organization?

We are able to monitor daily jobs, so if there is anything that needs to be done then we can do it. If there are discussions with our clients about changes that need to be made then it is us who implements them.

What is most valuable?

The most valuable feature is that we can deploy directly to the production environment, rather than having to go through a development environment first.

What needs improvement?

If the database is large with a lot of columns then it is difficult to clean the data.

For how long have I used the solution?

I have been using the Microsoft Parallel Data Warehouse for two years.

What do I think about the stability of the solution?

The Microsoft Parallel Data Warehouse is a stable solution.

What do I think about the scalability of the solution?

We have not had problems with scalability.

How was the initial setup?

The initial setup is not difficult. It is easy and we are fine with that.

What about the implementation team?

We deploy this solution for our clients.

What other advice do I have?

I am not comfortable with the Microsoft Parallel Data Warehouse at the moment.

I would rate this solution a six out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
PeerSpot user
Teradata DBA / Parallel datawarehouse DBA at a tech services company with 10,001+ employees
Real User
Concurrency issues forced the customer to use the raw DB as a secondary solution
Pros and Cons
    • "​Concurrent queries are limited to 32, making it more of a data storage mechanism instead of an active DWH solution."

    What is our primary use case?

    We are using PDW as an EDW solution.

    How has it helped my organization?

    It helped, initially, as a replacement for our DW DB, but later on faced issues due to concurrency, which forced the customer to use the DB as a secondary solution.

    What is most valuable?

    Nothing specific, comparable to other solutions.

    What needs improvement?

    Concurrent queries are limited to 32, making it more of a data storage mechanism instead of an active DWH solution.

    They need to improve the metadata being captured to a greater duration.

    For how long have I used the solution?

    One to three years.
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    it_user694689 - PeerSpot reviewer
    Business Intelligence evangelist at a hospitality company with 10,001+ employees
    Vendor
    Gives us the ability to distribute large data sets across nodes.

    What is most valuable?

    MPP processing gives us the ability to distribute large data sets across nodes.

    How has it helped my organization?

    We delivered a data warehouse for Contactless and Oyster at TFL.

    What needs improvement?

    Improve the speed of processing replicated tables.

    For how long have I used the solution?

    We have been using this solution for years.

    What do I think about the stability of the solution?

    There were stability issues when the product was in beta.

    What do I think about the scalability of the solution?

    There were scalability issues in that there is a limit to 32 concurrent queries.

    How are customer service and technical support?

    Technical support is good.

    Which solution did I use previously and why did I switch?

    We switched from the standard SQL server 2014. We use PDW to improve overall ETL and report performance.

    How was the initial setup?

    The initial setup was complex. It was fairly challenging to migrate from SQL server to PDW.

    What other advice do I have?

    Make sure your data volumes are very large: At least 60 million rows per table.

    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    PeerSpot user
    Developer at a tech consulting company with 51-200 employees
    Consultant
    It offers high performance & low cost Data Warehousing over industry latest hardware

    Valuable Features:

    Microsoft SQL Server Parallel Data Warehouse provides significant performance boost from 10-100 X times faster for data loading operation.It minimized the cost of design with lower infrastructure requirements.It is designed to provide optimized performance on latest industry hardware.

    Room for Improvement:

    It requires significant infrastructure expertise to implement. Possibility of over specified storage or under specified CPU (need proper planning before starting implementation)It requires significant SQL Server expertise. Microsoft SQL Server Parallel Data Warehouse is purposely built to provide high performance data warehousing. By adopting industry standard hardware it avoids vendor lock-in issue. It is best suited for mid-level to large scale organization.

    Other Advice:

    Both hardware and software support is provided by Microsoft.
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    PeerSpot user
    BI Business Analyst at a transportation company with 1,001-5,000 employees
    Vendor
    It handles high volumes of data very well. Though, it needs more compatibility with common BI tools.
    Pros and Cons
    • "It handles high volumes of data very well."
    • "​It has allowed fast daily loads and analysis of millions of rows of data, which eventually moved to near real-time.​"
    • "It needs more compatibility with common BI tools."

    What is our primary use case?

    Analysing large volumes of data collected from auto ticket barriers at railway stations.

    How has it helped my organization?

    It has allowed fast daily loads and analysis of millions of rows of data, which eventually moved to near real-time.

    What is most valuable?

    It handles high volumes of data very well.

    What needs improvement?

    It needs more compatibility with common BI tools. 

    It does not work well with normal ETL tools. Some functions do not work.

    For how long have I used the solution?

    One to three years.
    Disclosure: I am a real user, and this review is based on my own experience and opinions.
    PeerSpot user
    Buyer's Guide
    Download our free Microsoft Parallel Data Warehouse Report and get advice and tips from experienced pros sharing their opinions.
    Updated: November 2024
    Product Categories
    Data Warehouse
    Buyer's Guide
    Download our free Microsoft Parallel Data Warehouse Report and get advice and tips from experienced pros sharing their opinions.