Originally published at https://www.linkedin.com/pulse/microsoft-analytics...
In April 2014, Microsoft announced the future vision for their data
and analytics platforms. Microsoft Parallel Data Warehouse (PDW) was
rebranded as Microsoft Analytics Platform System (APS) with additional
appliance component offerings. APS combines MPP SQL Server data
warehouse with HDInsight, Microsoft’s 100% Apache Hadoop component
directly into the appliance. APS is a big data analytics appliance
capable of analyzing data of any type, structured or unstructured, and
of any size.
Microsoft ASP integrates data from SQL Server PDW with unstructured
big data from Hadoop through the PolyBase data querying technology.
Polybase gives APS a huge advantage over the competition because it you
to talk with big data in the regular T-SQL language you already use and
understand.
With APS, NoSQL doesn’t replace relational databases. Structured and
unstructured data technologies complement one another, and queries can
be executed across both universes.
By integrating Hadoop into the same rack as the relational data
warehouse, organizations can save on consulting, development and
configuration costs for Hadoop with an integrated appliance.
Why Microsoft APS is an EDW Game Changer
While ASP provides many innovations and improvements, four stand out
as strategic game changes CIOs should consider when evaluating data
warehouse and analytics strategies. SQL Server PDW, HDInsight, Polybase
and xVelocity Columnstores as a Platform for Data Mining and Analysis.
By combining both Microsoft Parallel Data Warehouse (PDW) and
unstructured Hadoop analytical capabilities in a single, easy-to-manage
EDW appliance, Microsoft APS is well positioned to help organizations
use information to enhance their competitive position.
Microsoft Parallel Data Warehouse (PDW)
Microsoft SQL Server Parallel Data Warehouse (PDW) and xVelocity Columnstores are covered in my article Microsoft Parallel Data Warehouse (PDW).
HDInsight & Hortonworks
HDInsight
is Microsoft’s 100% Apache Hadoop distribution based on Hortonworks
Data Platform. HDInsight is the phoenix that emerged from the ashes of
Dryad. Dryad was Microsoft’s own proprietary and competing version of
Hadoop that Microsoft tinkered with for 5 years before abandoning it.
PolyBase & Big Data Hadoop Integration
It’s not enough to store data in Hadoop. Businesses today need to
figure out how they can analyze Hadoop data fast and seamlessly in order
to make more informed business decisions. Unstructured and high volume
data are the two fastest growing types of enterprise data.
Organizations are using Apache Hadoop to store process non-relational
data from sources like blogs, clickstream data that is generated at a
rapid rate, social sentiment data with different schemas customer
feedback, sensor data, or telemetry data feeds. Most of this data is
not suitable for relational database management systems and often ends
up isolated from business users because it is not integrated with data
in the traditional data warehouse.
Technologies, like Hadoop are generally used, but implementing Hadoop
with traditional data warehouse and business intelligence
platforms pose new challenges. Hadoop is both open source, Java-based
and manages non-relational data across many nodes. It’s easy to add data
to Hadoop, but not so quick to extract and analyze it. The idea is
that, if the data is there, it may take a while to retrieve it, but at
least the data is stored somewhere in the system. MapReduce doesn't
have to be implemented in Java, however.
Big Data is not only about figuring out how to store, manage, and
analyze data from non-relational sources, but also about mashing
together various non-relational data with an organization’s relational
data to gain business insight. See my articles on Big Data
Data Lakes & Don’t Drown in the Data and 360 Degree View & Unifying Enterprise Data in a Sea.
PolyBase is the Microsoft APS query tool that enables you to easily
query PDW and HDInsight data using T-SQL, without investing in
Hadoop-based skills or training.
Microsoft PolyBase is a fundamental breakthrough on the data
processing engine which enables integrated query across Hadoop and
relational data. PolyBase opens up a whole new world of data analysis
and integration possibilities. This integration allows organizations to
merge large volumes of non-relational data stored within Hadoop with
their traditional enterprise data. Customers can continue to use their
existing analytics tool set to analyze their organization’s big data.
Without manual intervention, PolyBase Query Processor can accept a
standard SQL query and join tables from a relational source with tables
from a Hadoop source to return a combined result seamlessly to the user.
Queries that run too slow in Hadoop can now run quickly in PDW, data
mining queries can combine Hadoop and PDW data, Hadoop data can be
stored as relational data in PDW, and query results can be stored back
to Hadoop.
By using the power of Microsoft APS to run queries on Hadoop data in
HDInsight, it is now possible to do more in-depth data mining,
reporting, and analysis without acquiring the skills to run MapReduce
queries in Hadoop. PolyBase gives you the flexibility to structure the
Hadoop data you need, when you need it, as it’s brought into PDW for
fast analysis. You can seamlessly select from both Hadoop data in
HDInsight and PDW data in the same query, and join data from both data
sources. To satisfy a query, PolyBase transfers data quickly and
directly between PDW’s Compute Nodes and Hadoop’s Data Nodes.
APS uses external tables to point to data stored in text files on a
Hadoop HDFS cluster. Once an external table is created, the table can be
used in a select statement in the same manner as a PDW table. PolyBase
uses a single Transact-SQL query interface to leverage PDW and Hadoop,
so you don’t need to learn a host of new skills to run MapReduce queries
in Hadoop. PolyBase hides all the complexity of using Hadoop so most
business users do not need to know anything about Hadoop.
PolyBase uses ‘predicate pushdown’ to Hadoop that generates
map-reduce jobs behind the scenes to do the work on the Hadoop side
instead of distributed query data movement when necessary.
With PolyBase, organizations can take advantage of flexible hybrid
Hadoop solutions and query across Hortonworks, Cloudera, and even into
the cloud with Microsoft Azure HDInsight. PolyBase is only available in
Microsoft APS and is not available in SQL Server SMP at this time.
Integration with Business Intelligence Tools
APS has deep integration with Microsoft’s BI tools and other leading
non-Microsoft tools, making it simple to use the BI tools you are
familiar with to perform analysis. APS’s deep integration with Business
Intelligence (BI) tools makes APS a comprehensive platform for building
end-to-end data mining and analysis solutions. APS integrates with the
Microsoft BI Stack including Reporting Services, Analysis Services,
PowerPivot for Excel, and PowerView. But, APS also integrates with a
growing list of leading non-Microsoft BI platforms, such as Business
Objects, Cognos, SAP Data Integrator, Tableau, MicroStrategy, QlikView,
Oracle Business Intelligence, and TIBCO Spotfire.
Easy to Use & Manage
APS is designed for simplicity. The complexity is already engineered
into the appliance so that you don’t have to handle the details. The
appliance arrives with the hardware and software already configured and
installed. PDW handles all the plug and play details of distributing
the data across the appliance nodes, performs all the extra steps
required to process queries in parallel, and manages the low-level
hardware and software configuration settings. No tuning is required
because the appliance is already built and tuned to balance CPU, memory,
I/O, storage, network, and other resources.
Minimal Learning Curve
EPS has a minimal learning curve. There’s no need to
hire new talent in order to move from SQL Server SMP to SQL Server PDW
and EPS. DBAs who already know T-SQL can easily transfer their SQL
Server SMP knowledge to PDW. Some T-SQL query statements are added or
extended to accommodate the MPP architecture. There’s less DBA
maintenance. You don’t need to create indexes besides a clustered
columnstore index. DBAs can spend more of their time as architects and
not baby sitters. In my opinion, the alignment of APS with existing IT
skills may be its
biggest competitive advantage.
The appliance model is key to getting great performance. Tuning a
large database using traditional approaches is extremely difficult and
requires highly skilled DBAs. One of the main problems with the SMP
model, is the difficulty of understanding and tuning the interface
between the DBMS software and the underlying OS and hardware platform.
With SMP, there are a plethora of tuning parameters and options for the
DBA and OS administrator to setup. In the appliance model, the entire
software and hardware stack from SQL to storage is automatically
controlled. As a result, virtually all the complexity is removed.
Manageable Costs
Microsoft APS has manageable costs. APS has lower price/terabyte over
other companies by a significant margin. About 2x lower than Teradata,
Oracle, Greenplum and others. It’s worth noting that Microsoft’s
offering is cheaper than the competition not because of lower quality or
missing capabilities, but because of a different business strategy.
The strategy of commoditizing markets and then selling higher volumes to
make up for lower margins. Given that SQL Server is one of the most
popular enterprise databases on the planet, and APS falls under the SQL
umbrella, it has enough of a relative advantage that it could easily
become the biggest Big Data appliance player of all.
Microsoft APS & Hub and Spoke Architecture
See my article Microsoft APS & Hub and Spoke Architecture about using Microsoft APS to Build a Hub and Spoke EDW Architecture.
These views are my own and may not necessarily reflect those of my current or previous employers.
*Disclosure: I am a real user, and this review is based on my own experience and opinions.