Original published at https://www.linkedin.com/pulse/should-i-choose-net...
Two leading Massively Parallel Processing (MPP) architectures for Data Warehousing (DW) are IBM PureData System for Analytics (formerly Netezza) and Teradata. I thought talking about the similarities and differences would be useful to decision makers who may need to choose or recommend one technology over the other.
A few years ago, I evaluated the viability of Netezza and Teradata (shortly before the Aster Data acquisition), to meet my company’s DW needs. Both Netezza and Teradata follow the relational database paradigm and use table joins. My recommendation was Netezza, based on the particular DW use case and Netezza’s MPP architecture, in-database analytics, low DBA maintenance and price.
Bottom line, Netezza costs less to own and operate, and is easier to manage. If your DW environment doesn’t need to support many thousands of users or ever increasing volumes of data, then Netezza is an excellent choice. Teradata scales better but is also more expensive and requires more skilled DBA labor.
I recently spoke with a VP who migrated off Teradata to Netezza because Teradata was too expensive and required a lot more DBA effort than his company’s needs required. In this post, I’ll contrast the two technologies at a Database Architect or Designer level of understanding. I’ll first point out some similarities, then highlight a few important differences.
Architectural Similarities
The idea behind parallel processing is that “many hands make light work”. In other words large tasks become small when divided among several people. In this context “people” are processing units of memory, CPU and disk. MPP is when you get 64, 128, 256, or more synchronized processing units executing a database query simultaneously.
The primary design philosophy of both Netezza and Teradata is simple. Each processor has total control and responsibility for a disk. The rows of every table are spread or distributed as evenly as possible across all the disks in the system and then data can be retrieved in parallel.
A processing unit in Teradata is called an Access Module Processor (AMP). In Netezza it’s called a Snippet Processing Unit (SPU). In Netezza, a database query is first compiled into C and divided into units of work called snippets.
Each processing unit (Teradata AMP or Netezza SPU) has its own memory, CPU and disk. This is also called a “shared nothing” architecture. Data retrieval and manipulation operations proceed in parallel and are N times faster than they otherwise would be (where N is the degree of parallelism).
All the processing units (worker bees) are guided by the query optimizer and query coordinator. In Teradata this is the parsing engine.
Architectural Differences
Concurrency
Concurrency is a by-product of performance. Concurrency is the number of simultaneous database queries running at any one given time in the database. In this context, the word “query” includes searching, adding, updating or deleting data. Netezza will never physically run more than 48 queries at a time. Netezza can support up to 2,000 active read-only queries at one time, but at most 48 will be running, and the rest will be queued. The active query is interrupted so other queries can use the CPU. This context switch prevents any one query from monopolizing the CPU, and ensures all queries get a fair share of CPU time. The limit makes sure Netezza is not wasting time switching between sessions.
While there can be up to 2,000 concurrent read-only queries, Netezza has a limit of 64 active add, update or delete queries (anything that might change data). This usually isn’t an issue in a typical analytics environment where the work of getting data in and out of Netezza is done as quickly as possible and the writers are typically ETL processes.
In contrast, Teradata can support millions of concurrent queries allowing greater flexibility. Teradata’s benchmarks show their system in a better light relative to concurrency but they are not real world workloads.
Enforcing Referential Integrity (RI)
Netezza does not enforce RI, but depending on your DW use case, that may be perfectly fine. Netezza defines primary and foreign key constraints as metadata, but doesn’t enforce them. In general even DWs like Oracle Exadata and Teradata that do enforce RI will disable the constraints when loading data. Otherwise the load process would be radically slower since the referenced keys would need to be validated one row at a time.
For a DW, RI is often performed in the Data Integration Framework and incoming data is cross-checked with the available keys. Once the ETL has a handle on the data quality and is preventing data errors from entering the DW, the constraints can be disabled forever because the ETL becomes the de facto gatekeeper of data quality.
With IBM PureData System for Analytics, data load and bulk-data comparisons are incredibly fast. And because the primary and foreign key pairs are known metadata, some repeatable, metadata-driven patterns can be built that allow the referential checks to be parameterized in the ETL.
Primary and foreign key constraints are usually enforced with indexes, which Teradata provides to improve query speed and performance. Teradata distributes the data based on Primary Index (PI). Choosing a PI is based on data distribution and join frequency of the column. Secondary indexes provide another path to access data. Both primary and secondary indexes can be unique or non-unique.
In contrast, Netezza’s approach to indexes is simple. Netezza doesn’t offer them. Instead, Netezza uses a distribution key, performs massively parallel table scans, and relies on zone maps for performance.
Minimizing I/O
Both Netezza and Teradata compress data and use cost-based optimizers to calculate the most efficient query plan. A query plan is like the “GPS” in your car. In a DW, it’s used to find and retrieve data.
Netezza minimizes I/O by applying restriction and projection conditions to data in a Field Programmable Gate Array (FPGA) as the data comes off the disk. The FPGA architectural firmware is the “secret rocket” that gives Netezza such incredible speed. Each Netezza processing unit includes memory, CPU, a disk drive and an FPGA.
Applying restriction and projection conditions to data in Netezza as it comes off the disks and before it reaches the rest of the I/O sub-system makes the most sense where only a relatively small fraction of the data in a particular table is required to support the rest of the query. It is still the case that all of the data has to first get off the drive before the unwanted data can be discarded in the FPGA. Although incredibly fast, Netezza’s use of parallelized full table scans can limit query concurrency.
Instead of an FPGA paired with each disk drive, Teradata minimizes I/O through more traditional methods such as range-based partitioning, which supports partition elimination and the use of advanced indexing which reduces the amount of data that would otherwise have to be scanned. Partitioning and indexing strategies require more DBA involvement.
Netezza has a single active host node that can become a performance bottleneck. All sessions and data must flow through this single node for final sorting and merging of results. Netezza’s use of co-located joins helps minimize sorting. Teradata automatically spreads sessions across multiple servers, providing scalable bandwidth for data flow.
Mixed Workloads
Mixed workloads are different types of queries running in the Data Warehouse and are directly related to concurrency. A DW needs to optimize a mix of ad-hoc and tactical queries, reports, data mining, data loads and visualization queries.
Using a traffic analogy, think about driving through a big city at 6 AM. The roads are quiet and uncrowded and traffic is light. But by 9 AM, the crowds, taxis, trucks and ambulances turn the streets into an all-day traffic jam. Your vehicle only goes as fast as the car in front of you. Not until 8 PM does the congestion dissipate.
Now imagine no stop signs or traffic lights. Then, eliminate the "drive on the right side of the street" rule. There would be constant chaos and traffic would gridlock.
This is similar to the daily workloads that pass through a DW. Reports (cars), tactical queries (motorcycles and bikes), executive queries (ambulances), data mining (buses) and data loading (trucks) can simultaneously clog the system by taking up space (CPU) and producing congestion (blocking others). Like a city, query elapsed time is faster and the DW performs better if the traffic flow is organized.
Fair-share, priority and pre-emptive scheduling (slow lanes, fast lanes, cutting in front) ensure that no active queries starve for lack of CPU time. Resource governors (stop signs and traffic lights) throttle out-of-control queries like billion row table-joins. Organizing and prioritizing workloads ensures the executive query (ambulance) or tactical query (motorcycle or bike) are consistently fast and can zip though, regardless of concurrent traffic.
Teradata has what is generally acknowledged to be the best mixed-workload management capability in the industry.
Distribution, Skew and Co-Located Joins
Let’s use an example of distributing 128 million rows across 128 SPUs. Once loaded, using “random” distribution, each SPU will control 1 million rows. So the table exists logically once, and physically 128 times. The SQL query will physically run on all 128 SPUs simultaneously. Each SPU will work on its portion of the data and be merged into a result set. So, the total duration of the query is the speed a SPU can scan 1 million rows. Using “random” distribution, all the SPUs will move at this speed, in parallel, and finish at the same time.
Let’s say the 128 million rows are in an Order table, and 256 million rows are in Order_Detail. Both tables are joined on Order_ID. Rather than using “random” distribution, it appears Order_ID is what we want for a distribution, but this may skew the data. When a distribution is assigned, Netezza will hash the distribution key into one of 128 hash values (the number of SPUs). Every time a particular key appears, it will always be assigned the same hash value and land on the same SPU. So now we can distribute both tables on Order_ID and be absolutely certain that for any given SPU, all of the same ID's for both the Order and Order_Detail table are physically co-located on the same disk.
If we choose Order_ID and it turns out that a large number of rows hash to one or more overloaded SPUs, then the data distribution is “skewed” and detrimental to performance. Skew makes queries run slow, because the other SPUs will finish faster and wait on the overworked SPUs with the extra data.
Each Netezza table has only one distribution key. If a table is distributed on another key, the data would have to physically leave the SPU as it finds a new SPU home to align with its distribution key. Because redistributing data is the single biggest performance hit, the columns selected as distribution keys cannot be updated. You would need to delete the row, and insert a new one.
The distribution key can consist of 1-4 table columns. If the chosen key provides good physical distribution, then there is no reason to use more columns. More granularity in a good distribution has no value. All columns in the distribution key must be used in a join in order to achieve co-location. If we use a compound (1-4 column) distribution key, we are committed to using all columns in all joins, and this is rarely the case. You would usually use additional columns only if a single column produces high physical skew. The distribution key is a hash value for SPU assignment, not an index. If all of the columns in the distribution key are not mentioned in the join, Netezza will not attempt co-location. So even if a particular distribution key does not "directly" participate in the functional answer, it must directly participate in the join to achieve co-location.
Collocated joins provide optimal performance because data movement is minimized. In a non-collocated join, the data first needs to be sorted in memory. In a collocated join, the two data sets are already in sorted order on disk and each SPU can operate independently of the others without network traffic or communication between the other SPUs.
Scaling Up
IBM now offers a “Growth on Demand” model where IBM will bring in more capacity than initially required and start out licensing half of that environment. Customers can then “turn on” capacity when needed by licensing more of the environment. This is controlled by IBM’s Workload Management capabilities. An example would be to bring in a two rack system and license it as if it were a single rack. Customers can then add in 1/8 increments from there until the two rack system is fully utilized. If you grow past the two racks, then you’d consider upgrading to a 4, 6, 8 or 10 rack configuration.
Teradata, by contrast, allows systems to expand incrementally by adding server nodes as necessary to meet growth needs.
Conclusion
In conclusion, before choosing or recommending one DW technology over another, evaluate your particular DW needs and execute a well-constructed benchmark for the platform you are considering with your workload and data.
These views are my own and may not necessarily reflect those of my current or previous employers.
*Disclosure: I am a real user, and this review is based on my own experience and opinions.