03.10.2018 Views

Redshift Vs Teradata - An In-Depth Comparison

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

EBOOK<br />

REDSHIFT VS TERADATA<br />

AN IN-DEPTH COMPARISON<br />

AMAZON REDSHIFT<br />

TERADATA


Table of Contents<br />

<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong> 1<br />

<strong>Redshift</strong> Architecture & Its Features 1<br />

<strong>Teradata</strong> Architecture & Its Features 2<br />

<strong>Redshift</strong> Data Model 4<br />

<strong>Teradata</strong> Data Model 7<br />

Pros 8<br />

Cons 9<br />

<strong>Teradata</strong> Pros and Cons 12<br />

Pros 12<br />

Cons 13<br />

Features supported only by <strong>Teradata</strong>, not <strong>Redshift</strong> 15<br />

<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong> <strong>In</strong> A Nutshell 16<br />

Pricing and Effort <strong>Comparison</strong> 20<br />

When and How to Migrate data from <strong>Teradata</strong> to <strong>Redshift</strong> 21<br />

Summary 22<br />

ETL Challenges While Working With Amazon <strong>Redshift</strong> 23


1<br />

<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong><br />

<strong>Redshift</strong> versus <strong>Teradata</strong> has been one of the most debatable data<br />

warehouse comparisons. <strong>In</strong> this ebook, we will cover the detailed<br />

comparison between <strong>Redshift</strong> and <strong>Teradata</strong>.<br />

<strong>Redshift</strong> Architecture & Its Features<br />

<strong>Redshift</strong> is a fully managed petabyte scale data warehouse on the cloud.<br />

You can even start working from a few Gigabytes or Terabytes of data.<br />

Additionally, you can also scale it up to petabytes depending upon your<br />

business requirement. <strong>Redshift</strong> engine is also called a cluster and it is<br />

built up from one or more nodes. There are two types of nodes called<br />

Compute and Leader node. Compute node contains 2 or more slices<br />

depending upon node types. Leader node does multiple roles which<br />

include communicating with JDBC/ODBC client and creating the query<br />

execution plan to transfer it to compute node(s). Also, the cluster is<br />

incomplete without a Leader node.<br />

You can check out our blog for a detailed article on ​<strong>Redshift</strong> Architecture​.


2<br />

<strong>Teradata</strong> Architecture & Its Features<br />

<strong>Teradata</strong> is an RDBMS, meant for a data warehouse with an on-premise<br />

setup. It requires installation since it is unavailable on cloud platforms.<br />

Although <strong>Teradata</strong> is not over the cloud, you can spin up a <strong>Teradata</strong><br />

instance on a cloud VM. <strong>Teradata</strong> is designed on MPP shared nothing<br />

architecture.<br />

Here is a diagrammatic representation of <strong>Teradata</strong> Architecture.


3<br />

The four major components of <strong>Teradata</strong> are as follows:<br />

1. Node:​ The primary component of <strong>Teradata</strong> is called Node, which is a<br />

basic unit of <strong>Teradata</strong>. It has its own OS, CPU, RAM, disk space etc.<br />

2. Parsing Engine:​ Parsing Engine or PE is responsible for preparing the<br />

query execution plan.<br />

3.​ ​BYNET:​ BYNET receives query execution plan from PE and transfers it<br />

to AMPs aka Virtual Processor and vice versa. It is also called as Message<br />

Parsing layer.<br />

4. Access Module Processor (AMP):​ AMP is an important component of<br />

<strong>Teradata</strong>. AMP manages the processing of data by storing it in vDisks.<br />

Data can be stored in any AMP depending on the hash algorithm. <strong>In</strong> case<br />

the first BYNET fails there is an additional BYNET to take over. BYNET is<br />

responsible to communicate between the AMPs. <strong>In</strong> multi-node systems,<br />

<strong>Teradata</strong> will have at least two BYNETs to make the system fault tolerant.


4<br />

<strong>Redshift</strong> Data Model<br />

<strong>Redshift</strong> data model is designed for Data warehousing purposes. The<br />

unique features of <strong>Redshift</strong> make it a smart Data warehouse choice.<br />

1. <strong>Redshift</strong> is a fully managed data warehouse. You don't have to worry<br />

about setting up and installing the database. You just have to spin up<br />

your cluster and the database is ready.<br />

2. <strong>Redshift</strong>’s backup and restore are fully automatic. Through automatic<br />

snapshots, data in <strong>Redshift</strong> automatically gets backed up in S3 internally<br />

at regular intervals.<br />

3. Data is fully secured by inbound security rule and SSL connection. It<br />

has VPC for VPC mode and inbound security rule for classic mode cluster.<br />

4. <strong>Redshift</strong> stores data in the columnar format, unlike other data<br />

warehouses storage. For example, if you hit your query for a specific<br />

column, <strong>Redshift</strong> will exclusively search in that specific column instead of<br />

the entire row. This saves an enormous amount of time in query<br />

processing.<br />

5. Data is stored in blocks of 1 MB instead of typical blocks of 8 KB or 64<br />

KB which helps <strong>Redshift</strong> to store more data in a single block.<br />

6. <strong>Redshift</strong> does not have the concept of indexes. <strong>In</strong>stead, it has zone<br />

maps. With the help of zone map <strong>Redshift</strong> easily identifies which block<br />

has lowest and highest value for that column. Zone maps inform the<br />

cluster about all the blocks that are needed to be read.<br />

7. <strong>Redshift</strong> has column compression (encoding). ANALYZE<br />

COMPRESSION command automatically tells what compression strategy<br />

to apply for that table. <strong>Redshift</strong> provides various encoding techniques.<br />

Refer ​AWS documentation​ for more details on encoding.


5<br />

8. <strong>Redshift</strong> has a feature of caching the result of repeat queries for faster<br />

performance. To check whether your query has used cache, you can see<br />

the output of column source_query available in SVL_QLOG. If your query<br />

has used cache it will store the value of query id of which was run by the<br />

specific user id.<br />

Example:<br />

SELECT​ ​USERID, QUERY, ELAPSED, SOURCE_QUERY from SVL_QLOG ​WHERE<br />

USERID in (600, 601);<br />

<strong>In</strong> the below example, QUERY ID 853219 of USERID 601 has used the<br />

cache. (QUERY ID 123456 of USERID 600). Also, ​ ​QUERY ID 853219 ran<br />

by userid → 601 has utilized the cache and elapsed time in microseconds<br />

has reduced drastically.<br />

USERID | QUERY ID<br />

| ELAPSED | SOURCE_QUERY<br />

--------+-------------+----------+---------------<br />

600 | 123456 | 90000 | NULL<br />

600 | 567890 | 80000 | NULL<br />

601 | 853219 | 30 | 123456<br />

9. <strong>Redshift</strong> data model is similar to a typical data warehouse when it<br />

comes to analytical queries. You can create fact tables, dimension tables,<br />

and views. It supports all major query execution strategy i.e., <strong>In</strong>ner join,<br />

Outer join, Subquery, and Common Table Expressions (with clause).<br />

10. From a storage perspective <strong>Redshift</strong> cluster maintains multiple copies<br />

of your data as part of fault tolerance.


6<br />

<strong>Teradata</strong> Data Model<br />

1. <strong>Teradata</strong> is a massive parallel Data warehouse with shared-nothing<br />

architecture. However, unlike <strong>Redshift</strong>, the data is stored in a row-based<br />

format.<br />

2. <strong>Teradata</strong> uses a different kind of indexes for fast data retrieval. <strong>In</strong>dexes<br />

include Primary, Secondary, Join, and Hash <strong>In</strong>dexes, etc. Please note that<br />

Secondary <strong>In</strong>dex does not affect the distribution of rows across AMPs.<br />

Although, the secondary index takes extra processing overhead.<br />

3. <strong>Teradata</strong> supports and enforces Primary and Secondary index.<br />

4. <strong>Teradata</strong> has a hybrid storage concept where frequently used data is<br />

stored in SSD while the less accessed data is stored in HDD. <strong>Teradata</strong><br />

has a higher storage capacity than <strong>Redshift</strong>.<br />

5. <strong>Teradata</strong> does support Table partitioning feature, unlike <strong>Redshift</strong>.<br />

6. <strong>Teradata</strong> uses the Hash algorithm to distribute data into various disk<br />

storage units.<br />

7. <strong>Teradata</strong> can scale up to 2048 nodes. It has a storage capacity ranging<br />

from 10 TB to 94 petabytes thus providing higher storage capacity than<br />

<strong>Redshift</strong>.<br />

8. <strong>Teradata</strong> supports all kinds of major SQL related features (Primary<br />

<strong>In</strong>dex, Secondary <strong>In</strong>dex, Sequences, Stored Procedures, User Defined<br />

Functions, and Macros etc) which are compulsorily needed as part of Data<br />

Warehouse RDBMS.<br />

9. <strong>Teradata</strong>'s data model is designed to be fault tolerant. It is also<br />

designed to be scalable with redundant network connectivity to ensure<br />

throughout data connectivity and availability.


7<br />

<strong>Redshift</strong> Pros and Cons<br />

Pros<br />

1. Loading and unloading of data is exceptionally fast. You can load data<br />

in parallel mode. <strong>Redshift</strong>, even for a high volume of data, supports data<br />

loading from the zipped file. <strong>Redshift</strong> recommends loading the data from<br />

the COPY command for faster performance.<br />

2. You can load data from NoSQL database service, AWS DynamoDB.<br />

Refer ​AWS documentation​ for more detailed information about<br />

DynamoDB.<br />

3. You have an option to choose the node type (Dense Storage or Dense<br />

Compute) of your cluster depending upon your data needs and business<br />

requirements.<br />

4. You can scale your cluster's storage and CPU for better performance at<br />

any instant without any impact to the cluster.<br />

5. You can migrate your data from various data warehouses into <strong>Redshift</strong><br />

without much hassle. AWS does provide a service for the same called<br />

Database Migration Service (DMS). Refer to ​AWS documentation​ for<br />

more detailed information.<br />

6. You do not have to worry about the security as you can build your<br />

cluster inside a VPC and also use SSL encryption for further protection.<br />

7. <strong>Redshift</strong> backup and restore feature is pretty simple. Through<br />

automatic snapshots, your data is automatically backed up regularly.<br />

Snapshots are incremental, so you do not have to worry about any<br />

misses. You can also copy data to another region in case of any business<br />

need. Kindly refer ​AWS documentation​ for more details on working with<br />

snapshots.


8<br />

8. <strong>Redshift</strong> has an advanced feature called <strong>Redshift</strong> Spectrum. Using<br />

<strong>Redshift</strong> Spectrum you can query huge amounts of data directly from S3.<br />

While doing so, you can skip the loading of data through COPY command<br />

or any other method. You can refer to the detailed guide on ​<strong>Redshift</strong><br />

Spectrum​ for more information.<br />

9. Using Sort Keys, data can be pre-sorted based on specific columns.<br />

Also, the query performance can be improved automatically.<br />

10. Using Distribution Keys, data can be easily distributed across nodes<br />

equally to increase the query performance.<br />

11. <strong>Redshift</strong> provides various pre-built system tables and views to help<br />

developers and designers to help out during ETL and other processes.<br />

12. Setup related commands can be run through various modes such as<br />

AWS console, Command Line <strong>In</strong>terface (CLI), API, etc.<br />

13. AWS <strong>Redshift</strong> applies some patches and upgrades to the cluster<br />

automatically through maintenance window (configurable value). ence<br />

you do not have to worry about applying patches.<br />

Cons<br />

1. <strong>In</strong> <strong>Redshift</strong>, there is no concept of function, triggers, and procedures.<br />

2. There is no concept of sequence column in <strong>Redshift</strong>. You need to<br />

handle it through your ETL logic in case you need to generate sequence<br />

number of your column.<br />

3. Unlike other common data warehouses, <strong>Redshift</strong> does not enforce<br />

Primary keys or Foreign keys which can create data integrity issues.


9<br />

4. Only S3, DynamoDB, and EMR support a parallel load in <strong>Redshift</strong>. <strong>In</strong><br />

case you want to load data from other services you need to write ETL<br />

scripts or use ETL solutions such as ​Hevo​.<br />

5. It requires a good understanding of Sort and Dist key. There are some<br />

basic ground rules to set for sort and dist keys. If set improperly then it<br />

could lead to hampering of performance.<br />

6. Distribution keys cannot be changed once it is created. You need to be<br />

extremely careful while designing your tables. Wrong distribution keys<br />

could hamper the overall performance.<br />

7. <strong>In</strong> <strong>Redshift</strong>, there is no concept of DBLink, you cannot directly connect<br />

to another database/data warehouse tables for your queries.<br />

8. <strong>In</strong> <strong>Redshift</strong>, VACUUM and ANALYZE are mandatory on key tables. It<br />

can hamper the performance badly if run during business hours. Hence it<br />

needs to be handled carefully.<br />

9. <strong>In</strong> <strong>Redshift</strong> cluster, there is a limit on the number of nodes, databases,<br />

tables, etc. Maximum storage limit is still lesser than data warehouses like<br />

<strong>Teradata</strong>. Here is the node limitation list:<br />

Node Type vCPU Storage per Node Node Range<br />

dc1.large 2 160 GB SSD 1-32<br />

dc1.8xlarge 32 2.56 TB SSD 2-128<br />

dc2.large 2 160 GB NVMe-SSD 1-32<br />

dc2.8xlarge 32 2.56 TB NVMe-SSD 2-128<br />

ds2.xlarge 4 2 TB HDD 1-32<br />

ds2.8xlarge 36 16 TB HDD 2-128<br />

You can refer to ​AWS documentation​ to know more about the limits in<br />

Amazon <strong>Redshift</strong>.


10<br />

10. Although <strong>Redshift</strong> in classic mode is still in use, its cluster<br />

performance is relatively modest.<br />

11. <strong>Redshift</strong> still supports only a single AZ environment and does not<br />

support multi-AZ environment.<br />

12. <strong>Redshift</strong> has a limit on query concurrency of 15. You can have a<br />

maximum of 8 queues in a cluster. If your queues are unmanaged, then it<br />

hinders the performance.<br />

13. Your design should make sure that the cluster is not in use during the<br />

maintenance window period, else job will fail.<br />

14. There is no concept of table partitioning in <strong>Redshift</strong>.<br />

15. <strong>In</strong> <strong>Redshift</strong>, you do not have a concept of SET and MULTISET tables<br />

(SET tables are the tables that do not allow duplicates). This needs to be<br />

handled programmatically else it could lead to reporting errors if handled<br />

inappropriately.<br />

You can refer to Hevo’s blog which talks about the ​Pros and Cons of<br />

Amazon <strong>Redshift</strong>​ in complete detail.


11<br />

<strong>Teradata</strong> Pros and Cons<br />

Pros<br />

1.​ ​<strong>Teradata</strong> is a massively parallel data warehouse with shared nothing<br />

architecture.<br />

2. <strong>Teradata</strong> has provided pre-built utilities i.e. Fastload, Multiload, TPT,<br />

BTEQ etc.<br />

3. <strong>Teradata</strong> is linearly scalable. If data volume rises, AMPs or Nodes can<br />

also be increased.<br />

4. <strong>Teradata</strong> also has fallback feature. <strong>In</strong> case one AMP is down, another<br />

AMP will take over for data retrieval.<br />

5. <strong>Teradata</strong> provides an impressive tool called <strong>Teradata</strong> Visual Explain. It<br />

visually shows the execution plan of queries in a graphical manner. This<br />

helps developers/designers to fine-tune their queries.<br />

6. <strong>Teradata</strong> provides Ferret utility to set and display storage space<br />

utilization.


12<br />

Cons<br />

1. One of the biggest cons of <strong>Teradata</strong> is that it is not cloud-based unless<br />

scaled up to run over the cloud. It requires some initial setup or you need<br />

to integrate with other cloud service providers i.e, AWS or Azure.<br />

2. It is not a columnar data warehouse.<br />

3. Since <strong>Teradata</strong> is not a columnar DB, it runs entire row even if you<br />

search over a single column. You may end up with performance issues<br />

unless your data warehouse is properly designed.<br />

4. If a query runs on a set of different columns over the bigger dataset, it<br />

could lead to performance issues; unless query has been run on the<br />

indexed columns.<br />

5. <strong>Teradata</strong> only supports a maximum of 128 joins in a single query. If you<br />

want to perform more joins, you need to break them into chunks and<br />

handle it accordingly.<br />

6. <strong>Redshift</strong> outperforms <strong>Teradata</strong> in <strong>An</strong>alytical performance, Visualisation<br />

on storage, & CPU utilization visualization. Everything can be viewed in a<br />

single AWS console or through the Cloudwatch monitor in <strong>Redshift</strong>. On<br />

the other hand, <strong>Teradata</strong> provides separate visual tools while for few<br />

others checks and commands need to be hit in <strong>Teradata</strong> client.<br />

7. <strong>Teradata</strong> has no default column compression mechanism. Column<br />

compression needs to be done manually, and you can perform up to 256<br />

unique column value compression per column.<br />

8. There are a lot of limitations on the number of columns, table value, and<br />

table name length in <strong>Teradata</strong>. You can refer to ​<strong>Teradata</strong> documentation<br />

for more detailed information.


13<br />

Features supported only by <strong>Redshift</strong>, not<br />

<strong>Teradata</strong><br />

1. The most valuable feature of <strong>Redshift</strong> is that it is cloud-based and fully<br />

managed. Although, <strong>Teradata</strong> has a <strong>Teradata</strong> Database Developer (Single<br />

Node) a full-featured data warehouse software.<br />

2. No need to worry about backup and restore as manual snapshots and<br />

restore can also be done.<br />

3. Backed up data (snapshot) is automatically stored in S3. No need to<br />

worry about storing data in tape or any outside system.<br />

4. <strong>Redshift</strong> has an excellent feature of loading data through COPY<br />

command that too in the parallel mode where all nodes/slices can<br />

participate together to make the performance faster.<br />

5. <strong>Redshift</strong> performs automatic column level compression, and it suggests<br />

compression mechanisms on all table columns (command is ANALYZE<br />

COMPRESSION).<br />

6. Due to the VPC feature in AWS, <strong>Redshift</strong> security is too tight and well<br />

controlled.


14<br />

Features supported only by <strong>Teradata</strong>, not<br />

<strong>Redshift</strong><br />

1. <strong>Teradata</strong> supports various features including Procedures, Triggers, etc.<br />

2. <strong>Teradata</strong> has a column sequencing feature while <strong>Redshift</strong> doesn't.<br />

3. <strong>Teradata</strong> provides various load and unload utilities i.e. TPT, FastLoad,<br />

FastExport, Multiload, TPump, and BTEQ. You can use them depending<br />

upon data volume, business logic, and leverage it in your ETL logic.<br />

4. <strong>Teradata</strong> has a few visual utilities which <strong>Redshift</strong> should have such as<br />

<strong>Teradata</strong> Visual Explain. <strong>In</strong> <strong>Redshift</strong>, you need to hit query to view Explain<br />

plan.<br />

5. <strong>Teradata</strong> supports MULTISET and SET tables while <strong>Redshift</strong> doesn't.<br />

6. <strong>Teradata</strong> supports Macros but <strong>Redshift</strong> doesn't. Macros are a set of<br />

predefined SQL statements logically stored in Database. Macros also<br />

reduce LAN traffic.<br />

Example:<br />

CREATE​ ​MACRO​ ​Get_Sales​ ​AS​ ​(<br />

SELECT​ ​SalesId, StoreId, StoreName, StoreAddress​ ​FROM ​Stores ​ORDER BY<br />

StoreId;<br />

);<br />

Exec Get_Sales;<br />

→ ​This macro execute command will retrieve all rows from Stores table.


15<br />

<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong> <strong>In</strong> A Nutshell<br />

Items <strong>Redshift</strong> <strong>Teradata</strong><br />

Cloud perspective<br />

Backup and restore<br />

strategy<br />

Data Load and<br />

Unload<br />

Table Storage<br />

Fully managed Data Warehouse<br />

over cloud.<br />

Backups are automatically taken<br />

care of through the snapshot<br />

feature. Snapshots are stored<br />

internally stored in S3, which is<br />

highly durable.<br />

<strong>Redshift</strong> leverages data load<br />

through COPY command and<br />

unload through UNLOAD<br />

command. Using COPY<br />

command, data is loaded<br />

automatically so that all nodes<br />

can participate equally for faster<br />

performance.<br />

<strong>Redshift</strong> follows columnar<br />

storage format. If the query is hit<br />

based on a specific set of the<br />

columns or only on specific<br />

column then it provides an<br />

impressive performance. Hence,<br />

aggregates are very fast in<br />

<strong>Redshift</strong> as it leverages column<br />

level hit.<br />

Core Data Warehouse is not<br />

over the cloud. <strong>In</strong>itial setup is<br />

required by DBAs/Export.<br />

<strong>Teradata</strong> can be scaled to run<br />

over the cloud (AWS/Azure)<br />

with pay-as-you-go model.<br />

<strong>Teradata</strong> backup and restore<br />

can be manual or automated<br />

(using BAR) but data is stored<br />

in an outside system.<br />

<strong>In</strong> <strong>Teradata</strong>, we have separate<br />

utilities to handle load/unload.<br />

<strong>Teradata</strong> provides TPT,<br />

FastExport, FastLoad, etc. They<br />

can be leveraged accordingly<br />

for your ETL/ELT.<br />

<strong>Teradata</strong> follows row level<br />

storage. <strong>Teradata</strong> requires a<br />

proper indexing on columns so<br />

that data can be stored<br />

properly in AMPs. If indexes<br />

are not proper or table hit is<br />

done on non-indexed column<br />

then it could cause<br />

performance issue.


16<br />

<strong>In</strong>ternal Storage<br />

<strong>In</strong> <strong>Redshift</strong>, data is stored over<br />

chunks of 1 MB blocks of each<br />

column. Each block follows zone<br />

mapping. Using zone mapping,<br />

blocks stores minimum and<br />

maximum value of that column.<br />

<strong>In</strong> <strong>Teradata</strong>, the data storage is<br />

managed by AMPs under<br />

vDisks and data is distributed<br />

based on hash algorithm (i.e.<br />

based on index defined etc)<br />

and data is retrieved<br />

accordingly.<br />

Referential<br />

<strong>In</strong>tegrity Model<br />

<strong>Redshift</strong> tables do have Primary<br />

Keys and Foreign Keys but it<br />

does not follow enforcement.<br />

You need to apply your logic<br />

such that referential integrity<br />

model is applied on <strong>Redshift</strong><br />

tables.<br />

<strong>Teradata</strong> tables have Primary<br />

Keys and Foreign Keys and it<br />

follows enforcement.<br />

Hence, it has an additional<br />

overhead of doing reference<br />

checks while processing.<br />

Sequence Support<br />

Triggers, Stored<br />

Procedures<br />

Visual Features<br />

There is no concept of column<br />

sequencing. If you want to create<br />

a sequence on any column you<br />

need to handle it<br />

programmatically.<br />

<strong>In</strong> <strong>Redshift</strong>, there is no concept<br />

of Triggers or Stored Procedures.<br />

<strong>Redshift</strong> is a part of AWS,<br />

an integrated service. Entire<br />

<strong>Redshift</strong> performance can be<br />

monitored through AWS<br />

console, Cloudwatch, and<br />

automatic alerts.<br />

You can define Sequence on a<br />

column.<br />

You can create Triggers or<br />

Stored Procedures in <strong>Teradata</strong>.<br />

It has few visual tools like<br />

<strong>Teradata</strong> Visual Explain but<br />

they are cluttered.<br />

Max Concurrency<br />

Maximum 15 concurrent queries. Runs more than 15 concurrent


17<br />

By default its concurrency is 5.<br />

queries.<br />

Macros Support No concept of Macros. Supports Macros.<br />

NoSQL to <strong>Redshift</strong><br />

Feature<br />

Although, <strong>Redshift</strong> cannot load<br />

NoSQL data from other vendors<br />

but it can load data from<br />

DynamoDB.<br />

No such feature supported yet.<br />

Maximum Storage<br />

Capacity<br />

2 PB<br />

(16*128 DS2.8xlarge ~ 2 PB)<br />

Storage capacity of much more<br />

than 2 PB of data.<br />

Column<br />

Compression<br />

<strong>In</strong> <strong>Redshift</strong>, when the table is<br />

created it automatically creates<br />

default compression on all<br />

columns. It also provides a<br />

command called ANALYSE<br />

COMPRESSION to help on<br />

column compression.<br />

<strong>In</strong> <strong>Teradata</strong>, you need to<br />

specify column compress on<br />

individual columns. You can<br />

compress<br />

up to 128 unique values per<br />

column in a table.<br />

Maximum Columns<br />

Per Table<br />

Maximum 1600 columns per<br />

table.<br />

Maximum 258 columns per<br />

row.<br />

Maximum Joins No limit as such. 64 joins per query block.<br />

Data Warehouse<br />

Maintenance/<br />

Updates<br />

Table <strong>In</strong>dexes<br />

<strong>Redshift</strong> applies regular patches<br />

and does automatic maintenance<br />

inside maintenance window.<br />

It does not have table index<br />

concept but its performance is<br />

<strong>In</strong> <strong>Teradata</strong>, DBAs need to take<br />

care of all these activities<br />

manually or through some tool.<br />

<strong>Teradata</strong> does provide various<br />

types of index i.e. Primary


18<br />

unaffected due to zone mapping<br />

and sort key features.<br />

<strong>In</strong>dex, Secondary <strong>In</strong>dex, etc.<br />

Table partitioning<br />

<strong>Redshift</strong> Spectrum has but<br />

<strong>Redshift</strong> doesn’t.<br />

Tables can be partitioned.<br />

Fault Tolerance<br />

<strong>Redshift</strong> is Fault Tolerant. <strong>In</strong><br />

case, there is any node failure,<br />

<strong>Redshift</strong> will automatically<br />

replace the failed node with the<br />

replacement node. Although,<br />

multi-AZ is not supported in<br />

<strong>Redshift</strong>.<br />

<strong>Teradata</strong> is also fault tolerant.<br />

<strong>In</strong> case, there is a failover in<br />

AMP, fallback AMP will take<br />

over automatically.


19<br />

Pricing and Effort <strong>Comparison</strong><br />

<strong>Redshift</strong> leads <strong>Teradata</strong> in effort and in-house pricing. <strong>Redshift</strong> is cheaper<br />

and easier than <strong>Teradata</strong>. For <strong>Redshift</strong>, you only need to turn on the<br />

cluster, set up security settings, few other options (maintenance window<br />

period, snapshot enabling option, etc), and you are ready to go. This way<br />

DBAs efforts get reduced.<br />

However, in terms of storage, <strong>Teradata</strong> has upper hand because <strong>Redshift</strong><br />

cluster has limitations. However, in <strong>Redshift</strong>, we can still handle that<br />

through S3 as it does not have any space limitation.<br />

Remember, both <strong>Teradata</strong> and <strong>Redshift</strong> Data Warehouses are designed<br />

to solve different purposes.<br />

You can refer to ​<strong>Redshift</strong>​ and ​<strong>Teradata</strong>​ to know about pricing.


20<br />

When and How to Migrate data from <strong>Teradata</strong> to<br />

<strong>Redshift</strong><br />

There are various considerations that need to be made on whether to<br />

migrate from <strong>Teradata</strong> to AWS/cloud.<br />

1) How stable is your <strong>Teradata</strong> Warehouse?<br />

2) How much is your <strong>Teradata</strong> data volume?<br />

3) How complex is your <strong>Teradata</strong> data model?<br />

4) How much is your current <strong>Teradata</strong> data latency?<br />

5) How good is your <strong>Teradata</strong> RDBMS performance?<br />

6) How many BI tools are you using on your <strong>Teradata</strong><br />

tables/views/cubes?<br />

7) Are you using plenty of unsupported features of <strong>Redshift</strong> in<br />

<strong>Teradata</strong>?<br />

8) Will migrating your data warehouse from <strong>Teradata</strong> to <strong>Redshift</strong><br />

break your system?<br />

9) Your budget of maintaining the <strong>Redshift</strong> and other key AWS<br />

services post-migration.<br />

If all conditions are satisfied, you easily migrate your data from <strong>Teradata</strong><br />

to <strong>Redshift</strong>. AWS provides a useful service called Data Migration Service<br />

(DMS) and Schema Conversion Tool (SCT). Although, this pretty handy<br />

service is not fully automated as some minor manual efforts are required.<br />

Please refer to AWS documentation for migrating data from ​<strong>Teradata</strong> to<br />

<strong>Redshift</strong>​.


21<br />

Summary<br />

Choosing between <strong>Redshift</strong> and <strong>Teradata</strong> is a tough question to answer<br />

as both are solving different purposes. <strong>Redshift</strong> performs analytics and<br />

reporting extremely well. Since <strong>Redshift</strong> is a columnar base data<br />

warehouse, its performance is really good when it comes to hitting the<br />

table/view based columns and aggregate functions (sum, avg, count(*),<br />

etc). As <strong>Redshift</strong> is a part of AWS service, it is integrated with all vital<br />

AWS services. Hence you don't need to store millions of data in <strong>Redshift</strong><br />

alone as you can archive old data in S3. If required, you can leverage<br />

<strong>Redshift</strong> Spectrum to build your analytics and reports on top of it. Stored<br />

procedures can be handled through AWS Lambda Service. <strong>In</strong> terms of<br />

age, <strong>Redshift</strong> is a comparatively newer data warehouse. <strong>Redshift</strong> is still<br />

developing features which other key data warehouses offer.<br />

On the other hand, <strong>Teradata</strong> is pretty matured and old. <strong>Teradata</strong> as an<br />

RDBMS may not provide similar performance as <strong>Redshift</strong> unless it has a<br />

properly designed data model, fully leveraged features (FastLoad,<br />

Multiload, TPT, BTEQ, etc), and table/views are properly tuned. Although,<br />

some established customers might be reluctant to migrate from <strong>Teradata</strong><br />

to <strong>Redshift</strong>. They can also look for the hybrid model option.<br />

<strong>In</strong> conclusion, it is still an ongoing debate, both <strong>Redshift</strong> and <strong>Teradata</strong><br />

have its pros and cons.


22<br />

ETL Challenges While Working With Amazon <strong>Redshift</strong><br />

Data loading is one of the biggest challenges of <strong>Redshift</strong>. To perform ETL<br />

to <strong>Redshift</strong>, you would need to invest precious engineering resources to<br />

extract, clean, enrich, and build data pipelines. However, writing complex<br />

scripts to automate all of this is not easy. It gets harder if you want to<br />

stream your data real-time. Data loss becomes an everyday phenomenon<br />

due to issues that crop up with changing sources, unstructured & unclean<br />

data, incorrect data mapping at the warehouse, and more.<br />

Using a data integration platform like Hevo can solve all your <strong>Redshift</strong><br />

ETL problems. With Hevo you can move any data into <strong>Redshift</strong> in minutes<br />

in a hassle-free fashion. Hevo integrates with a variety of data sources<br />

ranging from SQL, NoSQL, SaaS, File Storage Base, Webhooks, etc. with<br />

the click of a button.<br />

Sign up for a ​free trial​ here or view ​a quick video​ on how Hevo can help.<br />

About Author:<br />

<strong>An</strong>kur Shrivastava is a AWS Solution Designer with hands-on experience<br />

on Data Warehousing, ETL, and Data <strong>An</strong>alytics. He is an AWS Certified<br />

Solution Architect Associate. <strong>In</strong> his free time, he enjoys all outdoor sports<br />

and practices.


Looking for a simple and reliable way to bring Data<br />

from <strong>An</strong>y Source to AWS <strong>Redshift</strong>?<br />

TRY HEVO<br />

SIGN UP FOR FREE TRIAL

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!