Redshift Vs Teradata - An In-Depth Comparison
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
EBOOK<br />
REDSHIFT VS TERADATA<br />
AN IN-DEPTH COMPARISON<br />
AMAZON REDSHIFT<br />
TERADATA
Table of Contents<br />
<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong> 1<br />
<strong>Redshift</strong> Architecture & Its Features 1<br />
<strong>Teradata</strong> Architecture & Its Features 2<br />
<strong>Redshift</strong> Data Model 4<br />
<strong>Teradata</strong> Data Model 7<br />
Pros 8<br />
Cons 9<br />
<strong>Teradata</strong> Pros and Cons 12<br />
Pros 12<br />
Cons 13<br />
Features supported only by <strong>Teradata</strong>, not <strong>Redshift</strong> 15<br />
<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong> <strong>In</strong> A Nutshell 16<br />
Pricing and Effort <strong>Comparison</strong> 20<br />
When and How to Migrate data from <strong>Teradata</strong> to <strong>Redshift</strong> 21<br />
Summary 22<br />
ETL Challenges While Working With Amazon <strong>Redshift</strong> 23
1<br />
<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong><br />
<strong>Redshift</strong> versus <strong>Teradata</strong> has been one of the most debatable data<br />
warehouse comparisons. <strong>In</strong> this ebook, we will cover the detailed<br />
comparison between <strong>Redshift</strong> and <strong>Teradata</strong>.<br />
<strong>Redshift</strong> Architecture & Its Features<br />
<strong>Redshift</strong> is a fully managed petabyte scale data warehouse on the cloud.<br />
You can even start working from a few Gigabytes or Terabytes of data.<br />
Additionally, you can also scale it up to petabytes depending upon your<br />
business requirement. <strong>Redshift</strong> engine is also called a cluster and it is<br />
built up from one or more nodes. There are two types of nodes called<br />
Compute and Leader node. Compute node contains 2 or more slices<br />
depending upon node types. Leader node does multiple roles which<br />
include communicating with JDBC/ODBC client and creating the query<br />
execution plan to transfer it to compute node(s). Also, the cluster is<br />
incomplete without a Leader node.<br />
You can check out our blog for a detailed article on <strong>Redshift</strong> Architecture.
2<br />
<strong>Teradata</strong> Architecture & Its Features<br />
<strong>Teradata</strong> is an RDBMS, meant for a data warehouse with an on-premise<br />
setup. It requires installation since it is unavailable on cloud platforms.<br />
Although <strong>Teradata</strong> is not over the cloud, you can spin up a <strong>Teradata</strong><br />
instance on a cloud VM. <strong>Teradata</strong> is designed on MPP shared nothing<br />
architecture.<br />
Here is a diagrammatic representation of <strong>Teradata</strong> Architecture.
3<br />
The four major components of <strong>Teradata</strong> are as follows:<br />
1. Node: The primary component of <strong>Teradata</strong> is called Node, which is a<br />
basic unit of <strong>Teradata</strong>. It has its own OS, CPU, RAM, disk space etc.<br />
2. Parsing Engine: Parsing Engine or PE is responsible for preparing the<br />
query execution plan.<br />
3. BYNET: BYNET receives query execution plan from PE and transfers it<br />
to AMPs aka Virtual Processor and vice versa. It is also called as Message<br />
Parsing layer.<br />
4. Access Module Processor (AMP): AMP is an important component of<br />
<strong>Teradata</strong>. AMP manages the processing of data by storing it in vDisks.<br />
Data can be stored in any AMP depending on the hash algorithm. <strong>In</strong> case<br />
the first BYNET fails there is an additional BYNET to take over. BYNET is<br />
responsible to communicate between the AMPs. <strong>In</strong> multi-node systems,<br />
<strong>Teradata</strong> will have at least two BYNETs to make the system fault tolerant.
4<br />
<strong>Redshift</strong> Data Model<br />
<strong>Redshift</strong> data model is designed for Data warehousing purposes. The<br />
unique features of <strong>Redshift</strong> make it a smart Data warehouse choice.<br />
1. <strong>Redshift</strong> is a fully managed data warehouse. You don't have to worry<br />
about setting up and installing the database. You just have to spin up<br />
your cluster and the database is ready.<br />
2. <strong>Redshift</strong>’s backup and restore are fully automatic. Through automatic<br />
snapshots, data in <strong>Redshift</strong> automatically gets backed up in S3 internally<br />
at regular intervals.<br />
3. Data is fully secured by inbound security rule and SSL connection. It<br />
has VPC for VPC mode and inbound security rule for classic mode cluster.<br />
4. <strong>Redshift</strong> stores data in the columnar format, unlike other data<br />
warehouses storage. For example, if you hit your query for a specific<br />
column, <strong>Redshift</strong> will exclusively search in that specific column instead of<br />
the entire row. This saves an enormous amount of time in query<br />
processing.<br />
5. Data is stored in blocks of 1 MB instead of typical blocks of 8 KB or 64<br />
KB which helps <strong>Redshift</strong> to store more data in a single block.<br />
6. <strong>Redshift</strong> does not have the concept of indexes. <strong>In</strong>stead, it has zone<br />
maps. With the help of zone map <strong>Redshift</strong> easily identifies which block<br />
has lowest and highest value for that column. Zone maps inform the<br />
cluster about all the blocks that are needed to be read.<br />
7. <strong>Redshift</strong> has column compression (encoding). ANALYZE<br />
COMPRESSION command automatically tells what compression strategy<br />
to apply for that table. <strong>Redshift</strong> provides various encoding techniques.<br />
Refer AWS documentation for more details on encoding.
5<br />
8. <strong>Redshift</strong> has a feature of caching the result of repeat queries for faster<br />
performance. To check whether your query has used cache, you can see<br />
the output of column source_query available in SVL_QLOG. If your query<br />
has used cache it will store the value of query id of which was run by the<br />
specific user id.<br />
Example:<br />
SELECT USERID, QUERY, ELAPSED, SOURCE_QUERY from SVL_QLOG WHERE<br />
USERID in (600, 601);<br />
<strong>In</strong> the below example, QUERY ID 853219 of USERID 601 has used the<br />
cache. (QUERY ID 123456 of USERID 600). Also, QUERY ID 853219 ran<br />
by userid → 601 has utilized the cache and elapsed time in microseconds<br />
has reduced drastically.<br />
USERID | QUERY ID<br />
| ELAPSED | SOURCE_QUERY<br />
--------+-------------+----------+---------------<br />
600 | 123456 | 90000 | NULL<br />
600 | 567890 | 80000 | NULL<br />
601 | 853219 | 30 | 123456<br />
9. <strong>Redshift</strong> data model is similar to a typical data warehouse when it<br />
comes to analytical queries. You can create fact tables, dimension tables,<br />
and views. It supports all major query execution strategy i.e., <strong>In</strong>ner join,<br />
Outer join, Subquery, and Common Table Expressions (with clause).<br />
10. From a storage perspective <strong>Redshift</strong> cluster maintains multiple copies<br />
of your data as part of fault tolerance.
6<br />
<strong>Teradata</strong> Data Model<br />
1. <strong>Teradata</strong> is a massive parallel Data warehouse with shared-nothing<br />
architecture. However, unlike <strong>Redshift</strong>, the data is stored in a row-based<br />
format.<br />
2. <strong>Teradata</strong> uses a different kind of indexes for fast data retrieval. <strong>In</strong>dexes<br />
include Primary, Secondary, Join, and Hash <strong>In</strong>dexes, etc. Please note that<br />
Secondary <strong>In</strong>dex does not affect the distribution of rows across AMPs.<br />
Although, the secondary index takes extra processing overhead.<br />
3. <strong>Teradata</strong> supports and enforces Primary and Secondary index.<br />
4. <strong>Teradata</strong> has a hybrid storage concept where frequently used data is<br />
stored in SSD while the less accessed data is stored in HDD. <strong>Teradata</strong><br />
has a higher storage capacity than <strong>Redshift</strong>.<br />
5. <strong>Teradata</strong> does support Table partitioning feature, unlike <strong>Redshift</strong>.<br />
6. <strong>Teradata</strong> uses the Hash algorithm to distribute data into various disk<br />
storage units.<br />
7. <strong>Teradata</strong> can scale up to 2048 nodes. It has a storage capacity ranging<br />
from 10 TB to 94 petabytes thus providing higher storage capacity than<br />
<strong>Redshift</strong>.<br />
8. <strong>Teradata</strong> supports all kinds of major SQL related features (Primary<br />
<strong>In</strong>dex, Secondary <strong>In</strong>dex, Sequences, Stored Procedures, User Defined<br />
Functions, and Macros etc) which are compulsorily needed as part of Data<br />
Warehouse RDBMS.<br />
9. <strong>Teradata</strong>'s data model is designed to be fault tolerant. It is also<br />
designed to be scalable with redundant network connectivity to ensure<br />
throughout data connectivity and availability.
7<br />
<strong>Redshift</strong> Pros and Cons<br />
Pros<br />
1. Loading and unloading of data is exceptionally fast. You can load data<br />
in parallel mode. <strong>Redshift</strong>, even for a high volume of data, supports data<br />
loading from the zipped file. <strong>Redshift</strong> recommends loading the data from<br />
the COPY command for faster performance.<br />
2. You can load data from NoSQL database service, AWS DynamoDB.<br />
Refer AWS documentation for more detailed information about<br />
DynamoDB.<br />
3. You have an option to choose the node type (Dense Storage or Dense<br />
Compute) of your cluster depending upon your data needs and business<br />
requirements.<br />
4. You can scale your cluster's storage and CPU for better performance at<br />
any instant without any impact to the cluster.<br />
5. You can migrate your data from various data warehouses into <strong>Redshift</strong><br />
without much hassle. AWS does provide a service for the same called<br />
Database Migration Service (DMS). Refer to AWS documentation for<br />
more detailed information.<br />
6. You do not have to worry about the security as you can build your<br />
cluster inside a VPC and also use SSL encryption for further protection.<br />
7. <strong>Redshift</strong> backup and restore feature is pretty simple. Through<br />
automatic snapshots, your data is automatically backed up regularly.<br />
Snapshots are incremental, so you do not have to worry about any<br />
misses. You can also copy data to another region in case of any business<br />
need. Kindly refer AWS documentation for more details on working with<br />
snapshots.
8<br />
8. <strong>Redshift</strong> has an advanced feature called <strong>Redshift</strong> Spectrum. Using<br />
<strong>Redshift</strong> Spectrum you can query huge amounts of data directly from S3.<br />
While doing so, you can skip the loading of data through COPY command<br />
or any other method. You can refer to the detailed guide on <strong>Redshift</strong><br />
Spectrum for more information.<br />
9. Using Sort Keys, data can be pre-sorted based on specific columns.<br />
Also, the query performance can be improved automatically.<br />
10. Using Distribution Keys, data can be easily distributed across nodes<br />
equally to increase the query performance.<br />
11. <strong>Redshift</strong> provides various pre-built system tables and views to help<br />
developers and designers to help out during ETL and other processes.<br />
12. Setup related commands can be run through various modes such as<br />
AWS console, Command Line <strong>In</strong>terface (CLI), API, etc.<br />
13. AWS <strong>Redshift</strong> applies some patches and upgrades to the cluster<br />
automatically through maintenance window (configurable value). ence<br />
you do not have to worry about applying patches.<br />
Cons<br />
1. <strong>In</strong> <strong>Redshift</strong>, there is no concept of function, triggers, and procedures.<br />
2. There is no concept of sequence column in <strong>Redshift</strong>. You need to<br />
handle it through your ETL logic in case you need to generate sequence<br />
number of your column.<br />
3. Unlike other common data warehouses, <strong>Redshift</strong> does not enforce<br />
Primary keys or Foreign keys which can create data integrity issues.
9<br />
4. Only S3, DynamoDB, and EMR support a parallel load in <strong>Redshift</strong>. <strong>In</strong><br />
case you want to load data from other services you need to write ETL<br />
scripts or use ETL solutions such as Hevo.<br />
5. It requires a good understanding of Sort and Dist key. There are some<br />
basic ground rules to set for sort and dist keys. If set improperly then it<br />
could lead to hampering of performance.<br />
6. Distribution keys cannot be changed once it is created. You need to be<br />
extremely careful while designing your tables. Wrong distribution keys<br />
could hamper the overall performance.<br />
7. <strong>In</strong> <strong>Redshift</strong>, there is no concept of DBLink, you cannot directly connect<br />
to another database/data warehouse tables for your queries.<br />
8. <strong>In</strong> <strong>Redshift</strong>, VACUUM and ANALYZE are mandatory on key tables. It<br />
can hamper the performance badly if run during business hours. Hence it<br />
needs to be handled carefully.<br />
9. <strong>In</strong> <strong>Redshift</strong> cluster, there is a limit on the number of nodes, databases,<br />
tables, etc. Maximum storage limit is still lesser than data warehouses like<br />
<strong>Teradata</strong>. Here is the node limitation list:<br />
Node Type vCPU Storage per Node Node Range<br />
dc1.large 2 160 GB SSD 1-32<br />
dc1.8xlarge 32 2.56 TB SSD 2-128<br />
dc2.large 2 160 GB NVMe-SSD 1-32<br />
dc2.8xlarge 32 2.56 TB NVMe-SSD 2-128<br />
ds2.xlarge 4 2 TB HDD 1-32<br />
ds2.8xlarge 36 16 TB HDD 2-128<br />
You can refer to AWS documentation to know more about the limits in<br />
Amazon <strong>Redshift</strong>.
10<br />
10. Although <strong>Redshift</strong> in classic mode is still in use, its cluster<br />
performance is relatively modest.<br />
11. <strong>Redshift</strong> still supports only a single AZ environment and does not<br />
support multi-AZ environment.<br />
12. <strong>Redshift</strong> has a limit on query concurrency of 15. You can have a<br />
maximum of 8 queues in a cluster. If your queues are unmanaged, then it<br />
hinders the performance.<br />
13. Your design should make sure that the cluster is not in use during the<br />
maintenance window period, else job will fail.<br />
14. There is no concept of table partitioning in <strong>Redshift</strong>.<br />
15. <strong>In</strong> <strong>Redshift</strong>, you do not have a concept of SET and MULTISET tables<br />
(SET tables are the tables that do not allow duplicates). This needs to be<br />
handled programmatically else it could lead to reporting errors if handled<br />
inappropriately.<br />
You can refer to Hevo’s blog which talks about the Pros and Cons of<br />
Amazon <strong>Redshift</strong> in complete detail.
11<br />
<strong>Teradata</strong> Pros and Cons<br />
Pros<br />
1. <strong>Teradata</strong> is a massively parallel data warehouse with shared nothing<br />
architecture.<br />
2. <strong>Teradata</strong> has provided pre-built utilities i.e. Fastload, Multiload, TPT,<br />
BTEQ etc.<br />
3. <strong>Teradata</strong> is linearly scalable. If data volume rises, AMPs or Nodes can<br />
also be increased.<br />
4. <strong>Teradata</strong> also has fallback feature. <strong>In</strong> case one AMP is down, another<br />
AMP will take over for data retrieval.<br />
5. <strong>Teradata</strong> provides an impressive tool called <strong>Teradata</strong> Visual Explain. It<br />
visually shows the execution plan of queries in a graphical manner. This<br />
helps developers/designers to fine-tune their queries.<br />
6. <strong>Teradata</strong> provides Ferret utility to set and display storage space<br />
utilization.
12<br />
Cons<br />
1. One of the biggest cons of <strong>Teradata</strong> is that it is not cloud-based unless<br />
scaled up to run over the cloud. It requires some initial setup or you need<br />
to integrate with other cloud service providers i.e, AWS or Azure.<br />
2. It is not a columnar data warehouse.<br />
3. Since <strong>Teradata</strong> is not a columnar DB, it runs entire row even if you<br />
search over a single column. You may end up with performance issues<br />
unless your data warehouse is properly designed.<br />
4. If a query runs on a set of different columns over the bigger dataset, it<br />
could lead to performance issues; unless query has been run on the<br />
indexed columns.<br />
5. <strong>Teradata</strong> only supports a maximum of 128 joins in a single query. If you<br />
want to perform more joins, you need to break them into chunks and<br />
handle it accordingly.<br />
6. <strong>Redshift</strong> outperforms <strong>Teradata</strong> in <strong>An</strong>alytical performance, Visualisation<br />
on storage, & CPU utilization visualization. Everything can be viewed in a<br />
single AWS console or through the Cloudwatch monitor in <strong>Redshift</strong>. On<br />
the other hand, <strong>Teradata</strong> provides separate visual tools while for few<br />
others checks and commands need to be hit in <strong>Teradata</strong> client.<br />
7. <strong>Teradata</strong> has no default column compression mechanism. Column<br />
compression needs to be done manually, and you can perform up to 256<br />
unique column value compression per column.<br />
8. There are a lot of limitations on the number of columns, table value, and<br />
table name length in <strong>Teradata</strong>. You can refer to <strong>Teradata</strong> documentation<br />
for more detailed information.
13<br />
Features supported only by <strong>Redshift</strong>, not<br />
<strong>Teradata</strong><br />
1. The most valuable feature of <strong>Redshift</strong> is that it is cloud-based and fully<br />
managed. Although, <strong>Teradata</strong> has a <strong>Teradata</strong> Database Developer (Single<br />
Node) a full-featured data warehouse software.<br />
2. No need to worry about backup and restore as manual snapshots and<br />
restore can also be done.<br />
3. Backed up data (snapshot) is automatically stored in S3. No need to<br />
worry about storing data in tape or any outside system.<br />
4. <strong>Redshift</strong> has an excellent feature of loading data through COPY<br />
command that too in the parallel mode where all nodes/slices can<br />
participate together to make the performance faster.<br />
5. <strong>Redshift</strong> performs automatic column level compression, and it suggests<br />
compression mechanisms on all table columns (command is ANALYZE<br />
COMPRESSION).<br />
6. Due to the VPC feature in AWS, <strong>Redshift</strong> security is too tight and well<br />
controlled.
14<br />
Features supported only by <strong>Teradata</strong>, not<br />
<strong>Redshift</strong><br />
1. <strong>Teradata</strong> supports various features including Procedures, Triggers, etc.<br />
2. <strong>Teradata</strong> has a column sequencing feature while <strong>Redshift</strong> doesn't.<br />
3. <strong>Teradata</strong> provides various load and unload utilities i.e. TPT, FastLoad,<br />
FastExport, Multiload, TPump, and BTEQ. You can use them depending<br />
upon data volume, business logic, and leverage it in your ETL logic.<br />
4. <strong>Teradata</strong> has a few visual utilities which <strong>Redshift</strong> should have such as<br />
<strong>Teradata</strong> Visual Explain. <strong>In</strong> <strong>Redshift</strong>, you need to hit query to view Explain<br />
plan.<br />
5. <strong>Teradata</strong> supports MULTISET and SET tables while <strong>Redshift</strong> doesn't.<br />
6. <strong>Teradata</strong> supports Macros but <strong>Redshift</strong> doesn't. Macros are a set of<br />
predefined SQL statements logically stored in Database. Macros also<br />
reduce LAN traffic.<br />
Example:<br />
CREATE MACRO Get_Sales AS (<br />
SELECT SalesId, StoreId, StoreName, StoreAddress FROM Stores ORDER BY<br />
StoreId;<br />
);<br />
Exec Get_Sales;<br />
→ This macro execute command will retrieve all rows from Stores table.
15<br />
<strong>Redshift</strong> <strong>Vs</strong> <strong>Teradata</strong> <strong>In</strong> A Nutshell<br />
Items <strong>Redshift</strong> <strong>Teradata</strong><br />
Cloud perspective<br />
Backup and restore<br />
strategy<br />
Data Load and<br />
Unload<br />
Table Storage<br />
Fully managed Data Warehouse<br />
over cloud.<br />
Backups are automatically taken<br />
care of through the snapshot<br />
feature. Snapshots are stored<br />
internally stored in S3, which is<br />
highly durable.<br />
<strong>Redshift</strong> leverages data load<br />
through COPY command and<br />
unload through UNLOAD<br />
command. Using COPY<br />
command, data is loaded<br />
automatically so that all nodes<br />
can participate equally for faster<br />
performance.<br />
<strong>Redshift</strong> follows columnar<br />
storage format. If the query is hit<br />
based on a specific set of the<br />
columns or only on specific<br />
column then it provides an<br />
impressive performance. Hence,<br />
aggregates are very fast in<br />
<strong>Redshift</strong> as it leverages column<br />
level hit.<br />
Core Data Warehouse is not<br />
over the cloud. <strong>In</strong>itial setup is<br />
required by DBAs/Export.<br />
<strong>Teradata</strong> can be scaled to run<br />
over the cloud (AWS/Azure)<br />
with pay-as-you-go model.<br />
<strong>Teradata</strong> backup and restore<br />
can be manual or automated<br />
(using BAR) but data is stored<br />
in an outside system.<br />
<strong>In</strong> <strong>Teradata</strong>, we have separate<br />
utilities to handle load/unload.<br />
<strong>Teradata</strong> provides TPT,<br />
FastExport, FastLoad, etc. They<br />
can be leveraged accordingly<br />
for your ETL/ELT.<br />
<strong>Teradata</strong> follows row level<br />
storage. <strong>Teradata</strong> requires a<br />
proper indexing on columns so<br />
that data can be stored<br />
properly in AMPs. If indexes<br />
are not proper or table hit is<br />
done on non-indexed column<br />
then it could cause<br />
performance issue.
16<br />
<strong>In</strong>ternal Storage<br />
<strong>In</strong> <strong>Redshift</strong>, data is stored over<br />
chunks of 1 MB blocks of each<br />
column. Each block follows zone<br />
mapping. Using zone mapping,<br />
blocks stores minimum and<br />
maximum value of that column.<br />
<strong>In</strong> <strong>Teradata</strong>, the data storage is<br />
managed by AMPs under<br />
vDisks and data is distributed<br />
based on hash algorithm (i.e.<br />
based on index defined etc)<br />
and data is retrieved<br />
accordingly.<br />
Referential<br />
<strong>In</strong>tegrity Model<br />
<strong>Redshift</strong> tables do have Primary<br />
Keys and Foreign Keys but it<br />
does not follow enforcement.<br />
You need to apply your logic<br />
such that referential integrity<br />
model is applied on <strong>Redshift</strong><br />
tables.<br />
<strong>Teradata</strong> tables have Primary<br />
Keys and Foreign Keys and it<br />
follows enforcement.<br />
Hence, it has an additional<br />
overhead of doing reference<br />
checks while processing.<br />
Sequence Support<br />
Triggers, Stored<br />
Procedures<br />
Visual Features<br />
There is no concept of column<br />
sequencing. If you want to create<br />
a sequence on any column you<br />
need to handle it<br />
programmatically.<br />
<strong>In</strong> <strong>Redshift</strong>, there is no concept<br />
of Triggers or Stored Procedures.<br />
<strong>Redshift</strong> is a part of AWS,<br />
an integrated service. Entire<br />
<strong>Redshift</strong> performance can be<br />
monitored through AWS<br />
console, Cloudwatch, and<br />
automatic alerts.<br />
You can define Sequence on a<br />
column.<br />
You can create Triggers or<br />
Stored Procedures in <strong>Teradata</strong>.<br />
It has few visual tools like<br />
<strong>Teradata</strong> Visual Explain but<br />
they are cluttered.<br />
Max Concurrency<br />
Maximum 15 concurrent queries. Runs more than 15 concurrent
17<br />
By default its concurrency is 5.<br />
queries.<br />
Macros Support No concept of Macros. Supports Macros.<br />
NoSQL to <strong>Redshift</strong><br />
Feature<br />
Although, <strong>Redshift</strong> cannot load<br />
NoSQL data from other vendors<br />
but it can load data from<br />
DynamoDB.<br />
No such feature supported yet.<br />
Maximum Storage<br />
Capacity<br />
2 PB<br />
(16*128 DS2.8xlarge ~ 2 PB)<br />
Storage capacity of much more<br />
than 2 PB of data.<br />
Column<br />
Compression<br />
<strong>In</strong> <strong>Redshift</strong>, when the table is<br />
created it automatically creates<br />
default compression on all<br />
columns. It also provides a<br />
command called ANALYSE<br />
COMPRESSION to help on<br />
column compression.<br />
<strong>In</strong> <strong>Teradata</strong>, you need to<br />
specify column compress on<br />
individual columns. You can<br />
compress<br />
up to 128 unique values per<br />
column in a table.<br />
Maximum Columns<br />
Per Table<br />
Maximum 1600 columns per<br />
table.<br />
Maximum 258 columns per<br />
row.<br />
Maximum Joins No limit as such. 64 joins per query block.<br />
Data Warehouse<br />
Maintenance/<br />
Updates<br />
Table <strong>In</strong>dexes<br />
<strong>Redshift</strong> applies regular patches<br />
and does automatic maintenance<br />
inside maintenance window.<br />
It does not have table index<br />
concept but its performance is<br />
<strong>In</strong> <strong>Teradata</strong>, DBAs need to take<br />
care of all these activities<br />
manually or through some tool.<br />
<strong>Teradata</strong> does provide various<br />
types of index i.e. Primary
18<br />
unaffected due to zone mapping<br />
and sort key features.<br />
<strong>In</strong>dex, Secondary <strong>In</strong>dex, etc.<br />
Table partitioning<br />
<strong>Redshift</strong> Spectrum has but<br />
<strong>Redshift</strong> doesn’t.<br />
Tables can be partitioned.<br />
Fault Tolerance<br />
<strong>Redshift</strong> is Fault Tolerant. <strong>In</strong><br />
case, there is any node failure,<br />
<strong>Redshift</strong> will automatically<br />
replace the failed node with the<br />
replacement node. Although,<br />
multi-AZ is not supported in<br />
<strong>Redshift</strong>.<br />
<strong>Teradata</strong> is also fault tolerant.<br />
<strong>In</strong> case, there is a failover in<br />
AMP, fallback AMP will take<br />
over automatically.
19<br />
Pricing and Effort <strong>Comparison</strong><br />
<strong>Redshift</strong> leads <strong>Teradata</strong> in effort and in-house pricing. <strong>Redshift</strong> is cheaper<br />
and easier than <strong>Teradata</strong>. For <strong>Redshift</strong>, you only need to turn on the<br />
cluster, set up security settings, few other options (maintenance window<br />
period, snapshot enabling option, etc), and you are ready to go. This way<br />
DBAs efforts get reduced.<br />
However, in terms of storage, <strong>Teradata</strong> has upper hand because <strong>Redshift</strong><br />
cluster has limitations. However, in <strong>Redshift</strong>, we can still handle that<br />
through S3 as it does not have any space limitation.<br />
Remember, both <strong>Teradata</strong> and <strong>Redshift</strong> Data Warehouses are designed<br />
to solve different purposes.<br />
You can refer to <strong>Redshift</strong> and <strong>Teradata</strong> to know about pricing.
20<br />
When and How to Migrate data from <strong>Teradata</strong> to<br />
<strong>Redshift</strong><br />
There are various considerations that need to be made on whether to<br />
migrate from <strong>Teradata</strong> to AWS/cloud.<br />
1) How stable is your <strong>Teradata</strong> Warehouse?<br />
2) How much is your <strong>Teradata</strong> data volume?<br />
3) How complex is your <strong>Teradata</strong> data model?<br />
4) How much is your current <strong>Teradata</strong> data latency?<br />
5) How good is your <strong>Teradata</strong> RDBMS performance?<br />
6) How many BI tools are you using on your <strong>Teradata</strong><br />
tables/views/cubes?<br />
7) Are you using plenty of unsupported features of <strong>Redshift</strong> in<br />
<strong>Teradata</strong>?<br />
8) Will migrating your data warehouse from <strong>Teradata</strong> to <strong>Redshift</strong><br />
break your system?<br />
9) Your budget of maintaining the <strong>Redshift</strong> and other key AWS<br />
services post-migration.<br />
If all conditions are satisfied, you easily migrate your data from <strong>Teradata</strong><br />
to <strong>Redshift</strong>. AWS provides a useful service called Data Migration Service<br />
(DMS) and Schema Conversion Tool (SCT). Although, this pretty handy<br />
service is not fully automated as some minor manual efforts are required.<br />
Please refer to AWS documentation for migrating data from <strong>Teradata</strong> to<br />
<strong>Redshift</strong>.
21<br />
Summary<br />
Choosing between <strong>Redshift</strong> and <strong>Teradata</strong> is a tough question to answer<br />
as both are solving different purposes. <strong>Redshift</strong> performs analytics and<br />
reporting extremely well. Since <strong>Redshift</strong> is a columnar base data<br />
warehouse, its performance is really good when it comes to hitting the<br />
table/view based columns and aggregate functions (sum, avg, count(*),<br />
etc). As <strong>Redshift</strong> is a part of AWS service, it is integrated with all vital<br />
AWS services. Hence you don't need to store millions of data in <strong>Redshift</strong><br />
alone as you can archive old data in S3. If required, you can leverage<br />
<strong>Redshift</strong> Spectrum to build your analytics and reports on top of it. Stored<br />
procedures can be handled through AWS Lambda Service. <strong>In</strong> terms of<br />
age, <strong>Redshift</strong> is a comparatively newer data warehouse. <strong>Redshift</strong> is still<br />
developing features which other key data warehouses offer.<br />
On the other hand, <strong>Teradata</strong> is pretty matured and old. <strong>Teradata</strong> as an<br />
RDBMS may not provide similar performance as <strong>Redshift</strong> unless it has a<br />
properly designed data model, fully leveraged features (FastLoad,<br />
Multiload, TPT, BTEQ, etc), and table/views are properly tuned. Although,<br />
some established customers might be reluctant to migrate from <strong>Teradata</strong><br />
to <strong>Redshift</strong>. They can also look for the hybrid model option.<br />
<strong>In</strong> conclusion, it is still an ongoing debate, both <strong>Redshift</strong> and <strong>Teradata</strong><br />
have its pros and cons.
22<br />
ETL Challenges While Working With Amazon <strong>Redshift</strong><br />
Data loading is one of the biggest challenges of <strong>Redshift</strong>. To perform ETL<br />
to <strong>Redshift</strong>, you would need to invest precious engineering resources to<br />
extract, clean, enrich, and build data pipelines. However, writing complex<br />
scripts to automate all of this is not easy. It gets harder if you want to<br />
stream your data real-time. Data loss becomes an everyday phenomenon<br />
due to issues that crop up with changing sources, unstructured & unclean<br />
data, incorrect data mapping at the warehouse, and more.<br />
Using a data integration platform like Hevo can solve all your <strong>Redshift</strong><br />
ETL problems. With Hevo you can move any data into <strong>Redshift</strong> in minutes<br />
in a hassle-free fashion. Hevo integrates with a variety of data sources<br />
ranging from SQL, NoSQL, SaaS, File Storage Base, Webhooks, etc. with<br />
the click of a button.<br />
Sign up for a free trial here or view a quick video on how Hevo can help.<br />
About Author:<br />
<strong>An</strong>kur Shrivastava is a AWS Solution Designer with hands-on experience<br />
on Data Warehousing, ETL, and Data <strong>An</strong>alytics. He is an AWS Certified<br />
Solution Architect Associate. <strong>In</strong> his free time, he enjoys all outdoor sports<br />
and practices.
Looking for a simple and reliable way to bring Data<br />
from <strong>An</strong>y Source to AWS <strong>Redshift</strong>?<br />
TRY HEVO<br />
SIGN UP FOR FREE TRIAL