23.01.2014 Views

Using TPump in an Active Warehouse Environment

Using TPump in an Active Warehouse Environment

Using TPump in an Active Warehouse Environment

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Dec 2002<br />

<strong>Us<strong>in</strong>g</strong> <strong>TPump</strong> <strong>in</strong> <strong>an</strong> <strong>Active</strong><br />

<strong>Warehouse</strong> <strong>Environment</strong><br />

Dr. V<strong>in</strong>cent Hager<br />

Senior Consult<strong>an</strong>t CRM Solutions<br />

Teradata Division – Austria<br />

December 2002<br />

<strong>Us<strong>in</strong>g</strong> <strong>TPump</strong> <strong>in</strong> <strong>an</strong> <strong>Active</strong><br />

<strong>Warehouse</strong> <strong>Environment</strong><br />

Agenda<br />

• Architecture<br />

• Script Variables<br />

• Data related Variables<br />

• Challenges & Deployment<br />

• Scenarios<br />

• <strong>TPump</strong> <strong>in</strong> a Cont<strong>in</strong>uous <strong>Environment</strong><br />

• Best Practices<br />

2 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

2<br />

1


Dec 2002<br />

<strong>TPump</strong> Architecture<br />

• MPP Teradata Load<strong>in</strong>g<br />

– SQL DML— <strong>Active</strong> Streams or Batch<br />

• <strong>TPump</strong><br />

– M<strong>an</strong>y client platforms<br />

– M<strong>an</strong>y client sources<br />

– Row level lock<strong>in</strong>g<br />

• Perform<strong>an</strong>ce<br />

– Multi-statement, Multi-session<br />

– C<strong>an</strong> saturate: Wire, Client, <strong>an</strong>d/or RDBMS<br />

3 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

3<br />

<strong>TPump</strong> Architecture<br />

Parallel Sessions<br />

PE<br />

AMP<br />

PE<br />

AXSMod<br />

Read OPR<br />

<strong>TPump</strong><br />

64K Block<br />

64K Block<br />

64K Block<br />

AMPE<br />

AMP PE<br />

Read <strong>an</strong>y Source<br />

•Disk, Tape<br />

•OLE-DB, ODBC<br />

•Pipes, MQ<br />

•3 rd Party<br />

•Etc.<br />

Multiple Physical Routes<br />

• FICON, ESCON, GbE, etc.<br />

Utility Process<strong>in</strong>g<br />

•Until EOF:<br />

•Read <strong>in</strong>put stream<br />

•Build exec <strong>an</strong>d rows <strong>in</strong><br />

buffer<br />

•Asynch send on <strong>an</strong>y available<br />

session<br />

•Optional checkpo<strong>in</strong>t <br />

64K Block<br />

PE<br />

PE Process<strong>in</strong>g<br />

•Reuse cached pl<strong>an</strong><br />

•Apply ‘pack’ <strong>in</strong> parallel<br />

•Checkpo<strong>in</strong>t<br />

4 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

4<br />

2


Dec 2002<br />

Motivations us<strong>in</strong>g <strong>TPump</strong><br />

<strong>in</strong> <strong>an</strong> <strong>Active</strong> <strong>Warehouse</strong><br />

• <strong>TPump</strong> is suitable, when some of the data<br />

needs to be updated closer to the time the<br />

event or the tr<strong>an</strong>saction took place<br />

• avoids table-level locks (row-hash locks only)<br />

• Concurrent table SQL access dur<strong>in</strong>g updates<br />

• flexibility <strong>in</strong> when <strong>an</strong>d how it is executed<br />

• queries c<strong>an</strong> access a table concurrently with<br />

<strong>TPump</strong><br />

• several <strong>TPump</strong> jobs c<strong>an</strong> run aga<strong>in</strong>st the same<br />

table at the same time<br />

• sources as varied as MVS, NT or UNIX<br />

concurrently<br />

5 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

5<br />

<strong>TPump</strong> Script Variables<br />

The <strong>TPump</strong> key variables, <strong>in</strong> order of<br />

greatest potential impact on perform<strong>an</strong>ce,<br />

are the follow<strong>in</strong>g (BEGIN LOAD):<br />

• PACK<br />

• SESSIONS (number)<br />

• SERIALIZE (ON or OFF)<br />

• CHECKPOINT (m<strong>in</strong>s)<br />

• ROBUST (ON or OFF)<br />

• NOMONITOR<br />

6 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

6<br />

3


Dec 2002<br />

<strong>TPump</strong> PACK Factor<br />

• The Pack Factor is the number of statements that<br />

will be packed together <strong>in</strong>to a <strong>TPump</strong> buffer <strong>an</strong>d<br />

sent to the database as one multi-statement<br />

request. Higher is usually better.<br />

CLIENT<br />

One Buffer, Pack = 3<br />

1 <strong>in</strong>sert plus USING clause columns<br />

1 <strong>in</strong>sert plus USING clause columns<br />

1 <strong>in</strong>sert plus USING clause columns<br />

1 Multi-Statement Request<br />

with USING clause<br />

DATABASE<br />

7 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

7<br />

<strong>TPump</strong> Sessions<br />

Input File<br />

CLIENT<br />

Buffer 1 Buffer 2 Buffer 3 Buffer 4<br />

Pack = 3<br />

Sessions = 4<br />

Row 7 Row 8 Row 9<br />

Row 4 Row 5 Row 6<br />

DATABASE<br />

Row 1 Row 2 Row 3<br />

AMP 14<br />

AMP 9<br />

AMP 22<br />

There is no correlation between <strong>TPump</strong> Sessions <strong>an</strong>d AMPs.<br />

Both the pack factor <strong>an</strong>d the number of sessions contribute to<br />

the level of parallelism <strong>in</strong>side Teradata.<br />

8 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

8<br />

4


Dec 2002<br />

<strong>TPump</strong> Sessions<br />

• Number of connections to the database that<br />

will be logged on for this <strong>TPump</strong> job. Each<br />

session will have its own buffer on the client.<br />

The greater the number of sessions, the more<br />

work will be required from the client <strong>an</strong>d<br />

database.<br />

• In sett<strong>in</strong>g the number of sessions, make sure<br />

to adjust the pack factor first, then:<br />

– Start with very few sessions until the application is<br />

runn<strong>in</strong>g as expected.<br />

– Increase the number of sessions consider<strong>in</strong>g the pack<br />

factor, server <strong>an</strong>d client power, how m<strong>an</strong>y other <strong>TPump</strong><br />

jobs are likely to be runn<strong>in</strong>g concurrently.<br />

– Increase sessions gradually.<br />

– How m<strong>an</strong>y <strong>TPump</strong>s concurrently?<br />

9 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

9<br />

<strong>TPump</strong> SERIALIZE<br />

• Tells <strong>TPump</strong> to perform a partition<strong>in</strong>g of the <strong>in</strong>put<br />

records across the number of sessions it is us<strong>in</strong>g,<br />

ensur<strong>in</strong>g that all <strong>in</strong>put records that touch a given<br />

target table row (or that conta<strong>in</strong> the same non-unique<br />

primary <strong>in</strong>dex value) are h<strong>an</strong>dled by the same session.<br />

STANDARD:<br />

One Buffer Fills at a Time<br />

CLIENT<br />

Input File<br />

Buffer 1 Buffer 2 Buffer 3 Buffer 4<br />

SERIALIZE:<br />

Buffers Fill Irregularly<br />

Input File<br />

CLIENT<br />

Buffer 1 Buffer 2 Buffer 3 Buffer 4<br />

10 /<br />

Dec 2002<br />

Full & Sent<br />

Full & Sent Be<strong>in</strong>g Filled Wait<strong>in</strong>g<br />

Teradata / NCR Confidential<br />

Partial Partial Full & Sent Partial<br />

• With SERIALIZE, all buffers may be partially filled at<br />

<strong>an</strong>y po<strong>in</strong>t <strong>in</strong> time; without SERIALIZE, only one<br />

buffer will be partially filled at <strong>an</strong>y po<strong>in</strong>t <strong>in</strong> time.<br />

10<br />

5


Dec 2002<br />

Considerations for SERIALIZE<br />

• Consider SERIALIZE ON for the follow<strong>in</strong>g<br />

situations:<br />

• Possibility of multiple updates with the same primary<br />

<strong>in</strong>dex value <strong>in</strong> the <strong>in</strong>put file<br />

• If the order of apply<strong>in</strong>g updates is import<strong>an</strong>t<br />

• Recommended with UPSERTs if <strong>an</strong>y s<strong>in</strong>gle row c<strong>an</strong> be<br />

touched more th<strong>an</strong> once<br />

• To reduce deadlock potential<br />

• Impacts of serialization:<br />

– Some additional work is done by the client when<br />

partition<strong>in</strong>g the <strong>in</strong>put<br />

– Buffer replenishment is spread out, mak<strong>in</strong>g stale data a<br />

greater possibility<br />

– Perform<strong>an</strong>ce may be impacted by send<strong>in</strong>g a frequent<br />

number of partial buffers to the database when a<br />

checkpo<strong>in</strong>t is taken.<br />

11 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

11<br />

<strong>TPump</strong> SERIALIZE<br />

SERIALIZE ON<br />

Buffer 1<br />

Buffer 2<br />

Buffer 3<br />

NUPI = 3<br />

NUPI = 7<br />

NUPI= 7<br />

NUPI =6<br />

NUPI = 2<br />

NUPI = 6<br />

NUPI =4<br />

NUPI = 4<br />

NUPI = 1<br />

AMP 14<br />

NUPI= 7<br />

DATABASE<br />

AMP 22<br />

NUPI= 6<br />

AMP5<br />

NUPI= 4<br />

• SERIALIZE ON removes deadlock potential between<br />

buffers with<strong>in</strong> the same <strong>TPump</strong> job, when rows with<br />

non-unique primary <strong>in</strong>dex values are be<strong>in</strong>g processed.<br />

• M<strong>an</strong>ual partition<strong>in</strong>g is required to do the same<br />

between multiple <strong>TPump</strong> jobs.<br />

12 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

12<br />

6


Dec 2002<br />

<strong>TPump</strong> Deadlock Potential<br />

• M<strong>an</strong>ually partition<strong>in</strong>g <strong>in</strong>put data across multiple<br />

<strong>TPump</strong> jobs will force the same NUPI values to go<br />

to the same <strong>TPump</strong> job. This elim<strong>in</strong>ates <strong>in</strong>ter-<br />

<strong>TPump</strong> deadlock potential when the same table is<br />

updated from different jobs.<br />

<strong>TPump</strong><br />

Job 1<br />

<strong>TPump</strong><br />

Job 2<br />

<strong>TPump</strong><br />

Job 3<br />

NUPI = 3<br />

NUPI = 7<br />

NUPI= 8<br />

NUPI = 7<br />

NUPI = 2<br />

NUPI = 6<br />

NUPI = 9<br />

NUPI = 4<br />

NUPI = 7<br />

DATABASE<br />

AMP 14<br />

NUPI= 7<br />

13 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

13<br />

<strong>TPump</strong> CHECKPOINT<br />

• Frequency (m<strong>in</strong>utes) between occurencies of<br />

checkpo<strong>in</strong>t<strong>in</strong>g<br />

• Dur<strong>in</strong>g a checkpo<strong>in</strong>t all the buffers on the client are<br />

flushed to the database, <strong>an</strong>d m<strong>in</strong>i-checkpo<strong>in</strong>ts (if<br />

ROBUST is ON) written s<strong>in</strong>ce the last checkpo<strong>in</strong>t will<br />

be deleted from the log table.<br />

14 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

14<br />

7


Dec 2002<br />

<strong>TPump</strong> ROBUST ON<br />

• ROBUST ON avoids re-apply<strong>in</strong>g rows that have<br />

already been processed <strong>in</strong> the event of a restart<br />

(data <strong>in</strong>tegrity). It causes a row to be written to<br />

the log table each time a buffer has successfully<br />

completed its updates. These m<strong>in</strong>i-checkpo<strong>in</strong>t are<br />

deleted from the log when a checkpo<strong>in</strong>t is taken.<br />

• ROBUST OFF is specified primarily to <strong>in</strong>crease<br />

throughput by bypass<strong>in</strong>g the extra work <strong>in</strong>volved<br />

<strong>in</strong> writ<strong>in</strong>g the m<strong>in</strong>i-checkpo<strong>in</strong>ts to the log.<br />

• When to use:<br />

– INSERTs <strong>in</strong>to multi-set tables, as such tables will accept<br />

re-<strong>in</strong>serted rows<br />

– Updates are based on calculations or percentage <strong>in</strong>creases<br />

– If pack factors are large, <strong>an</strong>d apply<strong>in</strong>g <strong>an</strong>d reject<strong>in</strong>g<br />

duplicates after a restart would be unduly time-consum<strong>in</strong>g<br />

– If data is time-stamped at the time it <strong>in</strong>serted <strong>in</strong>to the<br />

database<br />

15 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

15<br />

<strong>TPump</strong> IGNORE DUPLICATE ROWS<br />

• IGNORE DUPLICATE ROWS me<strong>an</strong>s that duplicate<br />

<strong>in</strong>serts, if they are attempted (as they would be on a<br />

restart with ROBUST OFF) will not generate a write to<br />

the error table. This will add efficiency to a restart,<br />

should one occur.<br />

16 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

16<br />

8


Dec 2002<br />

<strong>TPump</strong> Data Variables<br />

• Variables related to the database design<br />

<strong>an</strong>d the state of the data itself:<br />

– How cle<strong>an</strong> is the data<br />

– Any sequence to the <strong>in</strong>put stream<br />

– Degree <strong>an</strong>d type of <strong>in</strong>dexes def<strong>in</strong>ed on the table<br />

be<strong>in</strong>g updated<br />

– Use of Referential Integrity, fallback or<br />

perm<strong>an</strong>ent journal<strong>in</strong>g<br />

– Number <strong>an</strong>d complexity of triggers<br />

– Number of columns be<strong>in</strong>g passed <strong>in</strong>to the<br />

database per update<br />

– Ratio of UPDATEs to INSERTs when UPSERT is<br />

used<br />

17 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

17<br />

<strong>TPump</strong> Data Variables<br />

• Multi-Statement SQL<br />

• Efficiency by use of „USING“-clause (cached<br />

exec-pl<strong>an</strong>s <strong>in</strong>creas<strong>in</strong>g throughput)<br />

• „USING“ 512 columns limitation<br />

• UPSERT process<strong>in</strong>g: watch out ratio<br />

between UPDATEs to INSERTs (the more<br />

INSERTs the lower the throughput)<br />

18 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

18<br />

9


Dec 2002<br />

Challenges us<strong>in</strong>g <strong>TPump</strong><br />

• Careful design of <strong>in</strong>sert/update/delete strategy<br />

required to make use of effectiveness<br />

• Avoid table locks <strong>an</strong>d FTS (only S<strong>in</strong>gle/Dual AMP<br />

operations desired)<br />

• Longer runtime (+40%) when data with errors<br />

(1%)<br />

• Longer runtime with a NUSI (+50%), worse with 2<br />

NUSIs (+100%)<br />

• Longer runtime (+45%) with fallback<br />

• SERIALIZE adds 30%, ROBUST adds 15%<br />

19 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

19<br />

<strong>TPump</strong> Deployment<br />

• Feed<strong>in</strong>g cle<strong>an</strong> data from your tr<strong>an</strong>sformation<br />

processes <strong>in</strong>to <strong>TPump</strong> is import<strong>an</strong>t for overall<br />

perform<strong>an</strong>ce.<br />

• Pay attention to pack rate <strong>an</strong>d session optimization.<br />

• NUSI <strong>an</strong>d database configuration c<strong>an</strong> have a<br />

signific<strong>an</strong>t impact on data acquisition throughput.<br />

• W<strong>an</strong>t EAI/ETL tools with cont<strong>in</strong>uous data acquisition<br />

capability for feed<strong>in</strong>g <strong>in</strong>to <strong>TPump</strong>.<br />

20 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

20<br />

10


Dec 2002<br />

<strong>TPump</strong> Scenarios<br />

(I) Large, high volume<br />

(II) Small, highly time-critical bursts<br />

(III) Extract from OLTP database<br />

(IV) Multi-Source batch<br />

(V) Variable arrival rates, low volume<br />

(VI) <strong>TPump</strong> <strong>in</strong> a Cont<strong>in</strong>uous <strong>Environment</strong><br />

21 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

21<br />

<strong>TPump</strong> Scenarios (I)<br />

• Large, high volume<br />

A high volume of tr<strong>an</strong>saction data be<strong>in</strong>g <strong>in</strong>serted<br />

daily<br />

Data must be loaded <strong>an</strong>d available <strong>in</strong> less th<strong>an</strong> 8<br />

hours from arrival<br />

Batches arrive separately from each outlet, <strong>an</strong>d are<br />

loaded while queries run<br />

Data is cle<strong>an</strong> <strong>an</strong>d a powerful ma<strong>in</strong>frame client<br />

<strong>in</strong>itiates <strong>TPump</strong><br />

Use of Batch W<strong>in</strong>dow not feasible <strong>an</strong>ymore<br />

22 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

22<br />

11


Dec 2002<br />

<strong>TPump</strong> Sett<strong>in</strong>gs (I)<br />

Pack factor 31<br />

Sessions: Same as the number of AMPs<br />

Serialize OFF<br />

CHECKPOINT 0<br />

ROBUST OFF<br />

CLIENT<br />

DATA<br />

BASE<br />

Buffer 1<br />

5 Rows<br />

sent<br />

Back to<br />

Insert R1<br />

Client<br />

Insert R2<br />

Error R3<br />

Rollback<br />

Buffer 1<br />

Buffer 1<br />

4 Rows<br />

sent<br />

Back to<br />

Insert R1<br />

Insert R1<br />

Client<br />

Insert R2<br />

Insert R2<br />

Error R4 Error R5<br />

Rollback<br />

3 Rows<br />

sent<br />

Rollback<br />

Back to<br />

Client<br />

Upon errors the whole buffer is rolled back <strong>an</strong>d resent<br />

23 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

23<br />

Conclusion (I)<br />

• Large, high volume<br />

<strong>Us<strong>in</strong>g</strong> <strong>TPump</strong> <strong>in</strong> this scenario provides <strong>an</strong><br />

opportunity to move the load of a given row<br />

closer to the actual time the event happened.<br />

The site may currently be load<strong>in</strong>g one batch of<br />

<strong>in</strong>serts a day for each outlet, but has the<br />

flexibility <strong>in</strong> the future to do two or three <strong>TPump</strong><br />

runs per outlet per day, if they choose. <strong>Us<strong>in</strong>g</strong><br />

<strong>TPump</strong> offers a controllable tr<strong>an</strong>sition to<br />

updat<strong>in</strong>g that is closer to real time.<br />

24 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

24<br />

12


Dec 2002<br />

<strong>TPump</strong> Scenarios (II)<br />

• Small, highly time-critical bursts<br />

Sporadic, small files, such as reference material or<br />

<strong>in</strong>formation refreshes<br />

Need close to real-time availability of the data, 1<br />

m<strong>in</strong>ute or less<br />

Arrives from 10-15 different sources throughout a<br />

24-hour period<br />

Some sources provide cle<strong>an</strong> data, other sources<br />

have occasional errors <strong>in</strong> the data.<br />

25 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

25<br />

<strong>TPump</strong> Sett<strong>in</strong>gs (II)<br />

• Pack Factor: 35 (cle<strong>an</strong> data source), 3<br />

(unreliable data source)<br />

• Sessions: 1/4 the # of AMPs (cle<strong>an</strong> data),<br />

same as the # of AMPs (unreliable data)<br />

• SERIALIZE OFF<br />

• CHECKPOINT 0<br />

• ROBUST OFF<br />

• NOMONITOR (when jobs are very short: shorten<br />

job <strong>in</strong>itialisation time)<br />

• Use Persistent Macros for DML (NAME<br />

comm<strong>an</strong>d)<br />

26 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

26<br />

13


Dec 2002<br />

<strong>TPump</strong> Scenarios (III)<br />

• Extract from OLTP database<br />

Event or tr<strong>an</strong>saction data extracted from the<br />

tr<strong>an</strong>saction database <strong>in</strong>to queues or pipes<br />

20 m<strong>in</strong>ute turnaround from tr<strong>an</strong>saction to data<br />

warehouse<br />

Application reads queue com<strong>in</strong>g from OLTP<br />

database, tr<strong>an</strong>sforms data<br />

Multiple, short, dem<strong>an</strong>d-driven <strong>TPump</strong> jobs run<br />

concurrently<br />

27 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

27<br />

<strong>TPump</strong> Sett<strong>in</strong>gs (III)<br />

• Pack Factor: 12<br />

• Sessions: Between 10% to 50% of the number<br />

of AMPs<br />

• SERIALIZE: OFF for INSERT jobs, ON for<br />

UPSERTs (to bypass Deadlock Potential)<br />

• CHECKPOINT 0<br />

• ROBUST OFF<br />

OLTP<br />

Data<br />

Base<br />

Pipe or Queue<br />

APPLICATION<br />

Read segments<br />

Tr<strong>an</strong>sform data<br />

Starts <strong>TPump</strong> jobs<br />

Validates EOJ<br />

Table 1<br />

Table 2<br />

Table 3<br />

<strong>TPump</strong> Jobs<br />

Teradata<br />

28 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

28<br />

14


Dec 2002<br />

<strong>TPump</strong> Scenarios (IV)<br />

• Multi-Source batch<br />

Concurrent <strong>TPump</strong>s dur<strong>in</strong>g multi-hour batch<br />

w<strong>in</strong>dow at night<br />

Next day turnaround is acceptable<br />

Different sources that reside <strong>in</strong> different locations<br />

update the same table<br />

Highly-<strong>in</strong>dexed target tables<br />

29 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

29<br />

<strong>TPump</strong> Sett<strong>in</strong>gs (IV)<br />

• 5 <strong>TPump</strong> jobs runn<strong>in</strong>g concurrently:<br />

Pack Factor 5 (<strong>TPump</strong> jobs less resource-<strong>in</strong>tensive)<br />

Sessions: ½ the number of AMPs<br />

SERIALIZE OFF<br />

CHECKPOINT 15<br />

ROBUST OFF (reapply<strong>in</strong>g <strong>in</strong>serted rows is not <strong>an</strong><br />

issue, <strong>an</strong>d with a low pack rate, recovery issues are less)<br />

30 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

30<br />

15


Dec 2002<br />

<strong>TPump</strong> Scenarios (V)<br />

• Variable arrival rates, low volume<br />

Captures movement of <strong>an</strong> object or a<br />

tr<strong>an</strong>sportation <strong>in</strong>dustry carrier<br />

Real-time is with<strong>in</strong> one hour of the ch<strong>an</strong>ge <strong>in</strong><br />

position<br />

Duplicate report<strong>in</strong>g is possible, order of apply<strong>in</strong>g<br />

updates is import<strong>an</strong>t<br />

Data conta<strong>in</strong>s errors approach<strong>in</strong>g 5%<br />

31 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

31<br />

<strong>TPump</strong> Sett<strong>in</strong>gs (V)<br />

• Pack Factor 5 (propensity for error conditions <strong>an</strong>d slowarriv<strong>in</strong>g<br />

data need to fill the buffers)<br />

• Session: Same as the number of AMPs (more buffers<br />

to be sent <strong>in</strong>to Teradata)<br />

• SERIALIZE ON (The order that the updates are applied is<br />

import<strong>an</strong>t <strong>in</strong> this application, <strong>an</strong>d there is a possibility that<br />

rows with the same primary <strong>in</strong>dex value will be <strong>in</strong>serted<br />

through different buffers at the same time)<br />

• CHECKPOINT 20 (large value to counteract the cost of<br />

flush<strong>in</strong>g of partial buffers)<br />

• ROBUST ON<br />

32 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

32<br />

16


Dec 2002<br />

<strong>TPump</strong> Scenarios (VI)<br />

• <strong>TPump</strong> <strong>in</strong> a Cont<strong>in</strong>uous <strong>Environment</strong><br />

load data through a queue <strong>in</strong> <strong>an</strong> ongo<strong>in</strong>g fashion<br />

automatic <strong>an</strong>d tightly coord<strong>in</strong>ated rotation of<br />

different <strong>TPump</strong> <strong>in</strong>st<strong>an</strong>ces, all fed by the same<br />

queue<br />

automatic h<strong>an</strong>d-off of control between <strong>TPump</strong>s,<br />

<strong>an</strong>d to h<strong>an</strong>dle the end-of-job process<strong>in</strong>g<br />

opportunity for tighter m<strong>an</strong>agement <strong>in</strong> these<br />

areas:<br />

• Error h<strong>an</strong>dl<strong>in</strong>g<br />

• Monitor<strong>in</strong>g statistics<br />

• Send<strong>in</strong>g alerts<br />

• Detect<strong>in</strong>g <strong>in</strong>consistencies<br />

33 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

33<br />

MQSeries Feed <strong>in</strong>to <strong>TPump</strong><br />

Source Feeds<br />

Load Server<br />

RDBMS<br />

Input Data<br />

SOURCE Feeder:<br />

• Read <strong>in</strong>put<br />

• Vary arrival rate<br />

• Write timestamp<br />

• Write to queue<br />

MQPut<br />

Input Data<br />

MQPut<br />

EOF<br />

Scheduler:<br />

• Starts TP & BTEQ jobs<br />

• Forces EOJ to Queue<br />

• Initiates post-job<br />

process<strong>in</strong>g<br />

Start<br />

<strong>TPump</strong><br />

<strong>TPump</strong><br />

EOJ<br />

TERADATA<br />

MSG Queue<br />

MQGet<br />

MQ Access<br />

Module<br />

<strong>TPump</strong> Job(s)<br />

SQL<br />

Base Table<br />

TP_Sta<br />

t<br />

trigger<br />

Hold<br />

Stat<br />

NOTIFY exit<br />

Buffers<br />

Summary &<br />

Validation Reports<br />

34 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

34<br />

17


Dec 2002<br />

Role of Source Feeder<br />

• Provide application data feeds.<br />

• Read tr<strong>an</strong>saction messages from a file.<br />

• Add a timestamp to the message.<br />

• Put message <strong>in</strong>to the queue.<br />

• Arrival rate is adjustable.<br />

35 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

35<br />

Role of Scheduler<br />

• St<strong>an</strong>d-<strong>in</strong> for commercial scheduler.<br />

• Start <strong>TPump</strong> job.<br />

• End <strong>TPump</strong> job by plac<strong>in</strong>g <strong>an</strong> EOF message<br />

to the Queue.<br />

• Launch post-job process<strong>in</strong>g:<br />

– BTEQ script to consolidate error rows <strong>in</strong>side<br />

Teradata.<br />

• Monitor load process status/results.<br />

36 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

36<br />

18


Dec 2002<br />

Role of MQSeries AXSMod<br />

• Connect to MQSeries:<br />

– Local or remote.<br />

• Get messages from MQSeries.<br />

• Add timestamp to the messages.<br />

• Stream messages to <strong>TPump</strong>.<br />

• Guar<strong>an</strong>teed reliability:<br />

– No lost <strong>in</strong>serts.<br />

– No duplicate <strong>in</strong>serts.<br />

37 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

37<br />

Rotat<strong>in</strong>g Multiple <strong>TPump</strong>s<br />

<strong>TPump</strong>1<br />

<strong>in</strong>itializes,<br />

starts up <strong>an</strong>d<br />

reads from<br />

the queue<br />

<strong>TPump</strong>2<br />

<strong>in</strong>itializes<br />

& waits<br />

<strong>TPump</strong>1<br />

ends, sets<br />

off BTEQ<br />

script<br />

<strong>TPump</strong>2<br />

starts up<br />

<strong>an</strong>d reads<br />

from queue<br />

<strong>TPump</strong>1<br />

<strong>in</strong>itializes<br />

& waits<br />

<strong>TPump</strong>2 ends,<br />

sets off BTEQ<br />

script<br />

<strong>TPump</strong>1<br />

starts up<br />

<strong>an</strong>d reads<br />

from queue<br />

Time<br />

8:00 8:01 9:00 9:01 10:00 10:01<br />

<strong>TPump</strong> 1<br />

<strong>TPump</strong>1<br />

<strong>TPump</strong>1<br />

<strong>TPump</strong> 2<br />

<strong>TPump</strong>2<br />

BTEQ 1<br />

BTEQ 2<br />

BTEQ1<br />

BTEQ2<br />

38 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

38<br />

19


Dec 2002<br />

<strong>TPump</strong> Sett<strong>in</strong>gs (VI)<br />

• All <strong>TPump</strong> <strong>in</strong>st<strong>an</strong>ces use the same parameters:<br />

Pack Factor 10<br />

Number of sessions 20<br />

Checkpo<strong>in</strong>t 30<br />

ROBUST ON<br />

SERIALIZE OFF<br />

39 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

39<br />

Real Time Information about<br />

a <strong>TPump</strong> Inst<strong>an</strong>ce<br />

BTEQ Output<br />

• Audit Trail<br />

<strong>TPump</strong> Job<br />

<strong>TPump</strong> Output<br />

• Audit Trail<br />

Monitor Table<br />

• Periodic progress check<br />

• Get # of errors per m<strong>in</strong>ute<br />

• Enables trend <strong>an</strong>alysis<br />

Notify Exit<br />

• Event driven feedback<br />

• C<strong>an</strong> drive alert to DBA<br />

• Job term<strong>in</strong>ation code<br />

Journal Table<br />

• Backup (reapply)<br />

• Source for PK Deletes<br />

• Reprocess<strong>in</strong>g errors<br />

40 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

40<br />

20


Dec 2002<br />

Error-H<strong>an</strong>dl<strong>in</strong>g<br />

Supplement<strong>in</strong>g error row identification:<br />

• Import#, Record# already <strong>in</strong> the error file row.<br />

• At time of consolidation, add unique <strong>TPump</strong> name.<br />

• Unique MQ message-ID c<strong>an</strong> also be preserved on each<br />

row as it is loaded (or dur<strong>in</strong>g error process<strong>in</strong>g).<br />

This application will consolidate errors <strong>an</strong>d detail about<br />

the row, but will not:<br />

• Reconstruct the error row.<br />

• Re-process the orig<strong>in</strong>al row, if it has been preserved.<br />

41 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

41<br />

<strong>TPump</strong> Status <strong>an</strong>d Statistics<br />

<strong>TPump</strong> optionally ma<strong>in</strong>ta<strong>in</strong>s real time stats <strong>in</strong> a table:<br />

• UserName, Import Start Date, Time, RestartCount.<br />

• Last Table Update Date, Time<br />

• RecordsRead, RecordsOut<br />

• RecordsSkipped, RecordsRejcted, RecordsErrored<br />

Row exists for life of job:<br />

• Inserted at job start.<br />

• Updated each m<strong>in</strong>ute.<br />

• Deleted at job completion.<br />

42 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

42<br />

21


Dec 2002<br />

Monitor<strong>in</strong>g Techniques<br />

• Status Table<br />

– Triggers placed on <strong>TPump</strong>StatusTbl.<br />

– Triggers <strong>in</strong>sert status <strong>in</strong>formation <strong>in</strong>to perm<strong>an</strong>ent<br />

stats table.<br />

• NOTIFY EXIT<br />

– Initialize, Open.<br />

– Errors.<br />

– Checkpo<strong>in</strong>t.<br />

– F<strong>in</strong>al Stats.<br />

– Prototype writes delimited f<strong>in</strong>al stats record to flat<br />

file.<br />

43 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

43<br />

<strong>TPump</strong> Best Practices<br />

• High pack factors c<strong>an</strong> <strong>in</strong>crease <strong>TPump</strong> throughput.<br />

When high pack factors c<strong>an</strong>not be used, more<br />

sessions are <strong>an</strong>other way to <strong>in</strong>crease <strong>TPump</strong><br />

throughput if the client c<strong>an</strong> support them.<br />

• To reduce data load latency <strong>an</strong>d improve real-time<br />

availability of s<strong>in</strong>gle rows, reduce the pack factor<br />

• If <strong>in</strong>put data conta<strong>in</strong>s errors, a low pack factor will<br />

reduce the overhead of roll<strong>in</strong>g back the request that<br />

conta<strong>in</strong>ed the error, <strong>an</strong>d re-process<strong>in</strong>g all error-free<br />

rows.<br />

• Speed up <strong>TPump</strong> startup by us<strong>in</strong>g persistent<br />

macros <strong>an</strong>d specify <strong>TPump</strong>’s recommended pack<br />

factor from a previous, similar run.<br />

44 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

44<br />

22


Dec 2002<br />

<strong>TPump</strong> Best Practices (cont‘d)<br />

• When select<strong>in</strong>g the number of sessions, consider<br />

the total system load at the time the <strong>TPump</strong> job is<br />

run. When multiple <strong>TPump</strong> jobs are runn<strong>in</strong>g,<br />

consider a number of sessions equal to the number of<br />

AMPs <strong>in</strong> the system, or less.<br />

• If multiple <strong>TPump</strong> jobs may update rows from the<br />

same table with the same primary <strong>in</strong>dex value,<br />

m<strong>an</strong>ually partition the data on the primary <strong>in</strong>dex of<br />

the table, so all rows with the same PI value are<br />

directed to the same <strong>TPump</strong> job. Then also specify<br />

SERIALIZE ON to force the rows with the same NUPI<br />

value to a s<strong>in</strong>gle session with<strong>in</strong> that <strong>TPump</strong> job,<br />

further reduc<strong>in</strong>g possible contention.<br />

45 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

45<br />

<strong>TPump</strong> Best Practices (cont‘d)<br />

• Beware of Lock<strong>in</strong>g Conflicts:<br />

– Lock<strong>in</strong>g conflicts on the target table row-hash<br />

(<strong>in</strong>sert, update)<br />

SERIALIZE ON<br />

– Lock<strong>in</strong>g conflicts <strong>in</strong> Dictionary at SQL Submission<br />

Housekeep<strong>in</strong>g of DBC.AccessRights<br />

– Lock<strong>in</strong>g at <strong>TPump</strong> startup<br />

Pre-build <strong>an</strong>d reuse (persistent) macros for ET-<br />

DDLs<br />

46 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

46<br />

23


Dec 2002<br />

<strong>TPump</strong> Best Practices (cont‘d)<br />

• Assign the <strong>TPump</strong> user to a higher priority<br />

perform<strong>an</strong>ce group when the <strong>TPump</strong> job runs at the<br />

same time as decision support queries, if the <strong>TPump</strong><br />

completion time is more critical th<strong>an</strong> the other work<br />

active <strong>in</strong> the system.<br />

• If target table of <strong>in</strong>serts <strong>in</strong> the database is part of a<br />

jo<strong>in</strong> <strong>in</strong>dex, direct <strong>TPump</strong> <strong>in</strong>to a non-<strong>in</strong>dexed stag<strong>in</strong>g<br />

table. Insert/select from there <strong>in</strong>to the base table at<br />

regular <strong>in</strong>tervals is likely to be a better-perform<strong>in</strong>g<br />

approach to updat<strong>in</strong>g a table when a jo<strong>in</strong> <strong>in</strong>dex is<br />

<strong>in</strong>volved. Prior to the <strong>in</strong>sert/select, a UNION c<strong>an</strong> be<br />

used to make sure data recently <strong>in</strong>serted <strong>in</strong>to the<br />

stag<strong>in</strong>g table is <strong>in</strong>cluded <strong>in</strong> query <strong>an</strong>swer sets.<br />

• To ensure that <strong>TPump</strong> is able to perform s<strong>in</strong>gle-AMP<br />

operations on each <strong>in</strong>put record, <strong>in</strong>clude the entire<br />

primary <strong>in</strong>dex value for the row be<strong>in</strong>g updated<br />

among the columns passed to Teradata.<br />

47 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

47<br />

Th<strong>an</strong>k you !<br />

48 /<br />

Dec 2002<br />

Teradata / NCR Confidential<br />

48<br />

24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!