29.03.2014 Views

DataStage is Big Data Integration - DSXchange.net

DataStage is Big Data Integration - DSXchange.net

DataStage is Big Data Integration - DSXchange.net

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Luncheon Webinar Series<br />

May 13, 2013<br />

InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />

Sponsored By:<br />

Presented by :<br />

Tony Curcio, InfoSphere<br />

Product Management<br />

0


InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />

• Questions and suggestions regarding presentation topics? - send<br />

to editor@dsxchange.<strong>net</strong><br />

• Downloading the presentation<br />

– Click Presentation YES on Poll Question<br />

– Replay will be available within one day with email with details<br />

• Bonus Offer – Free premium membership for your <strong><strong>Data</strong>Stage</strong><br />

Management! Submit your management’s email address and we<br />

will offer him/her access on your behalf.<br />

– Email Info@dsxchange.<strong>net</strong> subject line “Managers special”.<br />

– Join us all at Linkedin http://tinyurl.com/DSXmembers<br />

– ISXchange will sponsor Trial membership for new requests at<br />

Linkedin DSX members site


© 2013 IBM Corporation<br />

InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />

Tony Curcio<br />

InfoSphere Product Management


<strong>Big</strong>ger <strong>Data</strong> <strong>Integration</strong> Challenges<br />

New types of data stores<br />

• <strong>Big</strong> <strong>Data</strong> introduces additional data stores that need to be<br />

integrated – both Hadoop based and noSQL based<br />

• These data stores don’t easily lend themselves to conventional<br />

methods for data movement<br />

New data types and formats<br />

• Unstructured data; poly-structured data stores; JSON, Avro,<br />

and what more to come ???<br />

• Video, docs, web logs, …<br />

Larger volumes<br />

• Solutions need to move, transform, cleanse and otherw<strong>is</strong>e<br />

prepare huge data volumes<br />

• <strong>Big</strong> <strong>Data</strong> requires data scalability<br />

3


Benefits of InfoSphere <strong><strong>Data</strong>Stage</strong><br />

Speeds Productivity<br />

Graphical design easier to use than hand coding<br />

Simplifies Heterogeneity<br />

Common method for diverse data sources<br />

Shortens Project Cycles<br />

Pre-built components reduce cost and timelines<br />

Promotes Object Reuse<br />

Build once, share, and run anywhere (etl/elt/real-time)<br />

Reduces Operational Cost<br />

Provides a robust framework to manage data integration<br />

Protects from Changes<br />

<strong>is</strong>olation from underlying technologies changes as they<br />

continue to evolve


<strong>Big</strong> <strong>Data</strong> <strong>is</strong> part of the Information Supply Chain<br />

Transactional<br />

& Collaborative<br />

Applications<br />

Manage<br />

Integrate<br />

Master<br />

<strong>Data</strong><br />

Analyze<br />

Content<br />

<strong>Big</strong> <strong>Data</strong><br />

Cubes<br />

Streams<br />

Business Analytics<br />

Applications<br />

External<br />

Information<br />

Sources<br />

<strong>Data</strong><br />

Content<br />

Streaming<br />

Information<br />

Govern<br />

<strong>Data</strong><br />

Warehouses<br />

Information<br />

Governance<br />

Quality<br />

Lifecycle<br />

Security &<br />

Privacy<br />

Standards<br />

Gartner Magic Quadrant<br />

“IBM <strong>is</strong> the only DBMS vendor that can offer an information architecture across the<br />

entire organization, covering information on all systems”<br />

5


4 Key Analytical Use Cases for <strong>Big</strong> <strong>Data</strong><br />

• Find, v<strong>is</strong>ualize,<br />

understand all big<br />

data to improve<br />

dec<strong>is</strong>ion making<br />

• Integrate big data<br />

and data warehouse<br />

capabilities to increase<br />

operational efficiency<br />

<strong>Big</strong> <strong>Data</strong><br />

Exploration<br />

<strong>Data</strong><br />

Warehouse<br />

Augmentation<br />

Enhanced<br />

360 o View of<br />

the Customer<br />

Operations<br />

Analys<strong>is</strong><br />

• Extend ex<strong>is</strong>ting<br />

customer views by<br />

incorporating additional<br />

information sources<br />

• Analyze a variety of<br />

machine data for<br />

improved business<br />

results


<strong>Data</strong> Warehouse Augmentation<br />

Integrate big data and data warehouse<br />

capabilities to increase operational efficiency<br />

Challenges<br />

• Leveraging structured, unstructured,<br />

and streaming data sources for deep<br />

analys<strong>is</strong><br />

• Low latency requirements<br />

• Query access to data<br />

• Optimizing warehouse for big data<br />

volumes<br />

• Metadata management to support<br />

impact analys<strong>is</strong> and data lineage<br />

Required capabilities<br />

• <strong>Data</strong> <strong>Integration</strong> Hub Processing<br />

• High-speed, massively scalable<br />

read from and write to big data<br />

sources and new data<br />

• <strong>Big</strong> <strong>Data</strong> Expert<br />

• Automatically build MapReduce<br />

logic through simple data flow<br />

design and coordinate workflow<br />

across traditional and big data<br />

platforms


<strong>Data</strong> <strong>Integration</strong><br />

Hub Processing


© 2013 IBM Corporation<br />

“Connectivity Hub”<br />

InfoSphere<br />

<strong><strong>Data</strong>Stage</strong><br />

Effectively handle the complexity of enterpr<strong>is</strong>e information sources<br />

and types with a common design paradigm across<br />

heterogeneous landscape with high-speed scalable solution<br />

to speed the delivery of analytics.


InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />

Sour<br />

ce<br />

<strong>Data</strong><br />

Transfor<br />

m<br />

Cleanse Enrich EDW<br />

• Dynamic<br />

Instantly get better performance<br />

as hardware resources are<br />

added to any topology<br />

Sequential<br />

D<strong>is</strong>k<br />

CPU<br />

Memor<br />

y<br />

CPU<br />

4-way Parallel<br />

CPU<br />

D<strong>is</strong>k<br />

CPU<br />

Shared<br />

Memory<br />

CPU<br />

64-way Parallel<br />

Uniprocessor SMP System MPP Clustered System<br />

• Extendable<br />

Add a new server to scale out<br />

through simple text file edit (or, in<br />

grid config, automatically via<br />

integration with grid management<br />

software).<br />

• <strong>Data</strong> Partitioned<br />

In true MPP fashion (like<br />

Hadoop) data pers<strong>is</strong>ted in the<br />

data integration platform <strong>is</strong> stored<br />

in parallel to scale out the I/O.<br />

• Hadoop Integrated<br />

Push all or parts of the process<br />

out to Hadoop to take advantage<br />

of it’s scalability in ELT fashion.<br />

10<br />

10


<strong>Big</strong> <strong>Data</strong> Source Types<br />

Hadoop D<strong>is</strong>tributed File System<br />

massively scalable and resilient storage<br />

noSQL (not-only SQL)<br />

record storage optimized for read (or write)<br />

noSQL<br />

InfoSphere Streams<br />

massive real-time analytics<br />

11


Blazing Fast HDFS<br />

• Available since v8.7 in 2011<br />

• Extends the simple flat file<br />

paradigm - just add your hadoop<br />

server name and port number<br />

• Parallelization techniques to pipe<br />

data in and out at massive scale<br />

• Performance study run up to 5.2<br />

TB/hr before hdfs d<strong>is</strong>ks were<br />

complete saturated (5 node<br />

hadoop cluster)<br />

12


Simple data flow design for HDFS<br />

Transform/<br />

restructure<br />

the data<br />

Read from an<br />

HDFS file in<br />

parallel<br />

Create new<br />

HDFS file,<br />

fully<br />

parallelized<br />

Join two<br />

HDFS files<br />

13


Agile Connector Accelerators for noSQL<br />

• New connectors available on<br />

developerWorks<br />

• Plugs into InfoSphere <strong><strong>Data</strong>Stage</strong> and<br />

operates just like any other stage.<br />

• Includes features to exploit specific<br />

data sources<br />

Open<br />

Code<br />

14


Sample Job with MongoDB and Hive<br />

Selects what HDFS<br />

data to send down<br />

stream.<br />

Accepts specific<br />

MongoDB<br />

directives<br />

Writing data<br />

to MongoDB<br />

Writing data<br />

to Hive<br />

15


Parse and Compose JSON (beta)<br />

• Parsing and composing<br />

of JSON data format<br />

• Included advanced<br />

transformation<br />

framework already<br />

provided for XML<br />

capabilities<br />

• Beta available on<br />

InfoSphere<br />

<strong><strong>Data</strong>Stage</strong> 9.1 FP1<br />

16


<strong>Big</strong> <strong>Data</strong><br />

Expert


© 2013 IBM Corporation<br />

“<strong>Big</strong> <strong>Data</strong> Expert”<br />

InfoSphere<br />

<strong><strong>Data</strong>Stage</strong><br />

Automatically push transformational processing close to where the<br />

data resides, both SQL for DBMS and MapReduce for Hadoop,<br />

leveraging the same simple data flow design process and coordinate<br />

workflow across all platforms


Automated MapReduce Job Generation<br />

• New in 9.1, leverage the same UI and the same stages to build<br />

MapReduce.<br />

• Drag and drop stages to the canvas to create a job, rather than have to<br />

learn MapReduce programming.<br />

• Push the processing to Hadoop for patterns when you don’t want to<br />

transport the data on the <strong>net</strong>work.<br />

19


Automated MapReduce Job Generation<br />

Build integration<br />

jobs with the<br />

same data flow<br />

tool and stages<br />

Automatically<br />

creates<br />

MapReduce<br />

code.<br />

20<br />

© 2013 IBM Corporation


Automated MapReduce Job Generation<br />

Job includes other<br />

database on<br />

separate system<br />

Recognizes what processing<br />

can run natively in Hadoop<br />

and what requires <strong><strong>Data</strong>Stage</strong><br />

engine to move the data<br />

21<br />

© 2013 IBM Corporation


Architecture for Warehouse Landing Zone<br />

Use Case Requirements: <strong>Data</strong> Warehouse Landing Zone<br />

Large Scale – large data volumes, scale out requires open MPP platform<br />

Low Cost – low cost storage, compute and commodity hardware<br />

Many <strong>Data</strong> Types – un/semi structured and social datatype coverage<br />

Many Access Patterns – exploratory, iterative and d<strong>is</strong>covery oriented<br />

clickstream<br />

ETL<br />

Lineage<br />

Quality<br />

sensors<br />

transactions<br />

Replication<br />

Information Server<br />

JAQL Hive HBase<br />

Analytics<br />

Warehouse<br />

Zone<br />

content<br />

Guardium<br />

<strong>Big</strong>Insights / Hadoop<br />

all sources<br />

Landing Zone<br />

Masking<br />

Masking<br />

Optim<br />

Custom MR<br />

Operational<br />

Warehouse<br />

Zone<br />

22


Combined Workflows for <strong>Big</strong> <strong>Data</strong><br />

• Oozie <strong>Integration</strong><br />

– Same design paradigm for<br />

workflows as for job design.<br />

– Directly call an Oozie activity that <strong>is</strong><br />

invoking custom MapReduce code.<br />

• End-to-end Workflows<br />

– Sequence right alongside other<br />

data integration and analytics<br />

activities<br />

– Allows users to have the data<br />

sourcing, ETL, Analytics and<br />

delivery of information all controlled<br />

through a single process.<br />

– Monitor all stages through<br />

Operations Console’s web based<br />

interace<br />

23


Cross Tool Impact Analys<strong>is</strong> and Traceability<br />

• Understand how traditional and big data sources are being used<br />

• Assess impact of change and mitigate r<strong>is</strong>ks<br />

• Show impact on downstream applications and BI reports<br />

• Navigate through impacted areas and drill down


Wrap-up


The IBM <strong>Big</strong> <strong>Data</strong> Platform<br />

New analytic applications drive the<br />

requirements for a big data platform<br />

• Integrate and manage the full<br />

variety, velocity and volume of data<br />

• Apply advanced analytics to<br />

information in its native form<br />

• V<strong>is</strong>ualize all available data for adhoc<br />

analys<strong>is</strong><br />

• Development environment for<br />

building new analytic applications<br />

• Workload optimization and<br />

scheduling<br />

• Security and Governance<br />

Systems<br />

Management<br />

Hadoop<br />

System<br />

BIG DATA PLATFORM<br />

Application<br />

Development<br />

Accelerators<br />

Stream<br />

Computing<br />

D<strong>is</strong>covery<br />

<strong>Data</strong><br />

Warehouse<br />

Information <strong>Integration</strong> & Governance<br />

<strong>Data</strong> Media Content Machine Social<br />

26


Information <strong>Integration</strong> & Governance for <strong>Big</strong> <strong>Data</strong><br />

Integrate & Link <strong>Big</strong> <strong>Data</strong><br />

• <strong>Big</strong> <strong>Data</strong> as a Source<br />

• <strong>Big</strong> <strong>Data</strong> as a Target<br />

• <strong>Data</strong> Transformations<br />

• <strong>Data</strong> Movement<br />

• Integrate w/ex<strong>is</strong>ting Enterpr<strong>is</strong>e<br />

• <strong>Data</strong> Lineage & Impact Analys<strong>is</strong><br />

• Metadata <strong>Integration</strong> w/Analytics<br />

• Realtime & <strong>Data</strong> Federation<br />

Cleanse and Validate <strong>Big</strong> <strong>Data</strong><br />

• Accuracy and Entity Matching<br />

with Social <strong>Data</strong><br />

• De-duplication and<br />

Standardization of Machine <strong>Data</strong><br />

• In-line Cleansing with <strong>Integration</strong><br />

• Trusted <strong>Data</strong> Dashboard and<br />

Reporting on <strong>Data</strong> Quality<br />

Protect <strong>Big</strong> <strong>Data</strong><br />

• Activity Monitoring<br />

• <strong>Data</strong> Masking<br />

• <strong>Data</strong> Encryption<br />

• On-Demand / In-Place Protection<br />

• In-Line Protection (w/ETL etc.)<br />

• Active Detection & Alerting<br />

Audit & Archive <strong>Big</strong> <strong>Data</strong><br />

• Queryable Archive<br />

• Structured and Semi-Structured<br />

• Optimized Connectors to ex<strong>is</strong>ting Apps<br />

• Hot-Restorable On-the-Fly<br />

• Immutable and Secure Access<br />

• Automated Legal Hold Capability for <strong>Data</strong><br />

Freeze<br />

Master <strong>Big</strong> <strong>Data</strong><br />

• <strong>Big</strong> <strong>Data</strong> as a Supplier<br />

• <strong>Big</strong> <strong>Data</strong> as a Consumer<br />

• Links between <strong>Big</strong> <strong>Data</strong> and<br />

Trusted Golden Records<br />

• Leverage Master <strong>Data</strong> in <strong>Big</strong><br />

<strong>Data</strong> Analytics<br />

• Entity Resolution at Extreme<br />

Scale Out Levels<br />

• Probabil<strong>is</strong>tic Entity Matching<br />

27


Where to go for learn more….<br />

• If you’d like to explore th<strong>is</strong> topic further…<br />

– Contact your IBM account team or your preferred IBM Partner.<br />

• If you’d like to explore more about InfoSphere <strong><strong>Data</strong>Stage</strong> and the<br />

Information Server platform…<br />

– http://www-01.ibm.com/software/data/integration/info_server/<br />

• If you’re looking for a Enterpr<strong>is</strong>e level Hadoop d<strong>is</strong>tribution…<br />

– InfoSphere <strong>Big</strong> Insightshttp://www-<br />

01.ibm.com/software/data/infosphere/biginsights/<br />

29


Thanks

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!