DataStage is Big Data Integration - DSXchange.net
DataStage is Big Data Integration - DSXchange.net
DataStage is Big Data Integration - DSXchange.net
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Luncheon Webinar Series<br />
May 13, 2013<br />
InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />
Sponsored By:<br />
Presented by :<br />
Tony Curcio, InfoSphere<br />
Product Management<br />
0
InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />
• Questions and suggestions regarding presentation topics? - send<br />
to editor@dsxchange.<strong>net</strong><br />
• Downloading the presentation<br />
– Click Presentation YES on Poll Question<br />
– Replay will be available within one day with email with details<br />
• Bonus Offer – Free premium membership for your <strong><strong>Data</strong>Stage</strong><br />
Management! Submit your management’s email address and we<br />
will offer him/her access on your behalf.<br />
– Email Info@dsxchange.<strong>net</strong> subject line “Managers special”.<br />
– Join us all at Linkedin http://tinyurl.com/DSXmembers<br />
– ISXchange will sponsor Trial membership for new requests at<br />
Linkedin DSX members site
© 2013 IBM Corporation<br />
InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />
Tony Curcio<br />
InfoSphere Product Management
<strong>Big</strong>ger <strong>Data</strong> <strong>Integration</strong> Challenges<br />
New types of data stores<br />
• <strong>Big</strong> <strong>Data</strong> introduces additional data stores that need to be<br />
integrated – both Hadoop based and noSQL based<br />
• These data stores don’t easily lend themselves to conventional<br />
methods for data movement<br />
New data types and formats<br />
• Unstructured data; poly-structured data stores; JSON, Avro,<br />
and what more to come ???<br />
• Video, docs, web logs, …<br />
Larger volumes<br />
• Solutions need to move, transform, cleanse and otherw<strong>is</strong>e<br />
prepare huge data volumes<br />
• <strong>Big</strong> <strong>Data</strong> requires data scalability<br />
3
Benefits of InfoSphere <strong><strong>Data</strong>Stage</strong><br />
Speeds Productivity<br />
Graphical design easier to use than hand coding<br />
Simplifies Heterogeneity<br />
Common method for diverse data sources<br />
Shortens Project Cycles<br />
Pre-built components reduce cost and timelines<br />
Promotes Object Reuse<br />
Build once, share, and run anywhere (etl/elt/real-time)<br />
Reduces Operational Cost<br />
Provides a robust framework to manage data integration<br />
Protects from Changes<br />
<strong>is</strong>olation from underlying technologies changes as they<br />
continue to evolve
<strong>Big</strong> <strong>Data</strong> <strong>is</strong> part of the Information Supply Chain<br />
Transactional<br />
& Collaborative<br />
Applications<br />
Manage<br />
Integrate<br />
Master<br />
<strong>Data</strong><br />
Analyze<br />
Content<br />
<strong>Big</strong> <strong>Data</strong><br />
Cubes<br />
Streams<br />
Business Analytics<br />
Applications<br />
External<br />
Information<br />
Sources<br />
<strong>Data</strong><br />
Content<br />
Streaming<br />
Information<br />
Govern<br />
<strong>Data</strong><br />
Warehouses<br />
Information<br />
Governance<br />
Quality<br />
Lifecycle<br />
Security &<br />
Privacy<br />
Standards<br />
Gartner Magic Quadrant<br />
“IBM <strong>is</strong> the only DBMS vendor that can offer an information architecture across the<br />
entire organization, covering information on all systems”<br />
5
4 Key Analytical Use Cases for <strong>Big</strong> <strong>Data</strong><br />
• Find, v<strong>is</strong>ualize,<br />
understand all big<br />
data to improve<br />
dec<strong>is</strong>ion making<br />
• Integrate big data<br />
and data warehouse<br />
capabilities to increase<br />
operational efficiency<br />
<strong>Big</strong> <strong>Data</strong><br />
Exploration<br />
<strong>Data</strong><br />
Warehouse<br />
Augmentation<br />
Enhanced<br />
360 o View of<br />
the Customer<br />
Operations<br />
Analys<strong>is</strong><br />
• Extend ex<strong>is</strong>ting<br />
customer views by<br />
incorporating additional<br />
information sources<br />
• Analyze a variety of<br />
machine data for<br />
improved business<br />
results
<strong>Data</strong> Warehouse Augmentation<br />
Integrate big data and data warehouse<br />
capabilities to increase operational efficiency<br />
Challenges<br />
• Leveraging structured, unstructured,<br />
and streaming data sources for deep<br />
analys<strong>is</strong><br />
• Low latency requirements<br />
• Query access to data<br />
• Optimizing warehouse for big data<br />
volumes<br />
• Metadata management to support<br />
impact analys<strong>is</strong> and data lineage<br />
Required capabilities<br />
• <strong>Data</strong> <strong>Integration</strong> Hub Processing<br />
• High-speed, massively scalable<br />
read from and write to big data<br />
sources and new data<br />
• <strong>Big</strong> <strong>Data</strong> Expert<br />
• Automatically build MapReduce<br />
logic through simple data flow<br />
design and coordinate workflow<br />
across traditional and big data<br />
platforms
<strong>Data</strong> <strong>Integration</strong><br />
Hub Processing
© 2013 IBM Corporation<br />
“Connectivity Hub”<br />
InfoSphere<br />
<strong><strong>Data</strong>Stage</strong><br />
Effectively handle the complexity of enterpr<strong>is</strong>e information sources<br />
and types with a common design paradigm across<br />
heterogeneous landscape with high-speed scalable solution<br />
to speed the delivery of analytics.
InfoSphere <strong><strong>Data</strong>Stage</strong> <strong>is</strong> <strong>Big</strong> <strong>Data</strong> <strong>Integration</strong><br />
Sour<br />
ce<br />
<strong>Data</strong><br />
Transfor<br />
m<br />
Cleanse Enrich EDW<br />
• Dynamic<br />
Instantly get better performance<br />
as hardware resources are<br />
added to any topology<br />
Sequential<br />
D<strong>is</strong>k<br />
CPU<br />
Memor<br />
y<br />
CPU<br />
4-way Parallel<br />
CPU<br />
D<strong>is</strong>k<br />
CPU<br />
Shared<br />
Memory<br />
CPU<br />
64-way Parallel<br />
Uniprocessor SMP System MPP Clustered System<br />
• Extendable<br />
Add a new server to scale out<br />
through simple text file edit (or, in<br />
grid config, automatically via<br />
integration with grid management<br />
software).<br />
• <strong>Data</strong> Partitioned<br />
In true MPP fashion (like<br />
Hadoop) data pers<strong>is</strong>ted in the<br />
data integration platform <strong>is</strong> stored<br />
in parallel to scale out the I/O.<br />
• Hadoop Integrated<br />
Push all or parts of the process<br />
out to Hadoop to take advantage<br />
of it’s scalability in ELT fashion.<br />
10<br />
10
<strong>Big</strong> <strong>Data</strong> Source Types<br />
Hadoop D<strong>is</strong>tributed File System<br />
massively scalable and resilient storage<br />
noSQL (not-only SQL)<br />
record storage optimized for read (or write)<br />
noSQL<br />
InfoSphere Streams<br />
massive real-time analytics<br />
11
Blazing Fast HDFS<br />
• Available since v8.7 in 2011<br />
• Extends the simple flat file<br />
paradigm - just add your hadoop<br />
server name and port number<br />
• Parallelization techniques to pipe<br />
data in and out at massive scale<br />
• Performance study run up to 5.2<br />
TB/hr before hdfs d<strong>is</strong>ks were<br />
complete saturated (5 node<br />
hadoop cluster)<br />
12
Simple data flow design for HDFS<br />
Transform/<br />
restructure<br />
the data<br />
Read from an<br />
HDFS file in<br />
parallel<br />
Create new<br />
HDFS file,<br />
fully<br />
parallelized<br />
Join two<br />
HDFS files<br />
13
Agile Connector Accelerators for noSQL<br />
• New connectors available on<br />
developerWorks<br />
• Plugs into InfoSphere <strong><strong>Data</strong>Stage</strong> and<br />
operates just like any other stage.<br />
• Includes features to exploit specific<br />
data sources<br />
Open<br />
Code<br />
14
Sample Job with MongoDB and Hive<br />
Selects what HDFS<br />
data to send down<br />
stream.<br />
Accepts specific<br />
MongoDB<br />
directives<br />
Writing data<br />
to MongoDB<br />
Writing data<br />
to Hive<br />
15
Parse and Compose JSON (beta)<br />
• Parsing and composing<br />
of JSON data format<br />
• Included advanced<br />
transformation<br />
framework already<br />
provided for XML<br />
capabilities<br />
• Beta available on<br />
InfoSphere<br />
<strong><strong>Data</strong>Stage</strong> 9.1 FP1<br />
16
<strong>Big</strong> <strong>Data</strong><br />
Expert
© 2013 IBM Corporation<br />
“<strong>Big</strong> <strong>Data</strong> Expert”<br />
InfoSphere<br />
<strong><strong>Data</strong>Stage</strong><br />
Automatically push transformational processing close to where the<br />
data resides, both SQL for DBMS and MapReduce for Hadoop,<br />
leveraging the same simple data flow design process and coordinate<br />
workflow across all platforms
Automated MapReduce Job Generation<br />
• New in 9.1, leverage the same UI and the same stages to build<br />
MapReduce.<br />
• Drag and drop stages to the canvas to create a job, rather than have to<br />
learn MapReduce programming.<br />
• Push the processing to Hadoop for patterns when you don’t want to<br />
transport the data on the <strong>net</strong>work.<br />
19
Automated MapReduce Job Generation<br />
Build integration<br />
jobs with the<br />
same data flow<br />
tool and stages<br />
Automatically<br />
creates<br />
MapReduce<br />
code.<br />
20<br />
© 2013 IBM Corporation
Automated MapReduce Job Generation<br />
Job includes other<br />
database on<br />
separate system<br />
Recognizes what processing<br />
can run natively in Hadoop<br />
and what requires <strong><strong>Data</strong>Stage</strong><br />
engine to move the data<br />
21<br />
© 2013 IBM Corporation
Architecture for Warehouse Landing Zone<br />
Use Case Requirements: <strong>Data</strong> Warehouse Landing Zone<br />
Large Scale – large data volumes, scale out requires open MPP platform<br />
Low Cost – low cost storage, compute and commodity hardware<br />
Many <strong>Data</strong> Types – un/semi structured and social datatype coverage<br />
Many Access Patterns – exploratory, iterative and d<strong>is</strong>covery oriented<br />
clickstream<br />
ETL<br />
Lineage<br />
Quality<br />
sensors<br />
transactions<br />
Replication<br />
Information Server<br />
JAQL Hive HBase<br />
Analytics<br />
Warehouse<br />
Zone<br />
content<br />
Guardium<br />
<strong>Big</strong>Insights / Hadoop<br />
all sources<br />
Landing Zone<br />
Masking<br />
Masking<br />
Optim<br />
Custom MR<br />
Operational<br />
Warehouse<br />
Zone<br />
22
Combined Workflows for <strong>Big</strong> <strong>Data</strong><br />
• Oozie <strong>Integration</strong><br />
– Same design paradigm for<br />
workflows as for job design.<br />
– Directly call an Oozie activity that <strong>is</strong><br />
invoking custom MapReduce code.<br />
• End-to-end Workflows<br />
– Sequence right alongside other<br />
data integration and analytics<br />
activities<br />
– Allows users to have the data<br />
sourcing, ETL, Analytics and<br />
delivery of information all controlled<br />
through a single process.<br />
– Monitor all stages through<br />
Operations Console’s web based<br />
interace<br />
23
Cross Tool Impact Analys<strong>is</strong> and Traceability<br />
• Understand how traditional and big data sources are being used<br />
• Assess impact of change and mitigate r<strong>is</strong>ks<br />
• Show impact on downstream applications and BI reports<br />
• Navigate through impacted areas and drill down
Wrap-up
The IBM <strong>Big</strong> <strong>Data</strong> Platform<br />
New analytic applications drive the<br />
requirements for a big data platform<br />
• Integrate and manage the full<br />
variety, velocity and volume of data<br />
• Apply advanced analytics to<br />
information in its native form<br />
• V<strong>is</strong>ualize all available data for adhoc<br />
analys<strong>is</strong><br />
• Development environment for<br />
building new analytic applications<br />
• Workload optimization and<br />
scheduling<br />
• Security and Governance<br />
Systems<br />
Management<br />
Hadoop<br />
System<br />
BIG DATA PLATFORM<br />
Application<br />
Development<br />
Accelerators<br />
Stream<br />
Computing<br />
D<strong>is</strong>covery<br />
<strong>Data</strong><br />
Warehouse<br />
Information <strong>Integration</strong> & Governance<br />
<strong>Data</strong> Media Content Machine Social<br />
26
Information <strong>Integration</strong> & Governance for <strong>Big</strong> <strong>Data</strong><br />
Integrate & Link <strong>Big</strong> <strong>Data</strong><br />
• <strong>Big</strong> <strong>Data</strong> as a Source<br />
• <strong>Big</strong> <strong>Data</strong> as a Target<br />
• <strong>Data</strong> Transformations<br />
• <strong>Data</strong> Movement<br />
• Integrate w/ex<strong>is</strong>ting Enterpr<strong>is</strong>e<br />
• <strong>Data</strong> Lineage & Impact Analys<strong>is</strong><br />
• Metadata <strong>Integration</strong> w/Analytics<br />
• Realtime & <strong>Data</strong> Federation<br />
Cleanse and Validate <strong>Big</strong> <strong>Data</strong><br />
• Accuracy and Entity Matching<br />
with Social <strong>Data</strong><br />
• De-duplication and<br />
Standardization of Machine <strong>Data</strong><br />
• In-line Cleansing with <strong>Integration</strong><br />
• Trusted <strong>Data</strong> Dashboard and<br />
Reporting on <strong>Data</strong> Quality<br />
Protect <strong>Big</strong> <strong>Data</strong><br />
• Activity Monitoring<br />
• <strong>Data</strong> Masking<br />
• <strong>Data</strong> Encryption<br />
• On-Demand / In-Place Protection<br />
• In-Line Protection (w/ETL etc.)<br />
• Active Detection & Alerting<br />
Audit & Archive <strong>Big</strong> <strong>Data</strong><br />
• Queryable Archive<br />
• Structured and Semi-Structured<br />
• Optimized Connectors to ex<strong>is</strong>ting Apps<br />
• Hot-Restorable On-the-Fly<br />
• Immutable and Secure Access<br />
• Automated Legal Hold Capability for <strong>Data</strong><br />
Freeze<br />
Master <strong>Big</strong> <strong>Data</strong><br />
• <strong>Big</strong> <strong>Data</strong> as a Supplier<br />
• <strong>Big</strong> <strong>Data</strong> as a Consumer<br />
• Links between <strong>Big</strong> <strong>Data</strong> and<br />
Trusted Golden Records<br />
• Leverage Master <strong>Data</strong> in <strong>Big</strong><br />
<strong>Data</strong> Analytics<br />
• Entity Resolution at Extreme<br />
Scale Out Levels<br />
• Probabil<strong>is</strong>tic Entity Matching<br />
27
Where to go for learn more….<br />
• If you’d like to explore th<strong>is</strong> topic further…<br />
– Contact your IBM account team or your preferred IBM Partner.<br />
• If you’d like to explore more about InfoSphere <strong><strong>Data</strong>Stage</strong> and the<br />
Information Server platform…<br />
– http://www-01.ibm.com/software/data/integration/info_server/<br />
• If you’re looking for a Enterpr<strong>is</strong>e level Hadoop d<strong>is</strong>tribution…<br />
– InfoSphere <strong>Big</strong> Insightshttp://www-<br />
01.ibm.com/software/data/infosphere/biginsights/<br />
29
Thanks