13.07.2015 Views

ohibited - Advanced Technology Support, Inc.

ohibited - Advanced Technology Support, Inc.

ohibited - Advanced Technology Support, Inc.

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OFFICIAL MICROSOFT LEARNING PRODUCT10777AImplementing a Data Warehouse withMicrosoft ® SQL Server ® 2012MCT USE ONLY. STUDENT USE PROHIBITED


ii Implementing a Data Warehouse with Microsoft SQL Server 2012Information in this document, including URL and other Internet Web site references, is subject to changewithout notice. Unless otherwise noted, the example companies, organizations, products, domain names,e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association withany real company, organization, product, domain name, e-mail address, logo, person, place or event isintended or should be inferred. Complying with all applicable copyright laws is the responsibility of theuser. Without limiting the rights under copyright, no part of this document may be reproduced, stored inor introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,photocopying, recording, or otherwise), or for any purpose, without the express written permission ofMicrosoft Corporation.Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual propertyrights covering subject matter in this document. Except as expressly provided in any written licenseagreement from Microsoft, the furnishing of this document does not give you any license to thesepatents, trademarks, copyrights, or other intellectual property.The names of manufacturers, products, or URLs are provided for informational purposes only andMicrosoft makes no representations and warranties, either expressed, implied, or statutory, regardingthese manufacturers or the use of the products with any Microsoft technologies. The inclusion of amanufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Linksmay be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is notresponsible for the contents of any linked site or any link contained in a linked site, or any changes orupdates to such sites. Microsoft is not responsible for webcasting or any other form of transmissionreceived from any linked site. Microsoft is providing these links to you only as a convenience, and theinclusion of any link does not imply endorsement of Microsoft of the site or the products containedtherein.© 2012 Microsoft Corporation. All rights reserved.Microsoft and the trademarks listed at http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Trademarks/EN-US.aspx are trademarks of the Microsoft group of companies. All other trademarks areproperty of their respective owners.Product Number: 10777APart Number: X18-28026Released: 05/2012MCT USE ONLY. STUDENT USE PROHIBITED


MICROSOFT LICENSE TERMSOFFICIAL MICROSOFT LEARNING PRODUCTSMICROSOFT OFFICIAL COURSE Pre-Release and Final Release VersionsThese license terms are an agreement between Microsoft Corporation and you. Please read them. They apply tothe Licensed Content named above, which includes the media on which you received it, if any. These licenseterms also apply to any updates, supplements, internet based services and support services for the LicensedContent, unless other terms accompany those items. If so, those terms apply.BY DOWNLOADING OR USING THE LICENSED CONTENT, YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPTTHEM, DO NOT DOWNLOAD OR USE THE LICENSED CONTENT.If you comply with these license terms, you have the rights below.1. DEFINITIONS.a. “Authorized Learning Center” means a Microsoft Learning Competency Member, Microsoft IT AcademyProgram Member, or such other entity as Microsoft may designate from time to time.b. “Authorized Training Session” means the Microsoft-authorized instructor-led training class using onlyMOC Courses that are conducted by a MCT at or through an Authorized Learning Center.c. “Classroom Device” means one (1) dedicated, secure computer that you own or control that meets orexceeds the hardware level specified for the particular MOC Course located at your training facilities orprimary business location.d. “End User” means an individual who is (i) duly enrolled for an Authorized Training Session or PrivateTraining Session, (ii) an employee of a MPN Member, or (iii) a Microsoft full-time employee.e. “Licensed Content” means the MOC Course and any other content accompanying this agreement.Licensed Content may include (i) Trainer Content, (ii) sample code, and (iii) associated media.MCT USE ONLY. STUDENT USE PROHIBITEDf. “Microsoft Certified Trainer” or “MCT” means an individual who is (i) engaged to teach a training sessionto End Users on behalf of an Authorized Learning Center or MPN Member, (ii) currently certified as aMicrosoft Certified Trainer under the Microsoft Certification Program, and (iii) holds a MicrosoftCertification in the technology that is the subject of the training session.g. “Microsoft IT Academy Member” means a current, active member of the Microsoft IT AcademyProgram.h. “Microsoft Learning Competency Member” means a Microsoft Partner Network Program Member ingood standing that currently holds the Learning Competency status.i. “Microsoft Official Course” or “MOC Course” means the Official Microsoft Learning Product instructorledcourseware that educates IT professionals or developers on Microsoft technologies.


j. “Microsoft Partner Network Member” or “MPN Member” means a silver or gold-level Microsoft PartnerNetwork program member in good standing.k. “Personal Device” means one (1) device, workstation or other digital electronic device that youpersonally own or control that meets or exceeds the hardware level specified for the particular MOCCourse.l. “Private Training Session” means the instructor-led training classes provided by MPN Members forcorporate customers to teach a predefined learning objective. These classes are not advertised orpromoted to the general public and class attendance is restricted to individuals employed by orcontracted by the corporate customer.m. “Trainer Content” means the trainer version of the MOC Course and additional content designatedsolely for trainers to use to teach a training session using a MOC Course. Trainer Content may includeMicrosoft PowerPoint presentations, instructor notes, lab setup guide, demonstration guides, betafeedback form and trainer preparation guide for the MOC Course. To clarify, Trainer Content does notinclude virtual hard disks or virtual machines.2. INSTALLATION AND USE RIGHTS. The Licensed Content is licensed not sold. The Licensed Content islicensed on a one copy per user basis, such that you must acquire a license for each individual thataccesses or uses the Licensed Content.2.1 Below are four separate sets of installation and use rights. Only one set of rights apply to you.MCT USE ONLY. STUDENT USE PROHIBITEDa. If you are a Authorized Learning Center:i. If the Licensed Content is in digital format for each license you acquire you may either:1. install one (1) copy of the Licensed Content in the form provided to you on a dedicated, secureserver located on your premises where the Authorized Training Session is held for access anduse by one (1) End User attending the Authorized Training Session, or by one (1) MCT teachingthe Authorized Training Session, or2. install one (1) copy of the Licensed Content in the form provided to you on one (1) ClassroomDevice for access and use by one (1) End User attending the Authorized Training Session, or byone (1) MCT teaching the Authorized Training Session.ii. You agree that:1. you will acquire a license for each End User and MCT that accesses the Licensed Content,2. each End User and MCT will be presented with a copy of this agreement and each individualwill agree that their use of the Licensed Content will be subject to these license terms prior totheir accessing the Licensed Content. Each individual will be required to denote theiracceptance of the EULA in a manner that is enforceable under local law prior to their accessingthe Licensed Content,3. for all Authorized Training Sessions, you will only use qualified MCTs who hold the applicablecompetency to teach the particular MOC Course that is the subject of the training session,4. you will not alter or remove any copyright or other protective notices contained in theLicensed Content,


MCT USE ONLY. STUDENT USE PROHIBITED5. you will remove and irretrievably delete all Licensed Content from all Classroom Devices andservers at the end of the Authorized Training Session,6. you will only provide access to the Licensed Content to End Users and MCTs,7. you will only provide access to the Trainer Content to MCTs, and8. any Licensed Content installed for use during a training session will be done in accordancewith the applicable classroom set-up guide.b. If you are a MPN Member.i. If the Licensed Content is in digital format for each license you acquire you may either:1. install one (1) copy of the Licensed Content in the form provided to you on (A) one (1)Classroom Device, or (B) one (1) dedicated, secure server located at your premises wherethe training session is held for use by one (1) of your employees attending a training sessionprovided by you, or by one (1) MCT that is teaching the training session, or2. install one (1) copy of the Licensed Content in the form provided to you on one (1)Classroom Device for use by one (1) End User attending a Private Training Session, or one (1)MCT that is teaching the Private Training Session.ii. You agree that:1. you will acquire a license for each End User and MCT that accesses the Licensed Content,2. each End User and MCT will be presented with a copy of this agreement and each individualwill agree that their use of the Licensed Content will be subject to these license terms priorto their accessing the Licensed Content. Each individual will be required to denote theiracceptance of the EULA in a manner that is enforceable under local law prior to theiraccessing the Licensed Content,3. for all training sessions, you will only use qualified MCTs who hold the applicablecompetency to teach the particular MOC Course that is the subject of the training session,4. you will not alter or remove any copyright or other protective notices contained in theLicensed Content,5. you will remove and irretrievably delete all Licensed Content from all Classroom Devices andservers at the end of each training session,6. you will only provide access to the Licensed Content to End Users and MCTs,7. you will only provide access to the Trainer Content to MCTs, and8. any Licensed Content installed for use during a training session will be done in accordancewith the applicable classroom set-up guide.c. If you are an End User:You may use the Licensed Content solely for your personal training use. If the Licensed Content is indigital format, for each license you acquire you may (i) install one (1) copy of the Licensed Content inthe form provided to you on one (1) Personal Device and install another copy on another PersonalDevice as a backup copy, which may be used only to reinstall the Licensed Content; or (ii) print one (1)copy of the Licensed Content. You may not install or use a copy of the Licensed Content on a deviceyou do not own or control.


MCT USE ONLY. STUDENT USE PROHIBITEDd. If you are a MCT.i. For each license you acquire, you may use the Licensed Content solely to prepare and deliver anAuthorized Training Session or Private Training Session. For each license you acquire, you mayinstall and use one (1) copy of the Licensed Content in the form provided to you on one (1) PersonalDevice and install one (1) additional copy on another Personal Device as a backup copy, which maybe used only to reinstall the Licensed Content. You may not install or use a copy of the LicensedContent on a device you do not own or control.ii.Use of Instructional Components in Trainer Content. You may customize, in accordance with themost recent version of the MCT Agreement, those portions of the Trainer Content that are logicallyassociated with instruction of a training session. If you elect to exercise the foregoing rights, youagree: (a) that any of these customizations will only be used for providing a training session, (b) anycustomizations will comply with the terms and conditions for Modified Training Sessions andSupplemental Materials in the most recent version of the MCT agreement and with this agreement.For clarity, any use of “customize” refers only to changing the order of slides and content, and/ornot using all the slides or content, it does not mean changing or modifying any slide or content.2.2 Separation of Components. The Licensed Content components are licensed as a single unit and youmay not separate the components and install them on different devices.2.3 Reproduction/Redistribution Licensed Content. Except as expressly provided in the applicableinstallation and use rights above, you may not reproduce or distribute the Licensed Content or any portionthereof (including any permitted modifications) to any third parties without the express written permissionof Microsoft.2.4 Third Party Programs. The Licensed Content may contain third party programs or services. Theselicense terms will apply to your use of those third party programs or services, unless other terms accompanythose programs and services.2.5 Additional Terms. Some Licensed Content may contain components with additional terms,conditions, and licenses regarding its use. Any non-conflicting terms in those conditions and licenses alsoapply to that respective component and supplements the terms described in this Agreement.3. PRE-RELEASE VERSIONS. If the Licensed Content is a pre-release (“beta”) version, in addition to the otherprovisions in this agreement, then these terms also apply:a. Pre-Release Licensed Content. This Licensed Content is a pre-release version. It may not contain thesame information and/or work the way a final version of the Licensed Content will. We may change itfor the final version. We also may not release a final version. Microsoft is under no obligation toprovide you with any further content, including the final release version of the Licensed Content.b. Feedback. If you agree to give feedback about the Licensed Content to Microsoft, either directly orthrough its third party designee, you give to Microsoft without charge, the right to use, share andcommercialize your feedback in any way and for any purpose. You also give to third parties, withoutcharge, any patent rights needed for their products, technologies and services to use or interface withany specific parts of a Microsoft software, Microsoft product, or service that includes the feedback. Youwill not give feedback that is subject to a license that requires Microsoft to license its software,technologies, or products to third parties because we include your feedback in them. These rights


survive this agreement.c. Term. If you are an Authorized Training Center, MCT or MPN, you agree to cease using all copies of thebeta version of the Licensed Content upon (i) the date which Microsoft informs you is the end date forusing the beta version, or (ii) sixty (60) days after the commercial release of the Licensed Content,whichever is earliest (“beta term”). Upon expiration or termination of the beta term, you willirretrievably delete and destroy all copies of same in the possession or under your control.4. INTERNET-BASED SERVICES. Classroom Devices located at Authorized Learning Center’s physical locationmay contain virtual machines and virtual hard disks for use while attending an Authorized TrainingSession. You may only use the software on the virtual machines and virtual hard disks on a ClassroomDevice solely to perform the virtual lab activities included in the MOC Course while attending theAuthorized Training Session. Microsoft may provide Internet-based services with the software includedwith the virtual machines and virtual hard disks. It may change or cancel them at any time. If thesoftware is pre-release versions of software, some of its Internet-based services may be turned on bydefault. The default setting in these versions of the software do not necessarily reflect how the featureswill be configured in the commercially released versions. If Internet-based services are included with thesoftware, they are typically simulated for demonstration purposes in the software and no transmissionover the Internet takes place. However, should the software be configured to transmit over the Internet,the following terms apply:a. Consent for Internet-Based Services. The software features described below connect to Microsoft orservice provider computer systems over the Internet. In some cases, you will not receive a separatenotice when they connect. You may switch off these features or not use them. By using these features,you consent to the transmission of this information. Microsoft does not use the information to identifyor contact you.b. Computer Information. The following features use Internet protocols, which send to the appropriatesystems computer information, such as your Internet protocol address, the type of operating system,browser and name and version of the software you are using, and the language code of the devicewhere you installed the software. Microsoft uses this information to make the Internet-based servicesavailable to you.• Accelerators. When you use click on or move your mouse over an Accelerator, the title and full webaddress or URL of the current webpage, as well as standard computer information, and any contentyou have selected, might be sent to the service provider. If you use an Accelerator provided byMicrosoft, the information sent is subject to the Microsoft Online Privacy Statement, which isavailable at go.microsoft.com/fwlink/?linkid=31493. If you use an Accelerator provided by a thirdparty, use of the information sent will be subject to the third party’s privacy practices.MCT USE ONLY. STUDENT USE PROHIBITED• Automatic Updates. This software contains an Automatic Update feature that is on by default. Formore information about this feature, including instructions for turning it off, seego.microsoft.com/fwlink/?LinkId=178857. You may turn off this feature while the software isrunning (“opt out”). Unless you expressly opt out of this feature, this feature will (a) connect toMicrosoft or service provider computer systems over the Internet, (b) use Internet protocols to sendto the appropriate systems standard computer information, such as your computer’s Internetprotocol address, the type of operating system, browser and name and version of the software youare using, and the language code of the device where you installed the software, and (c)automatically download and install, or prompt you to download and/or install, current Updates tothe software. In some cases, you will not receive a separate notice before this feature takes effect.


By installing the software, you consent to the transmission of standard computer information andthe automatic downloading and installation of updates.• Auto Root Update. The Auto Root Update feature updates the list of trusted certificate authorities.you can switch off the Auto Root Update feature.MCT USE ONLY. STUDENT USE PROHIBITED• Customer Experience Improvement Program (CEIP), Error and Usage Reporting; Error Reports. Thissoftware uses CEIP and Error and Usage Reporting components enabled by default thatautomatically send to Microsoft information about your hardware and how you use this software.This software also automatically sends error reports to Microsoft that describe which softwarecomponents had errors and may also include memory dumps. You may choose not to use thesesoftware components. For more information please go to.• Digital Certificates. The software uses digital certificates. These digital certificates confirm theidentity of Internet users sending X.509 standard encrypted information. They also can be used todigitally sign files and macros, to verify the integrity and origin of the file contents. The softwareretrieves certificates and updates certificate revocation lists. These security features operate onlywhen you use the Internet.• Extension Manager. The Extension Manager can retrieve other software through the internet fromthe Visual Studio Gallery website. To provide this other software, the Extension Manager sends toMicrosoft the name and version of the software you are using and language code of the devicewhere you installed the software. This other software is provided by third parties to Visual StudioGallery. It is licensed to users under terms provided by the third parties, not from Microsoft. Readthe Visual Studio Gallery terms of use for more information.• IPv6 Network Address Translation (NAT) Traversal service (Teredo). This feature helps existinghome Internet gateway devices transition to IPv6. IPv6 is a next generation Internet protocol. Ithelps enable end-to-end connectivity often needed by peer-to-peer applications. To do so, eachtime you start up the software the Teredo client service will attempt to locate a public TeredoInternet service. It does so by sending a query over the Internet. This query only transfers standardDomain Name Service information to determine if your computer is connected to the Internet andcan locate a public Teredo service. If you· use an application that needs IPv6 connectivity or· configure your firewall to always enable IPv6 connectivityby default standard Internet Protocol information will be sent to the Teredo service at Microsoft atregular intervals. No other information is sent to Microsoft. You can change this default to use non-Microsoft servers. You can also switch off this feature using a command line utility named “netsh”.• Malicious Software Removal. During setup, if you select “Get important updates for installation”,the software may check and remove certain malware from your device. “Malware” is malicioussoftware. If the software runs, it will remove the Malware listed and updated atwww.support.microsoft.com/?kbid=890830. During a Malware check, a report will be sent toMicrosoft with specific information about Malware detected, errors, and other information aboutyour device. This information is used to improve the software and other Microsoft products andservices. No information included in these reports will be used to identify or contact you. You maydisable the software’s reporting functionality by following the instructions found at


www.support.microsoft.com/?kbid=890830. For more information, read the Windows MaliciousSoftware Removal Tool privacy statement at go.microsoft.com/fwlink/?LinkId=113995.• Microsoft Digital Rights Management. If you use the software to access content that has beenprotected with Microsoft Digital Rights Management (DRM), then, in order to let you play thecontent, the software may automatically request media usage rights from a rights server on theInternet and download and install available DRM updates. For more information, seego.microsoft.com/fwlink/?LinkId=178857.• Microsoft Telemetry Reporting Participation. If you choose to participate in Microsoft TelemetryReporting through a “basic” or “advanced” membership, information regarding filtered URLs,malware and other attacks on your network is sent to Microsoft. This information helps Microsoftimprove the ability of Forefront Threat Management Gateway to identify attack patterns andmitigate threats. In some cases, personal information may be inadvertently sent, but Microsoft willnot use the information to identify or contact you. You can switch off Telemetry Reporting. Formore information on this feature, see http://go.microsoft.com/fwlink/?LinkId=130980.MCT USE ONLY. STUDENT USE PROHIBITED• Microsoft Update Feature. To help keep the software up-to-date, from time to time, the softwareconnects to Microsoft or service provider computer systems over the Internet. In some cases, youwill not receive a separate notice when they connect. When the software does so, we check yourversion of the software and recommend or download updates to your devices. You may not receivenotice when we download the update. You may switch off this feature.• Network Awareness. This feature determines whether a system is connected to a network by eitherpassive monitoring of network traffic or active DNS or HTTP queries. The query only transfersstandard TCP/IP or DNS information for routing purposes. You can switch off the active queryfeature through a registry setting.• Plug and Play and Plug and Play Extensions. You may connect new hardware to your device, eitherdirectly or over a network. Your device may not have the drivers needed to communicate with thathardware. If so, the update feature of the software can obtain the correct driver from Microsoft andinstall it on your device. An administrator can disable this update feature.• Real Simple Syndication (“RSS”) Feed. This software start page contains updated content that issupplied by means of an RSS feed online from Microsoft.• Search Suggestions Service. When you type a search query in Internet Explorer by using the InstantSearch box or by typing a question mark (?) before your search term in the Address bar, you will seesearch suggestions as you type (if supported by your search provider). Everything you type in theInstant Search box or in the Address bar when preceded by a question mark (?) is sent to yoursearch provider as you type it. In addition, when you press Enter or click the Search button, all thetext that is in the search box or Address bar is sent to the search provider. If you use a Microsoftsearch provider, the information you send is subject to the Microsoft Online Privacy Statement,which is available at go.microsoft.com/fwlink/?linkid=31493. If you use a third-party searchprovider, use of the information sent will be subject to the third party’s privacy practices. You canturn search suggestions off at any time in Internet Explorer by using Manage Add-ons under theTools button. For more information about the search suggestions service, seego.microsoft.com/fwlink/?linkid=128106.• SQL Server Reporting Services Map Report Item. The software may include features that retrievecontent such as maps, images and other data through the Bing Maps (or successor branded)


MCT USE ONLY. STUDENT USE PROHIBITEDapplication programming interface (the “Bing Maps APIs”). The purpose of these features is tocreate reports displaying data on top of maps, aerial and hybrid imagery. If these features areincluded, you may use them to create and view dynamic or static documents. This may be done onlyin conjunction with and through methods and means of access integrated in the software. You maynot otherwise copy, store, archive, or create a database of the content available through the BingMaps APIs. you may not use the following for any purpose even if they are available through theBing Maps APIs:• Bing Maps APIs to provide sensor based guidance/routing, or• Any Road Traffic Data or Bird’s Eye Imagery (or associated metadata).Your use of the Bing Maps APIs and associated content is also subject to the additional terms andconditions at http://www.microsoft.com/maps/product/terms.html.• URL Filtering. The URL Filtering feature identifies certain types of web sites based upon predefinedURL categories, and allows you to deny access to such web sites, such as known malicious sites andsites displaying inappropriate or pornographic materials. To apply URL filtering, Microsoft queriesthe online Microsoft Reputation Service for URL categorization. You can switch off URL filtering. Formore information on this feature, see http://go.microsoft.com/fwlink/?LinkId=130980• Web Content Features. Features in the software can retrieve related content from Microsoft andprovide it to you. To provide the content, these features send to Microsoft the type of operatingsystem, name and version of the software you are using, type of browser and language code of thedevice where you run the software. Examples of these features are clip art, templates, onlinetraining, online assistance and Appshelp. You may choose not to use these web content features.• Windows Media Digital Rights Management. Content owners use Windows Media digital rightsmanagement technology (WMDRM) to protect their intellectual property, including copyrights. Thissoftware and third party software use WMDRM to play and copy WMDRM-protected content. If thesoftware fails to protect the content, content owners may ask Microsoft to revoke the software’sability to use WMDRM to play or copy protected content. Revocation does not affect other content.When you download licenses for protected content, you agree that Microsoft may include arevocation list with the licenses. Content owners may require you to upgrade WMDRM to accesstheir content. Microsoft software that includes WMDRM will ask for your consent prior to theupgrade. If you decline an upgrade, you will not be able to access content that requires the upgrade.You may switch off WMDRM features that access the Internet. When these features are off, you canstill play content for which you have a valid license.• Windows Media Player. When you use Windows Media Player, it checks with Microsoft for· compatible online music services in your region;· new versions of the player; and· codecs if your device does not have the correct ones for playing content.You can switch off this last feature. For more information, go towww.microsoft.com/windows/windowsmedia/player/11/privacy.aspx.• Windows Rights Management Services. The software contains a feature that allows you to createcontent that cannot be printed, copied or sent to others without your permission. For moreinformation, go to www.microsoft.com/rms. you may choose not to use this feature


MCT USE ONLY. STUDENT USE PROHIBITED• Windows Time Service. This service synchronizes with time.windows.com once a week to provideyour computer with the correct time. You can turn this feature off or choose your preferred timesource within the Date and Time Control Panel applet. The connection uses standard NTP protocol.• Windows Update Feature. You may connect new hardware to the device where you run thesoftware. Your device may not have the drivers needed to communicate with that hardware. If so,the update feature of the software can obtain the correct driver from Microsoft and run it on yourdevice. You can switch off this update feature.c. Use of Information. Microsoft may use the device information, error reports, and malware reports toimprove our software and services. We may also share it with others, such as hardware and softwarevendors. They may use the information to improve how their products run with Microsoft software.d. Misuse of Internet-based Services. You may not use any Internet-based service in any way that couldharm it or impair anyone else’s use of it. You may not use the service to try to gain unauthorized accessto any service, data, account or network by any means.5. SCOPE OF LICENSE. The Licensed Content is licensed, not sold. This agreement only gives you some rightsto use the Licensed Content. Microsoft reserves all other rights. Unless applicable law gives you morerights despite this limitation, you may use the Licensed Content only as expressly permitted in thisagreement. In doing so, you must comply with any technical limitations in the Licensed Content that onlyallows you to use it in certain ways. Except as expressly permitted in this agreement, you may not:• install more copies of the Licensed Content on devices than the number of licenses you acquired;• allow more individuals to access the Licensed Content than the number of licenses you acquired;• publicly display, or make the Licensed Content available for others to access or use;• install, sell, publish, transmit, encumber, pledge, lend, copy, adapt, link to, post, rent, lease or lend,make available or distribute the Licensed Content to any third party, except as expressly permittedby this Agreement.• reverse engineer, decompile, remove or otherwise thwart any protections or disassemble theLicensed Content except and only to the extent that applicable law expressly permits, despite thislimitation;• access or use any Licensed Content for which you are not providing a training session to End Usersusing the Licensed Content;• access or use any Licensed Content that you have not been authorized by Microsoft to access anduse; or• transfer the Licensed Content, in whole or in part, or assign this agreement to any third party.6. RESERVATION OF RIGHTS AND OWNERSHIP. Microsoft reserves all rights not expressly granted to you inthis agreement. The Licensed Content is protected by copyright and other intellectual property laws andtreaties. Microsoft or its suppliers own the title, copyright, and other intellectual property rights in theLicensed Content. You may not remove or obscure any copyright, trademark or patent notices thatappear on the Licensed Content or any components thereof, as delivered to you.7. EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regulations. Youmust comply with all domestic and international export laws and regulations that apply to the LicensedContent. These laws include restrictions on destinations, End Users and end use. For additionalinformation, see www.microsoft.com/exporting.


8. LIMITATIONS ON SALE, RENTAL, ETC. AND CERTAIN ASSIGNMENTS. You may not sell, rent, lease, lend orsublicense the Licensed Content or any portion thereof, or transfer or assign this agreement.9. SUPPORT SERVICES. Because the Licensed Content is “as is”, we may not provide support services for it.10. TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you failto comply with the terms and conditions of this agreement. Upon any termination of this agreement, youagree to immediately stop all use of and to irretrievable delete and destroy all copies of the LicensedContent in your possession or under your control.MCT USE ONLY. STUDENT USE PROHIBITED11. LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed Content.The third party sites are not under the control of Microsoft, and Microsoft is not responsible for thecontents of any third party sites, any links contained in third party sites, or any changes or updates to thirdparty sites. Microsoft is not responsible for webcasting or any other form of transmission received fromany third party sites. Microsoft is providing these links to third party sites to you only as a convenience,and the inclusion of any link does not imply an endorsement by Microsoft of the third party site.12. ENTIRE AGREEMENT. This agreement, and the terms for supplements, updates and support services arethe entire agreement for the Licensed Content.13. APPLICABLE LAW.a. United States. If you acquired the Licensed Content in the United States, Washington state law governsthe interpretation of this agreement and applies to claims for breach of it, regardless of conflict of lawsprinciples. The laws of the state where you live govern all other claims, including claims under stateconsumer protection laws, unfair competition laws, and in tort.b. Outside the United States. If you acquired the Licensed Content in any other country, the laws of thatcountry apply.14. LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the laws ofyour country. You may also have rights with respect to the party from whom you acquired the LicensedContent. This agreement does not change your rights under the laws of your country if the laws of yourcountry do not permit it to do so.15. DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS," "WITH ALL FAULTS," AND "ASAVAILABLE." YOU BEAR THE RISK OF USING IT. MICROSOFT CORPORATION AND ITS RESPECTIVEAFFILIATES GIVE NO EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS UNDER OR IN RELATION TOTHE LICENSED CONTENT. YOU MAY HAVE ADDITIONAL CONSUMER RIGHTS UNDER YOUR LOCAL LAWSWHICH THIS AGREEMENT CANNOT CHANGE. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS,MICROSOFT CORPORATION AND ITS RESPECTIVE AFFILIATES EXCLUDE ANY IMPLIED WARRANTIES ORCONDITIONS, INCLUDING THOSE OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDNON-INFRINGEMENT.16. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. TO THE EXTENT NOT PROHIBITED BYLAW, YOU CAN RECOVER FROM MICROSOFT CORPORATION AND ITS SUPPLIERS ONLY DIRECTDAMAGES UP TO USD$5.00. YOU AGREE NOT TO SEEK TO RECOVER ANY OTHER DAMAGES, INCLUDINGCONSEQUENTIAL, LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES FROM MICROSOFTCORPORATION AND ITS RESPECTIVE SUPPLIERS.


MCT USE ONLY. STUDENT USE PROHIBITEDThis limitation applies too anything related to the Licensed Content, services made available through the Licensed Content, orcontent (including code) on third party Internet sites or third-party programs; ando claims for breach of contract, breach of warranty, guarantee or condition, strict liability, negligence,or other tort to the extent permitted by applicable law.It also applies even if Microsoft knew or should have known about the possibility of the damages. Theabove limitation or exclusion may not apply to you because your country may not allow the exclusion orlimitation of incidental, consequential or other damages.Please note: As this Licensed Content is distributed in Quebec, Canada, some of the clauses in this agreementare provided below in French.Remarque : Ce le contenu sous licence étant distribué au Québec, Canada, certaines des clauses dans cecontrat sont fournies ci-dessous en français.EXONÉRATION DE GARANTIE. Le contenu sous licence visé par une licence est offert « tel quel ». Touteutilisation de ce contenu sous licence est à votre seule risque et péril. Microsoft n’accorde aucune autre garantieexpresse. Vous pouvez bénéficier de droits additionnels en vertu du droit local sur la protection duesconsommateurs, que ce contrat ne peut modifier. La ou elles sont permises par le droit locale, les garantiesimplicites de qualité marchande, d’adéquation à un usage particulier et d’absence de contrefaçon sont exclues.LIMITATION DES DOMMAGES-INTÉRÊTS ET EXCLUSION DE RESPONSABILITÉ POUR LES DOMMAGES. Vouspouvez obtenir de Microsoft et de ses fournisseurs une indemnisation en cas de dommages directs uniquementà hauteur de 5,00 $ US. Vous ne pouvez prétendre à aucune indemnisation pour les autres dommages, ycompris les dommages spéciaux, indirects ou accessoires et pertes de bénéfices.Cette limitation concerne:• tout ce qui est relié au le contenu sous licence , aux services ou au contenu (y compris le code)figurant sur des sites Internet tiers ou dans des programmes tiers ; et• les réclamations au titre de violation de contrat ou de garantie, ou au titre de responsabilitéstricte, de négligence ou d’une autre faute dans la limite autorisée par la loi en vigueur.Elle s’applique également, même si Microsoft connaissait ou devrait connaître l’éventualité d’un tel dommage.Si votre pays n’autorise pas l’exclusion ou la limitation de responsabilité pour les dommages indirects,accessoires ou de quelque nature que ce soit, il se peut que la limitation ou l’exclusion ci-dessus ne s’appliquerapas à votre égard.EFFET JURIDIQUE. Le présent contrat décrit certains droits juridiques. Vous pourriez avoir d’autres droits prévuspar les lois de votre pays. Le présent contrat ne modifie pas les droits que vous confèrent les lois de votre payssi celles-ci ne le permettent pas.Revised March 2012


xiv Implementing a Data Warehouse with Microsoft SQL Server 2012MCT USE ONLY. STUDENT USE PROHIBITED


Implementing a Data Warehouse with Microsoft SQL Server 2012AcknowledgmentsMicrosoft Learning would like to acknowledge and thank the following for their contribution towardsdeveloping this title. Their effort at various stages in the development has ensured that you have a goodclassroom experience.Graeme Malcolm – Lead Content DeveloperGraeme Malcolm is a Microsoft SQL Server subject matter expert and professional content developer atContent Master—a division of CM Group Ltd. As a Microsoft Certified Trainer, Graeme has deliveredtraining courses on SQL Server since version 4.2; as an author, Graeme has written numerous books,articles, and training courses on SQL Server; and as a consultant, Graeme has designed and implementedbusiness solutions based on SQL Server for customers all over the world.Geoff Allix – Content DeveloperGeoff Allix is a Microsoft SQL Server subject matter expert and professional content developer at ContentMaster—a division of CM Group Ltd. Geoff is a Microsoft Certified IT Professional for SQL Server withextensive experience in designing and implementing database and BI solutions on SQL Servertechnologies, and has provided consultancy services to organizations seeking to implement and optimizedata warehousing and OLAP solutions.Martin Ellis – Content DeveloperMartin Ellis is a Microsoft SQL Server subject matter expert and professional content developer at ContentMaster—a division of CM Group Ltd. Martin is a Microsoft Certified Technical Specialist on SQL Server andan MCSE. He has been working with SQL Server since version 7.0, as a DBA, consultant and MicrosoftCertified Trainer, and has developed a wide range of technical collateral for Microsoft Corp. and othertechnology enterprises.Chris Testa-O’Neill – Technical ReviewerChris Testa-O’Neil is a Senior Consultant at Coeo (www.coeo.com), a leading provider of SQL ServerManaged <strong>Support</strong> and Consulting in the UK and Europe. He is also a Microsoft Certified Trainer, MicrosoftMost Valuable Professional for SQL Server, and lead author of Microsoft E-Learning MCTS courses for SQLServer 2008. Chris has spoken at a range of SQL Server events in the UK, Europe, Australia and the UnitedStates. He is also one of the organizers of SQLBits, SQLServerFAQ and a UK Regional Mentor for SQLPASS.You can contact Chris at chris@coeo.com, @ctesta_oneill or through his blog athttp://www.coeo.com/sql-server-events/sql-events-and-blogs.aspx.MCT USE ONLY. STUDENT USE PROHIBITEDxv


xvi Implementing a Data Warehouse with Microsoft SQL Server 2012ContentsModule 1: Introduction to Data WarehousingLesson 1: Overview of Data Warehousing 1-3Lesson 2: Considerations for a Data Warehouse Solution 1-14Lab 1: Exploring a Data Warehousing Solution 1-28Module 2: Data Warehouse HardwareLesson 1: Considerations for Building a Data Warehouse 2-3Lesson 2: Data Warehouse Reference Architectures and Appliances 2-11Module 3: Designing and Implementing a Data WarehouseLesson 1: Logical Design for a Data Warehouse 3-3Lesson 2: Physical Design for a Data Warehouse 3-17Lab 3: Implementing a Data Warehouse Schema 3-27Module 4: Creating an ETL Solution with SSISLesson 1: Introduction to ETL with SSIS 4-3Lesson 2: Exploring Source Data 4-10Lesson 3: Implementing Data Flow 4-21Lab 4: Implementing Data Flow in an SSIS Package 4-38Module 5: Implementing Control Flow in an SSIS PackageLesson 1: Introduction to Control Flow 5-3Lesson 2: Creating Dynamic Packages 5-14Lesson 3: Using Containers 5-21Lab 5A: Implementing Control Flow in an SSIS Package 5-33Lesson 4: Managing Consistency 5-41Lab 5B: Using Transactions and Checkpoints 5-51Module 6: Debugging and Troubleshooting SSIS PackagesLesson 1: Debugging an SSIS Package 6-3Lesson 2: Logging SSIS Package Events 6-12Lesson 3: Handling Errors in an SSIS Package 6-21Lab 6: Debugging and Troubleshooting an SSIS Package 6-30Module 7: Implementing an <strong>Inc</strong>remental ETL ProcessLesson 1: Introduction to <strong>Inc</strong>remental ETL 7-3Lesson 2: Extracting Modified Data 7-9Lab 7A: Extracting Modified Data 7-31Lesson 3: Loading Modified Data 7-54Lab 7B: Loading <strong>Inc</strong>remental Changes 7-73MCT USE ONLY. STUDENT USE PROHIBITED


Implementing a Data Warehouse with Microsoft SQL Server 2012Module 8: <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseLesson 1: Overview of Cloud Data Sources 8-3Lesson 2: SQL Azure 8-9Lesson 3: The Windows Azure Marketplace DataMarket 8-19Lab: Using Cloud Data in a Data Warehouse Solution 8-26Module 9: Enforcing Data QualityLesson 1: Introduction to Data Quality 9-3Lesson 2: Using Data Quality Services to Cleanse Data 9-13Lab 9A: Cleansing Data 9-20Lesson 3: Using Data Quality Services to Match Data 9-29Lab 9B: Deduplicating Data 9-38Module 10: Using Master Data ServicesLesson 1: Introduction to Master Data Services 10-3Lesson 2: Implementing a Master Data Services Model 10-10Lesson 3: Managing Master Data 10-23Lesson 4: Creating a Master Data Hub 10-36Lab 10: Implementing Master Data Services 10-46Module 11: Extending SQL Server Integration ServicesLesson 1: Using Custom Components in SSIS 11-3Lesson 2: Using Scripts in SSIS 11-10Lab 11: Using Custom Components and Scripts 11-21Module 12: Deploying and Configuring SSIS PackagesLesson 1: Overview of SSIS Deployment 12-3Lesson 2: Deploying SSIS Projects 12-9Lesson 3: Planning SSIS Package Execution 12-19Lab 12: Deploying and Configuring SSIS Packages 12-30Module 13: Consuming Data in a Data WarehouseLesson 1: Introduction to Business Intelligence 13-3Lesson 2: Introduction to Reporting 13-8Lesson 3: Introduction to Data Analysis 13-12Lab 13: Using Business Intelligence Tools 13-18MCT USE ONLY. STUDENT USE PROHIBITEDxvii


xviii Implementing a Data Warehouse with Microsoft SQL Server 2012Appendix: Lab Answer KeysModule 1 Lab 1: Exploring a Data Warehousing Solution L1-1Module 3 Lab 3: Implementing a Data Warehouse Schema L3-7Module 4 Lab 4: Implementing Data Flow in an SSIS Package L4-13Module 5 Lab 5A: Implementing Control Flow in an SSIS Package L5-25Module 5 Lab 5B: Using Transactions and Checkpoints L5-33Module 6 Lab 6: Debugging and Troubleshooting an SSIS Package L6-37Module 7 Lab 7A: Extracting Modified Data L7-45Module 7 Lab 7B: Loading <strong>Inc</strong>remental Changes L7-65Module 8 Lab 8: Using Cloud Data in a Data Warehouse Solution L8-81Module 9 Lab 9A: Cleansing Data L9-91Module 9 Lab 9B: Deduplicating Data L9-99Module 10 Lab 10: Implementing Master Data Services L10-105Module 11 Lab 11: Using Custom Components and Scripts L11-117Module 12 Lab 12: Deploying and Configuring SSIS Packages L12-123Module 13 Lab 13: Using Business Intelligence Tools L13-129MCT USE ONLY. STUDENT USE PROHIBITED


Module 6Debugging and Troubleshooting SSIS PackagesContents:Lesson 1: Debugging an SSIS Package 6-3Lesson 2: Logging SSIS Package Events 6-12Lesson 3: Handling Errors in an SSIS Package 6-21Lab 6: Debugging and Troubleshooting an SSIS Package 6-30MCT USE ONLY. STUDENT USE PROHIBITED6-1


6-2 Debuggingg and Troubleshooting SSIS PackagesModule OverviewAs you develop more complex SQL Server Integration Services (SSIS) packages, it is important to befamiliar with the tools and techniques you canuse to debugg package execution and handle any errorsthatt occur. This module describes how you can debug packages to find the cause of errors that occurduring execution. It then discusses the logging functionalityy built into SSIS that you canuse to log eventsfor troubleshootinng purposes. Finally, the module describess common approaches for handling errorsincontrol flow and data flow.After completing this module, you will be able to:• Debug an SSIS package.• Implement logging for an SSIS package.• Handle errorsin an SSIS package.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson1Debugging an SSIS PackageWhen you are in the process of developing an application, misconfiguration of tasks or data flowcomponents, orerrors in variable definitions or expressions can lead tounexpected behavior. Evenif youdevelop your package perfectly, there are many potential problems that might arise during execution,such as a missing or misnamed file, or an invalid data value. It is therefore important to be able to usedebugging techniques to findthe cause of these problems, and formulate a solution.After completing this lesson, you will be able to:• Describe the tools and techniques for debugging SSIS packages.• View package execution events.• Use breakpoints to pausepackage execution.• View variable values and status while debugging.• Use data viewers to view data flow values while debugging.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-3MCT USE ONLY. STUDENT USE PROHIBITED


6-4 Debuggingg and Troubleshooting SSIS PackagesOverview of SSIS DebuggingDebugging is the process of finding the source of problemss that occur during packageexecution, eitherduring development or in a package that has been deployed to a production environment.Debugging During DevelopmentAt design time, SSIS developers can use a variety of debugging techniques in SQL Server Data Tools tofindproblems in control flow and data flow processes. These techniques include:• Observing rowcounts and task outcome indicators when running packages in the debuggingenvironment. .• Viewing events that are recorded during package execution. These events are shown in the Progresstab during execution, and in the Execution Results tabb after execution. Events arealso shown intheOutput window during andafter each execution.• Step through package execution by setting breakpoints that pause execution at specific points inthecontrol flow.• Viewing variable values while debugging.• View the rows that pass through the dataa flow pipelinee by attaching data viewers to data flow paths.Note SQL Server Data Tools is based on Microsoft®Visual Studio®, and includes anumber of debugging windows and tools that are primarily designed for debuggingsoftware solutions with programming languages such as Microsoft Visual C#®. This lessonfocuses on the debuggingg tools in the Visual Studio environment that are most useful whendebugging SSIS packages.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-5Debugging in the Production EnvironmentIt is common for problems to occur during package execution after the package has been deployed tothe production environment. In this scenario, if the package source project is available, you can use thetechniques described previously. However, you can also debug the package by examining any log files it isconfigured to generate, or by using the dtexec utility or the dtutil utility to generate a dump file. Dumpfiles contain information about system variable values and settings that you can use to diagnose aproblem with a package.MCT USE ONLY. STUDENT USE PROHIBITED


6-6 Debuggingg and Troubleshooting SSIS PackagesViewing Package Execution EventsYoucan think of a package execution as a sequence of events that are generated by the tasks andcontainers in the package control flow. Whenyou run a package in debug mode in thedevelopmentenvironment, these events are recorded and displayed in two locations. To run a package in debug mode,you can use any of the following techniques:• On the Debug menu, click Start Debugging.• Click the Start Debugging button on thetoolbar.• Press F5.TheProgress / Execution Results TabDuring execution, , the Progress tab of the SSISS package designer shows a hierarchical view of the packageand its containersand tasks, anddisplays information about events that occur during execution. Whenexecution is complete, the tab is renamed Execution Results and shows the entire event tree for thecompleted execution.Youcan enable or disable the display of messages on the Progress tab byy toggling Debug ProgresssReporting on theSSIS menu. Disabling progress reporting can help improve performance whendebugging complex packages.TheOutput WindowThe Output window shows the list of events that occur during execution. After execution is complete, youcan review the Output window to find detailsof the eventss that occurred.The Output window and the Progress / Execution Results tab are useful resources for troubleshootingerrors during package execution. As an SSIS developer, youu should habitually review the events in thesewindows when debugging yourpackages.MCT USE ONLY. STUDENT USE PROHIBITED


BreakpointtsWhen you havediscovered a problem with an event in your package, you can use breakpoints to pauseexecution at thepoint in the control flow where the errorr occurs in order to troubleshoot the problem.You can create a breakpoint for events raised by any container or task in a control flow. The simplest waytocreate a breakpoint is to select the task or container where you wantt to pause execution, and ontheDebug menu, click Toggle Breakpoint. This adds a breakpoint at the OnPreExecute event of theselected task orcontainer, which is the first event a task or container raises during package execution.For greater control of when a breakpoint pauses execution, you can right-click any task or container andclick Edit Breakpoints to display the Set Breakpoints dialog box for the task or container. In this dialogbox you can:• Enable breakpoints for any event supported by the task or container.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-7• Specify a Hit Count Typeand Hit Count value to control how often the event isignored before thebreakpoint pauses execution. You can set the Hit Count Type to one of the following settings:• Always – The Hit Count value is ignored and execution is always paused at this event.• Hit count equals – Execution is paused when the event has been raised thenumber of timesspecified in the Hit Count property.• Hit greater or equal – Execution is paused when the event has been raised the number of timesspecified in the Hit Count property or more.• Hit count multiple – Execution is paused when the event has been raised a number of times thatis a multiple of the Hit Count property or more. .You can view and manage all of the breakpoints that are set in a package in the Breakpoints window,which you can display by clicking the Debug menu, clicking Windows, and clicking Breakpoints.MCT USE ONLY. STUDENT USE PROHIBITED


6-8 Debuggingg and Troubleshooting SSIS PackagesVariable andStatus WindowsWhen you have used a breakpoint to pause package execution, it can bee useful to viewthe current valuesassigned to variables, parameters, and other system settings. SQL Server Data Tools provides two windowsthatt you can use to observe these values while debugging.TheLocals WindowThe Locals window is a pane in the SQL ServeData Tools environment that lists all of the system settings,variables, and parameters that are currently inscope. You can use this window to find current values forthese settings, variables, and parameters in the execution context.To view the Locals window when package execution is paused by a breakpoint; on theDebug menuclickWindows, and then click Locals.Watch WindowsIf you want to track specific variable, or parameter values while debugging, you can add a watch for eachvalue you want totrack. Watched values are shown in a watch window. You can use four watch windows,named Watch 1, Watch 2, Watch 3, and Watch 4. However, in most SSIS debugging scenarios, onlyWatch 1 is used.To display a watchwindow while debugging, on the Debugg menu, click Windows, click Watch, and thenclickthe watch window you want to display.To add a value to the Watch 1 window, right-click the variable or parameter you want to track in theLocals window, and click Add Watch.To add a variable or parameter to another watch window, drag it from the Locals window to the watchwindow in which you want it tobe displayed.MCT USE ONLY. STUDENT USE PROHIBITED


Data ViewersMost SSIS packages are primarily designed to transfer data, and when debugging a package, it canbeuseful to examine the data asit passes through the data flow. Data viewers provide a way to view the datarows as they pass along data flow paths between sources, transformations, and destinations.Enabling a Data ViewerToenable a data viewer, right-click a data flow path on the Data Flow tab and click Enable Data Viewer.Alternatively, you can double-click a data flow path and enable a data viewer on the Data Viewer tab ofthe Data Flow Path Editor dialog box. Usingthe dialog box also enables you to selectspecific columns tobeincluded in the data viewer.Viewing Dataa in the Data FlowA data viewer behaves like a breakpoint, and pauses execution at the data flow path on which it isdefined. When a data viewer pauses execution, a windoww containing the data in the data flow pathisdisplayed, enabling you to examine the data at various stages of the data flow. Whenyou have finishedexamining the data, you can resume execution by clickingg the green Continue arrowbutton in thedataviewer window. If you no longer require thedata viewer, you can remove it by clicking the Detachbuttoninthe data viewer window.Copying Data from a Data Viewer10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-9The data viewerwindow includes a Copy button, which you can use tocopy the contents of the dataviewer to the Windows® clipboard. When a data flow contains a large number of rows, it can be useful tocopy the contents of a data viewer and paste the data into a tool such as Microsoft Excel® for furtherexamination.MCT USE ONLY. STUDENT USE PROHIBITED


6-10 Debugging and Troubleshooting SSIS PackagesDemonstration: Debugging a Package1.2.3.4.5.6.Task 1: Addd a breakpointEnsure MIA-DC1 and MIA-SQLBI are started, and log onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$ $$w0rd.In the D:\10777A\Demofiles\Mod06 folder, run Setup.cmd as Administrator, and then double-clickDebugging.sln to open the solution in SQL Server Data Tools.In Solution Explorer, double-click Debugging Demo.dtsx. This package includes a control flow thatperforms the following tasks:• Copies a text file using a variable named User::sourceFile to determine the source path andavariable named User::copiedFile to determine thee destination path.• Uses a data flow to extract the data from the text file, convert columns to appropriate data types,and load the resulting data into a database table.• Deletes the coped file.On the Debug menu, click Start Debugging, note that the first task k fails, and stopdebugging.Click the Copy Source Filetask and on the Debug menu, click Toggle Breakpoint. Then right-clicthisdialog boxtothe Copy Source File task and click Edit Breakpoints, and note thatt you can use control the events and conditions for breakpoints in your package – when you toggle a breakpoint,by default it is enabled for the onPreExecute event with a Hit Count Type value of Always. Thenclick OK.Start debugging and note that executionstops at the breakpoint.MCT USE ONLY. STUDENT USE PROHIBITED


Task 2: View variable values while debugging10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-111. With execution stopped at the breakpoint, on the Debug menu, click Windows, and click Locals.2. In the Locals pane, expand Variables and find the user:copiedFile variable. When you find theuser:copiedFile variable, right click it and click Add Watch. The Watch 1 pane is then shown withthe user::copiedFile variable displayed.3. Click the Locals pane and find the user:sourceFile variable. When you find the user:sourceFilevariable, right click it and click Add Watch. The Watch 1 pane is then shown with theuser::copiedFile and user:sourceFile variables displayed.4. Note that the value of the user:sourceFile variable is D:\10777A\Demofiles\Mod06\Products.txt, andin the D:\10777A\Demofiles\Mod06 folder, note that the file is actually named Products.csv.5. Stop debugging, and on the SSIS menu, click Variables. Then in the Variables pane, change thevalue for the sourceFile variable to D:\10777A\Demofiles\Mod06\Products.csv.6. Start debugging and observe the variable values in the Watch 1 pane. Note that the sourceFilevariable now refers to the correct file.7. On the Debug menu, click Continue and note that the Load Data task fails. Then stop debugging. Task 3: Enable a data viewer1. Double-click the Load Data task to view the data flow design surface.2. Right-click the data flow path between Products File and Data Conversion, and click Enable DataViewer.3. Double-click the data flow path between Products File and Data Conversion, and in the Data FlowPath Editor dialog box, click the Data Viewer tab. Note that you can use this tab to enable the dataviewer and specify which columns should be included, and that by default, all columns are included.Then click OK.4. Click the Control Flow tab and verify that a breakpoint is still enabled on the Copy Source File task.Then on the Debug menu, click Start Debugging.5. When execution stops at the breakpoint, on the Debug menu, click Continue.6. When the data viewer window is displayed, resize it so you can see the data it contains, and note thatthe Price column for the second row contains a “-” character instead of a number.7. In the data viewer window, click Copy Data. Then click the green continue button in the data viewerwindow.8. When execution stops because the data flow task has failed, on the Debug menu, click StopDebugging.9. Start Excel, and with cell A1 selected, on the Home tab of the ribbon, click Paste. Then view the datayou have pasted from the data viewer.10. Close Excel without saving the workbook, and close SQL Server Data Tools.MCT USE ONLY. STUDENT USE PROHIBITED


6-12 Debugging and Troubleshooting SSIS PackagesLesson 2Logging SSIS Package EventsThe debugging tools in SQL Server Data Tools can be extremely useful when developing a package.However, after a package is in production, it can be easier to diagnose a problem with package executionif the package provides details of the events that occurred during execution in a log. Inaddition to usinga log for troubleshooting, you might want to log details of package execution for auditing orperformance benchmarking purposes. Planning and implementing a suitable logging solution is animportant part of developing a package, and SSIS includes built-in functionality to helpyou accomplishthis.After completing this lesson, you will be able to:• Describe the log providers available in SSIS.• Describe the events that can be logged and the schema for logging information.• Implement logging in an SSIS package.• View logged events.MCT USE ONLY. STUDENT USE PROHIBITED


SSIS Log ProvidersThe logging architecture in SSIS supports the recording of event information to one or more logs. Eachlog is access through a log provider that determines the type of log and the connection details used toaccess it.SSIS includes the following log providers:10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-13• Windows Event Log – Logs event information in thee ApplicationWindows event log. Noconnection manager is equired for this log provider.• Text File – Logs event information to a text file specified in a file connection manager.• XML File – logs event information to an XML file specified in a filee connection manager.• SQL Serverr – Logs event information inthe sysssislog system table in a Microsoft SQL Server®database, which is specified in an OLE DB connectionn manager.• SQL Serverr Profiler – Logs event information in a .trc file that can be examined in SQL Serverprofiler. Thelocation of the .trc file is specified in a file connection manager. Thislog provider is onlyavailable in32-bit execution environments.Additionally, software developers can use the Microsoft .NET Framework to develop custom log providers.When deciding which log providers to include in your logging solution, you should generally try tocomply with standard loggingg procedures in the existing IT infrastructure environment. For example, ifadministrators in the organization typically use the Windows Event Logas the primary source oftroubleshootinginformation, you should consider using the Windows Event Log provider for your SSISpackages. Whenusing files orSQL Server tables for logging, you shouldalso consider the security of thelog, which may contain sensitive information.MCT USE ONLY. STUDENT USE PROHIBITED


6-14 Debugging and Troubleshooting SSIS PackagesLogEvents and SchemaHaving determined the log providers you want to use, you can select theevents for which you want tocreate log entries and the details you want toinclude in thee log entries.LogEventsSSISS logging supports the following events:• OnError – This event is raised when an error occurs.• OnExecStatusChanged – This event is raised when a task is paused or resumed.• OnInformation – This event is raised during validationn and execution to report information.• OnPostExecute – This event is raised when execution of an executable has completed.• OnPreExecute – This eventis raised before an executable starts running.• OnPreValidate – This event is raised when validation of an executable begins.• OnProgress – This event is raised to indicate executionn progress for an executable.• OnQueryCancelled – This event is raisedwhen execution is cancelled.• OnTaskFailed– This event is raised whena task fails.• OnVariableChangedValuee – This event is raised whenn a variable hass its value changed.• OnWarning – This event is raised when a warning occurs.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-15• PipelineComponentTime – This event is raised to indicate the processing time for each phase of adata flow component.• Diagnostic – This event is raised to provide diagnostic information.• Executable-specific events – Some containers and tasks provide events that are specific to theexecutable. For example, a Foreach Loop container provides an event that is raised at the start of eachloop iteration.While it may be tempting to log every event, you should consider the performance overhead incurred bythe logging process. The choice of events to log depends on the purposes of the logging solution. Forexample, if your goal is primarily to provide troubleshooting information when exceptions occur, then youshould consider logging the OnError, OnWarning, and OnTaskFailed events. If your log will be used forauditing purposes, then you might want to log the OnInformation event; and if you want to use your logto measure package performance, then you may consider logging the OnProgress andPipelineComponentTime events.Log SchemaThe specific details that can be logged for each event are defined in the SSIS log schema. This schemaincludes the following values:• StartTime – When the executable started running.• EndTime – When the executable finished.• DataCode – An integer value indicating the execution result:• 0: Success• 1: Failure• 2: Completed• 3: Cancelled• Computer – The name of the computer on which the package was executed.• Operator – The Windows account that initiated package execution.• MessageText – A message associated with the event.• DataBytes – A byte array specific to the log entry.• SourceName – The name of the executable.• SourceID – The unique identifier for the executable.• ExecutionID – A unique identifier for the running instance of the package.You can choose to include all elements of the schema in your log, or select individual values to reduce logsize and performance overhead. However, the StartTime, EndTime, and DataCode values are alwaysincluded in the log.MCT USE ONLY. STUDENT USE PROHIBITED


6-16 Debugging and Troubleshooting SSIS PackagesImplementing SSIS LoggingPackages can be thought of as a hierarchy of containers and tasks, with the package itself as the rootofthe hierarchy. Youcan configure logging at the package level in the hierarchy, and by default, all childcontainers and tasks inherit the same loggingg settings. If required, you can override inherited logginggsettings for any container or task in the hierarchy. For example, you might choose to log only OnErrorevents to the Windows Event Log provider at the package level and inherit these settings for most childcontainers and tasks, but configure a data flow task within the package tolog OnInformation andDiagnostic events to the SQL Server log provider.To implement logging for an SSIS package, inSQL Server Data Tools, withthe packageopen in thedesigner, on the SSIS menu, click Logging todisplay the Configure SSISS Logs dialog box. Then performthe following steps:1.2.Add and configure log providers.On the Providers and Logs tab of the dialog box, addd the log providers you wantto use. Forproviders other than the provider for Windows Event Log, you must also specify a connectionmanager thatt defines the file or SQL Server instance where you wantt to write the log information.Select containers and tasks to include.Select the package container, which by default selects all child containers and tasks with inherited logsettings. You can then unselect individual containers and tasks that you do not want to include in thelog.MCT USE ONLY. STUDENT USE PROHIBITED


3. Select events and details to log.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-17On the Details tab of the dialog box, select the events you want to include in the log. By default, allschema fields are logged for the selected events, but you can click the <strong>Advanced</strong> button to selectindividual fields to log.4. Override log settings for child executables if required.If you want to specify individual logging settings for a specific child executable, select the executablein the Containers tree and specify the log provider, events, and details you want to use for thatexecutable.MCT USE ONLY. STUDENT USE PROHIBITED


6-18 Debugging and Troubleshooting SSIS PackagesViewing Logged EventsYoucan view logged events in SQL Server Data Tools by displaying the Log Events window. When loggingis configured for a package, theLog Events window shows the selected log event details, even when nolog provider is specified.The Log Events window is a useful tool for troubleshootingpackages during development, and also fortesting and debugging logging configuration.To display the LogEvents window, on the SSIS menu, click Log Events.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: Logging Package Execution Task 1: Configure SSIS Logging1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd.2. . In the D:\10777A\Demofiles\Mod06 folder, run Setup.cmd as Administrator, andthen double-clickLogging.sln to open the solution in SQL Server Dataa Tools.3. . In Solution Explorer, double-click Logging Demo.dtsx.4. . On the SSISS menu, click Logging.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-195. . In the Configure SSIS Logs: Logging Demo dialog box, in the Provider type list, select SSIS logprovider for Windows Event Log andclick Add. Then select SSISS log provider for SQL Server andclick Add.6. . In the Configuration column for the SSIS log provider for SQL Server, select the (local).DemoDWOLED connection manager. Note that the Windows Event Log provider requires no configuration.7. . In the Containers tree, check the checkbox for Logging Demo, and then with Logging Demoselected, onthe Providers and Logs tab, check the checkbox for the SSIS log provider forWindows Event Log.8. . With Logging Demo selected, on the Details tab, select the OnError and OnInformation events.9. . Click the Providers and Logs tab, and in the Containers tree, clear the checkbox for Load Data, andthen click the checkbox again to checkit. This enables you to override the inherited logging settingsfor the Load Data task.10. With Load Data selected, on the Providers and Logs tab, check the checkbox for the SSIS logprovider for SQL Server.11. With Load Data selected, on the Details tab, select the OnError and OnInformation events andthen click <strong>Advanced</strong> andclear the Operator columnn for the two selected events. Then click OK.MCT USE ONLY. STUDENT USE PROHIBITED


6-20 Debugging and Troubleshooting SSIS Packages Task 2: View logged events1. On the Debug menu, click Start Debugging. Then, when the Load Data task fails, on the Debugmenu click Stop Debugging.2. On the SSIS menu, click Log Events. This shows the events that have been logged during thedebugging session (if the log is empty, rerun the package and then view the Log Events windowagain).3. Click Start, click Administrative Tools, and click Event Viewer. Then expand Windows Logs, andclick Application. Note the log entries with a source of SQLISPackage110. These are the loggedevents for the package.4. Start SQL Server Management Studio and connect to the localhost instance of the database engineby using Windows authentication.5. In Object Explorer, expand Databases, expand DemoDW, expand Tables, and expand SystemTables. Then right-click dbo.sysssislog and click Select Top 1000 Rows.6. View the contents of the table, noting that the operator column is empty.7. Close SQL Server Management Studio without saving any files, then close Event Viewer andSQL Server Data Tools.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson3Handling Errors in an SSIS PackageNo matter how much debugging you perform, or how much information you log during packageexecution, exceptions can occur in any ETL process and cause errors. For example, servers can becomeunavailable, files can be renamed or deleted, and data sources can include invalid entries. A good SSISsolution includes functionalityto handle errors that occurr by performing compensating tasks andcontinuing withexecution wherever possible, or by ensuring that temporary resources are cleaned up andoperators are notified where execution cannot be continued.After completing this lesson, you will be able to:• Describe approaches for handing errors in an SSIS package.• Implement event handlers.• Handle errors in data flows.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-21MCT USE ONLY. STUDENT USE PROHIBITED


6-22 Debugging and Troubleshooting SSIS PackagesIntroductionto Error HandlingErrors can occur at any stage in the executionof a package, and SSIS provides a number of ways to handleerrors and take corrective actionif possible.Handling Errors in Control FlowYoucan use the following techniques to handle errors in package control flow:• Use Failure Precedence ConstraintsYou can use Failure precedence constraints to redirect control flow when a task fails. For example, if atask fails, youcan use a Failure precedence constraint to execute another task thatt performs acompensatingalternative action to allow the control flow to continue, or to deletee any temporaryfiles and sendan email notification to an operator.• Implement Event HandlersYou can create event handlers to executee a specific set of tasks whenan event occurs in the controlflow. For example, you could implement an event handler for the OnError event of the package, andinclude tasks to delete files and send email notifications in the OnError event handler.Note Precedence constraints are discussed in Module 5: Implementing Control Flow in anSSIS Package. The remainder of this lesson focuses on using event handlers to handle errorsin control flow.MCT USE ONLY. STUDENT USE PROHIBITED


Handling Errors in Data Flow10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-23Errors in data flow can often be caused by invalid or unexpected data values in rows being processed bythe data flow pipeline. SSIS data flow components provide the following configuration options forhandling rows that cause errors:• Fail the task if any rows cause an error.• Ignore errors and continue the data flow.• Redirect rows that cause an error to the error output of the data flow component.MCT USE ONLY. STUDENT USE PROHIBITED


6-24 Debugging and Troubleshooting SSIS PackagesImplementing Event HandlersYoucan add an event handler for the events supported by each executable in the package. To add anevent handler to a package, click the Event Handlers tab, select the executable and event for which youwant to implement a handler, and click the hyperlink on thee design surface. Doing this creates a newcontrol flow surface on which you can define the control flow for the event handler. Event Handlers canbe used for all kinds of control flow tasks, andare not specific to handlingerrors. However, the OnErrorand OnTaskFailed events are commonly usedfor handlingg error conditions.The system variables and configuration valuesavailable to tasks in your event handler are specific to thecontext of the event. For example, the System::ErrorDescription variable is populatedwith an errorrmessage during the OnError event.Because a package is a hierarchy of containers and tasks, each with their own events, you need toconsider where best to handle each possible error condition. For example, you can handle a task-specificerror in the task’s own OnErrorr event; or, depending on thee MaxErrors property of containers and thepackage itself, theerror caused by the task could trigger OnError events further up thepackagehierarchy, where you could alsohandle the error condition. . In general, if you anticipatespecific errors thatyou can resolve or compensate for and continue execution, , you should implement the OnError eventhandler for the task or container where the error is likely too occur. You should use the OnError eventhandler of the package itself to catch errors that cannot bee resolved, anduse it to perform clean-up tasksand notify operators.MCT USE ONLY. STUDENT USE PROHIBITED


Handling Data Flow Errors10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-25Data flow components participate in a dataa flow pipelinee through which rows of dataa are passed alongdata flow paths. Errors can occur in data flow for a number of reasons, including:• Rows that contain data of data type that is incompatible with a transformation or destination, such asa decimal field in a text file that is mapped to an integer column ina destination.• Rows that contain invaliddata values, such as a text file that contains a date fieldwith an invalid datevalue.• Rows that contain data that will be truncated by a transformation or destination, such as a text fieldwith 50 characters that is loaded into a table where the mapped column has a maximum length of 40characters.• Rows that contain data values that will cause an exception during a transformation, such as anumerical field with a value of zero that is used as a divisor in a derived column transformation.Bydefault, rowsthat cause anerror result inthe failure off the data flowcomponent. However, youcanconfigure manydata flow components to ignore rows that contain errors, or to redirect them to the erroroutput data flowpath of the component. Ignoring or redirecting failedrows enables the data flowtocomplete for all other rows, and if you havechosen to redirect failed rows, you can use transformations toattempt to correct the invaliddata values, or save the failed rows to a file or table forlater analysis.MCT USE ONLY. STUDENT USE PROHIBITED


6-26 Debugging and Troubleshooting SSIS PackagesWhen configuring error output for a data flow component, you can specify different actions fortruncations and errors. For example, you could choose to ignore truncations, but redirect rows thatcontain other errors. Additionally, some components enable you to specify different actions for eachcolumn in the data flow, so you could ignore errors in one column while redirecting rows that have aninvalid value in another column.Redirected rows include all of the input columns for the data flow component, and two additionalcolumns:• ErrorCode – The numeric code for the error.• ErrorColumn – The original number of the column that caused the error.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: Handling Errors Task 1: Implement anevent handler1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd.2. . In the D:\10777A\Demofiles\Mod06 folder, run Setup.cmd as Administrator, andthen double-clickError Handling.sln to open the solution in SQL Server Data Tools.3. . In Solution Explorer, double-click Errorr Handling Demo.dtsx.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-274. . On the Event Handlers pane, in the Executable list, ensure Error Handling Demo is selected; and inthe Event handler list, ensure OnErrorr is selected. Then click the hyperlink in the middle of thedesign surface.5. . In the SSIS Toolbox, double-click Send Mail Task, and then on thedesign surface, right-click SendMail Task, click Rename, and change the name to Notify administrator.6. . Double-click Notify administrator andon the Mail tab of the Send Mail Task Editor dialog box,configure the following properties:• SmtpConnection: A new connection to the localhost SMTP server with default settings.• From: etl@adventureworks.msft• To: administrator@adventureworks.msft• Subject: An error has occurred7. . In the SendMail Task Editor dialog box, on the Expressions tab, click the ellipsis (…) button in theExpressions box. Then inthe PropertyExpressionss Editor dialogbox, in the Property list selectMessageSource, and in the Expression box, click the ellipsis button.8. . In the Expression Builder dialog box, expand Variables and Parameters, expand System Variables,and drag System:ErrorDescription to the Expression box. Then click OK.MCT USE ONLY. STUDENT USE PROHIBITED


6-28 Debugging and Troubleshooting SSIS Packages9. In the Property Expressions Editor dialog box, in the row under the MessageSource property, inthe Property list select FileAttachements. Then in the Expression box, click the ellipsis button.10. In the Expression Builder dialog box, expand Variables and Parameters, and drag User::sourceFileto the Expression box. Then click OK.11. In the Property Expressions Editor dialog box, click OK. Then in the Send Mail Task Editor dialogbox, click OK.12. Click the Control Flow tab, and then on the Debug menu, click Start Debugging. When the LoadData task fails, click the Event Handlers tab to verify that the OnError event handler has beenexecuted, and then on the Debug menu, click Stop Debugging.13. In the C:\inetpub\mailroot\Drop folder, order the files by the data modified field so that the mostrecent files are at the top of the list. Then note that several email messages have been delivered at thesame time. The event handler has been executed once for each error that occurred.14. Double-click each of the most recent email messages to open them in Microsoft Outlook® and viewtheir contents. Then close all Outlook windows. Task 2: Redirect failed rows1. Click the Data Flow tab, and in the Data Flow Task drop-down list, ensure Load Data is selected.2. Double-click Data Conversion, and then in the Data Conversion Transformation Editor dialog box,click Configure Error Output.3. In the Configure Error Output dialog box, click the Error cell for the Numeric Product column, andthen hold the Ctrl key and click the Error cell for the Numeric Price column so that both cells areselected. Then in the Set this value to the selected cells list, select Redirect row and click Apply.4. Click OK to close the Configure Error Output dialog box, and then click OK to close the DataConversion Transformation Editor dialog box.5. In the SSIS Toolbox, in the Other Destinations section, double-click Flat File Destination. Then onthe design surface, right-click Flat File Destination, click Rename, and change the name to InvalidRows.6. Drag Invalid Rows to the right of Data Conversion, and drag the red data path from DataConversion to Invalid Rows. In the Configure Error Output dialog box, verify that the NumericProduct and Numeric Price columns both have an Error value of Redirect row, and click OK.7. Double-click Invalid Rows, and next to the Flat File connection manager drop-down list, click New.8. In the Flat File Format dialog box, ensure Delimited is selected and click OK. Then in the Flat FileConnection Manager Editor dialog box, in the Connection manager name box, type Invalid RowsCSV File, in the File name box type D:\10777A\Demofiles\Mod06\InvalidRows.csv, and click OK.9. In the Flat File Destination Editor dialog box, click the Mappings tab and note that the inputcolumns include the columns from the data flow, an ErrorCode column, and an ErrorColumncolumn. Then click OK.10. Click the Control Flow tab and on the Debug menu, click Start Debugging. Note that all taskssucceed, and on the Data Flow tab note the row counts that pass through each data flow path. Thenon the Debug menu, click Stop Debugging and close SQL Server Data Tools.11. In the D:\10777A\Demofiles\Mod06 folder, double-click InvalidRows.csv to open it in Excel., and viewthe rows that were redirected. Then close Excel without saving the workbook.MCT USE ONLY. STUDENT USE PROHIBITED


Lab ScenarioInthis lab, you will continue to develop theAdventure Works ETL solution.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-29The current solution includes a package that extracts reseller paymentss data from text files that aregenerated by the company’s financial accounts system. On some occasions during your testing, you havefound that the package fails unexpectedly.You suspect that the failures may be caused by some invaliddata in the text files, but you need to debugthe packagee to step through the process and verify thisassumption.After identifyingthe source of the errors, you need to addd logging functionality to the package so thatoperators can use the execution logs to troubleshoot anyy similar errorss when the package is used in aproduction environment.Finally, you want to minimize the risk of errors causing the package to fail in a production environment,soyou want to add error handling to the package. If an unresolvable error occurs in the control flow, youwant the package to handle the error by archiving the payments file that is currentlybeing processed, andsending an e-mail notificationn to an operator so that theyy can examinee the archived file and find thesource of the error.Additionally, you want to implement error handling in the data flow sothat in cases where individual rowscontain invalid data, the other, valid rows can be loaded into the staging database as usual, and theinvalid rows areredirected to a separate fileso they can be examined and corrected later.MCT USE ONLY. STUDENT USE PROHIBITED


6-30 Debugging and Troubleshooting SSIS PackagesLab 6:Exercise 1: Debuggingan SSIS PackageScenarioYouhave developed an SSIS package to extract data from text files exported from a financial accountssystem, and load the data into a staging database. However, while developing the package you haveencountered some errors, and you need to debug the package to identifythe cause ofthese errors.The main tasks for this exercise are as follows:1.2.3.4.5.6.7.Debugging and Troubleshooting anSSIS PackagePrepare the lab environment.Run an SSIS package.Add a breakpoint.Add a data viewer.View breakpoints.Observe variable values while debugging.View data copied from a data viewer. Task 1: Prepare the labenvironment• Ensure the MIA-DC1 and MIA-SQLBI virtual machines are both running, and then log on toMIA-SQLBI asADVENTUREWORKS\Student with the password Pa$ $$w0rd.• Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab06\Starterfolder as Administrator.MCT USE ONLY. STUDENT USE PROHIBITED


Task 2: Run an SSIS package10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-31• Open the AdventureWorksETL.sln solution in the D:\10777A\Labfiles\Lab06\Starter\Ex1 folder.• Open the Extract Payment Data.dtsx package and examine its control flow.• Run the package, noting that the package fails. When execution is complete, stop debugging. Task 3: Add a breakpoint• Add a breakpoint to the Foreach Loop container, selecting the appropriate event to ensure thatexecution is always paused at the beginning of every iteration of the loop. Task 4: Add a data viewer• In the data flow for the Extract Payments task, add a data viewer to the data flow path between thePayments File source and the Staging DB destination. Task 5: View breakpoints• View the Breakpoints window, and note that it contains the breakpoint you defined on the ForeachLoop container and the data viewer you added to the data flow. Task 6: Observe variable values while debugging• Start debugging the package, and note that execution pauses at the first breakpoint• View the Locals window and view the configuration values and variables it contains.• Add the following variables to the Watch 1 window:• User::fName• $Project::AccountsFolderPath• Continue execution and note that execution pauses at the next breakpoint, which is the data vieweryou defined in the data flow. Note also that the data viewer window contains the contents of the fileindicated by the User::fName variable in the Watch 1 window.• Continue execution, observing the value of the User::fName variable and the contents of the dataviewer window during each iteration of the loop.• When the Extract Payments task fails, note the value of the User::fName variable, and copy thecontents of the data viewer window to the clipboard. Then stop debugging and close SQL Server DataTools. Task 7: View data copied from a data viewer• Start Excel and paste the data you copied from the data viewer window into a worksheet.• Examine the data and try to determine the errors it contains.Results: After this exercise, you should have observed the variable values and data flows for each iterationof the loop in the Extract Payment Data.dtsx package. You should also have identified the file that causedthe data flow to fail and examined its contents to find the data errors that caused the failure.MCT USE ONLY. STUDENT USE PROHIBITED


6-32 Debugging and Troubleshooting SSIS PackagesExercise 2: Logging SSIS Package ExecutionScenarioYou have debugged the Extract Payments Data.dtsx package and found some errors in the source data.Now you want to implement logging for the package to assist in diagnosing future errors when thepackage is deployed in a production environment.The main tasks for this exercise are as follows:1. Configure SSIS logs.2. View logged events. Task 1: Configure SSIS logs• Open the AdventureWorksETL.sln solution in the D:\10777A\Labfiles\Lab06\Starter\Ex2 folder.• Open the Extract Payment Data.dtsx package and configure logging with the following settings:• Enable logging for the package, and inherit loggings settings for all child executables.• Log package execution events to the Windows Event Log.• Log all available details for the OnError and OnTaskFailed events. Task 2: View logged events• Run the package in debug mode, and note that it fails. When execution is complete, stop debugging.• In SQL Server Data Tools, view the Log Events window and note the logged events it contains. If noevents are logged, re-run the package and look again. Then close SQL Server Data Tools.• View the Application Windows event log in the Event Viewer administrative tool.Results: After this exercise, you should have a package that logs event details to the Windows Event Log.MCT USE ONLY. STUDENT USE PROHIBITED


Exercise 3: Implementing an Event HandlerScenario10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-33You have debugged the Extract Payments Data.dtsx package and observed how errors in the source datacan cause the package to fail. You now want to implement an error handler that copies the invalid sourcedata file to a folder for later examination, and notifies an administrator of the failure.The main tasks for this exercise are as follows:1. Create an event handler for the OnError event.2. Add a file system task to copy the current payments file.3. Add a send mail task to send an email notification.4. Test the error handler. Task 1: Create an event handler for the OnError event• Open the AdventureWorksETL.sln solution in the D:\10777A\Labfiles\Lab06\Starter\Ex3 folder.• Open the Extract Payment Data.dtsx package and create an event handler for the OnError event ofthe Extract Payment package. Task 2: Add a file system task to copy the current payments file• Add a file system task to the control flow for the OnError event handler, and name it Copy FailedFile.• Configure the Copy Failed File task as follows:• The task should perform a Copy File operation.• Use the Payments File connection manager as the source connection.• Create a new connection manager to create a file named D:\10777A\ETL\FailedPayments.csv forthe destination.• The task should overwrite the destination file if it already exists.• Configure the connection manager you created for the destination file to use the followingexpression for its ConnectionString property, so that instead of the name FailedPayments.csv youconfigured in the connection manager, the copied file is named using a combination of the uniquepackage execution ID and the name of the source file."D:\\10777A\\ETL\\" + @[System::ExecutionInstanceGUID] + @[User::fName]MCT USE ONLY. STUDENT USE PROHIBITED


6-34 Debugging and Troubleshooting SSIS Packages Task 3: Add a send mail task to send an email notification• Add a send mail task to the control flow for the OnError event handler, and name it SendNotification.• Connect the Copy Failed File task to the Send Notification task with a Completion precedenceconstraint.• Configure the Notification task as follows:• Use the Local SMTP Server connection manager.• Send a high-priority email message from etl@adventureworks.msft tostudent@adventureworks.msft with the subject “An error occurred”.• Use the following expression to set the MessageSource property.@[User::fName] + " failed to load. " + @[System::ErrorDescription] Task 4: Test the error handler• Run the package in debug mode and verify that the event handler is executed. Then close SQL ServerData Tools.• Verify that the source file containing invalid data is copied to the D:\10777A\ETL folder with a namesimilar to {1234ABCD-1234-ABCD-1234-ABCD1234}Payments - EU.csv.• View the contents of the C:\inetpub\mailroot\Drop folder, and verify that an email was sent for eacherror that occurred. You can view the email messages by double-clicking them to open them inOutlook.Results: After this exercise, you should have a package that includes an event handler for the OnErrorevent. The event handler should create a copy of files that contain invalid data and send an emailmessage.MCT USE ONLY. STUDENT USE PROHIBITED


Exercise 4: Handling Errors in a Data FlowScenario10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 6-35You have implemented an error handler that notifies an operator when a data flow fails. However, youwould like to handle errors in the data flow so that only the rows containing invalid data are not loaded,and the rest of the data flow succeeds.The main tasks for this exercise are as follows:1. Redirect data flow errors.2. View invalid data flow rows. Task 1: Redirect data flow errors• Open the AdventureWorksETL.sln solution in the D:\10777A\Labfiles\Lab06\Starter\Ex4 folder.• Open the Extract Payments Data.dtsx package and view the data flow for the Extract Paymentstask.• Configure the error output of the Staging DB destination to redirect rows that contain an error.• Add a flat file destination to the data flow and name it Invalid Rows. Then configure the Invalid Rowsdestination as follows:• Create a new connection manager named Invalid Payment Records for a delimited file namedD:\10777A\ETL\InvalidPaymentsLog.csv.• Do not overwrite data in the text file if it already exists.• Map all columns, including ErrorCode and ErrorColumn to fields in the text file. Task 2: View invalid data flow rows• Run the package in debug mode and note that it succeeds. When execution is complete, stopdebugging and close SQL Server Data Tools.• Use Excel to view the contents of the InvalidPaymentsLog.csv file in the D:\10777A\ETL folder andnote the rows that contain invalid values.Results: After this exercise, you should have a package that includes a data flow where rows containingerrors are redirected to a text file.MCT USE ONLY. STUDENT USE PROHIBITED


6-36 Debugging and Troubleshooting SSIS PackagesModule Review and TakeawaysReview Questions1.2.3.You have executed a package in SQL Server Data Tools, and a task failed unexpectedly. Where canyou review information about the package execution too help determine the cause of the problem?You have configured logging with the SSIS log provider for SQL Server. Where canyou view thelogged eventinformation?You suspect a data flow is failing becausee some values in a source text file are too long for thecolumns in the destination. How can you handle this problem?MCT USE ONLY. STUDENT USE PROHIBITED


Module 7Implementing an <strong>Inc</strong>remental ETL ProcessContents:Lesson 1: Introduction to <strong>Inc</strong>remental ETL 7-3Lesson 2: Extracting Modified Data 7-9Lab 7A: Extracting Modified Data 7-31Lesson 3: Loading Modified Data 7-54Lab 7B: Loading <strong>Inc</strong>remental Changes 7-73MCT USE ONLY. STUDENT USE PROHIBITED7-1


7-2 Implementing an <strong>Inc</strong>remental ETL ProcessModule OverviewA data warehousing solution generally needs to refresh thee data warehouse at regular intervals to reflectnewand modifiedd data in the source systems on which the data warehouse is based. It is important toimplement a refresh process that has a minimal impact on network and processing resources, and whichenables you to retain historical data in the data warehouse while reflecting changes and additions tobusiness entities in transactional systems.Thismodule describes the techniques you canuse to implement an incremental data warehouse refreshprocess. After completing this module, you will be able to:• Describe the considerationss for implementing an incremental extract, transform, and load (ETL)solution.• Use multiple techniques to extract new and modified data from source systems.• Use multiple techniques to insert new and modified data into a data warehouse.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson1Introductionto <strong>Inc</strong>remental ETLMost data warehousing solutions use an incremental ETL process to refresh the data warehouse with newand modified data from source systems. Implementing ann effective incremental ETL process presents anumber of challenges, for which common solution designs have been identified. By understandingsomeofthe key features of an incremental ETL process, you can design an effective data warehouse refreshsolution that meets your analytical and reporting needs while maximizing performance and resourceefficiency.After completing this lesson, you will be able to:• Describe a typical data warehouse refresh scenario.• Describe considerations for implementing an incremental ETL process.• Describe key features of slowly changing dimensions.10777A: Implementing a Dataa Warehouse with Microsoft SQL Server 2012 7-3MCT USE ONLY. STUDENT USE PROHIBITED


7-4 Implementing an <strong>Inc</strong>remental ETL ProcessOverview of Data Warehouse Load CyclesA typical data warehousing solution includes a regular refresh of the dataa in the data warehouse to reflectnewand modifiedd data in the source systems on which it iss based. For each load cycle, data is extractedfromthe source systems, usuallyto a staging area, and thenn loaded into the data warehouse. Thefrequency of the refresh process depends on how up to date the analytical and reporting data in thedatawarehouse needs to be, and in some cases you might choose to implement a different refresh cycle foreachgroup of related data sources.In some rare cases, it can be appropriate to completely replace the data warehouse data with fresh datafromthe data sources during each load cycle. However, a more commonapproach is to use anincremental ETL process to extract only rows that have been inserted or modified in thesource systems,and then insert orupdate rows in the data warehouse to reflect the extracted data. This reduces thevolume of data being transferred, minimizingthe effect of the ETL process on network bandwidth andprocessing resources.MCT USE ONLY. STUDENT USE PROHIBITED


Considerations for <strong>Inc</strong>remental ETL10777A: Implementing a Dataa Warehouse with Microsoft SQL Server 2012 7-5When planning an incremental ETL process, there are a number of factors that you should consider.Data Modifications to Be TrackedOne of the primary considerations for planning an incremental ETL process is to identify the kinds of datamodifications that you need to track in source systems. Specifically, youshould consider the followingkinds of modifications:• Inserts – for example, newsales transactions or the registration of a new customer.• Updates – for example, a change of telephone number or address for a customer.• Deletes – for example, the removal of a discontinuedd product froma product catalog.Most data warehousing solutions include inserted and updated recordss in refresh cycles. However, youmust give special consideration to deleted records because propagating deletions tothe data warehouseresults in the loss of historical reporting data.MCT USE ONLY. STUDENT USE PROHIBITED


7-6 Implementing an <strong>Inc</strong>remental ETL ProcessLoad OrderA data warehouse can include dependencies between tables. For example, rows in a fact table generallyinclude foreign key references to rows in dimension tables, and some dimension tables include foreignkey references to subdimension tables. For this reason, you should generally design your incremental ETLprocess to load subdimension tables first, then dimension tables, and finally fact tables. If this is notpossible, you can load inferred members as minimal placeholder records for dimension members that arereferenced by other tables and which will be loaded at a later time.Note Inferred members are usually used to create a placeholder record for a missingdimension member referenced by a fact record. For example, the data to be loaded into afact table for sales orders might include a reference to a product for which no dimensionrecord has yet been loaded. In this case, you can create an inferred member for the productthat contains the required key values but null columns for all other attributes, and thenupdate the inferred member record on a subsequent load of product data.Dimension KeysThe keys used to identify rows in dimension tables are usually independent from the business keys used insource systems, and are referred to as surrogate keys. When loading data into a data warehouse, you needto consider how you will identify the appropriate dimension key value to use in the following scenarios:• Determining whether or not a staged record represents a new dimension member or an update to anexisting dimension member, and if it is an update, applying the update to the appropriate dimensionrecord.• Determining the appropriate foreign key values to use in a fact table that references a dimensiontable, or in a dimension table that references a subdimension table.In many data warehouse designs, the source business key for each dimension member is retained as analternative key in the data warehouse, and can therefore be used to look up the corresponding dimensionkey. In other cases, dimension members must be found by matching a unique combination of multiplecolumns.Updating Dimension MembersWhen refreshing dimension tables, you must consider whether changes to individual dimension attributeswill have a material effect on historical reporting and analysis. Dimension attributes can be categorized asone of three kinds:• Fixed – the attribute value cannot be changed. For example, you might enforce a rule that preventschanges to a product name after it has been loaded into the dimension table.• Changing – the attributes value can change without affecting historical reporting and analytics. Forexample, a customer’s telephone number might change, but it is unlikely that any historical businessreporting or analytics will aggregate measures by telephone number, so the change can be madewithout the need to retain the previous telephone number.• Historical – the attribute value can change, but the previous value must be retained for historicalreporting and analysis. For example, a customer might move from Edinburgh to New York, butreports and analysis must associate all sales to that customer that occurred before the move withEdinburgh, and all sales after the move with New York.MCT USE ONLY. STUDENT USE PROHIBITED


Updating Fact Records10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-7When refreshing the data warehouse, you must consider whether you will allow updates to fact records.Often, your data warehouse design will only contain fact records that are complete, so no incompleterecords will be loaded. However in some cases, you might want to include a fact record in the datawarehouse that is incomplete and will be updated during a later refresh cycle.For example, you might choose to include a fact record for a sales order where the sale has beencompleted, but the item has not yet been delivered. If the record includes a column for the delivery date,you might initially store a null value in this column, and then update the record with the delivery dateduring a later refresh after the order has been delivered.While some data warehousing professionals allow updates to the existing record in the fact table, otherpractitioners prefer to support changes to fact records by deleting the existing fact record and inserting anew fact record. In most cases, the delete operation is actually a logical deletion that is achieved bysetting a bit value on a column that indicates whether the record is active or not, rather than actuallydeleting the record from the table.MCT USE ONLY. STUDENT USE PROHIBITED


7-8 Implementing an <strong>Inc</strong>remental ETL ProcessSlowly Changing DimensionsWhen updating dimension tables, you need to apply the appropriate logic depending on the kind ofdimension attributes being modified. Changes to fixed attributes are not supported, and so any updatesto columns containing fixed attributes must either be discarded or cause an error. However, modificationsto changing and historical attributes must be supported, and a widely used set of techniques for handlingthese changes hasbeen identified.Slowly changing dimensions aredimensions that change over time, whileretaining historical attributes forreporting and analysis. Changesto dimensionn members aree usually categorized as the following types:• Type 1 – Changing attributes are updatedd in the existing record and the previous value is lost.• Type 2 – Historical attributechanges result in a new record in the dimension table,representing anew version of the dimension member. A column is used to indicate which versionof the dimensionmember is the current one (either with a flag value to indicate the current record, or by storing thedate and timewhen each version becomes effective). This technique enables you to store a completehistory of all versions of thedimension member.• Type 3 – Historical attributechanges are stored in the existing record, in which theoriginal valueisalso retained and a columnindicates the date on whichh the new value becomes effective. Thistechnique enables you to store the original and latest versions of thedimension member.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson2Extracting Modifiedd DataAn incremental ETL process starts by extracting data fromsource systems. To avoid ncluding unnecessaryrows of data in the extraction, the solution must be able to identify records that havebeen insertedormodified since the last refreshcycle, and limit the extraction to those records.This lesson describes a number of techniques for identifying and extracting modifiedd records. Aftercompleting thislesson, you will be able to:• Describe the common options for extracting modified records.10777A: Implementing a Dataa Warehouse with Microsoft SQL Server 2012 7-9• Implement an ETL solution that extracts modified rows based on a DateTime column.• Configure the Change Data Capture feature in the SQL Server Enterprise Edition database engine.• Implement an ETL solution that extracts modified rows by using the Change Data Capture feature.• Use the CDC Control Taskand data flowcomponentss to extract Change Data Capture records.• Configure the Change Tracking featuree in the Microsoft® SQL Server® databaseengine.• Implement an ETL solution that extracts modified rows by using Change Tracking.MCT USE ONLY. STUDENT USE PROHIBITED


7-10 Implementing an <strong>Inc</strong>remental ETL ProcessOptions for ExtractingModified DataThere are a number of commonly used techniques used to extract data as part of a data warehouserefresh cycle.Extract All RecordsThe simplest solution is to extract all source records and load them to a staging area, before using themto refresh the data warehouse. This techniquee works with all data sourcess and ensures that the refreshcycle includes all inserted, updated, and deleted source records. However, this technique can requiree thetransfer and storage of large volumes of data, making it inefficient and impractical for may enterprise datawarehousing solutions.Store a Primary Key and ChecksumAnother solution is to store the primary key of all previouslyy extracted rows in a stagingg table along with achecksum value that is calculated from sourcecolumns in which you want to detect changes. For eachrefresh cycle, your ETL process can extract source records for which the primary key is not recorded in thetable of previous extractions as well as rows where the checksum value calculated fromthe columns in thesource record does not match the checksum recorded during the previous extraction. Additionally, anyprimary keys recorded in the staging table that no longer exist in the source represent deleted records.Thistechnique limits the extracted records to those that have been inserted or modified since theprevious refresh cycle, but for large numbers of rows the overhead of calculating a checksum to comparewitheach row cansignificantly increase processing requirements.MCT USE ONLY. STUDENT USE PROHIBITED


Use a Datetime Column As a “High Water Mark”10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-11Tables in data sources often include a column to record the date and time of the initial creation and lastmodification to each record. If your data source includes such a column, you can log the date and time ofeach refresh cycle and compare it with the last modified value in the source data records to identifyrecords that have been inserted or modified. This technique is commonly referred to as using the dateand time of each extraction as a “high water mark” because of its similarity to the way a tide or flood canleave an indication of the highest water level.Use Change Data CaptureChange Data Capture (CDC) is a feature of SQL Server Enterprise Edition that uses transaction logsequence numbers (LSNs) to identify insert, update, and delete operations that have occurred within aspecified time period. To use CDC, your ETL process must store the date and time or LSN of the lastextraction as described for the high water mark technique; but it is not necessary for tables in the sourcedatabase to include a column that indicates the date and time of the last modification.CDC is an appropriate technique when:• The data source is a database in the enterprise edition of a SQL Server 2008 or later.• You need to extract a complete history that includes each version of a record that has been modifiedmultiple times.Use Change TrackingChange Tracking is another SQL Server technology that you can use to record the primary key of recordsthat have been modified and extract records based on a version number that is incremented each time arow is inserted, updated, or deleted. To use Change Tracking, you must log the version that is extracted,and then compare the logged version number to the current version in order to identify modified recordsduring the next extraction.Change Tracking is an appropriate technique when:• The data source is a SQL Server 2008 or later database.• You need to extract the latest version of a row that has been modified since the previous extraction,but you do not need a full history of all interim versions of the record.Note Considerations for handling deleted recordsIf you need to propagate record deletions in source systems to the data warehouse, youshould consider the following guidelines.• You need to be able to identify which records have been deleted since the previous extraction.One way to accomplish this is to store the keys of all previously extracted records in the stagingarea and compare them to the values in the source database as part of the extraction process.Alternatively, Change Data Capture and Change Tracking both provide information aboutdeletions, enabling you to identify deleted records without maintaining the keys of previouslyextracted records.• If the source database supports logical deletes by updating a Boolean column to indicate that therecord is deleted, then deletions are conceptually just a special form of update. You canimplement custom logic in the extraction process to treat data updates and logical deletionsseparately if necessary.MCT USE ONLY. STUDENT USE PROHIBITED


7-12 Implementing an <strong>Inc</strong>remental ETL ProcessExtracting Rows Basedon a Datetime ColumnIf your data source includes a column to indicate the date and time each record was inserted or modified,you can use the high water mark technique toextract modified records. The high-levell steps your ETLprocess must perform to use the high water mark technique are:1.2.3.4.Note the current time.Retrieve the date and time of the previous extraction from a log table.Extract records where the modified date column is later than the lastt extraction time, but beforee orequal to the current time you noted in step 1 (this disregards any insert or update operations thathave occurredsince the start of the extraction process) ).Update the last extraction date and time in the log withh the time younoted in step1.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: Usinga Datetime Column Task 1: Use a Datetime column toextract modified dataa1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd.2. . In the D:\10777A\Demofiles\Mod07 folder, run Setup.cmd as Administrator, andthen double-clickModify Products.sql to open the query file in SQL Server Management Studio. Each time youareprompted, connect to thelocalhost instance of the database engine by using Windowsauthentication. Do not execute the query yet.3. . In Object Explorer, expand Databases, expand DemoDW, and expand Tables. Then note thatt thedatabase includes tables in three schemas (src, stg, and dw) to represent the dataa sources stagingdatabase, and data warehouse in an ETL solution.4. . Right-click each of the following tablesand click Select Top 1000 Rows:10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-13• stg.Products – this table is used for staging product records during the ETLprocess.• stg.ExtractLog – this table logs the last extraction date for each source system.• src.Products – this table contains the source data for products, including a LastModifieddcolumnthat records when each row was last modified.5. . In the D:\10777A\Demofiles\Mod07 folder, double-click <strong>Inc</strong>rementalETL.sln to open the solution inSQL Server Data Tools. Then in SolutionExplorer, double-click the Extract Products.dtsx SSISSpackage.6. . On the SSISS menu, click Variables, andnote that thee package contains two uservariables namedCurrentTime and LastExtractTime.7. . On the control flow surface, double-click Get Current Time, and note that the expression in this tasksets the CurrentTime user variable to the current date and time. Then click Cancel.MCT USE ONLY. STUDENT USE PROHIBITED


7-14 Implementing an <strong>Inc</strong>remental ETL Process8. Double-click Get Last Extract Time, and note the following configuration settings. Then click Cancel.• On the General tab, the ResultSet property is set to return a single row, and SQLStatementproperty contains a query to retrieve the maximum LastExtractTime value for the productssource in the stg.ExtractLog table.• On the Result Set tab, the LastExtractTime value in the query results row is mapped to theLastExtractTime user variable.9. Double-click Stage Products to view the data flow surface, and then double-click Products Sourceand note that the SQL command used to extract products data includes a WHERE clause that filtersthe query results. Then click Parameters, and note that the parameters in the Transact-SQL query aremapped to the LastExtractTime and CurrentTime variables.10. Click Cancel in all dialog boxes and then click the Control Flow tab.11. On the control flow surface, double-click Update Last Extract Time, and note the followingconfiguration settings. Then click Cancel.• On the General tab, the SQLStatement property contains a Transact-SQL UPDATE statementthat updates the LastExtractTime in the stg.ExtractLog table.• On the Parameter Mapping tab, the CurrentTime user variable is mapped to the parameter inthe Transact-SQL statement.12. In SQL Server Management Studio, execute the Modify Products.sql script.13. In SQL Server Data Tools, on the Debug menu, click Start Debugging. Then, when packageexecution is complete, on the Debug menu, click Stop Debugging.14. In SQL Server Management Studio, right click each of the following tables and click Select Top 1000Rows:• stg.ExtractLog – Note that the LastExtractTime for the Products data source has beenupdated.• stg.Products – Note the rows that have been extracted from the src.Products table.15. Close SQL Server Management Studio and SQL Server Data Tools without saving any changes.MCT USE ONLY. STUDENT USE PROHIBITED


Change Data CaptureThe CDC feature in SQL ServeEnterprise Edition provides a number off functions andstored proceduresthat you can use to identify modified rows. To use CDC, perform the following high-level steps:1. . Enable CDCC in the data source. You must enable CDCC for the database, and for each table in thedatabase for which you want to monitor changes.The following Transact-SQL code sample shows how to use the sp_ _cdc_enable_db andsp_cdc_enable_table system stored procedures to enable CDC in a database and monitor datamodifications in the dbo.Customers table.EXEC sys.sp_cdc_enable_dbEXEC sys.sp_cdc_enable_table @source_schema= N'dbo', @source_name = N'Customers',@role_ name = NULL, @supports_net_changes = 12. . In the ETL process used toextract the data, map startt and end times (based on the logged date andtime of the previous extraction and thecurrent date and time) to log sequence numbers.The following Transact-SQL code sample shows how to use the fn_ _cdc_map_time_to_lsn systemfunction to map Transact-SQL variablesnamed @StartDate and @EndDate to log sequencenumbers.DECLARE @from_lsn binary(10), @to_lsn binary(10);SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than', @StartDate)SET @to_lsn = sys.fn_ _cdc_map_time_to_lsn('largest less than or equal', @EndDate)3. . <strong>Inc</strong>lude logic to handle errors if either of the log sequence numbers is null. This can occur if nochanges have occurred inthe databaseduring the specified time period.The following Transact-SQL code sample shows how to check for null log sequence numbers.IF (@from_lsn IS NULL) OR (@to_lsn IS NULL)-- There may have been no transactions in the timeframe10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-15MCT USE ONLY. STUDENT USE PROHIBITED


7-16 Implementing an <strong>Inc</strong>remental ETL Process4. Extract records that have been modified between the log sequence numbers. When you enable CDCfor a table, SQL Server generates table-specific system functions that you can use to extract datamodifications to that table. The following Transact-SQL code sample shows how to use thefn_cdc_get_net_changes_dbo_Customers system function to retrieve rows that have been modifiedin the dbo.Customers table.SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_Customers(@from_lsn, @to_lsn, 'all')Note For detailed information about the syntax of the CDC system functions and storedprocedures used in these code samples, see SQL Server Books Online.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: UsingChange Data Capture Task 1: Enable Change Data Capture1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-172. . If you did not complete the previous demonstration,in the D:\10777A\Demofiles\Mod07 folder, runSetup.cmdas Administrator.3. . Double-click Using CDC. .sql to open the query file inn SQL Server Management Studio. Each time youare prompted, connect tothe localhost instance of the database engine by using Windowsauthentication. Do not execute the query yet.4. . In Object Explorer, expand Databases, expand DemoDW, and expand Tables. Then right clickthesrc.Customers table andclick Select Top 1000 Rows. This table contains sourcedata for customers.5. . On the Using CDC.sql query tab, select the Transact-SQL code under the comment Enable CDC, andthen click Execute. This enables CDC inthe DemoDWdatabase, and starts logging modifications todata in the src.Customers table.6. . Select the Transact-SQL code under thecomment Select all changed customerrecords since thelast extraction, and thenclick Execute. This code uses CDC functions to map dates to log sequencenumbers, and retrieve records in the src.Customerstable that have been modified between the lastlogged extraction in the stg.ExtractLog table, and the current time. There are no changed recordsbecause nomodificationshave been made since CDCC was enabled.MCT USE ONLY. STUDENT USE PROHIBITED


7-18 Implementing an <strong>Inc</strong>remental ETL Process Task 2: Use Change Data Capture to extract modified data1. Select the Transact-SQL code under the comment Insert a new customer, and then click Execute.This code inserts a new customer record.2. Select the Transact-SQL code under the comment Make a change to a customer, and then clickExecute. This code updates customer record.3. Select the Transact-SQL code under the comment Now see the net changes, and then click Execute.This code uses CDC functions to map dates to log sequence numbers, and retrieve records in thesrc.Customers table that have been modified between the last logged extraction in thestg.ExtractLog table, and the current time. Two records are returned.4. Wait ten seconds. Then select the Transact-SQL code under the comment Check for changes in aninterval with no database activity, and then click Execute. Because there has been no activity in thedatabase during the specified time interval, one of the log sequence numbers is null. Thisdemonstrates the importance of checking for a null log sequence number value when using CDC.5. Close SQL Server Management Studio without saving any changes.MCT USE ONLY. STUDENT USE PROHIBITED


ExtractingData with Change Data CaptureToextract data from a CDC-enabled table in an SSIS-based ETL solution, you can create a custom controlflow that uses the same principles as the “high water mark” technique described earlier. The generalapproach is to establish the range of records to be extracted based on a minimum and maximum logsequence number (LSN), extract those records, and log the end point of the extractedrange to be used asthe starting point for the nextextraction.You can chooseto log the high water markas an LSN value or as a datetime value that can be mappedtoan LSN by using the fn_cdc_map_time_to_lsn system function.The following procedure describes one possible way to create an SSIS control flow for extracting CDCdata:1. . Use an Expression task toassign the current time to a datetime variable.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-192. . Use a SQL Command taskto retrieve the logged datetime value that was loggedafter the previousextraction.3. . Use a Data Flow task in which a source uses the fn_cdc_map_time_to_lsn systemfunction to map thecurrent time and previously extracted time to the corresponding LSNs, and then uses thecdc.fn_cdc_get_net_changes_capture_e_instance function to extract the data that was modifiedbetween those LSNs. Optionally, you can use the _$operation column in the resulting dataset to splitthe recordsinto differentdata flow paths for inserts, updates, and deletes.4. . Use a SQL Command taskto update the logged datetime value tobe used as the starting point forthe next extraction.MCT USE ONLY. STUDENT USE PROHIBITED


7-20 Implementing an <strong>Inc</strong>remental ETL ProcessThe CDC Control Task and Data Flow ComponentsTo make it easier to implement packages thatt extract data from CDC-enabled sources, SSIS includes CDCcomponents that abstract the underlying CDCC functionality. The CDC components included in SSIS are:• CDC ControlTask. A control flow task that you can use to manage CDC state, which provides astraightforward way track CDC data extraction status.• CDC Source. A data flow source that uses the CDC state logged by the CDC Control task to extract arange of modified records from a CDC-enabled data source.• CDC Splitter. A data flow transformationn that splits output rows froma CDC Source into separatedata flow paths for inserts, updates, and deletes.All of the CDC components in SSIS require theuse of ADO. NET connection managers to the CDC-enableddataa source and to the databasewhere CDC state is to be stored.Performing anInitial Extraction withthe CDC Control TaskWhen using the CDC Control Task to manageextractions from a CDC-enabled data source, it isrecommended practice to create a package that will be executed once toperform the initial extraction.Thispackage should contain thefollowing control flow:1.2.3.A CDC Control Task configured to perform the Mark initial load start operation. This writes anencoded value including the starting LSNto a packagee variable, and optionally persists it to a statetracking tablein a database.A data flow that extracts alll rows from the source and loads them into a destination – typically astaging table. This data flowdoes not require CDC-specific components.A second CDCC Control Taskconfigured toperform the Mark initial load end operation. This writesan encoded value includingthe ending LSN to a package variable, and optionally persists it to a statetracking tablein a database.MCT USE ONLY. STUDENT USE PROHIBITED


Performing <strong>Inc</strong>remental Extractions with the CDC Control Task10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-21After the initial extraction has been performed, subsequent extractions should use an SSIS package withthe following control flow:1. A CDC Control Task configured to perform the Get processing range operation. This establishes therange of records to be extracted and writes an encoded representation to a package variable, whichcan also be persisted to a state tracking table in a database.2. A data flow that uses a CDC Source, which uses the encoded value in the CDC state package variableto extract the modified rows from the data source.3. Optionally, the data flow can include a CDC Splitter task, which uses the _$operation column in theextracted rowset to redirect inserts, updates, and deletes to separate data flow paths. These can thenbe connected to appropriate destinations for staging tables.4. A second CDC Control Task configured to perform the Mark processed range operation. This writesan encoded value including the ending LSN to a package variable, and optionally persists it to a statetracking table in a database. This value is then used to establish the starting point for the nextextraction.MCT USE ONLY. STUDENT USE PROHIBITED


7-22 Implementing an <strong>Inc</strong>remental ETL ProcessDemonstration: Using CDC Components1.2.3.4.5.Task 1: Use the CDC Control Task to perform an initial extractionEnsure MIA-DC1 and MIA-SQLBI are started, and log onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$ $$w0rd.If you did notcomplete theprevious demonstration, inn the D:\10777A\Demofiles\Mod07 folder,runSetup.cmd as Administrator. Then double-click Using CDC.sql to open the query file in SQL ServerManagementStudio, connecting to the localhost instance of the database enginee by using Windowsauthentication when prompted, and execute the script.In SQL Serverr ManagementStudio, open the CDC Components.sqlscript file in theD:\10777A\Demofiles\Mod07 folder and note that it enabled CDC for the src.Shippers table. Thenexecute the script.In Object Explorer, right-click each of thefollowing tables in the DemoDW database, and click SelectTop 1000 Rows to view their contents. Then minimizee SQL Server Management Studio.• src.Shippers – This table should contain four records.• stg.ShipperDeletes – This table should be empty.• stg.ShipperInserts – This table should be empty.• stg.ShipperUpdates – This table should be empty.In the D:\10777A\Demofiles\Mod07 folder, double-click <strong>Inc</strong>rementalETL.sln to open the solution inSQL Server Data Tools. Then in Solution Explorer, double-click the Extract Initial Shippers.dtsx SSISpackage. Note that the CDCC Control tasks in the control flow containn errors, whichh you will resolve.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-236. Double-click the Mark Initial Load Start CDC Control task, and in its editor, set the followingproperties. Then click OK.• SQL Server CDC database ADO.NET connection manager: localhost DemoDW ADO NET• CDC control operation: Mark initial load start• Variable containing the CDC state: Click New and create a new variable named CDC_State• Automatically store state in a database table: Selected• Connection manager for the database where the state is stored: localhost DemoDWADO NET• Table to use for storing state: Click New, and then click Run to create the cdc_states table• State name: CDC_State7. Double-click the Extract All Shippers data flow task, and on the Data Flow surface, note that anADO.NET source is used to extract all rows from the src.Shippers table, and an ADO.NET destinationis used to load the extracted rows into the stg.ShipperInserts table.8. On the Control Flow tab, double-click the Mark Initial Load End CDC Control task and set thefollowing properties. Then click OK.• SQL Server CDC database ADO.NET connection manager: localhost DemoDW ADO NET• CDC control operation: Mark initial load end• Variable containing the CDC state: User::CDC_State (the variable you created earlier)• Automatically store state in a database table: Selected• Connection manager for the database where the state is stored: localhost DemoDWADO NET• Table to use for storing state: [dbo].[cdc_states] (the table you created earlier)• State name: CDC_State9. On the Debug menu, click Start Debugging and wait for the package execution to complete. Thenon the Debug menu, click Stop Debugging.10. In SQL Server Management Studio, right-click the Tables folder for the DemoDW database and clickRefresh. Note that a table named dbo.cdc_states has been created.11. Right-click dbo.cdc_states and click Select Top 1000 Rows to view the logged CDC_State value(which should begin “ILEND…”).12. Right-click stg.ShipperInserts and click Select Top 1000 Rows to verify that the initial set ofshippers has been extracted.MCT USE ONLY. STUDENT USE PROHIBITED


7-24 Implementing an <strong>Inc</strong>remental ETL Process Task 2: Use the CDC Control Task and data flow components to extract changes1. In SQL Server Management Studio, open and execute Update Shippers.sql from theD:\10777A\Demofiles\Mod07 folder, noting that it truncates the stg.ShipperInserts table and thenperforms an INSERT, an UPDATE, and a DELETE operation on the src.Shippers table.2. In SQL Server Data Tools, in Solution Explorer, double-click Extract Changed Shippers.dtsx.3. On the Control Flow tab, double-click the Get Processing Range CDC Control task and note it getsthe processing range and stores it in the CDC_State variable and the cdc_states table. Then clickCancel.4. Double-click the Extract Modified Shippers data flow task and on its Data Flow surface, view theproperties of the Shipper CDC Records CDC Source component, noting that it extracts modifiedrecords based on the range stored in the CDC_State variable.5. Note that CDC Splitter transformation has three outputs, one for inserts, one for updates, and onefor deletes. Each of these is connected to an ADO.NET destination that loads the records into thestg.ShipperInserts, stg.ShipperUpdates, and stg.ShipperDeletes tables respectively.6. On the Control Flow tab, double-click the Mark Processed Range CDC Control task and note itupdates the CDC_State variable and the cdc_states table when the extraction is complete. Then clickCancel.7. On the Debug menu, click Start Debugging. When execution is complete, double-click the ExtractModified Shippers data flow task and note the number of rows transferred. If no rows weretransferred, stop debugging and then re-run the package. When three rows have been transferred(one to each output of the CDC Splitter transformation), stop debugging and close SQL Server DataTools.8. In SQL Server Management Studio, right-click each of the following tables and click Select Top 1000Rows to view their contents. Each table should contain a single row.• stg.ShipperDeletes• stg.ShipperInserts• stg.ShipperUpdates9. Close SQL Server Management Studio.MCT USE ONLY. STUDENT USE PROHIBITED


Change TrackingThe Change Tracking feature in SQL Server provides a number of functions and stored procedures thatyou can use to identify modified rows. To use Change Tracking, performthe following high-level steps:1. . Enable Change Tracking in the data source. You must enable Change Tracking for the database, andfor each table in the database for whichh you want to monitor changes.The following Transact-SQL code sample shows how to enable Change Tracking in a databasenamedSales and monitor data modifications in the Salespeople table. Note that you can choose to trackwhich columns that weree modified, butthe change table only contains the primary key of eachrowthat was modified – not the modified column values. .ALTER DATABASE SalesSET CHANGE_TRACKING = ON (CHANGE_RETENTION = 7 DAYS, AUTO_CLEANUP = ON)ALTER TABLE SalespeopleENABLE CHANGE_TRACKING WITH (TRACK_COLUMNS_UPDATED = OFF)2. . For the initial data extraction, record the current version (which byy default is 0), and extract all rows inthe source table. Then logthe current version as the last extracted version.The following Transact-SQL code sample shows how to use the Change_Tracking_Current_Versionsystem function to retrieve the current version, extract the initial data, and assignthe current versionto a variable so it can be stored as the last extracted version.SET @CurrentVersion = CHANGE_TRACKING_CURRENT_VERSION();SELECT * FROM SalespeopleSET @LastExtractedVersion = @CurrentVersion10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-25MCT USE ONLY. STUDENT USE PROHIBITED


7-26 Implementing an <strong>Inc</strong>remental ETL Process3. For subsequent refresh cycles, extract changes that have occurred between the last extracted versionand the current version. The following Transact-SQL code sample shows how to determine thecurrent version, use the Changetable system function in a query that joins the primary key of recordsin the change table to records in the source table, and update the last extracted version.SET @CurrentVersion = CHANGE_TRACKING_CURRENT_VERSION();SELECT * FROM CHANGETABLE(CHANGES Salespeople, @LastExtractedVersion) CTINNER JOIN Salespeople s ON CT.SalespersonID = s.SalespersonIDSET @LastExtractedVersion = @CurrentVersionA best practice when using Change Tracking is to enable snapshot isolation in the source databaseand use it to ensure that the any modifications that occur during the extraction do not affect recordsthat were modified between the version numbers that define the lower and upper bounds of yourextraction range.Note For detailed information about the syntax of the Change Tracking system functionsused in these code samples, see SQL Server Books Online.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: UsingChange Tracking Task 1: Enable Change Tracking1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-272. . If you did not complete the previous demonstration,in the D:\10777A\Demofiles\Mod07 folder, runSetup.cmdas Administrator. Then double-click Using CT.sql to open the queryfile in SQL ServerManagement Studio. Each time you areprompted, connect to the localhost instance of the databaseengine by using Windows authentication. Do not execute the query yet.3. . In Object Explorer, expand Databases, expand DemoDW, and expand Tables. Then right clickthesrc.Salespeople table and click Select Top 1000 Rows. This table contains source data forsalespeople.4. . On the Using CT.sql query tab, select the Transact-SQL code under the comment Enable ChangeTracking, and then click Execute. This enables Change Tracking inthe DemoDWdatabase, andstarts logging changes todata in the src.Salespeople table.5. . Select the Transact-SQL code under thecomment Obtain the initial data and log the currentversion number, and then click Execute. This code uses theCHANGE_TRACKING_CURRENT_VERSION function too determine the current version, and retrieves allrecords in the src.Salespeople table.MCT USE ONLY. STUDENT USE PROHIBITED


7-28 Implementing an <strong>Inc</strong>remental ETL Process Task 2: Use Change Tracking to extract modified data1. Select the Transact-SQL code under the comment Insert a new salesperson, and then click Execute.This code inserts a new customer record.2. Select the Transact-SQL code under the comment Update a salesperson, and then click Execute.This code updates customer record.3. Select the Transact-SQL code under the comment Retrieve the changes between the lastextracted and current versions, and then click Execute. This code retrieves the previously extractedversion from the stg.ExtractLog table, determines the current version, uses the CHANGETABLEfunction to find records in the src.Salespeople table that have been modified since the last extractedversion, and then updates the last extracted version in the stg.ExtractLog table.4. Close SQL Server Management Studio without saving any changes.MCT USE ONLY. STUDENT USE PROHIBITED


ExtractingData with Change Tracking10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-29You can create an SSIS package that uses the Change Tracking feature in SQL Server in a similar way tothe high water mark technique described earlier in this lesson. The key difference is that rather thanstoring the dateand time of the previous extraction, you must store the Change Tracking version numberthat was extracted, and update this with thecurrent version during each extract operation.A typical control flow for extracting data from a Change Tracking-enabled data source includes thefollowing elements:1. . A SQL Command that retrieves the previously extracted version from a log tableand assigns itto avariable.2. . A data flowthat containsa source to extract records that have been modified since the previouslyextracted version and return the current version.3. . A SQL Command that updates the logged version number with thecurrent version.MCT USE ONLY. STUDENT USE PROHIBITED


7-30 Implementing an <strong>Inc</strong>remental ETL ProcessLabScenarioIn this lab, you will continue to develop the Adventure Works ETL solution.Youhave developed SSIS packages that extract data from various data sources and load it into a stagingdatabase. However, the current solution extracts all source records each time the ETL process is run, whichresults in unnecessary processing of records that have already been extracted and consumes a largeamount of network bandwidth to transfer a large volume of data. To resolve this problem, you mustmodify the SSIS packages to extract only dataa that has beenn added or modified since the previousextraction.Sales order records have a datetime column that is used to record the date and time ofany inserts ormodifications to the data. A tracking table in the staging database storess the date and time of the mostrecent extraction of sales data ( initialized withan arbitrary value that pre-dates all of the records in thesource database to ensure that the first extraction captures all records), and another BI developer hasalready implemented a control flow that extracts reseller sales records that have been inserted ormodified since theassociated extraction date in the tracking table. You plan to use thesame approach toimplement an incremental extraction of Internet sales data. .Customer data is stored in a SQL Server 2012 Enterprise database, which supports Change Data Capture(CDC). You plan toenable CDC in the source database and implement a custom control flow that uses thelogged extractiondate in the tracking table toidentify the range of modified customerdata to beextracted.The Human Resources database, in which employee records are stored, also supports CDC. You plan touse the CDC Control task and CDC data flow components to implement SSIS control flows for the initialand subsequent incremental extractions of employee data.Reseller data is stored in a SQL Server database in which you plan to enable Change Tracking. You willthencreate a custom control flow that tracks the extracted version and then uses the tracked versiontoensure only the most recent changes to the data are extracted on subsequent executions.MCT USE ONLY. STUDENT USE PROHIBITED


Lab 7A: Extracting Modifieed DataExercise 1: Using a Datetime Column to <strong>Inc</strong>rementally Extract DataScenarioThe InternetSales and ResellerSales databases contain source data for your data warehouse, and thesales order records in these databases include a LastModified date column that is updated with thecurrent date and time when a row is inserted or updated. . You have decided to use this column toimplement an incremental extraction solution that compares record modification times to a loggedextraction date and time in the staging database, and restricts data extractions to rows that have beenmodified since the previous refresh cycle.The main tasks for this exercise are as follows:1. . Prepare thelab environment.2. . View extraction data.3. . Examine anexisting package.4. . Define variables for data extraction boundaries.5. . Add tasks to set variable values.6. . Modify a data source to filter the data being extracted.7. . Add a task to update the extraction log.8. . Test the package.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-31MCT USE ONLY. STUDENT USE PROHIBITED


7-32 Implementing an <strong>Inc</strong>remental ETL Process Task 1: Prepare the lab environment• Ensure that the MIA-DC1 and MIA-SQLBI virtual machines are both running, and then log on toMIA-SQLBI as ADVENTUREWORKS\Student with the password Pa$$w0rd.• Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab07A\Starterfolder as Administrator. Task 2: View extraction data• Start SQL Server Management Studio and connect to the localhost database instance by usingWindows authentication.• In the Staging database, view the contents of the dbo.ExtractLog table, noting that it contains thedate and time of previous extractions from the InternetSales and ResellerSales databases. This isinitially set to January 1st 1900.• In the InternetSales database, view the contents of the dbo.SalesOrderHeader table and note thatthe LastModified column contains the date and time that each record was inserted or modified. Task 3: Examine an existing package• In the D:\10777A\Labfiles\Lab07A\Starter\Ex1 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools and examine the Extract Reseller Data.dtsx SSIS package.• View the variables defined in the package, and note that they include two DateTime variables namedCurrentTime and ResellerSalesLastExtract.• Examine the tasks in the package and note the following:• The Get Current Time task uses the GETDATE() function to assign the current date and time tothe CurrentTime variable.• The Get Last Extract Time task uses a Transact-SQL command to return a single row containingthe LastExtract value for the ResellerSales data source from the dbo.ExtractLog table in theStaging database, and assigns the LastExtract value to the ResellerSalesLastExtract variable.• The Extract Reseller Sales data flow task includes a data source named Reseller Sales that usesa WHERE clause to extract records with a LastModified value between two parameterized values.The parameters are mapped to the ResellerSalesLastExtract and CurrentTime variables.• The Update Last Extract Time task updates the LastExtract column for the ResellerSales datasource in the dbo.ExtractLog table with the CurrentTime variable.• Run the Extract Reseller Data.dtsx package. When it has completed, stop debugging and in SQLServer Management Studio, verify that the ExtractLog table in the Staging database has beenupdated to reflect the most recent extraction from the ResellerSales source.MCT USE ONLY. STUDENT USE PROHIBITED


Task 4: Define variables for data extraction boundaries10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-33• In SQL Server Data Tools, open the Extract Internet Sales Data.dtsx SSIS package, and add theDateTime variables named CurrentTime and InternetSalesLastExtract to the package.• Verify that the Variables pane includes variable definitions similar to these: Task 5: Add tasks to set variable values• Add an Expression task named Get Current Time to the Extract Customer Sales Data sequence inthe control flow of the Extract Internet Sales Data.dtsx package, and configure it to apply thefollowing expression.@[User::CurrentTime] = GETDATE()• Add an Execute SQL task named Get Last Extract Time to the Extract Customer Sales Datasequence in the control flow of the Extract Internet Sales Data.dtsx package, and set the followingconfiguration properties:• On the General tab, set the Connection property to localhost.Staging.• On the General tab, set the SQLStatement to the following Transact-SQL query.SELECT MAX(LastExtract) LastExtractFROM ExtractLogWHERE DataSource = 'InternetSales'• On the General tab, set the ResultSet property to Single row.• On the Result Set tab, add a result that maps the LastExtract column in the result set to, theUser::InternetSalesLastExtract variable.MCT USE ONLY. STUDENT USE PROHIBITED


7-34 Implementing an <strong>Inc</strong>remental ETL Process• Connect the precedence constraints of the new tasks soo that the firstt part of the control flow lookslike the following.• On the data flow tab for the Extract Internet Sales data flow task, make the following changes to theInternet Sales source:Task 6: Modify a data source to filter the data being extracted• Add the following WHERE clause to the query in the SQL Command property.WHERE LastModified> ?AND LastModified


• On theParameter Mapping tab, add the following parameter mapping.Variable NameUser:: :CurrentTime• Connect the precedence constraint of the Extract Internet Sales task to the Update Last ExtractTime task so that the completed control flow looks like the following. Task 8: Test the package• View the Extract Internet Sales data flow and then start debugging the package and note thenumber of rows transferred. When package execution is complete, , stop debugging.• In SQL Server Management Studio, viewthe contentss of the dbo.ExtractLog table in the Stagingdatabase and verify that the LastExtract column for the InternetSales data source has beenupdated.• View the contents of the dbo.InternetSales table and note the rows that have been extracted.• In SQL Server Data Tools,debug the package again and verify thatt this time no rows are transferred inthe Extractt Internet Sales data flow (because no data has been modified since the previousextraction). . When package execution iscomplete, stop debugging.• Close SQL Server Data Tools.DirectionInput10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-35Data TypeeDATEParameter Name Parameter SizeResults: After this exercise, you should havean SSIS package that uses the high water mark technique toextract only records that havebeen modified since the previous extraction.0-1MCT USE ONLY. STUDENT USE PROHIBITED


7-36 Implementing an <strong>Inc</strong>remental ETL ProcessExercise 2: Using Change Data CaptureScenarioThe InternetSales database contains a Customers table that does not include a column to indicate whenrecords were inserted or modified. You plan to use the Change Data Capture feature of SQL ServerEnterprise Edition to identify records that have changed between data warehouse refresh cycles, andrestrict data extractions to include only modified rows.The main tasks for this exercise are as follows:1. Enable Change Data Capture for customer data.2. Create a stored procedure to retrieve changed customer records.3. Reset the staging database.4. Modify a data flow to use the stored procedure.5. Test the package. Task 1: Enable Change Data Capture for customer data• In SQL Server Management Studio, execute Transact-SQL statements to enable Change Data Capturein the InternetSales database, and monitor net changes in the Customers table. You can use theEnable CDC.sql file in the D:\10777A\Labfiles\Lab07A\Starter\Ex2 folder to accomplish this.• Open the Test CDC.sql file in the D:\10777A\Labfiles\Lab07A\Starter\Ex2 folder and examine it.• Execute the code under the comment Select all changed customer records between 1/1/1900and today and note that no rows are returned because no changes have occurred since Change DataCapture was enabled.• Execute the code under the comment Make a change to all customers (to create CDC records) tomodify data in the Customers table.• Execute the code under the comment Now see the net changes and note that all customer recordsare returned because they have all been modified within the specified time period while Change DataCapture was enabled. Task 2: Create a stored procedure to retrieve changed customer records• In SQL Server Management Studio, execute a Transact-SQL statement that creates a stored procedurenamed GetChangedCustomers in the InternetSales database. The stored procedure should performthe following tasks. You can execute the Create SP.sql file in theD:\10777A\Labfiles\Lab07A\Starter\Ex2 folder to accomplish this.• Retrieve the log sequence numbers for the dates specified in StartDate and EndDateparameters.• If neither of the log sequence numbers is null, return all records that have changed in theCustomers table.• If either of the log sequence numbers is null, return an empty rowset.MCT USE ONLY. STUDENT USE PROHIBITED


• Test the stored procedure by running the following query.USE InternetSalesGOEXEC GetChangedCustomers '1/1/1900', '1/1/2099'GO Task 3: Reset the staging database• In SQL Server Management Studio open the Reset Staging.sql file in theD:\10777A\Labfiles\Lab07A\Starter\Ex2 folder and execute it. Task 4: Modify a data flow to use the stored procedure10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-37• In the D:\10777A\Labfiles\Lab07A\Starter\Ex2 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools and open the Extract Internet Sales Data.dtsx SSIS package.• On the Data Flow tab for the Extract Customers task, make the following changes to theCustomers source:• In the Data access mode drop-down list, select SQL Command.• In the SQL command text box, type the following Transact-SQL statement:EXEC GetChangedCustomers ?, ?• Click the Parameters button, and in the Set Query Parameters dialog box, create the followingparameter mappings and click then OK.Parameters Variables Param direction@StartDate User::InternetSalesLastExtract Input@EndDate User::CurrentTime Input Task 5: Test the package• View the Extract Customers data flow and then start debugging the package and note the numberof rows transferred. When package execution is complete, stop debugging.• In SQL Server Management Studio, view the contents of the dbo.ExtractLog table in the Stagingdatabase and verify that the LastExtract column for the InternetSales data source has beenupdated.• View the contents of the dbo.Customers table and note the rows that have been extracted.• In SQL Server Data Tools, debug the package again and verify that no rows are transferred in theExtract Customers data flow this time. When package execution is complete, stop debugging.• Close SQL Server Data Tools.Results: After this exercise, you should have a database in which Change Data Capture has been enabled,and an SSIS package that uses a stored procedure to extract modified rows based on changes monitoredby Change Data Capture.MCT USE ONLY. STUDENT USE PROHIBITED


7-38 Implementing an <strong>Inc</strong>remental ETL ProcessExercise 3: Using the CDC Control TaskScenarioThe HumanResources database contains an Employee table in which employee data is stored. You planto use the Change Data Capture feature of SQL Server Enterprise Edition to identify modified rows in thistable. You also plan to use the CDC Control Task in SSIS to manage the extractions from this table bycreating a package to perform the initial extraction of all rows, and a second package that uses the CDCdata flow components to extract rows that have been modified since the previous extraction.The main tasks for this exercise are as follows:1. Enable Change Data Capture for employee data.2. View staging tables for employee data.3. Create ADO.NET Connection Managers for CDC Components.4. Create a package for the initial employee data extraction.5. Test the initial extraction package.6. Create a package for incremental employee data extraction.7. Test the incremental extraction package. Task 1: Enable Change Data Capture for customer data• In SQL Server Management Studio, execute Transact-SQL statements to enable Change Data Capturein the HumanResources database, and monitor net changes in the Employee table. You can use theEnable CDC.sql file in the D:\10777A\Labfiles\Lab07A\Starter\Ex3 folder to accomplish this. Task 2: View staging tables for employee data• In SQL Server Management Studio, view the contents of the dbo.EmployeeDeletes,dbo.EmployeeInserts, and dbo.EmployeeUpdates tables in the Staging database to verify thatthey are empty. Task 3: Create ADO.NET Connection Managers for CDC Components• In the D:\10777A\Labfiles\Lab07A\Starter\Ex3 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools.Note The project already contains an OLE DB connection manager for the Stagingdatabase, but the CDC components in SSIS require an ADO.NET connection manager. Youwill therefore need to create an ADO.NET connection manager for the HumanResourcesdatabase and a new ADO.NET connection manager for the Staging database.• Create a new, project-level connection manager that creates an ADO.NET connection to theHumanResources database on the localhost server by using Windows authentication. After theconnection manager has been created, rename it to localhost.HumanResources.ADO.NET.conmgr.• Create another new, project-level connection manager that creates an ADO.NET connection to theStaging database on the localhost server by using Windows authentication. After the connectionmanager has been created, rename it to localhost.Staging.ADO.NET.conmgr.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-39 Task 4: Create a package for the initial employee data extraction• Add a new SSIS package named Extract Initial Employee Data.dtsx to the project.• Add a CDC Control Task from the Other Tasks section of the SSIS Toolbox to the control flow, andrename the task to Mark Initial Load Start. Then, configure the Mark Initial Load Start task asfollows.PropertySQL Server CDC database ADO.NETconnection managerCDC control operationVariable containing the CDC stateAutomatically store state in a database tableConnection manager for the database wherethe state is storedTable to use for storing stateState nameSettinglocalhost HumanResources ADO NETMark initial load startClick New and then in the Add New Variabledialog box, click OK to create a variable namedCDC_State in the Extract Initial Employee Datacontainer.Selectedlocalhost Staging ADO NETClick New, and in the Create New State Tabledialog box, click Run to create a table named[dbo].[cdc_states] in the Staging database.CDC_StateMCT USE ONLY. STUDENT USE PROHIBITED


7-40 Implementing an <strong>Inc</strong>remental ETL ProcessWhen you have finished configuring the task, the task editor should look like the following.• Add a Data Flow Task named Extract Initial Employee Data to thecontrol flow, connecting thesuccess precedence constraint from the Mark Initial Load Start taskk to the Extract Initial EmployeeData task. Then configure the Extract Initial Employee Data task’s data flow as follows:• Create anADO.NET Source named Employees, with the following settings.PropertySettingADO.NET connection managerlocalhost HumanResources ADO NETData access modeTable or viewName of the table or view"dbo"."Employee"• Connect the data flow from the Employees sourcee to an ADO.NET Destination namedEmployee Inserts withthe followingsettings.PropertySettingConnection managerlocalhost Stagingg ADO NETUse a table or view"dbo"."EmployeeInserts"MappingsOn the Mappingss tab, verify that all availableinput columns aremapped to destinationcolumns of the same name.MCT USE ONLY. STUDENT USE PROHIBITED


The completed data flow should look like the following.• On the control flow, add a second CDCC Control Taskk named MarkInitial Load End and connect thesuccess precedence constraint from theExtract Initial Employee Data task to the Mark Initial LoadEnd task. Then, configurethe Mark Initial Load Endd task as follows.PropertySQL ServeCDC database ADO.NETconnection managerCDC control operationVariable containing the CDC stateAutomatically store state in a database tableConnection manager for the databasewherethe state is i storedTable to use for storing stateState name10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-41Settinglocalhost HumanResourcesADO NETMark initial load endUser:: CDC_StateSelectedlocalhost Staging ADO NET[dbo].[cdc_states]CDC_StateMCT USE ONLY. STUDENT USE PROHIBITED


7-42 Implementing an <strong>Inc</strong>remental ETL ProcessWhen you have finished configuring the task, the task editor should look like the following.• Verify that the completed control flow for the Extract Initial Employee Data.dtsxpackage looks likethis, and thensave the package.MCT USE ONLY. STUDENT USE PROHIBITED


Task 5: Test the initial extraction package10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-43• Start debugging the Extract Initial Employee Data.dtsx package. When package execution iscomplete, stop debugging.• In SQL Server Management Studio, view the contents of the dbo.EmployeeInserts table in theStaging database to verify that the employee records have been transferred.• Refresh the view of the tables in the Staging database, and verify that a new table nameddbo.cdc_states has been created. This table should contain an encoded string that indicates the CDCstate. Task 6: Create a package for incremental employee data extraction• Add a new SSIS package named Extract Changed Employee Data.dtsx to the project.• Add a CDC Control Task to the control flow, and rename the task to Get Processing Range. Then,configure the Get Processing Range task as follows.PropertySQL Server CDC database ADO.NETconnection managerCDC control operationVariable containing the CDC stateAutomatically store state in a database tableConnection manager for the database wherethe state is storedTable to use for storing stateState nameSettinglocalhost HumanResources ADO NETGet processing rangeClick New and then in the Add New Variabledialog box, click OK to create a variable namedCDC_State in the Extract Changed EmployeeData container.Selectedlocalhost Staging ADO NET[dbo].[cdc_states]CDC_StateMCT USE ONLY. STUDENT USE PROHIBITED


7-44 Implementing an <strong>Inc</strong>remental ETL ProcessWhen you have finished configuring the task, the task editor should look like the following.• Add a Data Flow Task named Extract Changed Employee Data to the control flow, connecting thesuccess precedence constraint from the Get Processingg Range task to the Extractt ChangedEmployee Data task. Then configure theExtract Changed Employee Data task’ss data flow asfollows:• Add a CDC Source (in the Other Sources section of the SSIS Toolbox) named EmployeeChanges, with the following settings.PropertySettingADO.NET connection managerlocalhost HumanResources ADO NETCDC enabled table[dbo].[Employee]Captureinstancedbo_EmployeeCDC processing modeNettVariablecontaining the CDC state User::CDC_StateMCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-45When you have finished configuring the source, the source configuration editor should look likethe following.• Connect the data flow from the Employee Changes source toa CDC Splitter transformation (inthe Other Transforms section of the SSIS Toolbox), as shown here.MCT USE ONLY. STUDENT USE PROHIBITED


7-46 Implementing an <strong>Inc</strong>remental ETL Process• Add an ADO.NET Destination named Employee Inserts below and to the left of the CDCSplitter transformation, and connect the InsertOutput data flow output from the CDC Splittertransformation to the Employee Inserts destination. Then configure the Employee Insertsdestination as follows.PropertyConnection managerUse a table or viewMappingsSettinglocalhost Staging ADO NET"dbo"."EmployeeInserts"On the Mappings tab, verify that all availableinput columns other than _$start_lsn,_$operation, and _$update_mask are mappedto destination columns of the same name.• Add an ADO.NET Destination named Employee Updates directly below the CDC Splittertransformation, and connect the UpdateOutput data flow output from the CDC Splittertransformation to the Employee Updates destination. Then configure the Employee Updatesdestination as follows.PropertyConnection managerUse a table or viewMappingsSettinglocalhost Staging ADO NET"dbo"."EmployeeUpdates"On the Mappings tab, verify that all availableinput columns other than _$start_lsn,_$operation, and _$update_mask are mappedto destination columns of the same name.• Add an ADO.NET Destination named Employee Deletes below and to the right of the CDCSplitter transformation, and connect the DeleteOutput data flow output from the CDC Splittertransformation to the Employee Deletes destination. Then configure the Employee Deletesdestination as follows.PropertyConnection managerUse a table or viewMappingsSettinglocalhost Staging ADO NET"dbo"."EmployeeDeletes"On the Mappings tab, verify that all availableinput columns other than _$start_lsn,_$operation, and _$update_mask are mappedto destination columns of the same name.MCT USE ONLY. STUDENT USE PROHIBITED


The completed data flow should look like the following.• On the control flow, add a second CDCC Control Taskk named MarkProcessed Range, and connectthe success precedence constraint fromthe Extract Changed Employee Data task to the MarkProcessed Range task. Then, configurethe Mark Processed Range task as follows.PropertySQL ServeCDC database ADO.NETconnection managerCDC control operationVariable containing the CDC stateAutomatically store state in a database tableConnection manager for the databasewherethe state is i storedTable to use for storing stateState name10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-47Settinglocalhost HumanResourcesADO NETMark processedrangeUser:: CDC_StateSelectedlocalhost Staging ADO NET[dbo].[cdc_states]CDC_StateMCT USE ONLY. STUDENT USE PROHIBITED


7-48 Implementing an <strong>Inc</strong>remental ETL ProcessWhen you have finished configuring the task, the task editor should look like the following image.• Verify that the completed control flow for the Extract Changed Employee Data.dtsx package lookslike this, and then save the package.MCT USE ONLY. STUDENT USE PROHIBITED


Task 7: Test the incremental extraction package10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-49• In SQL Server Management Studio, open the Change Employees.sql Transact-SQL script file in theD:\10777A\Labfiles\Lab07A\Starter\Ex3 folder, and execute the script to:• Truncate the dbo.EmployeeInserts, dbo.EmployeeUpdates, and dbo.EmployeeDeletes tablesin the Staging database.• Insert a new record for Jeff Price in the Employee table in the HumanResources database.• Update employee 281 to change the Title to ‘Sales Manager’.• Delete employee 273.• In SQL Server Data Tools, view the control flow for the Extract Changed Employee Data.dtsxpackage and start debugging.• When execution has completed, view the Extract Changed Employee Data data flow to verify thatthree rows were extracted and split into one insert, one update, and one delete. Then, stopdebugging. If no rows were transferred, wait for a few seconds, and then run the package again.• In SQL Server Management Studio, view the contents of the dbo.EmployeeDeletes,dbo.EmployeeInserts, and dbo.EmployeeUpdates tables in the Staging database to verify thatthey contain the inserted, updated, and deleted rows respectively.• Close SQL Server Data Tools and minimize SQL Server Management Studio when you are finished.Results: After this exercise, you should have a HumanResources database in which Change Data Capturehas been enabled, an SSIS package that uses a the CDC Control to extract the initial set of employeerecords, and an SSIS package that uses a the CDC Control and CDC data flow components to extractmodified employee records based on changes recorded by Change Data Capture.MCT USE ONLY. STUDENT USE PROHIBITED


7-50 Implementing an <strong>Inc</strong>remental ETL ProcessExercise 4: Using Change TrackingScenarioThe ResellerSales database contains a Resellers table that does not include a column to indicate whenrecords were inserted or modified. You plan to use the Change Tracking feature of SQL Server to identifyrecords that have changed between data warehouse refresh cycles, and restrict data extractions to includeonly modified rows.The main tasks for this exercise are as follows:1. Enable Change Tracking.2. Create a stored procedure to retrieve changed reseller records.3. Reset the staging database.4. Modify a data flow to use the stored procedure.5. Test the package. Task 1: Enable Change Tracking• In SQL Server Management Studio, execute Transact-SQL statements to enable Change Tracking inthe ResellerSales database, and track changes in the Resellers table. You do not need to track whichcolumns were modified. You can use the Enable CT.sql file in theD:\10777A\Labfiles\Lab07A\Starter\Ex4 folder to accomplish this.• Open the Test CT.sql file in the D:\10777A\Labfiles\Lab07A\Starter\Ex4 folder and examine it, notingthat it contains statements to perform the following tasks:• Get the current change tracking version number.• Retrieve all data from the Resellers table.• Store the current version number as the previously retrieved version.• Update a row in the Resellers table.• Get the new current version number.• Get all changes between the previous version and the current version.• Store the current version number as the previously retrieved version.• Update a row in the Resellers table.• Get the new current version number.• Get all changes between the previous version and the current version.• Execute the script and verify that:• The first resultset shows all reseller records.• The second resultset indicates that the previously retrieved version was numbered 0, and thecurrent version is numbered 1.• The thirds resultset indicates that the previously retrieved version was numbered 1, and thecurrent version is numbered 2.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-51 Task 2: Create a stored procedure to retrieve changed reseller records• In SQL Server Management Studio, execute a Transact-SQL statement that enables snapshot isolationand creates a stored procedure named GetChangedResellers in the ResellerSales database. Thestored procedure should perform the following tasks. You can use the Create SP.sql file in theD:\10777A\Labfiles\Lab07A\Starter\Ex4 folder to accomplish this.• Set the isolation level to snapshot.• Retrieve the current change tracking version number.• If the LastVersion parameter is -1, assume that no previous versions have been retrieved, andreturn all records from the Resellers table.• If the LastVersion parameter is not -1, retrieve all changes between LastVersion and the currentversion.• Updates the LastVersion parameter to the current version, so the calling application can storethe last version retrieved for next time.• Set the isolation level back to read committed.• Test the stored procedure by running the following query.USE ResellerSalesGODECLARE @p BigInt = -1EXEC GetChangedResellers @p OUTPUTSELECT @p LastVersionRetrievedEXEC GetChangedResellers @p OUTPUT Task 3: Reset the staging database• In SQL Server Management Studio open the Reset Staging.sql file in theD:\10777A\Labfiles\Lab07A\Starter\Ex4 folder and execute it. Task 4: Modify a data flow to use the stored procedure• In the D:\10777A\Labfiles\Lab07A\Starter\Ex4 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools and open the Extract Reseller Data.dtsx SSIS package.• Add a decimal variable named PreviousVersion to the package.• Add an Execute SQL Task named Get Previously Extracted Version to the control flow, and set thefollowing configuration properties:• On the General tab, set the ResultSet property to Single row.• On the General tab, set the Connection property to localhost.Staging.• On the General tab, set the SQLStatement property to the following Transact-SQL query.SELECT MAX(LastVersion) LastVersionFROM ExtractLogWHERE DataSource = 'ResellerSales'• On the Result Set tab, add a result that maps the LastVersion column in the result set to, theUser::PreviousVersion variable.MCT USE ONLY. STUDENT USE PROHIBITED


7-52 Implementing an <strong>Inc</strong>remental ETL Process• Add an Execute SQL task named Update Previous Version to the control flow, and set the followingconfiguration properties:• On the General tab, set the Connection property to localhost.Staging.• On the General tab, set the SQLStatement property to the following Transact-SQL query.UPDATE ExtractLogSET LastVersion = ?WHERE DataSource = 'ResellerSales'• On the Parameter Mapping tab, add the following parameter mapping.Variable Name Direction Data Type Parameter Name Parameter SizeUser::PreviousVersion Input LARGE_INTEGER 0 -1• Make the necessary changes to the precedence constraint connections in the control flow so itmatches the following image.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-53• On the Data Flow tab for the Extract Resellers task, make the following changes to the Resellerssource:• In the Data access mode drop-down list, select SQL Command.• In the SQL command text box, type the following Transact-SQL statement.EXEC GetChangedResellers ? OUTPUT• Click the Parameters button, and in the Set Query Parameters dialog box, create the followingparameter mappings and click then OK.Parameters Variables Param direction@LastVersion User::PreviousVersion InputOutput Task 5: Test the package• View the Extract Resellers data flow and then start debugging the package and note the number ofrows transferred. When package execution is complete, stop debugging.• In SQL Server Data Tools, view the contents of the dbo.ExtractLog table in the Staging database andverify that the LastExtract column for the ResellerSales data source has been updated.• View the contents of the dbo.Resellers table and note the rows that have been extracted.• In SQL Server Data Tools, debug the package again and verify that no rows are transferred in theExtract Resellers data flow. When package execution is complete, stop debugging.• Close SQL Server Data Tools.Results: After this exercise, you should have a database in which Change Tracking has been enabled, andan SSIS package that uses a stored procedure to extract modified rows based on changes recorded byChange Tracking.MCT USE ONLY. STUDENT USE PROHIBITED


7-54 Implementing an <strong>Inc</strong>remental ETL ProcessLesson 3Loading Modified DataAfter your incremental ETL process has transferred source data to the staging location, it must load thedataa into the dataa warehouse. One of the main challenges when refreshing data in a data warehousee isidentifying which staged records are new dimension members or facts that need to be inserted into thedataa warehouse, and which records representmodificationss that require rows in the data warehouse to beupdated.Thislesson describes a number of common techniques for performing anincremental load of a datawarehouse. After completing this lesson, you will be able to:• Describe the common options for incrementally loading a data warehouse.• Load data from staging tables created byusing the CDC source and CDC Splitter data flowcomponents.• Use a Lookuptransform to differentiate between new records and updates of existing records.• Use the Slowly Changing Dimension transformation to apply type 1 and type 2 changes to adimension.• Use the Transact-SQL MERGE statement to insert and update data ina single query.MCT USE ONLY. STUDENT USE PROHIBITED


Options for<strong>Inc</strong>rementally Loading DataThere are a number of commonly used techniques for loading incremental changes to a data warehouse.The specific technique you should use to load a particular dimension or fact table depends on a numberoffactors, including performance, the needto update existing records as well as insert new records, theneed to retain historical dimension attributes, and the location of the staging and data warehouse tables.Insert, Update, or Delete Data Based on CDC Output TablesIf you used the CDC Splitter to stage modified source data into operation-specific tables, then you cancreate an SSIS package that uses the business keys or unique column combinations inthe staging tablestoapply the appropriate changes to the associated tabless in the data warehouse. In most cases, deleteoperations in the source are applied as logical delete operations in the data warehouse, in which a deletedflag is set instead of actually deleting the matching record.Use a LookupTransformationYou can use a Lookup transformation to determine whether a matchingrecord existss in the datawarehouse for a record that has been extracted from the data sources. You can then use the no matchoutput of the Lookup transformation to create a data floww for new records that needto be insertedintothe data warehouse. Optionally, you can also use the match output of the Lookup transformationtocreate a data flow that updates existing records in the data warehouse with new values from the extractedrecords.Use the Slowly Changing Dimension Transformation10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-55The Slowly Changing Dimension transformation enables you to create a complex data flow that insertsnew dimension members andapplies type 1 or type 2 changes to existing dimensionn members dependingonwhich attributes have beenupdated. In many data warehousing scenarios, the Slowly ChanginggDimension transformation provides an easyto implement solution for refreshing dimension tables.However, its performance canbe limited for ETL processes with extremely large numbers of rows, forwhich you might need to create a custom solution for slowly changing dimensions.MCT USE ONLY. STUDENT USE PROHIBITED


7-56 Implementing an <strong>Inc</strong>remental ETL ProcessUse the MERGE StatementThe MERGE statement is a Transact-SQL construct that you can use to perform insert, update, and deleteoperations in the same statement. The statement works by matching rows in a source rowset with rows ina target table, and taking appropriate action to merge the source with the target.The MERGE statement is appropriate when the source or staging tables and the data warehouse tables areimplemented in SQL Server databases and the ETL process is able to execute the MERGE statement usinga connection through which all source and target tables can be accessed. In practical terms, this requiresthat:• The staging tables and data warehouse are co-located in the same SQL Server database.• The staging tables and data warehouse tables are located in multiple databases in the same SQLServer instance, and the credentials used to execute the MERGE statement has appropriate user rightsin both databases.• The staging tables are located in a different SQL Server instance than the data warehouse, but alinked server has been defined that enables the MERGE statement to access both databases and theperformance of the MERGE statement over the linked server connection is acceptable.Use a ChecksumYou can use the columns in the staged dimension records to generate a checksum value, and thencompare this with a checksum generated from the historic attributes in the corresponding dimensiontable to identify rows that require a type 2 or type 3 change. When combined with a Lookuptransformation to identify new or modified rows, this technique can form the basis of a custom solutionfor slowly changing dimensions.Considerations for Deleting Data Warehouse RecordsIf you need to propagate record deletions in source systems to the data warehouse, you should considerthe following guidelines:• In most cases, you should use a logical deletion technique in which you indicate that a record is nolonger valid by setting a Boolean column value. It is not common practice to physically delete recordsfrom a data warehouse unless you have a compelling business reason to discard all historicalinformation relating to that record.• The techniques you can use to delete records (or mark them as logically deleted) when loading datadepend on how you have staged deleted records.• If the staging tables for a dimension or fact table contain all valid records (not just records thathave been modified since the previous refresh cycle), then you can delete any existing records inthe data warehouse that do not exist in the staging tables.• If the staged data indicates logical deletions in the form of a Boolean column value, and youneed to apply logical deletes in the data warehouse, then the logical deletions can be treated asupdates.• If the keys of records to be deleted are stored separately from new and updated records in thestaging database, then you may want to perform two distinct load operations for each dimensionor fact table – one to load new and updated records, and another to delete records.MCT USE ONLY. STUDENT USE PROHIBITED


Using CDC Output TablesAs described earlier in this module, you canuse the CDC Splitter or a custom data flow to stage CDCinsert, update, or delete records in separatee tables based on the _$operation value inthe CDC output. Ifyour data extraction process has taken this approach, then you can apply the appropriate changesto thecorresponding tables in the data warehouse. The specific control flow tasks you can use to performthistype of data load depend on the relative locations of the staging tabless and the dataa warehouse.Loading a Data Warehouse from a Co-Locatedd Staging DatabaseIf the staging tables and the data warehouse are co-located the same database or server instance, or canbeconnected by using a linked server, thenyou can use SQL Commandcontrol flowtasks to execute set-based Transact-SQL statements that load the data warehouse tables.For example, you could use the following Transact-SQL statement to load data from an inserts stagingtable into a data warehouse table.INSERT INTO dw.DimProduct (ProductAltKey, ProductName, Description, Price)SELECT ProductBizKey, ProductName, Description,PriceFROM stg.ProductInsertsThe following code example updates data in a data warehouse table based on the records in an updatesstaging table.UPDATE dw.DimProductSET dw.DimProduct.ProductName = stg.ProductUpdates.ProductName,dw.DimProduct.Description = stg.ProductUpdates.Description,dw.DimProduct.Price = stg.ProductUpdates.PriceFROM dw.DimProduct JOIN stg.ProductUpdatesON dw.DimProduct.ProductAltKey = stg.ProductUpdates.ProductBizKey10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-57MCT USE ONLY. STUDENT USE PROHIBITED


7-58 Implementing an <strong>Inc</strong>remental ETL ProcessThe following code example deletes records in the data warehouse based on a deletes staging table.DELETE FROM dw.DimProductWHERE ProductAltKey IN (SELECT ProductBizKey FROM stg.ProductDeletes)However, deleting records from the data warehouse tables can result in loss of historical analytical andreporting data, and can be further complicated by the presence of foreign key constraints between factand dimension tables. A more common approach is to perform a logical delete by setting a column to aBoolean value to indicate that the record has been deleted from the source system, as shown in thefollowing example.UPDATE dw.DimProductSET dw.DimProduct.Deleted = 1FROM dw.DimProduct JOIN stg.ProductDeletesON dw.DimProduct.ProductAltKey = stg.ProductDeletes.ProductBizKeyLoading a Remote Data WarehouseIf the data warehouse is stored on a different server from the staging database, and no linked serverconnection is available, you can apply the necessary updates in the staging tables to the data warehouseby creating a data flow for each operation.To load records from an inserts staging table, create a data flow that includes a source component toextract records from the staging table and a destination that maps the extracted column to theappropriate data warehouse table.To apply updates in an updates staging table to a data warehouse table, create a data flow that includes asource component to extract records from the staging table and an OLE DB Command transformation toexecute an UPDATE statement that sets the changeable data warehouse table columns to thecorresponding values in the staging table based on a join between the business key in the staging tableand the alternative key in the data warehouse table.To apply deletes to data warehouse tables based on records in a deletes staging table, create a data flowthat includes a source component to extract records from the staging table and an OLE DB Commandtransformation to execute a DELETE statement that matches the business key in the staging table to thealternative key in the data warehouse table. Alternatively, use the OLE DB command to perform a logicaldelete by executing an UPDATE statement that sets the deleted flag to 1 for records where the businesskey in the staging table matches the alternative key in the data warehouse table.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: UsingCDC Output Tables10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-59 Task 1: Load data from CDC output tables1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd. Then, in theD:\10777A\ \Demofiles\Mod07 folder, run StageData. .cmd as Administrator.2. . Start SQL Server Management Studio and connect too the localhostt instance of the SQL Serverdatabase engine by usingWindows authentication.3. . In Object Explorer, view the top 1000 rows in the dw.DimShipper,, stg.ShipperDeletes,stg.ShipperInserts, and stg.ShipperUpdates tabless in the DemoDW database.4. . In the D:\10777A\Demofiles\Mod07 folder, double-click <strong>Inc</strong>rementalETL.sln to open the solution inSQL Server Data Tools. Then, in Solution Explorer, double-click the Load Shippers.dtsx SSIS package.5. . On the control flow surface, double-click the Load Inserted Shippers Execute SQL task. Note that theSQL Statement inserts data into dw.DimShippers from the stg.ShipperInserts table. Then clickCancel.6. . On the control flow surface, double-click the Load Updated Shippers Execute SQL task. Notethatthe SQL Statement updates data in dw.DimShipperss with new values from the stg.ShipperUpdatestable. Then click Cancel.MCT USE ONLY. STUDENT USE PROHIBITED


7-60 Implementing an <strong>Inc</strong>remental ETL Process7. On the control flow surface, double-click the Load Deleted Shippers data flow task. On the data flowsurface, note that the task extracts data from the stg.ShipperDeletes table and then uses an OLE DBCommand transformation to update the Deleted column in dw.DimShippers for the extracted rows.8. On the Debug menu, click Start Debugging, and observe the control flow as it executes. Whenexecution is complete, on the Debug menu, click Stop Debugging and minimize SQL Server DataTools.9. In SQL Server Management Studio, review the changes to the dw.DimShipper, stg.ShipperDeletes,stg.ShipperInserts, and stg.ShipperUpdates tables in the DemoDW database. Then close SQLServer Management Studio.MCT USE ONLY. STUDENT USE PROHIBITED


The LookupTransformationTouse a Lookuptransformation when loading a data warehouse table, , connect the output of the sourcethat extracts thestaged data to the Lookuptransformation and apply the following configuration settings:• Redirect non-matched rows to the no match output.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-61• Look up theprimary key column or columns in the dimension or fact table you want to refreshbymatching itto one or more input columns from the staged data. Iff the staged data includes abusiness key column and the business key is stored as an alternative key in the data warehousee table,match the business key tothe alternative key. Otherwise, match a combination of columns thatuniquely identifies a fact or dimension member.• Connect the no match output from theLookup transformation to a data flow that ultimately insertsnew records into the dataa warehouse.• If you wantto update existing data warehouse records with modified values in the staged data,connect thematch outpuof the Lookup transformation to a data flow that usess an OLE DBCommand transformation to update the records based on the primary key you retrieved in theLookup transformation.Note The Lookup transformation uses an in-memory cache to optimize performance. Ifthe same data set will beused in multiple lookup operations, you can persist the cache to afile and use a Cache Connection Manager to reference it. This further improves performanceby decreasing the time it takes to loadthe cache, but results in lookup operations against adata set that might not be as up to date as the dataa in the database. For more informationabout configuring caching for the Lookup transformation, see “Lookup Transformation” inSQL Server Books Online.MCT USE ONLY. STUDENT USE PROHIBITED


7-62 Implementing an <strong>Inc</strong>remental ETL ProcessDemonstration: Using the Lookup Transformation1.2.3.4.5.6.Task 1: Use a Lookup transformation to insert only new recordsEnsure MIA-DC1 and MIA-SQLBI are started, and log onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$ $$w0rd.If you did notcomplete theprevious demonstration, inn the D:\10777A\Demofiles\Mod07 folder,runStageData.cmd as Administrator, and then double-click <strong>Inc</strong>rementalETL.sln to open the solution inSQL Server Data Tools.In SQL Serverr data Tools, inSolution Explorer, double-click the Load Geography.dtsx SSIS package.On the control flow surface, double-click Load Geography Dimension to view thedata flow surface.Then on the data flow surface, double-click Staged Geography Data, note that the SQL commandused by the OLE DB source extracts geography data from the stg.Customers and stg.Salespeopletables, and then click Cancel.On the data flow surface, double-click Lookup Existingg Geographies and note the followingconfigurationn settings of the Lookup transformation. Then click Cancel.• On the General tab, unmatched rows are redirected to the no match output.• On the Connection tab, the data to be matched iss retrieved fromthe dw.DimGeography table.• On the Columns tab, the GeographyKey column is retrieved for rows with matchingPostalCode, City, Region, and Country columns.On the data flow surface, click the data flow arrow connecting Lookup Existing Geographies toNew Geographies, and press F4. Then inthe Properties pane, note that this arrowrepresents the nomatch data flow.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-637. Double-click New Geographies, and note that the rows in the no match data flow are inserted intothe dw.DimGeography table. Then click Cancel.8. On the Debug menu, click Start Debugging, and observe the data flow as it executes. Note thatwhile four rows are extracted from the staging tables, only one does not match an existing record.The new record is loaded into the data warehouse, and the rows that match existing records arediscarded. When execution is complete, on the Debug menu, click Stop Debugging. Task 2: Use a Lookup transformation to insert new records and update existingrecords1. In Solution Explorer, double-click the Load Products.dtsx SSIS package. Then on the control flowsurface, double-click Load Product Dimension to view the data flow surface.2. On the data flow surface, double-click Staged Products, note that the SQL command used by theOLE DB source extracts product data from the stg.Products table, and then click Cancel.3. On the data flow surface, double-click Lookup Existing Products and note the followingconfiguration settings of the Lookup transformation. Then click Cancel.• On the General tab, unmatched rows are redirected to the no match output.• On the Connection tab, the data to be matched is retrieved from the dw.DimProduct table.• On the Columns tab, the ProductKey column is retrieved for rows where theProductBusinessKey column in the staging table matches the ProductAltKey column in thedata warehouse dimension table.4. On the data flow surface, click the data flow arrow connecting Lookup Existing Products to InsertNew Products, and press F4. Then in the Properties pane, note that this arrow represents the nomatch data flow.5. Double-click Insert New Products, and note that the rows in the no match data flow are insertedinto the dw.DimProduct table. Then click Cancel.6. On the data flow surface, click the data flow arrow connecting Lookup Existing Products to UpdateExisting Products. Then in the Properties pane, note that this arrow represents the match data flow.7. Double-click Update Existing Products, and note the following configuration settings. Then clickCancel.• On the Connection Managers tab, the OLE DB Command transformation connects to theDemoDW database.• On the Component Properties tab, the SQLCommand property contains a parameterizedTransact-SQL statement that updates the ProductName, ProductDescription, andProductCategoryName columns for a given ProductKey.• On the Column Mapping tab, the ProductName, ProductDescription,ProductCategoryName, and ProductKey input columns from the match data flow are mappedto the parameters in the SQL command.8. On the Debug menu, click Start Debugging, and observe the data flow as it executes. Note thenumber of rows extracted from the staging tables, and how the Lookup transformation splits theserows to insert new records and update existing ones.9. When execution is complete, on the Debug menu, click Stop Debugging. Then minimize SQL ServerData Tools.MCT USE ONLY. STUDENT USE PROHIBITED


7-64 Implementing an <strong>Inc</strong>remental ETL ProcessThe Slowly Changing Dimensionn TransformationThe Slowly Changing Dimensionn transformation provides a wizard that you can use to generate a complexdataa flow to handle inserts and updates for a dimension table. Using the Slowly Changing Dimensionnwizard, you can specify:• Which columns contain keys that can be used to look up existing dimension records in the datawarehouse.• Which non-key columns are fixed attributes, changing attributes, or historic attributes.• Whether changes to a fixedcolumn should produce ann error or be ignored.• The column in the dimension table that should be usedd to indicate the current version of a dimensionmember for which historic attribute havechanged over time.• Whether the staged data includes inferred members for which a minimal record should be inserted.Inferred members are identified based onthe value of a specified column in the source.After completing the wizard, the Slowly Changing Dimension transformation generatesa data flow thatincludes the following paths:• A path to insert new dimension records.• A path to update dimension records where a changingg attribute has been modified. This is animplementation of a type 1 change.• A path to update the current record indicator and insert a new record for dimension members wherea historic attribute has beenmodified. This is an implementation of a type 2 change.• If specified, a path to insertminimal records for inferred members in the source data.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: Implementinga Slowly Changing Dimension Task 1: Examine a slowly changing dimension data flow1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-652. . If you did not complete the previous demonstration,in the D:\10777A\Demofiles\Mod07 folder, runStageData.cmd as Administrator, and then double-click <strong>Inc</strong>rementalETL.sln toopen the solution inSQL Server Data Tools, open the Load Geography.dtsx package, and execute itto load the stagedgeography data.3. . In SQL Server Data Tools,in Solution Explorer, double-click the Load Salespeople.dtsx SSIS package.4. . On the control flow surface, double-click Load Salesperson Dimension to view the data flowsurface.Note the following details about the data flow:• Stagedsalespeople records are extracted from the stg.Salesperson table.• The Lookup Geography Key transformation retrieves the GeographyKey value for thesalesperson based onthe PostalCode, City, Region, and Country column values.• Salesperson SCD is a slowly changing dimension transformation that generates multiple dataflow paths for historical attribute updates that require the insertion of a newrecord, newdimension member records, and changing attribute updates.5. . On the Debug menu, click Start Debugging and observe the dataa flow as it executes. Note that thestaged dataa includes onenew salesperson, and a salesperson record with a modified historicalattribute. This results in two new records in the data warehouse.6. . When execution is complete, on the Debug menu, click Stop Debugging.MCT USE ONLY. STUDENT USE PROHIBITED


7-66 Implementing an <strong>Inc</strong>remental ETL Process Task 2: Use the Slowly Changing Dimension transformation wizard1. Start SQL Server Management Studio and connect to the localhost instance of the database engineby using Windows authentication.2. In Object Explorer, expand Databases, expand DemoDW, and expand Tables. Then right click thestg.Customers table and click Select Top 1000 Rows. This table contains staged customer data.3. Right click the dw.DimCustomers table and click Select Top 1000 Rows. This table containscustomer dimension data. Note that the staged data includes two new customers (Ben Miller andMarie Dubois), a customer with a changed email address (Holly Holt), and a customer that has movedfrom New York to Seattle (Walter Harp).4. In SQL Server Data Tools, in Solution Explorer, double-click the Load Customers.dtsx SSIS package.Then on the control flow surface, double-click Load Customer Dimension and note the followingdetails about the data flow:• Staged customer records are extracted from the stg.Customer table.• The Lookup Geography Key transformation retrieves the GeographyKey value for the customerbased on the PostalCode, City, Region, and Country column values.5. In the SSIS Toolbox, drag a Slowly Changing Dimension to the data flow surface, and drop it belowthe Lookup Geography Key transformation. Then right-click Slowly Changing Dimension, clickRename, and change the name to Customer SCD.6. Click Lookup Geography Key, and then drag the blue data flow arrow to Customer SCD. In theInput Output Selection dialog box, in the Output drop-down list, select Lookup Match Output,and then click OK.7. Double-click Customer SCD, and then in the Slowly Changing Dimension Wizard, specify thefollowing configuration settings:• On the Select a Dimension Table and Keys page, in the Connection manager drop-down list,select localhost.DemoDW, and in the Table or view drop-down list, select[dw].[DimCustomer]. Then specify the following column mappings.Input Columns Dimension Columns Key TypeCurrent RecordCustomerID CustomerAltKey Business KeyCustomerEmail CustomerEmail Not a Key ColumnGeographyKey CustomerGeographyKey Not a Key ColumnCustomerName CustomerName Not a Key Column• On the Slowly Changing Dimension Columns page, specify the following change types.Dimension ColumnsCustomerEmailCustomerNameCustomerGeographyKeyChange TypeChanging AttributeChanging AttributeHistorical AttributeMCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-67• On the Fixed and Changing Attribute Options page, leave both options unselected.• On the Historical Attribute Options page, select Use a single column to show current andexpired records. Then, in the Column to indicate current record drop-down list, selectCurrentRecord, in the Value when current drop-down list, select True, and in the Expirationvalue drop-down list, select False.• On the Inferred Dimension Members page, uncheck the Enable inferred member supportoption.8. When you have completed the wizard, note the data flow that it has created.9. On the Debug menu, click Start Debugging and observe the data flow as it executes. Note that thestaged data includes two new customers, one customer record with a modified changing attribute(which is a customer that has changed their email address), and one customer record with a modifiedhistorical attribute (a customer that has moved to a new geographical location).10. When execution is complete, on the Debug menu, click Stop Debugging. Then close SQL ServerData Tools and SQL Server Management Studio without saving any changes.MCT USE ONLY. STUDENT USE PROHIBITED


7-68 Implementing an <strong>Inc</strong>remental ETL ProcessThe MERGE StatementtThe MERGE statement matches source and target rows based on specified criteria, and then performsinsert, update, or delete operations based on the matching results.When used to load a data warehouse table, a MERGE statement includes the following components:• An INTO clause containing the name of the target table being loaded.• A USING clause containing a query that defines the source data to be loaded into the target table –usually executed against tables in a staging database and often usingJOIN clausess to look updimension keys in the data warehouse.• An ON clausespecifying the criteria usedto match rows in the source rowset with rows in the targettable.• A WHEN MATCHED clause specifying theaction to be taken for rowss in the target table that matchrows in the source rowset – often an UPDATE statement to apply a type 1 change.• A WHEN NOTMATCHED clause specifying the action to be taken when no matches are found inthetarget table for rows in the source rowset– usually an INSERT statement to add a new record to thedata warehouse.When you implement an incremental ETL process with SQLL Server Integration Services,you can use a SQLCommand task in the control flow to execute a MERGE statement. However, you must ensure that theconnection manager assigned to the SQL Command task provides accesss to the sourceand target tables.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: Usingthe MERGE Statement Task 1: Use the MERGE statement to insert and update data1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-692. . If you did not complete the previous demonstration,in the D:\10777A\Demofiles\Mod07 folder, runStageData.cmd as Administrator.3. . In the D:\10777A\Demofiles\Mod07 folder, double-click Merge Sales Orders.sql to open the queryfile in SQL Server Management Studio. Each time youu are prompted, connect to the localhostinstance of the database engine by using Windows authentication.. Do not execute the query yet.4. . In Object Explorer, expand Databases, expand DemoDW, and expand Tables. Then right clickthestg.SalesOrders table and click Select Top 1000 Rows. This tablecontains staged sales orderr data.5. . Right click the dw.FactSalesOrders table and click Select Top 1000 Rows. This table containssalesorder fact data. Note that the staged data includes three order records that do not exist in thedatawarehouse fact table (with OrderNo and ItemNo values of 1005 and 1, 1006 and 1, and 1006and 2respectively), and one record that doesexist but for which the Cost value has been modified(OrderNo 1004, ItemNo 1).6. . On the Merge Sales Orders.sql query tab, view the Transact-SQL code and notethe followingdetails:• The MERGE statement specified the DemoDW.dw.FactSalesOrders table as the target.• A query that returns staged sales orders and uses joins to lookk up dimensionn keys in the datawarehouse is specified as the source.• The target and source tables are matched on thee OrderNo and ItemNo columns.• Matched rows are updated in the target.• Unmatched rows areinserted into the target.MCT USE ONLY. STUDENT USE PROHIBITED


7-70 Implementing an <strong>Inc</strong>remental ETL Process7. Click Execute and note the number or rows affected.8. Right click the dw.FactSalesOrders table and click Select Top 1000 Rows. Then compare thecontents of the table with the results of the previous query you performed in step 4.9. Close SQL Server Management Studio without saving any changes.MCT USE ONLY. STUDENT USE PROHIBITED


Lab ScenarioInthis lab, you will continue to develop theAdventure Works ETL solution.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-71You are ready to start developing the SSIS packages that load data fromthe staging database intothedata warehouse. You need to implement anincremental data load solution that includes the followingdimension and fact tables:• DimEmployee. Rows to be inserted, updated, and deleted in the DimEmployeee dimension tablehave been extracted into operation-specific staging tables by a package that uses the CDC Controltask and the CDC Source and CDC Splitter data flow components. You must use the data in thesestaging tables to insert new dimension member records, update existing dimension member records,and set theDeleted flag for dimensionn records that have been deleted in the source system.• DimProduct. Product dimension records support type 1 changes. The data flow for loading productdimension records must look up the key for the product’s subcategory (by matching the productsubcategory business keyin the stagedproduct records to the alternative key in theDimProductSubcategory dimension table). Then it must match staged productrecords to existingrecords in the data warehouse based on the productt business key, update existing records where amatch is found, and insert new recordswhere no match is found.MCT USE ONLY. STUDENT USE PROHIBITED


7-72 Implementing an <strong>Inc</strong>remental ETL Process• DimCustomer. Customer dimension records support type 1 changes for updates to the BirthDate,EmailAddress, FirstName, LastName, MiddleName, Phone, Suffix, and Title columns. Updates tothe AddressLine1, AddressLine2, CommuteDistance, Gender, GeographyKey, HouseOwnerFlag,MaritalStatus, NumbercarsOwned, and Occupation columns should result in type 2 changes. Youplan to use the Slowly Changing Dimension transformation to implement a data flow that loads thisdimension table.• FactInternetSales. Internet sales fact records can be updated in the data warehouse (for example, toupdate a shipped date after the goods have been shipped). You plan to use a Transact-SQL MERGEstatement to match staged sales records with existing fact records in the data warehouse based onthe SalesOrderNumber and SalesOrderLineNumber fields, and insert new records and updateexisting ones.MCT USE ONLY. STUDENT USE PROHIBITED


Lab 7B: Loading <strong>Inc</strong>remental ChangesExercise 1: Loading Data from CDC Output TablesScenarioThe staging database in your ETL solution includes tabless named EmployeeInserts, which containsemployee records that have been inserted in the employee source system, EmployeeUpdates, whichcontains records that have been modified in the employee source system, and EmployeeDeletes,whichcontains records that have been deleted in the employeee source system. You must use these tablestoload and update the DimEmployee dimension table, which uses a Deleted flag to indicate records thathave been deleted in the source system.The main tasks for this exercise are as follows:1. . Prepare thelab environment.2. . Create a data flow for inserts3. . Create a data flow for updates.4. . Create a data flow for deletes.5. . Test the package.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-73 Task 1: Prepare the lab environment• Ensure thatt the MIA-DC1and MIA-SQLBI virtual machines are bothrunning, andthen log on toMIA-SQLBI as ADVENTUREWORKS\Student with the password Pa$$w0rd.• Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab07B\Starterfolder as Administrator.MCT USE ONLY. STUDENT USE PROHIBITED


7-74 Implementing an <strong>Inc</strong>remental ETL ProcessTask 2: Create a data flow for inserts• In the D:\10777A\Labfiles\Lab07B\Starter\Ex1 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools.• In Solution Explorer, note that a connection manager for the AWDataWarehousee database has beencreated.• Add a new SSIS package named Load Employee Data.dtsx to the project.• Add a Data Flow Task named Insert Employees to the control flow. Then configure the InsertEmployees task’s data flowas follows:• Use the Source Assistant to create an OLE DB source that uses the localhost. .Stagingconnection manager. Then name thesource Staged Employee Inserts, and configure it with thefollowingsettings.PropertyOLE DBconnection managerData access modeName of the table or view• Use the Destination Assistant to create an OLE DB destination that uses thelocalhost.AWDataWarehouse connection manager. Then name the source New Employees,connect the data flow from Staged Employee Inserts to it, and configure it with the followingsettings.PropertyOLE DBconnection managerData access modeName of the table or viewMappingsSettinglocalhost.StaginggTable or viewThe completed data flow should looklike the following.[dbo].[EmployeeInserts]Settinglocalhost.AWDataWarehouseTable or view – fast load[dbo].[DimEmployee]On the Mappingss tab, drag theEmployeeIDinput column to the EmployeeAlternateKeydestination column, and verify that all other inputcolumns are mapped to destination columns ofthe same name and that the EmployeeKey andDeleted destination columns are not mapped.MCT USE ONLY. STUDENT USE PROHIBITED


Task 3: Create a data flow for updates• On the control flow surface of the LoadEmployee Data.dtsx package, connect the successprecedencee constraint ofthe Insert Employees dataa flow task to a new Data Flow Task namedUpdate Employees. Then configure the Update Employees task’ss data flow as follows:• Use theSource Assistant to createe an OLE DB source that uses the localhost.Stagingconnection manager. Then name the source Staged Employee Updates, and configure itwiththe following settings.• Connect the data flow from the Staged Employee Updates source to an OLE DB Commandtransformation named Update Existing Employees with the following settings configured in the<strong>Advanced</strong> Editor dialog box:Connection Manager: localhost.AWDataWarehouseSqlCommand (on the Component Properties tab):UPDATE dbo.DimEmployeeSET FirstName = ?, LastName = ?, EmailAddress = ?, Title = ?, HireDate = ?WHEREE EmployeeAlternateKey = ?Column Mappings:Input ColumnFirstNameLastNameDestination ColumnParam_0Param_1EmailAddressParam_2TitleParam_3HireDateParam_4EmployeeIDParam_5The completed data flow should look like the following.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-75PropertySettingOLE DB connectionn managerlocalhost.StagingData access modeTable or viewName of the table or view[dbo].[EmployeeUpdates]MCT USE ONLY. STUDENT USE PROHIBITED


7-76 Implementing an <strong>Inc</strong>remental ETL Process Task 4: Create a data flow for deletes• On the control flow surfaceof the Load Employee Data.dtsx package, connect the successprecedence constraint of the Update Employees data flow task to a new Data Flow Task namedDelete Employees. Then configure the Delete Employees task’s data flow as follows:• Use the Source Assistant to create an OLE DB source that uses the localhost. .Stagingconnection manager. Then name thesource Staged Employee Updates, and configure it withthe following settings.PropertySettingOLE DBconnection managerlocalhost.StagingData access modeTable or viewName of the table or view[dbo].[EmployeeDeletes]• Connect the data flow from the Staged Employeee Deletes source to an OLE DB Commandtransformation named Delete Existing Employees with the following settingss configured inthe<strong>Advanced</strong> Editor dialog box:Connection Manager: localhost.AWDataWarehouseSqlCommand (on the Component Properties tab):UPDATE dbo.DimEmployeeSET Deleted = 1WHERE EmployeeAlternateKey = ?Column Mappings:Input ColumnDestination ColumnEmployeeIDParam_0The completed data flow should looklike the following.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-77 Task 5: Test the package• Verify that the control flow for the Load Employee Data.dtsx package looks like the following.• Start debugging the LoadEmployee Data.dtsx package, and when execution iscomplete view thedata flow surface for eachof the data flow tasks, noting the numbers of rows processed in each task.Then stop debugging andclose SQL Server Data Tools.Results: After this exercise, you should havean SSIS package that uses data flows to apply inserts,updates, and logical deletes in the data warehouse basedd on staging tables extractedby the CDC Controltask and data flow components.MCT USE ONLY. STUDENT USE PROHIBITED


7-78 Implementing an <strong>Inc</strong>remental ETL ProcessExercise 2: Using a Lookup Transformation to Insert or Update DimensionDataScenarioAnother BI developer has partially implemented an SSIS package to load product data into a hierarchy ofdimension tables. You must complete this package by creating a data flow that uses a lookuptransformation to determine whether a product dimension record already exists, and then insert orupdate a record in the dimension table accordingly.The main tasks for this exercise are as follows:1. Examine an existing package.2. Add a data flow to extract staged product data.3. Add a Lookup transformation to find parent subcategory keys.4. Add a Lookup transformation to find existing product records.5. Add a destination to load new product records into the data warehouse.6. Add an OLE DB Command transformation to update existing product records.7. Test the package. Task 1: Examine an existing package• In the D:\10777A\Labfiles\Lab07B\Starter\Ex2 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools. Then open the Load Products Data.dtsx SSIS package.• View the data flow for the Load Product Category Dimension task, and note the following:• The Staged Product Category Data source extracts product category data from theInternetSales and ResellerSales tables in the Staging database.• The Lookup Existing Product Categories task retrieves the ProductCategoryKey value forproduct categories that exist in the DimProductCategory table in the AWDataWarehousedatabase by matching the product category business key in the staging database to the productcategory alternative key in the data warehouse.• The Lookup No Match Output data flow path connects to the New Product Categoriesdestination, and the Lookup Match Output data flow path from the Lookup Existing ProductCategories task connects to the Update Existing Product Categories task.• The New Product Categories destination loads new product category records into theDimProductCategory table.• The Update Existing Product Categories task executes a Transact-SQL statement to update theProductCategoryName column in the DimProductCategory table for an existing row based onthe ProductCategoryKey.• View the data flow for the Load Product Subcategory Dimension task, and note that this data flowinserts or updates product category dimension data using a similar approach to the Load ProductCategory Dimension data flow with the addition of a lookup task to retrieve theProductCategoryKey in AWDataWarehouse for the parent category (which should have alreadybeen loaded).MCT USE ONLY. STUDENT USE PROHIBITED


Task 2: Add a data flow to extract staged product data10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-79• Add a Data Flow task named Load Product Dimension to the control flow of the Load ProductsData.dtsx package, and connect the success precedence constraint from the Load ProductSubcategory Dimension task to the Load Product Dimension task.• In the data flow for the Load Product Dimension data flow task, add an OLE DB source namedStaged Product Data that uses the localhost.Staging connection manager and uses the followingTransact-SQL command to retrieve product data.SELECT DISTINCT ProductSubcategoryBusinessKey, ProductBusinessKey, ProductName,StandardCost, Color, ListPrice, Size, Weight, DescriptionFROM dbo.InternetSalesUNIONSELECT DISTINCT ProductSubcategoryBusinessKey, ProductBusinessKey, ProductName,StandardCost, Color, ListPrice, Size, Weight, DescriptionFROM dbo.ResellerSales Task 3: Add a Lookup transformation to find parent subcategory keys• In the data flow for the Load Product Dimension data flow task, add a Lookup transformationnamed Lookup Parent Subcategory, connect the output data flow path from the Staged ProductData source to it, and set the following configuration properties in the Lookup TransformationEditor dialog box:• On the General tab, in the Specify how to handle rows with no matching entries drop-downlist, ensure Fail component is selected.• On the Connection tab, in the OLE DB connection manager drop-down list, selectlocalhost.AWDataWarehouse; and in the Use a table or a view drop-down list, select[dbo].[DimProductSubcategory].• On the Columns tab, drag the ProductSubcategoryBusinessKey column in the AvailableInput Columns list to the ProductSubcategoryAlternateKey column in the Available lookupcolumns list. Then in the Available lookups column list, select the checkbox for theProductSubcategoryKey column to specify the following lookup columns.Lookup Column Lookup Operation Output AliasProductSubcategoryKey ProductSubcategoryKey Task 4: Add a Lookup transformation to find existing product records• In the data flow for the Load Product Dimension data flow task, add a Lookup transformationnamed Lookup Existing Products, connect the Lookup Match Output data flow path from theLookup Parent Subcategory transformation to it, and set the following configuration properties inthe Lookup Transformation Editor dialog box:• On the General tab, in the Specify how to handle rows with no matching entries drop-downlist, select Redirect rows to no match output.• On the Connection tab, in the OLE DB connection manager drop-down list, selectlocalhost.AWDataWarehouse; and in the Use a table or a view drop-down list, select[dbo].[DimProduct].MCT USE ONLY. STUDENT USE PROHIBITED


7-80 Implementing an <strong>Inc</strong>remental ETL Process• On the Columns tab, drag the ProductBusinessKey column in the Available Input Columns listto the ProductAlternateKey column in the Available lookup columns list. Then in theAvailable lookups column list, select the checkbox for the ProductKey column to create thefollowing lookup columns.Lookup Column Lookup Operation Output AliasProductKey ProductKey Task 5: Add a destination to load new product records into the data warehouse• In the data flow for the Load Product Dimension data flow task, add an OLE DB destination for SQLServer that uses the localhost.AWDataWarehouse connection manager, and name it NewProducts.• Connect the Lookup No Match data flow path from the Lookup Existing Products transformationto the New Products destination.• Set the following configuration properties for the New Products destination.PropertyOLE DB connectionmanagerData access modeName of the table or viewSettinglocalhost.AWDataWarehouseTable or view – fast load[dbo].[DimProducts]Mappings: Input Column Destination ColumnProductBusinessKeyProductNameStandardCostColorListPriceSizeWeightDescriptionProductSubcategoryKeyProductKeyProductAlternateKeyProductNameStandardCostColorListPriceSizeWeightDescriptionProductSubcategoryKey Task 6: Add an OLE DB Command transformation to update existing product records• In the data flow for the Load Product Dimension data flow task, add an OLE DB Commandtransformation named Update Existing Products.• Connect the Lookup Match Output data flow path from the Lookup Existing Productstransformation to Update Existing Products transformation.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-81• Set the following configuration properties for the Update Existing Products transformation.Connection Manager: localhost.AWDataWarehouseSqlCommand (on the Component Properties tab):UPDATE dbo.DimProductSET ProductName = ?, StandardCost = ?, Color = ?, ListPrice = ?, Size = ?,Weight = ?, Description = ?, ProductSubcategoryKey = ?WHERE ProductKey = ?Column Mappings:Input ColumnProductNameStandardCostColorListPriceSizeWeightDescriptionProductSubcategoryKeyProductKeyDestination ColumnParam_0Param_1Param_2Param_3Param_4Param_5Param_6Param_7Param_8• View the Load Product Dimension data flow and ensure that it looks like the following.MCT USE ONLY. STUDENT USE PROHIBITED


7-82 Implementing an <strong>Inc</strong>remental ETL Process Task 7: Test the package• With the Load Product Dimension data flow visible, start debugging the package and verify that allrows flow to the New Products destination (because the data warehouse contained no existingproduct records). When package execution is complete, stop debugging.• Debug the package again and verify that all rows flow to the Update Existing Productstransformation this time (because all staged product records were loaded to the data warehouseduring the previous execution, so they all match existing records in the data warehouse). Whenpackage execution is complete, stop debugging and close SQL Server Data Tools.Results: After this exercise, you should have an SSIS package that uses a Lookup transformation todetermine whether product records already exist; and updates them or inserts them as required.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-83Exercise 3: Implementing a Slowly Changing DimensionScenarioYou have an existing SSIS package that uses a Slowly Changing Dimension transformation to load resellerdimension records into a data warehouse. You want to examine this package and then create a newpackage that uses a Slowly Changing Dimension transformation to load customer dimension records intothe data warehouse.The main tasks for this exercise are as follows:1. Execute a package to load a non-changing dimension.2. Examine an existing package for a slowly changing dimension.3. Add a Slowly Changing Dimension transformation to a data flow.4. Test the package. Task 1: Execute a package to load a non-changing dimension• In the D:\10777A\Labfiles\Lab07B\Starter\Ex3 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools.• Open the Load Geography Data.dtsx package and review the control flow and data flow defined init. This package includes a simple data flow to load staged geography data into the data warehouse.Only new rows are loaded, and rows that match existing data in the data warehouse are discarded.• Start debugging and observe the package execution as it loads geography data into the datawarehouse. When package execution has completed, stop debugging. Task 2: Examine an existing package for a slowly changing dimension• Open the Load Reseller Data.dtsx SSIS package.• Examine the data flow for the Load Reseller Dimension task, and note the following features of thedata flow:• The Staged Reseller Data source extracts data from the Resellers table in the Staging database.• The Lookup Geography Key transformation looks up the geography key for the reseller in theDimGeography table in the AWDataWarehouse database.• The Reseller SCD transformation is a slowly changing dimension transformation that hasgenerated the remaining transformations and destinations. You can double-click the ResellerSCD transformation to view the wizard used to configure the slowly changing dimension, andthen click Cancel to avoid making any unintentional changes.• The Reseller SCD transformation maps the ResellerBusinessKey input column to theResellerAlternateKey dimension column and uses it as a business key to identify existingrecords.• The Reseller SCD transformation treats AddressLine1, AddressLine2, BusinessType,GeographyKey, and NumberEmployees as historical attributes; Phone, and ResellerName aschanging attributes, and YearOpened as a fixed attribute.• Start debugging and observe the data flow as the dimension is loaded. When package execution iscomplete, stop debugging.MCT USE ONLY. STUDENT USE PROHIBITED


7-84 Implementing an <strong>Inc</strong>remental ETL Process Task 3: Add a Slowly Changing Dimension transformation to a data flow• Open the Load Customer Data.dtsx SSIS package and view the data flow for the Load CustomerDimension task. Note that the data flow already contains a source named Staged Customer Data,which extracts customer data from the Staging database, and a Lookup transformation namedLookup Geography Key, which retrieves a GeographyKey value from the AWDataWarehousedatabase.• Add a Slowly Changing Dimension transformation named Customer SCD to the data flow andconnect the Lookup Match Output data flow path from the Lookup Geography Keytransformation to the Customer SCD transformation.• Use the Slowly Changing Dimension wizard to set the following configuration settings:• On the Select a Dimension Table and Keys page, in the Connection manager drop-down list,select localhost.AWDataWarehouse, and in the Table or view drop-down list, select[dbo].[DimCustomer]. Then specify the following column mappings.Input Columns Dimension Columns Key TypeAddressLine1 AddressLine1 Not a key columnAddressLine2 AddressLine2 Not a key columnBirthDate BirthDate Not a key columnCommuteDistance CommuteDistance Not a key columnCurrentRecordCustomerBusinessKey CustomerAlternateKey Business keyEmailAddress EmailAddress Not a key columnFirstName FirstName Not a key columnGender Gender Not a key columnGeographyKey GeographyKey Not a key columnHouseOwnerFlag HouseOwnerFlag Not a key columnLastName LastName Not a key columnMaritalStatus MaritalStatus Not a key columnMiddleName MiddleName Not a key columnNumberCarsOwned NumberCarsOwned Not a key columnOccupation Occupation Not a key columnPhone Phone Not a key columnSuffix Suffix Not a key columnTitle Title Not a key columnMCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-85• On the Slowly Changing Dimension Columns page, specify the following change types.Dimension ColumnsAddressLine1AddressLine2BirthDateCommuteDistanceEmailAddressFirstNameGenderGeographyKeyHouseOwnerFlagLastNameMaritalStatusMiddleNameNumberCarsOwnedOccupationPhoneSuffixTitleChange TypeHistorical attributeHistorical attributeChanging attributeHistorical attributeChanging attributeChanging attributeHistorical attributeHistorical attributeHistorical attributeChanging attributeHistorical attributeChanging attributeHistorical attributeHistorical attributeChanging attributeChanging attributeChanging attribute• On the Fixed and Changing Attribute Options page, leave both options unselected.• On the Historical Attribute Options page, select Use a single column to show current andexpired records. Then, in the Column to indicate current record drop-down list, selectCurrentRecord, in the Value when current drop-down list, select True, and in the Expirationvalue drop-down list, select False.• On the Inferred Dimension Members page, uncheck the Enable inferred member supportoption.MCT USE ONLY. STUDENT USE PROHIBITED


7-86 Implementing an <strong>Inc</strong>remental ETL Process• When you have finished the wizard, view the Load Customer Dimension data flow and ensure that itlooks like the following image. Task 4: Test the package• Debug the package and verify that all rows pass through the New Output data flow path. Whenpackage execution is complete, stop debugging.• Debug the package again and verify that no rows pass through the New Output data flow path,because all rows already exist and no changes have been made. When package execution iscomplete, stop debugging.• Use SQL Server Management Studio to execute the Update Customers.sql script in the localhostinstance of the database engine. This script updates two records in the staging database, changingone customer’s phone number and another customer’s marital status.• In SQL Server Data Tools, debug the package again and verify that one rows passes through theHistorical Attribute Inserts Output data flow path, and one row passes through the ChangingAttributes Updates Output. When package execution is complete, stop debugging.• Close SQL Server Data Tools.Results: After this exercise, you should have an SSIS package that uses a Slowly Changing Dimensiontransformation to load data into a dimension table.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 7-87Exercise 4: Using a MERGE Statement to Load Fact DataScenarioYour staging database is located on the same server as the data warehouse and you want to takeadvantage of this colocation of data and use the MERGE statement to insert and update staged data intothe Internet sales fact table. An existing package already uses this technique to load data into the resellersales fact table.The main tasks for this exercise are as follows:1. Examine an existing package.2. Create a package that merges Internet sales data.3. Test the package. Task 1: Examine an existing package• In the D:\10777A\Labfiles\Lab07B\Starter\Ex4 folder, double-click AdventureWorksETL.sln to openthe solution in SQL Server Data Tools. Then open the Load Reseller Sales Data.dtsx SSIS package.• Examine the configuration of the Merge Reseller Sales task and note the following details.• The task uses the localhost.Staging connection manager to connect to the Staging database.• The task executes a Transact-SQL MERGE statement that retrieves reseller sales and relateddimension keys from the Staging and AWDataWarehouse databases, matches these recordswith the FactResellerSales table based on the SalesOrderNumber and SalesOrderLineNumbercolumns, updates rows that match, and inserts new records for rows that do not match.• Start debugging to run the package and load the reseller data. When package execution is complete,stop debugging. Task 2: Create a package that merges Internet sales data• Add a new SSIS package named Load Internet Sales Data.dtsx.• Add an Execute SQL Task named Merge Internet Sales Data to the control flow of the LoadInternet Sales Data.dtsx package.• Configure the Merge Internet Sales Data task use the localhost.Staging connection manager andexecute a MERGE statement that retrieves Internet sales and related dimension keys from the Stagingand AWDataWarehouse databases, matches these records with the FactInternetSales table basedon the SalesOrderNumber and SalesOrderLineNumber columns, updates rows that match, andinserts new records for rows that do not match. You can use the code in the Merge InternetSales.sql script file in the D:\10777A\Labfiles\Lab07B\Starter\Ex4 folder to accomplish this. Task 3: Test the package• View the control flow tab and start debugging the package, observing the execution of the MergeInternet Sales Data task. When execution is complete, stop debugging.• Close SQL Server Data Tools.Results: After this exercise, you should have an SSIS package that uses an Execute SQL task to execute aMERGE statement that inserts or updates data in a fact table.MCT USE ONLY. STUDENT USE PROHIBITED


7-88 Implementing an <strong>Inc</strong>remental ETL ProcessModule Review and TakeawaysReview Questions1.2.What should you consider when choosingbetween Change Data Capture and Change Tracking?What should you consider when decidingwhether or not to use the MERGE statement to load stagingdata into a data warehouse?MCT USE ONLY. STUDENT USE PROHIBITED


Module 8<strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseContents:Lesson 1: Overview of Cloud Data Sources 8-3Lesson 2: SQL Azure 8-9Lesson 3: The Windows Azure Marketplace DataMarket 8-19Lab: Using Cloud Data in a Data Warehouse Solution 8-26MCT USE ONLY. STUDENT USE PROHIBITED8-1


8-2 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseModule OverviewCloud computing is becoming increasingly common in organizations. Typically only parts of theinfrastructure of an organization are moved to the cloud and it is important to learn how to connectyourlocal resources with cloud basedservices. In this module, you will learn about how you can use cloudcomputing in your data warehouse infrastructure and learnn about the tools and services available fromMicrosoft.After completing this module, you will be able to:• Describe cloud data scenarios.• Describe Microsoft® SQL Azure.• Describe the Microsoft Windows Azure Marketplace DataMarket.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson1Overview of Cloud Data SourcesAfter completing this lesson, you will be able to:• Describe scenarios for cloud data.• Describe the Microsoft cloud platform for data.• Describe how cloud data can be incorporated into a BI solution.MCT USE ONLY. STUDENT USE PROHIBITED10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-3


8-4 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseCloud Data ScenariosCloud computing is the deliveryof Internet based infrastructure, application, and data services that wouldtraditionally have been hosted within an organization. The services are typically paid for with a monthlysubscription, which allows costss to increase and decrease inn line with usage. In data warehouse scenariosthere are several uses of cloud computing including line of business databases, datasets, and data servicesand these solutions could be deployed as public cloud, private cloud, or hybrid cloud.• Advantages:• Cloud-based databasess remove the initial costs of buying hardware and licensing software.Application DatabasesApplication databases are online transaction processing (OLTP) databases used by line-of-businesswhencompared to on-applications. Cloud-based databases have several advantages and disadvantagespremise databases:• The cost of upgrading hardware if capacity or performance limits are reached is removed whenusing a cloud-based solution.• The cost of upgrading software licenses when versions change iss removed when using a cloud-based solution.• If capacity fluctuates and data volumes go up and down the monthly cost of the cloud-baseddatabasewill change inline with database volumes. Being charged for what you use on amonthly basis removesthe need for planning capacity based on maximum usage and typicallyhaving hardware whichh exceeds requirements for most of the time.MCT USE ONLY. STUDENT USE PROHIBITED


• Disadvantages:10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-5• After you have paid for the initial cost of hardware and software, the ongoing costs of an onpremiseinstalled database are low, but with a cloud-based solution the monthly costs areongoing.• You have complete control over the hardware configuration if you use an on-premise database,but you do not have this level of control with a cloud-based solution.• You may be reluctant to trust the security of a third-party with your sensitive data.• You may be concerned that you might not meet legal and industry compliance regulations ifyour data is stored in an unknown country.• You must have a consistent and reliable Internet connection to use cloud services.Third-Party DataWhen building a data warehouse, you typically aim to have as much data as possible to improve youranalysis. Using cloud-based datasets, this data can be supplemented with third party datasets. Cloudbaseddatasets are incredibly diverse and include meteorological data, stock market data, and industrialproduction data. For example, if you manufacture hats, you might find that warmer, drier months lead tosales of different hats than cooler, wetter months. You might also speculate that people consider yourhats to be a luxury purchase and sales probably go down when the economy is less robust. By purchasinghistorical meteorological and share price data, you can analyze your sales using these variables and assessthe exact impact of the weather and the economy on your sales. This allows you to plan production moreeffectively for the future.In addition to providing datasets that can supplement your own organization’s data, third-party datasetscan be useful for performing data validation and cleansing. For example, you could subscribe to a datasetthat is provided by a postal service and contains address information, and use this data to look up missingpostal codes based on street addresses or validate customer addresses.Note A public cloud solution moves all of your processing and data to a cloud provider. Aprivate cloud solution uses cloud methods and technologies, such as virtualization, but hoststhem on-premise on company infrastructure. A hybrid solution is a combination of publiccloud and private cloud resources. In reality, most solutions are hybrid cloud to a greater orlesser extent. With a hybrid cloud, sensitive data can be kept local and infrastructure thathas already been purchased can continue to be used. Data that is less sensitive, services thatyou cannot provide internally, and additional data to improve analysis can be purchasedfrom a cloud provider.MCT USE ONLY. STUDENT USE PROHIBITED


Microsoft SQL Azure Data Sync provides one-watwo cloud-based and two way synchronization between cloud-based andon-site databases, or between databases.8-6 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseMicrosoft Cloud Platform for DataSQL AzureSQLAzure is a Microsoft cloud based database platform built on SQL Server technologies. It is a highlyscalable solution that is built, hosted, and maintained by Microsoft. High availability and fault tolerance isbuilt into the SQLAzure solution with no setup or administration from the user.Because SQL Azure is built on SQL Server technologies, youu can administer and develop your databasesthrough familiar tools includingSQL Server Management Studio, PowerPivot, and SQL Server IntegrationServices.Microsoft SQL Azure Reporting provides a cloud-based reporting solutionfor all SQL Azure subscribers.Reports can be created using SQL Server Dataa Tools and can be viewed through the Azure Portal,exported in multiple formats, orembedded inapplications. .For More InformationFor more information about SQL Azure, seehttp://go.microsoft.com/fwlink/?LinkID=246725.MCT USE ONLY. STUDENT USE PROHIBITED


Windows Azure Marketplace DataMarket10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-7The Windows Azure Marketplace DataMarket is a portal to subscribe to cloud-based datasets and dataservices. After you have registered with the Azure Marketplace you can subscribe to a provider with a fewclicks. You then have access to sophisticated validation services and many datasets providing you withadditional context for your business intelligence data. Many of the services are free, free for a certainnumber of uses a month, or offer a free trial period.For More Information For more information about the Windows Azure Marketplace, seehttp://go.microsoft.com/fwlink/?LinkID=246726.MCT USE ONLY. STUDENT USE PROHIBITED


8-8 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseCloud Data and Services in the BI EcosystemCloud based services can replace or supplement many of the components of a BI infrastructure. Thisincludes data sources, data cleansing, and reporting. You can implement whichever cloud-basedcomponents fit your needs.Examples of ways in which you can incorporate cloud data into a BI solution include:• Using SQL Server Integration Services to extract data from SQL Azuredatabases and loading it intothe data warehouse.• Using the SQL Azure Data Sync service tosynchronize data in SQL Azure and an on-premise databasethat provides source data for a data warehouse.• Using a dataset from the Azure Marketplace in a SQL Server Data Quality Services (DQS) knowledgebase (KB), andusing it to cleanse and validate data.• Using the SQL Azure Data Sync service tosynchronize a subset of data in a data warehouse to SQLAzure and using SQL AzureReporting Services to enable users to view business information in theresulting cloud-based data mart.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson2SQL AzureSQL Azure is theMicrosoft cloud-based database offering. In this lesson, you will learn about SQL Azureand see the similarities between using SQL Azure and using SQL Server.After completing this lesson, you will be able to:• Describe SQL Azure.• Explain the differences between SQL Azure and SQL Server.• Describe the topology ofSQL Azure.• Configure the Windows Azure firewall.• Connect to SQL Azure.• Connect to SQL Azure using SQL Server Management Studio.• Use SQL Azure as a data source in yourBI ecosystem.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-9MCT USE ONLY. STUDENT USE PROHIBITED


8-10 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseGetting Started with SQL AzureInitially, you can start a free trial of SQL Azureto test the functionality. This requires a Windows Live ID,contact information, and credit card information in case you exceed the usage limits.To fully sign-up for SQL Azure you have several options. You can choose to pay by credit card or invoice,you can choose pay-as-you-goor subscription tariffs, and you can get offers for subscribers to servicessuchas MSDN.Tools are provided to choose the best offer for your needs and to estimate cost savings over a traditional,non-cloud-based,, infrastructure.After you have created your account, you cancreate a new w server instance and complete theconfiguration in the Management Portal at https://windows.azure.com/.To configure yourSQL Azure server and create your first database, followthese steps:1.2.3.4.For More InformationFor more information about the Windowss Azure 3 Month freetrial, see http://go.microsoft.com/fwlink/?LinkID=246727.From the Management Portal homepage, click the Database link or click New Database Serverr inCommon Tasks.Select a region where you would like your database server hosted.Note You should be aware of legal implications, compliancy, and network latency whenchoosing where to site your server.Supply a username and password for the server administrator.Add a firewall rule to at least allow your computer to connect and, optionally, to allow any otherclients that will need to access the server in the future.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-115. In the Database section of the toolbar, you can click Create to create a new database.6. Provide the name, edition, and size that you require for your database.7. Select your new database and, in the Database section of the toolbar, you can click Manage toconfigure your new database. Alternatively, you can use SQL Server Management Studio to manageyour database. Simply note down the Fully Qualified DNS Name and use this as the Server Namewhen you connect to a database instance and add @servername to the end of your username.For More Information For more information about the Windows Azure PurchaseOptions, see http://go.microsoft.com/fwlink/?LinkID=246728.MCT USE ONLY. STUDENT USE PROHIBITED


8-12 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseComparing SQL Azure with SQL ServerSQLAzure provides the same inputs and outputs as SQL Server and therefore you can continue to use thesame clients and many of the same administration tools.The core difference in administration is the separation of physical administration and logicaladministration. The physical administration is the administration of the hardware including servers,processing and storage. The logical administration includess most SQL Server administration tasksincluding database, users, and logins. Physical administration is provided by Microsoft and you do notneed to configurethe physical aspects of your system. Logical administration remains your responsibility.Because of the separation between physical and logical administration there are some differences in mostaspects of SQL Server including administration, programming and Transact-SQL, but these are very minorand would not typically require any changes to your practices.For More InformationFor more information about Guidelines and Limitations(SQLAzure Database), see http://go.microsoft.com/fwlink//?LinkID=246729.Administrators should still perform all of the logical administration and, therefore, as well as basic taskssuchas table creation, administrators should perform all other logical tasks and therefore they shouldmaintain indexes, administer security, and optimize queriess in the same way that they did with SQL Server.All physical aspects such as file group partition location aree not available and, furthermore, loadbalancing, high availability, andbackup are all automatically performed by SQL Azure. Features thatreference physical hardware, such as Resourcee Governor or SQL Server Profiler, are not available whenmanaging SQL Azure servers.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-13Provisioning of new databases with SQL Azure is a simple process of adding a database to yoursubscription and there is no need to provision the hardware and software.For More Information For more information about the Windows SQL Azure ProvisioningModel, see http://go.microsoft.com/fwlink/?LinkID=246730.SQL Azure does not include Microsoft SQL Server Analysis Services, but SQL Azure can be used as anAnalysis Services data source. SQL Azure also does not include Service Broker or Replication, althoughthere is SQL Azure Data Sync to provide replication capabilities.For More Information For more information about how to Compare SQL Server withSQL Azure, see http://go.microsoft.com/fwlink/?LinkID=246731.MCT USE ONLY. STUDENT USE PROHIBITED


8-14 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseTopology of SQL AzureFromthe perspective of a clientapplication SQL Azure is the same as SQLL Server. Applications connectwithfamiliar libraries such as ODBC and ADO.Net over TCP/IP. They use Transact-SQL to submit queriesand receive results as a tabular data stream (TDS) and therefore no application modifications are required.The only noticeable differences are that Windows authentication is not supported and that the usernameis passed in the format username@servername.The queries are sent to a load balancer, whichh then passes the query on to the SQL Azure instance. Loadbalancing, replication, and highh availability are all handled by the SQL Azure back-end.MCT USE ONLY. STUDENT USE PROHIBITED


ManagingFirewall Settings10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-15Although client applications can connect toSQL Azure inn much the same way as SQLServer, there areobviously additional security measures required for a cloud-based solution.Each server running SQL Azure has firewall rules to restrict client accesss based on theclient IP address. Thedefault setting is i to allow no access to the server from any client. By using the Windows Azure PlatformManagement Portal you can define IP address ranges that are allowed to connect and whether or not toallow Windows Azure applications to connect to your SQL Azure server. Examples of valid ranges include:• 0.0.0.0 to 0.0.0.0 – This range enables applications and services in Windows Azure to access thedatabase server. For example, you can use this rangee to enable access to the database from aWindows Azure Web application you have created or from SQL Azure ReportingServices.• 0.0.0.0 to 255.255.255.255 – This rangeenables access from any computer on the Internet. Whilecreating a firewall rule for this range makes it easier to test your database functionality, you shouldavoid usingthis range in production systems if possible.After you have enabled IP addresses, you can connect from these computers and usethesp_set_firewall_rule stored procedure in the Master database to further configure firewall rules as well asusing the Windows Azure Platform Management Portal for firewall administration.Inaddition to the SQL Azure firewall, you should also ensure that your own firewalls allow outgoing TCPconnections on port 1433. If they do not, connections will be blocked before they reach the Internet butyou might mistakenly think that the problem is caused byy Windows Azure firewall settings.For MoreInformationnFor more information about the SQL Azure Firewall, seehttp://go.microsoft.com/fwlink/?LinkID=246732.MCT USE ONLY. STUDENT USE PROHIBITED


8-16 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseConnecting to SQL AzureAfter you have configured yourown firewall and the SQL Azure firewall to allow your client computers toconnect to SQL Azure, they are ready to access data.SQLAzure provides two modelsfor data access. You could connect your on-premise client applications toSQLAzure, or youcould port your applicationcode to Windows Azure.If you port your application code to WindowsAzure, the SQL Azure firewall would haveto be configuredto allow access from Windows Azure applications on the 0.0.0.0 to 0.0.0.00 IP address range. Oneadvantage of hosting your applications on Windows Azure is that it reduces network latency between theapplication and SQL Azure when compared toa local application connecting to SQL Azure.If you choose to use on-premisee applications,the data is sent between SQL Azure and the applicationsusing TDS over secure sockets layer (SSL) to encrypt your data. While query performance is likely to beimproved by a move to SQL Azure due to theimproved hardware and load balancing, the physicaldistance separating the application and database is likely too increase latency. Latency should beconsidered when developing applications andtime-outs increased accordingly.For More InformationFor more information about SQL Azure Data Access, seehttp://go.microsoft.com/fwlink/?LinkID=246733.In a BI infrastructure, you use SQL Azure in the same way that you would use an on-premise SQL Serverdatabase. Assuming that you have configuredthe firewall settings, you simply pass theconnectioninformation and access SQL Azure.MCT USE ONLY. STUDENT USE PROHIBITED


SQL Server Management Studio and SQLL AzureMost administration tasks using SQL Serverr Managementt Studio are exactly the samewhen using SQLAzure as they are when using SQL Server.There are some minor differences which arelisted here:10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-17• The server-leveserver-level security role for creating databases is dbmanager rather than dbcreator.security role for creating logins is loginmanager rather than securityadmin and the• You cannotswitch database with the USE command in SQL Azure. Instead, you must create a newconnection to the target database.• Most hardware focused features, such as Resource Governor, are not available.• Many taskss that can be performed using a graphical tool in SQL Server require Transact-SQL inSQLAzure. For example, creating a table ora user must be performed by using Transact-SQL.For MoreInformationnFor more information about Administration (SQL AzureDatabase) e), see http://go.microsoft.com/fwlink/?LinkID=246734.MCT USE ONLY. STUDENT USE PROHIBITED


8-18 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseUsing SQL Azure As a Data Source for a Data WarehouseYoucan use SQL Azure just as you would anyother data source. Therefore, you can useSSIS to transferdataa from SQL Azure to a staging database. Ifyou have followed the steps to configurethe firewall andconnect to the SQL Azure instance, then additional configuration steps are not required.Another way to use SQL Azure as a source fora data warehousing solution is to synchronize it with an on-multiple SQL Server and SQL Azure databasess as long as at least one of them is a database in SQL Azurethatt can serve as a synchronization premise database, which could be a staging database. You can use SQL Azure Data Sync to synchronizehub.With Data Sync, you can define columns and filters on the data so that you only need to synchronizethedataa that you require. You can configure Dataa Sync to synchronize from the hub, to thehub, or with bi-whether data in the hub or the client should persist. Synchronization schedules can be configured fromminutes to months and, therefore, you have a fine degree of control overr your synchronization latency.directional synchronization and you can configure how conflicts are resolved if they occur, by choosingFor More InformationFor more information SQL Azure Data Sync, seehttp://go.microsoft.com/fwlink/?LinkID=246735.MCT USE ONLY. STUDENT USE PROHIBITED


The Windows Azure Marketplace DataMarket provides a portal for cloud-based dataa and services. In thislesson, you will see the benefits of the Windows Azure Marketplace DataMarket and how it can benefityour BI ecosystem.After completing this lesson, you will be able to:• Describe the Windows Azure Marketplace.• Describe the data scenarios that can benefit from thee Windows Azure Marketplace DataMarket.• View and acquire data through the Windows Azure Marketplace DataMarket.• Use the Windows Azure Marketplace DataMarket Add-In for Excel. .10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-19Lesson3The Windowws AzureMarketplaceDataMarket• Describe how to configure applicationsto connect too the Windowss Azure Marketplace DataMarket.MCT USE ONLY. STUDENT USE PROHIBITED


8-20 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseWhat Is the Windows Azure Marketplace??There are vast amounts of data available bothcommerciallyy and for free, on the Internet. This data canbe extremely useful for organizations; however, it can be hard to find, is poorly regulated, and can beinmany different formats. It can be difficult for an organization to find, accurately compare, and then usethe data availablefor a particular topic. Even data sets that are in commonly used formats such as XMLcan have differentschemas; andthe workloadto find, assess, and integrate the data can outweigh thebenefits that the data itself provides.The Windows Azure Marketplace is an online portal to buy and sell cloud-based applications and datasets. Although thedata is provided by third parties, all invoicing is through the Azure Marketplace, anddataa is in a consistent format that is compatible with SQL Server. In addition, the portal provides tools tomonitor usage. Furthermore, Microsoft validates all data publishers before any data is published.MCT USE ONLY. STUDENT USE PROHIBITED


Windows Azure Marketplace DataMarkeet Data ScenariosThe Windows Azure Marketplace DataMarket can be usedd in two core scenarios for data warehousing –datasets and data cleansing services.DatasetsThere are manydatasets available on the Azure data market and they can be integrated into yourdata warehousing solution to provide richer analysis. Often these resources are freelyavailable fromgovernment agencies providing demographic information on many topics. By extending sources of datafor reporting and analysis, trends and correlations can bee found that would otherwise be invisible.You can import datasets fromthe WindowsAzure Marketplace DataMarket directly into your datawarehousing ETL processes, or you can review the datasets in MicrosoftExcel® and then save the datayou need as a .csv file for incorporation intoan ETL process.Data Cleansing and Validation10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-21Data cleansing is crucial for accurate BI analysis and the features in SQLL Server Data Quality Services (DQS)are invaluable for ensuring the quality of your data. In many scenarios, data quality can be improved inhouse,for example some people in your organization might be able toidentify that NY and New York arelikely to be the same location, or that KY11 8JJ is a postall code in Dunfermline, Scotland. However,veryfew people would be able to validate everypostal code of every customer address. Third party datacleansing services can be integrated into your DQS solution and provide a level of specialization whichwould be unlikely, if not impossible, inside you organization.For MoreInformationnFor more information about using DQSS and Windows AzureMarketplace DataMarket data cleansing tools, see Module 9: Enforcing Data Quality.MCT USE ONLY. STUDENT USE PROHIBITED


8-22 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseAcquiring and ViewingDataThe range of dataa and services is straightforward to search through the Data Market web site athttps://datamarket.azure.com. There is also a data market catalog that you can download if you wouldprefer an offline copy.For More InformationFor more information about the Data Market Catalog, seehttp://go.microsoft.com/fwlink/?LinkID=217152.To acquire data from the Windows Azure Marketplace DataMarket, followthese steps:1.2.3.4.Sign in with your Windows Live ID.Explore the Windows AzureMarketplace DataMarket and find the dataset that yourequire.Sign up to the subscription that meets your needs.Note Many providers have a trial subscription or a free subscription with a limited numberof transactions per month enabling you to assess the usefulness of the data before you haveto pay for it. .Follow the prompts online to run queries, view results as datasets or charts, and export data.After you have subscribed to the data you need, you can access the data from within Microsoft Excel,Microsoft Visual Studio®, or using third-partytools.For More InformationFor more information on the Windows Azure MarketplaceDataMarkett add-in for Excel, see the next topic.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-23The Windows Azure Marketplace DataMarket Add-in for ExcelThe Windows Azure Marketplace DataMarket add-in for Excel 2010 enables you to discover, subscribe to,and use Azure Marketplace datasets directlyfrom Microsoft Excel 2010.The add-in adds an Import data from DataMarket button to the Data ribbon.Note Toinstall the Microsoft Windows Azure Marketplace DataMarket Add-infor ExcelCommunity <strong>Technology</strong>Preview 2 (CTP 2), seehttp://go.microsoft.com/fwlink/?LinkID=246736.When you click the Import data from DataMarket button, you can sign in to the Azure Marketplaceusing your Windows Live ID and then give the add-in access to your Marketplace account. You canthensee all of the subscriptions that you have purchased. There is also a Browse button that launches abrowser windoww to search for, and purchase, additional subscriptions.Each of the purchased subscriptions also has an Import data link that launches the Query Builder, whichenables you to filter the data to your requirements and then import thedata to Excel.For MoreInformationnFor more information about the Windows Azure MarketplaceDataMarket Add-In for Excel, see http://go.microsoft.com/fwlink//?LinkID=246738.MCT USE ONLY. STUDENT USE PROHIBITED


8-24 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseAccessing the Windows Azure Marketplacce DataMarket from ClientApplicationsIn addition to importing data into your data warehouse or into Excel, youcan also use Azure Marketplacedataa in your applications. Whenyou have subscribed to a data source, there is a link toUse Add ServiceReference in Visual Studio to create client classes. This will create the whole object model for youinVisual Studio.To use this in a .Net application, you will needto pass network credentials. The username for the networkcredentials can beanything youwant, but thepassword must be the account key for the AzureMarketplace account. To find the account key, go to the Azure Marketplace, click on the My Account tab,clickon Account Keys and either copy an existing accountt key, or add a new account key.Eachtime you usethe application it passes the query to thee provider on the Azure Marketplace and,therefore, you have completely up-to-date information.For More InformationFor more information about Building Applications with theWindows Azure DataMarket, see http:// /go.microsoft.com/fwlink/?LinkID=246737.MCT USE ONLY. STUDENT USE PROHIBITED


Lab Scenario10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-25Inthis lab you will incorporate data from the cloud into the Adventure Works data warehousing solution.The Adventure Works Cycles marketing department has contracted a market research firm named TreyResearch to conduct a survey. You have decided to createe a cloud-based database inSQL Azure sothatTrey Research employees can enter the results of the research without requiring access to on-premiseAdventure Works systems.When the research results areavailable, youmust extractt the data in SQL Azure during the ETL processesfor the Adventure Works dataa warehouse.The marketing department has also requested demographic data about population by gender in theinternational sales territories in which Adventure Works operates. You plan to obtain this data fromtheWindows AzureMarketplace DataMarket.MCT USE ONLY. STUDENT USE PROHIBITED


8-26 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseLab 8:Exercise 1: Creating a SQL AzureSolutionScenarioAdventure Works Cycles has contracted a third-party market research company namedTrey Research togather data in a market survey. You have decided to provision a SQL Azure database inwhich TreyResearch can store the survey data, and whenit is ready you plan to incorporate this data into your datawarehousing ETL process.The main tasks for this exercise are as follows:1.2.3.4.Using Cloud Data in a Data Warehouse SolutionPrepare the lab environment.Provision a SQL Azure server.Create a database in SQL Azure.Use SQL Server Management Studio to access SQL Azure.Note If an Internet connection is available, you can perform this lab with a trial accountfor the Microsoft Windows Azure platform. If no Internet connection is available, you shoulduse the lab answer key to perform the lab in the provided simulation. Task 1: Prepare the labenvironment• If a connection to the Internet is available, start the MSL-TMG1 virtual machine andthen start theMIA-DC1 virtual machine. When MSL-TMG1 and MIA-DC1 are running, start the MIA-SQLBI virtualmachine. Then log on to MIA-SQLBI as ADVENTUREWWORKS\Student with the password Pa$$w0rd.• Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab08\Starterfolder as Administrator.MCT USE ONLY. STUDENT USE PROHIBITED


Task 2: Provision a SQL Azure Server• Start Internet Explorer and navigate to http://windows.azure.com.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-27• Sign in using your Windows Live credentials, and then click Database on the left hand pane.• Select your subscription in the left pane, and view the subscription information and then create a newserver in any available region.• Enter the following administrator credentials for the new server:• Administrator Login: Student10777A• Password: Pa$$w0rd• Confirm password: Pa$$w0rd• Add the following firewall rule:• Rule name: All Internet• IP range start: 0.0.0.0• IP range end: 255.255.255.255• Allow other Windows Azure services to access this server.• Expand your subscription and click the server that was created to its details. Task 3: Create a Database in SQL Azure• Create a new database named TreyResearch.• In the Properties pane to the right, view the properties of the server. Then copy the Fully QualifiedDNS Name to the clipboard. (You may be prompted to allow the Silverlight application to access theclipboard). Task 4: Use SQL Server Management Studio to access SQL Azure• Start SQL Server Management Studio and paste the Fully Qualified DNS Name of the SQL Azureserver (which should still be on the clipboard) in the Server name box.• Login using SQL Server authentication as Student10777A with a password of Pa$$w0rd.• Open the Create Marketing Data.sql script file in the D:\10777A\Labfiles\Lab08\Starter folder. Thisscript creates a number of tables and populates them with some data.• In the drop-down list of databases, select TreyResearch. Then execute the script.• In a new query window run a query to select all of the data from the dbo.AWMarketResearch table,and review the data that the script inserted.• Note that the table contains a list of market research questions and the most common answers fromrespondents in a variety of market segments.Results: After this exercise, you should have provisioned a SQL Azure database, created a table, andloaded it with data.MCT USE ONLY. STUDENT USE PROHIBITED


8-28 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseExercise 2: Extracting Data from SQL AzureScenarioYou need to import the data that is now hosted in SQL Azure into your data warehouse and you need tocreate an SSIS package to automate the import.The main tasks for this exercise are as follows:1. Create an SSIS connection manager for SQL Azure.2. Create an SSIS Package to extract data from SQL Azure.3. Test the package. Task 1: Create an SSIS connection manager for SQL Azure• Open the AdventureWorksETL.sln solution in the D:\10777A\Labfiles\Lab08\Starter folder with SQLServer Data Tools.• Create a new OLEDB connection manager with the following properties• In the Server name textbox paste the fully qualified DNS name of the SQL Azure server youcreated in the previous exercise (which should still be on the clipboard).• Select the Use SQL Server Authentication option. Then in the User name box, typeStudent10777A, in the Password box type Pa$$w0rd, and select the Save my passwordcheckbox.• In the Select or enter a database name drop-down list, select TreyResearch. Task 2: Create an SSIS Package to extract data from SQL Azure• Create a new SSIS package called Extract Market Data.dtsx.• Create a new data flow task called Get Cloud Data.• In the Get Cloud Data data flow, add an OLE DB source named SQL Azure that uses the connectionmanager you created previously to extract the contents of the dbo.AWMarketResearch table fromthe TreyResarch database.• And add a destination named Staging DB that uses the localhost.Staging connection manager, andconnect the data flow output from the SQL Azure source to the Staging DB destination. Map theinput columns to the MarketResearch table in the Staging database. Task 3: Test the package• Start debugging and observe the package as it executes. When execution is complete, stopdebugging.Results: After this exercise, you should have created and tested an SSIS package to extract data from SQLAzure.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-29Exercise 3: Obtaining Data from the Windows Azure MarketplaceDataMarketScenarioYou are aware that you could improve your BI analysis if you had more population statistics, but you donot have this data in your organization. You have decided to import data from the Windows AzureMarketplace DataMarket to rectify this situation.The main tasks for this exercise are as follows:1. Register a Windows Azure Marketplace account.2. Subscribe to a dataset.3. Explore a dataset.4. Use the Windows Azure Marketplace DataMarket Add-In for Excel. Task 1: Register a Windows Azure Marketplace account• Use Internet Explorer to navigate to https://datamarket.azure.com.• Sign in to the Windows Azure Marketplace with your Windows Live credentials.• If you have not previously registered a Windows Marketplace account, you must register one now.Enter the required account details, agree to the Azure Marketplace conditions, accept the terms ofuse, and register. Task 2: Subscribe to a dataset• In the Data section, view the Demographics category.• Search for demographic statistics.• In the list of results, click the UNSD Demographic Statistics – United Nations Statistics Divisiondataset.• Sign Up to the dataset and agree to the publisher’s terms and conditions. Task 3: Explore a dataset• Click EXPLORE THIS DATASET and run the default query to return a list of data series that you canaccess in this dataset.• Note that the data series with an Id value of 1 contains data about population by sex and urban/ruralresidence.• In the Build Query pane, specify that the query should return Values. Then configure the query toreturn values for records with a DataSeriesId value of 1 and run it.MCT USE ONLY. STUDENT USE PROHIBITED


8-30 <strong>Inc</strong>orporating Data from the Cloud into a Data Warehouse• Note that the data series includes the following columns:• DataSeriesId: The unique identifier for the data series to which this record belongs.• CountryId: The unique identifier of the country for which that this record shows population data(for example 36 or 250).• CountryName: The name of the country for which that this record shows population data (forexample, Australia or France).• Year: The year for which that this record shows population data.• AreaCode: The unique identifier of the area for which that this record shows population data (0,1, or 2).• Area: The description of the area for which that this record shows population data (Total, Urban,or Rural).• SexCode: The unique identifier of the gender for which that this record shows population data(0, 1, or 2).• Sex: The description of the gender for which that this record shows population data (Both Sexes,Male, or Female).• RecordTypeCode: The unique identifier of the type of population record (for example, 9 or 3).• RecordType: The description of the type of population record (for example, Estimate, de factoor Census, de facto, complete tabulation)• ReliabilityCode: The unique identifier of the level the reliability of the population value in thisrecord (for example, 0 or 1).• Reliability: An indication of the level the reliability of the population value in this record (forexample, Final figure, complete or Provisional figure).• SourceYear: The year the data in this record was recorded.• Value: The population value for this record.• FootnoteSequenceId: A reference to a footnote for this dataset.• Add a filter to the query to return only records with a ReliabilityCode of 0 (the reliability code forFinal figure, complete), and run it.• On the Visualize tab create a chart with the following settings:• Chart Type: Line Chart• X-Axis Values: Year• Y-Axis Values: Value• Note the formats to which you can export the query results on the Export tab.• Note the URL that you can use to make a REST-based call for these query results on the Develop tab• Close Internet Explorer.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 8-31 Task 4: Use the Windows Azure Marketplace DataMarket Add-In for Excel• Start Microsoft Excel 2010, and on the ribbon, on the Data tab, click Import data from DataMarket.• Sign in with your Windows Live credentials and allow the DataMarket Add-In for Excel to access yourWindows Azure Marketplace account.• Import data from the UNSD Demographic Statistics – United Nations Statistics Divisionsubscribed dataset. Load data from Values and apply the following filters for the query:• DataSeriesId [Text] = 1• Year [Number] > 2000• AreaCode [Number] = 0 (the code for the Total area)• SexCode [Number] = 0 (the code for the Both Sexes population records)• ReliabilityCode [Number] = 0 (the code for the Final figure, complete population records)• CountryId [Number] = 36 (the ID for Australia) or 250 (the ID for France) or 276 (the ID forGermany) or 826 (the ID for United Kingdom of Great Britain and Northern Ireland) or 840(the ID for United States of America)• Select and group all of the CountryID filters as a single clause in the query.• Import the data. Then view the data and save the workbook in CSV (comma-delimited) format asD:\10777A\Labfiles\Lab08\Demographics.csv.Results: After this exercise, you should have registered an account with the Windows Azure MarketplaceDataMarket, subscribed to a dataset and imported that data using Excel.MCT USE ONLY. STUDENT USE PROHIBITED


8-32 <strong>Inc</strong>orporating Data from the Cloud into a Data WarehouseModule Review and TakeawaysReview Questions1.2.3.What are some of the consideration you should have iff you are considering using cloud services?How much configuration will client applications need to move from using a SQL Server databasetousing SQL Azure?What are some of the uses of the Windows Azure Marketplace DataMarket in the BI ecosystem?MCT USE ONLY. STUDENT USE PROHIBITED


Module 9Enforcing Data QualityContents:Lesson 1: Introduction to Data Quality 9-3Lesson 2: Using Data Quality Services to Cleanse Data 9-13Lab 9A: Cleansing Data 9-20Lesson 3: Using Data Quality Services to Match Data 9-29Lab 9B: Deduplicating Data 9-38MCT USE ONLY. STUDENT USE PROHIBITED9-1


9-2 Enforcing Data QualityModule OverviewEnsuring the highh quality of data is essential if the results off data analysiss are to be trusted. Microsoft®SQLServer® 20122 includes Data Quality Services (DQS) to provide a computer assistedprocess forcleansing data values and identifying and removing duplicate data entities. This process reduces theworkload of the data steward toa minimum while maintaining human interaction to ensure accurateresults.After completing this module, you will be able to:• Describe howData Quality Services can help you manage data quality.• Use Data Quality Services tocleanse yourdata.• Use Data Quality Services tomatch data.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson1Introductionto Data QualityData quality is a major concern for anyone building a data warehousingsolution. In this lesson, youwilllearn about the kinds of data quality issue that must be addressed in a data warehousing solution, andhow SQL Serverr Data Quality Services can help you address these issues.After completing this lesson, you will be able to:• Describe the need for data quality management.• Describe the features andcomponents of Data Quality Services.• Describe the features of a knowledge base.• Describe the features of a domain.• Explain howreference data can be usedin a knowledge base.• Create a Data Quality Services knowledge base.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-3MCT USE ONLY. STUDENT USE PROHIBITED


9-4 Enforcing Data QualityWhat Is Dataa Quality, and Why Do You Need It?As organizations consume moredata from more data sources, the need for data quality managementhas become increasingly common in many businesses. Dataa quality is especially important in a datawarehousing solution, because the reports and analysis generated from data in the data warehouse canformthe basis of important business decisions. Business users must be able to trust thedata they usetomake these decisions.Data Quality IssuesCommon data quality issues include:• Invalid data values – for example, an organization might categorize its stores as “wholesale” or“retail”. However, a user might use an application that allows free-form data entry to create a storewith a category of “reseller”instead of “retail”, or they might accidentally type “whalesale” instead of“wholesale”. Any analysis or reporting that aggregates data by store type will then produce inaccurateresults because of the additional, invalid categories.• <strong>Inc</strong>onsistencies – for example, an organization might have an application for managing customeraccounts in which US statess are stored using two-letterr state codes (such as “WA” for Washington),and a second application that stores supplier addressess with a full state name (suchas “California”).When data from both of these systems isloaded into the data warehouse, your data warehouse willcontain inconsistent values for states.• Duplicate business entities – for example,a customer relationship management system might containcustomer records for Jim Corbin, Jimmy Corbin, James Corbin, and J Corbin. If the address andtelephone number for these customers are all the same, then it might be reasonable to assume thatall of these records relate tothe same customer. Of course, it’s also possible that Jim Corbin has a wifenamed Jennifer and a son named James, so you must be confident that you have matched therecords appropriately before deduplicating the data.MCT USE ONLY. STUDENT USE PROHIBITED


Data Quality ServicesOverviewData Quality Services is a knowledge-basedd solution for managing dataa quality. WithData QualityServices, you can perform thefollowing kinds of data quality management:• Data Cleansing – identifying invalid data values and correcting them.• Data Matching – finding duplicate dataa entities.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-5Data Quality Services is installed from the SQL Server 2012 installation media, and consist of the followingcomponents:• Data Quality Services Server – a service that uses a knowledge base to apply data quality rules todata. The server must be installed on the same instance as the dataa that you wishto analyze. Two SQLServer catalogs are installed and you can monitor, maintain, backup, and performotheradministrative tasks on these from within SQL Server r Managementt Studio. DQS_ _MAIN includes storedprocedures, the DQS engine, and published knowledge bases. DQS_ PROJECT includes data that isrequired for knowledge base management and data quality project activities.• Data Quality Client – a wizard-based application that data stewards (typically business users) can useto create and manage data quality services knowledge bases and perform data quality services tasks.The client can either be installed on thesame computer as the DQS server or used remotely.• Data Cleansing SSIS Transformation– a data flow transformation for SQL Server IntegrationnServices that you can useto cleanse data as it flows through a dataa flow pipeline.MCT USE ONLY. STUDENT USE PROHIBITED


9-6 Enforcing Data QualityWhat Is a Knowledge Base?Data Quality Services enables you to improvedata quality by creating a knowledge base about the data,and then applyingthe rules in the knowledgee base to perform data cleansing and matching. A knowledgebasestores all theknowledge related to a specific aspect off the business. . For example, you could maintainoneknowledge base for a customer databaseand another knowledge base for an employee database.Eachknowledge base contains:• Domains thatt define valid values and correction rules for data fields.• Matching policies that define rules for identifying duplicate data entities.Knowledge bases are usually created and maintained by data stewards, who are often business userss withparticular expertise in a specific area of the business.Data Quality Services provides a basic knowledge base thatt includes domains for US address data (such asstates and cities), which you canuse to learn about data quality services and as a starting point for yourownknowledge bases.MCT USE ONLY. STUDENT USE PROHIBITED


What Is a Domain?10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-7Domains are central to a Dataa Quality Services knowledgee base. Each domain identifies the possible valuesand rules for a data field (thatt is, a column in a dataset). The values for each domain are categorized as:• Valid – for example, validvalues for a US State domain might include “California” or “CA”• Invalid – for example, invalid values for a US State domain might include “8”• Error – for example, a common error for a US State domain might be “Calfornia”(with a missing “i”)Values can be grouped as synonyms. For example, you might group “California”, “CA”, and “Calfornia” assynonyms for California, and you can specify a leading value to which all synonyms should be corrected.For example, you could configure the domain do that instances of “CA” and “Calfornia” are automaticallycorrected to “California”.Inaddition to defining the values for a domain, you can create domainrules that validate new dataa values– for example, you could create a rule to ensure that all values in an Age domain arenumbers or that allvalues in an Email Address domain includea “@” character.You can also specify standardization settings for a text-based domain to enforce correct capitalization.This enables you to ensure that cleansed text values havee consistent formats.Often, you can create domains to representthe most granular level of your data, for example FirstName,but the actual unit of storage comprises multiple domains, for examplee FullName. In this example, youcan combine the FirstName and LastName domains to form a FullName composite domain. Compositedomains are also used for address fields which comprise of a combination of address, city, state, postalcode, and country data. Another use of composite domains is a rule that combines data from multipledomains. For example, you can verify that the string “98007” in a zipcode domain corresponds to thestring “Bellevue” in a city domain.Matching can be performed on the individual domains that comprise the composite domain, but not onthe composite domain itself.MCT USE ONLY. STUDENT USE PROHIBITED


9-8 Enforcing Data QualityWhat Is a Reference Data Service?Many data qualityproblems areoutside the core specialization of the organization in which they arebeing used. For example, you are an Internet retailer and one of your core data problems, and highestunnecessary costs, is incorrect address data. You have madee your web site as user-friendly as possible, butthere are still an unacceptably high number of incorrectly addressed orders.To cleanse data that is outside the knowledgee of your organization, you can subscribe to third partyReference Data Service (RDS) providers. Usingthe Microsoft Windows Azure Data Market, it isstraightforward tosubscribe to an RDS servicee and then usee this service to validate andcleanse your data.For example, using the Windows Azure Data Market, you have purchaseda subscription to an addressverification service. You send data to the address verification service and it verifies and cleanses the datareducing incorrect address information and, therefore, reducing your postage costs.To use RDS to cleanse your data, you must follow these steps:1.2.3.4.5.First, create a free DataMarket account key at the Windows Azure Marketplace.Then subscribe to a free or paid for RDS provider’s service at the Marketplace.Next, configure the reference data servicee details in DQS.Then, map your domain to the RDS service.Finally, you can use the knowledge base that contains the domain that maps to the RDS service tocleanse the data.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-9One of the key advantages of using the Windows Azure Data Market to provide DQS services is that thecost of the data service is typically based on the number of times you use the service per month. Thisallows you to scale up at busy times and reduce costs at quiet times.For More Information For more information about Application and Data Subscriptionsfor Windows Azure Marketplace, see http://go.microsoft.com/fwlink/?LinkID=246739.MCT USE ONLY. STUDENT USE PROHIBITED


9-10 Enforcing Data QualityCreating a KnowledgeBaseBuilding a DQS knowledge baseis an iterativee process that involves the following steps:• Knowledge Discovery – Using existing data to identify domain values.• Domain Management – Categorizing discovered values as valid, invalid, or errors; specifyingsynonyms andleading values; and correction rules, andd other domain configuration tasks.The data steward can create theinitial knowledge base fromscratch, baseit on an existing knowledgebase, or import a knowledge base from a data file. Then thee knowledge discovery process is used toidentify data fields that need tobe managed, map these fields to domains in the knowledge base (whichcan be created during knowledge discovery ifrequired), and identify values for these fields.After the knowledge base has been populatedby the knowledge discovery process, thedata stewardmanages the domains to control how Data Quality Servicess validates andcorrects data values.Additionally, domain management may include configuringg reference data services, orsetting up term-based or cross-field relationships.Creating a knowledge base is not a one-time activity. A data steward will continually use the knowledgediscovery and domain management processes to enhance the knowledgee base and manage the quality ofnewdata values and domains.MCT USE ONLY. STUDENT USE PROHIBITED


Demonstraation: Creating a Knowledge Base10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-11 Task 1: Create a knowledge base1. . Ensure MIA-DC1 and MIA-SQLBI are started, and logg onto MIA-SQLBI asADVENTUREWORKS\Student with the password Pa$$w0rd. Then, in theD:\10777A\ \Demofiles\Mod09 folder, run Setup.cmdd as Administrator.2. . Click Start, click All Programs, click Microsoft SQL Server 2012 RC0, click Data Quality Services,and click Data Quality Client. When prompted, enter the server name localhost, and click Connect.3. . In SQL Server Data Quality Services, in the Knowledge Base Management section, click NewKnowledgee Base. Then create a knowledge base named Demo KBfrom the existing DQS Dataknowledge base, for the Domain Management activity, and click Next.4. . Delete all of the existing domains other than US – State. Then, with US - State selected, on theDomain properties tab, change the name of the domain to State.5. . On the Domain Values tab, note the existing values. . The leading value for each state is the full statename, and other possiblevalues that should be corrected to the leading value are indented beneatheach leading value.6. . Click Finish, and then when prompted to publish thee knowledge base, click No.MCT USE ONLY. STUDENT USE PROHIBITED


9-12 Enforcing Data Quality Task 2: Perform knowledge discovery1. In SQL Server Data Quality Services, under Recent Knowledge Base, click Demo KB and then clickKnowledge Discovery.2. On the Map page, in the Data Source drop-down list, select Excel File; in the Excel File box, browseto D:\10777A\Demofiles\Mod09\Stores.xlsx; in the Worksheet drop-down list, ensure Sheet1$ isselected; and ensure Use first row as header is selected. This worksheet contains a sample of storedata that needs to be cleansed.3. In the Mappings table, in the Source Column list, select State (String), and in the Domain list,select State.4. In the Mappings table, in the Source Column list, select City (String), and then click the Create aDomain button and create a domain named City with the default properties.5. Repeat the previous step to map the StoreType (String) source column to a new domain namedStoreType.6. Click Next, and then on the Discover page, click Start and wait for the knowledge discovery processto complete. When the process has finished, note that 11 new City and StoreType records werefound and that there were three unique City values, 5 unique State values, and 4 unique StoreTypevalues. Then click Next.7. On the Manage Domain Values page, with the City domain selected, note the new values that werediscovered.8. Select the State domain and note that no new values were discovered. Then clear the Show OnlyNew checkbox and note that all possible values fort the State domain are shown, and the Frequencycolumn indicates that the data included California, CA, Washington, and WA.9. Select the StoreType domain and note the values that were discovered.10. In the list of values, click Retail, hold the Ctrl key and click Resale, and click the Set selected domainvalues as synonyms button. Then right-click Retail and click Set as leading.11. In the list of values, note that Whalesale has a red spell-check line. Then right-click Whalesale, andclick Wholesale in the list of suggested spelling corrections. Note that the Type for the Whalesalevalue changes to Error and the Correct to value is automatically set to Wholesale.12. Click Finish. If prompted to review more values, click No. When prompted to publish the knowledgebase, click No. Task 3: Perform domain management1. In SQL Server Data Quality Services, under Recent Knowledge Base, click Demo KB and then clickDomain Management.2. In the Domain list, select StoreType. Then view the Domain Values tab and note that the valuesdiscovered in the previous task are listed with appropriate leading values and correction behavior.3. Click the Add new domain value button, and then enter the value Reseller.4. Click the Retail leading value, hold the Ctrl key and click the new Reseller value, and then click theSet selected domain values as synonyms button. Note that Reseller becomes a valid value that iscorrected to the Retail leading value.5. Click Finish, and when prompted to publish the knowledge base, click Publish. Then, whenpublishing is complete, click OK.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson2UsingData QualityServices to Cleanse DataOne of the major tasks that a data quality management solution must do is to cleanse data by validatingand correcting domain values. This lesson describes how you can use Data Quality Services to cleanse dataand review dataa cleansing results.After completing this lesson, you will be able to:• Create a data cleansing project.• View cleansed data.• Use the Data Cleansing transformationin an SSIS data flow.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-13MCT USE ONLY. STUDENT USE PROHIBITED


9-14 Enforcing Data QualityCreating a Data Cleansing ProjectData stewards canuse the Data Quality Clientapplication too create a data cleansing project that appliesthe knowledge in a knowledge base to data in a SQL Serverr database or a Microsoft Excel® workbook.When creating a data cleansingproject, the data steward must:1.2.3.4.Select the knowledge base to use for the project and specify that theaction to be performed incleansing.Select the data source containing the data to be cleansed and map the columns init to the domainsin the knowledge base.Run the data cleansing process and then review the suggestions and corrections generated by DataQuality Services. The data steward can then approve orr reject the suggestions and corrections.Export the cleansed data toa database table, comma-delimited file, or Excel workbook.MCT USE ONLY. STUDENT USE PROHIBITED


Viewing Cleansed DataThe output froma data cleansing project includes the cleansed data ass well as additional informationabout the corrections made by Data QualityServices. Thee columns in the output are named by combiningthe name of thedomain and the type of data in the column. For example, the cleansed output for adomain named State is storedin a column named State_ _Output.Cleansed data output includes the followingtypes of column:10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-15• Output – the values for all fields after data cleansing. All fields in the original data source generateoutput columns, even those not mapped to domainss in the knowledge base (in which case theycontain theoriginal data values).• Source – The original value for fields that were mapped to domains and cleansed.• Reason – The reason theoutput value was selected by the cleansing operation. For example, a validvalue might be correctedto a leading value defined for the domain, or Data Quality Services mighthave applied a cleansing algorithm andsuggested a corrected value.• Confidencee – An indication of the confidence Data Quality Services estimates for corrected values.For values corrected to leading values defined in the knowledge base, this is usually 1 (or 100% %).When Dataa Quality Services uses a cleansing algorithm to suggest a correction, the confidencee is avalue between 0 and 1.• Status – The status of theoutput column. A value off correct indicates that the original value wasalready correct, and a value of corrected indicates that Data Quality Services changed the value.MCT USE ONLY. STUDENT USE PROHIBITED


9-16 Enforcing Data QualityDemonstration: Cleansing Data1.2.3.4.5.6.7.Note You must complete the previousdemonstration in this module before performingthis one.Task 1: Create a data cleansing projectIf it is not already running, start the Data Quality Clientt application, and connect tolocalhost.In SQL Serverr Data Quality Services, in the Data Quality Projects section, click New Data QualityProject, and create a new project namedCleansing Demo based onthe Demo KB knowledge base.Ensure the Cleansing activity is selected, and then clickk Next.On the Map page, in the Data Source list, ensure SQL Server is selected. Then in the Database list,click DemoDQS; and in theTable/View list, click Stores.In the Mappings table, map the City (varchar), State (varchar), andStoreType (varchar) sourcecolumns to the City, State, and StoreType domains. Then click Next.On the Cleanse page, click Start. Then, when the cleansing process has completed, view the data inthe Profiler tab, noting thenumber of corrected and suggested values for each domain, and clickNext.On the Manage and View Results page, ensure that the City domain is selected and on theSuggested tab, note that DQS has suggested correcting the value New Yrk to New York. Click theApprove option to accept this suggestion, and then click the Corrected tab to verify that the valuehas been corrected.Click the State and StoreType domains in turn, and onn the Corrected tab, note the corrections thathave been applied based on the values defined in the knowledge base. Then click Next.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-178. On the Export page, view the output data preview. Then under Export cleansing results, in theDestination Type list, select Excel File; in the Excel file name box, typeD:\10777A\Demofiles\Mod09\CleansedStores.xlsx; ensure that Standardize Output is selected,ensure that the Data and Cleansing Info option is selected; and click Export.9. When the file download has completed, click Close. Then click Finish, and close SQL Server DataQuality Services. Task 2: View cleansed data1. Open D:\10777A\Demofiles\Mod09\CleansedStores.xlsx in Microsoft Excel.2. Note that the output includes the following types of column:• Output – the values for all fields after data cleansing.• Source – The original value for fields that were mapped to domains and cleansed.• Reason – The reason the output value was selected by the cleansing operation.• Confidence – An indication of the confidence Data Quality Services estimates for correctedvalues.• Status – The status of the output column (correct or corrected).3. Close Excel without saving any changes.MCT USE ONLY. STUDENT USE PROHIBITED


9-18 Enforcing Data QualityUsing the Data Cleansing Data Flow TransformationIn addition to creating data cleansing projects to cleanse data interactively, you can use the DataCleansing transformation to perform data cleansing in an SSIS data flow. Using the Data Cleansingtransformation enables you to automate dataa cleansing as a part of the extract, transform, and load (ETL)processes used topopulate your data warehouse.To add the Data Cleansing transformation to a data flow inn an SSIS package, perform the following steps:1.2.3.Add the data Cleansing transformation tothe data floww and drag a data flow connection from asource or transformation that contains the data you want to cleanse to the input of the DataCleansing transformation.Edit the settings of the Dataa Cleansing transformation to connect to the data quality server, specifythe knowledge base you want to use, andmap the input columns in the data flow to domains intheknowledge base.Drag the output from the Data Cleansingtransformation to the next transformation or destination inthe data flow, and map theoutput columns from the Data Cleansingtransformation to theappropriate input columns in the transformation or destination. The output columns from the dataCleansing transformation are the same asthose generated by an interactive data cleansing project.MCT USE ONLY. STUDENT USE PROHIBITED


Lab Scenario10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-19Inthis lab, you will add data cleansing to the Adventure Works data warehousing solution.You have created an ETL solution for the Adventure Works data warehouse, and invited some dataastewards to validate the process before putting it into production.The data stewards have noticed some data quality issues in the staged customer data, and requested thatyou provide a way for them tocleanse dataa so that the data warehousee is based on consistent and reliabledata. The data stewards have provided you with an Excel workbook containing someexamples of theissues found in the data.You plan to work with the data stewards to create a knowledge base for customer data, and enable themtouse the data Quality Services client and Excel to cleanse customer records. You will base the customerknowledge baseon an existing knowledge base, removing any unnecessary domains. Then you will adddomains for thecustomer records you needto cleanse and perform knowledge discovery to determinevalid domain values and common errors in the source data.You will use theknowledge base you have created to cleanse customerr data from the source database.Then, when youare confidentthat the knowledge base iss accurate, youwill add the Data Cleansingtransformationto an SSIS data flow to cleanse data automatically as it is extracted and loaded intothestaging database.MCT USE ONLY. STUDENT USE PROHIBITED


9-20 Enforcing Data QualityLab 9A: Cleansing DataExercise 1: Creating a DQS Knowledge BaseScenarioYouhave integrated data from many data sources into your data warehouse. This has provided manybenefits. Howeverusers have observed some quality issues with the data, , which you must correct.The main tasks for this exercise are as follows:1.2.3.4.Prepare the lab environment.View existing data.Create a Knowledge Base.Perform knowledge discovery. Task 1: Prepare the labenvironment• Ensure that the MIA-DC1 and MIA-SQLBII virtual machines are both running, and then log on toMIA-SQLBI asADVENTUREWORKS\Student with the password Pa$ $$w0rd.• Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab09A\Starterfolder as Administrator. Task 2: Viewexisting data• Open D:\10777A\Labfiles\Lab09A\Starter\Sample Customer Data.xslx in Microsoft Excel andexamine the worksheets in the workbook.• Note that there are multiple names for the same country on the Countries and Statesworksheets.• Note that there are multiple names for the same state on the States worksheets.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-21• Note that some customers do not have a gender code of F or M on the Gender worksheet.• On the Sample Customer Data worksheet, apply column filters to explore the data further and viewthe source records for the anomalous data.• Close Excel without saving any changes to the workbook. Task 3: Create a Knowledge Base• Start the Data Quality Client application and connect to localhost.• Create a new knowledge base with the following properties:• Name: Customer KB• Description: Customer data knowledge base• Create knowledge base from: Existing Knowledge Base (DQS Data)• Select Activity: Domain Management• The new knowledge base inherits several domains from the DQS Data knowledge base you havebased it on. You do not need all of these domains. Delete the following domains, but do not deletethe Country/Region, Country/Region (two-letter leading), and US-State domains:• Country/Region (three-letter leading)• US - Counties• US - Last Name• US - Places• US - State (2-letter leading)• View the domain values for the Country, Country (2 char leading), and US - State domains.• Change the domain name of the US – State domain to State.• Create a domain with the following properties:• Domain Name: Gender• Description: Customer gender• Data Type: String• Use Leading Values: Selected• Normalize String: Selected• Format Output to: Upper Case• Language: English• Enable Speller: Selected• Disable Syntax Error Algorithms: Not selectedMCT USE ONLY. STUDENT USE PROHIBITED


9-22 Enforcing Data QualityThe knowledge base shouldnow resemble the following image.• View the domain values forthe Gender domain, and notice that nulll values are allowed.• Add new domain values forF, M, Female, and Male too the Gender domain.• Set F and Female as synonyms, with F as the leading value.• Set M and Male as synonyms, with M as the leading value.The domain values for the Gender domain should noww be similar to the following image.• Click Finish but do not publish the knowledge base.MCT USE ONLY. STUDENT USE PROHIBITED


Task 4: Perform knowledge discovery• Open the Customer KB Knowledge base for knowledge discovery.• Select the Sample Customer Data$ worksheet in thee Customer Data.xlsx Excel workbook inD:\10777A\ \Labfiles\Lab09A\Starter as the source data for mapping. Use the first row as the header.• In the Mappings table, select the following mappings.Source ColumnCountryRegionCode (String)CountryRegionName (String)StateProvinceName (String)Gender (String)The data source mappingshould resemble the following image.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-23DomainCountry/Region(two-letter leading)Country/RegionStateGender• Start the discovery process, and when it is complete, view the new values that have been discoveredfor the State domain, andset New South Wales andd NSW as synonyms with New South Wales asthe leadingvalue as shown here (in thealphabetically ordered list of values, clickNew South Walesfirst, and then Ctrl + clickNSW to select them both. Then click theSet selected domain values assynonyms button).• View the new values thatt have been discovered for the Country/Region (two-letter leading)domain, and mark the value UK as an error that should be corrected to GB as shown here.MCT USE ONLY. STUDENT USE PROHIBITED


9-24 Enforcing Data Quality• View the newvalues that have been discovered for thee Gender domain, and markthe value W asinvalid and correct it to F as shown here.• View the newvalues that have been discovered for thee Country/Region domain, and remove thefilter that cases the list to show only new values.• Set United States and America as synonyms with United States as the leading value as shownhere (in the alphabetically ordered list of values, click United States first, and then CTRL + clickAmerica to select themboth. Then click the Set selected domain values as synonyms button).• Set United Kingdom and Great Britain as synonyms with United Kingdom as the leading valueas shownhere (in the alphabeticallyordered list off values, click United Kingdom first, and thenCTRL + click Great Britain to select them both. Then click the Set selected domain values assynonyms button).• Finish and publish the knowledge base.Results: After thisexercise, you should have created a knowledge base and performed knowledgediscovery.MCT USE ONLY. STUDENT USE PROHIBITED


Exercise 2: Using a DQS Projectto Cleanse DataScenarioNow that you have a published knowledge base you can use it in a data quality project to performdatacleansing.The main task for this exercisee is as follows:1. . Create a data quality project. Task 1: Create a data quality project• Create a new data qualityproject with the following properties:• Name: : Cleanse Customer Data• Description: Apply Customer KB to customer data• Use knowledge base: Customer KB• Select Activity: Cleansing• On the Map page select the InternetSales SQL Server database, select the Customers table. Then inthe Mappings table, select the following mappings.Source ColumnCountryRegionCode (nvarchar)CountryRegionName (nvarchar)Gender (nvarchar)StateProvinceName (nvarchar)The data source mappingshould resemble the following image.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-25DomainCountry/Region(two-letter leading)Country/RegionGenderStateMCT USE ONLY. STUDENT USE PROHIBITED


9-26 Enforcing Data Quality• Start the cleansing process,review the source statistics in the Profiler pane, and then on the Manageand View Results page note that Data Quality Services has found the value Astralia, which is likelyto be a typographical error, and suggested it be corrected to Australia on the Suggested tab of theCountry domain, as shownhere.• Approve the suggested correction, and note that it is now listed on the Corrected tab. Then viewthecorrected values for the Country/Region, Country/Region (two-letter leading), , Gender, and Statedomains.• On the Export page, view the output data, and then export the dataa and cleansinginfo to an Excelfile named D: :\10777A\Labfiles\Lab09A\Starter\Cleansed Customers.xlsx.• When the export is complete, finish the project and view the results of the cleansing process in Excel.It should looksimilar to thefollowing image.Results: After thisexercise, you should have used a DQS project to cleanse data and export it as an Excelworkbook.MCT USE ONLY. STUDENT USE PROHIBITED


Exercise 3: Using DQS in an SSIS PackageeScenarioYou are happy with the data cleansing capabilities of DQS and the results are accurate enough to beautomated. Youwill edit an SSIS package toinclude a data cleansing component as part of a dataflow.The main tasks for this exercise are as follows:1. . Add a DQS Cleansing transformation toa data flow.2. . Test the package. Task 1: Add a DQS Cleansing transformationn to a data flow• Open the D:\10777A\Laabfiles\Lab09A\Starter\AdventureWorksETL.sln solution in BusinessIntelligencee Development Studio.• Open the Extract Internet Sales Data.dtsx SSIS package.• Open the Extract Customers task, addd a DQS Cleansing transformation and rename it to CleanseCustomer Data.Remove the data flow between Customerss and Staging DB and add a data flow from Customerss toCleanse Customers so that the data flow looks like this.• Configure the following settings for Cleanse Customer Data:• Create a new Data quality connection manager for the localhost server andthe Customer KBknowledge base.• Specifythe followingmapping.Input ColumnGenderStateProvinceNameCountryRegionCodeCountryRegionNameDomainGenderStateCountry/Region(two-letterleading)Country/RegionSourceAliasGender_SourceStateProvinceName_SourceCountryRegionCode_SourceCountryRegionName_Source10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-27Output AliasGender_OutputStateProvinceName_OutputCountryRegionCode_OutputCountryRegionName_OutputStatus AliasGender_StatusMCT USE ONLY. STUDENT USE PROHIBITEDStateProvinceName_StatusCountryRegionCode_StatusCountryRegionName_Status


9-28 Enforcing Data Quality• Standardize the output.• Connect the output data flow from Cleanse Customerr Data to Staging DB and change thefollowing column mappings in the Staging DB destination (leaving the remaining existing mappingsas they are).Input ColumnGender_OutputStateProvinceName_OutputCountryRegionCode_OutputCountryRegionName_OutputThe completed data flow should look likethis.Task 2: Testthe packageDestination ColumnGenderStateProvinceNameCountryRegionCodeCountryRegionName• Start Debugging and observe the Extract Customerss data flow as it executes, noting the number ofrows processed by the Cleanse Customer Data transformation.Results: After thisexercise, you should have created and tested an SSIS package that cleanses data.MCT USE ONLY. STUDENT USE PROHIBITED


Lesson3UsingData QualityServices to Match DataAs well as cleansing data, youcan use Data Quality Services to identify duplicate dataa entities. The abilitytomatch data entities is useful when you need to deduplicate data to eliminate errors in reports andanalysis caused by the same entity being counted more than once.This lesson explains how to create a matching policy and then use it to find duplicatedata entities in adata matching project.After completing this lesson, you will be able to:• Create a matching policy.• Create a data matching project.• View data matching results.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-29MCT USE ONLY. STUDENT USE PROHIBITED


9-30 Enforcing Data QualityCreating a Matching PolicyA data warehousee is almost always composedof data from multiple sources and often at least some ofthis data is provided by third parties. Furthermore, there aree likely to be many transactions relating tothesame customer orproduct, but unless you have a system which only allows existing customers to buyexisting products, , there is a likelihood that duplication will occur.For example, you sell books on the Internet. To buy a book, a customer must first register with their name,address, a username, and a password. Often customers forget their username or their password and,although there is a Forgotten Username/Password link, many customers choose to register again. Youhave implemented a constraint to stop anyone using the same username, but you still have manyinstances when a single customer occurs two or more times. This duplication can causeproblems fordataanalysis. You will have more customers in your system than in reality. There might be more customers of aparticular gender, , more customers between certain age brackets, or morecustomers ina particulargeographic area, than occur in reality. This will also affect sales per customer analysis which will returnlower-than-accurate results.By providing a constraint to prevent duplicateusernames, you have gonesome way topreventingduplication, but, as you can see from the example, this will only reduce the problem slightly. You couldenforce unique names, but thatt would prevent customers with common names. You could enforce uniquenames at the same address, butthat would prevent a customer who has the same name as a partnerr orchild.MCT USE ONLY. STUDENT USE PROHIBITED


Matching Policies10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-31Data Quality Services can use a matching policy to assess the likelihood of records being duplicates. <strong>Inc</strong>ases with a high likelihood of duplication, the potential duplicates are assessed by a data steward beforeany changes are made. A data steward can add a matching policy to a knowledge base and create rulesfor the matching policy that help determine whether multiple data records represent the same businessentity. A data matching rule compares one or more domains across records and applies weightedcomparisons to identify matches. For each domain in the matching rule, the data steward defines thefollowing settings:• Similarity – You can specify that the rule should look for similar values based on fuzzy logiccomparison, or an exact match.• Weight – A percentage score to apply if a domain match succeeds.• Prerequisite – Indicates that this particular domain must match for the records to be consideredduplicates.For each rule, the data steward specifies a minimum matching score. When the matching process occurs,the individual weightings for each successful domain match comparison are added together and if thetotal is equal to or greater than the minimum matching score, and all prerequisite domains match, thenthe records are considered to be duplicates.MCT USE ONLY. STUDENT USE PROHIBITED


9-32 Enforcing Data QualityCreating a Data Matching ProjectData stewards canuse the Data Quality Clientapplication too create a data matching project that appliesthe knowledge in a knowledge base to data in a SQL Serverr database or an Excel workbook.When creating a data matchingproject, the data steward must:1.2.3.4.Select the knowledge base to use for the project and specify that theaction to be performed inmatching.Select the data source containing the data to be cleansed and map the columns init to the domainsin the knowledge base.Run the data matching process and then review the clusters of matched records that Data QualityServices identifies based onthe matchingpolicies in the knowledge base.Export the matched data toa database table, comma-delimited file, or Excel workbook. Additionally,you can specify a survivorship rule that eliminates duplicate records and export thesurviving records.You can specify the following rules for survivorship:• Pivot record – A record chosen arbitrarily by Dataa Quality Services in each cluster of matchedrecords• Most complete and longest record– The recordd that has fewest missing data values and thelongest values in each field• Most complete record– The recordthat has fewest missing data values• Longest record – The record containing the longest values in each fieldMCT USE ONLY. STUDENT USE PROHIBITED


Viewing Data Matching ResultssAfter the data matching process is complete, you can vieww and export the following results:• Matches – The original dataset plus additional columns that indicate clusters of matched records.• Survivors – The resulting dataset with duplicate records eliminatedbased on theselected survivorshiprule.When you export matches, the results include the original data and thefollowing columns:• Cluster ID – A unique identifier for a cluster of matched records.• Record ID – A unique identifier for each matched record.• Matching Rule – The rule that produced the match.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-33• Score – Thecombined weighting of the matched domains as defined in the matching rule.• Pivot Mark– A matchedrecord chosen arbitrarily byy Data Qualityy Services as the pivot recordfor acluster.MCT USE ONLY. STUDENT USE PROHIBITED


9-34 Enforcing Data QualityDemonstration: Matching Data1.2.3.4.5.6.7.8.Note You must complete the previousdemonstrations in this module before performingthis one.Task 1: Create a matching policyStart the Dataa Quality Client application, and connect to localhost.In SQL Serverr Data Quality Services, under Recent Knowledge Base, click Demo KB and then clickMatching Policy.On the Map page, in the Data Source drop-down list, select Excel File; in the Excel File box, browseto D:\10777A\Demofiles\\Mod09\Stores.xlsx; in the Worksheet drop-down list,ensure Sheet1$ isselected; and ensure Use first row as header is selected. This worksheet containsa sample of storedata that needs to be cleansed.In the Mappings table, map the City (String), State (String), and StoreType (String) sourcecolumns to the City, State, and StoreType domains.In the Mappings table, in the Source Column list, select PhoneNumber (String),, and then click theCreate a Domain button and create a domain named PhoneNumber with the default properties.Repeat the previous step tomap the StreetAddress (String) source column to a new domain namedStreetAddress.Click the Addd a column mapping button to create a new row in themapping table, and then repeatthe previous step to map the StoreName(String) source column to a new domainnamedStoreName. Then click Next.On the Matching Policy page, click the Create a matching rule button. The in the Rule Detailssection, change the rule name to Is SameStore.MCT USE ONLY. STUDENT USE PROHIBITED


10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-359. In the Rule Editor table, click the Add a new domain element button. Then in the Domain column,ensure that StoreName is selected; in the Similarity column, ensure that Similar is selected; in theWeight column, enter 20; and leave the Prerequisite column unselected.10. Repeat the previous steps to add the following rules.Domain Similarity Weight (%) PrerequisiteStreetAddress Similar 20 NoCity Exact 20 NoPhoneNumber Exact 30 NoStoreType Similar 10 NoState Exact Yes11. Click Start and wait for the matching process to complete, and note that one match is detected in thesample data (Store 1 is the same as Store One). Then click Next.12. On the Matching Results page, view the details in the Profiler tab, and then click Finish. Whenprompted to publish the knowledge base, click Publish, and when publishing is complete, click OK. Task 2: Create a data matching project1. In SQL Server Data Quality Services, in the Data Quality Projects section, click New Data QualityProject and create a new project named Matching Demo based on the Demo KB knowledge base.Ensure the Matching activity is selected, and then click Next.2. On the Map page, in the Data Source list, ensure SQL Server is selected. Then in the Database list,click DemoDQS, and in the Table/View list, click Stores.3. In the Mappings table, map the City (varchar), PhoneNumber (varchar), State (varchar),StoreName (varchar), StoreType (varchar), and StreetAddress (varchar) source columns to theCity, PhoneNumber, State, StoreName, StoreType, and StreetAddress domains. Then click Next.Note When the Mappings table is full click Add a column mapping to add an additionalrow.4. On the Matching page, click Start, and when matching is complete, note that two matches weredetected (Store 1 is the same as Store One and Store 16 is the same as Store Sixteen). Then clickNext.5. On the Export page, in the Destination Type drop-down list, select Excel File. Then select thefollowing content to export:• Matching Results: D:\10777A\Demofiles\Mod09\MatchedStores.xlsx• Survivorship Results: D:\10777A\Demofiles\Mod09\SurvivingStores.xlsx6. Select the Most complete record survivorship rule, and click Export. Then when the export hascompleted successfully, click Close.7. Click Finish and close SQL Server Data Quality Services.MCT USE ONLY. STUDENT USE PROHIBITED


9-36 Enforcing Data Quality Task 3: View data matching results1. Open D:\10777A\Demofiles\Mod09\MatchedStores.xlsx in Microsoft Excel. Note that this filecontains all of the records in the dataset with additional columns to indicate clusters of matchedrecords. In this case, there are two clusters, each containing two matches.2. Open D:\10777A\Demofiles\Mod09\SurvivingStores.xlsx in Microsoft Excel. Note that this filecontains the records that were selected to survive the matching process. The data has beendeduplicated by eliminating duplicates and retaining only the most complete record.3. Close Excel without saving any changes.MCT USE ONLY. STUDENT USE PROHIBITED


Lab ScenarioInthis lab, you will enhance the data qualitymanagement capabilities of the Adventure Works datawarehousing solution to support deduplication of customer records.You have created a DQS knowledge base and used it to cleanse customer data. However, data stewardsare concerned that the stagedcustomer data may include duplicate entries for the same customer.You have decided to extend the knowledgee base and create a matching policy for customer records. Forrecords to be considered a match, the following criteria must be true:• The Country/Region column must be an exact match.• A total matching score of80 or higher based on the following weightings must be achieved:• An exact match of the Gender column has a weighting of 10.• An exact match of the City columnhas a weighting of 20.• An exact match of the EmailAddress column has a weighting of 30.• A similar FirstName column value has a weighting of 10.• A similar LastName column value has a weighting of 10.• A similar AddressLine1 column value has a weighting of 20.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-37Data stewards can then use this matching policy to identify potential duplicate records and consolidatethem.MCT USE ONLY. STUDENT USE PROHIBITED


9-38 Enforcing Data QualityLab 9B: Deduplicating DataExercise 1: Creating a MatchingPolicyScenarioYouhave implemented a data cleansing solution for data being staged. However, you have identifiedd thatthe staged data contains multiple records for the same business entity and you want touse a datamatching solutionto deduplicate the data.The main tasks for this exercise are as follows:1.2.Prepare the lab environment.Create a matching policy. Task 1: Prepare the labenvironment• Ensure that you have completed the previous lab.• Ensure that the MIA-DC1 and MIA-SQLBII virtual machines are both running, and then log on toMIA-SQLBI asADVENTUREWORKS\Student with the password Pa$ $$w0rd.• Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab09B\Starterfolder as Administrator. Task 2: Create a matching policy• Start the Dataa Quality Client application and connect too the localhost server.• Open the Customer KB knowledge basefor the Matching Policy activity.• Use the Sheet1$ worksheet in the D:\10777A\Labfiles\Lab09B\Starter\SampleStaged Data. .xlsxExcel file as the data sourcefor the matching policy.MCT USE ONLY. STUDENT USE PROHIBITED


• On the Map page, map the columns inthe Excel worksheet to the following domains (adding newdomains asrequired.) You will need to add more column mappings after you create the first fivemappings.Source ColumnAddressLine1 (String)City (String)CountryRegionCode (String)CountryRegionNameEmailAddress (String)FirstName(String)Gender (String)LastName(String)StateProvinceName (String)• On the Matching Policypage, create a matching rule with the following details:• Rule name: Is Same Customer• Description: Checks for duplicate customer records• Min. matching score: 80DomainCountry/RegionGenderCityEmailAddressFirstNameLastNameAddressLine1DomainThe matching rule shouldlook similar to the following.A new domain named AddressLine1 with a String data type.A new domain named City with a String data type.Country/Region(two-letter heading)Country/RegionA new domain named EmailAddress with a String data type.A new domain named FirstName with a String data type.GenderA new domain named LastName with a String data type.StateSimilarityExactExactExactExactSimilarSimilarSimilar10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-39Weight102030101020PrerequisiteSelectedUnselectedUnselectedUnselectedUnselectedUnselectedUnselectedMCT USE ONLY. STUDENT USE PROHIBITED


9-40 Enforcing Data Quality• Start the matching process and, when thematching process has finished, review the matches foundby Data Quality Services; noting that there are duplicate records for three customers as shown here.• Finish the matching process and publish the knowledge base.Results: After thisexercise, you should have created a matching policy and published the knowledgeebase.MCT USE ONLY. STUDENT USE PROHIBITED


Exercise 2: Using a DQS Project to Match DataScenario10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-41You will now create a data quality project to apply the matching rules from the previous exercise. Afterthis process is complete, you will have exported a deduplicated set of data. You will finally integrate thisdata into your database using Transact-SQL.The main tasks for this exercise are as follows:1. Create a data quality project for matching data.2. Review matching results.3. Apply the matching results to the staged customer data. Task 1: Create a data quality project for matching data• In SQL Server Data Quality Services, create a new data quality project with following details:• Name: Deduplicate Customers• Description: Identify customer matches• Use knowledge base: Customer KB• Select Activity: Matching• Use the Customers table in the Staging SQL Server database as the data source, and map thefollowing columns to domains in the knowledge base.Source ColumnFirstName (nvarchar)LastName (nvarchar)Gender (nvarchar)AddressLine1 (nvarchar)City (nvarchar)CountryRegionName (nvarchar)EmailAddress (nvarchar)DomainFirstNameLastNameGenderAddressLine1CityCountry/RegionEmailAddressMCT USE ONLY. STUDENT USE PROHIBITED


9-42 Enforcing Data Quality• Run the mapping process and review theresults.• Export the results to the following Excel workbooks, specifying the Most completee recordsurvivorship rule.The mappingdetails shouldresemble thefollowing image.Export ContentMatching ResultsSurvivorshipResultsFile NameTask 2: Review matching resultsD:\10777A\Labbfiles\Lab09B\ \Starter\Matches.xlsxD:\10777A\Labbfiles\Lab09B\ \Starter\Survivors.xlsx• Open D:\10777A\Labfiles\Lab09B\Starter\Matches..xlsx in Excel. It should look similar to thefollowing image.MCT USE ONLY. STUDENT USE PROHIBITED


Note that the matching process found a match with a score of 90 for the following customer records:• CustomerBusinessKey: 29261 (Robert Turner)• CustomerBusinessKey: 29484 (Rob Turner)• Open D:\10777A\Labfiles\Lab09B\Starter\Survivors.xlsx in Excel. It should look similar to thefollowing image.• Note that the survivors file contains all of the recordss that should survive de-duplication basedon thematches that were found. It contains the record for customer 29261 (Robert Turner), but not forcustomer 29484 (Rob Turner). Task 3: Apply the matching results to the staged customer data• Open D:\10777A\Labfiles\Lab09B\Starter\Fix Duplicates.sql inSQL Server Management Studio,connecting to the localhost instance of the databasee engine by using Windows®authentication.• Review the Transact-SQL code and note that it performs the following tasks:• Updates the InternetSales table so that all sales that are currently associatedwith the duplicatecustomer record become associated with the surviving customer record.• Deletesthe duplicatecustomer record.• Execute theSQL statement.10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 9-43Results: After this exercise, you should havededuplicatedd data using a matching project and updateddata in your database to reflect these changes.MCT USE ONLY. STUDENT USE PROHIBITED


9-44 Enforcing Data QualityModule Review and TakeawaysReview Questions1. Why is it necessary to have human interaction when creating a knowledge base?2. If you do not have the knowledge or resources to create a domain inyour knowledge base whatchoices do you have?3. What are potential issues with data deduplication?MCT USE ONLY. STUDENT USE PROHIBITED

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!