13.07.2015 Views

Applied XML Programming for Microsoft .NET.pdf - Csbdu.in

Applied XML Programming for Microsoft .NET.pdf - Csbdu.in

Applied XML Programming for Microsoft .NET.pdf - Csbdu.in

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Applied</strong> <strong>XML</strong> <strong>Programm<strong>in</strong>g</strong> <strong>for</strong> <strong>Microsoft</strong> .<strong>NET</strong>D<strong>in</strong>o Esposito<strong>Microsoft</strong> PressA Division of <strong>Microsoft</strong> Corporation One <strong>Microsoft</strong> Way Redmond, Wash<strong>in</strong>gton 98052-6399Copyright © 2003 by D<strong>in</strong>o EspositoAll rights reserved. No part of the contents of this book may be reproduced ortransmitted <strong>in</strong> any <strong>for</strong>m or by any means without the written permission of the publisher.Library of Congress Catalog<strong>in</strong>g-<strong>in</strong>-Publication Data [ pend<strong>in</strong>g.]Esposito, D<strong>in</strong>o, 1965-<strong>Applied</strong> <strong>XML</strong> <strong>Programm<strong>in</strong>g</strong> <strong>for</strong> <strong>Microsoft</strong> .<strong>NET</strong> / D<strong>in</strong>o Espositop. cm.Includes <strong>in</strong>dex.ISBN 0-7356-1801-11. <strong>XML</strong> (Document markup language) 2. <strong>Microsoft</strong> .<strong>NET</strong>. I. Title.QA76.76.H94 E85 2002005.7'2--dc21 2002029546Pr<strong>in</strong>ted and bound <strong>in</strong> the United States of America.1 2 3 4 5 6 7 8 9 QWT 7 6 5 4 3 2Distributed <strong>in</strong> Canada by H.B. Fenn and Company Ltd.A CIP catalogue record <strong>for</strong> this book is available from the British Library.<strong>Microsoft</strong> Press books are available through booksellers and distributors worldwide. Forfurther <strong>in</strong><strong>for</strong>mation about <strong>in</strong>ternational editions, contact your local <strong>Microsoft</strong> Corporationoffice or contact <strong>Microsoft</strong> Press International directly at fax (425) 936-7329. Visit ourWeb site at www.microsoft.com/mspress. Send comments to:.ActiveX, IntelliSense, JScript, <strong>Microsoft</strong>, <strong>Microsoft</strong> Press, MS-DOS, Visual Basic, VisualStudio, W<strong>in</strong>32, W<strong>in</strong>dows and W<strong>in</strong>dows NT are either registered trademarks ortrademarks of <strong>Microsoft</strong> Corporation <strong>in</strong> the United States and/or other countries. Otherproduct and company names mentioned here<strong>in</strong> may be the trademarks of theirrespective owners.The example companies, organizations, products, doma<strong>in</strong> names, e-mail addresses,logos, people, places, and events depicted here<strong>in</strong> are fictitious. No association with anyreal company, organization, product, doma<strong>in</strong> name, e-mail address, logo, person,place, or event is <strong>in</strong>tended or should be <strong>in</strong>ferred.Acquisitions Editor: Anne HamiltonProject Editor: Lynn F<strong>in</strong>nelTechnical Editor: Marc YoungBody Part No. X08-81851


D<strong>in</strong>o EspositoD<strong>in</strong>o Esposito is W<strong>in</strong>tellect's ADO.<strong>NET</strong> and <strong>XML</strong> expert and a tra<strong>in</strong>er and consultantwho specializes <strong>in</strong> .<strong>NET</strong> and Web applications. A frequent speaker at popular <strong>in</strong>dustryevents such as <strong>Microsoft</strong> TechEd, VSLive!, DevConnections, and W<strong>in</strong>Summit, D<strong>in</strong>o isalso a prolific author writ<strong>in</strong>g the monthly "Cutt<strong>in</strong>g Edge" column <strong>for</strong> MSDN Magaz<strong>in</strong>eand the "Div<strong>in</strong>g <strong>in</strong>to Data Access" column <strong>for</strong> MSDN Voices. He also regularlycontributes to a number of other magaz<strong>in</strong>es, <strong>in</strong>clud<strong>in</strong>g Visual Studio Magaz<strong>in</strong>e, CoDeMagaz<strong>in</strong>e, and asp.netPRO Magaz<strong>in</strong>e (http://www.aspnetpro.com). Dur<strong>in</strong>g a few raremoments of spare time, D<strong>in</strong>o cofounded http://www.vb2themax.com, a Web site <strong>for</strong>Visual Basic and Visual Basic .<strong>NET</strong> developers.Fond of sea and beaches, D<strong>in</strong>o lives <strong>in</strong> Italy, precisely <strong>in</strong> the Rome area, with his wife,Silvia, and two children—Francesco and Michela.To Silvia, Francesco, and MichelaAcknowledgmentsI can say it now: Several times I was about to start an <strong>XML</strong> book project, but then <strong>for</strong>one reason or another the project never took off. So I'd like to start by say<strong>in</strong>g thanks tothe people who believed <strong>in</strong> a fairly confused book idea and worked to make it happen.These people are Anne Hamilton and Jeann<strong>in</strong>e Gailey. (By the way, all the best,Jeann<strong>in</strong>e!)Lynn F<strong>in</strong>nel brought the usual fundamental contribution as project editor. As Lynnorig<strong>in</strong>ally described her role <strong>in</strong> the first e-mail we exchanged, be<strong>in</strong>g an editor is adelicate art, as you have to reconcile the needs of many people while meet<strong>in</strong>g your owndeadl<strong>in</strong>es. Thanks aga<strong>in</strong>, Lynn.And a warm thanks goes to Jennifer Harris, who edited the book, and technicalreviewers Marc Young, Jim Fuchs, Julie Xiao, and Jean Ross.Other people were <strong>in</strong>volved with this book, mostly as personal reviewers. FrancescoBalena tested some of the code and provided a lot of <strong>in</strong>sight. In particular, GiuseppeDimauro and Giuseppe Guerrasio helped to figure out the <strong>in</strong>tricacies of theXmlSerializer class, and Ralph Westphal did the same with custom readers. KennScribner has been the ideal extension to the MSDN documentation about Web


services. Ra<strong>in</strong>er Heller of Siemens offered a really <strong>in</strong>terest<strong>in</strong>g perspective on Webservices <strong>in</strong>teroperability. It was nice to discuss Web services <strong>in</strong> the more generalcontext of a conversation based on the World Football Championships—an <strong>in</strong>directdemonstration that Web services are still <strong>in</strong>teroperable today!Thanks to all the W<strong>in</strong>tellect guys, and Jason Clark and Jeffrey Richter, <strong>in</strong> particular, <strong>for</strong>their friendly and effective support.And now my family. I've noticed that many authors, when writ<strong>in</strong>g acknowledgments,promise their families that they will never repeat the experience. Although reward<strong>in</strong>g <strong>for</strong>themselves, they expla<strong>in</strong>, writ<strong>in</strong>g a book is too hard on the rest of the family to berepeated. I'll be honest and s<strong>in</strong>cere here. So, Silvia, and Francesco and Michela, setyour m<strong>in</strong>d at rest. I will do all I can to write even more books. But I love you all beyondimag<strong>in</strong>ation.—'til the next bookD<strong>in</strong>o


Table of Contents<strong>Applied</strong> <strong>XML</strong> <strong>Programm<strong>in</strong>g</strong> <strong>for</strong> <strong>Microsoft</strong> .<strong>NET</strong>IntroductionPart I - <strong>XML</strong> Core Classes <strong>in</strong> the .<strong>NET</strong> FrameworkChapter 1 - The .<strong>NET</strong> <strong>XML</strong> Pars<strong>in</strong>g ModelChapter 2 -<strong>XML</strong> ReadersChapter 3 - <strong>XML</strong> Data ValidationChapter 4 -<strong>XML</strong> WritersPart II - <strong>XML</strong> Data ManipulationChapter 5 - The <strong>XML</strong> .<strong>NET</strong> Document Object ModelChapter 6 - <strong>XML</strong> Query Language and NavigationChapter 7 - <strong>XML</strong> Data Trans<strong>for</strong>mationPart III - <strong>XML</strong> and Data AccessChapter 8 - <strong>XML</strong> and DatabasesChapter 9 - ADO.<strong>NET</strong> <strong>XML</strong> Data SerializationChapter 10 - Stateful Data SerializationPart IV - Applications InteroperabilityChapter 11 - <strong>XML</strong> SerializationChapter 12 - The .<strong>NET</strong> Remot<strong>in</strong>g SystemChapter 13 - <strong>XML</strong> Web ServicesChapter 14 - <strong>XML</strong> on the ClientChapter 15 - .<strong>NET</strong> Framework Application ConfigurationAfterwordIndexList of FiguresList of TablesList of Sidebars


IntroductionIt was about five years ago, a few days after I f<strong>in</strong>ished my first book, when the publishercame to me with a rather entic<strong>in</strong>g proposal: "Why don't you start th<strong>in</strong>k<strong>in</strong>g about a newbook?" Now I realize that all publishers make this sort of proposition, but at the time theproposal was def<strong>in</strong>itely allur<strong>in</strong>g, and a clear signal—I thought—of appreciation."Because you seem to do so well with new technologies," they said, "we'd like you tohave a look at this new stuff called <strong>XML</strong>." It was the first time I had heard about <strong>XML</strong>,which was not yet a W3C recommendation.A lot of th<strong>in</strong>gs have happened <strong>in</strong> the meantime, and <strong>XML</strong> did go a long way. You canbe sure that, as I write this, a thousand or more IT managers are giv<strong>in</strong>g presentationsthat <strong>in</strong>clude <strong>XML</strong> <strong>in</strong> one way or another. Not many years ago, at a software conference,I heard a product manager emphasize the key role played by <strong>XML</strong> <strong>in</strong> the suite ofproducts he was present<strong>in</strong>g. After the first dozen sentences to the effect that "thisfeature wouldn't have been possible without <strong>XML</strong>," one of the attendees asked a candidquestion: "Is there a function <strong>in</strong> which you didn't use <strong>XML</strong>?" The presenter's genu<strong>in</strong>eenthusiasm led everyone there (<strong>in</strong>clud<strong>in</strong>g myself) to believe that programm<strong>in</strong>g would nolonger be possible without a strong knowledge of <strong>XML</strong>. We were more than a littlereassured by the speaker's answer: "Oh no, we didn't use <strong>XML</strong> <strong>in</strong> the compiler."Regardless of the hype that often accompanies it, <strong>XML</strong> truly is a key element <strong>in</strong>software. Today, <strong>XML</strong> is more than just a software technology. <strong>XML</strong> is a fundamentalaspect of all <strong>for</strong>ms of programm<strong>in</strong>g, as essential as water and air to every human be<strong>in</strong>g.Just as human be<strong>in</strong>gs realistically need some <strong>in</strong>frastructure to take advantage of waterand air, programm<strong>in</strong>g <strong>for</strong>ms of life must be supported by software tools to be effectiveand express their potential <strong>in</strong> terms of <strong>in</strong>teroperability, flexibility, and <strong>in</strong><strong>for</strong>mation. For<strong>XML</strong>, the most important of these tools is the parser.An <strong>XML</strong> parser reads <strong>in</strong> <strong>XML</strong> text and outputs a memory representation of the contents.The <strong>in</strong>put <strong>for</strong> an <strong>XML</strong> parser is always pla<strong>in</strong> and plat<strong>for</strong>m-<strong>in</strong>dependent text, althoughpotentially encoded <strong>in</strong> a variety of character sets, whereas the output of an <strong>XML</strong> parseris strictly tied to the underly<strong>in</strong>g hardware and software plat<strong>for</strong>m. Depend<strong>in</strong>g on theoperat<strong>in</strong>g system and the programm<strong>in</strong>g environment of choice, an <strong>XML</strong> parser cangenerate a Component Object Model (COM) object as well as a Java or a JScript class.No matter the k<strong>in</strong>d of output, however, the end result is <strong>XML</strong> data <strong>in</strong> a programmable<strong>for</strong>m.The grow<strong>in</strong>g level of <strong>in</strong>tegration and orchestration that partner applications need makesthe exchanged <strong>XML</strong> code more and more sophisticated and often requires the use ofspecialized dialects like Simple Object Access Protocol (SOAP) and XPath. As a result,<strong>XML</strong> programm<strong>in</strong>g requires ad hoc tools <strong>for</strong> read<strong>in</strong>g and writ<strong>in</strong>g <strong>in</strong> these dialects; all thebetter if the tools are tightly <strong>in</strong>tegrated <strong>in</strong>to some sort of programm<strong>in</strong>g framework.Effective <strong>XML</strong> programm<strong>in</strong>g requires that you be able to generate <strong>XML</strong> <strong>in</strong> a morepowerful way than merely concatenat<strong>in</strong>g str<strong>in</strong>gs. The <strong>XML</strong> API must be extensibleenough to accommodate pluggable technologies and custom functionalities. And itmust be serializable and <strong>in</strong>tegrate well with other elements of data storage andexchange, <strong>in</strong>clud<strong>in</strong>g databases, complex data types (arrays, tables, and lists), and—why not?—visual user <strong>in</strong>terface elements. In simple terms, <strong>XML</strong> must no longer be adist<strong>in</strong>ct API bolted onto the core framework, but <strong>in</strong>stead be a fully <strong>in</strong>tegrated member ofthe family. This is just what <strong>XML</strong> is <strong>in</strong> the <strong>Microsoft</strong> .<strong>NET</strong> Framework. And this book isabout <strong>XML</strong> programm<strong>in</strong>g with the .<strong>NET</strong> Framework.1


What Is This Book About?This book explores the array of <strong>XML</strong> tools provided by the .<strong>NET</strong> Framework. <strong>XML</strong> iseverywhere <strong>in</strong> the .<strong>NET</strong> Framework, from remot<strong>in</strong>g to Web services, and from dataaccess to configuration. In the first part of this book, you'll f<strong>in</strong>d <strong>in</strong>-depth coverage of thekey classes that implement <strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> plat<strong>for</strong>m. Readers and writers, validation,and schemas are discussed with samples and reference <strong>in</strong><strong>for</strong>mation. Next the bookmoves on to XPath and XSL Trans<strong>for</strong>mations (XSLT) and the .<strong>NET</strong> version of the <strong>XML</strong>Document Object Model (<strong>XML</strong> DOM).The f<strong>in</strong>al part of this book focuses on data access and <strong>in</strong>teroperability and touches onSQL Server 2000 and its <strong>XML</strong> extensions and .<strong>NET</strong> Remot<strong>in</strong>g and its cross-plat<strong>for</strong>mcounterpart—<strong>XML</strong> Web services. You'll also f<strong>in</strong>d a couple of chapters about <strong>XML</strong>configuration files and <strong>XML</strong> data islands and browser/deployed managed controls.What Does This Book Cover?This book attempts to answer the follow<strong>in</strong>g common questions:• Can I read custom data as <strong>XML</strong>?• What are the guidel<strong>in</strong>es <strong>for</strong> writ<strong>in</strong>g custom <strong>XML</strong> readers?• Is it possible to set up validat<strong>in</strong>g <strong>XML</strong> writers?• How can I extend the <strong>XML</strong> DOM?• Why should I use the XPath navigator object whenever possible?• Can I embed my own managed classes <strong>in</strong> an XSLT script?• How can I serialize a DataSet object efficiently?• What is the DiffGram <strong>for</strong>mat?• Are the SQL Server 2000 <strong>XML</strong> Extensions (SQL<strong>XML</strong>) worth us<strong>in</strong>g?• Why does the <strong>XML</strong> serializer use a dynamic assembly?• When should I use Web services <strong>in</strong>stead of .<strong>NET</strong> Remot<strong>in</strong>g?• How can I embed managed controls <strong>in</strong> Web pages?• How can managed controls access client-side <strong>XML</strong> data islands?• How do I <strong>in</strong>sert my own <strong>XML</strong> data <strong>in</strong> a configuration file?All of the sample files discussed <strong>in</strong> this book (and even more) are available through theWeb at the follow<strong>in</strong>g address: http://www.microsoft.com/mspress/books/6235.asp. Toopen the Companion Content page, click on the Companion Content l<strong>in</strong>k <strong>in</strong> the MoreIn<strong>for</strong>mation box on the right side of the page.Although all the code shown <strong>in</strong> this book is <strong>in</strong> C#, the sample files are available both <strong>in</strong>C# and <strong>in</strong> <strong>Microsoft</strong> Visual Basic .<strong>NET</strong>. Here are some of the more <strong>in</strong>terest<strong>in</strong>gexamples:• An <strong>XML</strong> reader that reads CSV files and exposes their contents as <strong>XML</strong>• An extended version of the <strong>XML</strong> DOM that detects changes to the disk file andautomatically refreshes its data• A Web service that offers dynamically created images• An <strong>XML</strong> reader class with writ<strong>in</strong>g capabilities• A class that serializes DataTable objects <strong>in</strong> a true b<strong>in</strong>ary <strong>for</strong>mat• A tool to track the behavior of the <strong>XML</strong> serializer class• A ListView control that retrieves its data from the host HTML pageThese and other samples will get you on your way to <strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> Framework.2


What Do I Need to Use This Book?Most of the examples <strong>in</strong> this book are W<strong>in</strong>dows Forms or console applications. The keyrequirements <strong>for</strong> runn<strong>in</strong>g these applications are the .<strong>NET</strong> Framework and <strong>Microsoft</strong>Visual Studio .<strong>NET</strong>. You also need to have SQL Server 2000 <strong>in</strong>stalled to make most ofthe samples work, and a few examples make use of <strong>Microsoft</strong> Access 2000 databases.The SQL<strong>XML</strong> 3.0 extensions are required <strong>for</strong> the samples <strong>in</strong> Chapter 8. The code hasbeen tested with the .<strong>NET</strong> Framework SP1.The SQL Server examples <strong>in</strong> this book assume that the sa account uses a blankpassword, although the use of such a blank password is strongly discouraged <strong>in</strong> anyprofessional development environment. If your SQL Server sa account doesn't use ablank password, you'll need to add the sa password to the connection str<strong>in</strong>gs <strong>in</strong> thesource code. For example, if your sa password is "Hello", the follow<strong>in</strong>g connectionstr<strong>in</strong>g provides access to the Northw<strong>in</strong>d database:str<strong>in</strong>g nw<strong>in</strong>d ="SERVER=localhost;UID=sa;pswd=Hello;DATABASE=northw<strong>in</strong>d;";Some of the applications <strong>in</strong> this book require SOAP Toolkit 2.0 and SQL<strong>XML</strong> 3.0.These products are available at the follow<strong>in</strong>g locations:• SOAP Toolkit 2.0http://msdn.microsoft.com/downloads/default.asp?URL=/downloads/sample.asp?url=/MSDN-FILES/027/001/580/msdncompositedoc.xml• SQL<strong>XML</strong> 3.0http://msdn.microsoft.com/downloads/default.asp?URL=/downloads/sample.asp?url=/MSDN-FILES/027/001/824/msdn-compositedoc.xmlContact<strong>in</strong>g the AuthorPlease feel free to send any questions about this book directly to the author. D<strong>in</strong>oEsposito can be reached via e-mail at one of the follow<strong>in</strong>g addresses:• • In addition, you can contact the author at the W<strong>in</strong>tellect (http://www.w<strong>in</strong>-tellect.com) andVB2-The-Max (http://www.vb2themax.com) Web sites.SupportEvery ef<strong>for</strong>t has been made to ensure the accuracy of this book and the contents of thesample files. <strong>Microsoft</strong> Press provides corrections <strong>for</strong> books through the Web at thefollow<strong>in</strong>g address:http://www.microsoft.com/mspress/support/To connect directly to the <strong>Microsoft</strong> Press Knowledge Base and enter a query regard<strong>in</strong>ga question or issue that you might have, go to:http://www.microsoft.com/mspress/support/search.aspIf you have comments, questions, or ideas regard<strong>in</strong>g this book or the sample files,please send them to <strong>Microsoft</strong> Press us<strong>in</strong>g either of the follow<strong>in</strong>g methods:Postal mail:3


<strong>Microsoft</strong> PressAttn:<strong>Microsoft</strong> .<strong>NET</strong> <strong>XML</strong> <strong>Programm<strong>in</strong>g</strong> EditorOne <strong>Microsoft</strong> WayRedmond, Wa 98052-6399E-mail:Please note that product support is not offered through the above mail addresses. Forsupport <strong>in</strong><strong>for</strong>mation, please visit the <strong>Microsoft</strong> Product Support Web site athttp://support.microsoft.com4


Part I: <strong>XML</strong> Core Classes <strong>in</strong> the .<strong>NET</strong> FrameworkChapter ListChapter 1: The .<strong>NET</strong> <strong>XML</strong> Pars<strong>in</strong>g ModelChapter 2: <strong>XML</strong> ReadersChapter 3: <strong>XML</strong> Data ValidationChapter 4: <strong>XML</strong> WritersPart Overview5


Chapter 1: The .<strong>NET</strong> <strong>XML</strong> Pars<strong>in</strong>g ModelOverview<strong>XML</strong> is certa<strong>in</strong>ly a hot topic <strong>in</strong> the software community these days. As you read this,probably a thousand or more IT managers are giv<strong>in</strong>g presentations that <strong>in</strong>clude <strong>XML</strong> <strong>in</strong>one way or another. In fact, it's becom<strong>in</strong>g almost redundant to emphasize the effect thatthe use of <strong>XML</strong> can have on applications.Today, <strong>XML</strong> is a natural element of all <strong>for</strong>ms of programm<strong>in</strong>g life, just as water, sun,and m<strong>in</strong>erals are fundamental resources <strong>for</strong> every human be<strong>in</strong>g. To take full advantageof <strong>XML</strong>, applications need some <strong>in</strong>frastructure built <strong>in</strong>to the operat<strong>in</strong>g system or <strong>in</strong>to theunderly<strong>in</strong>g software plat<strong>for</strong>m. Normally, an <strong>XML</strong> <strong>in</strong>frastructure takes the <strong>for</strong>m of toolsthat provide <strong>for</strong> pars<strong>in</strong>g, document validation, schema design, and trans<strong>for</strong>mations.The <strong>Microsoft</strong> .<strong>NET</strong> Framework provides a comprehensive set of classes that let youwork with <strong>XML</strong> documents and related technologies at various levels and <strong>in</strong> strictaccordance with the most recent World Wide Web Consortium (W3C) standards andrecommendations. The <strong>XML</strong> support available <strong>in</strong> the .<strong>NET</strong> Framework covers <strong>XML</strong> 1.0,<strong>XML</strong> namespaces, Document Object Model (DOM) Level 2 Core, <strong>XML</strong> SchemaDef<strong>in</strong>ition (XSD) Language, Extensible Stylesheet Language Trans<strong>for</strong>mations (XSLT),and XPath expressions. In addition, <strong>XML</strong> core classes are tightly <strong>in</strong>tegrated with otherkey portions of the .<strong>NET</strong> Framework, <strong>in</strong>clud<strong>in</strong>g data access, serialization, andapplications configuration.In this chapter, we'll take an overall look at <strong>XML</strong> as it is used <strong>in</strong> the .<strong>NET</strong> Framework. Inparticular, we'll focus on the new and <strong>in</strong>novative pars<strong>in</strong>g model based on the concept ofreader components. This first chapter is aimed at provid<strong>in</strong>g you with the big picture ofthe .<strong>NET</strong> Framework <strong>XML</strong> API, the key elements of transition from the previousComponent Object Model (COM)-based W<strong>in</strong>32 API, and a bird's-eye view of the<strong>in</strong>terconnections between <strong>XML</strong> and various parts of the .<strong>NET</strong> Framework.<strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> FrameworkThe .<strong>NET</strong> Framework <strong>XML</strong> core classes can be categorized accord<strong>in</strong>g to theirfunctions: read<strong>in</strong>g and writ<strong>in</strong>g documents, validat<strong>in</strong>g documents, navigat<strong>in</strong>g andselect<strong>in</strong>g nodes, manag<strong>in</strong>g schema <strong>in</strong><strong>for</strong>mation, and per<strong>for</strong>m<strong>in</strong>g documenttrans<strong>for</strong>mations. The assembly <strong>in</strong> which the whole <strong>XML</strong> .<strong>NET</strong> Framework isimplemented is system.xml.dll.The most commonly used namespaces are listed here:• System.Xml• System.Xml.Schema• System.Xml.XPath• System.Xml.XslThe .<strong>NET</strong> Framework also provides <strong>for</strong> <strong>XML</strong> object serialization. The classes <strong>in</strong>volvedwith this functionality are grouped <strong>in</strong> the System.Xml.Serialization namespace. <strong>XML</strong>serialization writes objects to, and reads them from, <strong>XML</strong> documents. This k<strong>in</strong>d ofserialization is particularly useful over the Web <strong>in</strong> comb<strong>in</strong>ation with the Simple ObjectAccess Protocol (SOAP) and with<strong>in</strong> the boundaries of .<strong>NET</strong> Framework <strong>XML</strong> Webservices.6


Related <strong>XML</strong> StandardsTable 1-1 lists the <strong>XML</strong>-related standards that have been implemented <strong>in</strong> the .<strong>NET</strong>Framework. The table also provides the official URL <strong>for</strong> each standard <strong>for</strong> furtherreference.Table 1-1: W3C Standards Supported <strong>in</strong> the .<strong>NET</strong> FrameworkStandard<strong>XML</strong> 1.0<strong>XML</strong> namespaces<strong>XML</strong> SchemaDOM Level 1 and Level 2 CoreXPathXSLTSOAP 1.1Referencehttp://www.w3.org/TR/1998/RECxml-19980210http://www.w3.org/TR/REC-xmlnameshttp://www.w3.org/TR/xmlschema-2http://www.w3.org/TR/DOM-Level-2http://www.w3.org/TR/xpathhttp://www.w3.org/TR/xslthttp://www.w3.org/TR/SOAPAs a data exchange technology, <strong>XML</strong> is fully and tightly <strong>in</strong>tegrated <strong>in</strong>to the .<strong>NET</strong>Framework. Table 1-2 provides a quick schematic view of the ma<strong>in</strong> areas of the .<strong>NET</strong>Framework <strong>in</strong> which significant traces of <strong>XML</strong> are clearly visible. Each area <strong>in</strong>cludesnumerous classes and provides a set of application-level functions.Table 1-2: Areas of the .<strong>NET</strong> Framework <strong>in</strong> Which <strong>XML</strong> Is KeyCategoryADO.<strong>NET</strong>ConfigurationRemot<strong>in</strong>gWeb services<strong>XML</strong> pars<strong>in</strong>g<strong>XML</strong> serializationDescriptionData conta<strong>in</strong>er objects (<strong>for</strong> example, the DataSet object)are always transferred and remoted via <strong>XML</strong>. The .<strong>NET</strong>Framework also provides <strong>for</strong> two-way synchronizedb<strong>in</strong>d<strong>in</strong>g between data exposed <strong>in</strong> tabular <strong>for</strong>mat and<strong>XML</strong> <strong>for</strong>mat.Application sett<strong>in</strong>gs are stored <strong>in</strong> <strong>XML</strong> files, mak<strong>in</strong>g useof predef<strong>in</strong>ed and user-def<strong>in</strong>ed section readers. (Moreon readers later.)Remote .<strong>NET</strong> Framework objects can be accessed byus<strong>in</strong>g SOAP packets to prepare and per<strong>for</strong>m the call.SOAP is a lightweight <strong>XML</strong> protocol that Web servicesuse <strong>for</strong> the exchange of <strong>in</strong><strong>for</strong>mation <strong>in</strong> a decentralized,distributed environment. Typically, you use SOAP to<strong>in</strong>voke methods on a Web service <strong>in</strong> a plat<strong>for</strong>m<strong>in</strong>dependentfashion.The core classes provid<strong>in</strong>g <strong>for</strong> <strong>XML</strong> pars<strong>in</strong>g andmanipulation through both the stream-based API and the<strong>XML</strong> Document Object Model (<strong>XML</strong>DOM).Supplies the ability to save and restore liv<strong>in</strong>g <strong>in</strong>stancesof objects to and from <strong>XML</strong> documents.7


Although not strictly part of the .<strong>NET</strong> Framework, another group of classes deservesmention: the managed classes def<strong>in</strong>ed <strong>in</strong> the SQL Server 2000 <strong>XML</strong> Extensions(SQL<strong>XML</strong>). SQL<strong>XML</strong> 3.0 extends the <strong>XML</strong> capabilities of SQL Server 2000 by<strong>in</strong>troduc<strong>in</strong>g Web services support. SQL<strong>XML</strong> 3.0 makes it possible <strong>for</strong> you to exportstored procedures as SOAP-based Web services and also extends ADO.<strong>NET</strong>capabilities with server-side XPath queries and <strong>XML</strong> views. SQL<strong>XML</strong> 3.0 is available asa separate download, but it seamlessly <strong>in</strong>tegrates with the exist<strong>in</strong>g <strong>in</strong>stallation of the.<strong>NET</strong> Framework. We'll look at SQL<strong>XML</strong> 3.0 <strong>in</strong> more detail <strong>in</strong> Chapter 8.In general, the entire set of <strong>XML</strong> classes provided with the .<strong>NET</strong> Framework offers astandards-compliant, <strong>in</strong>teroperable, extensible solution to today's software developmentchallenges. This support is not a tacked-on API but a true part of the .<strong>NET</strong> Framework.NoteAlmost all of today's <strong>XML</strong> parsers support the latest W3Cspecification <strong>for</strong> the DOM Level 2 Core. The current specificationdoes not def<strong>in</strong>e a standard <strong>in</strong>terface to persist and restore contents,however, although the most popular <strong>XML</strong> parsers, such as<strong>Microsoft</strong>'s <strong>XML</strong> Core Services (MS<strong>XML</strong>)—<strong>for</strong>merly known as the<strong>Microsoft</strong> <strong>XML</strong> Parser—and some others based on Java, alreadyhave their own ways to persist objects to streams and to restoreobjects from them. These mechanisms have yet to be consideredas custom and plat<strong>for</strong>m-specific extensions. An official API <strong>for</strong>serializ<strong>in</strong>g documents to and from <strong>XML</strong> <strong>for</strong>mat will not be availableuntil DOM Level 3 Core achieves the status of a W3Crecommendation. As of summer 2002, DOM Level 3 Core isqualified as a work <strong>in</strong> progress. The publicly available draft def<strong>in</strong>esthe specification <strong>for</strong> a pair of Load and Save methods designed toenable load<strong>in</strong>g <strong>XML</strong> documents <strong>in</strong>to a DOM representation andsav<strong>in</strong>g a DOM representation as an <strong>XML</strong> document. For more<strong>in</strong><strong>for</strong>mation, refer to http://www.w3.org/TR/2002/WD-DOM-Level-3-Core-20020409.A known parser that already provides an experimentalimplementation of DOM Level 3 Core is IBM's <strong>XML</strong> Parser <strong>for</strong> Java(Xml4J). See http://www.alphaworks.ibm.com/tech/xml4j <strong>for</strong> more<strong>in</strong><strong>for</strong>mation.Core Classes <strong>for</strong> Pars<strong>in</strong>gRegardless of the underly<strong>in</strong>g plat<strong>for</strong>m, the available <strong>XML</strong> parsers fall <strong>in</strong>to one of twoma<strong>in</strong> categories: tree-based parsers and event-based parsers. Each parser category isdesigned accord<strong>in</strong>g to a different philosophical approach and, subsequently, has itsown pros and cons. The two categories are commonly identified with their two mostpopular implementations: <strong>XML</strong>DOM and Simple API <strong>for</strong> <strong>XML</strong> (SAX). The <strong>XML</strong>DOMparser is a generic tree-based API that renders an <strong>XML</strong> document as an <strong>in</strong>-memorystructure. The SAX parser provides an event-based API <strong>for</strong> process<strong>in</strong>g each significantelement <strong>in</strong> a stream of <strong>XML</strong> data.Conceptually speak<strong>in</strong>g, a SAX parser is diametrically opposed to an <strong>XML</strong>DOM parser,and the gap between the two models is <strong>in</strong>deed fairly large. <strong>XML</strong>DOM seems to beclearly def<strong>in</strong>ed <strong>in</strong> its set of functionalities, and there is not much more one canreasonably expect from the evolution of this model. Regardless of whether you like the<strong>XML</strong>DOM model or f<strong>in</strong>d it suitable <strong>for</strong> your needs, you can't really expect to radicallyimprove or change its way of work<strong>in</strong>g. In a certa<strong>in</strong> sense, the down sides of the8


<strong>XML</strong>DOM model (memory footpr<strong>in</strong>t and bandwidth required to process largedocuments) are structural and stem directly from design choices.SAX parsers work by lett<strong>in</strong>g client applications pass liv<strong>in</strong>g <strong>in</strong>stances of plat<strong>for</strong>m-specificobjects to handle parser events. The parser controls the whole process and pushesdata to the application, which is <strong>in</strong> turn free to accept or simply ignore the data. TheSAX model is extremely lean and features a limited complexity <strong>in</strong> space.The .<strong>NET</strong> Framework provides full support <strong>for</strong> the <strong>XML</strong>DOM pars<strong>in</strong>g model but not <strong>for</strong>the SAX model. The set of .<strong>NET</strong> Framework <strong>XML</strong> core classes supports two parsermodels: <strong>XML</strong>DOM and a new model called an <strong>XML</strong> reader. The lack of support <strong>for</strong> SAXparsers does not mean that you have to renounce the functionality that a SAX parsercan br<strong>in</strong>g, however. All the functions of a SAX parser can be easily and even moreeffectively implemented us<strong>in</strong>g an <strong>XML</strong> reader. Unlike a SAX parser, a .<strong>NET</strong> Framework<strong>XML</strong> reader works under the total control of the client application, enabl<strong>in</strong>g theapplication to pull out only the data it really needs and skip over the rema<strong>in</strong>der of the<strong>XML</strong> stream.Readers are based on .<strong>NET</strong> Framework streams and work <strong>in</strong> much the same way as adatabase cursor. Interest<strong>in</strong>gly, the classes that implement this cursor-like pars<strong>in</strong>g modelalso provide the substrate <strong>for</strong> the .<strong>NET</strong> Framework implementation of the <strong>XML</strong>DOMparser. Two abstract classes—XmlReader and XmlWriter—are at the very foundation ofall .<strong>NET</strong> Framework <strong>XML</strong> classes, <strong>in</strong>clud<strong>in</strong>g <strong>XML</strong>DOM classes, ADO.<strong>NET</strong>-relatedclasses, and configuration classes. So <strong>in</strong> the .<strong>NET</strong> Framework you have two possibleapproaches when it comes to process<strong>in</strong>g <strong>XML</strong> data. You can use either any classesdirectly built onto XmlReader and XmlWriter or classes that expose <strong>in</strong><strong>for</strong>mation throughthe well-known <strong>XML</strong>DOM.The set of <strong>XML</strong> core classes also <strong>in</strong>cludes tailor-made class hierarchies to supportother related <strong>XML</strong> technologies such as XSLT, XPath expressions, and the SchemaObject Model (SOM).We'll look at <strong>XML</strong> core classes and related standards <strong>in</strong> the follow<strong>in</strong>g chapters. Inparticular, Chapter 2, Chapter 3, Chapter 4, and Chapter 5 describe the core classesand pars<strong>in</strong>g models. Chapter 6 and Chapter 7 exam<strong>in</strong>e the related standards, such asXPath and XSL.<strong>XML</strong> and ADO.<strong>NET</strong>The <strong>in</strong>teraction between ADO.<strong>NET</strong> classes and <strong>XML</strong> documents takes one of two<strong>for</strong>ms:• Serialization of ADO.<strong>NET</strong> objects (<strong>in</strong> particular, the DataSet object) to<strong>XML</strong> documents and correspond<strong>in</strong>g deserialization. Data can be saved to<strong>XML</strong> <strong>in</strong> a variety of <strong>for</strong>mats, with or without schema <strong>in</strong><strong>for</strong>mation, as a fullsnapshot of the <strong>in</strong>-memory data <strong>in</strong>clud<strong>in</strong>g pend<strong>in</strong>g changes and errors, orwith just the current <strong>in</strong>stance of the data.• A dual-access model that lets you access and update the same piece ofdata either through a hierarchical programm<strong>in</strong>g <strong>in</strong>terface or us<strong>in</strong>g theADO.<strong>NET</strong> relational API. Basically, you can trans<strong>for</strong>m a DataSet object<strong>in</strong>to an <strong>XML</strong>DOM object and view the <strong>XML</strong>DOM's subtrees as tablesmerged with the DataSet object's tables.The ADO.<strong>NET</strong> DataSet class represents the only .<strong>NET</strong> Framework object that can benatively saved to <strong>XML</strong>. The <strong>XML</strong> representation of a DataSet object can have twodifferent layouts: the ADO.<strong>NET</strong> normal <strong>for</strong>m and the DiffGram <strong>for</strong>mat. In particular, theDiffGram <strong>for</strong>mat describes the history of the data and all recent changes. Eachchanged row <strong>in</strong> each table is represented by two nodes: the first node conta<strong>in</strong>s the9


snapshot of the row as it was orig<strong>in</strong>ally read, and the second node conta<strong>in</strong>s the currentvalues. The DiffGram represents a snapshot of the DataSet state and contents at agiven moment. To write DiffGrams, ADO.<strong>NET</strong> uses an XmlWriter object.The <strong>in</strong>tegration of and <strong>in</strong>teraction between <strong>XML</strong> and ADO.<strong>NET</strong> classes is discussed <strong>in</strong>Chapter 8.Application ConfigurationBe<strong>for</strong>e <strong>Microsoft</strong> W<strong>in</strong>dows 95, applications stored configuration sett<strong>in</strong>gs to a text filewith a .<strong>in</strong>i extension. INI files store <strong>in</strong><strong>for</strong>mation us<strong>in</strong>g name/value pairs grouped undersections. Ultimately, an INI file is a collection of sections, with each section consist<strong>in</strong>g ofany number of name/value pairs.W<strong>in</strong>dows 95 revamped the role of the system registry—a centralized data repositoryorig<strong>in</strong>ally <strong>in</strong>troduced with W<strong>in</strong>dows NT. The registry is a collection of b<strong>in</strong>ary files that theoperat<strong>in</strong>g system manages <strong>in</strong> exclusive mode. Client applications can read and writethe contents of the registry only by us<strong>in</strong>g a tailor-made API. The registry works as ak<strong>in</strong>d of hierarchical database consist<strong>in</strong>g of root nodes (also known as hives), nodes,and entries. Each entry is a name/ value pair.All system, component, and application sett<strong>in</strong>gs are supposed to be stored <strong>in</strong> theregistry. The registry cont<strong>in</strong>ues to <strong>in</strong>crease <strong>in</strong> size, contribut<strong>in</strong>g to the creation of aconfiguration subsystem with a s<strong>in</strong>gle (and critical) po<strong>in</strong>t of failure. More recently,applications have been encouraged to store custom sett<strong>in</strong>gs and preferences <strong>in</strong> a localfile stored <strong>in</strong> the application's root folder. For .<strong>NET</strong> Framework applications, thisconfiguration file is an <strong>XML</strong> file written accord<strong>in</strong>g to a specific schema.In addition, the .<strong>NET</strong> Framework provides a specialized set of classes to read and writesett<strong>in</strong>gs. The key class is named AppSett<strong>in</strong>gsReader and works as a k<strong>in</strong>d of parser <strong>for</strong>a small fragment of <strong>XML</strong> code—mostly a node or two with a few attributes.ASP.<strong>NET</strong> applications store configuration sett<strong>in</strong>gs <strong>in</strong> a file named web.config that islocated <strong>in</strong> the root of the application's virtual folder. W<strong>in</strong>dows Forms applications, onthe other hand, store their preferences <strong>in</strong> a file with the same name as the executableplus a .config extension—<strong>for</strong> example, myprogram.exe.config. The CONFIG file mustbe available <strong>in</strong> the same folder as the ma<strong>in</strong> executable. The schema of the CONFIG fileis the same regardless of the application model.The contents of a CONFIG file is logically articulated <strong>in</strong>to sections. The .<strong>NET</strong>Framework provides a number of predef<strong>in</strong>ed sections to accommodate Web andW<strong>in</strong>dows Forms sett<strong>in</strong>gs, remot<strong>in</strong>g parameters, and ASP.<strong>NET</strong> run-time characteristicssuch as the authentication scheme and registered HTTP handlers and modules.User-def<strong>in</strong>ed applications can extend the <strong>XML</strong> schema of the CONFIG file by def<strong>in</strong><strong>in</strong>gcustom sections with custom elements. By default, however, the AppSett<strong>in</strong>gsReaderclass supports only sett<strong>in</strong>gs expressed <strong>in</strong> a few <strong>for</strong>mats, such as name/value pairs anda s<strong>in</strong>gle tag with as many attributes as needed. This schema fits the bill <strong>in</strong> most cases,but when you have complex structured <strong>in</strong><strong>for</strong>mation, it soon becomes <strong>in</strong>sufficient.In<strong>for</strong>mation is read from a section us<strong>in</strong>g special objects called section handlers. If nopredef<strong>in</strong>ed section structure fits your needs, you can provide a tailor-madeconfiguration section handler to read your own <strong>XML</strong> data, as shown here:10


⋮A configuration section handler is simply a .<strong>NET</strong> Framework class that parses aparticular <strong>XML</strong> fragment extracted from the CONFIG file. We'll look at custom sectionhandlers <strong>in</strong> more detail <strong>in</strong> Chapter 15.Interoperability<strong>XML</strong> is key to mak<strong>in</strong>g .<strong>NET</strong> Framework applications <strong>in</strong>teroperate with each other andwith external applications runn<strong>in</strong>g on other software and hardware plat<strong>for</strong>ms. <strong>XML</strong><strong>in</strong>teroperability is a sort of blanket term that covers three .<strong>NET</strong>-specific technologies:<strong>XML</strong> Web services, remot<strong>in</strong>g, and <strong>XML</strong> object serialization.By roll<strong>in</strong>g functionality <strong>in</strong>to an <strong>XML</strong> Web service, you can expose the functionality toany application on the Web that, irrespective of plat<strong>for</strong>m, speaks HTTP andunderstands <strong>XML</strong>. Based on open standards (HTTP and <strong>XML</strong>, but also SOAP), <strong>XML</strong>Web services are an emerg<strong>in</strong>g technology <strong>for</strong> system <strong>in</strong>teroperation and are supportedby the major players <strong>in</strong> the IT <strong>in</strong>dustry. The .<strong>NET</strong> Framework provides a special<strong>in</strong>frastructure to build both remote services and proxy-based clients.Actually, <strong>in</strong> the .<strong>NET</strong> Framework, an <strong>XML</strong> Web service is treated as a special case ofan ASP.<strong>NET</strong> application—one that is saved with a different file extension (.asmx) andaccessible through the SOAP protocol as well as through HTTP GET and POSTcommands. Incom<strong>in</strong>g calls <strong>for</strong> both .aspx files (ASP.<strong>NET</strong> pages) and .asmx files areprocessed by the same Internet In<strong>for</strong>mation Services (IIS) extension module, whichthen dispatches the request to dist<strong>in</strong>ct downstream factory components.In an <strong>XML</strong> Web service, <strong>XML</strong> plays its role entirely beh<strong>in</strong>d the scenes. It is first used asthe glue <strong>for</strong> the SOAP payloads that the communicat<strong>in</strong>g sides exchange. In addition,<strong>XML</strong> is used to express the results of a remote, cross-plat<strong>for</strong>m call. But what if you writea .<strong>NET</strong> <strong>XML</strong> Web service with one method return<strong>in</strong>g, say, an ADO.<strong>NET</strong> DataSetobject? How can a Java application handle the results? The answer is that the DataSetobject is serialized to <strong>XML</strong> and then sent back to the client.The .<strong>NET</strong> Framework provides two types of object serialization: serialization through<strong>for</strong>matters and <strong>XML</strong> serialization. The two live side by side but have differentcharacteristics. <strong>XML</strong> serialization is the process that converts the public <strong>in</strong>terface of anobject to a particular <strong>XML</strong> schema. The goal is simplify<strong>in</strong>g the process of dataexchange between components rather than truly serializ<strong>in</strong>g objects that will then bedeserialized to liv<strong>in</strong>g and effective <strong>in</strong>stances.Remot<strong>in</strong>g is the .<strong>NET</strong> Framework counterpart of the Distributed Component ObjectModel (DCOM) and uses <strong>XML</strong> to configure both the client and the remote components.In addition, <strong>XML</strong> is used through SOAP to serialize outbound parameters and <strong>in</strong>boundreturn values. Remot<strong>in</strong>g is the official .<strong>NET</strong> Framework API <strong>for</strong> communicat<strong>in</strong>gapplications, but it works only between .<strong>NET</strong> peers.<strong>XML</strong> serialization, remot<strong>in</strong>g, and <strong>XML</strong> Web services are covered <strong>in</strong> Part IV—specifically<strong>in</strong> Chapter 11, Chapter 12, and Chapter 13.From MS<strong>XML</strong> to .<strong>NET</strong> Framework ClassesPrior to the advent of the .<strong>NET</strong> Framework, manag<strong>in</strong>g <strong>XML</strong> <strong>in</strong> the <strong>Microsoft</strong> worldmeant us<strong>in</strong>g the COM-based MS<strong>XML</strong>, now available <strong>in</strong> version 4.0, SP1. It goes11


without say<strong>in</strong>g that <strong>Microsoft</strong> is still strongly committed to support<strong>in</strong>g <strong>XML</strong> the COMway, although this does not necessarily mean that we are go<strong>in</strong>g to have an MS<strong>XML</strong> 5.0anytime soon. However, MS<strong>XML</strong> 4.0 represents an excellent parser <strong>for</strong> the W<strong>in</strong>dowsplat<strong>for</strong>m and has been updated to support W3C f<strong>in</strong>al recommendations <strong>for</strong> the <strong>XML</strong>Schema.COM and .<strong>NET</strong> Framework <strong>XML</strong> Core ServicesThe first difference between MS<strong>XML</strong> and .<strong>NET</strong> Framework <strong>XML</strong> core classes thatcatches the eye is the fact that while MS<strong>XML</strong> supports <strong>XML</strong>DOM and SAX parsers, the.<strong>NET</strong> Framework supplies an <strong>XML</strong>DOM parser and <strong>XML</strong> readers and writers. (More onreaders shortly.) This is just the most remarkable example of a common pattern,however. Quite a few key features of MS<strong>XML</strong> are apparently not supported <strong>in</strong> the .<strong>NET</strong>Framework <strong>XML</strong> core classes, but this hardly results <strong>in</strong> a loss of programm<strong>in</strong>g power.In general, the biggest (and perhaps the only significant) difference between MS<strong>XML</strong>and .<strong>NET</strong> Framework <strong>XML</strong> classes is that the <strong>for</strong>mer represents a set of classes fully<strong>in</strong>tegrated <strong>in</strong>to an all-encompass<strong>in</strong>g, self-conta<strong>in</strong>ed framework. Several functionalitiesthat MS<strong>XML</strong> has to provide on its own come <strong>for</strong> free <strong>in</strong> the .<strong>NET</strong> Framework from othercompartments. If you happen to use a certa<strong>in</strong> MS<strong>XML</strong> function and you don't f<strong>in</strong>d adirect counterpart <strong>in</strong> the .<strong>NET</strong> Framework, check out the MSDN documentation be<strong>for</strong>eyou panic. In the paragraphs that follow, we'll look at a few examples of .<strong>NET</strong>Framework functionality that provide the equivalent of some MS<strong>XML</strong> functionality.MS<strong>XML</strong> supports asynchronous load<strong>in</strong>g and validation while pars<strong>in</strong>g. The .<strong>NET</strong>Framework <strong>XML</strong>DOM parser, centered around the XmlDocument class, does notdirectly provide the same features, but proper use of the resources of the .<strong>NET</strong>Framework will let you obta<strong>in</strong> the same f<strong>in</strong>al behavior anyway.MS<strong>XML</strong> also provides <strong>for</strong> a multithreaded HTTP client (the XmlHttp object) capable ofissu<strong>in</strong>g both synchronous and asynchronous calls to a remote URL. A similar feature iscerta<strong>in</strong>ly available <strong>in</strong> the .<strong>NET</strong> Framework, but it has noth<strong>in</strong>g to do with <strong>XML</strong> classes. Ifyou just want your application to act as an HTTP client, use some of the classes <strong>in</strong> theSystem.Net namespace (<strong>for</strong> example, HttpWebRequest and HttpWebResponse).In general, if you loved MS<strong>XML</strong>, you'll love .<strong>NET</strong> Framework <strong>XML</strong> classes too. Theoverall programm<strong>in</strong>g <strong>in</strong>terface, especially <strong>for</strong> <strong>XML</strong>DOM process<strong>in</strong>g, is similar, althoughthe underly<strong>in</strong>g implementation is radically different, and several methods and propertieshave been renamed.NoteIn MS<strong>XML</strong> 4.0, <strong>Microsoft</strong> <strong>in</strong>troduced the same level of support <strong>for</strong>some relatively newer <strong>XML</strong> standards that are found <strong>in</strong> .<strong>NET</strong>Framework <strong>XML</strong> core classes—<strong>in</strong> particular, XSD, the <strong>XML</strong> Schemaobject model, and XPath. If you look at MS<strong>XML</strong> 3.0, however, thedifferences between managed and unmanaged <strong>XML</strong> process<strong>in</strong>g areclearer.Us<strong>in</strong>g MS<strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> FrameworkAs with other COM objects, you can import the MS<strong>XML</strong> type library with<strong>in</strong> theboundaries of a .<strong>NET</strong> application. The layer of system code provid<strong>in</strong>g <strong>for</strong> COMimportation <strong>in</strong> the .<strong>NET</strong> Framework is the COM Interop Services (CIS). CIS providesaccess to exist<strong>in</strong>g COM components <strong>in</strong> a codeless and seamless way, without requir<strong>in</strong>gmodification of the orig<strong>in</strong>al component.The CIS consists of two dist<strong>in</strong>ct parts: one part makes COM components usable fromwith<strong>in</strong> .<strong>NET</strong> applications, and the other part does the opposite—namely, mak<strong>in</strong>g .<strong>NET</strong>classes callable from with<strong>in</strong> a COM component. To <strong>in</strong>corporate a COM object <strong>in</strong>to a12


managed application, you must first create a .<strong>NET</strong> wrapper class that exposes all thepublic methods and properties found <strong>in</strong> the component's type library. <strong>Microsoft</strong> VisualStudio .<strong>NET</strong>, <strong>for</strong> example, creates such a class on the fly, immediately after add<strong>in</strong>g theproper library reference to the current project.Dur<strong>in</strong>g the process, the <strong>in</strong>volved types are converted from COM types and adapted tofit <strong>in</strong>to the .<strong>NET</strong> Framework type system. After the importation is complete, the orig<strong>in</strong>alCOM object is ready <strong>for</strong> use <strong>in</strong> the .<strong>NET</strong> Framework, and more importantly, it haspreserved the orig<strong>in</strong>al <strong>in</strong>terface while add<strong>in</strong>g some .<strong>NET</strong> Framework-specific memberssuch as ToStr<strong>in</strong>g and GetType. In the end, <strong>for</strong> a <strong>Microsoft</strong> Visual Basic 6.0 programmerwho happens to use Visual Basic .<strong>NET</strong>, the code to be written is nearly identical.NoteTo generate a .<strong>NET</strong> wrapper class <strong>for</strong> a COM object, you can alsouse the tlbimp.exe utility from the command l<strong>in</strong>e. This utility givesyou full control over the entire process, and by us<strong>in</strong>g command-l<strong>in</strong>eswitches, you can <strong>in</strong>tervene <strong>in</strong> many useful areas, <strong>in</strong>clud<strong>in</strong>g the(strong) name of the assembly and the wrapp<strong>in</strong>g namespace.Although import<strong>in</strong>g MS<strong>XML</strong> functionality <strong>in</strong>to a .<strong>NET</strong> application is straight<strong>for</strong>ward, youmust have a good reason <strong>for</strong> do<strong>in</strong>g so. Jump<strong>in</strong>g cont<strong>in</strong>uously <strong>in</strong> and out of the .<strong>NET</strong>common language runtime (CLR) can result <strong>in</strong> a per<strong>for</strong>mance hit—not to mention thefact that you end up us<strong>in</strong>g a programm<strong>in</strong>g model that, although perfectly functional, isnot the best suited <strong>for</strong> the surround<strong>in</strong>g environment.The .<strong>NET</strong> Framework <strong>XML</strong> APIThe essence of <strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> Framework is found <strong>in</strong> two abstract classes—XmlReader and XmlWriter. These classes are at the core of all other .<strong>NET</strong> Framework<strong>XML</strong> classes, <strong>in</strong>clud<strong>in</strong>g the <strong>XML</strong>DOM classes, and are used extensively by varioussubsystems to parse or generate <strong>XML</strong> text. For example, ADO.<strong>NET</strong> data adaptersretrieve the data to store <strong>in</strong> a DataSet object us<strong>in</strong>g a database reader, and the DataSetobject serializes its contents to the DiffGram <strong>for</strong>mat us<strong>in</strong>g an XmlTextWriter object,which derives from XmlWriter.<strong>XML</strong> readers and writers constitute the primitive I/O functions <strong>for</strong> <strong>XML</strong> documents andare used to build more sophisticated functionalities. So overall, you have two possibleapproaches when it comes to process<strong>in</strong>g <strong>XML</strong> data. You can use any of the specializedclasses built on top of XmlReader and XmlWriter as well as document classes thatexpose the contents through the well-known and classic <strong>XML</strong>DOM.The direct use of readers represents a stream-based, but fast and stateless, approachto <strong>XML</strong> pars<strong>in</strong>g. The use of <strong>XML</strong>DOM classes (<strong>for</strong> example, XmlDocument) representsthe traditional <strong>XML</strong>DOM pars<strong>in</strong>g model. Readers are representative of a pull model, asopposed to the SAX parser's typical push model. You can certa<strong>in</strong>ly build a push modelatop a pull model-based API. Un<strong>for</strong>tunately, the reverse is never true, and that's whythere is no SAX support <strong>in</strong> the .<strong>NET</strong> Framework. (In Chapter 2, you'll learn the basics ofimplement<strong>in</strong>g a SAX parser us<strong>in</strong>g .<strong>NET</strong> Framework <strong>XML</strong> readers.)The <strong>XML</strong> API <strong>for</strong> the .<strong>NET</strong> Framework comprises the follow<strong>in</strong>g set of functionalities:• <strong>XML</strong> readers• <strong>XML</strong> writers• <strong>XML</strong> document classesAll of these functionalities must overcome the rather subtle problem of type mapp<strong>in</strong>g.The .<strong>NET</strong> Framework <strong>XML</strong> type system has several th<strong>in</strong>gs <strong>in</strong> common with the XSDSchema type system, and ad hoc conversion classes <strong>in</strong> the .<strong>NET</strong> Framework provide<strong>for</strong> applicable trans<strong>for</strong>mations.13


Be<strong>for</strong>e we go any further <strong>in</strong>to this overview of the key groups of classes, let's look atreaders and writers <strong>in</strong> general. Readers and writers represent two rather genericsoftware components that f<strong>in</strong>d several concrete (and powerful) implementationsthroughout the .<strong>NET</strong> Framework. The reader component provides a relatively commonprogramm<strong>in</strong>g <strong>in</strong>terface to read <strong>in</strong><strong>for</strong>mation out of a file or a stream. The writercomponent offers a common set of methods to write <strong>in</strong><strong>for</strong>mation down to a file or astream <strong>in</strong> a <strong>for</strong>mat-<strong>in</strong>dependent way. Not surpris<strong>in</strong>gly, readers operate <strong>in</strong> read-onlymode, whereas writers accomplish their tasks operat<strong>in</strong>g <strong>in</strong> write-only mode..<strong>NET</strong> Framework Readers and WritersIn the .<strong>NET</strong> Framework, the classes available from the System.IO namespace provide<strong>for</strong> both synchronous and asynchronous read/write operations on two dist<strong>in</strong>ctcategories of data: streams and files. A file is an ordered and named collection of bytesand is persistently stored to a disk. A stream represents a block of bytes that is readfrom, and written to, a data store. The data store can be based on a variety of storagemedia, <strong>in</strong>clud<strong>in</strong>g memory, disk files, and remote URLs. A stream is a k<strong>in</strong>d of superset ofa file, or <strong>in</strong> other words, a file that can be saved to a variety of storage media <strong>in</strong>clud<strong>in</strong>gmemory. To work with streams, the .<strong>NET</strong> Framework def<strong>in</strong>es several flavors of readerand writer classes. Figure 1-1 shows how each class relates to the others.14


Figure 1-1: Streams can be read and written us<strong>in</strong>g made-to-measure reader and writerclasses.The base classes are TextReader, TextWriter, B<strong>in</strong>aryReader, B<strong>in</strong>aryWriter, andStream. With the exception of the b<strong>in</strong>ary classes, all of these classes are marked asabstract (MustInherit, if you speak Visual Basic) and cannot be directly <strong>in</strong>stantiated <strong>in</strong>code. You can use abstract classes to reference liv<strong>in</strong>g <strong>in</strong>stances of derived classes,however.In the .<strong>NET</strong> Framework, base reader and writer classes f<strong>in</strong>d a number of concreteimplementations, <strong>in</strong>clud<strong>in</strong>g StreamReader and Str<strong>in</strong>gReader and their writ<strong>in</strong>gcounterparts. By design, reader and writer classes work on top of .<strong>NET</strong> streams andprovide programmers with a customized user <strong>in</strong>terface able to handle a particular typeof underly<strong>in</strong>g data or file <strong>for</strong>mat. Although each specific reader or writer class is tailormade<strong>for</strong> the content of a given type of stream, they share a common set of methodsand properties that def<strong>in</strong>es the official .<strong>NET</strong> <strong>in</strong>terface <strong>for</strong> read<strong>in</strong>g and writ<strong>in</strong>g data.The Cursor-Like ApproachA reader works <strong>in</strong> much the same way as a client-side database cursor. The underly<strong>in</strong>gstream is seen as a logical sequence of units of <strong>in</strong><strong>for</strong>mation whose size and layoutdepend on the particular reader. Like a cursor, the reader moves through the data <strong>in</strong> aread-only, <strong>for</strong>ward-only way. Normally, a reader is not expected to cache any<strong>in</strong><strong>for</strong>mation, but this is only common practice, rather than a strict requirement <strong>for</strong> allstandard .<strong>NET</strong> readers.ADO.<strong>NET</strong> data reader classes (<strong>for</strong> example, SqlDataReader) are simply .<strong>NET</strong> readersthat move from one record to the next and expose the contents of the current recordthrough a tailor-made <strong>in</strong>terface. The unit of <strong>in</strong><strong>for</strong>mation read at every step is thedatabase row. Similarly, a reader work<strong>in</strong>g on a disk file stream would consider as itsown atomic unit of <strong>in</strong><strong>for</strong>mation the s<strong>in</strong>gle byte, whereas a text reader would perhapsspecialize <strong>in</strong> extract<strong>in</strong>g one row of text at a time.<strong>XML</strong> readers are simply another, very peculiar, type of .<strong>NET</strong> reader. The class parsesthe contents of an <strong>XML</strong> file, mov<strong>in</strong>g from one node to the next. In this case, the f<strong>in</strong>ergra<strong>in</strong> of the <strong>in</strong><strong>for</strong>mation processed is represented by the <strong>XML</strong> node—be it an element,an attribute, a comment, or a process<strong>in</strong>g <strong>in</strong>struction.<strong>XML</strong> ReadersAn <strong>XML</strong> reader makes externally available a programm<strong>in</strong>g <strong>in</strong>terface through whichcallers can connect and pull out all the data they need. This is <strong>in</strong> no way different fromwhat happens when you connect to a database and fetch data. The database serverreturns a reference to an <strong>in</strong>ternal object—the cursor—which manages all the queryresults and makes them available on demand. This statement applies regardless of thefact that the database world might provide several flavors of cursors—client, scrollable,server-side, and so on.With <strong>XML</strong> readers, client applications are returned a reference to an <strong>in</strong>stance of thereader class, which abstracts the underly<strong>in</strong>g data stream. Methods on the reader classallow you to scroll <strong>for</strong>ward through the contents, mov<strong>in</strong>g from node to node rather thanfrom byte to byte or from record to record. When viewed from the perspective ofreaders, an <strong>XML</strong> document ceases to be a tagged text file and becomes a serializedcollection of nodes. Such a cursor model is specific to the .<strong>NET</strong> plat<strong>for</strong>m, and to date,you will not f<strong>in</strong>d a similar programm<strong>in</strong>g API available <strong>for</strong> other plat<strong>for</strong>ms, <strong>in</strong>clud<strong>in</strong>g<strong>Microsoft</strong> W<strong>in</strong>32.15


Readers vs. <strong>XML</strong>DOM<strong>XML</strong> readers don't require you to keep more data <strong>in</strong> memory than you actually need.When you open the <strong>XML</strong> document, a simple logical po<strong>in</strong>ter that corresponds to a nodeis returned. You can easily skip over nodes to locate the one you need. In do<strong>in</strong>g so, youdon't tax <strong>in</strong> any way the application's memory with extra data other than that required tobufferize the currently selected node.In contrast, the <strong>XML</strong>DOM—a full read/write parser model—has the drawback that itmight require a significant memory footpr<strong>in</strong>t and a long time to set up large documents<strong>in</strong> memory. Once <strong>in</strong> memory, however, the document can be easily and quickly read,edited, and serialized. To search a s<strong>in</strong>gle node, or to change an <strong>in</strong>dividual property, youhave to load the whole document <strong>in</strong> memory. As you can guess, this is not necessarilyan optimal approach and might not be the appropriate way to go <strong>for</strong> most applications.Tak<strong>in</strong>g the cursor-like approach to its limit, you can also observe an <strong>in</strong>terest<strong>in</strong>gconvergence between readers and the <strong>XML</strong>DOM. In fact, by visit<strong>in</strong>g all element andattribute nodes <strong>in</strong> the stream and stor<strong>in</strong>g <strong>in</strong> a memory tree the related data, you build adynamic and customized <strong>XML</strong>DOM. Incidentally, this is just what happens <strong>in</strong> the .<strong>NET</strong>Framework when <strong>XML</strong>DOM classes are <strong>in</strong>stantiated us<strong>in</strong>g readers to load data and areserialized to disk us<strong>in</strong>g writers.Readers vs. SAXA SAX parser directly controls the evolution of the pars<strong>in</strong>g process and pushes data tothe client application. A cursor parser (that is, an <strong>XML</strong> reader), on the other hand, playsa more passive role and leaves client applications to control the process.Giv<strong>in</strong>g applications, not the parser, control over the pars<strong>in</strong>g process promotes the pullmodel (as opposed to the SAX parser's push model), <strong>in</strong> which the parser is <strong>in</strong>voked toobta<strong>in</strong> a reference to the underly<strong>in</strong>g <strong>XML</strong> document. The parser also exposes methods<strong>for</strong> the client to navigate through the obta<strong>in</strong>ed document.In addition to provid<strong>in</strong>g a simplified programm<strong>in</strong>g <strong>in</strong>terface, the pull model is on averagemore efficient than the push model. For example, the pull model allows clientapplications to implement selective node process<strong>in</strong>g and just skip over unneedednodes. With SAX and the push model, all data has to pass through the application,which is the only entity that can reliably determ<strong>in</strong>e what is of <strong>in</strong>terest and what can bediscarded.NoteThe push model, at least as implemented <strong>in</strong> SAX, can also be quitebor<strong>in</strong>g to code. SAX works by pass<strong>in</strong>g node contents to applicationdef<strong>in</strong>edhandlers. A handler is a liv<strong>in</strong>g <strong>in</strong>stance of an object thatimplements one or more <strong>in</strong>terfaces accord<strong>in</strong>g to the specification.So an application that needs to parse <strong>XML</strong> documents us<strong>in</strong>g SAXassigns <strong>in</strong>stances of these objects to ad hoc properties on the SAXparser. Once started, the parser calls back the handlers through thepredef<strong>in</strong>ed <strong>in</strong>terfaces whenever it parses some content that relatesto a given handler.<strong>XML</strong> WritersThe .<strong>NET</strong> <strong>XML</strong> API separates pars<strong>in</strong>g from edit<strong>in</strong>g and writ<strong>in</strong>g and offers a set ofmethods that provides effective results <strong>for</strong> per<strong>for</strong>mance as well as usability. Whenwrit<strong>in</strong>g, you create new <strong>XML</strong> documents work<strong>in</strong>g at a considerably high level of16


abstraction and explicitly <strong>in</strong>dicate the <strong>XML</strong> elements to create—nodes, attributes,comments, or process<strong>in</strong>g <strong>in</strong>structions. The writer works on a stream, dump<strong>in</strong>g content<strong>in</strong>crementally, one node after the next, without the random access capabilities of the<strong>XML</strong>DOM but also without its memory footpr<strong>in</strong>t.To grasp the importance of <strong>XML</strong> writers, consider that, <strong>in</strong> general, the only alternativeyou have <strong>for</strong> writ<strong>in</strong>g <strong>XML</strong> contents to any storage media consists of prepar<strong>in</strong>g the entireoutput as a str<strong>in</strong>g and then writ<strong>in</strong>g it off. In this case, the markup nature of <strong>XML</strong> is moreh<strong>in</strong>drance than real help, because you must yourself take care of the <strong>in</strong>tricacies ofquotation marks, attributes, <strong>in</strong>dentation, and end tags.In the .<strong>NET</strong> Framework, <strong>XML</strong> writers come to the rescue and let you write <strong>XML</strong>documents programmatically <strong>in</strong> much the same way you write them through texteditors. For example, you can specify whether you want a namespace prefix, thepadd<strong>in</strong>g character and the size of the <strong>in</strong>dentation, the quotation mark and the newl<strong>in</strong>echaracter, and even how you want white spaces to be treated. To create nodes, yousimply use ad hoc methods to write comments, attributes, and element nodes. Theoverall method of work<strong>in</strong>g is simple and extremely effective.The .<strong>NET</strong> Framework provides several types of writers that use heterogeneous outputdevices—str<strong>in</strong>gs, HTTP response, and HTML documents. You could also use an <strong>XML</strong>text writer to dump contents to a stream object or a new text file. In the latter two cases,you could also specify character encod<strong>in</strong>g. If the encod<strong>in</strong>g argument is null, theUnicode 8-bits-per-character schema (UTF-8) will be used.<strong>XML</strong> writers, and <strong>in</strong> particular the XmlTextWriter class, are used throughout the .<strong>NET</strong>Framework <strong>for</strong> creat<strong>in</strong>g any sort of <strong>XML</strong> output. We'll look at <strong>XML</strong> writers <strong>in</strong> detail <strong>in</strong>Chapter 4.The <strong>XML</strong> Document Object API <strong>in</strong> .<strong>NET</strong>As mentioned, along with <strong>XML</strong> readers and writers, the .<strong>NET</strong> Framework also providesclasses that load and edit <strong>XML</strong> documents accord<strong>in</strong>g to the W3C DOM Level 1 andLevel 2 Core. The key <strong>XML</strong>DOM class <strong>in</strong> the .<strong>NET</strong> Framework is XmlDocument—notmuch different from the DOMDocument class, which you might recognize from work<strong>in</strong>gwith MS<strong>XML</strong>.The <strong>XML</strong>DOM supplies an <strong>in</strong>-memory tree-based representation of <strong>XML</strong> documentsand supports both navigation and edit<strong>in</strong>g of the document. In addition, the <strong>XML</strong>DOMclasses can handle both XPath queries and XSLT.Tightly coupled with the XmlDocument class is the XmlDataDocument class. It extendsXmlDocument and focuses on <strong>XML</strong> storage and retrieval of structured tabular data. Inparticular, XmlDataDocument can import data from an ADO.<strong>NET</strong> DataSet object andexport regular <strong>XML</strong> contents to the DataSet relational <strong>for</strong>mat. Regular <strong>XML</strong> content is aset of nodes with exactly one level of subnodes, with each node hav<strong>in</strong>g the samenumber of children. The ultimate goal of this requirement is enabl<strong>in</strong>g the <strong>XML</strong> contentsto fit <strong>in</strong>to a relational table.The <strong>XML</strong>DOM representation of an <strong>XML</strong> document is fully editable. Attributes and textcan be randomly accessed, and nodes can be added and removed. You per<strong>for</strong>mupdates on a loaded <strong>XML</strong>DOM document by first creat<strong>in</strong>g a node object (the XmlNodeclass) and then b<strong>in</strong>d<strong>in</strong>g it to the exist<strong>in</strong>g tree. All <strong>in</strong> all, the underly<strong>in</strong>g writ<strong>in</strong>g pattern isclose to that of <strong>XML</strong> writers—you write nodes to the stream <strong>in</strong> one case, and you addnodes to the tree <strong>in</strong> the other. Of course, if you are us<strong>in</strong>g the <strong>XML</strong>DOM, bear <strong>in</strong> m<strong>in</strong>dthat all changes occur <strong>in</strong> memory and must be flushed to the storage medium prior toreturn. (The <strong>XML</strong>DOM API is described <strong>in</strong> detail <strong>in</strong> Chapter 5.)17


XPath Expressions and XSLTIn the .<strong>NET</strong> Framework, XSLT and XPath expressions are fully supported but areimplemented <strong>in</strong> classes dist<strong>in</strong>ct from those that parse and write <strong>XML</strong> text. This is a keyfeature of the overall .<strong>NET</strong> <strong>XML</strong> API. Any functionality is provided through a smallhierarchy of objects, although each subtree connects and <strong>in</strong>teroperates well withothers. Figure 1-2 demonstrates the <strong>in</strong>terconnection between constituent APIs.Figure 1-2: The <strong>XML</strong>DOM API is built on top of readers and writers, but both XSLT andXPath expressions need to have a complete and <strong>XML</strong>DOM-based vision of the entire <strong>XML</strong>document to process it.<strong>XML</strong> readers and writers are the primitive elements of the .<strong>NET</strong> <strong>XML</strong> API. Whenever<strong>XML</strong> text must be parsed or written, all classes, directly or <strong>in</strong>directly, refer to them. Amore complex primitive element is the <strong>XML</strong>DOM tree. Trans<strong>for</strong>mations and advancedqueries must rely on the document <strong>in</strong> its entirety be<strong>in</strong>g held <strong>in</strong> memory and accessiblethrough a well-known <strong>in</strong>terface—the <strong>XML</strong>DOM.The XSLT ProcessorThe key class <strong>for</strong> XSLT is XslTrans<strong>for</strong>m. The class works as an XSLT processor andcomplies with version 1.0 of the XSLT recommendation. The class has two keymethods, Load and Trans<strong>for</strong>m, whose behavior is <strong>for</strong> the most part selfexplanatory.Once you acquire an <strong>in</strong>stance of the XslTrans<strong>for</strong>m class, you first load the source of anXSL document that conta<strong>in</strong>s the trans<strong>for</strong>mation rules. By call<strong>in</strong>g the Trans<strong>for</strong>m method,you actually per<strong>for</strong>m the conversion from native <strong>XML</strong> to the output <strong>for</strong>mat. Prior toapply<strong>in</strong>g the trans<strong>for</strong>mation, the underly<strong>in</strong>g <strong>XML</strong> document is loaded as a k<strong>in</strong>d of<strong>XML</strong>DOM tree. (The details of XSLT are covered <strong>in</strong> Chapter 7.)18


The XPath Query Eng<strong>in</strong>eXPath is a language that allows you to navigate with<strong>in</strong> <strong>XML</strong> documents. Th<strong>in</strong>k of XPathas a general-purpose query language <strong>for</strong> address<strong>in</strong>g, sort<strong>in</strong>g, and filter<strong>in</strong>g both theelements and the text of an <strong>XML</strong> document.The XPath notation is basically declarative. Any XPath expression is a path with<strong>in</strong> the<strong>XML</strong> document that identifies the <strong>in</strong><strong>for</strong>mation with the given characteristics. The pathdef<strong>in</strong>es a pattern, and the result<strong>in</strong>g selection <strong>in</strong>cludes all the nodes that match it. Theselection is expressed through a notation that emphasizes the hierarchical relationshipbetween the nodes. It works <strong>in</strong> much the same way files and folders work. For example,the XPath expression "book/publisher" means f<strong>in</strong>d the "publisher" element with<strong>in</strong> the"book" element. The XPath navigation model works <strong>in</strong> the context of a hierarchy ofnodes <strong>in</strong> the <strong>XML</strong> document's tree. XPath makes use of a variation of theXmlDocument class, named XPathDocument.Runn<strong>in</strong>g an XPath query is not actually different from execut<strong>in</strong>g a TransactSQL (T-SQL) query on SQL Server. Instead of gett<strong>in</strong>g back a collection of rows, a valid XPathexpression returns a collection of nodes. To scroll the returned nodes, you just use anXPath-customized version of a reader. We'll look at XPath <strong>in</strong> more detail <strong>in</strong> Chapter 6.ConclusionIn this chapter, we exam<strong>in</strong>ed the build<strong>in</strong>g blocks of <strong>XML</strong> and explored the rationalebeh<strong>in</strong>d <strong>XML</strong> readers and writers—a new and <strong>in</strong>novative way to per<strong>for</strong>m basicoperations on <strong>XML</strong> data sources. In the .<strong>NET</strong> Framework, <strong>XML</strong> readers <strong>in</strong>troduce adatabase-like cursor model to navigate through data. The cursor model fallssomewhere between the well-known <strong>XML</strong>DOM and SAX models. Not as expensive as<strong>XML</strong>DOM and more programmer-friendly than SAX, the .<strong>NET</strong> Framework cursor modelpresents <strong>XML</strong> as just another data <strong>for</strong>mat you can work on us<strong>in</strong>g a familiar approach.As a developer, you are certa<strong>in</strong>ly familiar with I/O operations accomplished on a file ora database. Why should <strong>XML</strong> data sources be totally different? The node becomes justanother atomic element, along with the database row or the byte. Ad hoc methodsmake it possible <strong>for</strong> you to move through nodes <strong>in</strong> a straight<strong>for</strong>ward, effective way.Readers and writers are not the only tools you can use to create <strong>XML</strong>-driven .<strong>NET</strong>applications. Another group of classes work accord<strong>in</strong>g to the specification of the W3CDOM. XSLT and XPath expressions are a pair of <strong>XML</strong>-related technologies that arepopular with developers and effective <strong>for</strong> arrang<strong>in</strong>g applications. In the .<strong>NET</strong>Framework, you f<strong>in</strong>d made-to-measure classes that make <strong>XML</strong>-to-<strong>XML</strong> trans<strong>for</strong>mationand query evaluation fast and easy.All the <strong>XML</strong> technologies <strong>in</strong>troduced <strong>in</strong> this chapter will be covered <strong>in</strong> depth <strong>in</strong> thechapters that follow, beg<strong>in</strong>n<strong>in</strong>g with <strong>XML</strong> readers <strong>in</strong> Chapter 2.Further Read<strong>in</strong>gThe W3C organization is currently work<strong>in</strong>g on a draft of the DOM Level 3 Core to<strong>in</strong>clude support <strong>for</strong> an abstract model<strong>in</strong>g schema and I/O serialization. Check out themost recent draft at http://www.w3.org/TR/2002/WD-DOM-Level3-ASLS-20020409. Theapproved standard—DOM Level 2 Core—is available at http://www.w3.org/TR/DOM-Level-2.Relevant <strong>in</strong><strong>for</strong>mation about <strong>XML</strong> standards is available from the W3C Web site, athttp://www.w3.org. If you want to learn more about the SAX specification, look at thenew Web site <strong>for</strong> the SAX project, at http://www.saxproject.org.19


A lot of useful developer-oriented documentation about <strong>XML</strong> is available on the Websites of the companies that support <strong>XML</strong>. In addition to the <strong>Microsoft</strong> Web site(http://msdn.microsoft.com/xml), check out the Intel Developer Services Web site(http://cedar.<strong>in</strong>tel.com). In particular, you'll f<strong>in</strong>d an essential guide to <strong>XML</strong> <strong>in</strong> the .<strong>NET</strong>Framework: http://cedar.<strong>in</strong>tel.com/media/<strong>pdf</strong>/dotnet/net_jumpstart.<strong>pdf</strong>.F<strong>in</strong>ally, if you just want a good, all-encompass<strong>in</strong>g book about <strong>XML</strong> programm<strong>in</strong>g, Iheartily recommend the <strong>Microsoft</strong> Press Core Reference book <strong>XML</strong> <strong>Programm<strong>in</strong>g</strong>(http://www.microsoft.com/mspress/books/4798.asp), by R. Allen Wyke, SultanRehman, and Brad Leupen (<strong>Microsoft</strong> Press, 2002). For a more general look <strong>in</strong>to <strong>XML</strong>as a unify<strong>in</strong>g technology, Essential <strong>XML</strong>: Beyond Markup (Addison Wesley, 2000), byDon Box, Aaron Skonnard, and John Lam, is still one of the best books available.20


Chapter 2: <strong>XML</strong> ReadersIn the <strong>Microsoft</strong> .<strong>NET</strong> Framework, two dist<strong>in</strong>ct sets of classes provide <strong>for</strong> <strong>XML</strong>-drivenread<strong>in</strong>g and writ<strong>in</strong>g operations. These classes are known globally as <strong>XML</strong> readers andwriters. The base class <strong>for</strong> readers is XmlReader, whereas XmlWriter provides the baseprogramm<strong>in</strong>g <strong>in</strong>terface <strong>for</strong> writers. In this chapter, we'll focus on a particular type of<strong>XML</strong> readers—the <strong>XML</strong> text readers. In Chapter 3, we'll zero <strong>in</strong> on validat<strong>in</strong>g readersand then move on to <strong>XML</strong> writers <strong>in</strong> Chapter 4.The <strong>Programm<strong>in</strong>g</strong> Interface of ReadersXmlReader is an abstract class available from the System.Xml namespace. It def<strong>in</strong>esthe set of functionalities that an <strong>XML</strong> reader exposes to let developers access an <strong>XML</strong>stream <strong>in</strong> a noncached, <strong>for</strong>ward-only, read-only way.An <strong>XML</strong> reader works on a read-only stream by jump<strong>in</strong>g from one node to the next <strong>in</strong> a<strong>for</strong>ward-only direction. The <strong>XML</strong> reader ma<strong>in</strong>ta<strong>in</strong>s an <strong>in</strong>ternal po<strong>in</strong>ter to the currentnode and its attributes and text but has no notion of previous and next nodes. You can'tmodify text or attributes, and you can move only <strong>for</strong>ward from the current node. If youare visit<strong>in</strong>g attribute nodes, however, you can move back to the parent node or accessan attribute by <strong>in</strong>dex. The visit takes place <strong>in</strong> node-first order, but other visit<strong>in</strong>galgorithms can be arranged <strong>in</strong> custom reader classes. See the note on page 72 <strong>for</strong>more <strong>in</strong><strong>for</strong>mation about visit<strong>in</strong>g algorithms.The specification <strong>for</strong> the XmlReader class recommends that any derived class shouldcheck at least whether the <strong>XML</strong> source is well-<strong>for</strong>med and throw exceptions if an erroris encountered. <strong>XML</strong> exceptions are handled through the tailor-made XmlExceptionclass. The <strong>XML</strong>Reader class specification does not say anyth<strong>in</strong>g about <strong>XML</strong> validation.Throughout this chapter, you'll see that the .<strong>NET</strong> Framework provides several readerclasses with and without validation capabilities. Valid sources <strong>for</strong> an <strong>XML</strong> reader aredisk files as well as any flavor of .<strong>NET</strong> streams and text readers (<strong>for</strong> example, str<strong>in</strong>greaders).An OOP RefresherThroughout this book, I'll often use terms such as <strong>in</strong>terface and class, sometimesqualified by helper adjectives such as abstract or base. Although a full explanation ofthese terms and their related object-oriented programm<strong>in</strong>g (OOP) concepts is beyondthe scope of this book, a quick term<strong>in</strong>ology refresher will help you get to the heart ofthe <strong>XML</strong> class hierarchy <strong>in</strong> the .<strong>NET</strong> Framework.In the .<strong>NET</strong> Framework, an <strong>in</strong>terface is a conta<strong>in</strong>er <strong>for</strong> a named collection of method,property, and event def<strong>in</strong>itions referred to as a contract An <strong>in</strong>terface can be used as areference type, but it is not a creatable type. Other types can implement one or more<strong>in</strong>terfaces. In do<strong>in</strong>g so, they adhere to the <strong>in</strong>terface's contract and agree to provideactual implementation <strong>for</strong> all the methods, properties, and events <strong>in</strong> the contract.A class is a conta<strong>in</strong>er that can <strong>in</strong>clude data and function members (methods,properties, events, operators, and constructors). Classes support <strong>in</strong>heritance fromother classes as well as from <strong>in</strong>terfaces. Any class from which another class <strong>in</strong>herits iscalled a base class.An abstract class simply declares its members without provid<strong>in</strong>g any implementation.Like <strong>in</strong>terfaces, abstract classes are not creatable but can be used as reference types.An abstract class differs from an <strong>in</strong>terface <strong>in</strong> that it has a slightly richer set of <strong>in</strong>ternalmembers (constructors, constants, and operators). Members of an abstract class canbe scoped as private, public, or protected, whereas members of an <strong>in</strong>terface aremostly public. In addition, child classes can implement multiple <strong>in</strong>terfaces but can<strong>in</strong>herit from only one class.21


The XmlReader ClassThe XmlReader class def<strong>in</strong>es methods that enable you to pull data from an <strong>XML</strong> sourceand to skip unwanted nodes. Bear <strong>in</strong> m<strong>in</strong>d that each and every element <strong>in</strong> an <strong>XML</strong>stream is considered a node, mean<strong>in</strong>g that node is a rather generic concept thatapplies to subtree roots as well as to attributes, process<strong>in</strong>g <strong>in</strong>structions, entities,comments, and pla<strong>in</strong> text.The XmlReader class <strong>in</strong>cludes methods <strong>for</strong> read<strong>in</strong>g <strong>XML</strong> content from an entire text file,return<strong>in</strong>g the depth of the current <strong>XML</strong> node's subtree, and determ<strong>in</strong><strong>in</strong>g whether thecontents of a given element is empty. You can also fairly easily read and navigateattributes and skip over elements and their contents. Valuable <strong>in</strong><strong>for</strong>mation such as thename and the contents of the current node is also returned via ad hoc properties.Base Properties of <strong>XML</strong> ReadersTable 2-1 lists the public properties exposed by the XmlReader class. Notice that thevalues these properties conta<strong>in</strong> depend on the actual reader class you are us<strong>in</strong>g <strong>in</strong> yourcode. The description of each property refers to the property's <strong>in</strong>tended goal, but thisdescription might not entirely reflect the actual role of the property <strong>in</strong> a derived readerclass.Table 2-1: Public Properties of the XmlReader ClassPropertyAttributeCountBaseURICanResolveEntityDepthEOFHasAttributesHasValueIsDefaultIsEmptyElementItemLocalNameNameNamespaceURINameTableNodeTypeDescriptionGets the number of attributes on the current node.Gets the base URI of the current node.Gets a value <strong>in</strong>dicat<strong>in</strong>g whether the reader can resolveentities.Gets the depth of the current node <strong>in</strong> the <strong>XML</strong>document.Indicates whether the reader has reached the end ofthe stream.Indicates whether the current node has any attributes.Indicates whether the current node can have a value.Indicates whether the current node is an attribute thatorig<strong>in</strong>ated from the default value def<strong>in</strong>ed <strong>in</strong> thedocument type def<strong>in</strong>ition (DTD) or schema.Indicates whether the current node is an emptyelement with no attributes or value.Indexer property that returns the value of the specifiedattribute.Gets the name of the current node with any prefixremoved.Gets the fully qualified name of the current node.Gets the namespace URI of the current node. Appliesto Element and Attribute nodes only.Gets the name table object associated with the reader.(More on name table objects later.)Gets the type of the current node.22


Table 2-1: Public Properties of the XmlReader ClassPropertyPrefixQuoteCharReadStateValueXmlLangXmlSpaceDescriptionGets the namespace prefix associated with the currentnode.Gets the quotation mark character used to enclose thevalue of an attribute.Gets the state of the reader from the ReadStateenumeration.Gets the text value of the current node.Gets the xml:lang scope with<strong>in</strong> which the current noderesides.Gets the current xml:space scope from the XmlSpaceenumeration (Default, None, or Preserve).NoteWhen you read any sort of documentation about <strong>XML</strong>, you areusually bombarded by a storm of similar-look<strong>in</strong>g acronyms: URI,URL, and URN. Let's review these terms. A Uni<strong>for</strong>m ResourceIdentifier (URI) is a str<strong>in</strong>g that unequivocally identifies a resourceover the network. There are two types of URI: Uni<strong>for</strong>m ResourceLocator (URL) and Uni<strong>for</strong>m Resource Name (URN). A URL isspecified by the protocol prefix, the host name or IP address, theport (optional), and the path. A URN is simply a unique descriptivestr<strong>in</strong>g—<strong>for</strong> example, the human-readable <strong>for</strong>m of a CLSID (the 128-bit identifier of a COM object) is a URN.A bit mislead<strong>in</strong>g is the fact that URNs are often created us<strong>in</strong>g URLlikestr<strong>in</strong>gs. This regularly happens with <strong>XML</strong> namespaces, <strong>for</strong>example. The reason <strong>for</strong> this practice is that a URL has a highlikelihood of be<strong>in</strong>g unique, especially if you use a path with<strong>in</strong> yourcompany's Web site.An <strong>XML</strong> reader can pass through several different states. All the possible states aredef<strong>in</strong>ed by the ReadState enumeration and are listed <strong>in</strong> Table 2-2. The ReadStateproperty conta<strong>in</strong>s a ReadState enumeration value and is expected to return the currentstate of the reader, but actual implementations of a reader class must ensure that theproperty always holds the correct value.Table 2-2: Reader StatesStateDescriptionClosedThe reader is closed.EndOfFileThe end of the file has been reached successfully, butthe reader is not yet closed.ErrorA critical error occurred, and the read operation can'tcont<strong>in</strong>ue.InitialThe reader is <strong>in</strong> its <strong>in</strong>itial position, wait<strong>in</strong>g <strong>for</strong> the Readmethod to be called <strong>for</strong> the first time.InteractiveThe reader is open and functional.23


The BaseURI property actually returns the URL of the node. Normally, the URL of anode—more generally, the URI—is bound to the resource name, be it a local file, anetworked document, or a Web document. In these cases, the BaseURI propertysimply returns the URL-styled name of the resource. The follow<strong>in</strong>g are examples ofvalues that would be returned under these circumstances:file://c:/myfolder/mydoc.xmlhttp://www.cpandl.com/myfolder/mydoc.xmlAn <strong>XML</strong> document can result from the aggregation of various chunks of data—entities,schemas, and DTDs—com<strong>in</strong>g from different network locations. In these cases, theBaseURI property tells you where these nodes come from. If the <strong>XML</strong> document isbe<strong>in</strong>g processed through a stream (<strong>for</strong> example, an <strong>in</strong>-memory str<strong>in</strong>g), no URI isavailable and the BaseURI property returns the empty str<strong>in</strong>g.Base Methods of <strong>XML</strong> ReadersTable 2-3 lists the public methods exposed by the XmlReader class. This table does not<strong>in</strong>clude the methods def<strong>in</strong>ed <strong>in</strong> the Object class and overridden <strong>in</strong> XmlReader—<strong>for</strong>example, ToStr<strong>in</strong>g, GetType, and Equals.Table 2-3: Public Methods of the XmlReader ClassMethodCloseGetAttributeIsStartElementLookupNamespaceMoveToAttributeMoveToContentMoveToElementMoveToFirstAttributeMoveToNextAttributeReadReadAttributeValueDescriptionCloses the reader and sets the <strong>in</strong>ternal state toClosed.Gets the value of the specified attribute. An attributecan be accessed by <strong>in</strong>dex, local name, or qualifiedname.Indicates whether the current content node is a starttag.Returns the namespace URI to which the givenprefix maps.Moves the po<strong>in</strong>ter to the specified attribute. Anattribute can be accessed by <strong>in</strong>dex, local name, orqualified name.Moves the po<strong>in</strong>ter ahead to the next content nodeor to the end of the file. This method returnsimmediately if the current node is already a contentnode, such as non-white-space text, CDATA,Element, EndElement, EntityReference, orEndEntity.Moves the po<strong>in</strong>ter back to the element node thatconta<strong>in</strong>s the current attribute node. Relevant onlywhen the current node is an attribute.Moves to the first attribute of the current Elementnode.Moves to the next attribute of the current Elementnode.Reads the next node and advances the po<strong>in</strong>ter.Parses the attribute value <strong>in</strong>to one or more Text,EndEntity, or EntityReference nodes. (More on this<strong>in</strong> the section "Pars<strong>in</strong>g Mixed-Content Attributes,"24


Table 2-3: Public Methods of the XmlReader ClassMethodReadElementStr<strong>in</strong>gReadEndElementReadInnerXmlReadOuterXmlReadStartElementReadStr<strong>in</strong>gResolveEntitySkipDescriptionon page 41.)Reads and returns the text from a text-only element.Checks that the current content node is an end tagand advances the reader to the next node. Throwsan exception if the node is not an end tag.Reads and returns all the content below the currentnode, <strong>in</strong>clud<strong>in</strong>g markup <strong>in</strong><strong>for</strong>mation.Reads and returns all the content <strong>in</strong> and below thecurrent node, <strong>in</strong>clud<strong>in</strong>g markup <strong>in</strong><strong>for</strong>mation.Checks that the current node is an element andadvances the reader to the next node. Throws anexception if the node is not a start tag.Reads the contents of an element or a text node asa str<strong>in</strong>g. This method concatenates all the text upuntil the next markup. For attribute nodes, call<strong>in</strong>gthis method is equivalent to read<strong>in</strong>g the attributevalue.Expands and resolves the current EntityReferencenode.Skips the children of the current node.In addition to the methods listed <strong>in</strong> Table 2-3, the XmlReader class also features acouple of static (shared, if you speak only <strong>Microsoft</strong> Visual Basic) methods namedIsName and IsNameToken. Both take a str<strong>in</strong>g and return a Boolean value. The returnvalue <strong>in</strong>dicates whether the given str<strong>in</strong>g complies with the respective def<strong>in</strong>itions of aName and a Nmtoken (name token) accord<strong>in</strong>g to the W3C <strong>XML</strong> 1.0 Recommendation.In <strong>XML</strong> 1.0, a Name is a str<strong>in</strong>g that beg<strong>in</strong>s with a letter, an underscore (_), or a colon (:)and cont<strong>in</strong>ues with letters, digits, hyphens, underscores, and colons. A Nmtoken, on theother hand, is any non-zero-length mixture of name characters—that is, letters, digits,hyphens, underscores, and colons.NoteA static member (as opposed to an <strong>in</strong>stance member) of a class is ak<strong>in</strong>d of global member that belongs to the type itself rather than to aspecific <strong>in</strong>stance of the class. Whereas an <strong>in</strong>stance of a class conta<strong>in</strong>s aseparate copy of all <strong>in</strong>stance members, there is only one copy of eachstatic member. Static members can't be referenced through an <strong>in</strong>stance.Instead, you must reference them through the type name:Console.WriteL<strong>in</strong>e(XmlReader.IsName("D<strong>in</strong>oEsposito"));Members that <strong>in</strong> C# are called static and declared with the statickeyword, <strong>in</strong> Visual Basic .<strong>NET</strong> are called shared and are declared withthe Shared keyword. Aside from this, their usage is identical.Recognized Node TypesEach node <strong>in</strong> an <strong>XML</strong> source is of a certa<strong>in</strong> type. The NodeType property is a read-onlyproperty that returns the type of the current node. The returned value belongs to theXmlNodeType enumeration, which comprises the node types listed <strong>in</strong> the Table 2-4.25


Table 2-4: Types of Nodes <strong>in</strong> the XmlNodeType EnumerationNode TypeDescriptionAttributeRepresents an attribute of an Element node.Attribute nodes can have two child node types,Text and EntityReference, which represent thevalue of the attribute. Note that an attribute is notthe child of any other node type—<strong>in</strong> particular, it isnot considered the child of an Element node.CDATARepresents a CDATA section. A CDATA section isa block of escaped text used as is and is notrecognized as markup text. A CDATA node can'thave any child nodes.CommentRepresents a comment <strong>in</strong> the <strong>XML</strong> text. AComment node can't have any child nodes.DocumentRepresents a document object that is the root ofthe document tree. Document provides access tothe whole <strong>XML</strong> document and can have thefollow<strong>in</strong>g child node types: only one Element node(the actual root of the <strong>XML</strong> tree),Process<strong>in</strong>gInstruction, Comment, andDocumentType.DocumentFragment Represents a document fragment—namely, anode or an entire subtree—that is l<strong>in</strong>ked to adocument without actually be<strong>in</strong>g part of it orconta<strong>in</strong>ed <strong>in</strong> the same file.DocumentTypeRepresents a document type. A document typenode is characterized by the tag. ADocumentType node can have child nodes of typeNotation and Entity.ElementRepresents the most common type of node found<strong>in</strong> <strong>XML</strong> documents. Element can have severaltypes of child nodes, <strong>in</strong>clud<strong>in</strong>g other elementnodes, text, comments, process<strong>in</strong>g <strong>in</strong>structions,CDATA, and entity references.EndElementRepresents the end tag of an element node.EndEntityRepresents the end of an entity node.EntityRepresents an entity declaration. In <strong>XML</strong>, entitiesare much the same as macros—that is, names thatpo<strong>in</strong>t to expanded text.EntityReferenceRepresents a reference to an entity used <strong>in</strong> thebody of <strong>XML</strong> documents.NoneThe node type returned by the XmlReader class ifthe Read method has not yet been called.NotationRepresents a notation <strong>in</strong> the document typedeclaration.Process<strong>in</strong>gInstruction Represents a process<strong>in</strong>g <strong>in</strong>struction at thebeg<strong>in</strong>n<strong>in</strong>g of the <strong>XML</strong> document.26


Table 2-4: Types of Nodes <strong>in</strong> the XmlNodeType EnumerationNode TypeDescriptionSignificantWhitespace Represents a significant white space characterbetween markup text <strong>in</strong> a mixed-content model orwhite space with<strong>in</strong> the scope ofxml:space="preserve".TextRepresents the text content of an element.WhitespaceRepresents an <strong>in</strong>significant space between markuptext.XmlDeclaration Represents the <strong>XML</strong> declaration node.XmlDeclaration must be the first node <strong>in</strong> thedocument and can't have children. The node canhave attributes that provide version and encod<strong>in</strong>g<strong>in</strong><strong>for</strong>mation.Table 2-4 <strong>in</strong>cludes all the possible types of nodes found with<strong>in</strong> the body of an <strong>XML</strong>document—at least when the document is parsed through a .<strong>NET</strong> <strong>XML</strong> reader. Noticethat the <strong>XML</strong> element that is normally perceived as be<strong>in</strong>g the node—that is, marked uptext—is said to be an element node. Attributes, comments, and even process<strong>in</strong>g<strong>in</strong>structions are just other types of nodes. In light of this, when you move from one nodeto the next, you are not necessarily mov<strong>in</strong>g between nodes of the same type.A lot of <strong>XML</strong> documents beg<strong>in</strong> with several tags that do not represent any data content.The reader's MoveToContent method lets you skip all the head<strong>in</strong>g <strong>in</strong><strong>for</strong>mation andposition the po<strong>in</strong>ter directly <strong>in</strong> the first content node. In do<strong>in</strong>g so, the method skips overthe follow<strong>in</strong>g node types: Process<strong>in</strong>gInstruction, DocumentType, Comment,Whitespace, and SignificantWhitespace.Specialized Reader ClassesThe XmlReader class def<strong>in</strong>es only the clauses and appendices <strong>in</strong> the contract that.<strong>NET</strong> <strong>XML</strong> applications sign with the actual parser class. Because XmlReader is anabstract class, you'll use it <strong>in</strong> your code only as a reference type when type cast<strong>in</strong>g isneeded. In lieu of XmlReader, you can use any of its derived classes already def<strong>in</strong>ed <strong>in</strong>the .<strong>NET</strong> Framework. In addition, you can use any other custom reader class that thirdpartyvendors, or you yourself, might have written. All of these reader classes share theprogramm<strong>in</strong>g <strong>in</strong>terface with XmlReader, however, and provide an actual, albeit custom,implementation <strong>for</strong> each of the methods and properties listed <strong>in</strong> Table 2-1, on page 27,and Table 2-3, on page 30.Implementations of the XmlReader class extend the base class and vary <strong>in</strong> their designto support different scenarios. The .<strong>NET</strong> Framework supplies the follow<strong>in</strong>g readerclasses:• XmlTextReader Extremely fast; the reader ensures that the <strong>XML</strong> sourceis well-<strong>for</strong>med but neither validates it aga<strong>in</strong>st a schema or a DTD norresolves any embedded entity.• XmlValidat<strong>in</strong>gReader An <strong>XML</strong> reader that can validate the source us<strong>in</strong>ga DTD, an <strong>XML</strong>-Data Reduced (XDR) schema, and an <strong>XML</strong> SchemaDef<strong>in</strong>ition (XSD). In addition, the reader is capable of expand<strong>in</strong>g entitiesand also supports default attributes as def<strong>in</strong>ed <strong>in</strong> the DTD or schema.• XmlNodeReader The reader specializes <strong>in</strong> pars<strong>in</strong>g <strong>XML</strong> data from an<strong>XML</strong> Document Object Model (<strong>XML</strong> DOM) subtree and does not supportvalidation.In the next section, we'll exam<strong>in</strong>e the XmlTextReader class—probably the mostfrequently used .<strong>NET</strong> reader class. Validat<strong>in</strong>g readers will be covered <strong>in</strong> Chapter 3;27


node readers are discussed <strong>in</strong> Chapter 5. By the end of this chapter, you'll also havehad <strong>in</strong>-depth exposure to the <strong>in</strong>tricacies (and the flexibility) connected with thedevelopment of a custom reader class.Pars<strong>in</strong>g with the XmlTextReader ClassThe XmlTextReader class is designed to provide fast access to streams of <strong>XML</strong> data <strong>in</strong>a <strong>for</strong>ward-only and read-only manner. The reader verifies that the submitted <strong>XML</strong> iswell-<strong>for</strong>med. It also per<strong>for</strong>ms a quick check <strong>for</strong> correctness on the referenced DTD, ifone exists. In no case, though, does this reader validate aga<strong>in</strong>st a schema or DTD. Ifyou need more functionality (<strong>for</strong> example, validation), you must resort to other readerclasses such as XmlNodeReader or XmlValidat<strong>in</strong>gReader.An <strong>in</strong>stance of the XmlTextReader class can be created <strong>in</strong> a number of ways and froma variety of sources, <strong>in</strong>clud<strong>in</strong>g disk files, URLs, streams, and text readers. To processan <strong>XML</strong> file, you start by <strong>in</strong>stantiat<strong>in</strong>g the constructor, as shown here:XmlTextReader reader = new XmlTextReader(file);Note that all the public constructors available require you to <strong>in</strong>dicate the source of thedata, be it a stream, a file, or whatever else. The default constructor of theXmlTextReader class is marked as protected and, as such, is not <strong>in</strong>tended to be useddirectly from user's code.After the reader is up and runn<strong>in</strong>g, you have to explicitly open it us<strong>in</strong>g the Readmethod. This behavior is not unique to <strong>XML</strong> readers, it is common to all .<strong>NET</strong> readercomponents. Readers move from their <strong>in</strong>itial state to the first element us<strong>in</strong>g only theRead method. To move from any node to the next, you can cont<strong>in</strong>ue us<strong>in</strong>g Read aswell as a number of other more specialized methods, <strong>in</strong>clud<strong>in</strong>g Skip, MoveToContent,and ReadInnerXml.To process the entire content of an <strong>XML</strong> source, you typically set up a loop based onthe return value of the Read method. The Read method returns true if there's morecontent to be read, and false otherwise.Access<strong>in</strong>g NodesThe follow<strong>in</strong>g example shows how to use an XmlTextReader object to parse thecontents of an <strong>XML</strong> file and build the node layout. Let's beg<strong>in</strong> by consider<strong>in</strong>g thefollow<strong>in</strong>g <strong>XML</strong> data:.<strong>NET</strong>L<strong>in</strong>uxW<strong>in</strong>32JavaThe correspond<strong>in</strong>g node layout that we want to extrapolate consists of a block of <strong>XML</strong>data that comprises all the element nodes of the source file, as shown here:28


To produce these results, I created the GetXmlFileNodeLayout function. This functionscans the entire contents of the <strong>XML</strong> file and processes each node found along theway. Only two types of nodes are relevant <strong>for</strong> this example: the start and end tags ofElement nodes. The NodeType enumeration identifies these two types of nodesthrough the keywords Element and EndElement.private str<strong>in</strong>g GetXmlFileNodeLayout(str<strong>in</strong>g file){// Open the streamXmlTextReader reader = new XmlTextReader(file);// Loop through the nodesStr<strong>in</strong>gWriter writer = new Str<strong>in</strong>gWriter();str<strong>in</strong>g tabPrefix = "";while (reader.Read()){// Write the start tagif (reader.NodeType == XmlNodeType.Element){tabPrefix = new str<strong>in</strong>g('\t', reader.Depth);writer.WriteL<strong>in</strong>e("{0}",tabPrefix,reader.Name);}else{// Write the end tagif (reader.NodeType == XmlNodeType.EndElement){tabPrefix = new str<strong>in</strong>g('\t', reader.Depth);writer.WriteL<strong>in</strong>e("{0}",tabPrefix,reader.Name);}}}// Write to the output w<strong>in</strong>dowstr<strong>in</strong>g buf = writer.ToStr<strong>in</strong>g();writer.Close();// Close the stream29


eader.Close();}return buf;The Boolean value that controls the ma<strong>in</strong> loop stops the loop when the reader's <strong>in</strong>ternalpo<strong>in</strong>ter reaches the end of the stream. GetXmlFileNodeLayout is designed to analyzeall nodes but process only those of type Element or EndElement. The name of thenode, <strong>for</strong>matted to look like a tag name, is output to a memory str<strong>in</strong>g as a l<strong>in</strong>e of text.After f<strong>in</strong>d<strong>in</strong>g an Element or EndElement node, the function uses the reader's Depthproperty to get the nest<strong>in</strong>g level of the current node and arranges a prefix str<strong>in</strong>g madeof as many tab characters as the depth level. The prefix str<strong>in</strong>g is <strong>in</strong>serted <strong>in</strong>to the outputbuffer be<strong>for</strong>e the node name to produce properly <strong>in</strong>dented text.You might have noticed that the GetXmlFileNodeLayout function accumulates the textthat represents the node layout <strong>in</strong>to a Str<strong>in</strong>gWriter object. The Str<strong>in</strong>gWriter object is atypical .<strong>NET</strong> writer class and offers a more friendly programm<strong>in</strong>g <strong>in</strong>terface than theclassic Str<strong>in</strong>g class. Str<strong>in</strong>gWriter lets you express the content <strong>in</strong> l<strong>in</strong>es and automaticallyprovides <strong>for</strong> newl<strong>in</strong>e characters. In addition, its writ<strong>in</strong>g methods support placeholdersand a variable-length parameters list. GetXmlFileNodeLayout then uses theStr<strong>in</strong>gWriter object's ToStr<strong>in</strong>g method to return the accumulated text as a pla<strong>in</strong> str<strong>in</strong>g.NoteThe full source code <strong>for</strong> a W<strong>in</strong>dows Forms application that uses theGetXmlFileNodeLayout function is available <strong>in</strong> this book's samplefiles. The application name is NodeLayout.Read<strong>in</strong>g and Convert<strong>in</strong>g TextTo read the content of the reader's current node, you normally use the Value property.This property, however, always returns a str<strong>in</strong>g that you might need to convert to amore specific type such as a date or a double. To convert a str<strong>in</strong>g to a .<strong>NET</strong> Frameworktype, you should use any of the XmlConvert class methods.How is the XmlConvert class different from the System.Convert class—the .<strong>NET</strong>Framework primary tool <strong>for</strong> convert<strong>in</strong>g from one type to another? The two classesper<strong>for</strong>m nearly identical tasks, but the XmlConvert class works accord<strong>in</strong>g to the XSDdata type specification and ignores the current locale. Let's look at an example thatillustrates the difference between the two convert<strong>in</strong>g classes. Suppose that you have an<strong>XML</strong> fragment such as the follow<strong>in</strong>g:2-8-2001150,000The current locale dictates that the hire date is February 8, 2001, and the yearly salaryis $150,000. If you convert the str<strong>in</strong>gs to specific .<strong>NET</strong> types us<strong>in</strong>g the System.Convertclass, all will work as expected. If you convert us<strong>in</strong>g XmlConvert, you'll get errors:// Assume the reader po<strong>in</strong>ts to DateTime dt = XmlConvert.ToDateTime(reader.Value);// Move the reader to reader.Read();double d = XmlConvert.ToDouble(reader.Value);30


In particular, the XmlConvert class will not recognize the first str<strong>in</strong>g as a correct date.As <strong>for</strong> the salary, you'll get a message stat<strong>in</strong>g that the <strong>in</strong>put str<strong>in</strong>g is not <strong>in</strong> the correct<strong>for</strong>mat.If you had created the <strong>XML</strong> code programmatically us<strong>in</strong>g an <strong>XML</strong> writer (more on <strong>XML</strong>writers <strong>in</strong> Chapter 4) and .<strong>NET</strong> strong types, the <strong>XML</strong> fragment you're work<strong>in</strong>g withwould be slightly different, as shown here:2001-02-08150000To be understood <strong>in</strong> <strong>XML</strong>, a date must be <strong>in</strong> YYYY-MM-DD <strong>for</strong>mat and a double valueshould not <strong>in</strong>clude any locale-dependent element such as the digit group symbol. If thedouble value <strong>in</strong>cludes a fractional part, use a decimal po<strong>in</strong>t to separate it from the<strong>in</strong>teger part. Likewise, XmlConvert recognizes Booleans only if they are expressed astrue/false or 1/0 pairs.NoteAnother aspect that makes the difference between the System.Convert and XmlConvert classes even sharper is the fact thatXmlConvert does not support custom <strong>for</strong>mat providers. TheXmlConvert class works as a translator to and from .<strong>NET</strong> types andXSD types. When the conversion takes place, the result isrigorously locale <strong>in</strong>dependent.Round-Tripp<strong>in</strong>g Non-<strong>XML</strong> Str<strong>in</strong>gsNot all characters available on a given plat<strong>for</strong>m are necessarily valid <strong>XML</strong> characters.Only the characters <strong>in</strong>cluded <strong>in</strong> the range of allowed characters def<strong>in</strong>ed <strong>in</strong> the <strong>XML</strong>specification (www.w3.org/TR/2000/REC-xml-20001006.html) can be safely used <strong>for</strong>element and attribute names.The XmlConvert class provides key functions <strong>for</strong> tunnel<strong>in</strong>g non-<strong>XML</strong> names through<strong>XML</strong> over a round-trip to some servers. When names conta<strong>in</strong> characters that are <strong>in</strong>valid<strong>in</strong> <strong>XML</strong> names, the methods EncodeName and DecodeName can adjust them to fit <strong>in</strong>toan <strong>XML</strong> name schema. For example, several applications, <strong>in</strong>clud<strong>in</strong>g <strong>Microsoft</strong> SQLServer and <strong>Microsoft</strong> Office, allow and support Unicode characters <strong>in</strong> their documents.However, some of these characters are not valid <strong>in</strong> <strong>XML</strong> names. The typicalcircumstance that demonstrates the importance of XmlConvert occurs when youmanipulate, say, a database column name conta<strong>in</strong><strong>in</strong>g blanks. Although SQL Serverallows a column name such as Invoice Details, that would not be a valid name <strong>for</strong> an<strong>XML</strong> stream. The word space must be replaced with its hexadecimal encod<strong>in</strong>g. A valid<strong>XML</strong> representation <strong>for</strong> the column name Invoice Details is the follow<strong>in</strong>g str<strong>in</strong>g:Invoice_0x0020_DetailsYou can obta<strong>in</strong> that str<strong>in</strong>g by us<strong>in</strong>g EncodeName, as shown here:str<strong>in</strong>g xmlColName = XmlConvert.EncodeName("Invoice Details");The reverse operation is accomplished by us<strong>in</strong>g DecodeName. This method translatesan <strong>XML</strong> name back to its orig<strong>in</strong>al <strong>for</strong>m by unescap<strong>in</strong>g any escaped sequence, asshown <strong>in</strong> the follow<strong>in</strong>g code. Note that only fully escaped <strong>for</strong>ms are detected. Forexample, only _0x0020_ is rendered as a blank space.31


str<strong>in</strong>g colName =XmlConvert.DecodeName("Invoice_0x0020_Details");The only valid <strong>for</strong>m of hexadecimal sequences is _0xHHHH_, where HHHH stands <strong>for</strong>a four-digit hexadecimal value. Similar <strong>for</strong>ms are left unaltered, although they couldeasily be considered logically equivalent—<strong>for</strong> example, _0x20_ is not processed.Character Encod<strong>in</strong>g<strong>XML</strong> documents can conta<strong>in</strong> an attribute to specify the encod<strong>in</strong>g. Character encod<strong>in</strong>gprovides a mapp<strong>in</strong>g between numeric <strong>in</strong>dexes and correspond<strong>in</strong>g characters that usersread from a document. The follow<strong>in</strong>g declaration shows how to set the requiredencod<strong>in</strong>g <strong>for</strong> an <strong>XML</strong> document:The Encod<strong>in</strong>g property of the <strong>XML</strong> reader returns the character encod<strong>in</strong>g found <strong>in</strong> thedocument. The default encod<strong>in</strong>g attribute is UTF-8 (UCS Trans<strong>for</strong>mation Format, 8bits).In the .<strong>NET</strong> Framework, the System.Text.Encod<strong>in</strong>g class gathers all supportedencod<strong>in</strong>gs. Most of these encod<strong>in</strong>gs can be used with <strong>XML</strong> documents, with just a fewexceptions. Encod<strong>in</strong>gs such as UTF-7 are <strong>in</strong>valid <strong>for</strong> <strong>XML</strong> documents because theyrequire different byte values than UTF-8. UTF-8 encodes Unicode characters us<strong>in</strong>g 8bits per character. UTF-7, on the other hand, encodes Unicode characters us<strong>in</strong>g 7 bitsper character.Access<strong>in</strong>g AttributesOf all the node types supplied <strong>in</strong> the .<strong>NET</strong> Framework, only Element, DocumentType,and XmlDeclaration support attributes. To check whether a given node conta<strong>in</strong>sattributes, use the HasAttributes Boolean property. The AttributeCount property returnsthe number of attributes available <strong>for</strong> the current node.Once the <strong>in</strong>ternal reader's po<strong>in</strong>ter is positioned on a certa<strong>in</strong> node, you can directly readthe value of a particular attribute us<strong>in</strong>g either the GetAttribute method or the <strong>in</strong>dexerproperty Item. In both cases, overloads of the method and the property allow you toaccess attributes <strong>in</strong> various ways: by absolute position, by name, and by name andnamespace. The returned value <strong>for</strong> an attribute is always a str<strong>in</strong>g; the task of convert<strong>in</strong>git to a more specific data type is left to the programmer.GetAttribute and Item provide a way to access attributes directly but require that youknow the name or the ord<strong>in</strong>al position of the attribute be<strong>in</strong>g accessed. A third way toread attribute values is by mov<strong>in</strong>g the po<strong>in</strong>ter to the attribute node itself and then us<strong>in</strong>gthe Value property. You enumerate the attribute nodes us<strong>in</strong>g the MoveToFirstAttributeand MoveToNextAttribute methods. You can also change the po<strong>in</strong>ter by mov<strong>in</strong>g directlyto a given node us<strong>in</strong>g the MoveToAttribute method.This next example demonstrates how to programmatically access any sequence ofattributes <strong>for</strong> a node and concatenate their names and values <strong>in</strong> a s<strong>in</strong>gle str<strong>in</strong>g.Consider the follow<strong>in</strong>g <strong>XML</strong> fragment:We want to create a method that, when run on this <strong>XML</strong> block of data, generates thefollow<strong>in</strong>g str<strong>in</strong>g:id="1" lastname="Users" firstname="Joe"32


The method we create to do this is the user-def<strong>in</strong>ed function GetAttributeList.GetAttributeList takes a reference to the reader and extracts attribute values <strong>for</strong> thecurrently selected node.// Assume we call this method after hav<strong>in</strong>g read the nodestr<strong>in</strong>g GetAttributeList(XmlReader reader){Str<strong>in</strong>g buf = "";if (reader.HasAttributes)while(reader.MoveToNextAttribute())buf += reader.Name + "=\""+ reader.Value + "\" ";}reader.MoveToElement();return buf;When the po<strong>in</strong>ter is not already positioned on an attribute node, call<strong>in</strong>gMoveToNextAttribute is equivalent to call<strong>in</strong>g MoveToFirstAttribute, which moves thepo<strong>in</strong>ter to the first attribute node.An <strong>XML</strong> reader can move only <strong>for</strong>ward, which means that no previously visited nodecan be revisited once you have moved on to another node. This rule has a very specificexception. When the po<strong>in</strong>ter is positioned on an attribute node, you can move back tothe parent node us<strong>in</strong>g the MoveToElement method. This exception exists because,after all, an attribute is a particular type of node that is used to qualify the contents ofthe parent. From this po<strong>in</strong>t of view, an attribute is seen as a sort of subnode, andmov<strong>in</strong>g between the attributes of a given node does not logically change the <strong>in</strong>dex ofthe current element node. Us<strong>in</strong>g MoveToAttribute and MoveToFirstAttribute, you canjump from one attribute node to the next <strong>in</strong> both directions.Pars<strong>in</strong>g Mixed-Content AttributesNormally, the content of an attribute consists of a simple str<strong>in</strong>g of text. If you need touse it as an <strong>in</strong>stance of a more specific type (<strong>for</strong> example, a date or a Boolean value),you can convert the str<strong>in</strong>g us<strong>in</strong>g either the methods of the static classes XmlConvert(recommended) or even System.Convert.In some situations, however, the content of an attribute is mixed and <strong>in</strong>cludes pla<strong>in</strong> textalong with entities. Although unable to resolve entity references, the XmlTextReaderclass can separate text from entities when both are embedded <strong>in</strong> an attribute's value.For this to happen, you must parse the attribute's content us<strong>in</strong>g the ReadAttributeValuemethod <strong>in</strong>stead of simply read<strong>in</strong>g the content via the Value property.The follow<strong>in</strong>g code demonstrates how to rewrite the GetAttributeList function so that itcan preprocess mixed attributes and separate text from entities. The added code isshown <strong>in</strong> boldface.// Assume we call this method after hav<strong>in</strong>g read the nodestr<strong>in</strong>g GetAttAttributeList(XmlReader reader){Str<strong>in</strong>g buf = "";if (reader.HasAttributes)while(reader.MoveToNextAttribute()){33


}buf += reader.Name + "=\"";while(reader.ReadAttributeValue()){if (reader.NodeType == XmlNodeType.EntityReference)buf += "["+ reader.Name + "]";elsebuf += reader.Value;}buf += "\" ";}reader.MoveToElement();return buf;The ReadAttributeValue method parses the attribute value and isolates eachconstituent token, be it pla<strong>in</strong> text or an entity. The function calls ReadAttributeValuerepeatedly until the end of the attribute str<strong>in</strong>g is reached. Because by design theXmlTextReader parser does not resolve entities, there is not much you can do with theembedded entity other than recogniz<strong>in</strong>g and maybe skipp<strong>in</strong>g it. The preced<strong>in</strong>g code, <strong>for</strong><strong>in</strong>stance, wraps the name of the entity <strong>in</strong> square brackets. When process<strong>in</strong>g an elementnode such as this:the GetAttAttributeList function produces the follow<strong>in</strong>g str<strong>in</strong>g:ISBN="61801-1" author="[author], Italy"Attribute NormalizationThe W3C <strong>XML</strong> 1.0 Recommendation def<strong>in</strong>es attribute normalization as the prelim<strong>in</strong>aryprocess that an attribute value should be subjected to prior to be<strong>in</strong>g returned to theapplication. The normalization process can be summarized <strong>in</strong> a few basic rules:• Any referenced character (<strong>for</strong> example, &nbsp;) is expanded.• Any white space character (blanks, carriage returns, l<strong>in</strong>efeeds, and tabs)is replaced with a blank (ASCII 0x20) character.• Any lead<strong>in</strong>g or trail<strong>in</strong>g sequence of blanks is discarded.• Any other sequence of blanks is replaced with a s<strong>in</strong>gle blank character(ASCII 0x20).All other characters (<strong>for</strong> example, the literals <strong>for</strong>m<strong>in</strong>g the value) are simply appended tothe result<strong>in</strong>g normalized value. Any entity reference found <strong>in</strong> the attribute value isrecursively normalized. Of course, the normalization process applies only to theattributes def<strong>in</strong>ed outside of any CDATA section.The XmlTextReader parser lets you toggle the normalization process on and offthrough the Normalization Boolean property. By default, the Normalization property isset to false, mean<strong>in</strong>g that attribute values are not normalized. If the normalizationprocess is disabled, an attribute can conta<strong>in</strong> any character, <strong>in</strong>clud<strong>in</strong>g characters <strong>in</strong> the&#00; to &#20; range, which are normally considered <strong>in</strong>valid and not permitted. Whennormalization is on, us<strong>in</strong>g any of those character entities results <strong>in</strong> an XmlExceptionbe<strong>in</strong>g thrown.34


Consider the follow<strong>in</strong>g attribute value, <strong>in</strong> which the entity character <br />

denotes al<strong>in</strong>efeed character:Let's try to read the AuthorDisplayName attribute us<strong>in</strong>g the XmlTextReader parserwhen the normalization is off. The follow<strong>in</strong>g code shows how:reader.Normalization = false;reader.Read();Console.WriteL<strong>in</strong>e(reader["AuthorDisplayName"]);In the result<strong>in</strong>g str<strong>in</strong>g, the l<strong>in</strong>efeed is preserved, and the output <strong>in</strong> the console w<strong>in</strong>dowlooks like this:D<strong>in</strong>oEspositoConversely, if you read the attribute when Normalization is set to true, the l<strong>in</strong>e-feed isreplaced with a blank, and the output looks like this:D<strong>in</strong>o EspositoHandl<strong>in</strong>g <strong>XML</strong> ExceptionsThe <strong>XML</strong> reader throws an exception whenever it encounters a pars<strong>in</strong>g error <strong>in</strong> the<strong>XML</strong> source. The reader makes use of the XmlException class to return detailed<strong>in</strong><strong>for</strong>mation about the last pars<strong>in</strong>g error. Ad hoc <strong>in</strong><strong>for</strong>mation <strong>in</strong>cludes the l<strong>in</strong>e number,the character position, and a text description. L<strong>in</strong>ePosition and L<strong>in</strong>eNumber, shownhere, are the members that differentiate the XmlException class from the basic .<strong>NET</strong>Exception class:public class XmlException : SystemException{<strong>in</strong>t L<strong>in</strong>ePosition;<strong>in</strong>t L<strong>in</strong>eNumber;}Although you can still catch <strong>XML</strong> pars<strong>in</strong>g and validation exceptions through the basicException class, catch<strong>in</strong>g them through XmlException gives you more <strong>in</strong><strong>for</strong>mation andthe certa<strong>in</strong>ty that the error relates only to the code handl<strong>in</strong>g <strong>XML</strong> data.NoteIf you have multiple <strong>XML</strong> documents <strong>in</strong> a s<strong>in</strong>gle stream to parse <strong>in</strong>sequence, you can still use the same <strong>in</strong>stance of the reader.However, prior to attack<strong>in</strong>g a new stream, you must reset the<strong>in</strong>ternal state of the reader. The XmlTextReader class specificallydef<strong>in</strong>es a method, named ResetState, that simply resets the state ofthe reader to ReadState.Initial.ResetState resets all the properties to their default values, with afew exceptions. Normalization, XmlResolver, andWhitespaceHandl<strong>in</strong>g are not affected by the state reset.Handl<strong>in</strong>g White SpacesIn <strong>XML</strong>, white spaces are a special type of node. White spaces found <strong>in</strong> the body of an<strong>XML</strong> document can be classified <strong>in</strong> two groups: significant and <strong>in</strong>significant. A white35


space is said to be significant when it appears <strong>in</strong> the text of an element node or when itappears to be with<strong>in</strong> the scope of a white space declaration, as shown here:⋮Significant white spaces can't be removed from the document without affect<strong>in</strong>g to someextent the validity and the contents of the document. An <strong>in</strong>significant white space, onthe other hand, is any white space that you do not need to preserve after read<strong>in</strong>g thesource document. White space is a blanket term that encompasses more than onecharacter and does not refer only to blanks (ASCII 0x20). White spaces are alsocarriage returns (ASCII 0x0D), l<strong>in</strong>efeeds (ASCII 0x0A), and tabs (ASCII 0x09).The XmlTextReader class lets you control how white spaces are handled by us<strong>in</strong>g theproperty WhitespaceHandl<strong>in</strong>g. This property accepts and returns a value taken from theWhitespaceHandl<strong>in</strong>g enumeration, which lists three feasible options. The default optionis All and <strong>in</strong>dicates that both significant and <strong>in</strong>significant spaces will be returned asdist<strong>in</strong>ct nodes—SignificantWhitespace and Whitespace, respectively. The None option<strong>in</strong>dicates that no white space at all will be returned as a node. The third option,Significant, discards all <strong>in</strong>significant white spaces and returns only nodes of typeSignificantWhitespace. Interest<strong>in</strong>gly, the WhitespaceHandl<strong>in</strong>g property is one of the fewreader properties that can be changed at any time and will take effect immediately onthe next read operation.Resolv<strong>in</strong>g EntitiesIn <strong>XML</strong>, an entity is a named placeholder <strong>for</strong> some content or markup text. Entities canbe declared both <strong>in</strong>-l<strong>in</strong>e and with<strong>in</strong> a DTD or a schema. The declaration syntax isshown here:The follow<strong>in</strong>g statement declares an entity named author that is associated with thecontents "D<strong>in</strong>o Esposito":When it is declared <strong>in</strong>-l<strong>in</strong>e, the entity must be part of an all-encompass<strong>in</strong>g node, as <strong>in</strong> the follow<strong>in</strong>g example:Once declared, entities are then used with<strong>in</strong> the body of the <strong>XML</strong> document <strong>in</strong> place oftheir bound content. An entity can appear only with<strong>in</strong> the scope of Element, Attribute, orEntityReference nodes. When used <strong>in</strong> an <strong>XML</strong> source, an entity is called an entityreference, and the parser connects to it through an EntityReference node. The follow<strong>in</strong>gexample shows how to use an entity <strong>in</strong> <strong>XML</strong> code:<strong>Microsoft</strong> Press&author;36


An entity reference consists of the entity name bracketed by an ampersand (&) and asemicolon (;). Not all parsers automatically expand entities upon document load<strong>in</strong>g.When the XmlTextReader class encounters an entity reference, it returns an empty<strong>in</strong>stance of the XmlEntityReference class <strong>in</strong> which the Value property is set to theempty str<strong>in</strong>g. By design, the XmlTextReader parser can't resolve entities, although itboasts a ResolveEntity method. Call<strong>in</strong>g this method always throws an exception. Youmust use XmlValidat<strong>in</strong>gReader to have entities properly expanded. (We'll covervalidat<strong>in</strong>g readers and validation schemas <strong>in</strong> Chapter 3.)Resolv<strong>in</strong>g External ReferencesIn the .<strong>NET</strong> Framework, external <strong>XML</strong> resources identified by a URI are resolvedthrough classes derived from the abstract class XmlResolver. Typical externalresources are entities and DTDs; however, the XmlResolver class can also successfullyprocess <strong>in</strong>clude and import elements <strong>for</strong> both XSD schemas and XSL style sheets.The .<strong>NET</strong> Framework provides only one concrete resolver class built atop XmlResolver:XmlUrlResolver. Programmers can design and implement custom resolvers, however,either by <strong>in</strong>herit<strong>in</strong>g from the XmlUrlResolver class or completely from scratch byoverrid<strong>in</strong>g the methods and properties of XmlResolver. Let's take a look at the keyaspects, and the ma<strong>in</strong> tasks, of a resolver.The activity of an <strong>XML</strong> resolver revolves around two methods: GetEntity andResolveUri. The <strong>for</strong>mer takes the specified URI and returns the Stream object thatrepresents the desired contents. How the method actually manages to resolve the URIis implementation-specific. GetEntity, however, assumes to have at its disposal anabsolute URI. What if the URI read from the <strong>XML</strong> document is relative? Prior to call<strong>in</strong>gGetEntity, you must be sure to call ResolveUri, pass<strong>in</strong>g both the relative URI and anybase URI. ResolveUri is responsible <strong>for</strong> comb<strong>in</strong><strong>in</strong>g these URIs <strong>in</strong>to an absolute URI.Another problem a resolver must be ready to face arises when the resource referencedby the URI is protected and available only to authenticated users. In this case, theresolver must be passed valid credentials to carry out the task. Credentials arerepresented by an <strong>in</strong>stance of the NetworkCredential class.The NetworkCredential class can be used to support a variety of authenticationschemes that make use of passwords. Among others, the list of authenticationschemes <strong>in</strong>cludes basic and digest authentication and Kerberos. The class does notsupport other types of authentication such as those based on a public key. You providethe credentials to the resolver through the XmlResolver.Credentials property, as shownhere:XmlUrlResolver resolver = new XmlUrlResolver();NetworkCredential cred = new NetworkCredential(user, pswd);resolver.Credentials = cred;reader.XmlResolver = resolver;You can also use the CredentialCache class to b<strong>in</strong>d the resolver <strong>in</strong> a s<strong>in</strong>gle shot to acollection of URI/credential pairs, as shown <strong>in</strong> the follow<strong>in</strong>g code. The collection willthen be scanned, search<strong>in</strong>g <strong>for</strong> a match<strong>in</strong>g URI each time the resolver is called toaction.CredentialCache credCache = new CredentialCache();credCache.Add(new Uri(url1), "Basic", cred);credCache.Add(new Uri(url2), "Digest", cred);resolver.Credentials = credCache;37


If credentials are needed but not provided, the resolver makes an attempt us<strong>in</strong>g defaultcredentials, available from the CredentialCache.DefaultCredentials property. If thedefault credentials still don't provide access, the resolve attempt will fail. Defaultcredentials represents the system credentials <strong>for</strong> the application security context—thatis, the credentials of the logged-<strong>in</strong> user or the user be<strong>in</strong>g impersonated.Read<strong>in</strong>g Large StreamsThe XmlTextReader class provides a few methods—ReadChars, ReadB<strong>in</strong>Hex, andReadBase64—tailored to read chunks of data out of a large stream of embedded text.These methods share almost the same prototype and overall logic, but differ <strong>in</strong> howthey preprocess and return the fetched data:public <strong>in</strong>t ReadChars (char[] array, <strong>in</strong>t offset, <strong>in</strong>t len);public <strong>in</strong>t ReadB<strong>in</strong>Hex(byte[] array, <strong>in</strong>t offset, <strong>in</strong>t len);public <strong>in</strong>t ReadBase64(byte[] array, <strong>in</strong>t offset, <strong>in</strong>t len);All three methods can be used only to read the text associated with an Element node. Ifyou use any of them with nodes of other types, the method will fail. The read methodslet you fetch the specified number of bytes (len argument) from the current readerstart<strong>in</strong>g at the given offset (offset argument). The fetched bytes are then placed <strong>in</strong> thearray argument. The return value <strong>in</strong>dicates the number of bytes effectively read. Thisnumber equals len if the call was successful. The return value could be less than len ifthe stream is close to its end, however. Anomalous situations are identified throughexceptions.So what's the difference between these three methods? As their names imply, theydiffer <strong>in</strong> their decod<strong>in</strong>g capabilities. The ReadB<strong>in</strong>Hex method decodes B<strong>in</strong>Hex content,whereas ReadBase64 returns Base64 decoded b<strong>in</strong>ary bytes. The ReadChars method,on the other hand, reads the text as it is.There are a few m<strong>in</strong>or issues regard<strong>in</strong>g the use of these methods. They do not per<strong>for</strong>many <strong>XML</strong>-specific tasks such as validat<strong>in</strong>g, resolv<strong>in</strong>g entities, or normaliz<strong>in</strong>g attributevalues. While you're <strong>in</strong> the process of read<strong>in</strong>g node content us<strong>in</strong>g a stream-basedmethod, you can't read any attributes.ReadChars, ReadB<strong>in</strong>Hex, and ReadBase64 always return everyth<strong>in</strong>g found betweenthe start tag and the end tag of the element node they are work<strong>in</strong>g on. If the embeddedtext <strong>in</strong>cludes any markup (<strong>for</strong> example, a mixed-content node), that is returned as well,just as if you were read<strong>in</strong>g a b<strong>in</strong>ary or a text file from a disk.NoteNoteThe full source code <strong>for</strong> an application demonstrat<strong>in</strong>g <strong>in</strong>crementalaccess to <strong>XML</strong> files is available <strong>in</strong> this book's sample files. Theapplication name is IncrementalRead.Earlier <strong>in</strong> this chapter, you learned how to use a s<strong>in</strong>gle <strong>in</strong>stance ofan XmlTextReader reader to process multiple <strong>XML</strong> streams. In thatcase, the key was us<strong>in</strong>g the ResetState method to re<strong>in</strong>itialize thereader's <strong>in</strong>ternal state. If needed, however, you can also do thereverse—that is, use different readers (<strong>for</strong> example, a text readerand a validat<strong>in</strong>g reader) to process dist<strong>in</strong>ct pieces of a s<strong>in</strong>gle <strong>XML</strong>stream. The method that makes this possible is GetRema<strong>in</strong>der,which returns the rema<strong>in</strong>der of the buffered <strong>XML</strong> stream.GetRema<strong>in</strong>der scans and returns the portion of the buffer that hasnot yet been processed. The buffer is returned as a genericTextReader object.38


The NameTable ObjectOne of the secrets beh<strong>in</strong>d the <strong>XML</strong> readers' great per<strong>for</strong>mance is the NameTableclass—a helper class that works as a quickly accessible table of str<strong>in</strong>g objects. Several.<strong>NET</strong> classes, <strong>in</strong>clud<strong>in</strong>g, but not limited to, XmlDocument and XmlTextReader, makeuse <strong>in</strong>ternally of a NameTable object. User applications too can use a NameTableobject to store potentially duplicated str<strong>in</strong>gs more efficiently. When stored <strong>in</strong> a nametable, a str<strong>in</strong>g is said to be an atomized str<strong>in</strong>g.The net effect of atomized str<strong>in</strong>gs is that <strong>XML</strong> readers can manage elements andattributes as references rather than values and can there<strong>for</strong>e function more effectively,especially <strong>in</strong> terms of memory occupation and speed of comparison. Compar<strong>in</strong>g twoobject references is much faster than compar<strong>in</strong>g all the characters that <strong>for</strong>m a str<strong>in</strong>g.The NameTable class, which <strong>in</strong>herits from the abstract class XmlNameTable, has arelatively simple programm<strong>in</strong>g <strong>in</strong>terface and provides methods to add new items and toread them back. You add a new item to a name table us<strong>in</strong>g the Add method.NameTable table = new NameTable();str<strong>in</strong>g name = table.Add("Author");You get the atomized str<strong>in</strong>g with the specified value from the table us<strong>in</strong>g the Getmethod.str<strong>in</strong>g name = table.Get("Author");<strong>XML</strong> reader classes make <strong>in</strong>ternal use of name tables. The reader's name table can beaccessed through the NameTable property. The reader's name table conta<strong>in</strong>s an atom(a reference to the str<strong>in</strong>g object) <strong>for</strong> each dist<strong>in</strong>ct element or attribute name, completedwith namespace <strong>in</strong><strong>for</strong>mation <strong>for</strong> uniqueness. If the <strong>XML</strong> document be<strong>in</strong>g processedconta<strong>in</strong>s, say, 1000 nodes named , only one atomized entry will be created<strong>in</strong> the name table. Don't mistake the NameTable object <strong>for</strong> a worker table <strong>in</strong> which thereader stores all the document's nodes. Instead, the NameTable object is just a workercollection of unique names stored <strong>in</strong> a way that allows <strong>for</strong> more effective storage,retrieval, and comparison.The NameTable object is <strong>in</strong>ternally implemented us<strong>in</strong>g an array of structures thatmimics a hash table. Like a hash table, the array manages str<strong>in</strong>gs us<strong>in</strong>g hash codes.So when a new str<strong>in</strong>g is added to the table, a new hash code is generated andcompared to the others exist<strong>in</strong>g <strong>in</strong> the array. If a str<strong>in</strong>g with that hash code alreadyexists <strong>in</strong> the table, a reference to the exist<strong>in</strong>g atom is returned; otherwise, a new entryis created and the relative reference (atom) returned. In case of overflow, the size of thearray is doubled.The NameTable object uses a homemade hash table rather than the official .<strong>NET</strong>HashTable object because the HashTable object is not as simple and compact asrequired <strong>in</strong> this context.When creat<strong>in</strong>g a new <strong>in</strong>stance of the XmlTextReader class, you can also <strong>in</strong>dicate thespecific NameTable object to use.Design<strong>in</strong>g a SAX Parser with .<strong>NET</strong> ToolsAs mentioned <strong>in</strong> Chapter 1, significant differences exist between .<strong>NET</strong> <strong>XML</strong> readers—ak<strong>in</strong>d of cursor-like parser—and Simple API <strong>for</strong> <strong>XML</strong> (SAX) parsers. All of thesedifferences can be traced, directly or <strong>in</strong>directly, to the differences exist<strong>in</strong>g between thepush model, which is typical of SAX, and the pull model on which readers are based.A SAX parser takes full control over the pars<strong>in</strong>g process, extrapolates any predef<strong>in</strong>edpiece of <strong>XML</strong> code, duplicates it <strong>in</strong>to local buffers, and f<strong>in</strong>ally pushes that data down tothe call<strong>in</strong>g application. The <strong>in</strong>teraction between the parser and the application takesplace through application-def<strong>in</strong>ed classes that, <strong>in</strong> turn, implement SAX-def<strong>in</strong>ed<strong>in</strong>terfaces.39


With SAX, the client application receives any data the parser is designed to push andcan discard it if that result is of no <strong>in</strong>terest. The data is always sent, however. Theapplication has to build fairly sophisticated code to isolate the pieces of <strong>in</strong><strong>for</strong>mation itreally needs (that is, the nodes of <strong>in</strong>terest) and, more importantly, to add them to acustom data structure that represents the state.<strong>XML</strong> readers tout the pull model, <strong>in</strong> which the parser is just one tool managed andgoverned by the caller application. This model allows <strong>for</strong> more selective process<strong>in</strong>g—the application just skips over unneeded data—and even <strong>for</strong> an optimized <strong>in</strong>teraction. Infact, the application puts data of <strong>in</strong>terest directly <strong>in</strong> its f<strong>in</strong>al buffers rather than hav<strong>in</strong>gthe parser create and pass on temporary buffers.The ma<strong>in</strong> advantage of SAX over <strong>XML</strong>DOM—that is, the ability to visit <strong>XML</strong> data <strong>in</strong> afast, <strong>for</strong>ward-only, read-only way—is still the key feature of .<strong>NET</strong> <strong>XML</strong> readers. For thisreason, you will not f<strong>in</strong>d any support <strong>for</strong> SAX <strong>in</strong> the .<strong>NET</strong> Framework, and frankly, the.<strong>NET</strong> <strong>XML</strong> <strong>in</strong>frastructure clearly works as a superset of SAX. However, if you still feelsome nostalgia <strong>for</strong> the SAX model, consider that the pull model is flexible enough to letyou build a push model on top of it. Let's see how.Applications <strong>in</strong>teract with a SAX parser by writ<strong>in</strong>g and register<strong>in</strong>g their own handlers, asshown here:Set saxParser.contentHandler = myCntHandler' *** Set other handlerssaxParser.parseURL(file)In Visual Basic .<strong>NET</strong>, you create a new .<strong>NET</strong> class named SaxParser:Public Class SaxParserPublic ContentHandler As SaxContentHandlerPublic Sub Parse(ByVal file As Str<strong>in</strong>g)Dim reader As XmlTextReader = New XmlTextReader(file)While (reader.Read())ContentHandler.Process(reader.Name, reader.Value,reader.NodeType)End Whilereader.Close()End SubEnd ClassThe SaxParser class has a property named ContentHandler that refers to a userdef<strong>in</strong>edobject <strong>in</strong> charge of process<strong>in</strong>g the found nodes. The Parse method parses thecontent of the <strong>XML</strong> document us<strong>in</strong>g a reader, and whenever a new node is found, themethod calls the content handler. The content handler class has a fixed <strong>in</strong>terfacerepresented by the follow<strong>in</strong>g abstract class:Public MustInherit Class SaxContentHandlerPublic MustOverride Sub Process(_ByVal name As Str<strong>in</strong>g, _ByVal value As Str<strong>in</strong>g, _40


End ClassByVal type As XmlNodeType)After the two classes have been compiled <strong>in</strong>to an assembly, a client SAX applicationcan simply reference and <strong>in</strong>stantiate the parser and the content handler class. Theworld's simplest content handler class is shown here:Public Class MyContentHandlerInherits SaxContentHandlerPublic Overrides Sub Process(_ByVal name As Str<strong>in</strong>g, _ByVal value As Str<strong>in</strong>g, _ByVal type As XmlNodeType)If type = XmlNodeType.Element ThenMsgBox(name)End IfEnd SubEnd ClassThe SAX application <strong>in</strong>itializes the parser as follows:Dim saxParser As New SaxParser()Dim myHandler As New MyContentHandler()saxParser.ContentHandler = myHandlersaxParser.Parse(file)Of course, the parser discussed here is fairly m<strong>in</strong>imal, but the design guidel<strong>in</strong>es areconcrete and effective. As an aside, consider the fact that <strong>in</strong> the client application, thecontent handler class and the <strong>for</strong>m are different classes, which makes updat<strong>in</strong>g theuser <strong>in</strong>terface from the content handler class a bit complicated.NoteThe full source code discussed here is provided <strong>in</strong> this book'ssample files. The application is named SaxParser.Pars<strong>in</strong>g <strong>XML</strong> FragmentsThe XmlTextReader class provides the basic set of functionalities to process any <strong>XML</strong>data com<strong>in</strong>g from a disk file, a stream, or a URL. This k<strong>in</strong>d of reader works sequentially,read<strong>in</strong>g one node after the next, and does not deliberately provide any ad hoc searchfunction to parse only a particular subtree.In the .<strong>NET</strong> Framework, to process only fragments of <strong>XML</strong> data, excerpted from avariety of sources, you can take one of two routes. You can <strong>in</strong>itialize the text readerwith the <strong>XML</strong> str<strong>in</strong>g that represents the fragment, or you can use another, morespecific, reader class—the XmlNodeReader class.The XmlNodeReader class works on the subtree rooted <strong>in</strong> the XmlNode object passedto the class constructor. A liv<strong>in</strong>g <strong>in</strong>stance of an XmlNode object is not someth<strong>in</strong>g youcan obta<strong>in</strong> through a text reader, however. Only the .<strong>NET</strong> <strong>XML</strong> DOM parser can createand return an XmlNode object. We'll exam<strong>in</strong>e the details of the XmlNodeReader class<strong>in</strong> Chapter 5, along with the .<strong>NET</strong> <strong>XML</strong> DOM parser.41


If you have ever used <strong>Microsoft</strong> <strong>XML</strong> Core Services (MS<strong>XML</strong>)—the <strong>Microsoft</strong> COM<strong>XML</strong> parser—you have certa<strong>in</strong>ly noticed that it allows you to <strong>in</strong>itialize the parser from awell-<strong>for</strong>med <strong>XML</strong> str<strong>in</strong>g. However, the long list of constructors that the XmlTextReaderclass boasts gives no clear <strong>in</strong>dication that that same MS<strong>XML</strong> feature is also supplied bythe .<strong>NET</strong> Framework. In this section, you'll learn how to parse <strong>XML</strong> data stored <strong>in</strong> amemory str<strong>in</strong>g. First I'll show you how to work with pla<strong>in</strong> str<strong>in</strong>gs with no context<strong>in</strong><strong>for</strong>mation, and then I'll show you how to process <strong>XML</strong> fragments us<strong>in</strong>g specificcontext <strong>in</strong><strong>for</strong>mation <strong>for</strong> the parser, such as namespaces and document typedeclarations.Pars<strong>in</strong>g Well-Formed <strong>XML</strong> Str<strong>in</strong>gsThe trick to <strong>in</strong>itializ<strong>in</strong>g a text reader from a str<strong>in</strong>g is all <strong>in</strong> pack<strong>in</strong>g the str<strong>in</strong>g <strong>in</strong>to aStr<strong>in</strong>gReader object. One of the XmlTextReader constructors looks like this:public XmlTextReader(TextReader);TextReader is an abstract class that represents a .<strong>NET</strong> reader object capable ofread<strong>in</strong>g a sequence of characters no matter where they are physically stored. TheStr<strong>in</strong>gReader class <strong>in</strong>herits from TextReader and simply makes itself capable ofread<strong>in</strong>g the bytes of an <strong>in</strong>-memory str<strong>in</strong>g. Because Str<strong>in</strong>gReader derives fromTextReader, you can safely use it to <strong>in</strong>itialize XmlTextReader.str<strong>in</strong>g xmlText = "…";Str<strong>in</strong>gReader strReader = new Str<strong>in</strong>gReader(xmlText);XmlTextReader reader = new XmlTextReader(strReader);The net effect of this code snippet is that the <strong>XML</strong> code stored <strong>in</strong> the xmlText variable isparsed as it is read from a disk file or an open stream or downloaded from a URL.ImportantAny class based on TextReader is <strong>in</strong>herently not thread-safe.Among other th<strong>in</strong>gs, this means that the str<strong>in</strong>g object you areus<strong>in</strong>g to conta<strong>in</strong> parsable <strong>XML</strong> data might be concurrentlyaccessed from other threads. Of course, this happens onlyunder special conditions, but it is def<strong>in</strong>itely a plausiblescenario. If you have a multi-threaded application and thestr<strong>in</strong>g itself happens to be globally visible throughout theapplication, one thread could break the well-<strong>for</strong>medness of thestr<strong>in</strong>g while another thread is pars<strong>in</strong>g it. To avoid this situation,create a thread-safe wrapper <strong>for</strong> the Str<strong>in</strong>gReader class us<strong>in</strong>gthe TextReader class's static member Synchronized, as shownhere:Str<strong>in</strong>g xmlText = "…";Str<strong>in</strong>gReader sr = new Str<strong>in</strong>gReader(xmlText);XmlTextReader reader = newXmlTextReader(sr);TextReader strReader =TextReader.Synchronized(sr);For per<strong>for</strong>mance reasons, you should use the thread-safewrapper class only when strictly necessary. Even better,wherever possible, you should design your code to avoid theneed <strong>for</strong> thread-safe classes.42


Fragments and Parser ContextThe context <strong>for</strong> an <strong>XML</strong> parser consists of all the <strong>in</strong><strong>for</strong>mation that can be used tocustomize the way <strong>in</strong> which the parser works. Context <strong>in</strong><strong>for</strong>mation <strong>in</strong>cludes theencod<strong>in</strong>g character set, the DTD <strong>in</strong><strong>for</strong>mation needed to set all the default attributes andto expand entities, the namespaces, the language, and the white space handl<strong>in</strong>g.If you specify the <strong>XML</strong> fragment us<strong>in</strong>g a Str<strong>in</strong>gReader object, as shown <strong>in</strong> the previoussection, all elements of the parser context are set with default values. The parsercontext is fully def<strong>in</strong>ed by the XmlParserContext class. When <strong>in</strong>stantiat<strong>in</strong>g anXmlTextReader class to operate on a str<strong>in</strong>g, you use the follow<strong>in</strong>g constructor andspecify a parser context:public XmlTextReader(str<strong>in</strong>g xmlFragment,XmlNodeType fragType,XmlParserContext context);The xmlFragment parameter conta<strong>in</strong>s the <strong>XML</strong> str<strong>in</strong>g to parse. The fragType argument,on the other hand, represents the type of fragment. It specifies the type of the node atthe root of the fragment. Only Element, Attribute, and Document nodes are permitted.The XmlParserContext constructor has a few overloads. The one with the shortest listof arguments, shown here, is probably the overload you will use most often:public XmlParserContext(XmlNameTable nt,XmlNamespaceManager nsMgr,str<strong>in</strong>g xmlLang,XmlSpace xmlSpace);Creat<strong>in</strong>g a new parser context is as easy as runn<strong>in</strong>g the follow<strong>in</strong>g statements:NameTable table = new NameTable();table.Add("Author");XmlNamespaceManager mgr = new XmlNamespaceManager(table);mgr.AddNamespace("company", "urn:ThisIsMyBook");XmlParserContext context;context = new XmlParserContext(table, mgr, "en-US",XmlSpace.None);The first parameter to this XmlParserContext constructor is a NameTable object. Thename table is used to look up prefixes and namespaces as atomized str<strong>in</strong>gs. Forper<strong>for</strong>mance reasons, you also need to pass a NameTable object—which <strong>in</strong>herits fromthe abstract XmlNameTable class—when creat<strong>in</strong>g a new <strong>in</strong>stance of a namespacemanager class.NoteIf the namespace manager and the parser context happen to usedifferent NameTable objects, the XmlParserContext might not beable to recognize the namespaces brought <strong>in</strong> by the manager,result<strong>in</strong>g <strong>in</strong> an <strong>XML</strong> exception.43


The second parameter to the XmlParserContext constructor is anXmlNamespaceManager object. The XmlNamespaceManager class is a type ofcollection class designed to conta<strong>in</strong> and manage namespace <strong>in</strong><strong>for</strong>mation. It providesmethods to add, remove, and search <strong>for</strong> namespaces. Namespaces are stored withtheir prefix and URN, which are passed to it through the AddNamespace method. If theprefix is an empty str<strong>in</strong>g, the namespace is considered to be the default.The XmlParserContext class makes use of a namespace manager to collect all thenamespaces that the fragment might use. A fragment is simply a small piece of <strong>XML</strong>code and, as such, is not expected to conta<strong>in</strong> all namespace def<strong>in</strong>itions that its nodesand attributes might use.When a namespace manager is created, the class constructor automatically adds acouple of frequently used prefixes. These prefixes are listed <strong>in</strong> Table 2-5.Table 2-5: Standard Namespace Prefixes Added to XmlNamespaceManagerPrefixxmlnsxml44Correspond<strong>in</strong>g Namespacehttp://www.w3.org/2000/xmlnshttp://www.w3.org/1998/namespaceA third namespace prefix that is allowed is the empty str<strong>in</strong>g, which of course has nocorrespond<strong>in</strong>g namespace URN. Thanks to this contrivance, you don't need to create anamespace manager <strong>in</strong>stance to parse <strong>XML</strong> fragments unless nodes and attributesreally conta<strong>in</strong> custom namespaces. Added namespaces are not verified as con<strong>for</strong>m<strong>in</strong>gto the W3C Namespaces specification and are discarded if they do not con<strong>for</strong>m.As mentioned <strong>in</strong> the section "The NameTable Object," on page 49, the namespacenames are atomized and placed <strong>in</strong> the related NameTable object as soon as they areadded to the collection. When you call the <strong>XML</strong> reader's LookupNamespace method tosearch <strong>for</strong> the namespace that matches the specified prefix, the prefix str<strong>in</strong>g isatomized and added to the name table <strong>for</strong> additional, faster use.Any namespace declaration has a clear and well-def<strong>in</strong>ed scope. The namespacedeclaration can appear anywhere <strong>in</strong> the document, not just at the very beg<strong>in</strong>n<strong>in</strong>g of it.The place <strong>in</strong> the source where the declaration appears determ<strong>in</strong>es the scope. Anamespace controls all the <strong>XML</strong> elements rooted <strong>in</strong> the node <strong>in</strong> which it appears. In thefollow<strong>in</strong>g example, the namespace is applied to the node and all of itsdescendants:⋮D<strong>in</strong>oEsposito99⋮The namespace def<strong>in</strong>ed <strong>for</strong> the element does not apply to elements outsidethat element. The namespace is effective from its po<strong>in</strong>t of declaration until the end ofthe element. After that, any other node not qualified with a namespace prefix isassumed to belong to whichever default namespace has been declared <strong>in</strong> thedocument.


You can specify other sett<strong>in</strong>gs <strong>for</strong> the parser context us<strong>in</strong>g the properties of theXmlParserContext class, <strong>in</strong>clud<strong>in</strong>g Encod<strong>in</strong>g, BaseURI, and DocTypeName. Inparticular, BaseURI is especially useful because it <strong>in</strong>dicates the location from which thefragment was loaded.Writ<strong>in</strong>g a Custom <strong>XML</strong> ReaderWe have one more topic to consider on the subject of <strong>XML</strong> readers, which opens up awhole new world of opportunities: creat<strong>in</strong>g customized <strong>XML</strong> readers. An <strong>XML</strong> readerclass is merely a programm<strong>in</strong>g <strong>in</strong>terface <strong>for</strong> read<strong>in</strong>g data that appears to be <strong>XML</strong>. TheXmlTextReader class represents the simplest and the fastest of all possible <strong>XML</strong>readers but—and this is what really matters—it is just one reader. Its <strong>in</strong>herent simplicityand effectiveness stems from two key po<strong>in</strong>ts. First, the class operates as a read-only,<strong>for</strong>ward-only, nonvalidat<strong>in</strong>g parser. Second, the class is assumed to work on native<strong>XML</strong> data. It has no need, and no subsequent overhead, to map <strong>in</strong>put data <strong>in</strong>ternally to<strong>XML</strong> data structures.Virtually any data can be read, traversed, and queried as <strong>XML</strong> as long as a tailor-madepiece of code takes care of mapp<strong>in</strong>g that data to an <strong>XML</strong> Schema. This mapp<strong>in</strong>g codecan then be buried <strong>in</strong> a method that simply returns one of the standard reader objectsor creates a custom <strong>XML</strong> reader class.NoteWhat's the advantage of expos<strong>in</strong>g data through <strong>XML</strong>? <strong>XML</strong> providesa k<strong>in</strong>d of universal model <strong>for</strong> def<strong>in</strong><strong>in</strong>g a set of <strong>in</strong><strong>for</strong>mation (<strong>in</strong>foset),the type and layout of constituent items (<strong>XML</strong> Schema), and thequery commands (XPath). In the .<strong>NET</strong> Framework, <strong>XML</strong> readersprovide an effective way to deal with hierarchical, <strong>XML</strong>-shaped data.Because <strong>XML</strong> is just a metalanguage used to describe <strong>in</strong><strong>for</strong>mation,and not a data repository itself, the key difference between standard<strong>XML</strong> readers and custom <strong>XML</strong> readers is <strong>in</strong> the location and themodality of <strong>in</strong>tervention of the code that exposes data as <strong>XML</strong>. Suchcode is not part of the basic .<strong>NET</strong> <strong>XML</strong> reader classes butconstitutes the core of custom <strong>XML</strong> readers.Mapp<strong>in</strong>g Data Structures to <strong>XML</strong> NodesFor a long time, INI files have been a fundamental part of <strong>Microsoft</strong> W<strong>in</strong>dowsapplications. Although with the advent of <strong>Microsoft</strong> W<strong>in</strong>32 they were officially declaredobsolete, a lot of applications have not yet stopped us<strong>in</strong>g them. Understand<strong>in</strong>g thereasons <strong>for</strong> this persistence is not of much importance here, but when they weredesign<strong>in</strong>g the .<strong>NET</strong> Framework, the <strong>Microsoft</strong> architects decided not to <strong>in</strong>sert anymanaged classes to handle INI files. Although overall I agree with their decision, keep<strong>in</strong> m<strong>in</strong>d that if you need to access INI files from with<strong>in</strong> a .<strong>NET</strong> Framework application,you'll f<strong>in</strong>d at your disposal only workarounds, not a direct solution.You could, <strong>for</strong> <strong>in</strong>stance, read and write the content of an INI file us<strong>in</strong>g file and I/Oclasses, or you might resort to mak<strong>in</strong>g calls to the underly<strong>in</strong>g W<strong>in</strong>32 unmanagedplat<strong>for</strong>m. Recently, however, I came across a rather illum<strong>in</strong>at<strong>in</strong>g MSDN article <strong>in</strong> whichan even better approach is discussed. (See the section "Further Read<strong>in</strong>g," on page 74,<strong>for</strong> details and the URL.) The idea is this: Why not wrap the contents of INI files <strong>in</strong>to an<strong>XML</strong> reader? INI files are not well-<strong>for</strong>med <strong>XML</strong> files, but a custom reader could easilymap the contents of an INI file's sections and entries to <strong>XML</strong> nodes and attributes.45


In the next few sections of this chapter, you'll learn how to build a custom <strong>XML</strong> readerwork<strong>in</strong>g on top of comma-delimited CSV files.Mapp<strong>in</strong>g CSV Files to <strong>XML</strong>A CSV file consists of one or more l<strong>in</strong>es of text. Each l<strong>in</strong>e conta<strong>in</strong>s str<strong>in</strong>gs of textseparated by commas. Each l<strong>in</strong>e of a CSV file can be naturally associated with adatabase row <strong>in</strong> which each token maps to a column. Likewise, a l<strong>in</strong>e <strong>in</strong> a CSV file canalso be correlated to an <strong>XML</strong> node with as many attributes as the comma-separatedtokens. The follow<strong>in</strong>g code shows a typical CSV file:Davolio,Nancy,Sales RepresentativeFuller,Andrew,Sales ManagerLeverl<strong>in</strong>g,Janet,Sales RepresentativeA good <strong>XML</strong> representation of this structure is shown here:Each row <strong>in</strong> the CSV file becomes a node <strong>in</strong> the <strong>XML</strong> representation, while each tokenis represented by a node attribute. In this case, the <strong>XML</strong> schema is ever-chang<strong>in</strong>gbecause the number of attributes varies with the number of commas <strong>in</strong> the CSV file.The number of total columns can be stored as an extra property. You can opt <strong>for</strong> anautomatically generated sequence of attribute names such as col1, col2, and so on, orif the CSV file provides a header with column names, you can use those names. Ofcourse, there is no way to know <strong>in</strong> advance, and <strong>in</strong> general, whether the first row has tobe read as the first data row or just the header. A possible workaround is add<strong>in</strong>g anextra property that tells the reader how to handle the first row.Us<strong>in</strong>g the <strong>XML</strong> schema described so far, you can use the follow<strong>in</strong>g pseudocode to readabout a given item of <strong>in</strong><strong>for</strong>mation <strong>in</strong> the second row:XmlCsvReader reader = new XmlCsvReader("employees.csv");reader.Read();reader.Read();Console.WriteL<strong>in</strong>e(reader[1].Value);Console.WriteL<strong>in</strong>e(reader["col2"].Value);Another reasonable <strong>XML</strong> schema <strong>for</strong> a CSV file is shown here:DavolioNancySales Representative46


FullerAndrewSales ManagerLeverl<strong>in</strong>gJanetSales RepresentativeAlthough more expressive, I f<strong>in</strong>d this <strong>for</strong>mat—an element normal <strong>for</strong>m—to be a bitverbose, and more importantly, it would require more calls to Read or Skip methods toget to what you really need to know from CSV data—values.Implement<strong>in</strong>g a CSV-to-<strong>XML</strong> ReaderIn this section, I'll take you through build<strong>in</strong>g a custom CSV-to-<strong>XML</strong> reader. A custom<strong>XML</strong> reader is built start<strong>in</strong>g from the abstract XmlReader class, as shown <strong>in</strong> thefollow<strong>in</strong>g code. You override all abstract methods and properties and, if needed, addyour own overloads and custom members.public class XmlCsvReader : XmlReader{}⋮The XmlCsvReader class we're go<strong>in</strong>g to build is the reader class that processes CSVfiles as <strong>XML</strong> documents. Given the structure of a CSV file, not all methods andproperties def<strong>in</strong>ed by the abstract <strong>XML</strong> reader <strong>in</strong>terface make sense. For example, aCSV file does not conta<strong>in</strong> namespaces or entities. Likewise, it does not need a nametable property. Aside from these few exceptions, a large part of the XmlReader classbasic <strong>in</strong>terface is preserved.The key method <strong>for</strong> our custom reader is still Read, and Value is the pr<strong>in</strong>cipal property.We'll use a StreamReader object to access the file and move from l<strong>in</strong>e to l<strong>in</strong>e as theuser calls Read. From an <strong>XML</strong> po<strong>in</strong>t of view, the structure of a CSV file is rather simple.It consists of just one level of nodes—the Depth property is always 0—and,subsequently, there is no possibility <strong>for</strong> nested nodes. As you can imag<strong>in</strong>e, this factgreatly simplifies the development and the <strong>in</strong>ternal logic of the reader.ImportantIf you look at the full source code <strong>for</strong> the XmlCsvReader class,you'll notice that not all properties (see Table 2-1, on page 27)and methods (see Table 2-3, on page 30) def<strong>in</strong>ed <strong>for</strong> theXmlReader class are actually implemented or overridden. Thereason is that although XmlReader is declared as an abstractclass, not all methods and properties <strong>in</strong> the class are markedas abstract. Abstract methods and properties must beoverridden <strong>in</strong> a derived class. Virtual methods and properties,on the other hand, can be overridden only if needed.Notice that abstract and virtual are C# and C++ specific47


keywords. In Visual Basic .<strong>NET</strong>, to def<strong>in</strong>e an abstract classand a virtual method, you use the MustInherit andMustOverride keywords, respectively.The Custom Reader's ConstructorsThe XmlCsvReader class comes with a couple of constructors: one takes the name ofthe file to open, and one, <strong>in</strong> addition to the file name, takes a Boolean value <strong>in</strong>dicat<strong>in</strong>gwhether the contents of the first l<strong>in</strong>e <strong>in</strong> the CSV file conta<strong>in</strong>s titles of the columns, asshown here:LastName,FirstName,TitleDavolio,Nancy,Sales RepresentativeFuller,Andrew,Sales ManagerLeverl<strong>in</strong>g,Janet,Sales RepresentativeBoth constructors reference an <strong>in</strong>ternal helper rout<strong>in</strong>e, InitializeClass, that takes care ofany <strong>in</strong>itialization steps.public XmlCsvReader(str<strong>in</strong>g filename){InitializeClass(filename, false);}public XmlCsvReader(str<strong>in</strong>g filename, bool hasColumnHeaders){InitializeClass(filename, hasColumnHeaders);}private void InitializeClass(str<strong>in</strong>g filename, boolhasColumnHeaders){m_hasColumnHeaders = hasColumnHeaders;m_fileName = filename;m_fileStream = new StreamReader(filename);m_readState = ReadState.Initial;m_tokenValues = new NameValueCollection();m_currentAttributeIndex = -1;m_currentL<strong>in</strong>e = "";}In particular, the <strong>in</strong>itialization rout<strong>in</strong>e creates a work<strong>in</strong>g <strong>in</strong>stance of the StreamReaderclass and sets the <strong>in</strong>ternal state of the reader to the ReadState.Initial value. The CSVreader class needs a number of <strong>in</strong>ternal and protected members, as follows:StreamReader m_fileStream;// Stream readerStr<strong>in</strong>g m_fileName;// Name of the CSV fileReadState m_readState;// Internal read stateNameValueCollection m_tokenValues; // Current element node48


Str<strong>in</strong>g[] m_headerValues;tokensbool m_hasColumnHeaders;<strong>in</strong>t m_currentAttributeIndex;str<strong>in</strong>g m_currentL<strong>in</strong>e;l<strong>in</strong>e// Current headers <strong>for</strong> CSV// Indicates whether the// CSV file has titles// Current attribute <strong>in</strong>dex// Text of the current CSVThe currently selected row is represented through a NameValueCollection structure,and the current attribute is identified by its ord<strong>in</strong>al and zero-based <strong>in</strong>dex. In addition, ifthe CSV file has a prelim<strong>in</strong>ary header row, the column names are stored <strong>in</strong> an array ofstr<strong>in</strong>gs.The Read MethodThe CSV reader implementation of the Read method lets you move through the variousrows of data that <strong>for</strong>m the CSV file. First the method checks whether the CSV file hasheaders. The structure of the CSV file does not change regardless of whether headersare present. It's the programmer who declares, us<strong>in</strong>g a constructor's argument, whetherthe reader must consider the first row as the header row or just a data row. If theheader row is present, it must be read only the first time a read operation is per<strong>for</strong>medon the CSV file, and only if the read state of the reader is set to Initial.public override bool Read(){// First read extracts headers if anyif (m_readState == ReadState.Initial){if(m_hasColumnHeaders){str<strong>in</strong>g headerL<strong>in</strong>e = m_fileStream.ReadL<strong>in</strong>e();m_headerValues = headerL<strong>in</strong>e.Split(',');}}// Read the new l<strong>in</strong>e and set the read state to <strong>in</strong>teractivem_currentL<strong>in</strong>e = m_fileStream.ReadL<strong>in</strong>e();if (m_currentL<strong>in</strong>e != null)m_readState = ReadState.Interactive;else{m_readState = ReadState.EndOfFile;return false;}// Populate the <strong>in</strong>ternal structure represent<strong>in</strong>g the currentelement49


m_tokenValues.Clear();Str<strong>in</strong>g[] tokens = m_currentL<strong>in</strong>e.Split(',');<strong>for</strong> (<strong>in</strong>t i=0; i


get{if(m_readState != ReadState.Interactive)return null;str<strong>in</strong>g buf = "";switch(NodeType){case XmlNodeType.Attribute:buf =m_tokenValues.Keys[m_currentAttributeIndex].ToStr<strong>in</strong>g();break;case XmlNodeType.Element:buf = CsvRowName;break;}}}return buf;If the reader is not <strong>in</strong> <strong>in</strong>teractive mode, all properties return null, <strong>in</strong>clud<strong>in</strong>g Name. If thecurrent node type is an attribute, Name is the header name <strong>for</strong> the CSV token thatcorresponds to the attribute <strong>in</strong>dex. For example, if the reader is currently positioned onthe second attribute, and the CSV has headers as shown previously, the name of theattribute is FirstName. Otherwise, if the node is an element, the name is a str<strong>in</strong>g thatyou can control through the extra CsvRowName property. By default, the propertyequals the word row.The Value property is implemented accord<strong>in</strong>g to a nearly identical logic. The onlydifference is <strong>in</strong> the returned text, which is the value of the currently selected attribute ifthe node is XmlNodeType.Attribute or the raw text of the currently selected CSV l<strong>in</strong>e ifthe node is an element.public override str<strong>in</strong>g Value{get{if(m_readState != ReadState.Interactive)return "";str<strong>in</strong>g buf = "";switch(NodeType){case XmlNodeType.Attribute:buf = this[m_currentAttributeIndex].ToStr<strong>in</strong>g();51


}break;case XmlNodeType.Element:buf = m_currentL<strong>in</strong>e;break;}return buf;}Who sets the node type? Actually, the node type is never explicitly set, but is <strong>in</strong>steadretrieved from other data whenever needed. In particular, <strong>for</strong> this example, the <strong>in</strong>dex ofthe current attribute determ<strong>in</strong>es the type of the node. If the <strong>in</strong>dex is equal to -1, thenode is an element simply because no attribute is currently selected. Otherwise, thenode can only be an attribute.public override XmlNodeType NodeType{get{if (m_currentAttributeIndex == -1)return XmlNodeType.Element;elsereturn XmlNodeType.Attribute;}}The programm<strong>in</strong>g <strong>in</strong>terface of an <strong>XML</strong> reader is quite general and abstract, so theactual implementation you provide (<strong>for</strong> example, <strong>for</strong> CSV files) is arbitrary to someextent, and several details can be changed at will. The NodeType property <strong>for</strong> a CSVfile is an example of how customized the <strong>in</strong>ternal implementation can be. In fact, youreturn Element or Attribute based on logical conditions rather than the actual structureof the <strong>XML</strong> element read off disk.Read<strong>in</strong>g AttributesEvery piece of data <strong>in</strong> the CSV file is treated like an attribute. You access attributesus<strong>in</strong>g <strong>in</strong>dexes or names. The methods <strong>in</strong> the XmlReader base <strong>in</strong>terface that allow youto retrieve attribute values us<strong>in</strong>g a str<strong>in</strong>g name and a namespace URI are notimplemented, simply because there is no notion of a namespace <strong>in</strong> a CSV file.The follow<strong>in</strong>g two function overrides demonstrate how to return the value of thecurrently selected attribute node by position as well as by name. The values of thecurrent CSV row are stored as <strong>in</strong>dividual entries <strong>in</strong> the <strong>in</strong>ternal m_tokenValuescollection.public override str<strong>in</strong>g this[<strong>in</strong>t i]{get{return m_tokenValues[i].ToStr<strong>in</strong>g();}52


}public override str<strong>in</strong>g this[str<strong>in</strong>g name]{get{return m_tokenValues[name].ToStr<strong>in</strong>g();}}The preced<strong>in</strong>g code simply allows you to access an attribute us<strong>in</strong>g one of the follow<strong>in</strong>gsyntaxes:Console.WriteL<strong>in</strong>e(reader[i]);Console.WriteL<strong>in</strong>e(reader["col1"]);You can also obta<strong>in</strong> the value of an attribute us<strong>in</strong>g one of the overloads of theGetAttribute method. The <strong>in</strong>ternal implementation <strong>for</strong> the CSV <strong>XML</strong> reader GetAttributemethod is nearly identical to the this overrides.Mov<strong>in</strong>g Through AttributesWhen you call the Read method on the CSV <strong>XML</strong> reader, you move to the firstavailable row of data. If the first row is managed as the header row, the first availablerow of data becomes the second row. The <strong>in</strong>ternal state of the reader is set toInteractive—mean<strong>in</strong>g that it is ready to take commands—only after the first successfuland content-effective read<strong>in</strong>g.Any s<strong>in</strong>gle piece of <strong>in</strong><strong>for</strong>mation <strong>in</strong> the CSV file is treated as an attribute. In this way, theRead method can move you only from one row to the next. As with real <strong>XML</strong> data,when you want to access attributes, you must first select them. To move amongattributes, you will not use the Read method; <strong>in</strong>stead, you'll use a set of methods<strong>in</strong>clud<strong>in</strong>g MoveToFirstAttribute, MoveToNextAttribute, and MoveToElement.The CSV <strong>XML</strong> reader implements attribute selection <strong>in</strong> a straight<strong>for</strong>ward and effectiveway. Basically, the current attribute is tracked us<strong>in</strong>g a simple <strong>in</strong>dex that is set to -1when no attribute is selected and to a zero-based value when an attribute has beenselected. This <strong>in</strong>dex, stored <strong>in</strong> m_currentAttributeIndex, po<strong>in</strong>ts to a particular entry <strong>in</strong>the collection of token values that represents each CSV row.The CSV <strong>XML</strong> reader positions itself at the first attribute of the current row simply bysett<strong>in</strong>g the <strong>in</strong>ternal <strong>in</strong>dex to 0, as shown <strong>in</strong> the follow<strong>in</strong>g code. It then moves to the nextattribute by <strong>in</strong>creas<strong>in</strong>g the <strong>in</strong>dex by 1. In this case, though, you should also make surethat you're not specify<strong>in</strong>g an <strong>in</strong>dex value that's out of range.public override bool MoveToFirstAttribute(){m_currentAttributeIndex = 0;return true;}public override bool MoveToNextAttribute(){if (m_readState != ReadState.Interactive)53


eturn false;}if (m_currentAttributeIndex < m_tokenValues.Count-1)m_currentAttributeIndex ++;elsereturn false;return true;You can also move to a particular attribute by <strong>in</strong>dex, and you can reset the attribute<strong>in</strong>dex to -1 to reposition the <strong>in</strong>ternal po<strong>in</strong>ter on the parent element node.public override void MoveToAttribute(<strong>in</strong>t i){if (m_readState != ReadState.Interactive)return;}m_currentAttributeIndex = i;public override bool MoveToElement(){if (m_readState != ReadState.Interactive)return false;}m_currentAttributeIndex = -1;return true;A bit trickier code is required if you just want to move to a particular attribute by name.The function provid<strong>in</strong>g this feature is an overload of the MoveToAttribute method.public override bool MoveToAttribute(str<strong>in</strong>g name){if (m_readState != ReadState.Interactive)return false;<strong>for</strong>(<strong>in</strong>t i=0; i


}}return false;The name of the attribute—determ<strong>in</strong>ed by a header row or set by default—is stored asthe key of the m_tokenValues named collection. Un<strong>for</strong>tunately, theNameValueCollection class does not provide <strong>for</strong> search capabilities, so the only way todeterm<strong>in</strong>e the ord<strong>in</strong>al position of a given key is by enumerat<strong>in</strong>g all the keys, track<strong>in</strong>g the<strong>in</strong>dex position, until you f<strong>in</strong>d the key that matches the specified name.As you've probably noticed, almost all the methods and properties <strong>in</strong> the CSV readerbeg<strong>in</strong> with a piece of code that simply returns if the reader's state is not Interactive. Thisis a specification requirement that basically dictates that an <strong>XML</strong> reader can acceptcommands only after it has been correctly <strong>in</strong>itialized.Expos<strong>in</strong>g Data as <strong>XML</strong>In a true <strong>XML</strong> reader, methods like ReadInnerXml and ReadOuterXml serve thepurpose of return<strong>in</strong>g the <strong>XML</strong> source code embedded <strong>in</strong>, or sitt<strong>in</strong>g around, the currentlyselected node. For a CSV reader, of course, there is no <strong>XML</strong> source code to return.You might want to return an <strong>XML</strong> description of the current CSV node, however.Assum<strong>in</strong>g that this is how you want the CSV reader to work, the ReadInnerXml method<strong>for</strong> a CSV <strong>XML</strong> reader can only return either null or the empty str<strong>in</strong>g, as shown <strong>in</strong> thefollow<strong>in</strong>g code. By design, <strong>in</strong> fact, each element has an empty body.public override str<strong>in</strong>g ReadInnerXml(){if (m_readState != ReadState.Interactive)return null;}return Str<strong>in</strong>g.Empty;In contrast, the outer <strong>XML</strong> text <strong>for</strong> a CSV node can be designed like a node with asequence of attributes, as follows:The source code to obta<strong>in</strong> this output is shown here:public override str<strong>in</strong>g ReadOuterXml(){if (m_readState != ReadState.Interactive)return null;Str<strong>in</strong>gBuilder sb = new Str<strong>in</strong>gBuilder("");sb.Append("


}sb.Append("=");sb.Append(QuoteChar);sb.Append(m_tokenValues[o.ToStr<strong>in</strong>g()].ToStr<strong>in</strong>g());sb.Append(QuoteChar);sb.Append("");}sb.Append("/>");return sb.ToStr<strong>in</strong>g();The CSV <strong>XML</strong> Reader <strong>in</strong> ActionIn this section, you'll see the CSV <strong>XML</strong> reader <strong>in</strong> action and learn how to <strong>in</strong>stantiate anduse it <strong>in</strong> the context of a realistic application. In particular, I'll show you how to load thecontents of a CSV file <strong>in</strong>to a DataTable object to appear <strong>in</strong> a W<strong>in</strong>dows Forms DataGridcontrol. Figure 2-1 shows the application <strong>in</strong> action.Figure 2-1: The CSV <strong>XML</strong> reader shows all the rows of a CSV file.You start by <strong>in</strong>stantiat<strong>in</strong>g the reader object, pass<strong>in</strong>g the name of the CSV file to beprocessed and a Boolean flag. The Boolean value <strong>in</strong>dicates whether the values <strong>in</strong> thefirst row of the CSV source file must be read as the column names or as data. If youpass false, the row is considered a pla<strong>in</strong> data row and each column name is <strong>for</strong>med bya prefix and a progressive number. You control the prefix through the CsvColumnPrefixproperty.// Instantiate the reader on a CSV fileXmlCsvReader reader;reader = new XmlCsvReader("employees.csv", hasHeader.Checked);reader.CsvColumnPrefix = colPrefix.Text;reader.Read();// Def<strong>in</strong>e the target tableDataTable dt = new DataTable();<strong>for</strong>(<strong>in</strong>t i=0; i


dt.Columns.Add(col);}reader.MoveToElement();Be<strong>for</strong>e you load data rows <strong>in</strong>to the table and populate the data grid, you must def<strong>in</strong>e thelayout of the target DataTable object. To do that, you must scroll the attributes of onerow—typically the first row. You move to each of the attributes <strong>in</strong> the first row andcreate a DataColumn object with the same name as the attribute and specified as astr<strong>in</strong>g type. You then add the DataColumn object to the DataTable object and cont<strong>in</strong>ueuntil you've added all the attributes. The MoveToElement call restores the focus to theCSV row element.// Loop through the rows and populate a DataTabledo{DataRow row = dt.NewRow();<strong>for</strong>(<strong>in</strong>t i=0; i


CautionI tried to keep this version of the CSV reader as simple aspossible, which is always a good guidel<strong>in</strong>e. In this case,however, I went beyond my orig<strong>in</strong>al <strong>in</strong>tention and came up with atoo simple reader! Don't be fooled by the fact that the samplecode discussed here works just f<strong>in</strong>e. As I built it, the CSV readerdoes not expose the CSV document as a well-<strong>for</strong>med <strong>XML</strong>document, but rather as a well-<strong>for</strong>med <strong>XML</strong> fragment. There isno root node, and no clear dist<strong>in</strong>ction is made between start andend element tags. In addition, the ReadAttributeValue method isnot supported. As a result, if you use ReadXml to load the CSV<strong>in</strong>to a DataSet object, only the first row would be loaded. If yourun the CsvReader sample <strong>in</strong>cluded <strong>in</strong> this book's sample files,you'll see an additional button on the <strong>for</strong>m labeled UseRead<strong>XML</strong>, which you can use to see this problem <strong>in</strong> action. InChapter 9, after a thorough exam<strong>in</strong>ation of the <strong>in</strong>ternals ofReadXml, we'll build an enhanced version of the CSV reader.The DataGrid control shown <strong>in</strong> Figure 2-2 is read-only, but this does not mean that youcan't modify rows <strong>in</strong> the underly<strong>in</strong>g DataTable object and then save changes back tothe CSV file. One way to accomplish this result would be by us<strong>in</strong>g a customized <strong>XML</strong>writer class—a k<strong>in</strong>d of XmlCsvWriter. You'll learn how to create such a class <strong>in</strong> Chapter4, while we're look<strong>in</strong>g at <strong>XML</strong> writers.NoteThe full source code <strong>for</strong> both the CSV <strong>XML</strong> reader and the sampleapplication mak<strong>in</strong>g use of it is available <strong>in</strong> this book's sample files.The folder of <strong>in</strong>terest is named CsvReader.ImportantThe XmlTextReader class implements a visit<strong>in</strong>g algorithm <strong>for</strong>the <strong>XML</strong> tree based on the so-called node-first approach. Thismeans that <strong>for</strong> each <strong>XML</strong> subtree found, the root is visited first,and then recursively all of its children are visited, from the firstto the last. Node-first is certa<strong>in</strong>ly not the most unique visit<strong>in</strong>galgorithm you can implement, but it turns out to be the mostsensible one <strong>for</strong> <strong>XML</strong> trees.Another well-known visit<strong>in</strong>g algorithm is the <strong>in</strong>-depth-firstapproach, which goes straight to the leaves of the tree andthen pops back to outer parent nodes. The node-first approachis more effective <strong>for</strong> <strong>XML</strong> trees because it visits nodes <strong>in</strong> theorder they are written to disk. Choos<strong>in</strong>g to implement adifferent visit<strong>in</strong>g algorithm would make the code significantlymore complex and less effective from the stand-po<strong>in</strong>t ofmemory footpr<strong>in</strong>t. In short, you should have a good reason toplan and code any algorithm other than node-first.In general, visit<strong>in</strong>g algorithms other than node-first algorithmsexist mostly <strong>for</strong> tree data structures, <strong>in</strong>clud<strong>in</strong>g well-balancedand b<strong>in</strong>ary trees. <strong>XML</strong> files are designed like a tree datastructure but rema<strong>in</strong> a very special type of tree.58


Readers and <strong>XML</strong> ReadersTo cap off our exam<strong>in</strong>ation of <strong>XML</strong> readers and custom readers, let's spend a fewmoments look<strong>in</strong>g at the difference between an <strong>XML</strong> reader and a generic reader <strong>for</strong> anon-<strong>XML</strong> data structure.A reader is a basic and key concept <strong>in</strong> the .<strong>NET</strong> Framework. Several different types ofreader classes do exist <strong>in</strong> the .<strong>NET</strong> Framework: b<strong>in</strong>ary readers, text readers, <strong>XML</strong>readers, and database readers, just to name a few. Of course, you can add your owndata-specific readers to the list. But that's the po<strong>in</strong>t. How would you write your newreader? The simplest answer would be, you write the reader by <strong>in</strong>herit<strong>in</strong>g from one ofthe exist<strong>in</strong>g reader classes.A more precise answer should help you identify the best reader class to start from. Thekey criterion when you're choos<strong>in</strong>g a base class is the k<strong>in</strong>d of programm<strong>in</strong>g <strong>in</strong>terfaceyou expect from the new reader. Another m<strong>in</strong>or, but not negligible, concern is whetherthe class allows <strong>for</strong> <strong>in</strong>heritance. Some reader classes are sealed and do not permit<strong>in</strong>heritance. (The data reader classes, such as SqlDataReader, belong to thiscategory.)Actually, you could build your own reader class from base classes such asB<strong>in</strong>aryReader, TextReader, and XmlReader. Typically, you choose the B<strong>in</strong>aryReaderclass if you need to manipulate primitive types <strong>in</strong> b<strong>in</strong>ary rather than text <strong>for</strong>mat. Youchoose the TextReader class whenever character <strong>in</strong>put is critical. To successfully buildon top of TextReader, the most complicated th<strong>in</strong>g you might need to do is read a l<strong>in</strong>e oftext between two successive <strong>in</strong>stances of a carriage return. You choose the XmlReaderclass as the base class if the content of the data you expose can be rendered, or atleast traversed, as <strong>XML</strong>. Because <strong>XML</strong> is a very specific flavor of text, the XmlReaderclass happens to be more powerful and richer than any other reader class. Not all data,however, maps to some reasonable extent to <strong>XML</strong>. If this is the case, simply plan abrand-new reader on top of B<strong>in</strong>aryReader or TextReader as applicable.If you just want to implement a specialized <strong>XML</strong> reader (<strong>for</strong> example, a SAX reader oran <strong>XML</strong> reader support<strong>in</strong>g a different visit<strong>in</strong>g algorithm), you might also considerstart<strong>in</strong>g from XmlTextReader, XmlNodeReader, or XmlValidat<strong>in</strong>gReader. An <strong>XML</strong>specialized reader is basically a reader designed to handle data that is natively storedas well-<strong>for</strong>med <strong>XML</strong>.ConclusionSo far, we've covered the basics of <strong>XML</strong> readers. By now, you should know how toparse an <strong>XML</strong> document irrespective of its physical location and storage medium. Youknow how to move between nodes, how to skip unneeded nodes, and how to readcontents and attributes. In short, you have gotten the gist of <strong>XML</strong> readers.The reader is a general concept that crosses the whole spectrum of .<strong>NET</strong> Frameworkfunctionalities and applies to <strong>XML</strong> as well as databases, files, and network protocols.You can also create custom <strong>XML</strong> readers to process non-<strong>XML</strong> data structures such asCSV files.We've only scratched the surface of this topic—there's a lot more to be done. Forexample, we haven't yet looked at validation, which is the topic of Chapter 3.59


Further Read<strong>in</strong>gAn article that summarizes <strong>in</strong> a few pages the essence of <strong>XML</strong> readers and writers waswritten <strong>for</strong> the January 2001 issue of MSDN Magaz<strong>in</strong>e. Although based on a betaversion of .<strong>NET</strong>, it is still of significant value and can be found athttp://msdn.microsoft.com/msdnmag/issues/01/01/xml/xml.asp. Fresh, up-to-date, andhandy <strong>in</strong><strong>for</strong>mation about <strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> world (and other topics) can be foundmonthly <strong>in</strong> the "Extreme <strong>XML</strong>" column on MSDN Onl<strong>in</strong>e.If you need to know more about ADO.<strong>NET</strong> and its <strong>in</strong>tegration with <strong>XML</strong>, you can checkout my book Build<strong>in</strong>g Web Solutions with ASP.<strong>NET</strong> and ADO.<strong>NET</strong> (<strong>Microsoft</strong> Press,2002) or David Sceppa's book <strong>Microsoft</strong> ADO.<strong>NET</strong> (Core Reference) (<strong>Microsoft</strong> Press,2002).<strong>XML</strong> extensions <strong>for</strong> SQL Server 2000 are described <strong>in</strong> detail <strong>in</strong> Chapter 2.F<strong>in</strong>ally, <strong>for</strong> a very <strong>in</strong><strong>for</strong>mative article about the development of <strong>XML</strong> custom readers,see "Implement<strong>in</strong>g XmlReader Classes <strong>for</strong> Non-<strong>XML</strong> Data Structures and Formats,"available on MSDN at http://msdn.microsoft.com/library/enus/dndotnet/html/Custxmlread.asp.60


Chapter 3: <strong>XML</strong> Data ValidationOverviewThe base <strong>XML</strong> reader exam<strong>in</strong>ed <strong>in</strong> Chapter 2—the XmlTextReader class—does notenable you to validate the contents of an <strong>XML</strong> source aga<strong>in</strong>st a schema. Thecorrectness of <strong>XML</strong> documents can be measured us<strong>in</strong>g two dist<strong>in</strong>ct and complementarymetrics: the well-<strong>for</strong>medness of the document and the validity. Well-<strong>for</strong>medness of thedocument refers to the overall syntax of the document. Validation applies at a deeperlevel and <strong>in</strong>volves the semantics of the document, which must be compliant with a userdef<strong>in</strong>edlayout.The XmlTextReader class ensures only that the document be<strong>in</strong>g processed issyntactically correct. By design, the XmlTextReader class deliberately avoids mak<strong>in</strong>g amore advanced analysis of the nodes <strong>in</strong> the document and check<strong>in</strong>g their <strong>in</strong>ternaldependencies. A more specialized class is available <strong>in</strong> the <strong>Microsoft</strong> .<strong>NET</strong> Framework<strong>for</strong> accomplish<strong>in</strong>g this more complex task—the XmlValidat<strong>in</strong>gReader class. Thischapter will focus on techniques and classes available <strong>in</strong> the .<strong>NET</strong> Framework toper<strong>for</strong>m validation on <strong>XML</strong> data.Although validation is a key aspect <strong>in</strong> projects that <strong>in</strong>volve critical document exchangeacross heterogeneous plat<strong>for</strong>ms, it does come at a price. Validat<strong>in</strong>g a document meanstak<strong>in</strong>g a while to analyze the constituent nodes; the number, type, and values of theirattributes; and the node-to-node dependencies. When applications handle a fullyvalidated document, they can be certa<strong>in</strong> not only about the overall syntax but evenabout the contents. In a normal <strong>XML</strong> document, a node simply represents itself—arather generic repository of hierarchical <strong>in</strong><strong>for</strong>mation. In a validated <strong>XML</strong> document, onthe other hand, the same node to the application's eye represents a strongly typed andstrongly def<strong>in</strong>ed piece of <strong>in</strong><strong>for</strong>mation. Basically, <strong>in</strong> a validated document, a node ceases to be a node and becomes what it was <strong>in</strong>tended to be—thenumber of the <strong>in</strong>voice.Clearly, a nonvalidat<strong>in</strong>g reader (and, more generally, a nonvalidat<strong>in</strong>g <strong>XML</strong> parser) willrun faster than a validat<strong>in</strong>g reader, and that's why <strong>XML</strong> parsers usually provide <strong>XML</strong>validation as an option that can be programmatically toggled on and off. In .<strong>NET</strong>applications, you use XmlTextReader if you simply need well-<strong>for</strong>medness; you resort toXmlValidat<strong>in</strong>gReader if you need to validate the schema of the document.The XmlValidat<strong>in</strong>gReader ClassThe XmlValidat<strong>in</strong>gReader class is an implementation of the XmlReader class thatprovides support <strong>for</strong> several types of <strong>XML</strong> validation: document type def<strong>in</strong>itions (DTDs),<strong>XML</strong>-Data Reduced (XDR) schemas, and <strong>XML</strong> Schemas. The <strong>XML</strong> Schema languageis also referred to as <strong>XML</strong> Schema Def<strong>in</strong>ition (XSD). DTD and XSD are officialrecommendations issued by the W3C, whereas XDR is simply the <strong>Microsoft</strong>implementation of an early work<strong>in</strong>g draft of <strong>XML</strong> Schemas that will be superseded byXSD as time goes by.You can use the XmlValidat<strong>in</strong>gReader class to validate entire <strong>XML</strong> documents as wellas <strong>XML</strong> fragments. An <strong>XML</strong> fragment is a str<strong>in</strong>g of <strong>XML</strong> code that does not have a rootnode. For example, the follow<strong>in</strong>g <strong>XML</strong> str<strong>in</strong>g turns out to be a valid <strong>XML</strong> fragment butnot a valid <strong>XML</strong> document. <strong>XML</strong> documents must have a root node.D<strong>in</strong>oEsposito61


The XmlValidat<strong>in</strong>gReader class works on top of an <strong>XML</strong> reader—typically an <strong>in</strong>stanceof the XmlTextReader class. The text reader is used to walk through the nodes of thedocument, and then the validat<strong>in</strong>g reader gets <strong>in</strong>to the game, validat<strong>in</strong>g each piece of<strong>XML</strong> based on the requested validation type.Supported Validation TypesWhat are the key differences between the validation mechanisms (DTD, XDR, andXSD) supported by the XmlValidat<strong>in</strong>gReader class? Let's briefly review the ma<strong>in</strong>characteristics of each mechanism.• DTD A DTD is a text file whose syntax stems directly from the StandardGeneralized Markup Language (SGML)—the ancestor of <strong>XML</strong> as we knowit today. A DTD follows a custom, non-<strong>XML</strong> syntax to def<strong>in</strong>e the set ofvalid tags, the attributes each tag can support, and the dependenciesbetween tags. A DTD allows you to specify the children <strong>for</strong> each tag, theircard<strong>in</strong>ality, their attributes, and a few other properties <strong>for</strong> both tags andattributes. Card<strong>in</strong>ality specifies the number of occurrences of each childelement.• XDR XDR is a schema language based on a proposal submitted by<strong>Microsoft</strong> to the W3C back <strong>in</strong> 1998. (For more <strong>in</strong><strong>for</strong>mation, seehttp://www.w3.org/TR/1998/NOTE-<strong>XML</strong>-data-0105.) XDRs are flexible andovercome some of the limitations of DTDs. Unlike DTDs, XDRs describethe structure of the document us<strong>in</strong>g the same syntax as the <strong>XML</strong>document. Additionally, <strong>in</strong> a DTD, all the data content is character data.XDR language schemas allow you to specify the data type of an elementor an attribute.• XSD XSD def<strong>in</strong>es the elements and attributes that <strong>for</strong>m an <strong>XML</strong>document. Each element is strongly typed. Based on a W3Crecommendation, XSD describes the structure of <strong>XML</strong> documents us<strong>in</strong>ganother <strong>XML</strong> document. XSDs <strong>in</strong>clude an all-encompass<strong>in</strong>g type systemcomposed of primitive and derived types. The XSD type system is also atthe foundation of the Simple Object Access Protocol (SOAP) and <strong>XML</strong>Web services.DTD was considered the cross-plat<strong>for</strong>m standard until a couple of years ago. Then theW3C officialized a newer standard—XSD—which is, technically speak<strong>in</strong>g, far superiorto DTD. Today, XSD is supported by almost all parsers on all plat<strong>for</strong>ms. Although thesupport <strong>for</strong> DTD will not be deprecated anytime soon, you'll be better positioned if youstart migrat<strong>in</strong>g to XSD or build<strong>in</strong>g new <strong>XML</strong>-driven applications based on XSD <strong>in</strong>steadof DTD or XDR.As mentioned, XDR is an early hybrid specification that never reached the status of aW3C recommendation. It then evolved <strong>in</strong>to XSD. The XmlValidat<strong>in</strong>gReader classsupports XDR mostly <strong>for</strong> backward compatibility, as XDR is fully supported by theComponent Object Model (COM)-based <strong>Microsoft</strong> <strong>XML</strong> Core Services (MS<strong>XML</strong>).NoteThe .<strong>NET</strong> Framework provides a handy utility, named xsd.exe, thatamong other th<strong>in</strong>gs can automatically convert an XDR schema toXSD. If you pass an XDR schema file (typically, a .xdr extension),xsd.exe converts the XDR schema to an XSD schema, as shownhere:xsd.exe myoldschema.xdrThe output file has the same name as the XDR schema, but withthe .xsd extension.62


The XmlValidat<strong>in</strong>gReader <strong>Programm<strong>in</strong>g</strong> InterfaceThe XmlValidat<strong>in</strong>gReader class <strong>in</strong>herits from the base class XmlReader but implements<strong>in</strong>ternally only a small set of all the functionalities that an <strong>XML</strong> reader exposes. Theclass always works on top of an exist<strong>in</strong>g <strong>XML</strong> reader, and many methods andproperties are simply mirrored.The dependency of validat<strong>in</strong>g readers on an exist<strong>in</strong>g text reader is particularly evident ifyou look at the class constructors. An <strong>XML</strong> validat<strong>in</strong>g reader, <strong>in</strong> fact, can't be directly<strong>in</strong>itialized from a file or a URL. The list of available constructors comprises the follow<strong>in</strong>goverloads:public XmlValidat<strong>in</strong>gReader(XmlReader);public XmlValidat<strong>in</strong>gReader(Stream, XmlNodeType,XmlParserContext);public XmlValidat<strong>in</strong>gReader(str<strong>in</strong>g, XmlNodeType,XmlParserContext);A validat<strong>in</strong>g reader can parse only an <strong>XML</strong> document <strong>for</strong> which a reader is provided aswell as any <strong>XML</strong> fragments accessible through a str<strong>in</strong>g or an open stream. In thesection "Under the Hood of the Validation Process," on page 89, we'll look more closelyat the <strong>in</strong>ternal architecture of an <strong>XML</strong> validat<strong>in</strong>g reader. In the meantime, let's analyzemore closely the programm<strong>in</strong>g <strong>in</strong>terface of such a class, start<strong>in</strong>g with properties.XmlValidat<strong>in</strong>gReader PropertiesTable 3-1 lists the key public properties exposed by the XmlValidat<strong>in</strong>gReader class.This table does not <strong>in</strong>clude those properties def<strong>in</strong>ed <strong>in</strong> the XmlReader base class <strong>for</strong>which the XmlValidat<strong>in</strong>gReader class simply mirrors the behavior of the underly<strong>in</strong>greader. Refer to Chapter 2 <strong>for</strong> more <strong>in</strong><strong>for</strong>mation about the base properties ofXmlReader.Table 3-1: Key Properties of the XmlValidat<strong>in</strong>gReader ClassPropertyCanResolveEntityEntityHandl<strong>in</strong>gNamespacesNameTableReaderSchemasDescriptionAlways returns true because the <strong>XML</strong> validat<strong>in</strong>g readercan always resolve entities.Indicates how entities are handled. Allowable values <strong>for</strong>this property come from the EntityHandl<strong>in</strong>genumeration. The default value is ExpandEntities,which means that all entities are expanded. If set toExpandCharEntities, only character entities areexpanded (<strong>for</strong> example, &apos;). General entities arereturned as EntityReference node types.Indicates whether namespace support is requested.Gets the name table object associated with theunderly<strong>in</strong>g reader.Gets the XmlReader object used to construct this<strong>in</strong>stance of the XmlValidat<strong>in</strong>gReader class. The returnvalue can be cast to a more specific reader type, suchas XmlTextReader. Any change entered directly to theunderly<strong>in</strong>g reader object can lead to unpredictableresults. Use the XmlValidat<strong>in</strong>gReader <strong>in</strong>terface tomanipulate the properties of the underly<strong>in</strong>g reader.Gets an XmlSchemaCollection object that holds a63


Table 3-1: Key Properties of the XmlValidat<strong>in</strong>gReader ClassPropertySchemaTypeValidationTypeXmlResolverDescriptioncollection of preloaded XDRs and XSDs. Schemapreload<strong>in</strong>g is a trick used to speed up the validationprocess. Schemas, <strong>in</strong> fact, are cached, and there is noneed to load them every time.Gets the schema object that represents the currentnode <strong>in</strong> the underly<strong>in</strong>g reader. This property is relevantonly <strong>for</strong> XSD validation. The object describes whetherthe type of the node is one of the built-<strong>in</strong> XSD types ora user-def<strong>in</strong>ed simple or complex type.Indicates the type of validation to per<strong>for</strong>m. Feasiblevalues come from the ValidationType enumeration:Auto, None, DTD, XDR, and Schema.Sets the XmlResolver object used <strong>for</strong> resolv<strong>in</strong>g externalDTD and schema location references. TheXmlResolver is also used to handle any import or<strong>in</strong>clude elements found <strong>in</strong> XSD schemas.The validat<strong>in</strong>g reader uses the underly<strong>in</strong>g reader to move around the document andimplements most of its XmlReader-derived properties by simply mirror<strong>in</strong>g thecorrespond<strong>in</strong>g properties of the worker reader.XmlValidat<strong>in</strong>gReader MethodsTable 3-2 lists the methods exposed by the XmlValidat<strong>in</strong>gReader class that are eithernew or whose behavior significantly differs from the correspond<strong>in</strong>g methods of theXmlReader class.Table 3-2: Public Methods of the XmlValidat<strong>in</strong>gReader ClassMethodReadReadTypedValueSkipDescriptionThe underly<strong>in</strong>g reader moves to the next node. At thesame time, the validat<strong>in</strong>g reader gets the node<strong>in</strong><strong>for</strong>mation and validates it us<strong>in</strong>g the schema<strong>in</strong><strong>for</strong>mation and the previously cached <strong>in</strong><strong>for</strong>mation.Gets the value <strong>for</strong> the underly<strong>in</strong>g node as a commonlanguage runtime (CLR) type. The mapp<strong>in</strong>g can takeplace only <strong>for</strong> XSDs. Whenever a direct mapp<strong>in</strong>g is notpossible, the node value is returned as a str<strong>in</strong>g.Skips the children of the current node <strong>in</strong> the underly<strong>in</strong>greader. You can't skip over badly <strong>for</strong>med <strong>XML</strong> text,however. In the XmlValidat<strong>in</strong>gReader class, the Skipmethod also validates the skipped content.As you can see, the programm<strong>in</strong>g <strong>in</strong>terface of the XmlValidat<strong>in</strong>gReader class does notexplicitly provide a s<strong>in</strong>gle method that can validate the entire contents of a document.The validat<strong>in</strong>g reader works <strong>in</strong>crementally, node by node, as the underly<strong>in</strong>g readerdoes. Each validation error found along the way results <strong>in</strong> a particular event notificationbe<strong>in</strong>g returned to the caller application. The application is then responsible <strong>for</strong> def<strong>in</strong><strong>in</strong>gan ad hoc event handler and behav<strong>in</strong>g as needed.64


The ValidationEventHandler EventThe XmlValidat<strong>in</strong>gReader class conta<strong>in</strong>s a public event named Validation-EventHandler, which is def<strong>in</strong>ed as follows:public event ValidationEventHandler ValidationEventHandler;This event is used to pass <strong>in</strong><strong>for</strong>mation about any DTD, XDR, or XSD schema validationerrors that have been detected. The handler <strong>for</strong> the event (also namedValidationEventHandler) has the follow<strong>in</strong>g signature:public delegate void ValidationEventHandler(object sender,ValidationEventArgs e);The ValidationEventArgs class is described by the follow<strong>in</strong>g pseudocode:public class ValidationEventArgs : EventArgs{public XmlSchemaException Exception;public str<strong>in</strong>g Message;public XmlSeverityType Severity;}The Message field returns a description of the error. The Exception field, on the otherhand, returns an ad hoc exception object (XmlSchemaException) with details aboutwhat happened. The schema exception class conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the l<strong>in</strong>e thatorig<strong>in</strong>ated the error, the source file, and, if available, the schema object that generatedthe error. The schema object (the SourceSchemaObject property) is available <strong>for</strong> XSDvalidation only.The Severity field represents the severity of the validation event. The XmlSeverityTypedef<strong>in</strong>es two levels of severity—Error and Warn<strong>in</strong>g. Error <strong>in</strong>dicates that a seriousvalidation error occurred when process<strong>in</strong>g the document aga<strong>in</strong>st a DTD, an XDR, or anXSD schema. If the current <strong>in</strong>stance of the XmlValidat<strong>in</strong>gReader class has no validationevent handler set, an exception is thrown. Typically, a warn<strong>in</strong>g is raised when there isno DTD, XDR, or XSD schema to validate a particular element or attribute aga<strong>in</strong>st.Unlike errors, warn<strong>in</strong>gs do not throw an exception if no validation event handler hasbeen set.The XmlValidat<strong>in</strong>gReader <strong>in</strong> ActionLet's see how to validate an <strong>XML</strong> document. As mentioned, the XmlValidat<strong>in</strong>gReaderclass is still a reader class, so it proceeds with an <strong>in</strong>cremental validation as nodes areactually read. The caller is notified of any schema exception found <strong>for</strong> a node by rais<strong>in</strong>gthe ValidationEventHandler event. This section describes <strong>in</strong> detail how to validate an<strong>XML</strong> document, <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>itializ<strong>in</strong>g an <strong>XML</strong> reader, handl<strong>in</strong>g validation errors, andsett<strong>in</strong>g and detect<strong>in</strong>g the validation types.Initialization of the ReaderTo validate the contents of an <strong>XML</strong> file, you must first create an <strong>XML</strong> text reader towork on the file and then use this reader to <strong>in</strong>itialize an <strong>in</strong>stance of a validat<strong>in</strong>g reader.A validat<strong>in</strong>g reader can be <strong>in</strong>itialized us<strong>in</strong>g a liv<strong>in</strong>g <strong>in</strong>stance of an XmlReader class—typically, an XmlTextReader object—or us<strong>in</strong>g an <strong>XML</strong> fragment taken from a stream ora memory str<strong>in</strong>g, as shown here:65


XmlTextReader _coreReader = new XmlTextReader(fileName);XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);You move around the <strong>in</strong>put document us<strong>in</strong>g the Read method as usual. Actually, youuse the validat<strong>in</strong>g reader as you would any other <strong>XML</strong> .<strong>NET</strong> reader. At each step,however, the structure of the currently visited node is validated aga<strong>in</strong>st the specifiedschema and an exception is raised if an error is found.To validate an entire <strong>XML</strong> document, you simply loop through its contents, as shownhere:private bool ValidateDocument(str<strong>in</strong>g fileName){// Initialize the validat<strong>in</strong>g readerXmlTextReader _coreReader = new XmlTextReader(fileName);XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);// Prepare <strong>for</strong> validationreader.ValidationType = ValidationType.Auto;reader.ValidationEventHandler += newValidationEventHandler(MyHandler);// Parse and validate all the nodes <strong>in</strong> the documentwhile(reader.Read()) {}}// Close the readerreader.Close();return true;The ValidationType property is set to the default value—ValidationType.Auto. In thiscase, the reader determ<strong>in</strong>es what type of validation (DTD, XDR, or XSD) is required bylook<strong>in</strong>g at the contents of the file. The caller application is notified of any error through aValidationEventHandler event. In the preced<strong>in</strong>g code, the MyHandler procedure runswhenever a validation error is detected, as shown here:private void MyHandler(object sender, ValidationEventArgs e){// Logs the error that occurredPr<strong>in</strong>tOut(e.Exception.GetType().Name, e.Message);}Figure 3-1 shows the output of the sample program ValidateDocument. The list boxtracks down all the errors that have been detected. The complete code list<strong>in</strong>g <strong>for</strong> thesample application show<strong>in</strong>g how to set up a validat<strong>in</strong>g parser is available <strong>in</strong> this book'ssample files.66


Figure 3-1: The sample application dumps the most significant events of its life cycle: whenpars<strong>in</strong>g beg<strong>in</strong>s, when pars<strong>in</strong>g ends, and all the validation errors that have been detected <strong>in</strong>between.When you've f<strong>in</strong>ished with the validation process, you close the reader us<strong>in</strong>g the Closemethod. This operation also resets the reader's <strong>in</strong>ternal state to Closed. Clos<strong>in</strong>g thevalidat<strong>in</strong>g reader automatically closes the underly<strong>in</strong>g text reader. However, noexception is raised if you also attempt to programmatically close the <strong>in</strong>ternal reader.The Close method simply returns when it is called on a reader that is already closed.Handl<strong>in</strong>g Validation ErrorsIf you need to know the details of validation errors, you must necessarily def<strong>in</strong>e anevent handler and pass it along to the validat<strong>in</strong>g reader. Whenever an error is found,the reader fires the event and then cont<strong>in</strong>ues to parse. As a result, the event fires <strong>for</strong> allthe errors detected, thus giv<strong>in</strong>g the caller application a chance to handle the errorsseparately.In some situations, you might want to know simply whether a given <strong>XML</strong> documentcomplies with a given schema. In this case, you don't need to know anyth<strong>in</strong>g about theerror other than the fact that it occurred. The follow<strong>in</strong>g code provides a class with astatic method named ValidateXmlDocument. This method takes the name of an <strong>XML</strong>file, figures out the most appropriate validation schema, and returns a Boolean value.us<strong>in</strong>g System;us<strong>in</strong>g System.Xml;us<strong>in</strong>g System.Xml.Schema;public class XmlValidator{private static bool m_isValid = false;// Handle any validation errors detectedprivate static void ErrorHandler(object sender,ValidationEventArgs e){// Go on <strong>in</strong> case of warn<strong>in</strong>gs67


}if (e.Severity == XmlSeverityType.Error)m_isValid = false;// Validate the specified <strong>XML</strong> document (us<strong>in</strong>g Auto mode)public static bool ValidateXmlDocument(str<strong>in</strong>g fileName){XmlTextReader _coreReader = new XmlTextReader(fileName);XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);reader.ValidationType = ValidationType.Auto;reader.ValidationEventHandler +=new ValidationEventHandler(XmlValidator.ErrorHandler);// Parse the documenttry{m_isValid = true;while(reader.Read() && m_isValid) {}}catch{m_isValid = false;}}}reader.Close();return m_isValid;The ValidateXmlDocument method loops through the nodes of the document until the<strong>in</strong>ternal member m_isValid is false or the end of the stream is reached. The m_isValidmember is set to true at the beg<strong>in</strong>n<strong>in</strong>g of the loop and changes to false the first time anerror is found. At this po<strong>in</strong>t, the document is certa<strong>in</strong>ly <strong>in</strong>valid, so there is no reason tocont<strong>in</strong>ue loop<strong>in</strong>g.Because the ValidateXmlDocument method is declared static (or Shared <strong>in</strong> <strong>Microsoft</strong>Visual Basic .<strong>NET</strong>), you don't need a particular <strong>in</strong>stance of the base class to issue thecall, as shown here:if(!XmlValidator.ValidateXmlDocument("data.xml"))MessageBox.Show("Not a valid document!");NoteThe reader's <strong>in</strong>ternal mechanisms responsible <strong>for</strong> check<strong>in</strong>g adocument's well-<strong>for</strong>medness and schema compliance are dist<strong>in</strong>ct.So if a validat<strong>in</strong>g reader happens to work on a badly <strong>for</strong>med <strong>XML</strong>68


document, no event is fired, but an XmlException exception israised.Sett<strong>in</strong>g the Validation TypeThe ValidationType property <strong>in</strong>dicates what type of validation must be per<strong>for</strong>med on thecurrent document. To be effective, the property must be set be<strong>for</strong>e the first call to Read.Sett<strong>in</strong>g the property after the first call to Read would orig<strong>in</strong>ate anInvalidOperationException exception. If no value is explicitly assigned to the property, itdefaults to the ValidationType.Auto value.The ValidationType enumeration def<strong>in</strong>es all the feasible values <strong>for</strong> the property, aslisted <strong>in</strong> Table 3-3.Table 3-3: Types of ValidationTypeDescriptionNoneCreates a nonvalidat<strong>in</strong>g reader and ignores any validation errorsAutoDeterm<strong>in</strong>es the most appropriate type of validation by look<strong>in</strong>g atthe contents of the documentDTDValidates accord<strong>in</strong>g to the specified DTDSchema Validates accord<strong>in</strong>g to the specified XSD schemas, <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>l<strong>in</strong>eschemasXDRValidates accord<strong>in</strong>g to XDR schemas, <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>-l<strong>in</strong>e schemasWhen the validation type is set to Auto, the reader first attempts to locate a DTDdeclaration <strong>in</strong> the document. The DTD validation always takes precedence over othervalidation types. If a DTD is found, the document is validated accord<strong>in</strong>gly. Otherwise,the reader looks <strong>for</strong> an XSD, either referenced or <strong>in</strong>l<strong>in</strong>e. If no XSD is found, the readermakes a f<strong>in</strong>al attempt to f<strong>in</strong>d a referenced or an <strong>in</strong>-l<strong>in</strong>e XDR schema. If a schema is stillnot found, a nonvalidat<strong>in</strong>g reader is created. If more than one validation schema isspecified <strong>in</strong> the document, only the first occurrence, <strong>in</strong> accordance with the order justdiscussed, is taken <strong>in</strong>to account.Detect<strong>in</strong>g the Actual Validation TypeWhen the ValidationType property is set to Auto, you know at the end of the processwhether the semantics of your <strong>XML</strong> document are valid. But valid aga<strong>in</strong>st whichschema? The Auto mode <strong>for</strong>ces the parser to make various attempts until a validationschema type is found <strong>in</strong> the source code—whether it be DTD, XSD, or XDR. Is there away to know what type of validation the parser is actually per<strong>for</strong>m<strong>in</strong>g when work<strong>in</strong>g <strong>in</strong>Auto mode?The validat<strong>in</strong>g reader class provides no help on this po<strong>in</strong>t, but with a bit of creativity youcan easily identify the <strong>in</strong><strong>for</strong>mation you need. This <strong>in</strong><strong>for</strong>mation is not directly exposed,but it is right under your nose and can be <strong>in</strong>ferred from the node type and the schematype without too much ef<strong>for</strong>t.If the parser detects a node of type DocumentType, it can only be validat<strong>in</strong>g aga<strong>in</strong>st aDTD. By def<strong>in</strong>ition, the DOCTYPE node must appear outside the <strong>in</strong><strong>for</strong>mation set(<strong>in</strong>foset). If no DOCTYPE node is found, check whether the SchemaType propertyevaluates to an XmlSchemaType object. This can happen only if an <strong>XML</strong> SchemaObject Model (SOM) has been created, and hence only if XSD validation is tak<strong>in</strong>gplace. The XmlSchemaType object has even more <strong>in</strong> store. By check<strong>in</strong>g the contents ofthe SourceUri property, you can also determ<strong>in</strong>e whether the schema is <strong>in</strong>-l<strong>in</strong>e or areference. If the schema is <strong>in</strong>-l<strong>in</strong>e, the SourceUri property matches the URI of the <strong>XML</strong>69


document be<strong>in</strong>g processed. F<strong>in</strong>ally, if the validation type is neither DTD nor XSD, it canonly be XDR! The follow<strong>in</strong>g source code illustrates a function that determ<strong>in</strong>es the actualvalidation type:str<strong>in</strong>g GetActualValidationType(XmlValidat<strong>in</strong>gReader reader,str<strong>in</strong>g filename){str<strong>in</strong>g realValidationType = "";if(reader.ValidationType == ValidationType.Auto){if(reader.NodeType == XmlNodeType.DocumentType)realValidationType = "Auto.DTD";else{if(reader.SchemaType is XmlSchemaType){XmlSchemaType xst = (XmlSchemaType)reader.SchemaType;str<strong>in</strong>g xsd = Path.GetFileName(xst.SourceUri);str<strong>in</strong>g doc = Path.GetFileName(filename);if (xsd == doc)realValidationType = "Auto.Schema.Inl<strong>in</strong>e";elserealValidationType = "Auto.Schema.Ref ("+ xsd +")";}}}return realValidationType;}This code alone is not sufficient to produce the desired effect. It must be used <strong>in</strong>comb<strong>in</strong>ation with the ma<strong>in</strong> pars<strong>in</strong>g loop, as shown <strong>in</strong> the follow<strong>in</strong>g code. The functionshould be called from with<strong>in</strong> the loop as you read nodes, and at the end loop, youshould check <strong>for</strong> the results. If neither DTD nor XSD has been detected, the documentcan be validated only through XDR.str<strong>in</strong>g valtype = "";while(reader.Read()){if (valtype == "")valtype = GetActualValidationType(reader, filename);}// No DTD, no XSD, so it must be XDR...if (valtype == ""&& reader.ValidationType==ValidationType.Auto)70


valtype = "Auto.XDR";Figure 3-2 shows how the ValidateDocument application implements this feature.Figure 3-2: The ValidateDocument application determ<strong>in</strong>es the type of validation occurr<strong>in</strong>gunder the umbrella of the Auto validation type.Although it's easy to use, the Auto option is the most expensive of all <strong>in</strong> terms ofper<strong>for</strong>mance because it must first figure out what type of validation to apply. Wheneverpossible, you should <strong>in</strong>dicate explicitly the type of validation required.NoteWhen the ValidationType property is set to None, the DTD-specificDOCTYPE node, if present, is not used <strong>for</strong> validation purposes.However, default attributes <strong>in</strong> the DTD are correctly reported.General entities are not automatically expanded but can be resolvedus<strong>in</strong>g the ResolveEntity method.Events vs. ExceptionsThe typical way to detect validation errors is by means of a validation event handler. If avalidation event handler is specified, no validation exception is ever raised. In practice,once the reader has found an error, it looks <strong>for</strong> an event handler. If a handler is found,the handler raises the event; otherwise, it throws an XmlSchemaException exception.For the reader class, handl<strong>in</strong>g an exception is much more expensive than fir<strong>in</strong>g anevent, so use the ValidationEventHandler event whenever possible and do not abuseexceptions. Us<strong>in</strong>g exceptions automatically stops the validation process after the firsterror. As shown <strong>in</strong> the section "Detect<strong>in</strong>g the Actual Validation Type," on page 86, youcan obta<strong>in</strong> the same behavior from the event by us<strong>in</strong>g a slightly smarter Boolean guard<strong>for</strong> the loop. Instead of us<strong>in</strong>g the follow<strong>in</strong>g statement:while(reader.Read());you resort to this:while(reader.Read() && !m_errorFound)where the m_errorFound private member is updated <strong>in</strong> the body of the event handleraccord<strong>in</strong>g to what you want to do.A Word on <strong>XML</strong> DOMSo far, we've looked exclusively at how the validation process works <strong>for</strong> <strong>XML</strong> readers.But what about the XmlDocument class <strong>for</strong> <strong>XML</strong> Document Object Model (<strong>XML</strong> DOM)pars<strong>in</strong>g? How can you validate aga<strong>in</strong>st a schema while build<strong>in</strong>g an <strong>XML</strong> DOM? We'll71


exam<strong>in</strong>e <strong>XML</strong> DOM classes <strong>in</strong> detail <strong>in</strong> Chapter 5, but <strong>for</strong> now a quick preview, limitedto validation, is <strong>in</strong> order.The XmlDocument class—the key .<strong>NET</strong> Framework class <strong>for</strong> <strong>XML</strong> DOM pars<strong>in</strong>g—usesthe Load method to parse the entire contents of a document <strong>in</strong>to memory. The Loadmethod does not validate the <strong>XML</strong> source code aga<strong>in</strong>st a DTD or a schema, however—Load can only check whether the <strong>XML</strong> is well-<strong>for</strong>med.If you want to validate the <strong>in</strong>-memory tree while build<strong>in</strong>g it, use the follow<strong>in</strong>g overload<strong>for</strong> the XmlDocument class's Load method:public override void Load(XmlReader);You can create an <strong>XML</strong> DOM from a variety of sources, <strong>in</strong>clud<strong>in</strong>g a stream, a textreader, and a file name. If you load the document through an <strong>XML</strong> validat<strong>in</strong>g reader,you hit your target and obta<strong>in</strong> a fully validated <strong>in</strong>-memory DOM, as shown here:XmlTextReader _coreReader = new XmlTextReader(fileName);XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);XmlDocument doc = new XmlDocument();doc.Load(reader);As you'll see <strong>in</strong> Chapter 5, <strong>in</strong> the .<strong>NET</strong> Framework, an <strong>XML</strong> DOM is built us<strong>in</strong>g an<strong>in</strong>ternal reader. The programm<strong>in</strong>g <strong>in</strong>terface of the XmlDocument class, however, <strong>in</strong>some cases allows you to specify the reader to use. If this reader happens to be avalidat<strong>in</strong>g reader, you are automatically provided with a fully validated <strong>in</strong>-memory DOM.Under the Hood of the Validation ProcessBe<strong>for</strong>e go<strong>in</strong>g any further with the details of DTD, XDR, and XSD validation, let's reviewwhat happens under the hood of the validation process and how theXmlValidat<strong>in</strong>gReader class really operates.As mentioned, a validat<strong>in</strong>g reader works on top of a less-specialized reader, typically an<strong>XML</strong> text reader. You <strong>in</strong>itialize the validat<strong>in</strong>g reader simply by pass<strong>in</strong>g a reference tothis object. Upon <strong>in</strong>itialization, the validat<strong>in</strong>g reader copies a few sett<strong>in</strong>gs from theunderly<strong>in</strong>g reader. In particular, the properties BaseURI, Normalization, andWhiteSpaceHandl<strong>in</strong>g get the same values as the underly<strong>in</strong>g reader. Dur<strong>in</strong>g the<strong>in</strong>itialization step, an <strong>in</strong>ternal validator object is created to manage the schema<strong>in</strong><strong>for</strong>mation on a per-node basis.ImportantAlthough one of the XmlValidat<strong>in</strong>gReader constructors takesan <strong>in</strong>stance of the XmlReader class as its parameter, actuallythat reader can only be an <strong>in</strong>stance of the XmlTextReaderclass, or a class that derives from it. You can't use just anyclass that happens to <strong>in</strong>herit from XmlReader (<strong>for</strong> example, acustom <strong>XML</strong> reader). Internally, the XmlValidat<strong>in</strong>gReaderclass assumes that the underly<strong>in</strong>g reader is anXmlTextReader object and specifically casts the <strong>in</strong>put readerto XmlTextReader. If you use XmlNodeReader or a customreader class, you will not get an error at compile time, but anexception will be thrown at run time.Incremental Pars<strong>in</strong>gThe validation takes place as the user moves the po<strong>in</strong>ter <strong>for</strong>ward us<strong>in</strong>g the Readmethod. After the node has been parsed and read, it is passed on to the <strong>in</strong>ternal72


validator object <strong>for</strong> further process<strong>in</strong>g. The validator object operates based on the nodetype and the validation type requested. The validator object makes sure that the nodehas all the attributes and children it is expected to have.The validator object <strong>in</strong>ternally <strong>in</strong>vokes two flavors of objects: the DTD parser and theschema builder. The DTD parser processes the contents of the current node and itssubtree aga<strong>in</strong>st the DTD. The schema builder builds a SOM <strong>for</strong> the current node basedon the XDR or XSD schema source code. The schema builder class is actually thebase class <strong>for</strong> more specialized XDR and XSD schema builders. What matters, though,is that XDR and XSD schemas are treated <strong>in</strong> much the same way and with nodifference <strong>in</strong> per<strong>for</strong>mance.If a node has children, another temporary reader is used to read its <strong>XML</strong> subtree <strong>in</strong>such a way the schema <strong>in</strong><strong>for</strong>mation <strong>for</strong> the node can be fully <strong>in</strong>vestigated. The overalldiagram is shown <strong>in</strong> Figure 3-3.Figure 3-3: The validat<strong>in</strong>g reader coord<strong>in</strong>ates the ef<strong>for</strong>ts of the <strong>in</strong>ternal reader, thevalidator, and the event handler.In general, an <strong>XML</strong> reader might or might not resolve entities, but an <strong>XML</strong> validat<strong>in</strong>greader always does so. The EntityHandl<strong>in</strong>g property def<strong>in</strong>es how entities are handled.The EntityHandl<strong>in</strong>g property can take one of two values def<strong>in</strong>ed <strong>in</strong> the EntityHandl<strong>in</strong>genumeration, as described <strong>in</strong> Table 3-4.Table 3-4: Ways to Handle EntitiesActionDescriptionExpandCharEntities Expands character entities and returns general73


Table 3-4: Ways to Handle EntitiesActionDescriptionentities as EntityReference nodes. You must then callthe ResolveEntity method to expand a general entity.ExpandEntitiesDefault sett<strong>in</strong>g; expands all entities and replacesthem with their underly<strong>in</strong>g text.A character entity is an <strong>XML</strong> entity that evaluates to a character and is expressedthrough the character's decimal or hexadecimal representation. For example, &#65;expands to A. Character entities are mostly used to guarantee the well-<strong>for</strong>medness ofthe overall document when this is potentially broken by that character.A general entity is a normal <strong>XML</strong> entity that can expand to a str<strong>in</strong>g of any size, <strong>in</strong>clud<strong>in</strong>ga s<strong>in</strong>gle character. A general entity is always expressed through text, even when itrefers to a s<strong>in</strong>gle character.By default, the reader makes no dist<strong>in</strong>ction between the types of entities and expandsthem all when needed. By sett<strong>in</strong>g the EntityHandl<strong>in</strong>g property to ExpandCharEntities,however, you can optimize entity handl<strong>in</strong>g by expand<strong>in</strong>g the general entities only whenrequired. In this case, a call to Read expands only character entities. To expandgeneral entities, you must resort to the ResolveEntity method or to GetAttribute, if theentity is part of an attribute.The EntityHandl<strong>in</strong>g property can be changed on the fly; the new value takes effectwhen the next call to Read is made.A Cache <strong>for</strong> SchemasIn the validat<strong>in</strong>g reader class, the Schemas property represents a collection—that is, an<strong>in</strong>stance of the XmlSchemaCollection class—<strong>in</strong> which you can store one or moreschemas that you plan to use later <strong>for</strong> validation. Us<strong>in</strong>g the schema collection improvesoverall per<strong>for</strong>mance because the various schemas are held <strong>in</strong> memory and don't needto be loaded each and every time validation occurs. You can add as many XSD andXDR schemas as you want, but bear <strong>in</strong> m<strong>in</strong>d that the collection must be completedbe<strong>for</strong>e the first Read call is made.To add a new schema to the cache, you use the Add method of theXmlSchemaCollection object. The method has a few overloads, as follows:public void Add(XmlSchemaCollection);public XmlSchema Add(XmlSchema);public XmlSchema Add(str<strong>in</strong>g, str<strong>in</strong>g);public XmlSchema Add(str<strong>in</strong>g, XmlReader);The first overload populates the current collection with all the schemas def<strong>in</strong>ed <strong>in</strong> thegiven collection. The rema<strong>in</strong><strong>in</strong>g three overloads build from different data and return an<strong>in</strong>stance of the XmlSchema class—the .<strong>NET</strong> Framework class that conta<strong>in</strong>s thedef<strong>in</strong>ition of an XSD schema.Populat<strong>in</strong>g the Schema CollectionThe schema collection actually consists of <strong>in</strong>stances of the XmlSchema class—a k<strong>in</strong>dof compiled version of the schema. The various overloads of the Add method allow youto create an XmlSchema object from a variety of <strong>in</strong>put arguments. For example,consider the follow<strong>in</strong>g method:public XmlSchema Add(str<strong>in</strong>g ns,str<strong>in</strong>g url74


);This method creates and adds a new schema object to the collection.The compiled schema object is created us<strong>in</strong>g the namespace URI associated with theschema and the URL of the source. For example, let's assume that you have aclients.xsd file that beg<strong>in</strong>s as follows:The correspond<strong>in</strong>g Add statement to <strong>in</strong>sert the schema <strong>in</strong>to the collection looks likethis:XmlTextReader _coreReader = new XmlTextReader(file);XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);reader.Schemas.Add("urn:my-company", "clients.xsd");While validat<strong>in</strong>g, the XmlValidat<strong>in</strong>gReader class identifies the schema to use <strong>for</strong> a given<strong>XML</strong> source document by match<strong>in</strong>g the document's namespace URI with thenamespace URIs available <strong>in</strong> the collection. If the <strong>in</strong>put document is an XDR schema,the source item to match <strong>in</strong> the schema collection is the contents of the xmlns attribute.If the <strong>in</strong>put document is an XSD schema, the targetNamespace attribute <strong>in</strong> the XSDsource code is used.When you add a new schema to the collection and the namespace URI argument (thefirst argument) is null or empty, the Add method automatically br<strong>in</strong>gs <strong>in</strong> the value of thexmlns attribute if the source file is an XDR schema and the value of thetargetNamespace attribute if you are add<strong>in</strong>g an XSD schema, as shown here:XmlTextReader _coreReader = new XmlTextReader(file);XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);reader.Schemas.Add(null, "Clients.xsd");reader.ValidationType = ValidationType.Schema;reader.ValidationEventHandler += newValidationEventHandler(MyHandler);If the namespace URI you use already exists <strong>in</strong> the schema collection, the schemabe<strong>in</strong>g added replaces the orig<strong>in</strong>al one.If necessary, you could also load the schema from an <strong>XML</strong> reader object by us<strong>in</strong>g theoverload shown here:public XmlSchema Add(str<strong>in</strong>g ns,XmlReader reader);75


NoteYou can check whether a schema is already <strong>in</strong> the schemacollection by us<strong>in</strong>g the Conta<strong>in</strong>s method. The Conta<strong>in</strong>s method cantake either an XmlSchema object or a str<strong>in</strong>g represent<strong>in</strong>g thenamespace URI associated with the schema. The <strong>for</strong>mer approachworks only <strong>for</strong> XSD schemas. The latter covers both XSD and XDRschemas.Different Treatments <strong>for</strong> XSD and XDRAlthough you can store both XSD and XDR schemas <strong>in</strong> the schema collection, thereare some differences <strong>in</strong> the way <strong>in</strong> which the XmlSchemaCollection object handlesthem <strong>in</strong>ternally. For example, the Add method returns an XmlSchema object if you addan XSD schema but returns null if the added schema is an XDR. In general, anymethod or property that manipulates the <strong>in</strong>put or output of an XmlSchema objectsupports XSD schemas only.Another difference concerns the behavior of the Item property <strong>in</strong> theXmlSchemaCollection class. The Item property takes a str<strong>in</strong>g represent<strong>in</strong>g theschema's namespace URI and returns the correspond<strong>in</strong>g XmlSchema object. Thishappens only <strong>for</strong> XSDs, however. If you call the Item property on a namespace URI thatcorresponds to an XDR schema, null is returned.The reason beh<strong>in</strong>d the different treatments <strong>for</strong> XDR and XSD schemas is that XDRschemas have no object model available <strong>in</strong> the .<strong>NET</strong> Framework, so when you need tohandle them through objects, the system gracefully ignores the requests.XDR schemas are there only to preserve backward compatibility; you will not f<strong>in</strong>d themsupported outside the <strong>Microsoft</strong> W<strong>in</strong>32 plat<strong>for</strong>m. It is important to pay attention to themethods and the properties you use to manage XDR <strong>in</strong> your code. The overallprogramm<strong>in</strong>g <strong>in</strong>terface makes the ef<strong>for</strong>t to unify the methods and the properties to workon both XDRs and XSDs. But <strong>in</strong> some circumstances, those same methods andproperties might lead to unpleasant surprises.In a nutshell, you can cache an XDR schema <strong>for</strong> further and repeated use by theXmlValidat<strong>in</strong>gReader class, but that's all that you can do. You can't check <strong>for</strong> theexistence of XDR schemas, nor can a reference to an XDR schema be returned. Butyou can do this, and more, <strong>for</strong> XSDs.ImportantThe XmlSchemaCollection object is important to improv<strong>in</strong>g theoverall per<strong>for</strong>mance of the validation process. If you arevalidat<strong>in</strong>g more than one document aga<strong>in</strong>st the same schema(XDR or XSD), preload the schema <strong>in</strong> the reader's <strong>in</strong>ternalcache, represented by the Schemas property. While do<strong>in</strong>g so,bear <strong>in</strong> m<strong>in</strong>d that any <strong>in</strong>sertion <strong>in</strong> the schema collection mustbe done prior to start<strong>in</strong>g the validation process. You can add tothe schema collection only when the reader's state is set toInitial.Validat<strong>in</strong>g <strong>XML</strong> FragmentsAs mentioned, the XmlValidat<strong>in</strong>gReader class has the ability to parse and validateentire documents as well as <strong>XML</strong> fragments. To parse an <strong>XML</strong> fragment, you mustresort to one of the other two constructors that the XmlValidat<strong>in</strong>gReader class k<strong>in</strong>dlyprovides, as shown here:76


public XmlValidat<strong>in</strong>gReader(Stream, XmlNodeType,XmlParserContext);public XmlValidat<strong>in</strong>gReader(str<strong>in</strong>g, XmlNodeType,XmlParserContext);These constructors allow you to read <strong>XML</strong> fragments from a stream or a memory str<strong>in</strong>gand process them with<strong>in</strong> the boundaries of a given parser context.To bypass the root level rule <strong>for</strong> well-<strong>for</strong>med <strong>XML</strong> documents, you explicitly <strong>in</strong>dicatewhat type of node the fragment happens to be. The node types <strong>for</strong> <strong>XML</strong> fragments arelisted <strong>in</strong> Table 3-5.Table 3-5: <strong>XML</strong> Fragment Node TypesTypeFragment ContentsAttributeThe value of an attribute, <strong>in</strong>clud<strong>in</strong>g entities.DocumentAn entire <strong>XML</strong> document <strong>in</strong> which all the rules of well<strong>for</strong>mednessapply, <strong>in</strong>clud<strong>in</strong>g the root level rules.ElementAny valid element contents, <strong>in</strong>clud<strong>in</strong>g a comb<strong>in</strong>ation ofelements, comments, process<strong>in</strong>g <strong>in</strong>structions, CDATA,and text. Root level rules are not en<strong>for</strong>ced.If you use any other element from the XmlNodeType enumeration, an exception isthrown. Entity references that are found <strong>in</strong> the element or the attribute body areexpanded accord<strong>in</strong>g to the value of the EntityHandl<strong>in</strong>g property.When pars<strong>in</strong>g a small <strong>XML</strong> fragment, you might need to take <strong>in</strong> extra <strong>in</strong><strong>for</strong>mation thatcan be used to resolve entities and add default attributes. For this purpose, you use theXmlParserContext class. (See Chapter 2 <strong>for</strong> more <strong>in</strong><strong>for</strong>mation about theXmlParserContext class.) The XmlParserContext argument of the XmlTextReaderconstructor is required if the requested validation mode is DTD or Auto. In this case, <strong>in</strong>fact, the parser context is expected to conta<strong>in</strong> the reference to the DTD file aga<strong>in</strong>stwhich the validation must be done. An exception is thrown if the ValidationTypeproperty is set to DTD and the XmlParserContext argument does not conta<strong>in</strong> any DTDproperties.For all other validation types, the XmlParserContext argument can be specified withoutany DTD properties. Any schemas (XSDs or XDRs) used to validate the <strong>XML</strong> fragmentmust be referenced directly <strong>in</strong>side the <strong>XML</strong> fragment. When the validation is aga<strong>in</strong>stschemas, the XmlParserContext argument is used primarily to provide <strong>in</strong><strong>for</strong>mationabout namespace resolution.ImportantAs mentioned, the XmlValidat<strong>in</strong>gReader always works on topof an <strong>XML</strong> text reader and uses it to move around the nodes tovalidate. When you validate an <strong>XML</strong> fragment, however, youare not required to <strong>in</strong>dicate a reader. So does the validat<strong>in</strong>greader support a dual <strong>in</strong>ternal architecture to handle bothcases? The fact that you don't have to pass an <strong>XML</strong> textreader to validate an <strong>XML</strong> fragment does not mean that a textreader can't be play<strong>in</strong>g around <strong>in</strong> your code. Internally, bothfragment-based constructors create a temporary text reader astheir first task. The follow<strong>in</strong>g pseudocode shows whathappens:XmlTextReader coreReader = new77


XmlTextReader(xml, type, context);this = new XmlValidat<strong>in</strong>gReader(coreReader);At this po<strong>in</strong>t, the <strong>in</strong>ternal mechanisms of an <strong>XML</strong> validat<strong>in</strong>g reader and its programm<strong>in</strong>g<strong>in</strong>terface should be clear. In the rema<strong>in</strong>der of this chapter, we'll exam<strong>in</strong>e <strong>in</strong> more detailthe three key types of validation—DTD, XDR, and XSD.Us<strong>in</strong>g DTDsThe DTD validation guarantees that the source document complies with the validityconstra<strong>in</strong>ts def<strong>in</strong>ed <strong>in</strong> a separate file—the DTD. A DTD file uses a <strong>for</strong>mal grammar todescribe both the structure and the syntax of <strong>XML</strong> documents. <strong>XML</strong> authors use DTDsto narrow the set of tags and attributes allowed <strong>in</strong> their documents. Validat<strong>in</strong>g aga<strong>in</strong>st aDTD ensures that processed documents con<strong>for</strong>m to the specified structure. From alanguage perspective, a DTD def<strong>in</strong>es a newer and stricter <strong>XML</strong>-based syntax and anew tagged language tailor-made <strong>for</strong> a related group of documents.Historically speak<strong>in</strong>g, the DTD was the first tool capable of def<strong>in</strong><strong>in</strong>g the structure of adocument. The DTD standard was developed a few decades ago to work side by sidewith SGML—a recognized ISO standard <strong>for</strong> def<strong>in</strong><strong>in</strong>g markup languages. SGML isconsidered the ancestor of today's <strong>XML</strong>, which actually sprang to life <strong>in</strong> the late 1990sas a way to simplify the too-rigid architecture of SGML.DTDs use a proprietary syntax to def<strong>in</strong>e the syntax of markup constructs as well asadditional def<strong>in</strong>itions such as numeric and character entities. You can correctly th<strong>in</strong>k ofDTDs as an early <strong>for</strong>m of an <strong>XML</strong> schema. Although doomed to obsolescence, DTD istoday supported by virtually all <strong>XML</strong> parsers.An <strong>XML</strong> document is associated with a DTD file by us<strong>in</strong>g the DOCTYPE special tag.The validat<strong>in</strong>g parser (<strong>for</strong> example, the XmlValidat<strong>in</strong>gReader class) recognizes thiselement and extracts from it the schema <strong>in</strong><strong>for</strong>mation. The DOCTYPE declaration caneither po<strong>in</strong>t to an <strong>in</strong>l<strong>in</strong>e DTD or be a reference to an external DTD file.Develop<strong>in</strong>g a DTD GrammarLet's look more closely at a DTD file. To build a DTD, you normally start writ<strong>in</strong>g the fileaccord<strong>in</strong>g to its syntax. In this case, however, we'll start from an <strong>XML</strong> file nameddata_dtd.xml that will actually be validated through the DTD, as shown here:<strong>XML</strong> Core ClassesRelated Technologies<strong>XML</strong> and ADO.<strong>NET</strong><strong>XML</strong> and Applications78


<strong>XML</strong> InteroperabilityAs you can see, the file describes a class through its modules and topics covered. Thegeneral <strong>in</strong><strong>for</strong>mation about the class (title, author, tra<strong>in</strong><strong>in</strong>g company) are written us<strong>in</strong>gattributes. Each module spans a full day, and its description is implemented us<strong>in</strong>g pla<strong>in</strong>text.Any <strong>XML</strong> document that must be validated aga<strong>in</strong>st a given DTD file <strong>in</strong>cludes aDOCTYPE tag through which it simply l<strong>in</strong>ks to the DTD of choice, as shown here:The word follow<strong>in</strong>g DOCTYPE identifies the metalanguage described by the DTD. This<strong>in</strong><strong>for</strong>mation is extremely important <strong>for</strong> the validation process. If that word—thedocument type name—does not match the root element of the DTD, a validation error israised. The text follow<strong>in</strong>g the SYSTEM attribute is the URL from which the DTD willactually be downloaded.The follow<strong>in</strong>g list<strong>in</strong>g demonstrates a DTD that is tailor-made <strong>for</strong> the preced<strong>in</strong>g <strong>XML</strong>document:The ELEMENT tag identifies a node element, whereas ATTLIST is the tag that groupsall attributes of a given node. Attributes are normally expressed through CDATAsections that conta<strong>in</strong> unparsed data. In some cases, however, they can be allowed totake only the values def<strong>in</strong>ed by the specified entity. This is the case <strong>for</strong> the expandableattribute, whose only permitted values are true and false.In the section "Further Read<strong>in</strong>g," on page 133, you'll f<strong>in</strong>d references <strong>for</strong> learn<strong>in</strong>g moreabout the DTD syntax. What first catches the eye about DTDs is that they are written <strong>in</strong>a proprietary language that only mimics the typical markup of <strong>XML</strong>.Validat<strong>in</strong>g Aga<strong>in</strong>st a DTDThe follow<strong>in</strong>g code snippet creates an XmlValidat<strong>in</strong>gReader object that works on thesample <strong>XML</strong> file data_dtd.xml discussed <strong>in</strong> the section "Develop<strong>in</strong>g a DTD Grammar,"on page 97. The document is bound to a DTD file and is validated us<strong>in</strong>g the DTDvalidation type.79


XmlTextReader _coreReader = new XmlTextReader("data_dtd.xml");XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);reader.ValidationType = ValidationType.DTD;reader.ValidationEventHandler += newValidationEventHandler(MyHandler);while(reader.Read());Remember that when the validation type is set to Auto, the DTD option is the first to beconsidered.When the validation mode is set to DTD, the validat<strong>in</strong>g parser returns a warn<strong>in</strong>g if thefile has no l<strong>in</strong>k to any DTDs. Otherwise, if a DTD is correctly l<strong>in</strong>ked and accessible, thevalidation is per<strong>for</strong>med, and <strong>in</strong> the process, entities are expanded. If the l<strong>in</strong>ked DTD fileis not available, an exception is raised. What you'll get is not a schema exception but asimpler FileNotFoundException exception.If you mistakenly use a DTD to validate an <strong>XML</strong> file with schema <strong>in</strong><strong>for</strong>mation, a schemaexception is thrown, but with a low severity level. In practice, you get a warn<strong>in</strong>g<strong>in</strong><strong>for</strong>m<strong>in</strong>g you that no DTD has been found <strong>in</strong> the <strong>XML</strong> file. Figure 3-4 shows how thesample application handles this situation.Figure 3-4: When you try to use a DTD to validate an <strong>XML</strong> document with schema<strong>in</strong><strong>for</strong>mation, the validat<strong>in</strong>g parser returns a warn<strong>in</strong>g.In general, if you decide that schema warn<strong>in</strong>gs are not serious enough to break theongo<strong>in</strong>g validation process, you can skip them with the follow<strong>in</strong>g code:private void MyHandler(object sender, ValidationEventArgs e){if (e.Severity == XmlSeverityType.Error){// Handle the schema exception}}Usage and Trade-Offs <strong>for</strong> DTDsUnquestionably, the DTD validation <strong>for</strong>mat is an old one, albeit largely supported byvirtually all available parsers. But if you are design<strong>in</strong>g the validation layer <strong>for</strong> an <strong>XML</strong>drivendata exchange <strong>in</strong>frastructure today, there is no reason <strong>for</strong> you to discard XSDs.80


XSDs are more powerful than DTDs, and more important, they recently achieved W3Crecommendation status, so they are a standard too.So when should you use DTDs <strong>in</strong>stead of XSDs, and under what circumstances willDTDs give you a better trade-off? Compatibility and legacy code are the only possibleanswers to these questions. Especially if your application handles complex DTDs,port<strong>in</strong>g them to an XSD can be costly and is <strong>in</strong> no way an easy task. There is no officialand totally reliable tool to automatically convert DTDs to schemas. On the W3C Website (www.w3.org), you'll f<strong>in</strong>d a conversion tool available <strong>for</strong> download, but I wouldn'ttrust it to do the job unsupervised and then take the output as a trustworthy result.Convert<strong>in</strong>g DTDs to schemas is no simple matter—<strong>in</strong> fact, it can be as complex astranslat<strong>in</strong>g spoken languages. Translat<strong>in</strong>g from English to Italian, <strong>for</strong> example, requiresa reeng<strong>in</strong>eer<strong>in</strong>g of the entire text, not just an adaptation of <strong>in</strong>dividual words andsentences. So design is deeply <strong>in</strong>volved. When convert<strong>in</strong>g DTDs to schemas, youshould also consider rearchitect<strong>in</strong>g tags <strong>in</strong>to types and perhaps rearchitect<strong>in</strong>g the wayyou expose data <strong>in</strong> light of the new features.Certa<strong>in</strong>ly XSDs provide you with more functions than DTDs can. For one th<strong>in</strong>g,schemas are all written <strong>in</strong> <strong>XML</strong> and don't require you to learn a new language. If youlook at our basic DTD example <strong>in</strong> this context, you might not be scared by its unusual<strong>for</strong>mat. As you move from textbook examples and enter the tough real world, thecomplexity of an <strong>in</strong>flexible language like DTD becomes more apparent.XSDs provide you with a f<strong>in</strong>er level of control over the card<strong>in</strong>ality of the tags and theattribute types. In addition, XSDs can be used to set up a system of schema <strong>in</strong>heritance<strong>in</strong> which more complex types are built atop exist<strong>in</strong>g ones.All <strong>in</strong> all, if you currently have a huge, complex DTD, probably the best th<strong>in</strong>g you can dois cont<strong>in</strong>ue work<strong>in</strong>g with it while you carefully plan a migration to XSDs. DTDs andXSDs are both renowned standards, but especially if you are exchang<strong>in</strong>g data betweenheterogeneous plat<strong>for</strong>ms, you're more likely to f<strong>in</strong>d a DTD-compliant parser than anXSD-compliant one. This situation will change over time, but not anytime soon. Checkthe supported functions <strong>for</strong> the <strong>XML</strong> parsers available on the target plat<strong>for</strong>m carefullybe<strong>for</strong>e you drop DTDs.Us<strong>in</strong>g XDR SchemasAs mentioned, <strong>XML</strong>-Data Reduced (XDR) schema validation is the result of a <strong>Microsoft</strong>implementation of an early draft of what today is XSDs. XDR was implemented <strong>for</strong> thefirst time <strong>in</strong> the version of MS<strong>XML</strong> that shipped with <strong>Microsoft</strong> Internet Explorer 5.0,back <strong>in</strong> the spr<strong>in</strong>g of 1999.In the XDR schema specification, you'll f<strong>in</strong>d almost all of the ideas that characterizeXSDs today. The ma<strong>in</strong> reason <strong>for</strong> XDR support <strong>in</strong> the .<strong>NET</strong> Framework is backwardcompatibility with exist<strong>in</strong>g MS<strong>XML</strong>-based applications. To enable these applications toupgrade properly to the .<strong>NET</strong> Framework, XDR support has been reta<strong>in</strong>ed <strong>in</strong>tact. Youwill not f<strong>in</strong>d XDR support anywhere else outside the <strong>Microsoft</strong> W<strong>in</strong>dows plat<strong>for</strong>m,however.If you have used <strong>Microsoft</strong> ActiveX Data Objects (ADO), and <strong>in</strong> particular the library'sability to persist the contents of a Recordset object to <strong>XML</strong>, you are probably a veteranof XDR. In fact, the <strong>XML</strong> schema used to persist ADO 2.x Recordset objects to <strong>XML</strong> issimply XDR.81


Overview of XDR SchemasThe example <strong>XML</strong> document data_dtd.xml used to demonstrate DTDs conta<strong>in</strong>s<strong>in</strong><strong>for</strong>mation about the modules <strong>in</strong> which a given class is articulated. The follow<strong>in</strong>g list<strong>in</strong>gshows the XDR schema that provides a full description of the class:82


Compared to the DTD schema, this XDR schema is certa<strong>in</strong>ly more verbose, but it alsoprovides more detailed <strong>in</strong><strong>for</strong>mation. The idea beh<strong>in</strong>d an XDR schema is that you def<strong>in</strong>eattribute and element types and then use those entities to construct the hierarchy thatmakes the target document. For example, let's analyze more closely the block thatrefers to the root node, shown here:The element is declared as an element type, with the subtree <strong>for</strong>med by all thenodes located one level down from it—<strong>in</strong> this case, only and a few attributes.Both attributes and child nodes have a type property that refers to other ElementTypeor AttributeType schema nodes. From this structure, it's easy to see how validat<strong>in</strong>gparsers work to verify the correctness of a node aga<strong>in</strong>st a schema—be it XDR or XSD.They simply validate the node attributes and the child nodes one level down. Byapply<strong>in</strong>g this simple algorithm recursively, they traverse and validate the entire tree.From our sample XDR file, you can also appreciate the schema enhancements over theDTD model. In particular, you can set the type <strong>for</strong> each attribute and strictly control thecard<strong>in</strong>ality of each node by us<strong>in</strong>g the m<strong>in</strong>Occurs and maxOccurs properties. WithDTDs, on the other hand, you can barely def<strong>in</strong>e a fixed range of occurrences <strong>for</strong> agiven node.Look<strong>in</strong>g ahead to XSD, you'll notice that the key improvement concerns typ<strong>in</strong>g. XSDdef<strong>in</strong>es a type system that extends the XDR type system and that, more importantly,has a direct counterpart <strong>in</strong> the .<strong>NET</strong> Framework type system. (I'll have more to sayabout this <strong>in</strong> the section ".<strong>NET</strong> Type Mapp<strong>in</strong>g," on page 109.)Validat<strong>in</strong>g Aga<strong>in</strong>st an XDRAn <strong>XML</strong> document can <strong>in</strong>clude its XDR schema as <strong>in</strong>-l<strong>in</strong>e code or simply l<strong>in</strong>k it as anexternal resource. The XmlValidat<strong>in</strong>gReader class determ<strong>in</strong>es that a given documentrequires XDR validation if an x-schema namespace declaration is found. The follow<strong>in</strong>gsample document, named data_xdr.xml, po<strong>in</strong>ts to an XDR schema stored <strong>in</strong> an externalresource—the schema.xml file:83


<strong>XML</strong> Core ClassesRelated Technologies<strong>XML</strong> and ADO.<strong>NET</strong><strong>XML</strong> and Applications<strong>XML</strong> InteroperabilityThe follow<strong>in</strong>g code snippet demonstrates how to set up an <strong>in</strong>stance of theXmlValidat<strong>in</strong>gReader class to make it validate a file us<strong>in</strong>g XDR:XmlTextReader _coreReader = new XmlTextReader("data_xdr.xml");XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);reader.ValidationType = ValidationType.XDR;reader.ValidationEventHandler += newValidationEventHandler(MyHandler);while(reader.Read());This is <strong>in</strong> no way different from what you've seen <strong>for</strong> DTD and what you will see <strong>for</strong>XSD <strong>in</strong> the section "Validat<strong>in</strong>g Aga<strong>in</strong>st an XSD Document," on page 130. When yourequire XDR validation and no XDR schema <strong>in</strong><strong>for</strong>mation exists <strong>in</strong> the <strong>XML</strong> document,the parser always returns a warn<strong>in</strong>g similar to the one shown <strong>in</strong> Figure 3-5.Figure 3-5: The parser has attempted to use XDR validation on a DTD-driven <strong>XML</strong>document.The <strong>XML</strong> <strong>for</strong>mat <strong>for</strong> an ADO recordset provides the perfect, real-world example of an<strong>XML</strong> document that conta<strong>in</strong>s <strong>in</strong>-l<strong>in</strong>e XDR schema <strong>in</strong><strong>for</strong>mation, as shown here:


xmlns:z='#RowsetSchema'>This simple recordset conta<strong>in</strong>s just three columns taken from the Employees table <strong>in</strong>the <strong>Microsoft</strong> SQL Server 2000 Northw<strong>in</strong>d database. The XDR schema is placed <strong>in</strong>-l<strong>in</strong>eunder the tag. The structure of the document is expressed us<strong>in</strong>g a s<strong>in</strong>gleelement node (named row) and one attribute node per each column <strong>in</strong> the result set.Figure 3-6 demonstrates that this file (northw<strong>in</strong>d.xml) is perfectly validated by the .<strong>NET</strong>XDR parser.Figure 3-6: When the sample application operates <strong>in</strong> XDR validation mode, it can easilyprocess <strong>XML</strong> files created by ADO.Us<strong>in</strong>g the <strong>XML</strong> Schema APIAs mentioned, XSD is a W3C recommendation that provides the tools you need todef<strong>in</strong>e the structure, contents, and semantics of an <strong>XML</strong> document. Compared to DTDsand XDRs, XSD has two key advantages. First, it is the official W3C recommendation85


<strong>for</strong> def<strong>in</strong><strong>in</strong>g the structure of <strong>XML</strong> data. Second, it is the newest schema technology, andas such, has been built specifically to fix bugs and flaws <strong>in</strong> the other schemas (mostlyproblems with DTDs). And remember, more than an alternative schema technology,XDR is <strong>Microsoft</strong>'s implementation of an early work<strong>in</strong>g draft of the <strong>XML</strong> Schemaspecification.Although no developer can seriously th<strong>in</strong>k of deny<strong>in</strong>g the significance of <strong>XML</strong>, manyperceive <strong>XML</strong> as a sort of extraneous entity that lies outside the ma<strong>in</strong> body of the codeand that must be <strong>in</strong>tegrated through dist<strong>in</strong>ct objects. <strong>XML</strong> parsers process str<strong>in</strong>gs madeof text and markup and come up with b<strong>in</strong>ary representations of that content. When youtry to <strong>in</strong>tegrate this with the rest of the caller program, you must effectively trans<strong>for</strong>mtext content <strong>in</strong>to more specific data types.The same issue arises <strong>in</strong> the other direction. To export your b<strong>in</strong>ary objects to <strong>XML</strong>, youper<strong>for</strong>m a k<strong>in</strong>d of text serialization that looks more like a normalization of <strong>in</strong>volved typeswith subsequent loss of type <strong>in</strong><strong>for</strong>mation. You shouldn't be surprised by this <strong>in</strong><strong>for</strong>mationloss, because <strong>XML</strong> doesn't have a type system.DTD is a <strong>for</strong>mat designed to describe the structure and the contents of a documentrather than to endow <strong>XML</strong> with an effective type system. XDR, on the other hand,<strong>in</strong>troduces the concept of typed attributes. XSD th<strong>in</strong>ks a little bigger. Not only does itre<strong>in</strong><strong>for</strong>ce the importance of typed attributes, but it also dist<strong>in</strong>guishes between simpleand complex types, simplifies type <strong>in</strong>heritance, and exposes a full-blown and official<strong>XML</strong> type system.The .<strong>NET</strong> Framework has been designed around <strong>XML</strong> standards, <strong>in</strong>clud<strong>in</strong>g XSD.Although the .<strong>NET</strong> Framework type system is a separate entity from the <strong>XML</strong> typesystem, a conversion API does exist that greatly simplifies software <strong>in</strong>teraction through<strong>in</strong>tegration technologies such as SOAP and Web services.What Is a Schema, Anyway?A schema is an <strong>XML</strong> file (with typical extension .xsd) that describes the syntax andsemantics of <strong>XML</strong> documents us<strong>in</strong>g a standard <strong>XML</strong> syntax. An <strong>XML</strong> schema specifiesthe content constra<strong>in</strong>ts and the vocabulary that compliant documents mustaccommodate. For example, compliant documents must fulfill any dependenciesbetween nodes, assign attributes the correct type, and give child nodes the exactcard<strong>in</strong>ality.The <strong>XML</strong> Schema specification is articulated <strong>in</strong>to two dist<strong>in</strong>ct parts. Part I conta<strong>in</strong>s thedef<strong>in</strong>ition of a grammar <strong>for</strong> complex types—that is, composite <strong>XML</strong> elements. Part IIdescribes a set of primitive types—the <strong>XML</strong> type system—plus a grammar <strong>for</strong> creat<strong>in</strong>gnew primitive types, said to be simple types. New types are def<strong>in</strong>ed <strong>in</strong> terms of exist<strong>in</strong>gtypes.An <strong>XML</strong> schema also supports rather advanced and object-oriented concepts such astype <strong>in</strong>heritance. In the .<strong>NET</strong> Framework, the SOM provides a suite of classes held <strong>in</strong>the System.Xml.Schema namespace to read a schema from an XSD file. Theseclasses also enable you to programmatically create a schema that can be eithercompiled <strong>in</strong> memory or written to a disk file.Simple and Complex Types<strong>XML</strong> simple types consist of pla<strong>in</strong> text and don't conta<strong>in</strong> any other elements. Examplesof simple types are str<strong>in</strong>g, date, and various flavors of numbers (long, double, and<strong>in</strong>teger). <strong>XML</strong> complex types can <strong>in</strong>clude child elements and attributes. In practice, acomplex type is always rendered as an <strong>XML</strong> subtree. A complex type can beassociated only with an <strong>XML</strong> element node, whereas a simple type applies to bothelements and attributes.The diagram <strong>in</strong> Figure 3-7 illustrates the structure of the XSD type system.86


Figure 3-7: The XSD type hierarchy.As you can see, both simple and complex types descend from the generic typeanyType. Simple types also have their own base class, named anySimpleType. Youcan build new simple types from exist<strong>in</strong>g types and comb<strong>in</strong>e simple and exist<strong>in</strong>gcomplex types to create new ad hoc types by restrict<strong>in</strong>g, summ<strong>in</strong>g up, or list<strong>in</strong>g featuresand values of the base types..<strong>NET</strong> Type Mapp<strong>in</strong>gAll the data types that can be used <strong>in</strong> XSD documents have a .<strong>NET</strong> Frameworkcounterpart. After an XSD has been compiled <strong>in</strong>to a .<strong>NET</strong> Framework representationobject model, you can access it us<strong>in</strong>g the SOM classes. I'll have more to say on this <strong>in</strong>the section "Modify<strong>in</strong>g a Schema Programmatically," on page 123.The <strong>in</strong>foset that results from the schema compilation is also def<strong>in</strong>ed <strong>in</strong> the XSDrecommendation and is said to be the post-schema-validation <strong>in</strong>foset (PSVI). The SOMrenders the PSVI fields us<strong>in</strong>g read-only properties.The pre-schema-validation <strong>in</strong>foset—that is, the <strong>in</strong>foset describ<strong>in</strong>g the source contentsof the XSD—is built while the schema is be<strong>in</strong>g edited either from read<strong>in</strong>g from a file orby us<strong>in</strong>g the SOM programmatically. The properties that express the pre-schemavalidation<strong>in</strong>foset are all read/write.In the SOM representation of the PSVI, the constituent elements of the schema arerepresented with <strong>in</strong>stances of the XmlSchemaDatatype class. This class features twoproperties: ValueType and TokenizedType. The <strong>for</strong>mer returns the name of the XSDtype, and the latter provides the name of the correspond<strong>in</strong>g .<strong>NET</strong> Framework type. Thereturn type is taken from the conversions listed <strong>in</strong> Table 3-6.Table 3-6: Mapp<strong>in</strong>g Between XSD and .<strong>NET</strong> TypesXSD Type .<strong>NET</strong> Type DescriptionanyURI System.Uri A URI referencebase64B<strong>in</strong>ary System.Byte[] Base64-encodedb<strong>in</strong>ary dataBoolean System.Boolean Boolean valuesByte System.SByte A byte—that is, an 8-bit signed <strong>in</strong>tegerDate System.DateTime Date based on theGregorian calendardateTime System.DateTime An <strong>in</strong>stant <strong>in</strong> timedecimal System.Decimal Decimal number witharbitrary precisionDouble System.Double Double-precisionfloat<strong>in</strong>g numberduration System.TimeSpan An <strong>in</strong>terval of timeENTITIES System.Str<strong>in</strong>g[] List of <strong>XML</strong> 1.0 entitiesENTITY System.Str<strong>in</strong>g An <strong>XML</strong> 1.0 entityFloat System.S<strong>in</strong>gle S<strong>in</strong>gle-precisionfloat<strong>in</strong>g number88


Table 3-6: Mapp<strong>in</strong>g Between XSD and .<strong>NET</strong> TypesXSD Type .<strong>NET</strong> Type DescriptiongDay System.DateTime Represents a daygMonth System.DateTime Represents a monthgMonthDay System.DateTime Represents a periodone day longgYear System.DateTime Represents a yeargYearMonth System.DateTime Represents a periodone month longhexB<strong>in</strong>ary System.Byte[] Hex-encoded b<strong>in</strong>arydataID System.Str<strong>in</strong>g An <strong>XML</strong> 1.0 IDelementIDREF System.Str<strong>in</strong>g An <strong>XML</strong> 1.0 IDREFelementIDREFS System.Str<strong>in</strong>g[] List of <strong>XML</strong> 1.0 IDREFelements<strong>in</strong>t System.Int32 32-bit signed <strong>in</strong>teger<strong>in</strong>teger System.Decimal Arbitrary long <strong>in</strong>tegerlanguage System.Str<strong>in</strong>g Language identifier(see RFC 1766 athttp://rfc.net/rfc1766.html)long System.Int64 64-bit signed <strong>in</strong>tegerName System.Str<strong>in</strong>g An <strong>XML</strong> nameNCName System.Str<strong>in</strong>g Local name of <strong>XML</strong>elements (noncolonized)negativeInteger System.Decimal Arbitrary long negative<strong>in</strong>tegerNMTOKEN System.Str<strong>in</strong>g An <strong>XML</strong> 1.0NMTOKEN elementNMTOKENS System.Str<strong>in</strong>g[] List of <strong>XML</strong> 1.0NMTOKEN elementsnonNegativeInteger System.Decimal Arbitrary long <strong>in</strong>teger =0nonPositiveInteger System.Decimal Arbitrary long <strong>in</strong>teger =0normalizedStr<strong>in</strong>g System.Str<strong>in</strong>g Str<strong>in</strong>g with normalizedwhite spacesNOTATION System.Str<strong>in</strong>g An <strong>XML</strong> 1.0NOTATION elementpositiveInteger System.Decimal Arbitrary long positive89


Table 3-6: Mapp<strong>in</strong>g Between XSD and .<strong>NET</strong> TypesXSD Type .<strong>NET</strong> Type Description<strong>in</strong>tegerQName System.Xml.XmlQualifiedName An <strong>XML</strong> qualifiednameshort System.Int16 16-bit signed <strong>in</strong>tegerstr<strong>in</strong>g System.Str<strong>in</strong>g A str<strong>in</strong>g typetime System.DateTime An <strong>in</strong>stant <strong>in</strong> timetimePeriod System.DateTime A period of timetoken System.Str<strong>in</strong>g Normalized str<strong>in</strong>g withlead<strong>in</strong>g and trail<strong>in</strong>gwhite spaces removedunsignedByte System.Byte 8-bit unsigned <strong>in</strong>tegerunsignedInt System.UInt32 32-bit unsigned <strong>in</strong>tegerunsignedLong System.UInt64 64-bit unsigned <strong>in</strong>tegerunsignedShort System.UInt16 16-bit unsigned <strong>in</strong>tegerThe schema compiler is a piece of code that translates between XSD types and thetype system of a particular plat<strong>for</strong>m. In the .<strong>NET</strong> Framework, the schema compilercompiles XSD <strong>in</strong>to an XmlSchema object that exposes the schema <strong>in</strong><strong>for</strong>mation throughmethods and properties.Effective serialization between XSD and b<strong>in</strong>ary classes on a given plat<strong>for</strong>m is a featurewith tremendous potential. It could supersede today's <strong>XML</strong> pars<strong>in</strong>g by automaticallycreat<strong>in</strong>g an <strong>in</strong>stance of a class <strong>in</strong>stead of creat<strong>in</strong>g a generic and unwieldy <strong>XML</strong> DOM orsimply pass<strong>in</strong>g raw data to the application. In the .<strong>NET</strong> Framework, <strong>XML</strong> serialization isaccomplished us<strong>in</strong>g the XmlSerializer class and exploit<strong>in</strong>g the services of the <strong>XML</strong>Schema def<strong>in</strong>ition tool (xsd.exe). I'll cover <strong>XML</strong> serialization extensively <strong>in</strong> Chapter 11.NoteThe <strong>XML</strong> Schema def<strong>in</strong>ition tool (xsd.exe) is an executableavailable with the .<strong>NET</strong> Framework SDK. You'll f<strong>in</strong>d it <strong>in</strong> the BINsubdirectory of the .<strong>NET</strong> Framework <strong>in</strong>stallation path. Normally, thispath is C:\Program Files\<strong>Microsoft</strong> Visual Studio.<strong>NET</strong>\FrameworkSDK.Among other th<strong>in</strong>gs, xsd.exe can generate a C# or Visual Basicclass from an XSD file and <strong>in</strong>fer an XSD from a source <strong>XML</strong> file.This tool is also responsible <strong>for</strong> all the XSD-related magic per<strong>for</strong>medby Visual Studio .<strong>NET</strong>.Def<strong>in</strong><strong>in</strong>g an XSD SchemaYou have three options when creat<strong>in</strong>g an XSD schema. You can write it manually bycomb<strong>in</strong><strong>in</strong>g the various tags def<strong>in</strong>ed by the <strong>XML</strong> Schema specification. A more effectiveoption is represented by Visual Studio .<strong>NET</strong>, which provides a visual editor <strong>for</strong> XSD fileswith full IntelliSense support. The third option is based on the <strong>XML</strong> Schema def<strong>in</strong>itiontool (xsd.exe) mentioned <strong>in</strong> the previous section, which can <strong>in</strong>fer the underly<strong>in</strong>g schemafrom any well-<strong>for</strong>med <strong>XML</strong> document.Of these options, the first is certa<strong>in</strong>ly the hardest to code and the one that you willprobably use less frequently. It also happens to be the most useful tool <strong>for</strong> ga<strong>in</strong><strong>in</strong>g an90


essential knowledge of the schema's structure and <strong>in</strong>ternals. Don't expect to f<strong>in</strong>d herean exhaustive explanation of the XSD syntax. For a comprehensive programmer'sreference guide, use one of the resources listed <strong>in</strong> the section "Further Read<strong>in</strong>g," onpage 133.Sett<strong>in</strong>g Up a Sample SchemaLet's start by creat<strong>in</strong>g a simple schema to describe an address. Like many realworldobjects, an address too is rendered us<strong>in</strong>g a complex type—a k<strong>in</strong>d of <strong>XML</strong> datastructure. The follow<strong>in</strong>g code shows the schema <strong>for</strong> an address. It's a fairly simpleschema consist<strong>in</strong>g of a sequence of five elements: street, number, city, state and zip,plus an attribute named country . All constituent elements are str<strong>in</strong>g types.An XSD file beg<strong>in</strong>s with a schema node prefixed by the standard <strong>XML</strong> schemanamespace: http://www.w3.org/2001/<strong>XML</strong>Schema. In the schema's root node, youmight want to set the targetNamespace attribute to specify the namespace of allcomponents <strong>in</strong> the schema be<strong>in</strong>g def<strong>in</strong>ed and any schemas imported us<strong>in</strong>g the <strong>in</strong>cludeelement. Below the root node, you can f<strong>in</strong>d any of the top-level elements listed <strong>in</strong> Table3-7.Table 3-7: Top-Level Elements <strong>for</strong> <strong>XML</strong> Schema FilesElementAnnotationAttributeAttributeGroupcomplexTypeelementgroupimportDescriptionConta<strong>in</strong>s a brief annotation about the structure.Indicates a global attribute declaration.Groups attribute declarations <strong>for</strong> further use with<strong>in</strong> thebody of complex type def<strong>in</strong>itions.Def<strong>in</strong>es an <strong>XML</strong> complex type.Indicates a global element declaration.Groups element declarations <strong>for</strong> further use with<strong>in</strong> thebody of complex type def<strong>in</strong>itions.Adds to the schema some def<strong>in</strong>itions belong<strong>in</strong>g to a91


Table 3-7: Top-Level Elements <strong>for</strong> <strong>XML</strong> Schema FilesElement<strong>in</strong>cludenotationredef<strong>in</strong>esimpleTypeDescriptiondifferent namespace. You reference the location of theexternal schema us<strong>in</strong>g the schemaLocation attribute.Adds to the schema some def<strong>in</strong>itions belong<strong>in</strong>g to thesame namespace as the current schema. TheschemaLocation attribute lets you reference the externalschema.Conta<strong>in</strong>s the def<strong>in</strong>ition of a notation to describe the <strong>for</strong>matof non-<strong>XML</strong> data with<strong>in</strong> an <strong>XML</strong> document.Allows you to redef<strong>in</strong>e <strong>in</strong> the current schema anycomponents imported or <strong>in</strong>cluded from an externalschema.Def<strong>in</strong>es an <strong>XML</strong> simple type.In the preced<strong>in</strong>g source code, the XSD file has one top-level element component oftype address . It is followed by the declaration of the correspond<strong>in</strong>g complex type—theAddressType sequence. The sequence element specifies the sequence of permittednodes and related types. A complex type can be arranged us<strong>in</strong>g exactly one of theelements listed <strong>in</strong> Table 3-8. The element chosen specifies the content and thestructure of the resultant type.Table 3-8: Elements That Specify the Contents <strong>for</strong> Complex TypesElementDescriptionsimpleContent Conta<strong>in</strong>s text or a simpleType; the type has no childelements.complexContent Conta<strong>in</strong>s only elements or is empty (has no elementcontents).groupConta<strong>in</strong>s the elements def<strong>in</strong>ed <strong>in</strong> the referenced group.sequence Conta<strong>in</strong>s the elements def<strong>in</strong>ed <strong>in</strong> the specifiedsequence.choiceLists the types of contents permitted <strong>for</strong> the type.allA group that allows elements to appear once and <strong>in</strong> anyorder.L<strong>in</strong>k<strong>in</strong>g Documents and SchemasYou might want to know how an <strong>XML</strong> document can l<strong>in</strong>k to the schema. An <strong>XML</strong>schema can be associated with document files <strong>in</strong> two ways: as <strong>in</strong>-l<strong>in</strong>e code or throughexternal references. The second option decouples the document <strong>in</strong>stance and theschema. The first option, on the other hand, simplifies deployment and datatransportation because all <strong>in</strong><strong>for</strong>mation resides <strong>in</strong> a s<strong>in</strong>gle place.The XSD is <strong>in</strong>serted prior to the document's root node, whether as <strong>in</strong>-l<strong>in</strong>e code or as anexternal reference. The follow<strong>in</strong>g <strong>XML</strong> document l<strong>in</strong>ks to the previously def<strong>in</strong>ed XSDthrough the noNamespaceSchemaLocation attribute:


xsi:noNamespaceSchemaLocation="address.xsd"country="Italy">One <strong>Microsoft</strong> Way1RedmondWA98052The schema can be tied to a namespace by us<strong>in</strong>g the schemaLocation attribute, asshown here:One <strong>Microsoft</strong> Way1RedmondWA98052In this case, the XSD (address1.xsd) must be slightly modified by add<strong>in</strong>g atargetNamespace attribute and sett<strong>in</strong>g an xmlns attribute to the target namespace URI,as follows:Needless to say, the target namespace must match the designated namespace URI <strong>in</strong>the source document.Complex Type InheritanceWith complex types, you simply def<strong>in</strong>e <strong>XML</strong> data structures that are <strong>in</strong> no logical waydifferent from classes of object-oriented languages such as C# or Java. One keyfeature of those languages is the ability to derive new data types from exist<strong>in</strong>g classes.The same k<strong>in</strong>d of <strong>in</strong>heritance can be achieved with <strong>XML</strong> schemas. To demonstrate,we'll build a new address type that, as <strong>in</strong> many European countries, takes <strong>in</strong>to accountalso the prov<strong>in</strong>ce.The address.xsd schema considered up to now conta<strong>in</strong>s more than just the def<strong>in</strong>ition ofa complex type—it also conta<strong>in</strong>s a global element that will be <strong>in</strong>cluded <strong>in</strong> any compliantdocument as an <strong>in</strong>stance of the type. Let's first create a base class <strong>for</strong> the schema andname it xaddress.xsd, as shown <strong>in</strong> the follow<strong>in</strong>g code. The new file differs from theearlier version <strong>in</strong> only one aspect: it now lacks the global element declaration.93


The next step is to def<strong>in</strong>e a new schema <strong>for</strong> a type named EuAddressType . You usethe <strong>in</strong>clude tag to import the exist<strong>in</strong>g address construct from the base type declaration,as shown <strong>in</strong> the follow<strong>in</strong>g code:At this po<strong>in</strong>t, you can declare the global element that, of course, will be of the newEuAddressType type, as follows:Us<strong>in</strong>g the orig<strong>in</strong>al xaddress.xsd schema (with a global element of type AddressType)raises a conflict because the address tag would be repeated. The f<strong>in</strong>al step is to def<strong>in</strong>ethe extensions (or the restrictions) that characterize the new type. You use theextension tag or the restriction tag as needed. The follow<strong>in</strong>g code adds a str<strong>in</strong>g element to the def<strong>in</strong>ition:94


The follow<strong>in</strong>g <strong>XML</strong> file is now perfectly valid:Via dei Tigli123Lamiacitta12345RmThe validation program ValidateDocument described <strong>in</strong> the section "TheXmlValidat<strong>in</strong>gReader <strong>in</strong> Action," on page 81, successfully checks the schemacon<strong>for</strong>mance of the preced<strong>in</strong>g document, as shown <strong>in</strong> Figure 3-8. In the section"Validat<strong>in</strong>g Aga<strong>in</strong>st an XSD Document," on page 130, we'll exam<strong>in</strong>e <strong>in</strong> more detail whathappens when an <strong>in</strong>stance of the XmlValidat<strong>in</strong>gReader class is called to process an<strong>XML</strong> schema.Figure 3-8: The EuAddressType schema is successfully checked.95


Creat<strong>in</strong>g an <strong>XML</strong> Schema with Visual Studio .<strong>NET</strong>Visual Studio .<strong>NET</strong> provides a visual editor, the <strong>XML</strong> Editor <strong>for</strong> XSD files. Instead ofhandl<strong>in</strong>g yourself the <strong>in</strong>tricacies of schema markup, you can simply edit <strong>XML</strong> files us<strong>in</strong>gthe drag and drop features and shortcut menus provided by the editor.Figure 3-9 shows the XSD file from the previous section as it appears <strong>in</strong> the VisualStudio editor. The figure shows the components of the XSD file: a global element oftype AddressType and the correspond<strong>in</strong>g def<strong>in</strong>ition of the global element's complextype.Figure 3-9: Sample XSD file edited with Visual Studio .<strong>NET</strong>.As mentioned, Visual Studio .<strong>NET</strong> can also dynamically <strong>in</strong>fer the schema from thecurrently displayed <strong>XML</strong> file. The task is actually accomplished by xsd.exe and can beeasily repeated and controlled programmatically. The command l<strong>in</strong>e to use this tool isfairly straight<strong>for</strong>ward, as shown here:xsd.exe file.xmlLet's ask Visual Studio .<strong>NET</strong> to <strong>in</strong>fer the schema <strong>for</strong> the sample address.xml file—thefile we designed to be compliant with the address.xsd schema. One would expect toobta<strong>in</strong> a file nearly identical to address.xsd. However, surpris<strong>in</strong>gly enough, the resultantschema seems to be different, as shown <strong>in</strong> the follow<strong>in</strong>g code. The schema <strong>in</strong>ferred islexically different from address.xsd but completely equivalent <strong>in</strong> terms of semantics.


xmlns:mstns="http://tempuri.org/address1.xsd"xmlns="http://tempuri.org/address1.xsd"xmlns:xs="http://www.w3.org/2001/<strong>XML</strong>Schema"xmlns:msdata="urn:schemas-microsoft-com:xml-msdata"attributeFormDefault="qualified"elementFormDefault="qualified">If the difference isn't obvious from look<strong>in</strong>g at the source code, take a quick look at thefile <strong>in</strong> the <strong>XML</strong> Editor, as shown <strong>in</strong> Figure 3-10.97


Figure 3-10: The graphical representation of the schema that Visual Studio <strong>in</strong>ferred fromthe sample document.The global address element is now described as simple content, as shown <strong>in</strong> thefollow<strong>in</strong>g code, and there is no reference to a named complex type like AddressType .In addition, the <strong>in</strong>stance of the global element <strong>in</strong> the page is <strong>in</strong>serted us<strong>in</strong>g the refkeyword <strong>in</strong>stead of the keyword pair name/type.In the address.xsd schema, the address element was def<strong>in</strong>ed us<strong>in</strong>g the name/type pair,like this:The ref attribute lets you declare an element that uses an exist<strong>in</strong>g element def<strong>in</strong>ition.You use the name/type pair when the element is of a previously def<strong>in</strong>ed, or <strong>in</strong>cluded,complex type. The ref and name attributes are mutually exclusive.NoteTo understand the reason <strong>for</strong> such apparently odd behavior,consider the <strong>in</strong>put data that you pass to Visual Studio .<strong>NET</strong> (and,under the hood, xsd.exe). Visual Studio .<strong>NET</strong> simply <strong>in</strong>fers theschema, which means that it tries to figure out the schema based onthe only observable source—the document text. In the source text,however, there is no mention of any complex type declarations.That's why the layout is correctly guessed but rendered us<strong>in</strong>g asimple content element.The .<strong>NET</strong> Schema Object ModelVisual Studio .<strong>NET</strong> is not the only commercial tool capable of creat<strong>in</strong>g <strong>XML</strong> schemas <strong>in</strong>a visual fashion. <strong>XML</strong> Spy, <strong>for</strong> example, is another popular tool. The more powerful atool is, however, the more details are hidden from the users.98


For an effective programmatic manipulation of an <strong>XML</strong> schema, you need an objectmodel. An object model enables you to build and edit schema <strong>in</strong><strong>for</strong>mation <strong>in</strong> memory. Italso gives you access to each element that <strong>for</strong>ms the schema and that exposesread/write properties <strong>in</strong> homage to the pre-schema-validation and post-schemavalidation<strong>in</strong>foset specifications.The .<strong>NET</strong> Framework provides a hierarchy of classes under the System.Xml.Schemanamespace to edit exist<strong>in</strong>g schemas or create new ones from the ground up. The rootclass of the hierarchy is XmlSchema. Once your application holds an <strong>in</strong>stance of theclass, it can load an exist<strong>in</strong>g XSD file and populate the <strong>in</strong>ternal properties andcollections with the conta<strong>in</strong>ed <strong>in</strong><strong>for</strong>mation. By us<strong>in</strong>g the XmlSchema programm<strong>in</strong>g<strong>in</strong>terface, you can then add or edit elements, attributes, and other schema components.F<strong>in</strong>ally, the class exposes a Write method that allows you to persist to a valid streamobject the current contents of the schema.Read<strong>in</strong>g a Schema from a FileYou can create an <strong>in</strong>stance of the XmlSchema class <strong>in</strong> two ways. You can use thedefault constructor, which returns a new, empty <strong>in</strong>stance of the class, or you can usethe static Read method.The Read method operates on schema <strong>in</strong><strong>for</strong>mation available through a stream, a textreader, or an <strong>XML</strong> reader. The schema returned is not yet compiled. The Read methodaccepts a second argument—a validation event handler such as the ones discussed <strong>in</strong>the section "The XmlValidat<strong>in</strong>gReader <strong>Programm<strong>in</strong>g</strong> Interface," on page 78. You canset this argument to null, but <strong>in</strong> this case you won't be able to catch and handlevalidation errors. The follow<strong>in</strong>g code shows how to read and compile a schema us<strong>in</strong>gthe .<strong>NET</strong> SOM:XmlSchema schema;XmlTextReader reader = new XmlTextReader(filename);schema = XmlSchema.Read(reader, null);schema.Compile(null);//// Do someth<strong>in</strong>g here//⋮reader.Close();Once the schema has been compiled, you can access the constituent elements of theschema as def<strong>in</strong>ed by the PSVI. To access the actual types <strong>in</strong> the schema, you use theSchemaTypes collection. One of the differences between the <strong>in</strong><strong>for</strong>mation availablebe<strong>for</strong>e and after compilation is that an <strong>in</strong>cluded complex type will not be detected untilthe schema is compiled. For example, <strong>in</strong> eu_address.xsd, we extended theAddressType type after import<strong>in</strong>g it through the tag. To programmaticallydetect the presence of the AddressType complex type, however, you must first compilethe schema, which would expand the <strong>in</strong>clude element that imports the type def<strong>in</strong>ition.The follow<strong>in</strong>g code snippet demonstrates how to get the list of complex types def<strong>in</strong>ed <strong>in</strong>the specified schema after compilation:void ListComplexTypes(str<strong>in</strong>g filename){XmlSchema schema;99


Open the <strong>XML</strong> readerXmlTextReader reader = new XmlTextReader(filename);try {schema = XmlSchema.Read(reader, null);schema.Compile(null);}catch {reader.Close();Console.WriteL<strong>in</strong>e("Invalid schema specified.");return;}Console.WriteL<strong>in</strong>e("{0} element(s) found.",schema.SchemaTypes.Count.ToStr<strong>in</strong>g());// Loop through the collection of types<strong>for</strong>each(XmlSchemaObject o <strong>in</strong> schema.SchemaTypes.Values){if (o is XmlSchemaComplexType){XmlSchemaComplexType t = (XmlSchemaComplexType) o;Console.WriteL<strong>in</strong>e("{0} -- {1}", t.Name, o.ToStr<strong>in</strong>g());}elseConsole.WriteL<strong>in</strong>e("No complex types found");}reader.Close();}Figure 3-11 shows the tool <strong>in</strong> action on the eu_address.xsd schema.Figure 3-11: Gett<strong>in</strong>g the list of complex types def<strong>in</strong>ed <strong>in</strong> the given XSD file.Modify<strong>in</strong>g a Schema ProgrammaticallyAfter the schema has been read <strong>in</strong>to memory, you can manipulate the structure of theschema, with the obvious limitation that <strong>in</strong>direct tags such as <strong>in</strong>clude, import, andredef<strong>in</strong>e will be detected only as <strong>in</strong>dividual objects. These three tags, <strong>for</strong> example, willbe detected as XmlSchemaInclude, XmlSchemaImport, and XmlSchemaRedef<strong>in</strong>e,100


espectively, but the effect they have on the overall schema and conta<strong>in</strong>ed types is notyet perceived.Immediately after read<strong>in</strong>g a schema, however, you can edit its child items by add<strong>in</strong>gnew elements and remov<strong>in</strong>g exist<strong>in</strong>g ones. When you have f<strong>in</strong>ished, you compile theschema and, if all went f<strong>in</strong>e, save it to disk. Compil<strong>in</strong>g the schema prior to persist<strong>in</strong>gchanges is not strictly necessary to get a valid schema, but it helps to verify whetherany errors were <strong>in</strong>troduced dur<strong>in</strong>g edit<strong>in</strong>g.The follow<strong>in</strong>g applet reads a schema from disk, verifies that it conta<strong>in</strong>s a particularcomplex type, and then extends the structure of the type by add<strong>in</strong>g a new element. Thetype processed is AddressType, which is edited with the addition of a newnode. The node is expected to conta<strong>in</strong> the first two uppercase <strong>in</strong>itialsof the prov<strong>in</strong>ce.void EditComplexTypes(str<strong>in</strong>g filename){// Open and read the <strong>XML</strong> reader <strong>in</strong>to a schema objectXmlSchema schema;XmlTextReader reader = new XmlTextReader(filename);schema = XmlSchema.Read(reader, null);reader.Close();// Verify that the AddressType complex type is thereXmlSchemaComplexType ct = GetComplexType(schema,"AddressType");if (ct == null){Console.WriteL<strong>in</strong>e("No type [AddressType] found.");return;}// Create the new elementXmlSchemaElement provElem = new XmlSchemaElement();provElem.Name = "prov<strong>in</strong>ceInitials";// Def<strong>in</strong>e the <strong>in</strong>-l<strong>in</strong>e type of the elementXmlSchemaSimpleType prov<strong>in</strong>ceType = new XmlSchemaSimpleType();XmlSchemaSimpleTypeRestriction prov<strong>in</strong>ceRestriction;prov<strong>in</strong>ceRestriction = new XmlSchemaSimpleTypeRestriction();prov<strong>in</strong>ceRestriction.BaseTypeName = newXmlQualifiedName("str<strong>in</strong>g","http://www.w3.org/2001/<strong>XML</strong>Schema");prov<strong>in</strong>ceType.Content = prov<strong>in</strong>ceRestriction;// Set the (<strong>in</strong>-l<strong>in</strong>e) type of the elementprovElem.SchemaType = prov<strong>in</strong>ceType;101


Def<strong>in</strong>e the pattern <strong>for</strong> the contentXmlSchemaPatternFacet provPattern = newXmlSchemaPatternFacet();provPattern.Value = "[A-Z]{2}";prov<strong>in</strong>ceRestriction.Facets.Add(provPattern);// Get the sequence <strong>for</strong> the AddressTypeXmlSchemaSequence seq = (XmlSchemaSequence) ct.Particle;seq.Items.Add(provElem);// Compile the schemaschema.Compile(null);}// Save the schemaXmlTextWriter writer = new XmlTextWriter("out.xsd", null);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;schema.Write(writer);writer.Close();This code reads the schema us<strong>in</strong>g an <strong>XML</strong> reader and checks <strong>for</strong> a complex typenamed AddressType. If the type is not found, the application exits immediately. Thesearch <strong>for</strong> a complex type is per<strong>for</strong>med by scann<strong>in</strong>g the contents of the schema's Itemscollection of XmlSchemaObject objects. XmlSchemaObject is the base class <strong>for</strong> allschema components. Figure 3-12 shows a non-exhaustive diagram of the schemaobject relationships. (See the section "Further Read<strong>in</strong>g," on page 133, <strong>for</strong> additionalreferences.)102


Figure 3-12: The XmlSchemaObject class and some of its descendants.The Items collection picks up all the top-level elements found below the root node. All elements that can be safely cast to XmlSchemaComplexTypehave their Name property checked aga<strong>in</strong>st the requested type, as shown here:XmlSchemaComplexType GetComplexType(XmlSchema schema, str<strong>in</strong>gtypeName){XmlSchemaComplexType ct;<strong>for</strong>each(XmlSchemaObject o <strong>in</strong> schema.Items){if (o is XmlSchemaComplexType){ct = (XmlSchemaComplexType) o;if (ct.Name == typeName)return ct;}}}return null;Once a reference to the complex type has been found, the code proceeds by creat<strong>in</strong>gthe new schema element. The relative type is declared <strong>in</strong>-l<strong>in</strong>e <strong>in</strong> the103


ody of the element. It is a simple type def<strong>in</strong>ed as a restriction of the primitive XSDstr<strong>in</strong>g type. When you def<strong>in</strong>e an XSD simple type by restriction you apply some facetsto it. A facet is a property that narrows the set of values allowed <strong>for</strong> that element. Forexample, length, m<strong>in</strong>Inclusive, and maxInclusive are all facets that respectivelydeterm<strong>in</strong>e the length of the type and the range of accepted values. Each facet def<strong>in</strong>ed<strong>in</strong> the <strong>XML</strong> Schema 1.0 specification has a correspond<strong>in</strong>g class <strong>in</strong> the .<strong>NET</strong> SOM.The element must fulfill a number of requirements. It has to be anuppercase str<strong>in</strong>g with a fixed length (2 characters). The pattern facet available <strong>in</strong> the<strong>XML</strong> Schema specification supports regular expressions to control the contents of anelement at the f<strong>in</strong>est level. The follow<strong>in</strong>g code sets the uppercase and fixed-lengthconstra<strong>in</strong>ts. (For more <strong>in</strong><strong>for</strong>mation about regular expressions, refer to the section"Further Read<strong>in</strong>g," on page 133.)XmlSchemaPatternFacet provPattern = new XmlSchemaPatternFacet();provPattern.Value = "[A-Z]{2}";prov<strong>in</strong>ceRestriction.Facets.Add(provPattern);After the new element has been def<strong>in</strong>ed and given a type, you add it to the sequence ofelements that <strong>for</strong>m the type you want to extend. In this case, the element must become the next element <strong>in</strong> the compositor of theAddressType type. The programm<strong>in</strong>g <strong>in</strong>terface of a complex type lets you access thesequence component through the Particle property, as shown here:XmlSchemaSequence seq = (XmlSchemaSequence) ct.Particle;seq.Items.Add(provElem);At this po<strong>in</strong>t, the edit<strong>in</strong>g phase approaches an end. The new schema is now complete;all that rema<strong>in</strong>s is to save it to a disk file. The code discussed up to now, when appliedto the address.xsd file, produces the follow<strong>in</strong>g schema:104


Application-Embedded SchemasSchema <strong>in</strong><strong>for</strong>mation is fundamental <strong>for</strong> lett<strong>in</strong>g client applications know about thestructure of the <strong>XML</strong> data they get from servers. Especially <strong>in</strong> distributed applications,however, schema <strong>in</strong><strong>for</strong>mation is just an extra burden that takes up a portion of thebandwidth.In some situations, you can treat the schema like the debug <strong>in</strong><strong>for</strong>mation <strong>in</strong> W<strong>in</strong>dowsexecutables: <strong>in</strong>dispensable dur<strong>in</strong>g the development of the application; useless andunneeded once the application is released. This pattern does not apply to allapplications but, where possible, constitutes an <strong>in</strong>terest<strong>in</strong>g <strong>for</strong>m of optimization. Afterthe two communicat<strong>in</strong>g modules agree on an <strong>XML</strong> <strong>for</strong>mat and this <strong>for</strong>mat is hard-coded<strong>in</strong> software, how can the <strong>for</strong>mat of the <strong>XML</strong> data be<strong>in</strong>g exchanged be different?When the generation of <strong>XML</strong> documents is not completely controlled by the <strong>in</strong>volvedapplications, schema validation ceases to be an optional feature. Thanks to the SOM,however, there's still room <strong>for</strong> optimiz<strong>in</strong>g the use of the bandwidth by not send<strong>in</strong>g theschema <strong>in</strong><strong>for</strong>mation along with the document. The first option that comes to m<strong>in</strong>d is thatthe client application stores the schema locally and loads it when needed to validate<strong>in</strong>com<strong>in</strong>g documents. For .<strong>NET</strong> Framework applications, the XmlSchema.Read staticmethod is just what you need to load exist<strong>in</strong>g schema files.An alternative option entails creat<strong>in</strong>g and compil<strong>in</strong>g a schema object dynamically andthen us<strong>in</strong>g it to validate documents. The code discussed <strong>in</strong> the previous sectionprovides a concrete example of how .<strong>NET</strong> Framework applications can use the SOM tocreate schemas on the fly.NoteSeveral applications <strong>in</strong> W<strong>in</strong>dows <strong>in</strong>corporate an <strong>in</strong>ternal schemaparser. Apparently, those applications don't require you to specify aschema. If you pass them an <strong>XML</strong> document that does not complywith the <strong>in</strong>ternal schema, however, an error is raised. An applicationthat works <strong>in</strong> this way is the W<strong>in</strong>dows Script Host (WSH)environment (wscript.exe)—the W<strong>in</strong>dows shell-level scriptenvironment. Along with pla<strong>in</strong> VBScript and JScript files, WSHsupports an <strong>XML</strong>-based <strong>for</strong>mat characterized by a .wsf extension.Those files do not require schema <strong>in</strong><strong>for</strong>mation, but if you violate thedocumented layout rules, the file is not processed.Determ<strong>in</strong>istic and Nondeterm<strong>in</strong>istic SchemasA schema validat<strong>in</strong>g parser works by match<strong>in</strong>g the structure of the underly<strong>in</strong>g <strong>XML</strong>document with the referenced <strong>XML</strong> schema document. By compil<strong>in</strong>g the schema, theparser gets enough <strong>in</strong><strong>for</strong>mation to determ<strong>in</strong>e whether a given node <strong>in</strong> the source <strong>XML</strong>document con<strong>for</strong>ms to the layout depicted by the XSD.As the parser moves from one node to the next, two different situations can occur.Either the parser can unambiguously match the current node structure with a valid XSDsequence or it can't. If exactly one match is found, the process can cont<strong>in</strong>ue. If nomatch is found, the source document does not follow the <strong>XML</strong> schema. Pars<strong>in</strong>g stops,and an exception is raised. A schema <strong>in</strong> which the match between one <strong>XML</strong> node andone XSD sequence is unique (if any) is said to be determ<strong>in</strong>istic. Our sample addressschema is determ<strong>in</strong>istic, and the SOM parser processes it successfully.Other flavors of <strong>XML</strong> schemas are called nondeterm<strong>in</strong>istic because the number ofmatches found can exceed one. In this case, the parser must look ahead to try to105


determ<strong>in</strong>e the correct sequence and identify the correct piece of PSVI <strong>in</strong><strong>for</strong>mation.Nondeterm<strong>in</strong>istic does not mean <strong>in</strong>valid, but not all parsers can successfully handlesuch schemas. The .<strong>NET</strong> Framework schema parser, <strong>for</strong> example, does not supportnondeterm<strong>in</strong>istic schemas. All files written accord<strong>in</strong>g to the follow<strong>in</strong>g (valid) schema are<strong>in</strong>evitably rejected:The element makes the schema <strong>in</strong>herently more prone to becomenondeterm<strong>in</strong>istic. The elements permits exactly one of the subsequentschema elements. However, when child elements are sequences, the schemaautomatically becomes nondeterm<strong>in</strong>istic.In the preced<strong>in</strong>g XSD, as soon as the parser moves to the street node, it detects anambiguity. What is the correct XmlSchemaSequence class to take <strong>in</strong>to account? Thecorrect class can be determ<strong>in</strong>ed only by look<strong>in</strong>g a certa<strong>in</strong> number of nodes ahead. Inthis very un<strong>for</strong>tunate case, the parser would need to look at least five nodes ahead.Some parsers support the <strong>for</strong>ward-check<strong>in</strong>g feature up to a fixed number of nodes;some do not. The .<strong>NET</strong> SOM parser requires the schema to be determ<strong>in</strong>istic. Figure 3-13 shows what happens when the sample application ValidateDocument grapples witha nondeterm<strong>in</strong>istic schema.106


Figure 3-13: .<strong>NET</strong> SOM parser compla<strong>in</strong>ts about the nondeterm<strong>in</strong>istic nature of theschema.Validat<strong>in</strong>g Aga<strong>in</strong>st an XSD DocumentAfter this long digression <strong>in</strong>to the <strong>XML</strong> Schema API <strong>in</strong> the .<strong>NET</strong> Framework, let'sconclude this chapter by look<strong>in</strong>g at what happens when the XmlValidat<strong>in</strong>gReader classis called to operate on an <strong>XML</strong> file that <strong>in</strong>cludes, or references, an <strong>XML</strong> schema.The follow<strong>in</strong>g code shows how to set up the <strong>XML</strong> validator class to work on XSD files:XmlTextReader _coreReader = new XmlTextReader(fileName);XmlValidat<strong>in</strong>gReader reader = newXmlValidat<strong>in</strong>gReader(_coreReader);reader.ValidationType = ValidationType.Schema;reader.ValidationEventHandler += newValidationEventHandler(MyHandler);while(reader.Read());When the ValidationType property is set to Schema, the parser tries to proceedanyway, regardless of the fact that the source file has no l<strong>in</strong>k to a schema file.An <strong>in</strong>terest<strong>in</strong>g phenomenon occurs when the <strong>XML</strong> schema is embedded <strong>in</strong> the <strong>XML</strong>document that is be<strong>in</strong>g validated. In this case, the schema appears as a constituentpart of the source document. In particular, it is a direct child of the document rootelement.The schema is an <strong>XML</strong> subtree that is logically placed at the same level as thedocument to validate. A well-<strong>for</strong>med <strong>XML</strong> document can't have two roots, however.Thus an all-encompass<strong>in</strong>g root node with two children, the schema and the document,must be created, as shown here:<strong>Applied</strong> <strong>XML</strong> <strong>Programm<strong>in</strong>g</strong> <strong>for</strong> <strong>Microsoft</strong>(r) .<strong>NET</strong>107


The root element can't be successfully validated because there is no schema<strong>in</strong><strong>for</strong>mation about it. When the ValidationType property is set to Schema, theXmlValidat<strong>in</strong>gReaderclass returns a warn<strong>in</strong>g <strong>for</strong> the root element if an <strong>in</strong>-l<strong>in</strong>e schema isdetected, as shown <strong>in</strong> Figure 3-14. Be aware of this when you set up your validationcode. A too-strong filter <strong>for</strong> errors could signal as <strong>in</strong>correct a perfectly legal <strong>XML</strong>document if the XSD code is embedded.Figure 3-14: The validat<strong>in</strong>g parser returns a warn<strong>in</strong>g when the ValidationType property isset to Schema and an <strong>in</strong>-l<strong>in</strong>e schema is used.NoteThe warn<strong>in</strong>g you get from XmlValidat<strong>in</strong>gReader is only the tip of theiceberg. Although <strong>XML</strong> Schema as a <strong>for</strong>mat is def<strong>in</strong>itely a widelyaccepted specification, the same can't be said <strong>for</strong> <strong>in</strong>-l<strong>in</strong>e schemas.An illustrious victim of this situation is the <strong>XML</strong> code you obta<strong>in</strong> fromthe WriteXml method of the DataSet object when theXmlWriteMode.WriteSchema option is set. The file you get has the<strong>XML</strong> schema <strong>in</strong>-l<strong>in</strong>e, but if you try to validate it us<strong>in</strong>gXmlValidat<strong>in</strong>gReader, it does not work!In general, the guidel<strong>in</strong>e is to avoid <strong>in</strong>-l<strong>in</strong>e <strong>XML</strong> schemas wheneverpossible. This improves the bandwidth management (the schema istransferred one time at most) and shields you from bad surprises.As <strong>for</strong> the DataSet object, if you remove the schema to a separatefile and reference it from with<strong>in</strong> the DataSet object's serializedoutput, everyth<strong>in</strong>g works just f<strong>in</strong>e. Alternatively, with theXmlValidat<strong>in</strong>gReader object, you can preload the schema <strong>in</strong> theschema cache and then proceed with the pars<strong>in</strong>g of the source.We'll delve deeper <strong>in</strong>to DataSet serialization issues <strong>in</strong> Chapter 9.Conclusion<strong>XML</strong> validation is the parser's ability to verify that a given <strong>XML</strong> source document iscom<strong>for</strong>mant to a specified layout. The <strong>in</strong>tr<strong>in</strong>sic importance of validation, and relatedtechnologies, can't be denied, but a few considerations must be kept <strong>in</strong> m<strong>in</strong>d.108


For one th<strong>in</strong>g, <strong>XML</strong> documents and schema <strong>in</strong><strong>for</strong>mation must be dist<strong>in</strong>ct elements. Thisimproves per<strong>for</strong>mance when the document is transferred over the wire and keeps thememory footpr<strong>in</strong>t as lean as possible. In addition, validat<strong>in</strong>g a document to make sure ithas the requested layout is not always necessary if the correctness of the data twoapplications exchange can be ensured by design. If the documents sent and receivedare generated programmatically and there is no (reasonable) way to hack them,validation can be an unneeded burden. In this case, you can rate the schema<strong>in</strong><strong>for</strong>mation as similar to debug <strong>in</strong><strong>for</strong>mation <strong>in</strong> W<strong>in</strong>32 executables: useful to speed upthe development cycle, but useless <strong>in</strong> a production environment.The real big th<strong>in</strong>g beh<strong>in</strong>d <strong>XML</strong> validation is XSD—a W3C specification to def<strong>in</strong>e thestructure, contents, and semantics of <strong>XML</strong> documents. XSD is another key element thatenriches the collection of official and de facto current standards <strong>for</strong> <strong>in</strong>teroperablesoftware. It jo<strong>in</strong>s the group <strong>for</strong>med by HTTP <strong>for</strong> network transportation, <strong>XML</strong> <strong>for</strong> datadescription, SOAP <strong>for</strong> method <strong>in</strong>vocation, XSL <strong>for</strong> data trans<strong>for</strong>mation, and XPath <strong>for</strong>queries.With XSD, we have a standard but extremely rigorous way to describe the layout of thedocument that leaves noth<strong>in</strong>g to the user's imag<strong>in</strong>ation. XSD is the constituentgrammar <strong>for</strong> the <strong>XML</strong> type system, and thanks to the broad acceptance ga<strong>in</strong>ed by <strong>XML</strong>,it is a candidate to become a universal and cross-plat<strong>for</strong>m type system.This chapter uses the features and programm<strong>in</strong>g <strong>in</strong>terface of a special reader class—the XmlValidat<strong>in</strong>gReader class—to demonstrate how <strong>XML</strong> validation is accomplished <strong>in</strong>the .<strong>NET</strong> Framework. In do<strong>in</strong>g so, we have <strong>in</strong>evitably touched on the technologies thatare <strong>in</strong>volved with the schema def<strong>in</strong>ition—from the still-flourish<strong>in</strong>g DTD, to the newestand standard XSD, and pass<strong>in</strong>g through the <strong>in</strong>termediate, and mostly <strong>Microsoft</strong>proprietary, XDR.For the most part, this chapter covers issues revolv<strong>in</strong>g around <strong>XML</strong> validat<strong>in</strong>g parsers.It also opens a w<strong>in</strong>dow <strong>in</strong>to the world of <strong>XML</strong>-related technologies.Further Read<strong>in</strong>g<strong>XML</strong> sprang to life <strong>in</strong> the late 1990s as a metalanguage scientifically designed todef<strong>in</strong>itively push aside SGML. If you want to learn more about this ancestor of <strong>XML</strong>, still<strong>in</strong> use <strong>in</strong> some legacy e-commerce applications, have a look at the tutorial available athttp://www.w3.org/TR/WD-html40-970708/<strong>in</strong>tro/sgmltut.html.In this chapter and <strong>in</strong> this book, you won't f<strong>in</strong>d detailed references to the syntax andstructure of <strong>XML</strong> technologies. If you need to know all about DTD attributes and XSDcomponents, you'll need to look elsewhere. One resource that I've found extremelyvaluable is Essential <strong>XML</strong> Quick Reference, written by Aaron Skonnard and Mart<strong>in</strong>Gudg<strong>in</strong> (Addison Wesley, 2001). This book is an annotated review of all the markupcode around <strong>XML</strong>, <strong>in</strong>clud<strong>in</strong>g XSD, XSL, XPath, and SOAP—not co<strong>in</strong>cidentally, thesame <strong>XML</strong> standards fully supported by the .<strong>NET</strong> Framework. Another resource I wouldrecommend is <strong>XML</strong> Pocket Consultant, written by William R. Stanek (<strong>Microsoft</strong> Press,2002). For onl<strong>in</strong>e resources, check out <strong>in</strong> particular http://www.xml.com.An excellent article that describes the big picture beh<strong>in</strong>d XSD, Web services, andSOAP can be found on the MSDN Magaz<strong>in</strong>e Web site, athttp://msdn.microsoft.com/msdnmag/issues/01/11/WebServ/WebServ0111.asp. Adetailed tutorial on XSD can be found at http://www.w3.org/TR/xmlschema-0. Iespecially recommend this tutorial if you need a complete step-by-step guide to the<strong>in</strong>tricacies and wonders of the XSD as def<strong>in</strong>ed by the W3C.As <strong>for</strong> regular expressions, I don't know of any book or onl<strong>in</strong>e resource that specificallyuntangles this topic. On the other hand, regular expressions are covered <strong>in</strong> almostevery book aimed at the .<strong>NET</strong> Framework. In particular, take a look at Chapter 12 of109


Francesco Balena's <strong>Programm<strong>in</strong>g</strong> Visual Basic .<strong>NET</strong> Core Reference (<strong>Microsoft</strong> Press,2002).110


Chapter 4: <strong>XML</strong> WritersOverviewCreat<strong>in</strong>g <strong>XML</strong> documents <strong>in</strong> a programmatic way has never been a particularlycomplicated issue. You simply concatenate a few str<strong>in</strong>gs <strong>in</strong>to a buffer and then flush thebuffer to a storage medium when you have f<strong>in</strong>ished. The process is quick, easy, andstraight<strong>for</strong>ward—could you ask <strong>for</strong> more? Well, actually, you should!<strong>XML</strong> documents are text-based files, but they also conta<strong>in</strong> a lot of markup text, and asyou know, deal<strong>in</strong>g with markup text can at times be bor<strong>in</strong>g or even annoy<strong>in</strong>g. More thanjust be<strong>in</strong>g a bother, you might f<strong>in</strong>d that supply<strong>in</strong>g the necessary quotation marks andangle brackets can make your code more error-prone. Creat<strong>in</strong>g <strong>XML</strong> documentsprogrammatically by simply putt<strong>in</strong>g one str<strong>in</strong>g of text after another is effective as long asyou can absolutely guarantee that subtle errors will never sneak <strong>in</strong>to the codema<strong>in</strong>stream, which is not much different from certify<strong>in</strong>g that all of your manually createdcode is 100 percent bug-free.The <strong>Microsoft</strong> .<strong>NET</strong> Framework provides a more productive, and even elegant,approach to writ<strong>in</strong>g <strong>XML</strong> code. Based on ad hoc tools, this approach simply applies thesame pattern that has been the key to <strong>XML</strong>'s rapid adoption—focus on the data andignore the rest. Enter <strong>XML</strong> writers.The <strong>XML</strong> Writer <strong>Programm<strong>in</strong>g</strong> InterfaceAn <strong>XML</strong> writer represents a component that provides a fast, <strong>for</strong>ward-only way ofoutputt<strong>in</strong>g <strong>XML</strong> data to streams or files. More important, an <strong>XML</strong> writer guarantees—bydesign—that all the <strong>XML</strong> data it produces con<strong>for</strong>ms to the W3C <strong>XML</strong> 1.0 andNamespace recommendations.Suppose you have to render <strong>in</strong> <strong>XML</strong> the contents of a str<strong>in</strong>g array. The follow<strong>in</strong>g codenormally fits the bill:void CreateXmlFile(Str<strong>in</strong>g[] theArray, str<strong>in</strong>g filename){Str<strong>in</strong>gBuilder sb = new Str<strong>in</strong>gBuilder("");// Loop through the array and build the filesb.Append("");<strong>for</strong>each(str<strong>in</strong>g s <strong>in</strong> theArray){sb.Append("");}sb.Append("");// Create the fileStreamWriter sw = new StreamWriter(filename);111


}sw.Write(sb.ToStr<strong>in</strong>g());sw.Close();The output is shown <strong>in</strong> Figure 4-1. Apparently, everyth<strong>in</strong>g is work<strong>in</strong>g just f<strong>in</strong>e.Figure 4-1: The sample <strong>XML</strong> file is successfully recognized and managed by <strong>Microsoft</strong>Internet Explorer.One small drawback is that the <strong>XML</strong> code you get is not exactly <strong>in</strong> the <strong>for</strong>mat youexpect—the <strong>for</strong>mat shown <strong>in</strong> Internet Explorer. The source code <strong>for</strong> the <strong>XML</strong> file <strong>in</strong>Figure 4-1 has no newl<strong>in</strong>e characters or <strong>in</strong>dentation and appears to be an endless andhardly readable str<strong>in</strong>g of markup text. But this is no big deal. You can simply enhancethe code a little bit by add<strong>in</strong>g newl<strong>in</strong>e and tab characters.In general, there is noth<strong>in</strong>g really bad or wrong with this approach as long as thedocument file you need to create is simple, has m<strong>in</strong>imal structure, and has only a fewlevels of nest<strong>in</strong>g. When you have more advanced and stricter requirements such asprocess<strong>in</strong>g <strong>in</strong>structions, namespaces, <strong>in</strong>dentation, <strong>for</strong>matt<strong>in</strong>g, and entities, thecomplexity of your code can grow exponentially and, with it, the likelihood of <strong>in</strong>troduc<strong>in</strong>gerrors and bugs.Let's rewrite our sample file us<strong>in</strong>g .<strong>NET</strong> <strong>XML</strong> writers, as shown <strong>in</strong> the follow<strong>in</strong>g code. A.<strong>NET</strong> <strong>XML</strong> writer features ad hoc write methods <strong>for</strong> each possible <strong>XML</strong> node type andmakes the creation of <strong>XML</strong> output more logical and much less dependent on the<strong>in</strong>tricacies, and even the quirk<strong>in</strong>ess, of the markup languages.void CreateXmlFileUs<strong>in</strong>gWriters(Str<strong>in</strong>g[] theArray,str<strong>in</strong>g filename){// Open the <strong>XML</strong> writer (default encod<strong>in</strong>g charset)XmlTextWriter xmlw = new XmlTextWriter(filename, null);xmlw.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;xmlw.WriteStartDocument();xmlw.WriteStartElement("array");<strong>for</strong>each(str<strong>in</strong>g s <strong>in</strong> theArray)112


{xmlw.WriteStartElement("element");xmlw.WriteAttributeStr<strong>in</strong>g("value", s);xmlw.WriteEndElement();}xmlw.WriteEndDocument();}// Close the writerxmlw.Close();Viewed <strong>in</strong> Internet Explorer, the f<strong>in</strong>al output <strong>for</strong> this file is the same as we saw <strong>in</strong> Figure4-1. However, now newl<strong>in</strong>e and tab characters have been <strong>in</strong>serted as appropriate, andthe source code truly looks like this:An <strong>XML</strong> writer is a specialized class that knows only how to write <strong>XML</strong> data to a varietyof storage media. It features ad hoc methods to write any special item thatcharacterizes <strong>XML</strong> documents—from character entities to process<strong>in</strong>g <strong>in</strong>structions, fromcomments to attributes, and from element nodes to pla<strong>in</strong> text. In addition, and moreimportant, an <strong>XML</strong> writer guarantees well-<strong>for</strong>med <strong>XML</strong> 1.0–compliant output. And youdon't have to worry about a s<strong>in</strong>gle angle bracket or the last element node that you leftopen.ImportantBecause more often than not an <strong>XML</strong> writer class simplycreates local or remote disk files, don't be too surprised if yourcode causes the .<strong>NET</strong> Code Access Security (CAS) system tothrow a security exception. Partially trusted applications, and<strong>in</strong> particular <strong>Microsoft</strong> ASP.<strong>NET</strong> applications with defaultsett<strong>in</strong>gs, have no access to the file system. Be aware thatwhen you use <strong>XML</strong> writers, unless you take particularmeasures, sooner or later a security exception will be thrown.The XmlWriter Base Class<strong>XML</strong> writers are based on the XmlWriter abstract class that def<strong>in</strong>es the .<strong>NET</strong>Framework <strong>in</strong>terface <strong>for</strong> writ<strong>in</strong>g <strong>XML</strong>. The XmlWriter class is not directly creatable fromuser applications, but it can be used as a reference type <strong>for</strong> objects that are <strong>in</strong>stancesof classes derived from XmlWriter. Actually, the .<strong>NET</strong> Framework provides just oneclass that gives a concrete implementation of the XmlWriter <strong>in</strong>terface—theXmlTextWriter class.113


What the XmlWriter Class Can't DoAlthough powerful and considerably feature-rich, an <strong>XML</strong> writer is not perfect; it stillleaves some marg<strong>in</strong> <strong>for</strong> errors. To be more precise, the XmlWriter class certa<strong>in</strong>lygenerates 100-percent well-<strong>for</strong>med code, but only if you pass on correct <strong>in</strong><strong>for</strong>mation. Inparticular, an <strong>XML</strong> writer does not check <strong>for</strong> <strong>in</strong>valid characters <strong>in</strong> element and attributenames. It also does not guarantee that any Unicode characters you use fit <strong>in</strong>to thecurrent encod<strong>in</strong>g schema. As a consequence, any characters outside the encod<strong>in</strong>gschema are not escaped <strong>in</strong>to character entities and might lead to <strong>in</strong>correct output.An <strong>XML</strong> writer also does not verify duplicate attributes; it simply dumps the text outwhen you call the appropriate method. Nor does an <strong>XML</strong> writer validate any identifiers(<strong>for</strong> example, the SYSTEM identifier) you specify when you create a DOCTYPE node.In addition, the XmlWriter class does not validate aga<strong>in</strong>st any schema or document typedef<strong>in</strong>ition (DTD). Creat<strong>in</strong>g a validat<strong>in</strong>g writer is not difficult, however; I'll give you sometips on how to build one <strong>in</strong> the section "<strong>XML</strong> Validat<strong>in</strong>g Writers," on page 168. By theway, an XmlValidat<strong>in</strong>gWriter class is just one of the extensions to the System.Xmlnamespace slated <strong>for</strong> the next version of the .<strong>NET</strong> Framework.Properties of the XmlWriter ClassTable 4-1 lists the properties that belong to the XmlWriter class.Table 4-1: Properties of the XmlWriter ClassPropertyWriteStateXmlLangXmlSpaceDescriptionRead-only property that gets the state of the writer. The statecan be any value taken from the WriteState enumeration anddescribes the element be<strong>in</strong>g written.Read-only property that returns the current xml:lang scope.You set the language of the document by writ<strong>in</strong>g an xml:langattribute to the output stream.Read-only property that <strong>in</strong>dicates the current xml:space scopethrough a value taken from the XmlSpace enumeration(Default, None, or Preserve).All of these properties are read-only and abstract—that is, they must be overridden <strong>in</strong>any derived class. The behavior described <strong>in</strong> Table 4-1 simply <strong>in</strong>dicates what theproperties have been designed <strong>for</strong> and does not necessarily reflect the actual behaviorof these properties <strong>in</strong> a custom implementation.In general, the XmlWriter class properties serve to track the state <strong>in</strong> which anothercomponent might have left the writer. Note that these properties belong to the current<strong>in</strong>stance of the writer object. If you are us<strong>in</strong>g the same writer to generate moredocuments on the same stream, these properties are not automatically reset when youstart a new document.<strong>XML</strong> Writer StatesTable 4-2 summarizes the allowable states <strong>for</strong> an <strong>XML</strong> writer. Values come from theWriteState enumeration type. Any <strong>XML</strong> writer is expected to properly and promptlyupdate its WriteState property as various <strong>in</strong>ternal operations take place.Table 4-2: States of an <strong>XML</strong> WriterStateAttributeDescriptionThe writer enters this state when an attribute is be<strong>in</strong>gwritten.114


Table 4-2: States of an <strong>XML</strong> WriterStateClosedContentElementPrologStartDescriptionThe Close method has been called, and the writer is nolonger available <strong>for</strong> writ<strong>in</strong>g operations.The writer enters this state when the contents of a node isbe<strong>in</strong>g written.The writer enters this state when an element start tag isbe<strong>in</strong>g written.The writer is writ<strong>in</strong>g the prolog (the section that declares theelement names, attributes, and construction rules of validmarkup <strong>for</strong> a data type) of a well-<strong>for</strong>med <strong>XML</strong> 1.0 document.The writer is <strong>in</strong> an <strong>in</strong>itial state, wait<strong>in</strong>g <strong>for</strong> a write call to beissued.When you create a writer, its state is set to Start, mean<strong>in</strong>g that you are still configur<strong>in</strong>gthe object and the actual writ<strong>in</strong>g phase has not yet begun. The next state is Prolog,which is reached as soon as you call WriteStartDocument—the first write method youcall. After that, the state transition depends primarily on the type of document you arewrit<strong>in</strong>g and its contents.The writer rema<strong>in</strong>s <strong>in</strong> Prolog state while you add nonelement nodes, <strong>in</strong>clud<strong>in</strong>gcomments, process<strong>in</strong>g <strong>in</strong>structions, and document types. When the first element nodeis encountered—the document root node—the state changes to Element. The stateswitches to Attribute when you call the WriteStartAttribute method but not when youwrite attributes us<strong>in</strong>g the more direct WriteAttributeStr<strong>in</strong>g method. (In the latter case,the state rema<strong>in</strong>s set to Element.) Writ<strong>in</strong>g an end tag switches the state to Content, andwhen you have f<strong>in</strong>ished writ<strong>in</strong>g and call WriteEndDocument, the state returns to Startuntil you start another document or close the writer.Methods of the XmlWriter ClassTable 4-3 lists some of the methods that belong to the XmlWriter class. Only methodsthat are not directly <strong>in</strong>volved with the writ<strong>in</strong>g of <strong>XML</strong> elements are <strong>in</strong>cluded here.Table 4-3: Nonwrit<strong>in</strong>g Methods of the XmlWriter ClassMethodCloseFlushLookupPrefixDescriptionCloses both the writer and the underly<strong>in</strong>g stream. The writercan't be used to write additional text. Any attempt wouldcause an <strong>in</strong>valid operation exception to be thrown.Flushes whatever is <strong>in</strong> the buffer to the underly<strong>in</strong>g streamsand also flushes the underly<strong>in</strong>g stream. After this method iscalled, the writer rema<strong>in</strong>s active and ready to write more tothe same stream.Takes a namespace URI and returns the correspond<strong>in</strong>gprefix. In do<strong>in</strong>g so, the method looks <strong>for</strong> the closestmatch<strong>in</strong>g prefix def<strong>in</strong>ed <strong>in</strong> the current namespace scope.An <strong>XML</strong> writer accumulates text <strong>in</strong> an <strong>in</strong>ternal buffer. Normally, the buffer is flushed,and the <strong>XML</strong> text actually written, only when the writer is closed. By call<strong>in</strong>g the Flushmethod, however, you can empty the buffer and write the current contents down to the115


stream. Some work<strong>in</strong>g memory is freed, the writer is not closed, and the operation cancont<strong>in</strong>ue.For example, let's assume that you use a file as the output stream. At some po<strong>in</strong>t, whilegenerat<strong>in</strong>g the <strong>XML</strong> content, you call Flush. As a result, the file (exist<strong>in</strong>g or alreadycreated by the time Flush is called) is partially populated. However, it can't be accessedby other processes because the file is locked by your process. The <strong>XML</strong> file will beunlocked and made available to other processes only when the writer is closed—anaction that, <strong>in</strong> turn, closes the stream and releases any underly<strong>in</strong>g resources.Table 4-4 summarizes the key methods of the XmlWriter class <strong>for</strong> writ<strong>in</strong>g specific <strong>XML</strong>elements such as attributes, entities, and nodes.Table 4-4: Writ<strong>in</strong>g Methods of the XmlWriter ClassMethodDescriptionWriteAttributeStr<strong>in</strong>gWrites an attribute with the specified value.The method adds start and end quotationmarks.WriteCData Writes a CDATA block conta<strong>in</strong><strong>in</strong>g thespecified text. The method adds start() blocks <strong>for</strong> theelement.WriteCharEntityWrites the specified Unicode character <strong>in</strong>hexadecimal character entity reference <strong>for</strong>mat.For example, the & (ampersand) character iswritten as &#x26;.WriteComment Writes a comment. The method adds start () blocks <strong>for</strong> the element.WriteDocTypeWrites the DOCTYPE declaration with thespecified name and optional attributes.WriteElementStr<strong>in</strong>gWrites an element node with the specifiedcontents. It can produce the follow<strong>in</strong>g outputwith a s<strong>in</strong>gle call: Rome, wherecity is the name of the element and Rome isthe contents to write.WriteEndAttribute Closes a previous call made toWriteStartAttribute.WriteEndDocumentCloses any open elements or attributes andreturns the writer to its <strong>in</strong>itial state (Start).WriteEndElementCloses the <strong>in</strong>nermost open element us<strong>in</strong>g theshort end tag (/>) where appropriate. Thenamespace scope moves one level up.WriteEntityRefWrites an entity reference with the specifiedname. Takes care of the lead<strong>in</strong>g & and thetrail<strong>in</strong>g semicolon (;).WriteFullEndElementCloses one element by us<strong>in</strong>g a full end tag (<strong>for</strong>example, ). This method is similarto WriteEndElement, but it always closes the<strong>in</strong>nermost element us<strong>in</strong>g a full end tag. Just aswith WriteEndElement, the namespace scopeis moved one level up.WriteNameWrites the specified name, ensur<strong>in</strong>g that it is a116


Table 4-4: Writ<strong>in</strong>g Methods of the XmlWriter ClassMethodWriteNmTokenWriteProcess<strong>in</strong>gInstructionWriteQualifiedNameWriteStartAttributeDescriptionvalid name accord<strong>in</strong>g to the W3C <strong>XML</strong> 1.0recommendation.Writes the specified name, ensur<strong>in</strong>g that it is avalid NmToken accord<strong>in</strong>g to the W3C <strong>XML</strong> 1.0recommendation.Writes a process<strong>in</strong>g <strong>in</strong>struction us<strong>in</strong>g therequired syntax .Writes the namespace-qualified name afterlook<strong>in</strong>g up the prefix that is <strong>in</strong> scope <strong>for</strong> thespecified namespace.Writes the start of an attribute. Switches thewriter's state to Attribute.WriteStartDocument Writes the <strong>XML</strong> 1.0 standard prologdeclaration.WriteStartElementWrites the specified start tag <strong>for</strong> the specifiedelement node.WriteStr<strong>in</strong>gWrites the specified text contents. Can beused with open attributes or element nodes.WriteWhitespaceWrites the specified white space.Some of these methods are abstract; some are not. In particular, the XmlWriter classprovides an implementation <strong>for</strong> one-shot methods that group a few more basic calls.For example, WriteAttributeStr<strong>in</strong>g is implemented <strong>in</strong> XmlWriter like this:public void WriteAttributeStr<strong>in</strong>g(str<strong>in</strong>g localName, str<strong>in</strong>g value){WriteStartAttribute(null, localName, null);WriteStr<strong>in</strong>g(value);WriteEndAttribute();}Other, more specialized, writ<strong>in</strong>g methods available <strong>in</strong> the XmlWriter <strong>in</strong>terface are listed<strong>in</strong> Table 4-5.Table 4-5: Miscellaneous Writ<strong>in</strong>g MethodsMethodDescriptionWriteAttributesWrites all the attributes found at the currentposition <strong>in</strong> the specified XmlReader object. Thismethod is actually implemented <strong>in</strong> XmlWriter.(This method will be discussed <strong>in</strong> more detail <strong>in</strong>the section "A Read/Write <strong>XML</strong> Stream<strong>in</strong>gParser," on page 179.)WriteBase64Encodes the specified b<strong>in</strong>ary bytes as base64and writes out the result<strong>in</strong>g text. (Base64encod<strong>in</strong>g is designed to represent arbitrary byte117


Table 4-5: Miscellaneous Writ<strong>in</strong>g MethodsMethodWriteB<strong>in</strong>HexWriteCharsWriteNodeWriteRawWriteSurrogateCharEntityDescriptionsequences <strong>in</strong> a text <strong>for</strong>m comprised of the 65US-ASCII characters [A-Za-z0-9+/=], whereeach character encodes 6 bits of the b<strong>in</strong>arydata.) You decrypt this text us<strong>in</strong>g theXmlReader class's ReadBase64 method.(These methods will be discussed <strong>in</strong> moredetail <strong>in</strong> the section "Writ<strong>in</strong>g Encoded Data," onpage 162.)Encodes the specified b<strong>in</strong>ary bytes as B<strong>in</strong>Hexand writes out the result<strong>in</strong>g text. (B<strong>in</strong>Hex is anencod<strong>in</strong>g scheme that converts b<strong>in</strong>ary data toASCII characters.) You decrypt this text us<strong>in</strong>gthe XmlReader class's ReadB<strong>in</strong>Hex method.(These methods will be discussed <strong>in</strong> moredetail <strong>in</strong> the section "Writ<strong>in</strong>g Encoded Data," onpage 162.)Writes a block of bytes as text to the <strong>XML</strong>stream. This method is useful when you haveto write a lot of text and want to do it one chunkat a time.Copies everyth<strong>in</strong>g from the specified reader tothe writer, mov<strong>in</strong>g the XmlReader object to theend of the current element. This method isactually implemented <strong>in</strong> XmlWriter. (Thismethod will be discussed <strong>in</strong> more detail <strong>in</strong> thesection "A Read/Write <strong>XML</strong> Stream<strong>in</strong>g Parser,"on page 179.)Writes unencoded text either from a str<strong>in</strong>g orfrom a buffer of bytes as is. Can conta<strong>in</strong>markup text that would be parsed asappropriate.Generates and writes the surrogate characterentity <strong>for</strong> the surrogate character pair.A surrogate (or surrogate pair) is a pair of 16-bit Unicode encod<strong>in</strong>g values that togetherrepresent a s<strong>in</strong>gle character. Surrogate pairs are <strong>in</strong> effect 32-bit atomic characters,although they are represented by a pair of characters (low and high char). Surrogatesare critical when you use the WriteChars method to split a large amount of text. If thattext, arbitrarily split, conta<strong>in</strong>s surrogates, some special handl<strong>in</strong>g must be done toensure that surrogate pairs are not split across different chunks.If a split happens, a generic exception (Exception class) is thrown. By catch<strong>in</strong>g thisexception, you <strong>for</strong>ce the application to cont<strong>in</strong>ue writ<strong>in</strong>g until the erroneously splitsurrogate pair is safely copied <strong>in</strong>to the output buffer.The XmlTextWriter ClassAs mentioned, XmlWriter is an abstract class, although a few of its methods have aconcrete implementation. In the .<strong>NET</strong> Framework, there is just one class built on top ofthe base XmlWriter class—the XmlTextWriter class.118


XmlTextWriter provides a standard implementation <strong>for</strong> all the methods and theproperties described up to now, plus a few more. It ma<strong>in</strong>ta<strong>in</strong>s an <strong>in</strong>ternal stack to keeptrack of <strong>XML</strong> elements that have been opened but not yet closed. Each element nodecan be directly associated with a namespace, thus becom<strong>in</strong>g the root of a namespacescope. If a namespace is not specified, the element is associated with the last declarednamespace.The XmlTextWriter class has three constructors. You can have the writer work on a fileor on an open stream. In both cases, you must also specify the required encod<strong>in</strong>gschema, as shown <strong>in</strong> the follow<strong>in</strong>g code. If this argument is null, the UniversalCharacter Set Trans<strong>for</strong>mation Format, 8-bit <strong>for</strong>m (UTF-8) character encod<strong>in</strong>g set isassumed.public XmlTextWriter(Stream w, Encod<strong>in</strong>g encod<strong>in</strong>g);public XmlTextWriter(str<strong>in</strong>g filename, Encod<strong>in</strong>g encod<strong>in</strong>g);The third constructor allows you build an <strong>XML</strong> text writer start<strong>in</strong>g from a TextWriterobject.Encod<strong>in</strong>g SchemasIn the .<strong>NET</strong> Framework, four different character encod<strong>in</strong>g schemas are def<strong>in</strong>ed. Eachschema corresponds to a class that <strong>in</strong>herits from the Encod<strong>in</strong>g class. The classes arelisted <strong>in</strong> Table 4-6.Table 4-6: Available Character Encod<strong>in</strong>g SchemasProperty Class DescriptionEncod<strong>in</strong>g.ASCII ASCIIEncod<strong>in</strong>g Encodes Unicode charactersas s<strong>in</strong>gle 7-bit ASCIIcharacters.Encod<strong>in</strong>g.Unicode UnicodeEncod<strong>in</strong>g Encodes each Unicodecharacter as two consecutivebytes.Encod<strong>in</strong>g.UTF7 UTF7Encod<strong>in</strong>g Encodes Unicode charactersus<strong>in</strong>g the UTF-7 characterencod<strong>in</strong>g set. (UTF-7 stands<strong>for</strong> Universal Character SetTrans<strong>for</strong>mation Format, 7-bit<strong>for</strong>m.)Encod<strong>in</strong>g.UTF8 UTF8Encod<strong>in</strong>g Encodes Unicode charactersus<strong>in</strong>g the UTF-8 characterencod<strong>in</strong>g set.The default character encod<strong>in</strong>g schema is UTF-8, which supports all Unicode charactervalues and surrogates. UTF-8 uses a variable number of bytes per character and isoptimized <strong>for</strong> the lower 127 ASCII characters.If you want to use the default encod<strong>in</strong>g, omit the second argument <strong>in</strong> the constructor.Otherwise, use the static properties of the Encod<strong>in</strong>g class to <strong>in</strong>dicate which type ofencod<strong>in</strong>g you want. You don't need to create a new <strong>in</strong>stance of an encod<strong>in</strong>g class tocreate a writer that encodes data <strong>in</strong> a certa<strong>in</strong> way. For example, to create an ASCIIstream, you use the follow<strong>in</strong>g code:XmlTextWriter xmlw = new XmlTextWriter(file, Encod<strong>in</strong>g.ASCII);119


If you want to get just the default sett<strong>in</strong>g, use Encod<strong>in</strong>g.Default <strong>in</strong>stead. Keep <strong>in</strong> m<strong>in</strong>dthat character encod<strong>in</strong>g classes are located <strong>in</strong> the System.Text namespace.Properties of the <strong>XML</strong> Text WriterTable 4-7 lists the properties that are specific to the XmlTextWriter class—that is, theproperties that the class does not <strong>in</strong>herit from XmlWriter.Table 4-7: Properties of the XmlTextWriter ClassPropertyBaseStreamFormatt<strong>in</strong>gIndentationIndentCharNamespacesQuoteCharDescriptionReturns the underly<strong>in</strong>g stream object. If you created thewriter from a file, this result is a FileStream object.Indicates how the output is <strong>for</strong>matted. Allowed values arefound <strong>in</strong> the Formatt<strong>in</strong>g enumeration type: None orIndented.Gets or sets the number of times to write the IndentCharwhite space character <strong>for</strong> each level of nest<strong>in</strong>g <strong>in</strong> the <strong>XML</strong>data. This property is ignored when Formatt<strong>in</strong>g is set toNone.Gets or sets the white space character to be used <strong>for</strong><strong>in</strong>dent<strong>in</strong>g when Formatt<strong>in</strong>g is set to Indented.Gets or sets support <strong>for</strong> namespaces. When this property isset to false, xmlns declarations are not written. Set to trueby default.Gets or sets the character to be used to surround attributevalues. Can be a s<strong>in</strong>gle (') or a double (") quotation mark;the default is a double quotation mark.In theory, the <strong>in</strong>dentation character can be any character; the property does notexercise any control over what you choose. To create <strong>XML</strong> 1.0–compliant code,however, the value of the IndentChar property must be a white space character such asa tab, a blank, or a carriage return. By default, each level of <strong>in</strong>dentation is rendered withtwo blanks.NoteWhen the <strong>XML</strong> text writer works on a file, it opens the file <strong>in</strong>exclusive write mode. If the file does not exist, it will be created. Ifthe file exists already, it will be truncated to zero length.The XmlTextWriter class has no data methods <strong>in</strong> addition to those described <strong>in</strong> Table 4-3, Table 4-4, and Table 4-5 as part of the XmlWriter class <strong>in</strong>terface.Writ<strong>in</strong>g Well-Formed <strong>XML</strong> TextTheXmlTextWriter class takes a number of precautions to ensure that the f<strong>in</strong>al <strong>XML</strong>code is perfectly compliant with the <strong>XML</strong> 1.0 standard of well-<strong>for</strong>medness. In particular,the class verifies that any special character found <strong>in</strong> the passed text is automaticallyescaped and that no elements are written <strong>in</strong> the wrong order (such as attributes outsidenodes, or CDATA sections with<strong>in</strong> attributes). F<strong>in</strong>ally, the Close method per<strong>for</strong>ms a fullcheck of well-<strong>for</strong>medness immediately prior to return. If the verification is successful,the method ends gracefully; otherwise, an exception is thrown.120


Other controls that the XmlTextWriter class per<strong>for</strong>ms on the generated <strong>XML</strong> outputensure that each document starts with the standard <strong>XML</strong> prolog, shown <strong>in</strong> the follow<strong>in</strong>gcode, and that any DOCTYPE node always precedes the document root node:This said, there is no absolute guarantee that users won't write badly <strong>for</strong>med code. Ifthe bad <strong>for</strong>mat can be detected, the writer throws an exception. Otherwise, the file isconsidered correctly written, but client applications might compla<strong>in</strong> about it, as <strong>in</strong> Figure4-2.Figure 4-2: An <strong>XML</strong> file created with the XmlTextWriter class has a duplicated attribute thatthe class did not discover.The follow<strong>in</strong>g code demonstrates how to write two identical attributes <strong>for</strong> a specifiednode:xmlw.WriteStartElement("element");xmlw.WriteAttributeStr<strong>in</strong>g("value", s);xmlw.WriteAttributeStr<strong>in</strong>g("value", s);xmlw.WriteEndElement();In the check made just be<strong>for</strong>e dump<strong>in</strong>g data out, the writer neither verifies the namesand semantics of the attributes nor validates the schema of the resultant document,thus authoriz<strong>in</strong>g this code to generate bad <strong>XML</strong>.Build<strong>in</strong>g an <strong>XML</strong> DocumentUp to now, we've looked at several code snippets show<strong>in</strong>g the <strong>XML</strong> text writer <strong>in</strong> action,but without go<strong>in</strong>g <strong>in</strong>to details. Let's make up <strong>for</strong> this now. The necessary steps to createan <strong>XML</strong> document can be summarized as follows:• Initialize the document The output stream is already open, and at thisstage you simply write the <strong>XML</strong> prolog, <strong>in</strong>clud<strong>in</strong>g the <strong>XML</strong> 1.0 defaultdeclaration and any other head<strong>in</strong>g <strong>in</strong><strong>for</strong>mation that the recommendationmandates to precede actual data nodes. (Typically, this <strong>in</strong><strong>for</strong>mationconsists of process<strong>in</strong>g <strong>in</strong>structions, schema references, and the DTD.)• Write data At this stage, you create <strong>XML</strong> nodes such as element nodes,attributes, CDATA and parsable text, entities, white space, and whatever121


else you might need that the writer supports. The writer ma<strong>in</strong>ta<strong>in</strong>s an<strong>in</strong>ternal node stack and uses it to detect and block erroneous calls such asattributes be<strong>in</strong>g created outside the start tag. The writer is smart enoughto complete the markup <strong>for</strong> nodes automatically. This means, <strong>for</strong> example,that the writer automatically <strong>in</strong>serts all miss<strong>in</strong>g end tags when the writer isclosed and completes the markup <strong>for</strong> the start tag when writ<strong>in</strong>g of text orchild nodes beg<strong>in</strong>s.• Close the document At this stage, you close the writer to flush both thecontents of the writer and the underly<strong>in</strong>g stream object. At this time only(or prior, if you call the Flush method), the <strong>XML</strong> text accumulated <strong>in</strong> an<strong>in</strong>ternal buffer is written out and undergoes a summary check <strong>for</strong> <strong>XML</strong>well-<strong>for</strong>medness.Writ<strong>in</strong>g the <strong>XML</strong> PrologOnce you have a liv<strong>in</strong>g and functional <strong>in</strong>stance of the XmlTextWriter class, the first <strong>XML</strong>element you add to it is the official <strong>XML</strong> 1.0 signature. You obta<strong>in</strong> this signature <strong>in</strong> avery natural and transparent way simply by call<strong>in</strong>g the WriteStartDocument method.This method starts a new document and marks the <strong>XML</strong> declaration with the versionattribute set to "1.0", as shown <strong>in</strong> the follow<strong>in</strong>g code:// produces: writer.WriteStartDocument();By us<strong>in</strong>g one of the WriteStartDocument overloads, you can also set the standaloneattribute to "yes", as shown here:// produces: writer.WriteStartDocument(true);NoteA stand-alone <strong>XML</strong> document is declared to be totally <strong>in</strong>dependentof external resources such as DTDs or entities.You close the document writ<strong>in</strong>g phase by call<strong>in</strong>g the WriteEndDocument method, asshown <strong>in</strong> the follow<strong>in</strong>g code. At this stage, all pend<strong>in</strong>g nodes are automatically closed,the <strong>in</strong>ternal stack is entirely cleared, and the writer is switched back to its <strong>in</strong>itial state.writer.WriteStartDocument();// ...// Build the document here// ...writer.WriteEndDocument();Important The WriteStartDocument/WriteEndDocument pair is notrequired to produce an <strong>XML</strong> file. If you omit such calls, thewriter will still work just f<strong>in</strong>e. However, <strong>in</strong>stead of a well-<strong>for</strong>med<strong>XML</strong> 1.0 document, you can get a well-<strong>for</strong>med <strong>XML</strong> fragmentwith no root rules applied.When you need to <strong>in</strong>sert a comment, use the WriteComment method. The syntax isstraight<strong>for</strong>ward, as shown here:writer.WriteComment("Do someth<strong>in</strong>g here");122


No exception is raised if the comment text is null or empty. The follow<strong>in</strong>g code isgenerated by an empty comment:Another <strong>XML</strong> element you often f<strong>in</strong>d at the beg<strong>in</strong>n<strong>in</strong>g of an <strong>XML</strong> document is theprocess<strong>in</strong>g <strong>in</strong>struction. The method that writes such <strong>in</strong>structions isWriteProcess<strong>in</strong>gInstruction. It takes two arguments: the name of the <strong>in</strong>struction and avalue. The follow<strong>in</strong>g code demonstrates a typical process<strong>in</strong>g <strong>in</strong>struction:The process<strong>in</strong>g <strong>in</strong>struction dictates that the contents of the current document must betrans<strong>for</strong>med us<strong>in</strong>g the source of the specified style sheet document. A process<strong>in</strong>g<strong>in</strong>struction consists of a name (xml-stylesheet <strong>in</strong> this example) plus a value. The valuecan be a comb<strong>in</strong>ation of one or more name/ value pairs, however. When you create aprocess<strong>in</strong>g <strong>in</strong>struction with the .<strong>NET</strong> <strong>XML</strong> API, you group all the name/value pairs <strong>in</strong> as<strong>in</strong>gle str<strong>in</strong>g, us<strong>in</strong>g blanks to separate consecutive pairs, as shown here:Str<strong>in</strong>g text = "type=\"text/xsl\" href=\"trans<strong>for</strong>m.xsl\"";writer.WriteProcess<strong>in</strong>gInstruction("xml-stylesheet", text);The preced<strong>in</strong>g code creates the follow<strong>in</strong>g <strong>XML</strong> l<strong>in</strong>e:ImportantThe <strong>XML</strong> declaration is a k<strong>in</strong>d of process<strong>in</strong>g <strong>in</strong>struction.However, you can't create a typical <strong>XML</strong> 1.0 signature us<strong>in</strong>gthe WriteProcess<strong>in</strong>gInstruction method becauseWriteProcess<strong>in</strong>gInstruction can be called only after the <strong>XML</strong>document has been <strong>in</strong>itialized—that is, afterWriteStartDocument has been called. At this po<strong>in</strong>t, anyattempt to write the xml process<strong>in</strong>g <strong>in</strong>struction would raise anargument exception.Writ<strong>in</strong>g DOCTYPE and EntitiesIn an <strong>XML</strong> document, the document type subtree is a unique graph that conta<strong>in</strong>sreferences to external markup resources such as a DTD or a list of entities. Asmentioned <strong>in</strong> the previous section, <strong>XML</strong> documents without such external referencesare said to be stand-alone and can declare their status <strong>in</strong> the <strong>XML</strong> signature throughthe standalone attribute.To identify an external markup resource, two types of identifiers can be used: publicand system. In the .<strong>NET</strong> Framework, both identifiers are found <strong>in</strong> the body of theWriteDocType method, as shown here:public override void WriteDocType(str<strong>in</strong>g name, str<strong>in</strong>g pubid, str<strong>in</strong>g sysid, str<strong>in</strong>g subset);The name argument is mandatory and represents the name of DOCTYPE root node.The subset argument, on the other hand, represents the text be<strong>in</strong>g written <strong>in</strong> the!DOCTYPE <strong>XML</strong> node. The pubid and sysid arguments represent the identifier of theDOCTYPE resource be<strong>in</strong>g def<strong>in</strong>ed. The key identifier is sysid, rendered <strong>in</strong> <strong>XML</strong> throughthe SYSTEM attribute. It normally evaluates to a URL that po<strong>in</strong>ts to the remote locationwhere the resource is stored. For example, the follow<strong>in</strong>g code associates the MyDocresource with the file.dtd URL:123


By us<strong>in</strong>g the pubid argument (PUBLIC attribute <strong>in</strong> <strong>XML</strong> code), you can re<strong>in</strong><strong>for</strong>ce theidentification of the resource by also us<strong>in</strong>g a location-<strong>in</strong>dependent public name <strong>for</strong> it, asshown here:You can use SYSTEM without PUBLIC, or both, or neither. You can't use PUBLICalone.You use the WriteDocType method to <strong>in</strong>sert a reference to an <strong>in</strong>-l<strong>in</strong>e or external DTDfile to be used <strong>for</strong> validation purposes. Alternatively, you can use the WriteDocTypemethod to <strong>in</strong>sert entity def<strong>in</strong>itions. In this case, specify null values <strong>for</strong> both sysid andpubid arguments. The follow<strong>in</strong>g <strong>XML</strong> code creates an entity named d<strong>in</strong>oe thatevaluates to "D<strong>in</strong>o Esposito":writer.WriteDocType("MyDef",null,null,"");The result<strong>in</strong>g <strong>XML</strong> text looks like this:An entity declaration def<strong>in</strong>es a macro to access pieces of <strong>XML</strong> text us<strong>in</strong>g a symbolicname. When a previously def<strong>in</strong>ed entity is then used <strong>in</strong> code, another method does thejob of expand<strong>in</strong>g the content—WriteEntityRef. (More on this expansion <strong>in</strong> the nextsection.)Writ<strong>in</strong>g Element Nodes and AttributesThe .<strong>NET</strong> <strong>XML</strong> API provides two methods <strong>for</strong> writ<strong>in</strong>g nodes. You use theWriteElementStr<strong>in</strong>g method if you need to write a simple node around some text. Youuse the WriteStartElement/WriteEndElement pair if you need to specify attributes or ifyou need to control what's written as the body of the node.The follow<strong>in</strong>g <strong>in</strong>struction creates a node named MyNode and wraps it around thespecified text. If needed, the method also provides an overload <strong>in</strong> which you can addnamespace <strong>in</strong><strong>for</strong>mation.writer.WriteElementStr<strong>in</strong>g("MyNode", "Sample text");The output looks like this:Sample textBy writ<strong>in</strong>g the start tag and the end tag of an element node as dist<strong>in</strong>ct pieces, you canadd attributes, reference entities, and create CDATA sections. Here's how:// Open the documentwriter.WriteStartDocument();// Write DOCTYPE and entitieswriter.WriteDocType("MyDef", null, null,"");124


Open the root writer.WriteStartElement("Cities");// Open the child writer.WriteStartElement("City");// Write the Zip attributewriter.WriteAttributeStr<strong>in</strong>g("Zip", "12345");// Write the State attribute (reference an entity)writer.WriteStartAttribute("State", "");writer.WriteEntityRef("I");writer.WriteEndAttribute();// Write the body of the node (reference an entity)writer.WriteEntityRef("I-Capital");// Close the current <strong>in</strong>nermost element (City)writer.WriteEndElement();// Close the current <strong>in</strong>nermost element (Cities)writer.WriteEndDocument();// Close the documentwriter.WriteEndDocument();All the <strong>in</strong>structions <strong>in</strong> the preced<strong>in</strong>g code work together to populate a s<strong>in</strong>gle elementnode named City. The City node conta<strong>in</strong>s an attribute named Zip, which is created <strong>in</strong>one shot us<strong>in</strong>g the WriteAttributeStr<strong>in</strong>g method. As with element nodes, attribute nodestoo can be written <strong>in</strong> two ways, us<strong>in</strong>g either a one-shot method or a pair of start/endmethods.The <strong>in</strong>structions <strong>in</strong> boldface demonstrate the alternative approach. The State attribute isopened and closed with separate statements. Meanwhile, a WriteEntityRef calldeterm<strong>in</strong>es the entity's contents by expand<strong>in</strong>g a previously def<strong>in</strong>ed entity. The f<strong>in</strong>aloutput is shown here:&I-Capital;Internet Explorer correctly displays the document and expands all of its entities, asshown <strong>in</strong> Figure 4-3.125


Figure 4-3: A dynamically created <strong>XML</strong> document with entities and DOCTYPE def<strong>in</strong>itions.If you need to concatenate entities with pla<strong>in</strong> text or if you just want to write thecontents of an attribute, use the WriteStr<strong>in</strong>g method. For example, the follow<strong>in</strong>g codeadds ", Europe" to the attribute Country:writer.WriteStartAttribute("Country", "");writer.WriteEntityRef("I");writer.WriteStr<strong>in</strong>g(", Europe");writer.WriteEndAttribute();Figure 4-4 shows the results of the concatenation.Figure 4-4: The Country attribute is created by concatenat<strong>in</strong>g an entity reference and pla<strong>in</strong>text.As you might have noticed, the end tags <strong>for</strong> both attributes and nodes do not take anyarguments. The writer ma<strong>in</strong>ta<strong>in</strong>s an <strong>in</strong>ternal stack of opened attributes and nodes andautomatically pops the <strong>in</strong>nermost element when you close a node. Likewise, when anew node or attribute is opened, the writer simply pushes a new element onto thestack. If the newly added element is a node, <strong>in</strong> the result<strong>in</strong>g <strong>XML</strong> code, the node isnested one additional level.At the end of the document—that is, when WriteEndDocument is called—all pend<strong>in</strong>gnodes are automatically popped off the stack and closed accord<strong>in</strong>g to the last <strong>in</strong>, firstout (LIFO) method. Let's consider what can happen if you disregard this simple rule andomit a call to WriteEndElement <strong>in</strong> a loop. The follow<strong>in</strong>g code translates an array ofstr<strong>in</strong>gs <strong>in</strong>to <strong>XML</strong>:126


writer.WriteStartDocument();writer.WriteStartElement("array");<strong>for</strong>each(str<strong>in</strong>g s <strong>in</strong> theArray){writer.WriteStartElement("element");writer.WriteAttributeStr<strong>in</strong>g("value", s);writer.WriteEndElement();}writer.WriteEndDocument();The root node is array and conta<strong>in</strong>s a series of child nodes named element, each withan attribute value, as shown here:The element node is created entirely <strong>in</strong> the loop. If you don't explicitly close it by call<strong>in</strong>gWriteEndElement, the f<strong>in</strong>al output would look like this:Writ<strong>in</strong>g Raw <strong>XML</strong> DataAs we've seen, the <strong>XML</strong> writer saves the developer from a lot of the details concern<strong>in</strong>gthe markup text <strong>in</strong> an <strong>XML</strong> document. So what happens if you try to run the follow<strong>in</strong>gcommand?writer.WriteStr<strong>in</strong>g("


This command executes normally, but any occurrence of markup-sensitive charactersis replaced by escaped characters—mostly entities. Thus, the less than sign (If you try to write the same text us<strong>in</strong>g WriteStr<strong>in</strong>g, the effect is different, as the follow<strong>in</strong>g<strong>XML</strong> text demonstrates:More &gt;TipThe XmlConvert class represents a handy tool that can be used toachieve a couple of goals. First, it provides methods <strong>for</strong> convert<strong>in</strong>g<strong>XML</strong> Schema Def<strong>in</strong>ition (XSD) data types to the .<strong>NET</strong> Frameworktype system. For example, the method ToDate Time converts an XSDDate type to System.DateTime. In addition, the XmlConvert class alsolets you encode and decode <strong>XML</strong> names so that they comply with theW3C standards. The encod<strong>in</strong>g process escapes any <strong>in</strong>validcharacters <strong>in</strong>to entities consist<strong>in</strong>g of the character's numericrepresentation <strong>in</strong> the current encod<strong>in</strong>g set.Formatt<strong>in</strong>g TextThe XmlTextWriter class allows you to specify a few properties to configure the way <strong>in</strong>which newl<strong>in</strong>e characters, quotation marks, and <strong>in</strong>dentation are def<strong>in</strong>ed. Normally, <strong>XML</strong>documents use tab characters or blanks to <strong>in</strong>dent child nodes, although an <strong>XML</strong>document rendered as an endless str<strong>in</strong>g is by all means a perfectly valid <strong>XML</strong>document.As mentioned, the properties <strong>in</strong>volved with <strong>XML</strong> <strong>for</strong>matt<strong>in</strong>g are Formatt<strong>in</strong>g, IndentChar,Indentation, and QuoteChar. The first three are somewhat correlated, whereas thelatter simply <strong>in</strong>dicates the character to be used to enclose attributes—by default, thedouble quotation mark.Formatt<strong>in</strong>g lets you control the <strong>for</strong>matt<strong>in</strong>g style by toggl<strong>in</strong>g it on and off altogether.When Formatt<strong>in</strong>g is set to Formatt<strong>in</strong>g.Indented (the other possible value isFormatt<strong>in</strong>g.None), the <strong>XML</strong> writer attributes a special role to IndentChar andIndentation that would otherwise be ignored. Indentation specifies the number ofcharacters to <strong>in</strong>dent <strong>for</strong> each level <strong>in</strong> the document's hierarchy. Conversely, IndentChar128


epresents the character that will be used to <strong>in</strong>dent the text of the new node. By default,<strong>for</strong>matt<strong>in</strong>g is on and the <strong>in</strong>dentation is two blanks.Note that all the <strong>XML</strong> writer's <strong>for</strong>matt<strong>in</strong>g is managed by the writer only be<strong>for</strong>e thedocument is actually opened—that is, prior to the WriteStartDocument call. Thefollow<strong>in</strong>g code snippet demonstrates how to write a new <strong>XML</strong> document, <strong>in</strong>dent<strong>in</strong>g witha tab character any level of the hierarchy:XmlTextWriter writer = new XmlTextWriter(filename, null);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;writer.Indentation = 1;writer.IndentChar = "\t";As a f<strong>in</strong>al note, keep <strong>in</strong> m<strong>in</strong>d that <strong>XML</strong> <strong>for</strong>matt<strong>in</strong>g normally <strong>in</strong>dents element contentsonly and does not <strong>for</strong>mat mixed contents.Support<strong>in</strong>g NamespacesIn the XmlTextWriter class, all the methods available <strong>for</strong> writ<strong>in</strong>g element nodes andattributes have overloads to work with namespaces. You simply add a new argument tothe call and specify the namespace prefix of choice. A namespace is identified by aURN and is used to qualify both attribute and node names so that they belong to aparticular doma<strong>in</strong> of names.Namespace DeclarationYou <strong>in</strong>sert a namespace declaration <strong>in</strong> the current node us<strong>in</strong>g the xmlns attribute. Youcan also optionally specify a namespace prefix. The prefix is a symbolic name thatuniquely identifies the namespace. To declare a namespace, add a special attribute tothe node that roots the target scope of the namespace, as shown here:You can write this <strong>XML</strong> text as raw text or use one of the methods of the writer object.Typically, you use one of the overloads of the WriteAttributeStr<strong>in</strong>g method, as shownhere:public void WriteAttributeStr<strong>in</strong>g(str<strong>in</strong>g prefix,str<strong>in</strong>g attr,str<strong>in</strong>g ns,str<strong>in</strong>g value);You can use this method to declare a namespace, but it rema<strong>in</strong>s primarily a method toadd attributes. To obta<strong>in</strong> a namespace declaration like the one <strong>in</strong> our earlier examples,a few exceptions to the signature apply. In particular, <strong>for</strong> an xmlns attribute be<strong>in</strong>gwritten, you <strong>in</strong>struct the method to add an attribute whose name matches the prefix andwhose prefix equals xmlns.The third argument is expected to be the URN of the namespace <strong>for</strong> the attribute. Inthis case, however, the namespace prefix named xmlns po<strong>in</strong>ts to the default <strong>XML</strong>namespace, so the ns argument must be set to null. Note that any attempt to set ns to anon-null value would result <strong>in</strong> an exception because the specified URN would not matchthe URN of the xmlns namespace prefix. The fourth and f<strong>in</strong>al argument, value, conta<strong>in</strong>sthe URN of the namespace you are declar<strong>in</strong>g. The follow<strong>in</strong>g code shows how to declarea sample namespace rooted <strong>in</strong> the node :writer.WriteStartElement("MyNode");writer.WriteAttributeStr<strong>in</strong>g("xmlns", "x", null,129


"d<strong>in</strong>oe:isbn-0735618011");This code produces the follow<strong>in</strong>g output:Qualified NodesA namespace is unequivocally identified by a URN. Thus, whenever you need to<strong>in</strong>dicate a namespace <strong>for</strong> an <strong>XML</strong> node, you should specify the URN. The follow<strong>in</strong>gcode shows how to use WriteElementStr<strong>in</strong>g to write a qualified node based on thenamespace declared <strong>in</strong> the previous section:writer.WriteElementStr<strong>in</strong>g("value", "d<strong>in</strong>oe:isbn-0735618011","...");The output looks like the follow<strong>in</strong>g <strong>XML</strong> code:...As you can see, the method uses the specified URN to look up the closest prefix andthen uses that prefix to generate the output text.The LookupPrefix method is a public method that takes a URN and returns the closestprefix that matches it. By closest, I mean the topmost prefix available on thenamespace stack. In other words, you can have the same namespace be<strong>in</strong>g referencedthrough different prefixes <strong>in</strong> different document's subtrees. LookupPrefix simply scansthe namespaces declared with<strong>in</strong> the current document and returns when the mostrecent one has been found. The method traverses the <strong>XML</strong> tree start<strong>in</strong>g from thecurrent node and mov<strong>in</strong>g up from parent to parent until the root is reached.The follow<strong>in</strong>g code shows an alternative way to write the preced<strong>in</strong>g <strong>XML</strong> data us<strong>in</strong>gLookupPrefix:str<strong>in</strong>g prefix = writer.LookupPrefix("d<strong>in</strong>oe:isbn-0735618011");writer.WriteStartElement(prefix, "value", null);writer.WriteStr<strong>in</strong>g("...");writer.WriteEndElement();The WriteStartElement method takes the prefix and the node name. It can also accepta third argument, the URN of the namespace. If this argument is null or matches theclosest URN <strong>for</strong> the prefix, the looked-up, exist<strong>in</strong>g namespace is used. The f<strong>in</strong>al <strong>XML</strong>code looks like this:...If the third argument of WriteStartElement represents an unknown URN, thenamespace is declared and prefixed <strong>in</strong> place. In this case, its scope ranges over the<strong>XML</strong> subtree rooted <strong>in</strong> the node be<strong>in</strong>g created. Consider the follow<strong>in</strong>g statements:// Get the topmost prefix <strong>for</strong> the URN.str<strong>in</strong>g prefix = writer.LookupPrefix("d<strong>in</strong>oe:isbn-0735618011");// Write a node. Identify the namespace// us<strong>in</strong>g the most recent prefix/URN b<strong>in</strong>d<strong>in</strong>g.writer.WriteStartElement(prefix, "value", null);writer.WriteStr<strong>in</strong>g("...");writer.WriteEndElement();130


Write a node. S<strong>in</strong>ce the URN associated with// the prefix does not match the specified URN, a new prefix/URN// b<strong>in</strong>d<strong>in</strong>g is generated root<strong>in</strong>g <strong>in</strong> the new node.writer.WriteStartElement(prefix, "value","despos:isbn-0735618011");writer.WriteStr<strong>in</strong>g("...");writer.WriteEndElement();The two nodes created look like the <strong>XML</strong> source code shown here:......The two nodes are scoped <strong>in</strong> different namespaces although they have thesame name and even the same namespace prefix.Qualified AttributesTo write qualified attributes, you use some of the overloads of the WriteAttributeStr<strong>in</strong>gand WriteStartAttribute methods. Accord<strong>in</strong>g to the W3C <strong>XML</strong> 1.0 and Namespacesspecifications, element nodes can have an associated namespace without a prefix, asshown here:...This namespace can be obta<strong>in</strong>ed with the follow<strong>in</strong>g code:writer.WriteStartElement("value", "despos:isbn-0735618011");writer.WriteStr<strong>in</strong>g("...");writer.WriteEndElement();Attributes, on the other hand, can't do without a prefix once they are bound to anamespace. If you don't <strong>in</strong>dicate the prefix explicitly, one is generated automatically.Try the follow<strong>in</strong>g code:writer.WriteStartElement("element");writer.WriteStartAttribute("value", "despos:isbn-0735618011");writer.WriteStr<strong>in</strong>g(s);writer.WriteEndAttribute();writer.WriteEndElement();The value attribute is associated with a namespace URN, but no prefix is set orretrieved through LookupPrefix. The resultant <strong>XML</strong> text is shown here:An automatic prefix is generated to scope the attribute. There are two elements <strong>in</strong> the.<strong>NET</strong> Framework–generated prefix: the depth level, d{n}, and the prefix <strong>in</strong>dex, p{n}. Thedepth level is a one-based value that counts the depth of the node <strong>in</strong> the <strong>XML</strong> tree. Theprefix <strong>in</strong>dex counts the number of namespaces def<strong>in</strong>ed <strong>in</strong> the body of the node. Forexample, consider the follow<strong>in</strong>g code:writer.WriteStartElement("parent");131


writer.WriteStartElement("element");// First attributewriter.WriteStartAttribute("value", "despos:isbn-0735618011");writer.WriteStr<strong>in</strong>g("...");writer.WriteEndAttribute();// Second attributewriter.WriteAttributeStr<strong>in</strong>g("value", "urn:my-namespace", "...");writer.WriteEndElement();writer.WriteEndElement();The correspond<strong>in</strong>g output that the XmlTextWriter class generates is shown <strong>in</strong> thefollow<strong>in</strong>g code. Notice the presence of an extra parent node.As you can see, the depth <strong>in</strong>creased by 1 due to the extra parent node. In addition, theprefix <strong>in</strong>dex ranges from 1 to 2 to <strong>in</strong>clude all the namespaces <strong>in</strong> the node.Gett<strong>in</strong>g the Qualified NameThe methods described up to now only allow you to create element and attribute nodeswith fully qualified names. WriteQualifiedName is a method you can use to write outboth element and attribute namespace-qualified names.The WriteQualifiedName method takes two arguments, one <strong>for</strong> the node name and one<strong>for</strong> the namespace URN. Next it looks <strong>for</strong> the prefix associated with that URN andoutputs the comb<strong>in</strong>ed name <strong>in</strong> the <strong>for</strong>m prefix:name. If you are writ<strong>in</strong>g element content,you get an exception if the namespace declaration does not exist. If the namespaceargument maps to the current default namespace, the method generates no prefix. Forattributes, if the specified namespace is not found, it is automatically registered and arelated prefix is created as described <strong>in</strong> the previous section.The WriteQualifiedName method, however, simply returns the name of the node andcan't be used to create the node itself. From this po<strong>in</strong>t of view, it is only complementaryto methods like WriteStartElement and WriteStartAttribute. You need this method onlywhen you have to write out the name of a node. When the writer is configured tosupport namespaces (which is the default), the WriteQualifiedName method alsoensures that the output name con<strong>for</strong>ms to the W3C Namespaces recommendation asdef<strong>in</strong>ed <strong>in</strong> the <strong>XML</strong> 1.0 specification. You can turn namespace support on and off <strong>in</strong> awriter by sett<strong>in</strong>g the Namespaces property with a Boolean value as appropriate.TipAs the W3C <strong>XML</strong> Namespaces specification recommends, the prefixshould be considered only as a placeholder <strong>for</strong> a namespace URN.Although you could use prefixes and real names <strong>in</strong>terchangeablywith<strong>in</strong> the range of a document, bear <strong>in</strong> m<strong>in</strong>d that an <strong>in</strong>tensive use ofprefixes can soon become mislead<strong>in</strong>g when the document must beaccessed by different applications and when you use the same prefix132


epeatedly <strong>in</strong> the same document. Whenever possible, applicationsshould use the namespace name rather than a prefix. The use of aprefix is more acceptable when only unique prefixes are used andpossibly only one namespace is def<strong>in</strong>ed <strong>in</strong> the document.Writ<strong>in</strong>g Encoded DataAs mentioned <strong>in</strong> the section "Methods of the XmlWriter Class," on page 141, the <strong>XML</strong>text writer object has two methods that write out <strong>XML</strong> data <strong>in</strong> a softly encrypted wayus<strong>in</strong>g base64 and B<strong>in</strong>Hex algorithms. The methods <strong>in</strong>volved—WriteBase64 andWriteB<strong>in</strong>Hex—have a rather straight<strong>for</strong>ward <strong>in</strong>terface. They simply take an array ofbytes and write it out start<strong>in</strong>g at a specified offset and <strong>for</strong> the specified number of bytes.(As you saw <strong>in</strong> Chapter 2, <strong>XML</strong> reader classes have match<strong>in</strong>g ReadBase64 andReadB<strong>in</strong>Hex methods to com<strong>for</strong>tably read back encoded <strong>in</strong><strong>for</strong>mation.)Note In the .<strong>NET</strong> Framework, base64 encod<strong>in</strong>g can also be per<strong>for</strong>medthrough static methods exposed by the Convert class. In particular,the ToBase64Str<strong>in</strong>g method takes an array of bytes and returns abase64-encoded str<strong>in</strong>g. Likewise, the FromBase64Str<strong>in</strong>g methoddecodes a previously encoded str<strong>in</strong>g and returns it as an array ofbytes. For some reason, the .<strong>NET</strong> Framework does not providesimilar support <strong>for</strong> B<strong>in</strong>Hex. B<strong>in</strong>Hex, there<strong>for</strong>e, is supported onlythrough <strong>XML</strong> readers and writers.In the section "The <strong>XML</strong> Writer <strong>Programm<strong>in</strong>g</strong> Interface," on page 136, you learned howto serialize an array of str<strong>in</strong>gs to <strong>XML</strong> us<strong>in</strong>g the follow<strong>in</strong>g array:str<strong>in</strong>g[] theArray = {"Rome", "New York", "Sydney","Stockholm", "Paris"};Let's look at how to write this array to a base64-encoded <strong>for</strong>m. The structure of thecode we analyzed earlier does not need to be altered much. Only a couple of issuesneed to be addressed. The first concerns how str<strong>in</strong>gs are actually turned <strong>in</strong>to an arrayof bytes. The second concerns the signature of the encod<strong>in</strong>g methods. You can useWriteB<strong>in</strong>Hex to write both element and attribute content <strong>in</strong> B<strong>in</strong>Hex <strong>for</strong>mat, <strong>in</strong>stead ofus<strong>in</strong>g WriteBase64, as shown here:XmlTextWriter xmlw = new XmlTextWriter(filename, null);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;writer.WriteStartDocument();writer.WriteComment("Array to Base64 <strong>XML</strong>");writer.WriteStartElement("array");writer.WriteAttributeStr<strong>in</strong>g("xmlns", "x", null,"d<strong>in</strong>oe:isbn-0735618011");<strong>for</strong>each(str<strong>in</strong>g s <strong>in</strong> theArray){writer.WriteStartElement("x", "element", null);writer.WriteBase64(Encod<strong>in</strong>g.Unicode.GetBytes(s),0, s.Length*2);writer.WriteEndElement();133


}writer.WriteEndDocument();writer.Close();Encod<strong>in</strong>g-derived classes provide the GetBytes method, which simply translates str<strong>in</strong>gs<strong>in</strong>to an array of bytes. You use Encod<strong>in</strong>g.Unicode be cause that is the native <strong>for</strong>mat of.<strong>NET</strong> Framework str<strong>in</strong>gs <strong>in</strong> memory. When translat<strong>in</strong>g a Unicode str<strong>in</strong>g to an array ofbytes, keep <strong>in</strong> m<strong>in</strong>d that each Unicode character takes up two bytes. This code isslightly more efficient than us<strong>in</strong>g the follow<strong>in</strong>g <strong>in</strong>struction, <strong>in</strong> which the conversion isper<strong>for</strong>med <strong>in</strong>ternally:writer.WriteBase64(Encod<strong>in</strong>g.Default.GetBytes(s), 0, s.Length);In the case of very large arrays, you can consider us<strong>in</strong>g direct po<strong>in</strong>ters and the unsafecopy method. The unsafe method has the clear advantage of reduc<strong>in</strong>g memoryallocations, so the result<strong>in</strong>g code is slightly faster. (See the section "Further Read<strong>in</strong>g,"on page 199, <strong>for</strong> references to more <strong>in</strong><strong>for</strong>mation.)Figure 4-5 shows the f<strong>in</strong>al output of this code.Figure 4-5: The contents of an array serialized to base64-encoded <strong>XML</strong> text.Encod<strong>in</strong>g us<strong>in</strong>g B<strong>in</strong>Hex is nearly identical, as Figure 4-6 demonstrates.134


Figure 4-6: The contents of an array serialized to B<strong>in</strong>Hex-encoded <strong>XML</strong> text.As <strong>for</strong> the code, simply change the boldfaced l<strong>in</strong>e to the follow<strong>in</strong>g and you're prettymuch done:writer.WriteB<strong>in</strong>Hex(Encod<strong>in</strong>g.Unicode.GetBytes(s), 0, s.Length*2);Decod<strong>in</strong>g Base64 and B<strong>in</strong>Hex DataRead<strong>in</strong>g encoded data is a bit trickier, but not because the ReadBase64 andReadB<strong>in</strong>Hex methods feature a more complex <strong>in</strong>terface. The difficulty lies <strong>in</strong> the factthat you have to allocate a buffer to hold the data and make some decision about itssize. If the buffer is too large, you can easily waste memory; if the buffer is too small,you must set up a potentially lengthy loop to read all the data. In addition, if you can'tprocess data as you read it, you need another buffer or stream <strong>in</strong> which you canaccumulate <strong>in</strong>com<strong>in</strong>g data.Aside from this, however, decod<strong>in</strong>g is as easy as encod<strong>in</strong>g. The follow<strong>in</strong>g code showshow to read the base64 <strong>XML</strong> document created <strong>in</strong> the previous section. The <strong>XML</strong>reader opens the file and loops over the conta<strong>in</strong>ed nodes. The ReadBase64 methodcopies the specified number of bytes, start<strong>in</strong>g at the specified offset, <strong>in</strong>to a buffer that isassumed to be large enough. ReadBase64 returns a value denot<strong>in</strong>g the actual numberof bytes read.Encod<strong>in</strong>g-derived classes also provide a method—GetStr<strong>in</strong>g—to trans<strong>for</strong>m an array ofbytes <strong>in</strong>to a str<strong>in</strong>g, as shown here:XmlTextReader reader = new XmlTextReader(filename);while(reader.Read()){if (reader.LocalName == "element"){byte[] bytes = new byte[1000];<strong>in</strong>t n = reader.ReadBase64(bytes, 0, 1000);str<strong>in</strong>g buf = Encod<strong>in</strong>g.Unicode.GetStr<strong>in</strong>g(bytes);// Output the decoded dataConsole.WriteL<strong>in</strong>e(buf.Substr<strong>in</strong>g(0,n));135


}}reader.Close();If <strong>in</strong> this code you replace the call to ReadBase64 with a call to ReadB<strong>in</strong>Hex, you obta<strong>in</strong>a B<strong>in</strong>Hex decoder as well.Embedd<strong>in</strong>g Images <strong>in</strong> <strong>XML</strong> DocumentsThe technique described <strong>in</strong> the previous section can be used with any sort of b<strong>in</strong>arydata that can be expressed with an array of bytes, <strong>in</strong>clud<strong>in</strong>g images. Let's look at howto embed a JPEG image <strong>in</strong> an <strong>XML</strong> document.The structure of the sample <strong>XML</strong> document is extremely simple. It will consist of as<strong>in</strong>gle node hold<strong>in</strong>g the B<strong>in</strong>Hex data plus an attribute conta<strong>in</strong><strong>in</strong>g the orig<strong>in</strong>alname, as shown here:writer.WriteStartDocument();writer.WriteComment("Conta<strong>in</strong>s a B<strong>in</strong>Hex JPEG image");writer.WriteStartElement("jpeg");writer.WriteAttributeStr<strong>in</strong>g("FileName", filename);// Get the size of the fileFileInfo fi = new FileInfo(jpegFileName);<strong>in</strong>t size = (<strong>in</strong>t) fi.Length;// Read the JPEG filebyte[] img = new byte[size];FileStream fs = new FileStream(jpegFileName, FileMode.Open);B<strong>in</strong>aryReader f = new B<strong>in</strong>aryReader(fs);img = f.ReadBytes(size);f.Close();// Write the JPEG datawriter.WriteB<strong>in</strong>Hex(img, 0, size);// Close the documentwriter.WriteEndElement();writer.WriteEndDocument();This code uses the FileInfo class to determ<strong>in</strong>e the size of the JPEG file. FileInfo is ahelper class <strong>in</strong> the System.IO namespace used to retrieve <strong>in</strong><strong>for</strong>mation about <strong>in</strong>dividualfiles. The contents of the JPEG file is extracted us<strong>in</strong>g the ReadBytes method of the.<strong>NET</strong> b<strong>in</strong>ary reader. The contents are then encoded as B<strong>in</strong>Hex and written to the <strong>XML</strong>document. Figure 4-7 shows the source code of the <strong>XML</strong> just created.136


Figure 4-7: An <strong>XML</strong> file conta<strong>in</strong><strong>in</strong>g a B<strong>in</strong>Hex-encoded JPEG file.The B<strong>in</strong>Hex stream is now part of the <strong>XML</strong> document and, as such, can be reread us<strong>in</strong>gan <strong>XML</strong> reader and decoded <strong>in</strong>to an array of bytes. The sample application shown <strong>in</strong>the follow<strong>in</strong>g code does just that and, <strong>in</strong> addition, translates the bytes <strong>in</strong>to a Bitmapobject to display with<strong>in</strong> a W<strong>in</strong>dows Forms PictureBox control:XmlTextReader reader = new XmlTextReader(filename);reader.Read();reader.MoveToContent();if (reader.LocalName == "jpeg"){FileInfo fi = new FileInfo(filename);<strong>in</strong>t size = (<strong>in</strong>t) fi.Length;byte[] img = new byte[size];reader.ReadB<strong>in</strong>Hex(img, 0, size);// Bytes to Image objectMemoryStream ms = new MemoryStream();ms.Write(img, 0, img.Length);Bitmap bmp = new Bitmap(ms);ms.Close();// Fill the PictureBox controlJpegImage.Image = bmp;}reader.Close();The reader opens the <strong>XML</strong> file and jumps to the root node us<strong>in</strong>g MoveToContent. Nextit gets the size of the <strong>XML</strong> file to oversize the buffer dest<strong>in</strong>ed to conta<strong>in</strong> the decodedJPEG file. Bear <strong>in</strong> m<strong>in</strong>d that a B<strong>in</strong>Hex stream is always significantly larger then a b<strong>in</strong>aryJPEG file, but this is the price you must pay to str<strong>in</strong>g encod<strong>in</strong>g algorithms. TheReadB<strong>in</strong>Hex method decodes the JPEG stream and stores it <strong>in</strong> a MemoryStreamobject. This step is necessary if you want to trans<strong>for</strong>m the array of bytes <strong>in</strong>to a .<strong>NET</strong>Framework graphics object—say, the Bitmap object—that can be then bound to aPictureBox control, as shown <strong>in</strong> Figure 4-8.137


Figure 4-8: A PictureBox control displays a JPEG file just extracted from an <strong>XML</strong> file andproperly decoded.If you want to extract the image bits and create a brand-new JPEG file, use thefollow<strong>in</strong>g code. The name of the JPEG file is read out of the Orig<strong>in</strong>alFileName attribute<strong>in</strong> the <strong>XML</strong> encoded document.str<strong>in</strong>g orig<strong>in</strong>alFileName = reader["Orig<strong>in</strong>alFileName"];FileStream fs = new FileStream(orig<strong>in</strong>alFileName,FileMode.Create);B<strong>in</strong>aryWriter writer = new B<strong>in</strong>aryWriter(fs);writer.Write(img);writer.Close();<strong>XML</strong> Validat<strong>in</strong>g WritersAs mentioned, <strong>XML</strong> text writers do not validate aga<strong>in</strong>st schema or DTD files. In fact,writ<strong>in</strong>g the <strong>XML</strong> document and validat<strong>in</strong>g its contents are two dist<strong>in</strong>ct operations thatcan't occur at the same time. However, if you need to make sure that the document justwritten is valid aga<strong>in</strong>st, say, a schema, you can proceed <strong>in</strong> the follow<strong>in</strong>g way: write thedocument and, when f<strong>in</strong>ished, validate it us<strong>in</strong>g a validat<strong>in</strong>g reader. Soundsstraight<strong>for</strong>ward? Well, it isn't.The difficulty lies <strong>in</strong> the fact that, to validate, you must reread the text just written. If youare us<strong>in</strong>g a file, you can simply open the file us<strong>in</strong>g an <strong>XML</strong> reader and then <strong>in</strong>stantiate avalidat<strong>in</strong>g reader. The task is trickier if you happen to use an output stream—<strong>in</strong> manycases, you can't read the contents of an output (and mostly write-only) stream. In thiscase, a possible workaround is cach<strong>in</strong>g the entire <strong>XML</strong> document <strong>in</strong>to a str<strong>in</strong>g. Whenyou've f<strong>in</strong>ished, you simply pass the <strong>XML</strong> fragment to the validat<strong>in</strong>g reader. If all wentf<strong>in</strong>e, you write out the str<strong>in</strong>g to the expected output stream. To accumulate the <strong>XML</strong>output <strong>in</strong>to a str<strong>in</strong>g, you use a Str<strong>in</strong>gWriter object to build the <strong>XML</strong> writer. TheStr<strong>in</strong>gWriter class <strong>in</strong>herits from TextWriter and, as such, can be used to <strong>in</strong>itialize an<strong>XML</strong> text writer us<strong>in</strong>g the follow<strong>in</strong>g constructor:public XmlTextWriter(TextWriter w);Because this constructor is not stream-based, you can't <strong>in</strong>dicate an encod<strong>in</strong>g schema.Once you have run the statements listed <strong>in</strong> the follow<strong>in</strong>g code, the rema<strong>in</strong>der of thecode does not need to be changed or altered. The big difference, though, is that nowthe text is accumulated <strong>in</strong> an <strong>in</strong>-memory buffer managed by Str<strong>in</strong>gWriter. Incidentally,this buffer is implemented us<strong>in</strong>g a Str<strong>in</strong>gBuilder object.Str<strong>in</strong>gWriter sw = new Str<strong>in</strong>gWriter();138


XmlTextWriter writer = new XmlTextWriter(sw);//// Write as usual//writer.Close();Only after the <strong>XML</strong> writer has been closed does the str<strong>in</strong>g conta<strong>in</strong> all the <strong>XML</strong> textgenerated by the application. You can copy that text <strong>in</strong>to a local str<strong>in</strong>g variable us<strong>in</strong>gthe ToStr<strong>in</strong>g method and post-process it as appropriate, as shown here:str<strong>in</strong>g xml = sw.ToStr<strong>in</strong>g();sw.Close();In particular, you might want to pass down this str<strong>in</strong>g to an <strong>in</strong>stance of theXmlValidat<strong>in</strong>gReader class to apply schema validation. You can <strong>in</strong>itialize theXmlValidat<strong>in</strong>gReader class by pass<strong>in</strong>g the str<strong>in</strong>g as a whole and a node type ofDocument. Alternatively, you can use an XmlTextReader object work<strong>in</strong>g on the <strong>XML</strong>str<strong>in</strong>g through a Str<strong>in</strong>gReader object, as shown here:Str<strong>in</strong>gReader sr = new Str<strong>in</strong>gReader(xml);XmlTextReader xr = new XmlTextReader(sr);XmlValidat<strong>in</strong>gReader reader = new XmlValidat<strong>in</strong>gReader(xr);Yet another option is to use the special all-<strong>in</strong>clusive validator object built <strong>in</strong> Chapter 3—the global XmlValidator object—as shown here:Str<strong>in</strong>gReader sr = new Str<strong>in</strong>gReader(xml);XmlTextReader xr = new XmlTextReader(sr);bool b = XmlValidator.ValidateXmlDocument(xr);TheXmlValidator object takes an XmlReader-derived class (or a file name) and handles<strong>in</strong>ternally all the details of the validation process, return<strong>in</strong>g a Boolean value that<strong>in</strong>dicates the success of the operation. Figure 4-9 shows the output of the sampleapplication.Figure 4-9: The sample <strong>XML</strong> validat<strong>in</strong>g writer <strong>in</strong> action. It dumps out the <strong>XML</strong> text and theBoolean value result<strong>in</strong>g from the schema validation.139


NoteThe entire source code <strong>for</strong> a sample <strong>XML</strong> validat<strong>in</strong>g writerapplication can be found <strong>in</strong> this book's sample files. It is a consoleapplication named Validat<strong>in</strong>gWriter.Writ<strong>in</strong>g a Custom <strong>XML</strong> WriterAs we've seen, an <strong>XML</strong> writer is a .<strong>NET</strong> Framework class that specializes <strong>in</strong> writ<strong>in</strong>g out<strong>XML</strong> text. Because there is just one flavor of <strong>XML</strong>, the need <strong>for</strong> customized versions ofXmlTextWriter is extremely low. However, a lot of documents and objects out theremight take significant advantage of an ad hoc, specialized, and seamless <strong>XML</strong>serialization class.In the .<strong>NET</strong> Framework, all the <strong>XML</strong> files be<strong>in</strong>g used—from ADO.<strong>NET</strong> DiffGram objectsto Web .config files—are written us<strong>in</strong>g <strong>XML</strong> writers. (ADO.<strong>NET</strong> DataSet objects arealways remoted and serialized <strong>in</strong> a special <strong>XML</strong> <strong>for</strong>mat called the DiffGram; seeChapter 10.) In addition, the <strong>XML</strong> serializer saves and restores .<strong>NET</strong> Frameworkobjects to and from <strong>XML</strong> documents. (I'll cover <strong>XML</strong> serialization <strong>in</strong> Chapter 11) So the.<strong>NET</strong> Framework provides you with some tools to save exist<strong>in</strong>g objects <strong>in</strong>to an <strong>XML</strong>layout.The <strong>XML</strong> serializer is designed to map liv<strong>in</strong>g <strong>in</strong>stances of objects to an <strong>XML</strong> schema.Sometimes, though, you just need to produce a particular <strong>XML</strong> output, and the use of<strong>XML</strong> schemas is not a strict requirement. In situations like this, what you can do iscreate an <strong>XML</strong> writer class and add to it as many specialized methods and propertiesas required by the structure you want to obta<strong>in</strong>.Earlier <strong>in</strong> this chapter, we looked at a couple of simple <strong>XML</strong> writers that were used tocreate <strong>XML</strong> representations of str<strong>in</strong>g arrays and even JPEG images. In those cases,however, the expected output was so simple that there was no need to set up a classwith more than one method. The next step is to analyze a more complex case—arrang<strong>in</strong>g a .<strong>NET</strong> <strong>XML</strong> writer class to produce the <strong>XML</strong> version of an ADO recordsetstart<strong>in</strong>g from ADO.<strong>NET</strong> objects.Implement<strong>in</strong>g an ADO Recordset <strong>XML</strong> WriterIn <strong>Microsoft</strong> ADO.<strong>NET</strong>, the OleDbDataAdapter class allows you to import the contentsof an ADO Recordset object <strong>in</strong>to one or more DataTable objects. This k<strong>in</strong>d of b<strong>in</strong>d<strong>in</strong>g isunidirectional, however. You can import recordsets <strong>in</strong>to ADO.<strong>NET</strong> objects, but you can'tcreate an ADO Recordset object start<strong>in</strong>g from, say, a DataSet or a DataTable object.The two-way b<strong>in</strong>d<strong>in</strong>g between ADO.<strong>NET</strong> and ADO is important because it can save youfrom plann<strong>in</strong>g hasty port<strong>in</strong>g of W<strong>in</strong>dows Distributed <strong>in</strong>terNet Applications (DNA)applications to the .<strong>NET</strong> plat<strong>for</strong>m. If you have a W<strong>in</strong>dows DNA application with middletierobjects that use ADO to fetch data, chances are good that you can import ADOrecordsets <strong>in</strong>to ASP.<strong>NET</strong> pages. In this way, as the first step of the port<strong>in</strong>g, you simplyrefresh the user <strong>in</strong>terface but leave unaltered the middle tier—the most critical part of adistributed system.With this approach, you soon run <strong>in</strong>to a subtle problem. How can you send downupdated recordsets to the middle-tier objects? A possible workaround to create arecordset from scratch is by import<strong>in</strong>g the ADO library <strong>in</strong> .<strong>NET</strong> Framework applicationsand then us<strong>in</strong>g the native methods to <strong>in</strong>stantiate and populate the Recordset object. Inthis section, we'll look at an alternative approach: creat<strong>in</strong>g an ADO-specific <strong>XML</strong> filethat COM-based middle-tier objects can read and <strong>in</strong>ternally trans<strong>for</strong>m <strong>in</strong>to a liv<strong>in</strong>g<strong>in</strong>stance of the object.Although the ADO.<strong>NET</strong> DataSet object can be easily serialized to <strong>XML</strong>, the schemaused is not compatible with ADO. The <strong>XML</strong> schema used by ADO is based on <strong>XML</strong>140


Data-Reduced (XDR) schemas (see Chapter 3) and a few specific namespaces. Inaddition, it makes use of the XDR type system, which has no direct correspondencewith the .<strong>NET</strong> Framework type system. But one th<strong>in</strong>g at a time. Let's start with the newXmlRecordsetWriter class.The XmlRecordsetWriter <strong>Programm<strong>in</strong>g</strong> InterfaceThe XmlRecordsetWriter class embeds an <strong>in</strong>stance of the XmlTextWriter class but doesnot <strong>in</strong>herit from it. All the hard work of creat<strong>in</strong>g the <strong>XML</strong> output is accomplished throughthe <strong>in</strong>ternal writer, but the class programm<strong>in</strong>g <strong>in</strong>terface is completely customized andlargely simplified.By design, the set of constructors of the XmlRecordsetWriter class is nearly identical tothe constructors of the XmlTextWriter class, as shown here:protected XmlTextWriter Writer;public XmlRecordsetWriter(str<strong>in</strong>g filename){Writer = new XmlTextWriter(filename, null);SetupWriter();}public XmlRecordsetWriter(Stream s){Writer = new XmlTextWriter(s, null);SetupWriter();}public XmlRecordsetWriter(TextWriter tw){Writer = new XmlTextWriter(tw);SetupWriter();}The only difference is that the XmlRecordsetWriter constructors do not support anencod<strong>in</strong>g character set. The parameter <strong>for</strong> encod<strong>in</strong>g is always set to null.Table 4-8 lists the methods exposed by the XmlRecordsetWriter class.Table 4-8: Public Methods of the XmlRecordsetWriter ClassMethodWriteContentWriteEndDocumentWriteRecordsetWriteSchemaDescriptionLoops on the specified ADO.<strong>NET</strong> source object andwrites a row of data. This method features overloadsto read from DataSet, DataTable, and DataViewobjects.Ensures that all the pend<strong>in</strong>g nodes are closed andreleases the underly<strong>in</strong>g writer and stream.One-shot method that groups together all the stepsnecessary to create an <strong>XML</strong> recordset file. Thismethod features overloads to read from DataSet,DataTable, and DataView objects.Writes the schema <strong>in</strong><strong>for</strong>mation accord<strong>in</strong>g to the XDRsyntax and reads column metadata from ADO.<strong>NET</strong>141


Table 4-8: Public Methods of the XmlRecordsetWriter ClassMethodWriteStartDocumentDescriptionobjects. This method features overloads to read fromDataSet, DataTable, and DataView objects.Writes the document's prolog, <strong>in</strong>clud<strong>in</strong>g the rootnode with all the needed namespace declarations.For writ<strong>in</strong>g schemas and content, the XmlRecordsetWriter class needs to read<strong>in</strong><strong>for</strong>mation out of some ADO.<strong>NET</strong> objects. For this reason, methods like WriteSchema,WriteContent, and WriteRecordset have the follow<strong>in</strong>g four overloads:public void WriteXXX(DataSet ds){WriteXXX(ds.Tables[0]);}public void WriteXXX(DataSet ds, str<strong>in</strong>g tableName){WriteXXX(ds.Tables[tableName]);}public void WriteXXX(DataView dv){WriteXXX(dv.Table);}public void WriteXXX(DataTable dt){// Actual implementation here}The node layout of an ADO Recordset object is shown <strong>in</strong> Figure 4-10.142


Figure 4-10: Layout of the <strong>XML</strong> schema <strong>for</strong> ADO Recordset objects.Creat<strong>in</strong>g an <strong>XML</strong> Recordset object <strong>in</strong>volves four steps: writ<strong>in</strong>g the prolog, writ<strong>in</strong>g theschema, writ<strong>in</strong>g the contents, and, f<strong>in</strong>ally, clos<strong>in</strong>g all pend<strong>in</strong>g nodes. TheXmlRecordsetWriter class allows you to create the <strong>XML</strong> code by controll<strong>in</strong>g each stepyourself or by call<strong>in</strong>g one of the WriteRecordset overloads, shown here:public void WriteRecordset(DataTable dt){WriteStartDocument();WriteSchema(dt);WriteContent(dt);WriteEndDocument();}Creat<strong>in</strong>g the Recordset-Based DocumentThe WriteStartDocument method writes the root node, named xml, and all of thenamespaces the document needs to reference, as follows:public void WriteStartDocument(){Writer.WriteStartDocument();Writer.WriteComment("Created by XmlRecordsetWriter");143


}Writer.WriteStartElement("xml");Writer.WriteAttributeStr<strong>in</strong>g("xmlns", "s", null,"uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882");Writer.WriteAttributeStr<strong>in</strong>g("xmlns", "dt", null,"uuid:C2F41010-65B3-11d1-A29F-00AA00C14882");Writer.WriteAttributeStr<strong>in</strong>g("xmlns", "rs", null,"urn:schemas-microsoft-com:rowset");Writer.WriteAttributeStr<strong>in</strong>g("xmlns", "z", null,"#RowsetSchema");The next step is creat<strong>in</strong>g the schema. The follow<strong>in</strong>g list<strong>in</strong>g demonstrates a sample, butvalid, <strong>XML</strong> schema <strong>for</strong> an ADO Recordset object with two fields, firstname andlastname:As you can see, this syntax is based on the <strong>Microsoft</strong> XDR schema, an early subset oftoday's <strong>XML</strong> schema, shown here:public void WriteSchema(DataTable dt){// Open the schema tag (XDR)Writer.WriteStartElement("s", "Schema", null);Writer.WriteAttributeStr<strong>in</strong>g("id", "RowsetSchema");Writer.WriteStartElement("s", "ElementType", null);Writer.WriteAttributeStr<strong>in</strong>g("name", "row");Writer.WriteAttributeStr<strong>in</strong>g("content", "eltOnly");// Write the column <strong>in</strong>fo based on the table passed<strong>in</strong>t <strong>in</strong>dex=0;<strong>for</strong>each(DataColumn dc <strong>in</strong> dt.Columns){<strong>in</strong>dex ++;Writer.WriteStartElement("s", "AttributeType", null);Writer.WriteAttributeStr<strong>in</strong>g("name", dc.ColumnName);Writer.WriteAttributeStr<strong>in</strong>g("rs", "number", null,<strong>in</strong>dex.ToStr<strong>in</strong>g());Writer.WriteEndElement();144


}Writer.WriteStartElement("s", "extends", null);Writer.WriteAttributeStr<strong>in</strong>g("type", "rs:rowbase");}// Close the schema tag(s)Writer.WriteEndElement();Writer.WriteEndElement();Writer.WriteEndElement();The <strong>in</strong><strong>for</strong>mation exposed by the schema depends on the source table used. TheXmlRecordsetWriter class can do its job start<strong>in</strong>g from data conta<strong>in</strong>ed <strong>in</strong> any of thefollow<strong>in</strong>g objects: DataSet, DataTable, and DataView. However, because theDataTable object is the ADO.<strong>NET</strong> object that more closely matches the ADO Recordsetobject, the overloaded methods that receive a DataSet or a DataView object simplypass the related DataTable object to the overloaded method that receives a DataTableobject. As mentioned, the <strong>XML</strong> recordset is built us<strong>in</strong>g a particular table <strong>in</strong> the specifiedDataSet object or us<strong>in</strong>g the table <strong>for</strong> the specified DataView object.Process<strong>in</strong>g Record ContentsThe serialized contents of an ADO Recordset object consist of a bunch of nodes grouped below a parent node. The WriteContent method simply loopsthrough the rows <strong>in</strong> the table and creates the nodes, as shown <strong>in</strong> the follow<strong>in</strong>gcode. Next it loops over all the columns and adds an attribute <strong>for</strong> each data columnfound.public void WriteContent(DataTable dt){// Write dataWriter.WriteStartElement("rs", "data", null);<strong>for</strong>each(DataRow row <strong>in</strong> dt.Rows){Writer.WriteStartElement("z", "row", null);<strong>for</strong>each(DataColumn dc <strong>in</strong> dt.Columns)Writer.WriteAttributeStr<strong>in</strong>g(dc.ColumnName,row[dc.ColumnName].ToStr<strong>in</strong>g());Writer.WriteEndElement();}Writer.WriteEndElement();}ADO Recordset objects do not support embedd<strong>in</strong>g more result sets <strong>in</strong> a s<strong>in</strong>gle <strong>XML</strong> file.For this reason, you must either develop a new <strong>XML</strong> <strong>for</strong>mat or use separate files, one<strong>for</strong> each result set.145


Test<strong>in</strong>g the XmlRecordsetWriter ClassFor .<strong>NET</strong> Framework applications, us<strong>in</strong>g the XmlRecordsetWriter class is no big deal.You simply <strong>in</strong>stantiate the class and call its methods, as shown here:void ButtonLoad_Click(object sender, System.EventArgs e){// Create and display the <strong>XML</strong> documentCreateDocument("adors.xml");UpdateUI("adors.xml");}void CreateDocument(str<strong>in</strong>g filename){DataSet ds = LoadDataFromDatabase();XmlRecordsetWriter writer = newXmlRecordsetWriter(filename);writer.WriteRecordset(ds);}Figure 4-11 shows the output of a sample application that creates the <strong>XML</strong> file and thendisplays it <strong>in</strong> a text box on the <strong>for</strong>m.Figure 4-11: An ADO <strong>XML</strong> Recordset object that has just been created and its contentsdisplayed <strong>in</strong> a text box on the <strong>for</strong>m.The source DataSet object is fetched from the SQL Server Northw<strong>in</strong>d database byexecut<strong>in</strong>g the follow<strong>in</strong>g query:SELECT employeeid, firstname, lastname FROM employeesThe <strong>XML</strong> file that is created <strong>in</strong> this way is successfully recognized by ADOdrivenapplications, as shown <strong>in</strong> Figure 4-12.146


Figure 4-12: COM-based applications based on ADO <strong>in</strong>teroperate perfectly with thedocument that the <strong>XML</strong> writer has created by export<strong>in</strong>g ADO.<strong>NET</strong> data.The follow<strong>in</strong>g VBScript script proves just that:Const adClipStr<strong>in</strong>g = 2Const adCmdFile = 256Set rs = CreateObject("ADODB.Recordset")rs.Open filename, Noth<strong>in</strong>g, -1, -1, adCmdFileMsgBox rs.GetStr<strong>in</strong>g(adClipStr<strong>in</strong>g),, filenameImportantIn the <strong>XML</strong> document that represents the data orig<strong>in</strong>ally stored<strong>in</strong> an ADO.<strong>NET</strong> DataTable object, no type <strong>in</strong><strong>for</strong>mation exists.In spite of this, the <strong>XML</strong> document built so far is technicallylegal and correct, and all ADO-based applications cansuccessfully manage it. All the various pieces of <strong>in</strong><strong>for</strong>mation <strong>in</strong>the document are rendered <strong>in</strong> the same way—that is, us<strong>in</strong>gUnicode str<strong>in</strong>gs, by means of the ADO adLongVarWChar datatype.Mak<strong>in</strong>g those fields type-aware means add<strong>in</strong>g some type<strong>in</strong><strong>for</strong>mation to the node <strong>in</strong> the <strong>XML</strong> schema.You do this us<strong>in</strong>g a pair of attributes <strong>in</strong> the dt namespace—one of the namespaces def<strong>in</strong>ed <strong>in</strong> the root node—as shownhere:The element describes the type of correspond<strong>in</strong>gcharacter data used <strong>in</strong> the parent attribute value. The ma<strong>in</strong>attribute of is the dt:type attribute. For variablelengthdata types, XDR also allows you to specify a maximumlength via the dt:maxLength attribute.The .<strong>NET</strong> Framework type system and the ADO Recordsetobject recognize different types. And ADO types are, <strong>in</strong> turn,different from predef<strong>in</strong>ed XDR data types. There's no easyway to obta<strong>in</strong> the XDR data type that corresponds to a .<strong>NET</strong>Framework Type object. Whenever type <strong>in</strong><strong>for</strong>mation is critical<strong>for</strong> the health of your application, you should figure out how tomap a DataTable object's column .<strong>NET</strong> Framework type to anXDR type. In fact, you should exhaustively consider each .<strong>NET</strong>147


Framework type and map each to an element <strong>in</strong> another set ofdata types.Compar<strong>in</strong>g Writers and <strong>XML</strong> WritersIn the .<strong>NET</strong> Framework, a writer class is merely a document-producer object. Itexposes ad hoc methods to let developers create the desired output us<strong>in</strong>g highleveltools. A method named WriteSchema that <strong>in</strong>ternally handles primitives to add nodesand attributes is much more understandable than, say, a Str<strong>in</strong>gBuilder object that youuse to build markup text. An <strong>XML</strong> writer is just a specialized writer that handles <strong>XML</strong>text.You can certa<strong>in</strong>ly design your own writer classes to quickly and easily enabledevelopers to create certa<strong>in</strong> compound documents. In do<strong>in</strong>g so, though, you don't needto <strong>in</strong>herit from XmlWriter, XmlTextWriter, or B<strong>in</strong>aryWriter. Although you can, and oftenmust, use those objects <strong>in</strong>ternally, the user-level <strong>in</strong>terface should comprehend methodsand properties that reflect the nature and the structure of the f<strong>in</strong>al document.As a general guidel<strong>in</strong>e, try to provide constructors that work over streams and textwriters and to provide as many overloads as you can. For example, theXmlRecordsetWriter class can output its contents to streams and TextWriter-derivedobjects, <strong>in</strong>clud<strong>in</strong>g Str<strong>in</strong>gWriter objects. The modular architecture of the .<strong>NET</strong>Framework makes achiev<strong>in</strong>g these goals relatively <strong>in</strong>expensive and, there-<strong>for</strong>e, there isno good reason <strong>for</strong> not exploit<strong>in</strong>g it to the fullest.A Read/Write <strong>XML</strong> Stream<strong>in</strong>g Parser<strong>XML</strong> readers and writers work <strong>in</strong> separate compartments and <strong>in</strong> an extremelyspecialized way. Readers just read, and writers just write. There is no way to <strong>for</strong>ceth<strong>in</strong>gs to go differently, and <strong>in</strong> fact, the underly<strong>in</strong>g streams are read-only or write-onlyas required. Suppose that your application manages lengthy <strong>XML</strong> documents thatconta<strong>in</strong> rather volatile data. Readers provide a powerful and effective way to read thatcontents. Writers, on the other hand, offer a fantastic tool to create that document fromscratch. But if you want to read and write the document at the same time, you mustnecessarily resort to a full-fledged <strong>XML</strong> Document Object Model (<strong>XML</strong> DOM). What canyou do to read and write an <strong>XML</strong> document without load<strong>in</strong>g it entirely <strong>in</strong>to memory?In Chapter 5, I'll tackle the <strong>XML</strong> DOM model of a parser, which is the classic tool <strong>for</strong>per<strong>for</strong>m<strong>in</strong>g read/write operations on an <strong>XML</strong> tree. The strength of the <strong>XML</strong> DOMparsers, but also their greatest drawback, lies <strong>in</strong> the fact that an <strong>XML</strong> DOM parser loadsthe whole <strong>XML</strong> document <strong>in</strong> memory, creates an ad hoc image of the tree, and lets youper<strong>for</strong>m any sort of modification and search on the mapped nodes. Keep<strong>in</strong>g the nittygrittydetails of <strong>XML</strong> DOM warm <strong>for</strong> Chapter 5, <strong>in</strong> this section, we'll look at how to set upa mixed type of stream<strong>in</strong>g parser that works as a k<strong>in</strong>d of lightweight <strong>XML</strong> DOM parser.The idea is that this parser will allow you read the contents of a document one node ata time as with an <strong>XML</strong> (validat<strong>in</strong>g) reader but that, if needed, it can also per<strong>for</strong>m somesimple updates. By simple updates, I mean simply chang<strong>in</strong>g the value of an exist<strong>in</strong>gattribute, chang<strong>in</strong>g the contents of a node, or add<strong>in</strong>g new attributes or nodes. For morecomplex operations, realistically noth<strong>in</strong>g compares to <strong>XML</strong> DOM parsers.Design<strong>in</strong>g a Writer on Top of a ReaderIn the .<strong>NET</strong> Framework, the <strong>XML</strong> DOM classes make <strong>in</strong>tensive use of stream<strong>in</strong>greaders and writers to build the <strong>in</strong>-memory tree and to flush it out to disk. Thus, readersand writers are def<strong>in</strong>itely the only <strong>XML</strong> primitives available <strong>in</strong> the .<strong>NET</strong> Framework.Consequently, to build up a sort of lightweight <strong>XML</strong> DOM parser, we can only rely, oncemore, on readers and writers.148


The <strong>in</strong>spiration <strong>for</strong> design<strong>in</strong>g such a read/write stream<strong>in</strong>g parser is database servercursors. With database server cursors, you visit records one after the next and, ifneeded, can apply changes on the fly. Database changes are immediately effective,and actually the canvas on which your code operates is simply the database table. Thesame model can be arranged to work with <strong>XML</strong> documents.You will use a normal <strong>XML</strong> (validat<strong>in</strong>g) reader to visit the nodes <strong>in</strong> sequence. Whileread<strong>in</strong>g, however, you are given the opportunity to change attribute values and nodecontents. Unlike the <strong>XML</strong> DOM, changes will have immediate effect. How can youobta<strong>in</strong> these results? The idea is to use an <strong>XML</strong> writer on top of the reader.You use the reader to read each node <strong>in</strong> the source document and an underly<strong>in</strong>g writerto create a hidden copy of it. In the copy, you can add some new nodes and ignore oredit some others. When you have f<strong>in</strong>ished, you simply replace the old document withthe new one. You can decide to write the copy <strong>in</strong> memory or flush it <strong>in</strong> a temporarymedium. The latter approach makes better use of the system's memory and saves youfrom possible troubles with the application's security level and zones. (For example,partially trusted W<strong>in</strong>dows Forms applications and default runn<strong>in</strong>g ASP.<strong>NET</strong> applicationscan't create or edit disk files.)Built-In Support <strong>for</strong> Read/Write OperationsWhen I first began th<strong>in</strong>k<strong>in</strong>g about this lightweight <strong>XML</strong> DOM component, one of keypo<strong>in</strong>ts I identified was an efficient way to copy (<strong>in</strong> bulk) blocks of nodes from the readonlystream to the write stream. Luckily enough, two somewhat underappreciatedXmlTextWriter methods just happen to cover this tricky but bor<strong>in</strong>g aspect of two-waystream<strong>in</strong>g: WriteAttributes and WriteNode.The WriteAttributes method reads all the attributes available on the currently selectednode <strong>in</strong> the specified reader. It then copies them as a s<strong>in</strong>gle str<strong>in</strong>g to the current outputstream. Likewise, the WriteNode method does the same <strong>for</strong> any other type of node.Note that WriteNode does noth<strong>in</strong>g if the node type is XmlNodeType.Attribute.The follow<strong>in</strong>g code shows how to use these methods to create a copy of the orig<strong>in</strong>al<strong>XML</strong> file, modified to skip some nodes. The <strong>XML</strong> tree is visited <strong>in</strong> the usual node-firstapproach us<strong>in</strong>g an <strong>XML</strong> reader. Each node is then processed and written out to theassociated <strong>XML</strong> writer accord<strong>in</strong>g to the <strong>in</strong>dex. This code scans a document and writesout every other node.XmlTextReader reader = new XmlTextReader(<strong>in</strong>putFile);XmlTextWriter writer = new XmlTextWriter(outputFile);// Configure reader and writerwriter.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;reader.MoveToContent();// Write the rootwriter.WriteStartElement(reader.LocalName);// Read and output every other node<strong>in</strong>t i=0;while(reader.Read()){149


}if (i % 2)writer.WriteNode(reader, false);i++;// Close the rootwriter.WriteEndElement();// Close reader and writerwriter.Close();reader.Close();You can aggregate the reader and the writer <strong>in</strong> a s<strong>in</strong>gle new class and build a brandnewprogramm<strong>in</strong>g <strong>in</strong>terface to allow <strong>for</strong> easy read/write stream<strong>in</strong>g access to attributesor nodes.Design<strong>in</strong>g the XmlTextReadWriter ClassThe XmlTextReadWriter class does not <strong>in</strong>herit from XmlReader or XmlWriter but,<strong>in</strong>stead, coord<strong>in</strong>ates the activity of runn<strong>in</strong>g <strong>in</strong>stances of both classes—one operat<strong>in</strong>g ona read-only stream, and one work<strong>in</strong>g on a write-only stream. The methods of theXmlTextReadWriter class read from the reader and write to the writer, apply<strong>in</strong>g anyrequested changes <strong>in</strong> the middle.The XmlTextReadWriter class features three constructors, shown <strong>in</strong> the follow<strong>in</strong>g code.These constructors let you <strong>in</strong>dicate an <strong>in</strong>put file and an optional output stream, whichcan be a stream as well as a disk file. If the names of <strong>in</strong>put and output files co<strong>in</strong>cide, orif you omit the output file, the XmlTextReadWriter class uses a temporary file to collectthe output and then automatically overwrites the <strong>in</strong>put file. The net effect of thisprocedure is that you simply modify your <strong>XML</strong> document without hold<strong>in</strong>g it all <strong>in</strong>memory, as <strong>XML</strong> DOM does.public XmlTextReadWriter(str<strong>in</strong>g <strong>in</strong>putFile)public XmlTextReadWriter(str<strong>in</strong>g <strong>in</strong>putFile, str<strong>in</strong>g outputFile)public XmlTextReadWriter(str<strong>in</strong>g <strong>in</strong>putFile, Stream outputStream)The <strong>in</strong>ternal reader and writer are exposed through read-only properties named Readerand Writer, as shown here:public XmlTextReader Reader{get {return m_reader;}}public XmlTextWriter Writer{get {return m_writer;}}150


For simplicity, I assume that all the <strong>XML</strong> documents the class processes have nosignificant prolog (<strong>for</strong> example, process<strong>in</strong>g <strong>in</strong>structions, comments, declarations, andDOCTYPE def<strong>in</strong>itions). On the other hand, the primary goal of this class is to provide<strong>for</strong> quick modification of simple <strong>XML</strong> files—mostly filled with any k<strong>in</strong>d of sett<strong>in</strong>gs. Formore complete read/write manipulation of documents, you should resort to <strong>XML</strong> DOMtrees.Configur<strong>in</strong>g the XmlTextReadWriter ClassImmediately after class <strong>in</strong>itialization, the reader and the writer are configured to workproperly. This process entails sett<strong>in</strong>g the policy <strong>for</strong> white spaces and sett<strong>in</strong>g the<strong>for</strong>matt<strong>in</strong>g options, as shown here:m_reader = new XmlTextReader(m_InputFile);m_writer = new XmlTextWriter(m_OutputStream, null);m_reader.WhitespaceHandl<strong>in</strong>g = WhitespaceHandl<strong>in</strong>g.None;m_writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Skip all noncontent nodesm_reader.Read();m_reader.MoveToContent();I recommend that you configure the reader to ignore any white space so that it neverreturns any white space as a dist<strong>in</strong>ct node. This sett<strong>in</strong>g is correlated to theauto<strong>for</strong>matt<strong>in</strong>g feature you might need on the writer. If the reader returns white spacesas nodes and the writer <strong>in</strong>dents any node be<strong>in</strong>g created, the use of the writer'sWriteNode method can cause double <strong>for</strong>matt<strong>in</strong>g.As you can see <strong>in</strong> the preced<strong>in</strong>g code, the XmlTextReadWriter class also moves the<strong>in</strong>ternal reader po<strong>in</strong>ter directly to the first contents node, skipp<strong>in</strong>g any prolog nodefound <strong>in</strong> the source.The XmlTextReadWriter <strong>Programm<strong>in</strong>g</strong> InterfaceI designed the XmlTextReadWriter class with a m<strong>in</strong>imal programm<strong>in</strong>g <strong>in</strong>terfacebecause, <strong>in</strong> most cases, what you really need is to comb<strong>in</strong>e the features of the readerand the writer to create a new and application-specific behavior such as updat<strong>in</strong>g aparticular attribute on a certa<strong>in</strong> node, delet<strong>in</strong>g nodes accord<strong>in</strong>g to criteria, or add<strong>in</strong>gnew trees of nodes. The class provides the methods listed <strong>in</strong> Table 4-9.Table 4-9: Methods of the XmlTextReadWriter ClassMethodAddAttributeChangeReadWriteAttributesDescriptionCaches all the <strong>in</strong><strong>for</strong>mation needed to per<strong>for</strong>m achange on a node attribute. All the changes cachedthrough this method are processed dur<strong>in</strong>g asuccessive call to WriteAttributes.Simple wrapper around the <strong>in</strong>ternal reader's Readmethod.Specialized version of the writer's WriteAttributesmethod. Writes out all the attributes <strong>for</strong> the specifiednode, tak<strong>in</strong>g <strong>in</strong>to account all the changes cachedthrough the AddAttributeChange method.151


Table 4-9: Methods of the XmlTextReadWriter ClassMethodDescriptionWriteEndDocument Term<strong>in</strong>ates the current document <strong>in</strong> the writer andcloses both the reader and the writer.WriteStartDocument Prepares the <strong>in</strong>ternal writer to output the documentand adds default comment text and the standard<strong>XML</strong> prolog.A read/write <strong>XML</strong> document is processed between two calls to WriteStartDocument andWriteEndDocument, shown <strong>in</strong> the follow<strong>in</strong>g code. The <strong>for</strong>mer method <strong>in</strong>itializes theunderly<strong>in</strong>g writer and writes a standard comment. The latter method completes thedocument by clos<strong>in</strong>g any pend<strong>in</strong>g tags and then closes both the reader and the writer.public void WriteStartDocument(){m_writer.WriteStartDocument();str<strong>in</strong>g text = Str<strong>in</strong>g.Format("Modified: {0}",DateTime.Now.ToStr<strong>in</strong>g());m_writer.WriteComment(text);}public void WriteEndDocument(){m_writer.WriteEndDocument();m_reader.Close();m_writer.Close();}// If us<strong>in</strong>g a temp file name, overwrite the <strong>in</strong>putif (m_ReplaceFile){File.Copy(m_tempOutputFile, m_InputFile, true);File.Delete(m_tempOutputFile);}If you are not us<strong>in</strong>g a dist<strong>in</strong>ct file <strong>for</strong> output, WriteEndDocument also overwrites theorig<strong>in</strong>al document with the temporary file <strong>in</strong> which the output has been accumulated <strong>in</strong>the meantime.You can use any of the methods of the native <strong>in</strong>terfaces of the XmlTextWriter andXmlTextReader classes. For simplicity, however, I endowed the XmlTextReadWriterclass with a Read method and a NodeType property. Both are little more than wrappers<strong>for</strong> the correspond<strong>in</strong>g method and property on the reader. Here's how you <strong>in</strong>itialize andstart us<strong>in</strong>g the XmlTextReadWriter class:XmlTextReadWriter rw = new XmlTextReadWriter(<strong>in</strong>putFile);rw.WriteStartDocument();// Process the file152


w.WriteEndDocument();What happens between these two calls depends primarily on the nature and the goalsof the application. You could, <strong>for</strong> example, change the value of one or more attributes,delete nodes, or replace the namespace. To accomplish whatever goal the applicationpursues, you can issue direct calls on the <strong>in</strong>terface of the <strong>in</strong>ternal reader and writer aswell as use the few methods specific to the XmlTextReadWriter class.Bear <strong>in</strong> m<strong>in</strong>d that read<strong>in</strong>g and writ<strong>in</strong>g are completely dist<strong>in</strong>ct and <strong>in</strong>dependentprocesses that work accord<strong>in</strong>g to slightly different models and strategies. When thereader is positioned on a node, no direct method can be called on the writer to makesure that just the value or the name of that node is modified. The follow<strong>in</strong>g pseudocode,<strong>for</strong> example, does not correspond to reality:if (reader.Value >100)writer.Value = 2*reader.Value;To double the value of each node, you simply write a new document that mirrors thestructure of the orig<strong>in</strong>al, apply<strong>in</strong>g the necessary changes. To change the value of anode, you must first collect all the <strong>in</strong><strong>for</strong>mation about that node (<strong>in</strong>clud<strong>in</strong>g attributes) andthen proceed with writ<strong>in</strong>g. One of the reasons <strong>for</strong> such an asymmetry <strong>in</strong> the reader'sand writer's work<strong>in</strong>g model is that <strong>XML</strong> documents are hierarchical by nature and notflat like an INI or a CSV file. In the section "A Full-Access CSV Editor," on page 192, I'lldiscuss a full read/write editor <strong>for</strong> CSV files <strong>for</strong> which the preced<strong>in</strong>g pseudocode ismuch more realistic.Test<strong>in</strong>g the XmlTextReadWriter ClassLet's review three examples of how the XmlTextReadWriter class can be used tomodify <strong>XML</strong> documents without us<strong>in</strong>g the full-blown <strong>XML</strong> DOM. Look<strong>in</strong>g at the sourcecode, you'll realize that a read/write stream<strong>in</strong>g parser is mostly achieved by a smart andcomb<strong>in</strong>ed use of readers and writers.By mak<strong>in</strong>g assumptions about the structure of the <strong>XML</strong> source file, you can simplify thatcode while build<strong>in</strong>g the arsenal of the XmlTextReadWriter class with ad hoc propertiessuch as Value or Name and new methods such as SetAttribute (which would be pairedwith the reader's GetAttribute method).Chang<strong>in</strong>g the NamespaceFor our first example, consider the problem of chang<strong>in</strong>g the namespace of all the nodes<strong>in</strong> a specified <strong>XML</strong> file. The XmlTextReadWriter parser will provide <strong>for</strong> this eventualitywith a simple loop, as shown here:void ChangeNamespace(str<strong>in</strong>g prefix, str<strong>in</strong>g ns){XmlTextReadWriter rw;rw = new XmlTextReadWriter(<strong>in</strong>putFile);rw.WriteStartDocument();// Modify the root tag manuallyrw.Writer.WriteStartElement(rw.Reader.LocalName);rw.Writer.WriteAttributeStr<strong>in</strong>g("xmlns", prefix, null, ns);// Loop through the document153


while(rw.Read()){switch(rw.NodeType){case XmlNodeType.Element:rw.Writer.WriteStartElement(prefix,rw.Reader.LocalName, null);rw.Writer.WriteAttributes(rw.Reader, false);if (rw.Reader.IsEmptyElement)rw.Writer.WriteEndElement();break;}}// Close the root tagrw.Writer.WriteEndElement();}// Close the document and any <strong>in</strong>ternal resourcesrw.WriteEndDocument();The code starts by manually writ<strong>in</strong>g the root node of the source file. Next it adds anxmlns attribute with the specified prefix and the URN. The ma<strong>in</strong> loop scans all thecontents of the <strong>XML</strong> file below the root node. For each element node, it writes a fullyqualified new node whose name is the just-read local name with a prefix andnamespace URN supplied by the caller, as shown here:rw.Writer.WriteStartElement(prefix, rw.Reader.LocalName, null);Because attributes are unchanged, they are simply copied us<strong>in</strong>g the writer'sWriteAttributes method, as shown here:rw.Writer.WriteAttributes(rw.Reader, false);The node is closed with<strong>in</strong> the loop only if it has no further contents to process. Figure 4-13 shows the sample application. In the upper text box, you see the orig<strong>in</strong>al file. Thebottom text box conta<strong>in</strong>s the modified document with the specified namespace<strong>in</strong><strong>for</strong>mation.154


Figure 4-13: All the nodes <strong>in</strong> the <strong>XML</strong> document shown <strong>in</strong> the bottom text box now belongto the specified namespace.Updat<strong>in</strong>g Attribute ValuesThe ultimate goal of our second example is chang<strong>in</strong>g the values of one or moreattributes on a specified node. The XmlTextReadWriter class lets you do that <strong>in</strong> a s<strong>in</strong>glevisit to the <strong>XML</strong> tree. You specify the node and the attribute name as well as the oldand the new value <strong>for</strong> the attribute.In general, the old value is necessary just to ensure that you update the correctattribute on the correct node. In fact, if an <strong>XML</strong> document conta<strong>in</strong>s other nodes with thesame name, you have no automatic way to determ<strong>in</strong>e which is the appropriate node toupdate. Check<strong>in</strong>g the old value of the attribute is just one possible workaround. If youcan make some assumptions about the structure of the <strong>XML</strong> document, this constra<strong>in</strong>tcan be easily released.As mentioned, the update takes place by essentially rewrit<strong>in</strong>g the source document,one node at a time. In do<strong>in</strong>g so, you can use updated values <strong>for</strong> both node contentsand attributes. The attributes of a node are written <strong>in</strong> one shot, so multiple changesmust be cached somewhere. There are two possibilities. One approach passes throughthe addition of enrichment of a set of properties and methods that more closely mimicsthe reader. You could expose a read/write Value property. Next, when the property iswritten, you <strong>in</strong>ternally cache the new value and make use of it when the attributes of theparent node are serialized.Another approach—the one you see implemented <strong>in</strong> the follow<strong>in</strong>g code—is based onan explicit and application-driven cache. Each update is registered us<strong>in</strong>g an <strong>in</strong>ternalDataTable object made up of four fields: node name, attribute name, old value, andnew value.rw.AddAttributeChange(nodeName, attribName, oldVal, newVal);The same DataTable object will conta<strong>in</strong> attribute updates <strong>for</strong> each node <strong>in</strong> thedocument. To persist the changes relative to a specified node, you use theXmlTextReadWriter class's WriteAttributes method, shown here:public void WriteAttributes(str<strong>in</strong>g nodeName){155


if (m_reader.HasAttributes){// Consider only the attribute changes <strong>for</strong> the givennodeDataView view = new DataView(m_tableOfChanges);view.RowFilter = "Node='"+ nodeName + "'";while(m_reader.MoveToNextAttribute()){// Beg<strong>in</strong> writ<strong>in</strong>g the attributem_writer.WriteStartAttribute(m_reader.Prefix,m_reader.LocalName, m_reader.NamespaceURI);// Search <strong>for</strong> a correspond<strong>in</strong>g entry// <strong>in</strong> the table of changesDataRow[] rows =m_tableOfChanges.Select("Attribute='"+m_reader.LocalName + "' AND OldValue='"+m_reader.Value + "'");if (rows.Length >0){DataRow row = rows[0];m_writer.WriteStr<strong>in</strong>g(row["NewValue"].ToStr<strong>in</strong>g());}elsem_writer.WriteStr<strong>in</strong>g(m_reader.Value);}}// Move back the <strong>in</strong>ternal po<strong>in</strong>term_reader.MoveToElement();}// Clear the table of changesm_tableOfChanges.Rows.Clear();m_tableOfChanges.AcceptChanges();The follow<strong>in</strong>g code, called by a client application, creates a copy of the sourcedocument and updates node attributes:void UpdateValues(str<strong>in</strong>g nodeName, str<strong>in</strong>g attribName,str<strong>in</strong>g oldVal, str<strong>in</strong>g newVal)156


{XmlTextReadWriter rw;rw = new XmlTextReadWriter(<strong>in</strong>putFile, outputFile);rw.WriteStartDocument();// Modify the root tag manuallyrw.Writer.WriteStartElement(rw.Reader.LocalName);// Prepare attribute changesrw.AddAttributeChange(nodeName, attribName, oldVal, newVal);// Loop through the documentwhile(rw.Read()){switch(rw.NodeType){case XmlNodeType.Element:rw.Writer.WriteStartElement(rw.Reader.LocalName);if (nodeName == rw.Reader.LocalName)rw.WriteAttributes(nodeName);elserw.Writer.WriteAttributes(rw.Reader, false);}}if (rw.Reader.IsEmptyElement)rw.Writer.WriteEndElement();break;// Close the root tagrw.Writer.WriteEndElement();}// Close the document and any <strong>in</strong>ternal resourcesrw.WriteEndDocument();Figure 4-14 shows the output of the sample application from which the preced<strong>in</strong>g codeis excerpted.157


Figure 4-14: The code can be used to change the value of the <strong>for</strong>ecolor attribute from blueto black.Add<strong>in</strong>g and Delet<strong>in</strong>g NodesA source <strong>XML</strong> document can also be easily read and modified by add<strong>in</strong>g or delet<strong>in</strong>gnodes. Let's look at a couple of examples.To add a new node, you simply read until the parent is found and then write an extraset of nodes to the <strong>XML</strong> writer. Because there might be other nodes with the samename as the parent, use a Boolean guard to ensure that the <strong>in</strong>sertion takes place onlyonce. The follow<strong>in</strong>g code demonstrates how to proceed:void AddUser(str<strong>in</strong>g name, str<strong>in</strong>g pswd, str<strong>in</strong>g role){XmlTextReadWriter rw;rw = new XmlTextReadWriter(<strong>in</strong>putFile, outputFile);rw.WriteStartDocument();// Modify the root tag manuallyrw.Writer.WriteStartElement(rw.Reader.LocalName);// Loop through the documentbool mustAddNode = true; // Only oncewhile(rw.Read()){switch(rw.NodeType){case XmlNodeType.Element:rw.Writer.WriteStartElement(rw.Reader.LocalName);if ("Users" == rw.Reader.LocalName && mustAddNode)158


pswd);}}{mustAddNode = false;rw.Writer.WriteStartElement("User");rw.Writer.WriteAttributeStr<strong>in</strong>g("name", name);rw.Writer.WriteAttributeStr<strong>in</strong>g("password",rw.Writer.WriteAttributeStr<strong>in</strong>g("role", role);rw.Writer.WriteEndElement();}elserw.Writer.WriteAttributes(rw.Reader, false);if (rw.Reader.IsEmptyElement)rw.Writer.WriteEndElement();break;// Close the root tagrw.Writer.WriteEndElement();}// Close the document and any <strong>in</strong>ternal resourcesrw.WriteEndDocument();To delete a node, you simply ignore it while read<strong>in</strong>g the document. For example, thefollow<strong>in</strong>g code removes a node <strong>in</strong> which the name attribute matches a specifiedstr<strong>in</strong>g:while(rw.Read()){switch(rw.NodeType){case XmlNodeType.Element:if ("User" == rw.Reader.LocalName){// Skip if name matchesstr<strong>in</strong>g userName = rw.Reader.GetAttribute("name");if (userName == name)break;}// Write <strong>in</strong> the output file if no match has been found159


}}rw.Writer.WriteStartElement(rw.Reader.LocalName);rw.Writer.WriteAttributes(rw.Reader, false);if (rw.Reader.IsEmptyElement)rw.Writer.WriteEndElement();break;Figure 4-15 shows this code <strong>in</strong> action. The highlighted record has been deletedbecause of the match<strong>in</strong>g value of the name attribute.Figure 4-15: A sample application to test the class's ability to add and delete nodes.NoteThe entire sample code illustrat<strong>in</strong>g the XmlTextReadWriter classand its way of work<strong>in</strong>g is available <strong>in</strong> this book's sample files. Theall-encompass<strong>in</strong>g <strong>Microsoft</strong> Visual Studio .<strong>NET</strong> solution is namedXmlReadWriter.A Full-Access CSV EditorIn Chapter 2, we looked at the XmlCsvReader class as an example of a custom <strong>XML</strong>reader. The XmlCsvReader class enables you to review the contents of a CSV filethrough nodes and attributes and the now-familiar semantics of <strong>XML</strong> readers. In thissection, I'll go one step further and illustrate a full-access CSV reader capable ofread<strong>in</strong>g and writ<strong>in</strong>g—the XmlCsvReadWriter class.The new class <strong>in</strong>herits from XmlCsvReader and modifies only a few methods andproperties. The XmlCsvReadWriter class works by us<strong>in</strong>g a companion output stream <strong>in</strong>which each row read and modified is then persisted prior to read<strong>in</strong>g a new row. TheXmlCsvReadWriter class is declared as follows:public class XmlCsvReadWriter : XmlCsvReader160


{public XmlCsvReadWriter(str<strong>in</strong>g filename, bool hasColumnHeaders, bool enableOutput){ ... }...}The class has a new constructor with a third argument—the Boolean valueenableOutput, which specifies whether the class should use a hidden output stream.Basically, by sett<strong>in</strong>g enableOutput to true, you declare your <strong>in</strong>tent to use the class as areader/writer <strong>in</strong>stead of a simple reader. When this happens, the constructor creates atemporary file and a stream writer to work on it. At the end of the read<strong>in</strong>g, this output fileconta<strong>in</strong>s the modified version of the CSV and is used to replace the orig<strong>in</strong>al file. A newproperty, named EnableOutput, can be used to programmatically enable and disablethe output stream.Shadow<strong>in</strong>g the Class IndexerThe Item <strong>in</strong>dexer property—that is, the property that permits the popular reader[<strong>in</strong>dex]syntax—is declared as read-only <strong>in</strong> the abstract XmlReader base class. This meansthat any derived class can't replace that property with another one that is read/write.However, the XmlCsvReader class provides a total implementation of the abstractfunctionality def<strong>in</strong>ed <strong>in</strong> XmlReader. So when deriv<strong>in</strong>g from XmlCsvReader, you cansimply shadow the base Item property and replace it with a brand-new one with bothget and set accessors.The follow<strong>in</strong>g code is at the heart of the new CSV reader/writer class. It extends theItem property to make it work <strong>in</strong> a read/write fashion. The get accessor is identical tothe base class. The set accessor copies the specified value <strong>in</strong> the m_tokenValuescollection, <strong>in</strong> which the attributes of the current CSV row are stored. (See Chapter 2 <strong>for</strong>more details about the <strong>in</strong>ternal architecture of the CSV sample <strong>XML</strong> reader.)new public str<strong>in</strong>g this[<strong>in</strong>t i]{get{return base[i].ToStr<strong>in</strong>g();}set{// The Item[<strong>in</strong>dex] property is read-only, so// use the Item[str<strong>in</strong>g] overloadstr<strong>in</strong>g key = m_tokenValues.Keys[i].ToStr<strong>in</strong>g();m_tokenValues[key] = value;}}Notice the use of the new keyword to shadow the same property def<strong>in</strong>ed on the baseclass. This trick alone paves the road <strong>for</strong> the read/write feature.161


Note The new keyword is C#-specific. To achieve the same effect with<strong>Microsoft</strong> Visual Basic .<strong>NET</strong>, you must use the Shadows keyword.Also note that, when it comes to overload<strong>in</strong>g a method <strong>in</strong> a derivedclass, you don't need to mark it <strong>in</strong> any way if the language of choiceis C#. If you use Visual Basic .<strong>NET</strong>, the overload must be explicitlydeclared us<strong>in</strong>g the Overloads keyword.In addition, bear <strong>in</strong> m<strong>in</strong>d that a standard NameValueCollection object allows you toupdate a value only if you can pass the key str<strong>in</strong>g to the <strong>in</strong>dexer, as shown here:public str<strong>in</strong>g this[<strong>in</strong>t] {get;}public str<strong>in</strong>g this[str<strong>in</strong>g] {get; set;}The new Item <strong>in</strong>dexer property allows you to write code, as the follow<strong>in</strong>g code snippetdemonstrates:<strong>for</strong>(<strong>in</strong>t i=0; i


ow <strong>for</strong> headers, now add that prior to writ<strong>in</strong>g the// first data row.if (HasColumnHeaders && !m_firstRowRead){m_firstRowRead = true;str<strong>in</strong>g header = "";<strong>for</strong>each(str<strong>in</strong>g tmp <strong>in</strong> m_tokenValues)header += tmp + ",";m_outputStream.WriteL<strong>in</strong>e(header.TrimEnd(','));}}// Prepare and write the current CSV rowstr<strong>in</strong>g row = "";<strong>for</strong>each(str<strong>in</strong>g tmp <strong>in</strong> m_tokenValues)row += m_tokenValues[tmp] + ",";m_outputStream.WriteL<strong>in</strong>e(row.TrimEnd(','));}// Move ahead as usualreturn base.Read();If the first row <strong>in</strong> the source CSV file has been <strong>in</strong>terpreted as the headers of thecolumns (HasColumnHeaders property set to true), this implementation of the Readmethod ensures that the very first row written to the output stream conta<strong>in</strong>s just thoseheaders. After that, the current contents of the m_tokenValues collection is serialized toa comma-separated str<strong>in</strong>g and is written to the output stream. Once this has beendone, the Read method f<strong>in</strong>ally moves to the next l<strong>in</strong>e.Clos<strong>in</strong>g the Output StreamWhen you close the reader, the output stream is also closed. In addition, because theoutput stream was writ<strong>in</strong>g to a temporary file, that file is also copied over by the sourceCSV replac<strong>in</strong>g it, as shown here:public override void Close(){base.Close();if (EnableOutput){m_outputStream.Close();File.Copy(m_tempFileName, m_fileName, true);File.Delete(m_tempFileName);}163


}The net effect of this code is that any changes entered <strong>in</strong> the source CSV document arecached to a temporary file, which then replaces the orig<strong>in</strong>al. The user won't perceiveanyth<strong>in</strong>g of these work<strong>in</strong>gs, however.The CSV Reader/Writer <strong>in</strong> ActionLet's take a sample CSV file, read it, and apply some changes to the contents so thatthey will automatically be persisted when the reader is closed. Here is the source CSVfile:LastName,FirstName,Title,CountryDavolio,Nancy,Sales Representative,USAFuller,Andrew,Sales Manager,USALeverl<strong>in</strong>g,Janet,Sales Representative,UKSuyama,Michael,Sales Representative,UKThe idea is to replac<strong>in</strong>g the expression Sales Representative with another one—say,Sales Force. The sample application, nearly identical to the one <strong>in</strong> Chapter 2, loads theCSV file, applies the changes, and then displays it through a desktop DataGrid control,as follows:// Instantiate the reader on a CSV fileXmlCsvReadWriter reader;reader = new XmlCsvReadWriter("employees.csv",hasHeader.Checked);reader.EnableOutput = true;reader.Read();// Def<strong>in</strong>e the schema of the table to b<strong>in</strong>d to the gridDataTable dt = new DataTable();<strong>for</strong>(<strong>in</strong>t i=0; i


if (reader[i] == "Sales Representative")reader[i] = "Sales Force";row[i] = reader[i].ToStr<strong>in</strong>g();}dt.Rows.Add(row);}while (reader.Read()); // Persist changes and move ahead// Flushes the changes to diskreader.Close();// B<strong>in</strong>d the table to the griddataGrid1.DataSource = dt;If the contents of a specified CSV attribute matches the specified str<strong>in</strong>g, it is replaced.The change occurs <strong>in</strong>itially on an <strong>in</strong>ternal collection and is then transferred to the outputstream dur<strong>in</strong>g the execution of the Read method. F<strong>in</strong>ally, the reader is closed and theoutput stream flushed. Figure 4-16 shows the program <strong>in</strong> action.Figure 4-16: The orig<strong>in</strong>al CSV file has been read and updated on disk.ConclusionReaders and writers are at the foundation of every I/O operation <strong>in</strong> the .<strong>NET</strong>Framework. You f<strong>in</strong>d them at work when you operate on disk and on network files,when you serialize and deserialize, while you per<strong>for</strong>m data access, even when youread and write configuration sett<strong>in</strong>gs.<strong>XML</strong> writers are ad hoc tools <strong>for</strong> creat<strong>in</strong>g <strong>XML</strong> documents us<strong>in</strong>g a higherlevel metaphorand putt<strong>in</strong>g more abstraction between your code and the markup. By us<strong>in</strong>g <strong>XML</strong>writers, you go far beyond markup to reach a nodeoriented dimension <strong>in</strong> which, <strong>in</strong>steadof just accumulat<strong>in</strong>g bytes <strong>in</strong> a block of contiguous memory, you assemble nodes andentities to create the desired schema and <strong>in</strong>foset.In this chapter, we looked primarily at the programm<strong>in</strong>g <strong>in</strong>terface of .<strong>NET</strong> <strong>XML</strong> writers—specifically, the XmlTextWriter class. You learned how to create well-<strong>for</strong>med <strong>XML</strong>documents, how to add nodes and attributes, how to support namespaces, and how toencode text us<strong>in</strong>g B<strong>in</strong>Hex and base64 encod<strong>in</strong>g algorithms.165


.<strong>NET</strong> <strong>XML</strong> writers only ensure the well-<strong>for</strong>medness of each <strong>in</strong>dividual <strong>XML</strong> elementbe<strong>in</strong>g generated. Writers can <strong>in</strong> no way guarantee the well-<strong>for</strong>medness of the entiredocument and can do even less to validate a document aga<strong>in</strong>st a DTD or a schema.Although badly <strong>for</strong>med <strong>XML</strong> documents can only result from actual gross programm<strong>in</strong>gerrors, the need <strong>for</strong> an extra step of validation is often felt <strong>in</strong> production environments,especially when the creation of the document depends on a number of variable factorsand run-time conditions. For this reason, we've also exam<strong>in</strong>ed the key po<strong>in</strong>ts <strong>in</strong>volved<strong>in</strong> the design and implementation of a validat<strong>in</strong>g <strong>XML</strong> writer.This chapter also featured a few custom <strong>XML</strong>-driven writers. In this chapter, youlearned how to write str<strong>in</strong>g arrays, JPEG images, and DataTable objects to specific<strong>XML</strong> schemas. It goes without say<strong>in</strong>g that the techniques discussed here do notexhaust the options available <strong>in</strong> the .<strong>NET</strong> Framework <strong>for</strong> those tasks. For example, the<strong>XML</strong> serializer can sometimes be more effectively employed to obta<strong>in</strong> the same results.(<strong>XML</strong> serializers are covered <strong>in</strong> Chapter 11.)These examples were provided with a double goal: to show one way to solve aproblem, and to demonstrate custom <strong>XML</strong> writers. As a general guidel<strong>in</strong>e, bear <strong>in</strong> m<strong>in</strong>dthat the more specific an <strong>XML</strong>-based <strong>for</strong>mat is, the more a specialized writer class canhelp. The key advantage of a writer class is perhaps not so much raw per<strong>for</strong>mancesav<strong>in</strong>gs but the resultant elegance, reusability, and efficiency of the design.We've also looked at an <strong>in</strong>termediate level of <strong>XML</strong> parser that falls somewhere betweenstream<strong>in</strong>g parsers such as readers and <strong>XML</strong> DOM. <strong>XML</strong> readers are great <strong>for</strong> pars<strong>in</strong>g<strong>XML</strong> documents, but they work <strong>in</strong> a read-only way. <strong>XML</strong> DOM parsers, on the otherhand, make updat<strong>in</strong>g documents a snap—but only after the documents have been fullyloaded <strong>in</strong> memory. The XmlTextReadWriter class <strong>in</strong>corporates a reader and a writerand coord<strong>in</strong>ates their <strong>in</strong>dependent activity through a simple new API. As a result, youcan parse a document one node at a time while ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the ability to add, update,or delete nodes. The new class is not the cure-all <strong>for</strong> any <strong>XML</strong> pa<strong>in</strong>s, but it can be an<strong>in</strong>terest<strong>in</strong>g option <strong>in</strong> some situations.In Chapter 5, we'll exam<strong>in</strong>e the <strong>XML</strong> DOM classes that you must use when fullread/write access to <strong>XML</strong> documents is critical and when the ability to per<strong>for</strong>m searchestakes precedence over the memory footpr<strong>in</strong>t.Further Read<strong>in</strong>gThis chapter touches on a number of topics that you might want to know more about.Some are <strong>XML</strong>-related, but not so much .<strong>NET</strong>-related as to f<strong>in</strong>d an ideal place <strong>for</strong>discussion here. Some are not really <strong>XML</strong>-related but def<strong>in</strong>itely belong to the .<strong>NET</strong>Framework and, as such, deserve at least a reference here.One topic we spent a lot of time on <strong>in</strong> this chapter is <strong>XML</strong> namespaces and qualifiednames. The official site where the specification can be found ishttp://www.w3.org/TR/REC-xml-names. In Chapter 3, I covered <strong>XML</strong> validation and thevarious schema <strong>in</strong>volved <strong>in</strong> the process. If you th<strong>in</strong>k you need an <strong>XML</strong> crash coursefrom a higher, non-.<strong>NET</strong>-Framework-related perspective, I can recommend two books.One is Essential <strong>XML</strong>, by Don Box, John Lam, and Aaron Skonnard (Addison-Wesley,2000). This reference is great if you need to get the gist of <strong>XML</strong> <strong>in</strong> a plat<strong>for</strong>m<strong>in</strong>dependentand language-<strong>in</strong>dependent context. Otherwise, look at the <strong>XML</strong><strong>Programm<strong>in</strong>g</strong> Core Reference, by R. Allen Wyke, Sultan Rehman, Brad Leupen, andAsh Rofail (<strong>Microsoft</strong> Press, 2002), <strong>for</strong> more development-related considerations andtips.A great source <strong>for</strong> learn<strong>in</strong>g about underdocumented features and tricks of the .<strong>NET</strong>Framework is certa<strong>in</strong>ly Jeffrey Richter's most recent book, <strong>Applied</strong> .<strong>NET</strong> Framework<strong>Programm<strong>in</strong>g</strong> (<strong>Microsoft</strong> Press, 2002). This book is a gold m<strong>in</strong>e <strong>for</strong> all that bor<strong>in</strong>g stuff166


that revolves around str<strong>in</strong>g manipulation, character encod<strong>in</strong>g, and memorymanagement.One of the examples discussed <strong>in</strong> this chapter entails the creation of an <strong>XML</strong> ADORecordset object from ADO.<strong>NET</strong>–specific objects such as DataSet, DataTable, andDataView. A more thorough discussion of the <strong>in</strong>tegration between ADO and ADO.<strong>NET</strong>can be found <strong>in</strong> Chapter 8 of my book Build<strong>in</strong>g Web Solutions with ASP.<strong>NET</strong> andADO.<strong>NET</strong> (<strong>Microsoft</strong> Press, 2002). Although that book is markedly ASP.<strong>NET</strong>-specific,the theme of how to efficiently use ADO from .<strong>NET</strong> Framework applications is fairlyplat<strong>for</strong>m-<strong>in</strong>dependent and can be applied to W<strong>in</strong>dows Forms as well.F<strong>in</strong>ally, this chapter touches on .<strong>NET</strong> code security. If you need to get started withsecurity and are look<strong>in</strong>g <strong>for</strong> a long-range perspective with concrete code snippetsspr<strong>in</strong>kled here and there, by all means check out Jason Clark's excellent article athttp://msdn.microsoft.com/msdnmag/issues/02/06/rich/rich.asp.167


Part II: <strong>XML</strong> Data ManipulationChapter ListChapter 5: The <strong>XML</strong> .<strong>NET</strong> Document Object ModelChapter 6: <strong>XML</strong> Query Language and NavigationChapter 7: <strong>XML</strong> Data Trans<strong>for</strong>mationPart Overview168


Chapter 5: The <strong>XML</strong> .<strong>NET</strong> Document Object ModelOverviewIn addition to <strong>XML</strong> readers and writers, the <strong>Microsoft</strong> .<strong>NET</strong> Framework provides classesthat parse <strong>XML</strong> documents accord<strong>in</strong>g to the W3C Document Object Model (DOM) Level1 Core and the DOM Level 2 Core. These classes, available <strong>in</strong> the System.Xmlnamespace, build a complete <strong>in</strong>-memory representation of the contents of an <strong>XML</strong>document and make it programmatically accessible dur<strong>in</strong>g both read and writeoperations.The structure of the <strong>XML</strong> Document Object Model (<strong>XML</strong> DOM) is a generalspecification that is implemented us<strong>in</strong>g plat<strong>for</strong>m-specific features and components. TheMS<strong>XML</strong> library provides a COM-based <strong>XML</strong> DOM implementation <strong>for</strong> the <strong>Microsoft</strong>W<strong>in</strong>32 plat<strong>for</strong>m. The System.Xml assembly provides a .<strong>NET</strong> Framework-specificimplementation of the <strong>XML</strong> DOM centered on the XmlDocument class.Although it is stored as flat text <strong>in</strong> a l<strong>in</strong>ear text file, <strong>XML</strong> content is <strong>in</strong>herentlyhierarchical. Readers simply parse the text as it is read out of the <strong>in</strong>put stream. Theynever cache read <strong>in</strong><strong>for</strong>mation and work <strong>in</strong> a stateless fashion. As a result of thisarrangement, you can neither edit nodes nor move backward. The limited navigationcapabilities also prevent you from implement<strong>in</strong>g node queries of any complexity. The<strong>XML</strong> DOM philosophy is quite different. <strong>XML</strong> DOM loads all the <strong>XML</strong> content <strong>in</strong> memoryand exposes it through a suite of collections that, overall, offer a tree-basedrepresentation of the orig<strong>in</strong>al content. In addition, the supplied data structure is fullysearchable and editable.Advanced search<strong>in</strong>g and edit<strong>in</strong>g are the primary functions of the <strong>XML</strong> DOM, whereasreaders (and Simple API <strong>for</strong> <strong>XML</strong> [SAX] parsers as well) are optimized <strong>for</strong> document<strong>in</strong>spection, simple search<strong>in</strong>g, and any sort of read-only activity. In Chapter 2, weexplored the characteristics of pull mode readers. Let's analyze now the .<strong>NET</strong>Framework programm<strong>in</strong>g <strong>in</strong>terface <strong>for</strong> full-access <strong>XML</strong> document process<strong>in</strong>g.The <strong>XML</strong> DOM <strong>Programm<strong>in</strong>g</strong> InterfaceThe central element <strong>in</strong> the .<strong>NET</strong> <strong>XML</strong> DOM implementation is the XmlDocument class.The XmlDocument class represents an <strong>XML</strong> document and makes it programmable byexpos<strong>in</strong>g its nodes and attributes through ad hoc collections. Let's consider a simple<strong>XML</strong> document:1NancyDavolio2AndrewFuller169


3JanetLeverl<strong>in</strong>gWhen processed by an <strong>in</strong>stance of the XmlDocument class, this file creates a tree likethe one shown <strong>in</strong> Figure 5-1.Figure 5-1: Graphical representation of an <strong>XML</strong> DOM tree.The XmlDocument class represents the entry po<strong>in</strong>t <strong>in</strong> the b<strong>in</strong>ary structure and thecentral console that lets you move through nodes, read<strong>in</strong>g and writ<strong>in</strong>g contents. Each170


element <strong>in</strong> the orig<strong>in</strong>al <strong>XML</strong> document is mapped to a particular .<strong>NET</strong> Framework classwith its own set of properties and methods. Each element can be reached from theparent and can access all of its children and sibl<strong>in</strong>gs. Element-specific <strong>in</strong><strong>for</strong>mation suchas contents and attributes are available via properties.Any change you enter is applied immediately, but only <strong>in</strong> memory. The XmlDocumentclass does provide an I/O <strong>in</strong>terface to load from, and save to, a variety of storagemedia, <strong>in</strong>clud<strong>in</strong>g disk files. Subsequently, all the changes to constituent elements of an<strong>XML</strong> DOM tree are normally persisted all at once.NoteThe W3C DOM Level 1 Core and Level 2 Core do not yet mandatean official API <strong>for</strong> serializ<strong>in</strong>g documents to and from <strong>XML</strong> <strong>for</strong>mat.Such an API will come only with the DOM Level 3 specification,which at this time is only a work<strong>in</strong>g draft.Be<strong>for</strong>e we look at the key tasks you might want to accomplish us<strong>in</strong>g the <strong>XML</strong> DOMprogramm<strong>in</strong>g <strong>in</strong>terface, let's review the tools that this <strong>in</strong>terface provides. In particular,we'll focus here on two major classes—the XmlDocument class and the XmlNodeclass. A third class, XmlDataDocument, that is tightly coupled with <strong>XML</strong> DOM <strong>in</strong>general, and XmlDocument <strong>in</strong> particular, will be covered <strong>in</strong> Chapter 8.XmlDataDocument represents the connect<strong>in</strong>g l<strong>in</strong>k between the hierarchical world of<strong>XML</strong> and the relational world of ADO.<strong>NET</strong> DataSet objects.The XmlDocument ClassWhen you need to load an <strong>XML</strong> document <strong>in</strong>to memory <strong>for</strong> full-access process<strong>in</strong>g, youstart by creat<strong>in</strong>g a new <strong>in</strong>stance of the XmlDocument class. The class features twopublic constructors, one of which is the default parameterless constructor, as shownhere:public XmlDocument();public XmlDocument(XmlNameTable);While <strong>in</strong>itializ<strong>in</strong>g the XmlDocument class, you can also specify an exist<strong>in</strong>gXmlNameTable object to help the class work faster with attribute and node names andoptimize memory management. Just as the XmlReader class does, XmlDocumentbuilds its own name table <strong>in</strong>crementally while process<strong>in</strong>g the document. However,pass<strong>in</strong>g a precompiled name table can only speed up the overall execution. Thefollow<strong>in</strong>g code snippet demonstrates how to load an <strong>XML</strong> document <strong>in</strong>to a liv<strong>in</strong>g<strong>in</strong>stance of the XmlDocument class:XmlDocument doc = new XmlDocument();doc.Load(fileName);The Load method always work synchronously, so when it returns, the document hasbeen completely (and successfully, we hope) mapped to memory and is ready <strong>for</strong>further process<strong>in</strong>g through the properties and methods exposed by the class. As you'llsee <strong>in</strong> a bit more detail later <strong>in</strong> this section, the XmlDocument class uses an <strong>XML</strong>reader <strong>in</strong>ternally to per<strong>for</strong>m any read operation and to build the f<strong>in</strong>al tree structure <strong>for</strong>the source document.NoteIn spite of what the beg<strong>in</strong>n<strong>in</strong>g of this chapter might suggest, theXmlDocument class is just the logical root class of the <strong>XML</strong> DOMclass hierarchy. The XmlDocument class actually <strong>in</strong>herits from theXmlNode class and is placed at the same level as classes likeXmlElement, XmlAttribute, and XmlEntity that you manipulate as171


child elements when process<strong>in</strong>g an <strong>XML</strong> document. In other words,XmlDocument is not designed as a wrapper class <strong>for</strong> <strong>XML</strong> nodeclasses. Its design follows the <strong>XML</strong> key guidel<strong>in</strong>e, accord<strong>in</strong>g towhich everyth<strong>in</strong>g <strong>in</strong> a document is a node, <strong>in</strong>clud<strong>in</strong>g the documentitself.Properties of the XmlDocument ClassTable 5-1 lists the properties supported by the XmlDocument class. The table <strong>in</strong>cludesonly the properties that the class <strong>in</strong>troduces or overrides. These properties are specificto the XmlDocument class or have a class-specific implementation. More properties areavailable through the base class XmlNode, which we'll exam<strong>in</strong>e <strong>in</strong> more detail <strong>in</strong> thesection "The XmlNode Base Class," on page 213.Table 5-1: Properties of the XmlDocument ClassPropertyDescriptionBaseURIGets the base URI of the document (<strong>for</strong> example,the file path).DocumentElement Gets the root of the document as an XmlElementobject.DocumentTypeGets the node with the DOCTYPE declaration (ifany).Implementation Gets the XmlImplementation object <strong>for</strong> thedocument.InnerXmlGets or sets the markup represent<strong>in</strong>g the body ofthe document.IsReadOnlyIndicates whether the document is read-only.LocalNameReturns the str<strong>in</strong>g #document.NameReturns the str<strong>in</strong>g #document.NameTableGets the NameTable object associated with thisimplementation of the XmlDocument class.NodeTypeReturns the value XmlNodeType.Document.OwnerDocumentReturns null. The XmlDocument object is notowned.PreserveWhitespace Gets or sets a Boolean value <strong>in</strong>dicat<strong>in</strong>g whether topreserve white space dur<strong>in</strong>g the load and saveprocess. Set to false by default.XmlResolverWrite-only property that specifies the XmlResolverobject to use <strong>for</strong> resolv<strong>in</strong>g external resources. Set tonull by default.NoteIn Table 5-1, you'll f<strong>in</strong>d the description of the property <strong>for</strong> a specialtype of <strong>XML</strong> node—the XmlNodeType.Document node. In some<strong>in</strong>stances, this same property is shared with other nodes, <strong>in</strong> whichcase it behaves <strong>in</strong> a slightly different manner. So read this table witha gra<strong>in</strong> of salt and replace the word document with the more genericword node when appropriate. For example, the OwnerDocumentproperty returns null if the node is Document but returns the ownerXmlDocument object <strong>in</strong> all other cases. Similarly, both Name and172


LocalName always return #document <strong>for</strong> XmlDocument, but theyactually represent the qualified and simple (namespace-less) nameof the particular node.By default, the PreserveWhitespace property is set to false, which <strong>in</strong>dicates that onlysignificant white spaces will be preserved while the document is loaded. A significantwhite space is any white space found between markup <strong>in</strong> a mixed-contents node or anywhite space found with<strong>in</strong> the subtree affected by the follow<strong>in</strong>g declaration:xml:space="preserve"All spaces are preserved throughout the document if PreserveWhitespace is set to truebe<strong>for</strong>e the Load method is called. As <strong>for</strong> writ<strong>in</strong>g, if PreserveWhitespace is set to truewhen the Save method is called, all spaces are preserved <strong>in</strong> the output. Otherwise, theserialized output is automatically <strong>in</strong>dented. This behavior represents a proprietaryextension over the standard DOM specification.The XmlDocument ImplementationThe Implementation property of the XmlDocument class def<strong>in</strong>es the operat<strong>in</strong>g context<strong>for</strong> the document object. Implementation returns an <strong>in</strong>stance of the XmlImplementationclass, which provides methods <strong>for</strong> per<strong>for</strong>m<strong>in</strong>g operations that are <strong>in</strong>dependent of anyparticular <strong>in</strong>stance of the DOM.In the base implementation of the XmlImplementation class, the list of operations thatvarious <strong>in</strong>stances of XmlDocument classes can share is relatively short. Theseoperations <strong>in</strong>clude creat<strong>in</strong>g new documents, test<strong>in</strong>g <strong>for</strong> supported features, and moreimportant, shar<strong>in</strong>g the same name table.The XmlImplementation class is not sealed, so you could try to def<strong>in</strong>e a customimplementation object and use that to create new XmlDocument objects with somenonstandard sett<strong>in</strong>gs (<strong>for</strong> example, PreserveWhitespace set to true by default). Thefollow<strong>in</strong>g code snippet shows how to create two documents from the sameimplementation:XmlImplementation imp = new XmlImplementation();XmlDocument doc1 = imp.CreateDocument();XmlDocument doc2 = imp.CreateDocument();The follow<strong>in</strong>g code shows how XmlImplementation could work with a customimplementation object:MyImplementation imp = new MyImplementation();XmlDocument doc = imp.CreateDocument();In the section "Custom Node Classes," on page 234, when we exam<strong>in</strong>e <strong>XML</strong> DOMextensions, I'll have more to say about custom implementations.NoteTwo <strong>in</strong>stances of XmlDocument can share the same implementationwhen the implementation is custom. Actually, all <strong>in</strong>stances ofXmlDocument share the same standard XmlImplementation object.Shar<strong>in</strong>g the same implementation does not mean that the twoobjects are each other's clone, however. The <strong>XML</strong> implementationis a k<strong>in</strong>d of common runtime that services both objects.Methods of the XmlDocument ClassTable 5-2 lists the methods supported by the XmlDocument class. The list <strong>in</strong>cludes onlythe methods that XmlDocument <strong>in</strong>troduces or overrides; more methods are available173


through the base class XmlNode. (See the section "The XmlNode Base Class," onpage 213.)Table 5-2: Methods of the XmlDocument ClassMethodCloneNodeCreateAttributeCreateCDataSectionCreateCommentCreateDocumentFragmentCreateDocumentTypeCreateElementDescriptionCreates a duplicate of the document.Creates an attribute with the specifiedname.Creates a CDATA section with the specifieddata.Creates a comment with the specified text.Creates an <strong>XML</strong> fragment. Note that afragment node can't be <strong>in</strong>serted <strong>in</strong>to adocument; however, you can <strong>in</strong>sert any ofits children <strong>in</strong>to a document.Creates a DOCTYPE element.Creates a node element.CreateEntityReference Creates an entity reference with thespecified name.CreateNodeCreates a node of the specified type.CreateProcess<strong>in</strong>gInstruction Creates a process<strong>in</strong>g <strong>in</strong>struction.CreateSignificantWhitespace Creates a significant white space node.CreateTextNodeCreates a text node. Note that text nodesare allowed only as children of elements,attributes, and entities.CreateWhitespaceCreates a white space node.CreateXmlDeclarationCreates the standard <strong>XML</strong> declaration.GetElementByIdGets the element <strong>in</strong> the document with thegiven ID.GetElementsByTagNameReturns the list of child nodes that matchthe specified tag name.ImportNodeImports a node from another document.LoadLoads <strong>XML</strong> data from the specified source.LoadXmlLoads <strong>XML</strong> data from the specified str<strong>in</strong>g.ReadNodeCreates an XmlNode object based on the<strong>in</strong><strong>for</strong>mation read from the given <strong>XML</strong> reader.SaveSaves the current document to the specifiedlocation.WriteContentToSaves all the children of the currentdocument to the specified XmlWriter object.WriteToSaves the current document to the specified174


Table 5-2: Methods of the XmlDocument ClassMethodDescriptionwriter.As you can see, the XmlDocument class has a lot of methods that create and return<strong>in</strong>stances of node objects. In the .<strong>NET</strong> Framework, all the objects that represent a nodetype (Comment, Element, Attribute, and so on) do not have any publicly usableconstructors. For this reason, you must resort to the correspond<strong>in</strong>g method.How can the XmlDocument class create and return <strong>in</strong>stances of other node objects ifno public constructor <strong>for</strong> them is available? The trick is that node classes mark theirconstructors with the <strong>in</strong>ternal modifier (Friend <strong>in</strong> <strong>Microsoft</strong> Visual Basic). The <strong>in</strong>ternalkeyword restricts the default visibility of a type method or property to the boundaries ofthe assembly. The <strong>in</strong>ternal keyword works on top of other modifiers like public andprotected. XmlDocument and other node classes are all def<strong>in</strong>ed <strong>in</strong> the System.Xmlassembly, which ensures the effective work<strong>in</strong>g of factory methods. The follow<strong>in</strong>gpseudocode shows the <strong>in</strong>ternal architecture of a factory method:public virtual XmlXXX CreateXXX( params ){return new XmlXXX ( params );}NoteWhen the node class is XmlDocument, the methods WriteTo andWriteContentTo happen to produce the same output, although theydef<strong>in</strong>itely run different code. WriteTo is designed to persist the entirecontents of the node, <strong>in</strong>clud<strong>in</strong>g the markup <strong>for</strong> the node, attributes,and children. WriteContentTo, on the other hand, walks its waythrough the collection of child nodes and persists the contents ofeach us<strong>in</strong>g WriteTo. Here's the pseudocode:void WriteContentTo(XmlWriter w) {<strong>for</strong>each(XmlNode n <strong>in</strong> this)n.WriteTo(w);}A Document node is a k<strong>in</strong>d of super root node, so the loop on allchild nodes beg<strong>in</strong>s with the actual root node of the <strong>XML</strong> document.In this case, WriteTo simply writes out the entire contents of thedocument but the super root node has no markup. As a result, thetwo methods produce the same output <strong>for</strong> the XmlDocument class.Events of the XmlDocument ClassTable 5-3 lists the events that the XmlDocument class fires under the follow<strong>in</strong>g specificconditions: when the value of a node (any node) is be<strong>in</strong>g edited, and when a node isbe<strong>in</strong>g <strong>in</strong>serted <strong>in</strong>to or removed from the document.Table 5-3: Events of the XmlDocument ClassEventsNodeChang<strong>in</strong>g,NodeChangedNodeInsert<strong>in</strong>g,NodeInsertedDescriptionThe Value property of a node belong<strong>in</strong>g to thisdocument is about to be changed or has beenchanged already.A node is about to be <strong>in</strong>serted <strong>in</strong>to another node175


Table 5-3: Events of the XmlDocument ClassEventsNodeRemov<strong>in</strong>g,NodeRemovedDescription<strong>in</strong> this document or has been <strong>in</strong>serted already.The event fires whether you are <strong>in</strong>sert<strong>in</strong>g a newnode, duplicat<strong>in</strong>g an exist<strong>in</strong>g node, or import<strong>in</strong>g anode from another document.A node belong<strong>in</strong>g to this document is about to beremoved from the document or has been removedfrom its parent already.All these events require the same delegate <strong>for</strong> the event handler, as follows:public delegate void XmlNodeChangedEventHandler(object sender,XmlNodeChangedEventArgs e);The XmlNodeChangedEventArgs structure conta<strong>in</strong>s the event data. The structure hasfour <strong>in</strong>terest<strong>in</strong>g fields:• Action Conta<strong>in</strong>s a value <strong>in</strong>dicat<strong>in</strong>g what type of change is occurr<strong>in</strong>g onthe node. Allowable values, listed <strong>in</strong> the XmlNodeChangedActionenumeration type, are Insert, Remove, and Change.• NewParent Returns an XmlNode object represent<strong>in</strong>g the new parent ofthe node once the operation is complete. The property will be set to nullif the node is be<strong>in</strong>g removed. If the node is an attribute, the propertyreturns the node to which the attribute refers.• Node Returns an XmlNode object that denotes the node that is be<strong>in</strong>gadded, removed, or changed. Can't be set to null.• OldParent Returns an XmlNode object represent<strong>in</strong>g the parent of thenode be<strong>for</strong>e the operation began. Returns null if the node has noparent—<strong>for</strong> example, when you add a new node.Some of the actions you can take on an <strong>XML</strong> DOM are compound actions consist<strong>in</strong>g ofseveral steps, each of which could raise its own event. For example, be prepared tohandle several events when you set the InnerXml property. In this case, multiple nodescould be created and appended, result<strong>in</strong>g <strong>in</strong> as many NodeInsert<strong>in</strong>g/NodeInsertedpairs. In some cases, the XmlNode class's AppendChild method might fire a pair ofNodeRemov<strong>in</strong>g / NodeRemoved events prior to actually proceed<strong>in</strong>g with the <strong>in</strong>sertion.By design, to ensure <strong>XML</strong> well-<strong>for</strong>medness, AppendChild checks whether the node youare add<strong>in</strong>g already exists <strong>in</strong> the document. If it does, the exist<strong>in</strong>g node is first removedto avoid identical nodes <strong>in</strong> the same subtree.The XmlNode Base ClassWhen you work with <strong>XML</strong> DOM parsers, you ma<strong>in</strong>ly use the XmlDocument class. TheXmlDocument class, however, derives from a base class, XmlNode, which provides allthe core functions to navigate and create nodes.XmlNode is the abstract parent class of a handful of node-related classes that areavailable <strong>in</strong> the .<strong>NET</strong> Framework. Figure 5-2 shows the hierarchy of node classes.176


Figure 5-2: Graphical representation of the hierarchy of node classes and theirrelationships <strong>in</strong> the .<strong>NET</strong> Framework.177


Both XmlL<strong>in</strong>kedNode and XmlCharacterData are abstract classes that provide basicfunctionality <strong>for</strong> more specialized types of nodes. L<strong>in</strong>ked nodes are nodes that youmight f<strong>in</strong>d as constituent elements of an <strong>XML</strong> document just l<strong>in</strong>ked to a preced<strong>in</strong>g or afollow<strong>in</strong>g node. Character data nodes, on the other hand, are nodes that conta<strong>in</strong> andmanipulate only text.Properties of the XmlNode ClassTable 5-4 lists the properties of the XmlNode class that derived classes can override ifnecessary. For example, not all node types support attributes and not all have childnodes or sibl<strong>in</strong>gs. For situations such as this, the overridden properties can simplyreturn null or the empty str<strong>in</strong>g. By design, all node types must provide a concreteimplementation <strong>for</strong> each property.Table 5-4: Properties of the XmlNode ClassPropertyAttributesBaseURIChildNodesFirstChildHasChildNodesInnerTextInnerXmlIsReadOnlyItemLastChildDescriptionReturns a collection conta<strong>in</strong><strong>in</strong>g the attributes of thecurrent node. The collection is of typeXmlAttributeCollection.Gets the base URI of the current node.Returns an enumerable list object that allows you toaccess all the children of the current node. The objectreturned derives from the base class XmlNodeList,which is a l<strong>in</strong>ked list connect<strong>in</strong>g all the nodes with thesame parent and the same depth level (sibl<strong>in</strong>gs). No<strong>in</strong><strong>for</strong>mation is cached (not even the objects count), andany changes to the nodes are detected <strong>in</strong> real time.Returns the first child of the current node or null. Theorder of child nodes reflects the order <strong>in</strong> which theyhave been added. In turn, the <strong>in</strong>sertion order reflects thevisit<strong>in</strong>g algorithm implemented by the reader. (SeeChapter 2.)Indicates whether the current node has children.Gets or sets the text of the current node and all itschildren. Sett<strong>in</strong>g this property replaces all the childrenwith the contents of the given str<strong>in</strong>g. If the str<strong>in</strong>gconta<strong>in</strong>s markup, the text will be escaped first.Gets or sets the markup represent<strong>in</strong>g the body of thecurrent node. The contents of the node is replaced withthe contents of the given str<strong>in</strong>g. Any markup text will beparsed and result<strong>in</strong>g nodes <strong>in</strong>serted.Indicates whether the current node is read-only.Indexer property that gets the child element node withthe specified (qualified) name.Gets the last child of the current node. Aga<strong>in</strong>, whichnode is the last one depends ultimately on the visit<strong>in</strong>galgorithm implemented by the reader. Normally, it is thelast child node <strong>in</strong> the source document.178


Table 5-4: Properties of the XmlNode ClassPropertyLocalNameNameNamespaceURINextSibl<strong>in</strong>gNodeTypeOuterXmlOwnerDocumentParentNodePrefixPreviousSibl<strong>in</strong>gValueDescriptionReturns the name of the node, m<strong>in</strong>us the namespace.Returns the fully qualified name of the node.Gets the namespace URI of the current node.Gets the node immediately follow<strong>in</strong>g the current node.Sibl<strong>in</strong>gs are nodes with the same parent and the samedepth.Returns the type of the current node as a value takenfrom the XmlNodeType enumeration.Gets the markup code represent<strong>in</strong>g the current nodeand all of its children. Unlike InnerXml, OuterXml also<strong>in</strong>cludes the node itself <strong>in</strong> the markup with all of itsattributes. InnerXml, on the other hand, returns only themarkup found below the node, <strong>in</strong>clud<strong>in</strong>g text.Gets the XmlDocument object to which the current nodebelongs.Gets the parent of the current node (if any).Gets or sets the namespace prefix of the current node.Gets the node immediately preced<strong>in</strong>g the current node.Gets or sets the value of the current node.The collection of child nodes is implemented as a l<strong>in</strong>ked list. The ChildNodes propertyreturns an <strong>in</strong>ternal object of type XmlChildNodes. (The object is not documented, butyou can easily verify this claim by simply check<strong>in</strong>g the type of the object thatChildNodes returns.) You don't need to use this object directly, however. Suffice to saythat it merely represents a concrete implementation of the XmlNodeList class, whosemethods are, <strong>for</strong> the most part, marked as abstract. In particular, XmlChildNodesimplements the Item and Count properties and the GetEnumerator method.XmlChildNodes is not a true collection and does not cache any <strong>in</strong><strong>for</strong>mation. When youaccess the Count property, <strong>for</strong> example, it scrolls the entire list, count<strong>in</strong>g the number ofnodes on the fly. When you ask <strong>for</strong> a particular node through the Item property, the listis scanned from the beg<strong>in</strong>n<strong>in</strong>g until a match<strong>in</strong>g node is found. To move through the list,the XmlChildNodes class relies on the node's NextSibl<strong>in</strong>g method. But which classactually implements the NextSibl<strong>in</strong>g method? Both NextSibl<strong>in</strong>g and PreviousSibl<strong>in</strong>g aredef<strong>in</strong>ed <strong>in</strong> the XmlL<strong>in</strong>kedNode base class.XmlL<strong>in</strong>kedNode stores an <strong>in</strong>ternal po<strong>in</strong>ter to the next node <strong>in</strong> the list. The objectreferenced is simply what NextSibl<strong>in</strong>g returns. Figure 5-3 how th<strong>in</strong>gs work.179


Figure 5-3: The XmlL<strong>in</strong>kedNode class's NextSibl<strong>in</strong>g method lets applications navigatethrough the children of each node.Scroll<strong>in</strong>g <strong>for</strong>ward through the list of child nodes is fast and effective. The same can't besaid <strong>for</strong> backward scroll<strong>in</strong>g. The list of nodes is not double-l<strong>in</strong>ked, and each nodedoesn't also store a po<strong>in</strong>ter to the previous one <strong>in</strong> the list. For this reason,PreviousSibl<strong>in</strong>g reaches the target node by walk<strong>in</strong>g through the list from the beg<strong>in</strong>n<strong>in</strong>gto the node that precedes the current one.TipTo summarize, when you are process<strong>in</strong>g <strong>XML</strong> subtrees, try tom<strong>in</strong>imize calls to PreviousSibl<strong>in</strong>g, Item, and Count because theyalways walk through the entire collection of subnodes to get theirexpected output. Whenever possible, design your code to takeadvantage of <strong>for</strong>ward-only movements and per<strong>for</strong>m them us<strong>in</strong>gNextSibl<strong>in</strong>g.Methods of the XmlNode ClassTable 5-5 lists the methods exposed by the XmlNode class.Table 5-5: Methods of the XmlNode ClassMethodAppendChildCloneCloneNodeGetEnumeratorDescriptionAdds the specified node to the list of children ofthe current node. The node is <strong>in</strong>serted at thebottom of the list.Creates a duplicate of the current node. Forelement nodes, duplication <strong>in</strong>cludes child nodesand attributes.Creates a duplicate of the current node. Takes aBoolean argument <strong>in</strong>dicat<strong>in</strong>g whether clon<strong>in</strong>gshould proceed recursively. If this argument istrue, call<strong>in</strong>g the CloneNode method is equivalentto call<strong>in</strong>g Clone. Entity and notation nodes can'tbe cloned.Returns an <strong>in</strong>ternal and node-specific object thatimplements the IEnumerator <strong>in</strong>terface. Thereturned object provides the support needed toarrange <strong>for</strong>-each iterations.180


Table 5-5: Methods of the XmlNode ClassMethodGetNamespaceOfPrefixGetPrefixOfNamespaceInsertAfterInsertBe<strong>for</strong>eNormalizePrependChildRemoveAllRemoveChildReplaceChildSelectNodesSelectS<strong>in</strong>gleNodeSupportsWriteContentToWriteToDescriptionReturns the closest xmlns declaration <strong>for</strong> thegiven prefix.Returns the closest xmlns declaration <strong>for</strong> thegiven namespace URI.Inserts the specified node immediately after thespecified node. If the node already exists, it isfirst removed. If the reference node is null, the<strong>in</strong>sertion occurs at the beg<strong>in</strong>n<strong>in</strong>g of the list.Inserts the specified node immediately be<strong>for</strong>e thespecified reference node. If the node alreadyexists, it is first removed. If the reference node isnull, the <strong>in</strong>sertion occurs at the bottom of the list.Ensures that there are no adjacent XmlTextnodes by merg<strong>in</strong>g all adjacent text nodes <strong>in</strong>to as<strong>in</strong>gle one accord<strong>in</strong>g to a series of precedencerules.Adds the specified node to the beg<strong>in</strong>n<strong>in</strong>g of thelist of children of the current node.Removes all the children of the current node,<strong>in</strong>clud<strong>in</strong>g attributes.Removes the specified child node.Replaces the specified child node with a newone.Returns a list (XmlNodeList) of all the nodes thatmatch a given XPath expression.Returns only the first node that matches thegiven XPath expression.Verifies whether the current XmlImplementationobject supports a specific feature.Saves all the children of the current node to thespecified XmlWriter object. Equivalent toInnerXml.Saves the entire current node to the specifiedwriter. Equivalent to OuterXml.To locate one or more nodes <strong>in</strong> an <strong>XML</strong> DOM object, you can use either theChildNodes collection or the SelectNodes method. With the <strong>for</strong>mer technique, you aregiven access to the unfiltered collection of child nodes. Note that <strong>in</strong> this context, childnodes means all and only the sibl<strong>in</strong>g nodes located one level below the current node.The SelectNodes (and the ancillary SelectS<strong>in</strong>gleNode) method exploits the XPathquery language to let you extract nodes based on logical conditions. In addition, XPathqueries can go deeper than one level and even work on all descendants of a node. The.<strong>NET</strong> Framework XPath implementation is covered <strong>in</strong> Chapter 6. See the section"Further Read<strong>in</strong>g," on page 244, <strong>for</strong> resources provid<strong>in</strong>g detailed coverage of the XPathquery language.181


Work<strong>in</strong>g with <strong>XML</strong> DocumentsTo be fully accessible, an <strong>XML</strong> document must be entirely loaded <strong>in</strong> memory and itsnodes and attributes mapped to relative objects derived from the XmlNode class. Theprocess that builds the <strong>XML</strong> DOM triggers when you call the Load method. You can usea variety of sources to <strong>in</strong>dicate the <strong>XML</strong> document to work on, <strong>in</strong>clud<strong>in</strong>g disk files andURLs and also streams and text readers.Load<strong>in</strong>g <strong>XML</strong> DocumentsThe Load method always trans<strong>for</strong>ms the data source <strong>in</strong>to an XmlTextReader object andpasses it down to an <strong>in</strong>ternal loader object, as shown here:public virtual void Load(Stream);public virtual void Load(str<strong>in</strong>g);public virtual void Load(TextReader);public virtual void Load(XmlReader);The loader is responsible <strong>for</strong> read<strong>in</strong>g all the nodes <strong>in</strong> the document and does thatthrough a nonvalidat<strong>in</strong>g reader. After a node has been read, it is analyzed and thecorrespond<strong>in</strong>g XmlNode object created and added to the document tree. The entireprocess is illustrated <strong>in</strong> Figure 5-4.Figure 5-4: The load<strong>in</strong>g process of an XmlDocument object.Note that be<strong>for</strong>e a new XmlDocument object is loaded, the current <strong>in</strong>stance of theXmlDocument object is cleared. This means that if you reuse the same <strong>in</strong>stance of theXmlDocument class to load a second document, the exist<strong>in</strong>g contents are entirelyremoved be<strong>for</strong>e proceed<strong>in</strong>g.182


ImportantAlthough an <strong>XML</strong> reader is always used to build an <strong>XML</strong> DOM,some differences can be noticed when the reader is built<strong>in</strong>ternally—that is, you call Load on a file or a stream—orexplicitly passed by the programmer. In the latter case, if thereader is already positioned on a nonroot node, only thesibl<strong>in</strong>gs of that node are read and added to the DOM. If thecurrent reader's node can't be used as the root of a document(<strong>for</strong> example, attributes or process<strong>in</strong>g <strong>in</strong>structions), the readerreads on until it f<strong>in</strong>ds a node that can be used as the root. Payattention to the state of the reader be<strong>for</strong>e you pass it on to the<strong>XML</strong> DOM loader.Let's see how to use the <strong>XML</strong> DOM to build a relatively simple example—the samecode that we saw <strong>in</strong> action <strong>in</strong> Chapter 2 with readers. The follow<strong>in</strong>g code parses thecontents of an <strong>XML</strong> document and outputs its element node layout, discard<strong>in</strong>geveryth<strong>in</strong>g else, <strong>in</strong>clud<strong>in</strong>g text, attributes, and other nonelement nodes:us<strong>in</strong>g System;us<strong>in</strong>g System.Xml;class XmlDomLayoutApp{public static void Ma<strong>in</strong>(Str<strong>in</strong>g[] args){try {Str<strong>in</strong>g fileName = args[0];XmlDocument doc = new XmlDocument();doc.Load(fileName);XmlElement root = doc.DocumentElement;LoopThroughChildren(root);}catch (Exception e) {Console.WriteL<strong>in</strong>e("Error:\t{0}\n", e.Message);}}return;private static void LoopThroughChildren(XmlNode root){Console.WriteL<strong>in</strong>e("", root.Name);<strong>for</strong>each(XmlNode n <strong>in</strong> root.ChildNodes)183


}}{if (n.NodeType == XmlNodeType.Element)LoopThroughChildren(n);}Console.WriteL<strong>in</strong>e("", root.Name);After creat<strong>in</strong>g the <strong>XML</strong> DOM, the program beg<strong>in</strong>s a recursive visit that touches on all<strong>in</strong>ternal nodes of all types. The ChildNodes list returns only the first-level children of agiven node. Of course, this is not enough to traverse the tree from the root to theleaves, so the LoopThroughChildren method is recursively called on each elementnode found. Let's call the program to work on the follow<strong>in</strong>g <strong>XML</strong> file:.<strong>NET</strong>L<strong>in</strong>uxW<strong>in</strong>32JavaThe result we get us<strong>in</strong>g the <strong>XML</strong> DOM is shown here and is identical to what we gotfrom readers <strong>in</strong> Chapter 2:Well-Formedness and ValidationThe <strong>XML</strong> document loader checks only <strong>in</strong>put data <strong>for</strong> well-<strong>for</strong>medness. If pars<strong>in</strong>g errorsare found, an XmlException exception is thrown and the result<strong>in</strong>g XmlDocument objectrema<strong>in</strong>s empty. To load a document and validate it aga<strong>in</strong>st a DTD or a schema file, youmust use the Load method's overload, which accepts an XmlReader object. You passthe Load method a properly <strong>in</strong>itialized <strong>in</strong>stance of the XmlValidat<strong>in</strong>gReader class, asshown <strong>in</strong> the follow<strong>in</strong>g code, and proceed as usual:XmlTextReader _coreReader;XmlValidat<strong>in</strong>gReader reader;_coreReader = new XmlTextReader(xmlFile);reader = new XmlValidat<strong>in</strong>gReader(_coreReader);doc.Load(reader);Any schema <strong>in</strong><strong>for</strong>mation found <strong>in</strong> the file is taken <strong>in</strong>to account and the contents arevalidated. Parser errors, if any, are passed on to the validation handler you might havedef<strong>in</strong>ed. (See Chapter 3 <strong>for</strong> more details on the work<strong>in</strong>g of .<strong>NET</strong> Framework validat<strong>in</strong>greaders.) If your validat<strong>in</strong>g reader does not have an event handler, the first exception184


stops the load<strong>in</strong>g. Otherwise, the operation cont<strong>in</strong>ues unless the handler itself throwsan exception.Load<strong>in</strong>g from a Str<strong>in</strong>gThe <strong>XML</strong> DOM programm<strong>in</strong>g <strong>in</strong>terface also provides you with a method to build a DOMfrom a well-<strong>for</strong>med <strong>XML</strong> str<strong>in</strong>g. The method is LoadXml and is shown here:public virtual void LoadXml(str<strong>in</strong>g xml);This method neither supports validation nor preserves white spaces. Any contextspecific<strong>in</strong><strong>for</strong>mation you might need (DTD, entities, namespaces) must necessarily beembedded <strong>in</strong> the str<strong>in</strong>g to be taken <strong>in</strong>to account.Load<strong>in</strong>g Documents AsynchronouslyThe .<strong>NET</strong> Framework implementation of the <strong>XML</strong> DOM does not provide <strong>for</strong>asynchronous load<strong>in</strong>g. The Load method, <strong>in</strong> fact, always work synchronously and doesnot pass the control back to the caller until completed. As you might guess, this canbecome a serious problem when you have huge files to process and a rich user<strong>in</strong>terface.In similar situations—that is, when you are writ<strong>in</strong>g a W<strong>in</strong>dows Forms rich client—us<strong>in</strong>gthreads can be the most effective solution. You transfer to a worker thread the burdenof load<strong>in</strong>g the <strong>XML</strong> document and update the user <strong>in</strong>terface when the thread returns, asshown here:void StartDocumentLoad<strong>in</strong>g(){// Create the worker threadThread t = new Thread(newThreadStart(this.LoadXmlDocument));}statusBar.Text = "Load<strong>in</strong>g document...";t.Start();void LoadXmlDocument(){XmlDocument doc = new XmlDocument();doc.Load(InputFile.Text);// Update the user <strong>in</strong>terfacestatusBar.Text = "Document loaded.";Output.Text = doc.OuterXml;Output.ReadOnly = false;}return;185


While the secondary thread works, the user can freely use the application's user<strong>in</strong>terface and the huge size of the <strong>XML</strong> file is no longer a serious issue—at least as itperta<strong>in</strong>s to load<strong>in</strong>g.Extract<strong>in</strong>g <strong>XML</strong> DOM SubtreesYou normally build the <strong>XML</strong> DOM by load<strong>in</strong>g the entire <strong>XML</strong> document <strong>in</strong>to memory.However, the XmlDocument class also provides the means to extract only a portion ofthe document and return it as an <strong>XML</strong> DOM subtree. The key method to achieve thisresult is ReadNode, shown here:public virtual XmlNode ReadNode(XmlReader reader);The ReadNode method beg<strong>in</strong>s to read from the current position of the given reader anddoesn't stop until the end tag of the current node is reached. The reader is then leftimmediately after the end tag. For the method to work, the reader must be positionedon an element or an attribute node.ReadNode returns an XmlNode object that conta<strong>in</strong>s the subtree represent<strong>in</strong>geveryth<strong>in</strong>g that has been read, <strong>in</strong>clud<strong>in</strong>g attributes. ReadNode is different fromChildNodes <strong>in</strong> that it recursively processes children at any level and does not stop atthe first level of sibl<strong>in</strong>gs.Visit<strong>in</strong>g an <strong>XML</strong> DOM SubtreeSo far, we've exam<strong>in</strong>ed ways to get <strong>XML</strong> DOM objects out of an <strong>XML</strong> reader. Is itpossible to call an <strong>XML</strong> reader to work on an <strong>XML</strong> DOM document and have the readervisit the whole subtree, one node after the next?Chapter 2 <strong>in</strong>troduced the XmlNodeReader class, with the promise to return to it later.Let's do that now. The XmlNodeReader class is an <strong>XML</strong> reader that enables you toread nodes out of a given <strong>XML</strong> DOM subtree.Just as XmlTextReader visits all the nodes of the specified <strong>XML</strong> file, XmlNodeReadervisits all the nodes that <strong>for</strong>m an <strong>XML</strong> DOM subtree. Note that the node reader is reallycapable of travers<strong>in</strong>g all the nodes <strong>in</strong> the subtree no matter the level of depth. Let'sreview a situation <strong>in</strong> which you might want to take advantage of XmlNodeReader.The XmlNodeReader ClassSuppose you have selected a node about which you need more <strong>in</strong><strong>for</strong>mation. To scan allthe nodes that <strong>for</strong>m the subtree us<strong>in</strong>g <strong>XML</strong> DOM, your only option is to use a recursivealgorithm like the one discussed with the LoopThroughChildren method <strong>in</strong> the section"Load<strong>in</strong>g <strong>XML</strong> Documents," on page 219. The XmlNodeReader class gives you aneffective, and ready-to-use, alternative, shown here:// Select the root of the subtree to processXmlNode n = root.SelectS<strong>in</strong>gleNode("Employee[@id=2]");if (n != null){// Instantiate a node reader objectXmlNodeReader nodeReader = new XmlNodeReader(n);// Visit the subtreewhile (nodeReader.Read()){// Do someth<strong>in</strong>g with the node...186


Console.WriteL<strong>in</strong>e(nodeReader.Value);}}The while loop visits all the nodes belong<strong>in</strong>g to the specified <strong>XML</strong> DOM subtree. Thenode reader class is <strong>in</strong>itialized us<strong>in</strong>g the XmlNode object that is the root of the <strong>XML</strong>DOM subtree.Updat<strong>in</strong>g Text and MarkupOnce an <strong>XML</strong> document is loaded <strong>in</strong> memory, you can enter all the needed changes bysimply access<strong>in</strong>g the property of <strong>in</strong>terest and modify<strong>in</strong>g the underly<strong>in</strong>g value. Forexample, to change the value of an attribute, you proceed as follows:// Retrieve a particular node and update an attributeXmlNode n = root.SelectS<strong>in</strong>gleNode("days");n.Attributes["module"] = 1;To <strong>in</strong>sert many nodes at the same time and <strong>in</strong> the same parent, you can exploit a littletrick based on the concept of a document fragment. In essence, you concatenate all thenecessary markup <strong>in</strong>to a str<strong>in</strong>g and then create a document fragment, as shown here:XmlDocumentFragment df = doc.CreateDocumentFragment();df.InnerXml = "ValueAnotherValue";parentNode.AppendChild(df);Set the InnerXml property of the document fragment node with the str<strong>in</strong>g, and then addthe newly created node to the parent. The nodes def<strong>in</strong>ed <strong>in</strong> the body of the fragmentwill be <strong>in</strong>serted one after the next.In general, when you set the InnerXml property on an XmlNode-based class, anydetected markup text will be parsed, and the new contents will replace the exist<strong>in</strong>gcontents. For this reason, if you want simply to add new children to a node, passthrough the XmlDocumentFragment class, as described <strong>in</strong> the previous paragraph, andavoid us<strong>in</strong>g InnerXml directly on the target node.Detect<strong>in</strong>g ChangesCallers are notified of any changes that affect nodes through events. You can set eventhandlers at any time and even prior to load<strong>in</strong>g the document, as shown here:XmlDocument doc = new XmlDocument();doc.NodeInserted += new XmlNodeChangedEventHandler(Changed);doc.Load(fileName);If you use the preced<strong>in</strong>g code, you will get events <strong>for</strong> each <strong>in</strong>sertion dur<strong>in</strong>g the build<strong>in</strong>gof the <strong>XML</strong> DOM. The follow<strong>in</strong>g code illustrates a m<strong>in</strong>imal event handler:void Changed(object sender, XmlNodeChangedEventArgs e){Console.WriteL<strong>in</strong>e(e.Action.ToStr<strong>in</strong>g());}Note that by design <strong>XML</strong> DOM events give you a chance to <strong>in</strong>tervene be<strong>for</strong>e and after anode is added, removed, or updated.187


Limitations of the <strong>XML</strong> DOM Event<strong>in</strong>g ModelAlthough you receive notifications be<strong>for</strong>e and after an action takes place, you can't alterthe predef<strong>in</strong>ed flow of operations. In other words, you can per<strong>for</strong>m any action whilehandl<strong>in</strong>g the event, but you can't cancel the ongo<strong>in</strong>g operation. This also means thatyou can't just skip some nodes based on run-time conditions. In fact, the event handlerfunction is void, and all the arguments passed with the event data structure are readonly.Programmers have no way to pass <strong>in</strong><strong>for</strong>mation back to the reader and skip thecurrent node. There is only one way <strong>in</strong> which the event handler can affect the behaviorof the reader. If the event handler throws an exception, the reader will stop work<strong>in</strong>g. Inthis case, however, the <strong>XML</strong> DOM will not be built.Select<strong>in</strong>g Nodes by QueryAs mentioned, the <strong>XML</strong> DOM provides a few ways to traverse the document <strong>for</strong>est tolocate a particular node. The ChildNodes property returns a l<strong>in</strong>ked list <strong>for</strong>med by thechild nodes placed at the same level. You move back and <strong>for</strong>th <strong>in</strong> this list us<strong>in</strong>g theNextSibl<strong>in</strong>g and PreviousSibl<strong>in</strong>g methods.You can also enumerate the contents of the ChildNodes list us<strong>in</strong>g a <strong>for</strong>each-styleenumerator. This enumerator is built <strong>in</strong>to the XmlDocument class and returned ondemand by the GetEnumerator method, as shown here:<strong>for</strong>each(XmlNode n <strong>in</strong> node.ChildNodes){// Do someth<strong>in</strong>g}Direct Access to ElementsTheGetElementById method returns the first child node below the current node that hasan ID attribute with the specified value. Note that ID is a particular <strong>XML</strong> type and notsimply an attribute with that name. An attribute can be declared as an ID only <strong>in</strong> an<strong>XML</strong> Schema Def<strong>in</strong>ition (XSD) or a DTD schema. The follow<strong>in</strong>g <strong>XML</strong> fragment def<strong>in</strong>esan employeeid attribute of type ID. The attribute belongs to the Employee node.A correspond<strong>in</strong>g <strong>XML</strong> node might look like this:As you can see, the source <strong>XML</strong> is apparently unaffected by the use of an ID attribute.An ID attribute can be seen as an <strong>XML</strong> primary key, and the GetElementById method—part of the W3C DOM specification—represents the search method that applicationsuse to locate nodes. The follow<strong>in</strong>g code retrieves the node element <strong>in</strong> the documentwhose ID attribute (employeeid) matches the specified value:employeeNode = node.GetElementById("1");If you call GetElementById on a node whose children have no ID attributes or match<strong>in</strong>gvalues, the method returns null. The search <strong>for</strong> a match<strong>in</strong>g node stops when the firstmatch is found.Another query method at your disposal is GetElementsByTagName. As the namesuggests, this method returns a list of nodes with the specified name.GetElementsByTagName looks similar to ChildNodes but differs <strong>in</strong> one aspect.Whereas ChildNodes returns all the child nodes found, <strong>in</strong>clud<strong>in</strong>g all elements andleaves, GetElementsByTagName returns only the element nodes with a particularname. The name specified can be expressed as a local as well as a namespacequalifiedname.188


XPath-Driven Access to ElementsThe methods SelectNodes and SelectS<strong>in</strong>gleNode provide more flexibility when it comesto select<strong>in</strong>g child nodes. Both methods support an XPath syntax (see Chapter 6) toselect nodes along the <strong>XML</strong> subtree rooted <strong>in</strong> the current node. There are two ma<strong>in</strong>differences between these methods and the other methods we've exam<strong>in</strong>ed, such asReadNode and XmlNodeReader.The first difference is that an XPath query lets you base the search at a deeper levelthan the current node. In other words, the query expression can select the level of childnodes on which the search will be based. All other search methods can work only onthe first level of child nodes.The second difference is that an XPath expression lets you select nodes based onlogical criteria. The code <strong>in</strong> this section is based on the follow<strong>in</strong>g <strong>XML</strong> layout:...By default, the SelectNodes and SelectS<strong>in</strong>gleNode methods work on the children of thenode that calls it, as follows:root.SelectNodes("Northw<strong>in</strong>dEmployees");root.SelectNodes("Northw<strong>in</strong>dEmployees/Employee");root.SelectNodes("Northw<strong>in</strong>dEmployees/Employee[@id>4]");An XPath expression, however, can traverse the tree and move the context <strong>for</strong> thequery one or more levels ahead, or even back. The first query selects all theNorthw<strong>in</strong>dEmployees nodes found below the root (the MyDataSet node). The secondquery starts from the root but goes two levels deeper to select all the nodes namedEmployee below the first Northw<strong>in</strong>dEmployees node. F<strong>in</strong>ally, the third query adds astricter condition and further narrows the result set by select<strong>in</strong>g only the Employeenodes whose id attribute is greater than 4. By us<strong>in</strong>g special syntax constructs, you canhave XPath queries start from the root node or any other node ancestor, regardless ofwhich node runs the query. (More on this topic <strong>in</strong> Chapter 6.)Creat<strong>in</strong>g <strong>XML</strong> DocumentsIf your primary goal is analyz<strong>in</strong>g the contents of an <strong>XML</strong> document, you will probablyf<strong>in</strong>d the <strong>XML</strong> DOM pars<strong>in</strong>g model much more effective than readers <strong>in</strong> spite of thelarger memory footpr<strong>in</strong>t and set-up time it requires. A document loaded through <strong>XML</strong>DOM can be modified, extended, shrunk, and, more important, searched. The samecan't be done with <strong>XML</strong> readers; <strong>XML</strong> readers follow a different design center. But whatare the advantages of creat<strong>in</strong>g <strong>XML</strong> documents us<strong>in</strong>g <strong>XML</strong> DOM?To create an <strong>XML</strong> document us<strong>in</strong>g the <strong>XML</strong> DOM API, you must first create thedocument <strong>in</strong> memory and then call the Save method or one of its overloads. Thissystem gives you great flexibility because no changes you make are set <strong>in</strong> stone untilyou save the document. In general, however, us<strong>in</strong>g the <strong>XML</strong> DOM API to create a new189


<strong>XML</strong> document is often overkill unless the creation of the document is driven by acomplex and sophisticated logic.In terms of the <strong>in</strong>ternal implementation, it is worth not<strong>in</strong>g that the <strong>XML</strong> DOM's Savemethod makes use of an <strong>XML</strong> text writer to create the document. So unless the contentto be generated is complex and subject to a lot of conditions, us<strong>in</strong>g an <strong>XML</strong> text writerto create <strong>XML</strong> documents is faster.The XmlDocument class provides a bunch of methods to create new nodes. Thesemethods are named consistently with the writ<strong>in</strong>g methods of the XmlTextWriter classwe encountered <strong>in</strong> Chapter 4. You'll f<strong>in</strong>d a CreateXXX method <strong>for</strong> each WriteXXXmethod provided by the writer. Actually, each CreateXXX method simply creates a newnode <strong>in</strong> memory, and the correspond<strong>in</strong>g WriteXXX method on the writer simply writesthe node to the output stream.Append<strong>in</strong>g NodesLet's look at how to create a brand-new <strong>XML</strong> document persist<strong>in</strong>g to <strong>XML</strong> thesubdirectories found below a given path. The basic algorithm to implement can besummarized <strong>in</strong> the follow<strong>in</strong>g steps:1. Create any necessary nodes.2. L<strong>in</strong>k the nodes to create a tree.3. Append the tree to the <strong>in</strong>-memory <strong>XML</strong> document.4. Save the document.The expected f<strong>in</strong>al output has the follow<strong>in</strong>g layout:texttext...The follow<strong>in</strong>g code creates the <strong>XML</strong> prolog and appends to the XmlDocument <strong>in</strong>stancethe standard <strong>XML</strong> declaration and a comment node:XmlDocument doc = new XmlDocument();XmlNode n;// Write and append the <strong>XML</strong> head<strong>in</strong>gn = doc.CreateXmlDeclaration("1.0", "", "");doc.AppendChild(n);// Write and append some commentn = doc.CreateComment("Content of the \""+ path + "\" folder ");doc.AppendChild(n);The CreateXmlDeclaration method takes three arguments: the <strong>XML</strong> version, therequired encod<strong>in</strong>g, and a Boolean value denot<strong>in</strong>g whether the document can beconsidered stand-alone or has dependencies on other documents. All arguments arestr<strong>in</strong>gs, <strong>in</strong>clud<strong>in</strong>g the encod<strong>in</strong>g argument, as shown here:190


If specified, the encod<strong>in</strong>g is written <strong>in</strong> the <strong>XML</strong> declaration and used by Save to createthe actual output stream. If the encod<strong>in</strong>g is null or empty, no encod<strong>in</strong>g attribute is set,and the default Unicode Universal Character Set Trans<strong>for</strong>mation Format, 8-bit <strong>for</strong>m(UTF-8) encod<strong>in</strong>g is used.CreateXmlDeclaration returns an XmlDeclaration node that you add as a child to theXmlDocument class. CreateComment, on the other hand, creates an XmlCommentnode that represents an <strong>XML</strong> comment, as shown here:Element nodes are created us<strong>in</strong>g the CreateElement method. The node is firstconfigured with all of its expected child nodes and then added to the document, asshown here:XmlNode root = doc.CreateElement("folders");For the purposes of this example, we need a way to access all the subdirectories of agiven folder. In the .<strong>NET</strong> Framework, this k<strong>in</strong>d of functionality is provided by theDirectoryInfo class <strong>in</strong> the System.IO namespace:DirectoryInfo dir = new DirectoryInfo(path);To scan the subdirectories of the given path, you arrange a loop on top of the array ofDirectoryInfo objects returned by the GetDirectories method, as follows:<strong>for</strong>each (DirectoryInfo d <strong>in</strong> dir.GetDirectories()){n = doc.CreateElement("folder");//// Create attributes <strong>for</strong> the node//// Set the text <strong>for</strong> the noden.InnerText = "Content of "+ d.Name;}// Append the node to the rest of the documentroot.AppendChild(n);In the loop, you create any needed node, configure the node with attributes andtext, and then append the node to the parent node.When creat<strong>in</strong>g an element node us<strong>in</strong>g the CreateElement method, you can specify anamespace URI as well as a namespace prefix. With the follow<strong>in</strong>g code, you add anxmlns attribute to the node declaration:XmlNode root = doc.CreateElement("folders", "urn:d<strong>in</strong>o-e");The f<strong>in</strong>al result is shown here:191


If you use a namespace, you might reasonably want to use a prefix too. To specify anamespace prefix, resort to another overload <strong>for</strong> the CreateElement method <strong>in</strong> whichyou pass <strong>in</strong> the order, the prefix, the local name of the element, and the namespaceURI, as shown here:XmlNode root = doc.CreateElement("d", "folders", "urn:d<strong>in</strong>o-e");The node <strong>XML</strong> code changes to this:At this po<strong>in</strong>t, to also qualify the successive nodes with this namespace, callCreateElement with the prefix and the URI, as shown here:n = doc.CreateElement("d", "folder", "urn:d<strong>in</strong>o-e");NoteBear <strong>in</strong> m<strong>in</strong>d that although all the CreateXXX methods available <strong>in</strong>the XmlDocument class can create an <strong>XML</strong> node, that node is notautomatically added to the <strong>XML</strong> DOM. You must do that explicitlyus<strong>in</strong>g one of the several methods def<strong>in</strong>ed to extend the currentDOM.Append<strong>in</strong>g AttributesAn attribute is simply a special type of node that you create us<strong>in</strong>g the CreateAttributemethod. The method returns an XmlAttribute object. The follow<strong>in</strong>g code shows how tocreate a new attribute named path and how to associate it with a parent node:XmlAttribute a;a = doc.CreateAttribute("path");a.Value = path;node.Attributes.SetNamedItem(a);Like CreateElement, CreateAttribute too allows you to qualify the name of the attributeus<strong>in</strong>g a namespace URI and optionally a prefix. The overloads <strong>for</strong> both methods havethe same signature.You set the value of an attribute us<strong>in</strong>g the Value property. At this po<strong>in</strong>t, however, theattribute node is not yet bound to an element node. To associate the attribute with anode, you must add the attribute to the node's Attributes collection. The SetNamedItemmethod does this <strong>for</strong> you. The follow<strong>in</strong>g code shows the f<strong>in</strong>alized version of the loopthat creates the <strong>XML</strong> file <strong>for</strong> our example:<strong>for</strong>each (DirectoryInfo d <strong>in</strong> dir.GetDirectories()){n = doc.CreateElement("folder");a = doc.CreateAttribute("name");a.Value = d.Name;n.Attributes.SetNamedItem(a);a = doc.CreateAttribute("created");a.Value = d.CreationTime.ToStr<strong>in</strong>g();n.Attributes.SetNamedItem(a);192


}root.AppendChild(n);n.InnerText = "Content of "+ d.Name;Figure 5-5 demonstrates the structure of the newly created <strong>XML</strong> file.Figure 5-5: An <strong>XML</strong> file represent<strong>in</strong>g a directory list<strong>in</strong>g created us<strong>in</strong>g the <strong>XML</strong> DOM API.Persist<strong>in</strong>g ChangesThe f<strong>in</strong>al step <strong>in</strong> sav<strong>in</strong>g the <strong>XML</strong> document we have created is to attach the node to the rest of the document and save the document, as shown here:doc.AppendChild(root);doc.Save(fileName);To persist all the changes to a storage medium, you call the Save method, whichconta<strong>in</strong>s four overloads, shown here:public virtual void Save(Stream);public virtual void Save(str<strong>in</strong>g);public virtual void Save(TextWriter);public virtual void Save(XmlWriter);The <strong>XML</strong> document can be saved to a disk file as well as to an output stream, <strong>in</strong>clud<strong>in</strong>gnetwork and compressed streams. You can also <strong>in</strong>tegrate the class that manages thedocument with other .<strong>NET</strong> Framework applications by us<strong>in</strong>g writers, and you cancomb<strong>in</strong>e more <strong>XML</strong> documents us<strong>in</strong>g, <strong>in</strong> particular, <strong>XML</strong> writers.Whatever overload you choose, it is always an <strong>XML</strong> writer that does the job ofpersist<strong>in</strong>g <strong>XML</strong> nodes to a storage medium. The XmlDocument class makes use of aspecialized version of the XmlTextWriter class that simply works around one of thelimitations of <strong>XML</strong> writers.<strong>XML</strong> writers do not allow you to write element and attribute nodes <strong>for</strong> which you have aprefix but an empty namespace. If the namespace URI is set to null, the writersuccessfully looks up the closest def<strong>in</strong>ition <strong>for</strong> that prefix and figures out thenamespace, if one exists. If the namespace is simply an empty str<strong>in</strong>g, however, anArgumentException exception is thrown. The <strong>XML</strong> DOM <strong>in</strong>ternal writer overrides theWriteStartElement and WriteStartAttribute methods. If the namespace URI is emptywhen the prefix is not, the new overrides reset the prefix to the empty str<strong>in</strong>g and noexception is raised.193


Extend<strong>in</strong>g the <strong>XML</strong> DOMAlthough the .<strong>NET</strong> Framework provides a suite of rich classes to navigate, query, andmodify the contents of an <strong>XML</strong> document, there might be situations <strong>in</strong> which you needmore functionality. For example, you might want a node class with more <strong>in</strong><strong>for</strong>mativeproperties or a document class with extra functions. To obta<strong>in</strong> that class, you simplyderive a new class from XmlNode, XmlDocument, or whatever <strong>XML</strong> DOM class youwant to override. Let's see how.Custom Node ClassesAs a general rule of thumb, you should avoid deriv<strong>in</strong>g node classes from the base classXmlNode. If necessary, derive node classes from a specialized and concrete nodeclass like XmlElement or XmlAttribute. This will ensure that no key behavior of the nodeis lost <strong>in</strong> your implementation. But what k<strong>in</strong>d of extensions can you reasonably build <strong>for</strong>a node?I haven't encountered any huge flaws <strong>in</strong> the design of the <strong>XML</strong> DOM node classes, so ifyou need extensions, it's probably because you want to give nodes new methods orproperties that simplify a particular operation you carry out quite often.The <strong>Microsoft</strong> Developer Network (MSDN) documentation already provides an exampleof <strong>XML</strong> DOM extensions that adds l<strong>in</strong>e <strong>in</strong><strong>for</strong>mation to each node and then counts thenumber of element nodes a given document conta<strong>in</strong>s. (See the section "FurtherRead<strong>in</strong>g," on page 244, <strong>for</strong> more <strong>in</strong><strong>for</strong>mation about this example.) As mentioned, theChildNodes property of the XmlDocument class does not cache the number ofelements <strong>in</strong> the list. As a result, whenever you need to know the number of children andcall the Count property, the entire list of nodes is walked from top to bottom. In addition,you have no way to dist<strong>in</strong>guish between element nodes and leaf nodes.In the MSDN documentation, you'll f<strong>in</strong>d a class that attempts to solve this problem byextend<strong>in</strong>g the XmlDocument class with a custom GetCount method, shown here:class L<strong>in</strong>eInfoDocument : XmlDocument{...public <strong>in</strong>t GetCount(){return elementCount;}...}In the rema<strong>in</strong>der of this section, however, we'll look at a more substantial improvementto the XmlDocument class. In particular, you'll learn how to build a k<strong>in</strong>d of "sensitive"<strong>XML</strong> DOM that can detect any changes to the underly<strong>in</strong>g disk file and automaticallyreload the new contents.Build<strong>in</strong>g a Hot-Plugg<strong>in</strong>g <strong>XML</strong> DOMBe<strong>in</strong>g able to detect changes to files and folders as they occur is a feature that manydevelopers would welcome. W<strong>in</strong>32 provides a set of functions to get notifications of<strong>in</strong>com<strong>in</strong>g changes to the size, the contents, or the attributes of a given file or folder.Un<strong>for</strong>tunately, the feature is limited to notify<strong>in</strong>g registered applications that a certa<strong>in</strong>event occurred <strong>in</strong> the watched file or folder but provides no further <strong>in</strong><strong>for</strong>mation aboutwhat happened to which file or folder and why.194


To clarify, this feature was <strong>in</strong>troduced with <strong>Microsoft</strong> W<strong>in</strong>dows 95 and was tailor-made<strong>for</strong> W<strong>in</strong>dows Explorer. Have you ever noticed that when you have a W<strong>in</strong>dows Explorerview open and you modify a file shown <strong>in</strong> that view, the W<strong>in</strong>dows Explorer viewautomatically refreshes to show updated data? The trick beh<strong>in</strong>d this apparently magicalbehavior is that, just be<strong>for</strong>e a new folder view is opened, W<strong>in</strong>dows Explorer registers afile notification object <strong>for</strong> the contents of that folder. When it gets a notification thatsometh<strong>in</strong>g occurred to that folder's contents, W<strong>in</strong>dows Explorer simply refreshes theview to show the new contents, whatever that is.Later, <strong>Microsoft</strong> <strong>in</strong>troduced only <strong>for</strong> the W<strong>in</strong>dows NT plat<strong>for</strong>m an even moresophisticated mechanism that not only notifies applications of the event but alsoprovides <strong>in</strong><strong>for</strong>mation about the type of change that occurred and the file or filesaffected. This extended feature relies on W<strong>in</strong>32 API functions supported only onW<strong>in</strong>dows NT plat<strong>for</strong>ms, start<strong>in</strong>g with W<strong>in</strong>dows NT 4.0.The .<strong>NET</strong> Framework wraps all this functionality <strong>in</strong>to the FileSystemWatcher class,available from the System.IO namespace. This class takes advantage of the W<strong>in</strong>dowsNT-based API and <strong>for</strong> this reason is not available with <strong>Microsoft</strong> W<strong>in</strong>dows 98, <strong>Microsoft</strong>W<strong>in</strong>dows Me, and older plat<strong>for</strong>ms.NoteBecause FileSystemWatcher is a wrapper <strong>for</strong> the W<strong>in</strong>dows NT API,it works only on computers runn<strong>in</strong>g W<strong>in</strong>dows NT, W<strong>in</strong>dows 2000, orW<strong>in</strong>dows XP. But you could write a wrapper class us<strong>in</strong>g a lesspowerful W<strong>in</strong>32 API and have it work on all W<strong>in</strong>32 plat<strong>for</strong>ms.An <strong>in</strong>stance of the FileSystemWatcher class is at the foundation of the extendedversion of the XmlDocument class that we'll build <strong>in</strong> the next section. The new class,named XmlHotDocument, is capable of detect<strong>in</strong>g any changes that have occurred <strong>in</strong>the underly<strong>in</strong>g file and automatically notifies the host application of these changes.The XmlHotDocument Class <strong>Programm<strong>in</strong>g</strong> InterfaceTheXmlHotDocument class <strong>in</strong>herits from XmlDocument and provides a new event anda couple of new properties, as shown <strong>in</strong> the follow<strong>in</strong>g code. In addition, it overrides oneof the overloads of the Load method—the method overload that works on files. Ingeneral, however, noth<strong>in</strong>g would really prevent you from extend<strong>in</strong>g the feature to alsocover streams or text readers as long as those streams and readers are based on diskfiles.public class XmlHotDocument : XmlDocument{public XmlHotDocument() : base(){m_watcher = new FileSystemWatcher();HasChanges = false;EnableFileChanges = false;}...}As you can see, the preced<strong>in</strong>g code <strong>in</strong>cludes the class declaration and the constructor'scode. Upon <strong>in</strong>itialization, the class creates an <strong>in</strong>stance of the file system watcher andsets the new public properties—HasChanges and EnableFileChanges—to false. Table5-6 summarizes what's really new with the programm<strong>in</strong>g <strong>in</strong>terface of theXmlHotDocument class.195


Table 5-6: <strong>Programm<strong>in</strong>g</strong> Interface of the XmlHotDocument ClassProperty or EventEnableFileChangesHasChangesUnderly<strong>in</strong>gDocumentChangedDescriptionBoolean property that you use to toggle onand off the watch<strong>in</strong>g system. If set to true,the application receives notifications <strong>for</strong>each change made to the file loaded <strong>in</strong> theDOM. Set to false by default.Boolean property that the class sets to truewhenever there are changes <strong>in</strong> theunderly<strong>in</strong>g <strong>XML</strong> file that the application hasnot yet processed. Set to false by default;is reset when you call the Load methodaga<strong>in</strong>.Represents an event that the class fireswhenever a change is detected <strong>in</strong> thewatched file.In addition, the XmlHotDocument class has one private member—the reference to theFileSystemWatcher object used to monitor file system changes.The Watch<strong>in</strong>g MechanismAn <strong>in</strong>stance of the FileSystemWatcher class is created <strong>in</strong> the class constructor but isnot set to work until the caller application sets the EnableFileChanges property to true,as shown here:public bool EnableFileChanges{get { return m_watcher.EnableRais<strong>in</strong>gEvents; }set {if (value == true){// Get the local path of the current fileUri u = new Uri(BaseURI);str<strong>in</strong>g filename = u.LocalPath;// Set the path to watch <strong>for</strong>FileInfo fi = new FileInfo(filename);m_watcher.Path = fi.DirectoryName;m_watcher.Filter = filename;// Set hooks <strong>for</strong> writ<strong>in</strong>g changesm_watcher.NotifyFilter = NotifyFilters.LastWrite;m_watcher.Changed +=new FileSystemEventHandler(this.OnChanged);196


}}// Start gett<strong>in</strong>g notificationsm_watcher.EnableRais<strong>in</strong>gEvents = true;}elsem_watcher.EnableRais<strong>in</strong>gEvents = false;EnableFileChanges is a read/write property that is responsible <strong>for</strong> sett<strong>in</strong>g up thewatch<strong>in</strong>g system when set to true. The watch<strong>in</strong>g system consists of Path and Filterproperties that you use to narrow the set of files and folders that must be watched <strong>for</strong>changes.The Path property sets the folder to watch, while the Filter property restricts the numberof files monitored <strong>in</strong> that folder. If you set the Filter property to an empty str<strong>in</strong>g, theentire contents of the folder will be watched; otherwise, only the files match<strong>in</strong>g the filterstr<strong>in</strong>g will be taken <strong>in</strong>to account. In this case, we just need to monitor a s<strong>in</strong>gle file, sowe'll set the Filter property to the name of the document used to populate the current<strong>XML</strong> DOM.NoteWhen sett<strong>in</strong>g the Filter property, avoid us<strong>in</strong>g fully qualified pathnames. Internally, the FileSystemWatcher class will beconcatenat<strong>in</strong>g the Path and Filter properties to obta<strong>in</strong> the fullyqualified path to filter out files and folders <strong>in</strong>volved <strong>in</strong> any filesystem-levelevent caught.The XmlDocument class stores the name of the document be<strong>in</strong>g processed <strong>in</strong> itsBaseURI property. Although the BaseURI property is a str<strong>in</strong>g, it stores the file name asa URI. As a result, a file name such as c:\data.xml is stored <strong>in</strong> the BaseURI property asfile:///c:/data.xml. Note that <strong>in</strong> the .<strong>NET</strong> Framework, URIs are rendered through an adhoc type—the Uri class. To obta<strong>in</strong> the local path from a URI, you must first create anew Uri object and query its LocalPath property, as shown here:Uri u = new Uri(BaseURI);str<strong>in</strong>g filename = u.LocalPath;Why can't we just use the file name <strong>in</strong> the URI <strong>for</strong>m? To avoid the rather bor<strong>in</strong>g task ofpars<strong>in</strong>g the path str<strong>in</strong>g to extract the directory <strong>in</strong><strong>for</strong>mation, I use the FileInfo class andits handy DirectoryName property. Un<strong>for</strong>tunately, however, the FileInfo class can'thandle file names <strong>in</strong> the URI <strong>for</strong>mat. The follow<strong>in</strong>g code will throw an exception iffilename is a URI:FileInfo fi = new FileInfo(filename);m_watcher.Path = fi.DirectoryName;m_watcher.Filter = fi.Name;To f<strong>in</strong>alize the watcher setup, you also need to def<strong>in</strong>e the change events that will bedetected and register a proper event handler <strong>for</strong> each of them. You set the NotifyFilterproperty with any bitwise comb<strong>in</strong>ation of flags def<strong>in</strong>ed <strong>in</strong> the NotifyFilters enumeration.In particular, you can choose values to detect changes <strong>in</strong> the size, attributes, name,contents, date, and security sett<strong>in</strong>gs of each watched file. The follow<strong>in</strong>g code simplyconfigures the watcher to control whether the monitored file has someth<strong>in</strong>g new writtento it. The LastWrite flag actually causes an event to fire whenever the timestamp of the197


file changes, irrespective of the contents that you might have written to the file. In otherwords, the event also fires if you simply open and save the file without enter<strong>in</strong>g anychanges.m_watcher.NotifyFilter = NotifyFilters.LastWrite;m_watcher.Changed += new FileSystemEventHandler(this.OnChanged);// Start gett<strong>in</strong>g notificationsm_watcher.EnableRais<strong>in</strong>gEvents = true;The changes you can register to be detected are orig<strong>in</strong>ated by four events: Changed,Created, Deleted, and Renamed. In this example, we are <strong>in</strong>terested only <strong>in</strong> the changesthat modify an exist<strong>in</strong>g file, so let's handle only the Changed event, as shown here:private void OnChanged(object source, FileSystemEventArgs e){HasChanges = true;if (Underly<strong>in</strong>gDocumentChanged != null)Underly<strong>in</strong>gDocumentChanged(this, EventArgs.Empty);}Any file system event passes to the handlers a FileSystemEventArgs object thatconta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the event—<strong>for</strong> example, the name of the files <strong>in</strong>volved and adescription of the event that just occurred. The XmlHotDocument class processes theChanged event by simply sett<strong>in</strong>g the HasChanges property to true and bubbl<strong>in</strong>g theevent up to the caller application. In the process, the orig<strong>in</strong>al event is renamed to aclass-specific event named Underly<strong>in</strong>gDocumentChanged. In addition, no argument ispassed because the client application us<strong>in</strong>g the <strong>XML</strong> DOM needs to know only thatsome changes have occurred to the underly<strong>in</strong>g documents currently be<strong>in</strong>g processed.After it is completely set up, the FileSystemWatcher class starts rais<strong>in</strong>g file systemevents only if you set its EnableRais<strong>in</strong>gEvents property to true. Chang<strong>in</strong>g the value ofthis property to false is the only way you have to stop the watcher from send<strong>in</strong>g furtherevents.NoteWhen monitor<strong>in</strong>g a file or a folder through a FileSystemWatcherclass, don't be surprised if you receive too many events and someevents that are not strictly solicited. The class is a watchful observerof what happens at the file system level and correctly reports anychange you registered <strong>for</strong>. Many operations that look like <strong>in</strong>dividualoperations are actually implemented <strong>in</strong> several steps, each of whichcan cause an <strong>in</strong>dependent event. In addition, you might havesoftware runn<strong>in</strong>g <strong>in</strong> the background (<strong>for</strong> example, antivirus software)that per<strong>for</strong>ms disk operations that will be detected as well.Us<strong>in</strong>g the XmlHotDocument ClassTo take advantage of the new class <strong>in</strong> a client application, start by declar<strong>in</strong>g and<strong>in</strong>stantiat<strong>in</strong>g a variable of that type, as follows:XmlHotDocument m_hotDocument = new XmlHotDocument();Next you register an event handler <strong>for</strong> the Underly<strong>in</strong>gDocumentChanged event and callthe Load method to build the <strong>XML</strong> DOM. When you th<strong>in</strong>k you are ready to start198


eceiv<strong>in</strong>g file system notifications, set the EnableFileChanges property to true, asshown here:m_hotDocument.Underly<strong>in</strong>gDocumentChanged +=new EventHandler(FileChanged);m_hotDocument.Load("data.xml");m_hotDocument.EnableFileChanges = true;Note that you can't set EnableFileChanges to true be<strong>for</strong>e the <strong>XML</strong> DOM is built—thatis, be<strong>for</strong>e the Load method has been called.Register<strong>in</strong>g a handler <strong>for</strong> the custom Underly<strong>in</strong>gDocumentChanged event is notmandatory, but do<strong>in</strong>g so gives your application an immediate notification about whathappened. The value of the HasChanges property automatically <strong>in</strong>dicates anyunderly<strong>in</strong>g changes that the current <strong>XML</strong> DOM does not yet reflect, however. When youbuild an <strong>XML</strong> DOM, the HasChanges property is reset to false. Figure 5-6 shows thesample application immediately after startup.Figure 5-6: A sample application mak<strong>in</strong>g use of the XmlHotDocument class. No pend<strong>in</strong>gchanges have been detected yet on the displayed <strong>XML</strong> file.When another user, or another application, modifies the <strong>XML</strong> file that is be<strong>in</strong>gprocessed by the current <strong>in</strong>stance of the XmlHotDocument object, anUnderly<strong>in</strong>gDocumentChanged event reaches the application. The sample programshown <strong>in</strong> Figure 5-6 handles the event us<strong>in</strong>g the follow<strong>in</strong>g code:void FileChanged(object sender, EventArgs e){UpdateUI();}The <strong>in</strong>ternal UpdateUI method simply refreshes the user <strong>in</strong>terface, check<strong>in</strong>g the state ofthe HasChanges property, as shown here:if (m_hotDocument.HasChanges)Pend<strong>in</strong>gChanges.Text = "*** Pend<strong>in</strong>g changes ***";Figure 5-7 shows the application when it detects a change.199


Figure 5-7: The sample application detects changes <strong>in</strong> the underly<strong>in</strong>g <strong>XML</strong> file and updatesthe user <strong>in</strong>terface.At this po<strong>in</strong>t, the user can reload the <strong>XML</strong> DOM us<strong>in</strong>g the Load method aga<strong>in</strong>, asshown <strong>in</strong> the follow<strong>in</strong>g code. As mentioned, call<strong>in</strong>g the Load method resets the status ofthe HasChanges property, result<strong>in</strong>g <strong>in</strong> an up-to-date user <strong>in</strong>terface.public override void Load(str<strong>in</strong>g filename){// Load the DOM the usual waybase.Load(filename);// Reset pend<strong>in</strong>g changesHasChanges = false;}Figure 5-8 shows the application display<strong>in</strong>g the change.Figure 5-8: Changes dynamically occurr<strong>in</strong>g <strong>in</strong> the <strong>XML</strong> document are now correctlyreflected by the <strong>XML</strong> DOM used by the application.A hot-plugg<strong>in</strong>g <strong>XML</strong> DOM is more than a made-to-measure example. It is a piece ofcode that you might f<strong>in</strong>d useful <strong>in</strong> all those circumstances <strong>in</strong> which you make use ofextremely volatile <strong>XML</strong> documents.200


ConclusionThis chapter presented the .<strong>NET</strong> Framework classes that provide <strong>XML</strong> DOMcapabilities. Us<strong>in</strong>g these classes—primarily XmlDocument and XmlNode—you canparse <strong>XML</strong> documents, build<strong>in</strong>g <strong>in</strong>-memory and fully accessible representations of data.The overall programm<strong>in</strong>g <strong>in</strong>terface of the XmlDocument class might look familiar tothose of you who have spent some time work<strong>in</strong>g with the <strong>Microsoft</strong> COM-basedMS<strong>XML</strong> library. The XmlDocument class provides methods to load <strong>XML</strong> documentsfrom a variety of sources, <strong>in</strong>clud<strong>in</strong>g <strong>XML</strong> readers and streams. The load<strong>in</strong>g of adocument can happen only synchronously, but you can significantly lessen the impactof this design issue by us<strong>in</strong>g multiple threads.To locate a node <strong>in</strong> the <strong>in</strong>-memory tree that represents the orig<strong>in</strong>al <strong>XML</strong> document, youcan proceed with a collection that returns only the first level of child nodes, or you can,more effectively, use an XPath query str<strong>in</strong>g to locate nodes by condition. If your goal isvisit<strong>in</strong>g all the nodes that are part of a given DOM subtree, you have two options, bothof which have been described with code <strong>in</strong> this chapter. One possibility is writ<strong>in</strong>g yourown recursive algorithm to visit all the child nodes below a given root. An alternativeapproach is based on the XmlNodeReader class—an <strong>XML</strong> reader class capable ofread<strong>in</strong>g nodes from an <strong>XML</strong> DOM source.You also learned how to build <strong>XML</strong> documents from scratch us<strong>in</strong>g the <strong>XML</strong> DOMclasses and the methods offered by the XmlDocument class. Creat<strong>in</strong>g new documentsus<strong>in</strong>g <strong>XML</strong> DOM is not as efficient as us<strong>in</strong>g <strong>XML</strong> writers, but because the document isfirst built <strong>in</strong> memory, you have an unprecedented level of flexibility and can f<strong>in</strong>e-tuneyour document be<strong>for</strong>e it is written to the output stream.<strong>XML</strong> DOM is a powerful object model that provides you with a rich set of methods andproperties to manipulate the schema and contents of <strong>XML</strong> documents. Under the hoodof the <strong>XML</strong> DOM <strong>in</strong>terface, however, you still f<strong>in</strong>d <strong>XML</strong> reader and writer objectswork<strong>in</strong>g hard to provide <strong>in</strong>put and output functionalities. Extend<strong>in</strong>g the DOM is as easyas deriv<strong>in</strong>g a new class from XmlDocument, as you saw when we created a "sensitive"<strong>XML</strong> DOM class that detects <strong>in</strong>com<strong>in</strong>g changes <strong>in</strong> the underly<strong>in</strong>g <strong>XML</strong> file and fires adhoc events to the caller application. In Chapter 6, we'll take the plunge <strong>in</strong>to XPath andthe .<strong>NET</strong> Framework classes that make it happen.Further Read<strong>in</strong>gThis chapter repeatedly mentions the <strong>XML</strong> DOM as the start<strong>in</strong>g po<strong>in</strong>t <strong>for</strong> def<strong>in</strong><strong>in</strong>g theset of methods and properties <strong>for</strong> the XmlDocument class. The .<strong>NET</strong> Frameworkclasses support the <strong>in</strong>terface def<strong>in</strong>ed by the DOM Level 1 Core and DOM Level 2 Corespecifications. If you are <strong>in</strong>terested <strong>in</strong> the official papers, you can f<strong>in</strong>d them athttp://www.w3.org/TR/REC-DOM-Level-1 and http://www.w3.org/TR/DOM-Level-2.Another topic that has been mentioned quite often is XPath. We'll be look<strong>in</strong>g at XPath<strong>in</strong> Chapter 6, but you won't f<strong>in</strong>d a complete reference to the syntax elements of theXPath query language there. (In general, this book is not a comprehensive reference<strong>for</strong> any of the <strong>XML</strong>-related standards.) For a thorough treatment of this topic, refer toEssential <strong>XML</strong> Quick Reference, by Aaron Skonnard and Mart<strong>in</strong> Gudg<strong>in</strong> (Addison-Wesley, 2001), which provides short comments and descriptions and not muchbackground <strong>in</strong><strong>for</strong>mation, but does cover <strong>in</strong> detail every s<strong>in</strong>gle element of the syntax. Bycomb<strong>in</strong><strong>in</strong>g the <strong>in</strong><strong>for</strong>mation <strong>in</strong> that book with the general <strong>in</strong><strong>for</strong>mation available <strong>in</strong> this one,you should end up with a good grasp of the technology.In this chapter, I developed an <strong>XML</strong> DOM extension that enables <strong>XML</strong> applications todetect ongo<strong>in</strong>g changes <strong>in</strong> the <strong>XML</strong> files they are process<strong>in</strong>g through the DOM. Anotherexample of <strong>XML</strong> DOM extensions is available <strong>for</strong> download athttp://www.gotdotnet.com/userfiles/<strong>XML</strong>Dom/extendDOM.zip.201


Chapter 6: <strong>XML</strong> Query Language and NavigationOverview<strong>XML</strong> sprang to life as a metalanguage that can be used to describe any sort of data anddocuments us<strong>in</strong>g a truly hierarchical representation, or a representation that simplylooks hierarchical. As <strong>XML</strong> ga<strong>in</strong>ed broad acceptance from the software <strong>in</strong>dustry, theneed <strong>for</strong> additional and related standards promptly arose. In Chapter 5, we looked atthe <strong>XML</strong> Document Object Model (<strong>XML</strong> DOM), which represents the official objectmodel <strong>for</strong> <strong>XML</strong> data conta<strong>in</strong>ers.Although it is rich and powerful, <strong>XML</strong> DOM alone does not address the needs of <strong>XML</strong>data retrieval. One of the key advantages of <strong>XML</strong> markup text over pla<strong>in</strong> text is that itcan be used to mark portions of the text with special tags and attributes. So how do youeffectively retrieve parts of an <strong>XML</strong> document that have been marked <strong>in</strong> a certa<strong>in</strong> way?The need <strong>for</strong> an effective <strong>XML</strong>-based query language is as old as the need <strong>for</strong> ageneral-purpose data description language. In fact, a W3C-ratified standard <strong>for</strong> an <strong>XML</strong>query language followed shortly after the <strong>XML</strong> 1.0 recommendation. XPath is the querylanguage def<strong>in</strong>ed to address parts of an <strong>XML</strong> document us<strong>in</strong>g a compact, relativelysimple, but not <strong>XML</strong>-based syntax. More importantly, XPath is designed to def<strong>in</strong>e andprovide a common syntax <strong>for</strong> access<strong>in</strong>g <strong>XML</strong> nodes through the <strong>XML</strong> DOM as well asfrom <strong>XML</strong> Stylesheet Language Trans<strong>for</strong>mation (XSLT) scripts. (We'll look at XSLT <strong>in</strong>Chapter 7.)In the <strong>Microsoft</strong> .<strong>NET</strong> Framework, XPath is fully supported through the classes def<strong>in</strong>ed<strong>in</strong> the System.Xml.XPath namespace. The .<strong>NET</strong> Framework implementation of XPath isbased on a language parser and an evaluation eng<strong>in</strong>e. The overall architecture issimilar to database queries. As with SQL commands, you prepare XPath expressionsand submit them to a run-time eng<strong>in</strong>e evaluation. The query is parsed and executedaga<strong>in</strong>st a data source—an <strong>in</strong>stance of the <strong>XML</strong> DOM. Next you get back some<strong>in</strong><strong>for</strong>mation represent<strong>in</strong>g the result set of the query.What Is XPath, Anyway?XPath is a general-purpose query language <strong>for</strong> address<strong>in</strong>g and filter<strong>in</strong>g both theelements and the text of an <strong>XML</strong> document. As the name suggests, the XPath notationis basically declarative. A valid XPath expression looks like a path to a particular set ofnodes or a value excerpted from the source document.XPath works on top of a tree-based representation of the source document. The pathexpresses a node pattern us<strong>in</strong>g a notation that emphasizes the hierarchical relationshipbetween the nodes. Although semantically speak<strong>in</strong>g the closest similarity is with theSQL query language, from a syntax po<strong>in</strong>t of view, XPath expressions look a lot like afile system path composed of folder and file names. For example, consider the follow<strong>in</strong>gsimple XPath expression:customer/addressThis expression states: f<strong>in</strong>d all the address nodes that happen to be children of thecustomer element. But on which nodes is this expression evaluated? An XPathexpression is always evaluated <strong>in</strong> the context of a node. The context node isdesignated by the application and represents the start<strong>in</strong>g po<strong>in</strong>t of the query. Express<strong>in</strong>gthe concept of the context node <strong>in</strong> terms of file system paths, we could say that theappropriate file system counterpart <strong>for</strong> the context node is the current directory.The nodes affected by the expression <strong>for</strong>m the context node-set. The f<strong>in</strong>al set of nodesthat is actually returned to the application is a subset of the context node-set that<strong>in</strong>cludes only those nodes that match the specified criteria.202


Context of XPath QueriesThe context of an XPath query <strong>in</strong>cludes, but is not limited to, a context node and acontext node-set. The XPath context also conta<strong>in</strong>s position and namespace<strong>in</strong><strong>for</strong>mation, variable b<strong>in</strong>d<strong>in</strong>gs, and a standard library of functions. We'll look at thecontents of the XPath context <strong>in</strong> detail <strong>in</strong> this section.In the .<strong>NET</strong> Framework, the context node is the XmlNode object on which you calleither the SelectNodes or the SelectS<strong>in</strong>gleNode method. The context node-set isdeterm<strong>in</strong>ed by the so-called axis of the query. The axis is a keyword that specifies thegroup of nodes that will then be filtered out by the XPath expression.XPath AxesCont<strong>in</strong>u<strong>in</strong>g with the file system parallel, the axis is similar to the drive <strong>in</strong><strong>for</strong>mation <strong>in</strong> afile system path. Like the drive identifier, axis <strong>in</strong><strong>for</strong>mation is not strictly necessary, and adefault value can be assumed if the axis is omitted.If an XPath query has no axis element, the context node-set conta<strong>in</strong>s the direct childrenof the context node. As with drives, when specified, an axis def<strong>in</strong>es the entire set ofnodes that the follow<strong>in</strong>g path will evaluate. Table 6-1 lists the available axes.Table 6-1: XPath AxesAxis Description ContextNodeSetself The context node. 7child Children of the context node. 8, 9parent Parent of the context node. 5descendantancestorFollow<strong>in</strong>gfollow<strong>in</strong>gsibl<strong>in</strong>gPreced<strong>in</strong>gpreced<strong>in</strong>gsibl<strong>in</strong>gNodes <strong>in</strong> the subtree rooted <strong>in</strong> the contextnode. The variant descendant-or-self addsthe context node to the set.Parent of the context node and then parent'sparent, up to the document root. The variantancestor-or-self adds the context node to theset.All the nodes that will be visited after thecontext node. The XPath specificationdictates that the document be visited <strong>in</strong>depth-first order, go<strong>in</strong>g as deep as possibleon a path.Follow<strong>in</strong>g sibl<strong>in</strong>g nodes of the context node. 11All the nodes already visited accord<strong>in</strong>g to thestandard algorithm.Preced<strong>in</strong>g sibl<strong>in</strong>g of the context node. 68, 9, 10The context node-set numbers <strong>in</strong> Table 6-1 refer to the <strong>XML</strong> tree <strong>in</strong> Figure 6-1 and<strong>in</strong>dicate the nodes that would <strong>for</strong>m the correspond<strong>in</strong>g node-set once a given axis isspecified. The context node is labeled 7.5, 1> 7< 7203


Figure 6-1: A sample <strong>XML</strong> tree <strong>in</strong> which the node numbers <strong>in</strong>dicate the order <strong>in</strong> whichnodes are visited by the XPath query processor.The XPath specification requires that the nodes be visited <strong>in</strong> depth-first order, start<strong>in</strong>gfrom the root and then proceed<strong>in</strong>g with all the children from left to right until a leaf isfound. This order corresponds to the order <strong>in</strong> which nodes are read from an <strong>XML</strong> diskfile.Position In<strong>for</strong>mationAn XPath context is characterized by a position and a size. The position attribute is aone-based value that <strong>in</strong>dicates the ord<strong>in</strong>al position of the context node <strong>in</strong> the contextnode-set to which it belongs. The size attribute, on the other hand, returns the size ofthe context node-set—that is, the number of nodes be<strong>in</strong>g processed by the expression.The number does not necessarily match the size of the f<strong>in</strong>al node-set returned to thecaller application.XPath and NamespacesThe XPath processor uses node <strong>in</strong><strong>for</strong>mation to determ<strong>in</strong>e whether a match exists withthe current expression. The most important <strong>in</strong><strong>for</strong>mation used by XPath expressions isthe node's name, type, and attributes. XPath fully supports <strong>XML</strong> namespaces and splitsthe name of a node <strong>in</strong>to two constituent parts: the namespace URI and the local name.The set of namespaces declared <strong>in</strong> scope <strong>for</strong> the context node is used to qualify nodenames <strong>in</strong> the expression.Variable B<strong>in</strong>d<strong>in</strong>gsAn XPath expression can conta<strong>in</strong> variable references that are resolved through a set of<strong>in</strong>-memory b<strong>in</strong>d<strong>in</strong>gs established between variable names and actual values. Eachvariable holds a value whose type is normally one of the four base types—node-set,str<strong>in</strong>g, Boolean, and number. It is still possible, however, <strong>for</strong> a variable reference toconta<strong>in</strong> a value of some other type.204


XPath FunctionsAny implementation of the XPath parser must provide a function library that is used toevaluate expressions. Functions <strong>in</strong> the core library have no namespace <strong>in</strong><strong>for</strong>mation, butextension functions can have a namespace. Extension functions are def<strong>in</strong>ed with<strong>in</strong>vendor-specific XPath implementations but can also be provided by specialized andXPath-based programm<strong>in</strong>g APIs such as XSLT and <strong>XML</strong> Po<strong>in</strong>ter Language (XPo<strong>in</strong>ter)APIs.The functions <strong>in</strong> the XPath core library work on the base XPath types: node-set,Boolean, str<strong>in</strong>g, and number. Type conversion is automatically per<strong>for</strong>med wheneverpossible. The only type conversion not permitted is from any other type to node-sets.Table 6-2 lists just the commonly used functions <strong>in</strong>cluded <strong>in</strong> the library.Table 6-2: Some Members of the XPath Core LibraryFunction DescriptionlastA node-set function that returns the number of nodes <strong>in</strong> thecurrent node-setnameA node-set function that returns the fully qualified name of thespecified nodetextA node-set function that returns the text of the specified nodeposition A node-set function that returns the <strong>in</strong>dex of the context node <strong>in</strong>the current node-setboolean A Boolean function that converts a value to a Booleanconta<strong>in</strong>s A str<strong>in</strong>g function that <strong>in</strong>dicates whether a str<strong>in</strong>g conta<strong>in</strong>s thespecified substr<strong>in</strong>gsubstr<strong>in</strong>g A str<strong>in</strong>g function that returns the specified substr<strong>in</strong>gstartswithceil<strong>in</strong>gfloorroundA str<strong>in</strong>g function that <strong>in</strong>dicates whether the str<strong>in</strong>g beg<strong>in</strong>s with agiven substr<strong>in</strong>gA number function that rounds a number up to the next <strong>in</strong>tegerA number function that rounds a number down to the next<strong>in</strong>tegerA number function that rounds a number to the nearest <strong>in</strong>tegerYou will likely use the node-set functions most often. While be<strong>in</strong>g processed, an XPathexpression is tokenized <strong>in</strong>to subexpressions, and each subexpression is <strong>in</strong>dividuallyevaluated. The XPath processor is passed the subexpression and the context node-set.It returns a possibly narrowed node-set that will be iteratively used as the <strong>in</strong>putargument <strong>for</strong> the next subexpression. Dur<strong>in</strong>g this process, the context node, position,and size can vary, whereas variable and function references as well as namespacedeclarations rema<strong>in</strong> <strong>in</strong>tact.Location PathsAs mentioned, an XPath expression can return any of the follow<strong>in</strong>g types: Boolean,str<strong>in</strong>g, number, or node-set. In most cases, however, it will return a set of nodes. Themost frequently used type of XPath expression is the location path.A location path looks a lot like a file system path and, like a file system path, can beeither absolute or relative to the context node. When absolute, a location path beg<strong>in</strong>s205


with the <strong>for</strong>ward slash (/). The follow<strong>in</strong>g expression, <strong>for</strong> example, locates all the nodes, irrespective of the node on which the expression is evaluated./archive/<strong>in</strong>voices/<strong>in</strong>voiceIn contrast, this expression attempts to retrieve the nodes at the end of a particular paththat starts from the current node:archive/<strong>in</strong>voices/<strong>in</strong>voiceUnabbreviated Syntax <strong>for</strong> a Location PathA fully qualified location path consists of three pieces: an optional axis, a node test, andan optional predicate. The axis <strong>in</strong><strong>for</strong>mation def<strong>in</strong>es the <strong>in</strong>itial context node-set <strong>for</strong> theexpression, whereas the node test is a sequence of node names that identifies a path<strong>in</strong> the node-set. The predicate is a logical expression that def<strong>in</strong>es the criteria to filter thecurrent node-set.If the location path lacks any of its optional components, it is said to be <strong>in</strong> abbreviated<strong>for</strong>m. The general, unabbreviated, syntax <strong>for</strong> a location path expression is shown here:axis::node-test[predicate]The syntax dictates that the axis be separated from the rest of the expression by adouble colon (::). This special separator once aga<strong>in</strong> recalls the parallel between axis<strong>in</strong><strong>for</strong>mation and drive <strong>in</strong><strong>for</strong>mation <strong>in</strong> a file system. The predicate is enclosed <strong>in</strong> squarebrackets. A location path can <strong>in</strong>clude multiple predicates that are written one afteranother like <strong>in</strong>dexes <strong>in</strong> a multidimensional array.The node test is a node-based expression that is evaluated <strong>for</strong> each node <strong>in</strong> the contextnode-set. If the expression returns true, the node rema<strong>in</strong>s <strong>in</strong> the node-set; otherwise, itis removed. Typically, the node test takes the <strong>for</strong>m of a path. Read as an expression, itreturns true if the specified path exists below the context node and false otherwise. Thefollow<strong>in</strong>g code demonstrates a fully qualified XPath location:descendant::<strong>in</strong>voice[@year = 2002]The XPath processor first selects all the descendants of the context node. Next itselects from this set all the nodes whose year attribute equals 2002.TipYou can use the wildcard character (*) to <strong>in</strong>dicate all the nodes <strong>in</strong> agiven axis. For example, the expression child::* denotes all thechildren of the current context node. Likewise, descendant-or-self::*means all the descendants and the node itself.Location StepsA location path is composed of several child elements called location steps. Eachlocation step is actually a location path and, as such, can be expressed <strong>in</strong> anabbreviated or fully qualified <strong>for</strong>m, as appropriate. Location steps are separated by<strong>for</strong>ward slashes, as shown <strong>in</strong> Figure 6-2.206


Figure 6-2: A location path consists of one or more location steps, each of which can beexpressed <strong>in</strong> full or abbreviated <strong>for</strong>m.Consider the follow<strong>in</strong>g three-step expression:<strong>in</strong>voices/descendant::<strong>in</strong>voice[@year = 2002]/child::country[text()= 'USA']The first step selects all the nodes named below the context node. Thisnode-set is then passed as the context node-set to the next location step. The secondlocation step is expressed <strong>in</strong> an unabbreviated <strong>for</strong>m and loops through all thedescendants of each previously selected node. When processed, eachnode plays the role of the context node and provides different position <strong>in</strong><strong>for</strong>mation. Atthe end of the second step, the node-set conta<strong>in</strong>s only the nodes that have aparent and a year attribute set to 2002.The f<strong>in</strong>al step further narrows the node-set by exclud<strong>in</strong>g all the nodes that have no child whose text equals USA.NoteThe at sign (@) that you use to <strong>in</strong>dicate a node attribute is actuallyan abbreviation <strong>for</strong> another particular axis type: the attribute. Thefull syntax <strong>for</strong> the year attribute is attribute::year. The XPathspecification recommends a number of abbreviations that arecommonly used <strong>in</strong> cod<strong>in</strong>g, <strong>in</strong>clud<strong>in</strong>g the follow<strong>in</strong>g shortcuts: Use aperiod (.) to <strong>in</strong>dicate the context node and two periods (..) to refer tothe parent. When no axis is specified, child:: is assumed. F<strong>in</strong>ally, [n]means the nth node <strong>in</strong> the current context node-set; this array-likenotation is equivalent to [position() = n].L<strong>in</strong>ks Between DocumentsThe XPath query language is used to select a set of nodes <strong>in</strong> a given <strong>XML</strong> document.You typically use XPath to search <strong>for</strong> nodes <strong>in</strong> an <strong>XML</strong> DOM implementation of a datasource and to filter the nodes to which a given trans<strong>for</strong>mation template <strong>in</strong> an XSL scriptmust be applied.Recently, another possible use <strong>for</strong> the XPath syntax has boldly emerged. I'm talk<strong>in</strong>gabout XPo<strong>in</strong>ter, which is designed to become the standard way to l<strong>in</strong>k portions ofexternal documents to <strong>XML</strong> documents.What Is XPo<strong>in</strong>ter?XPo<strong>in</strong>ter is used to locate data with<strong>in</strong> an <strong>XML</strong> document. When <strong>XML</strong> documents needto po<strong>in</strong>t to external resources, they can declare an entity reference or, more effectively,<strong>in</strong>clude the whole resource, us<strong>in</strong>g the <strong>XML</strong> Inclusion (XInclude) syntax. XInclude—aW3C recommendation candidate—l<strong>in</strong>ks the host document to an external resource, ora portion of it. XPo<strong>in</strong>ter def<strong>in</strong>es the syntax you use to specify the addressed portion ofthe document.Normally, to <strong>in</strong>dicate a particular position <strong>in</strong> an <strong>XML</strong> document, you attach a fragmentidentifier to the document's URL. A fragment identifier is marked by a number sign (#)and follows the document's URL. For example, the URLhttp://www.w3.org/TR/xptr/#con<strong>for</strong>mance po<strong>in</strong>ts to the portion of the document labeledwith the con<strong>for</strong>mance name.With XPo<strong>in</strong>ter, you can use the XPath syntax to identify with greater flexibility aparticular location <strong>in</strong> the external document.207


How XPo<strong>in</strong>ter Uses XPathAn XPo<strong>in</strong>ter fragment identifier can be the name of a particular portion of the targetdocument, but it could also be a more complex and expressive XPath query. Forexample, you could l<strong>in</strong>k a piece of <strong>in</strong><strong>for</strong>mation us<strong>in</strong>g the follow<strong>in</strong>g syntax:<strong>in</strong>voices.xml#xpo<strong>in</strong>ter(/descendant::<strong>in</strong>voice[@id=201])This expression references the particular descendant node named hav<strong>in</strong>g anid attribute equal to 201.XPath <strong>in</strong> the <strong>XML</strong> DOMIn the .<strong>NET</strong> Framework, you can make use of XPath expressions <strong>in</strong> two ways: throughthe <strong>XML</strong> DOM or by means of a new and more flexible API based on the concept of theXPath navigator.In the <strong>for</strong>mer case, you use XPath expressions to select nodes with<strong>in</strong> the context of aliv<strong>in</strong>g <strong>in</strong>stance of the XmlDocument class. As we saw <strong>in</strong> Chapter 5, the XmlDocumentclass is the .<strong>NET</strong> Framework class that renders a given <strong>XML</strong> document as ahierarchical object model (<strong>XML</strong> DOM). This approach keeps the API close to the oldMS<strong>XML</strong> programm<strong>in</strong>g style and has probably been supplied mostly <strong>for</strong> compatibilityreasons.The alternative approach consists of creat<strong>in</strong>g an <strong>in</strong>stance of the XPathDocument classand obta<strong>in</strong><strong>in</strong>g from it an XPath navigator object. The navigator object is a generic XPathprocessor that works on top of any <strong>XML</strong> data store that exposes the IXPathNavigable<strong>in</strong>terface. Rendered through the XPathNavigator class, the XPath navigator objectparses and executes expressions us<strong>in</strong>g its Select method. XPath expressions can bepassed as pla<strong>in</strong> text or as preprocessed, compiled expressions. As you can see,although the classes <strong>in</strong>volved are different, the overall programm<strong>in</strong>g style is not muchdifferent from those pushed by MS<strong>XML</strong> and the .<strong>NET</strong> Framework <strong>XML</strong> DOM classes.This said, though, the XPath navigator object represents a quantum leap from theSelectNodes method of the XmlDocument class. For one th<strong>in</strong>g, it works on top of highlyspecialized document classes that implement IXPathNavigable and are optimized toper<strong>for</strong>m both XPath queries and XSL trans<strong>for</strong>mations. In contrast, the XmlDocumentclass is a generic data conta<strong>in</strong>er class that <strong>in</strong>corporates an XPath processor but is notbuilt around it.Several classes <strong>in</strong> the .<strong>NET</strong> Framework implement the IXPathNavigable <strong>in</strong>terface, thusmak<strong>in</strong>g their contents automatically selectable by XPath expressions. We'll look at thenavigation API <strong>in</strong> more detail <strong>in</strong> the section "The .<strong>NET</strong> XPath Navigation API," on page263. For now, let's review the XPath support built <strong>in</strong>to the XmlDocument class.The <strong>XML</strong> DOM Node Retrieval APIWhen us<strong>in</strong>g XPath queries to query an <strong>XML</strong> DOM <strong>in</strong>stance, you can use theSelectNodes method of the XmlDocument class. In particular, SelectNodes returns acollection that conta<strong>in</strong>s <strong>in</strong>stances of all the XmlNode objects that match the specifiedexpression. If you don't need the entire node-set, but <strong>in</strong>stead plan to use the query tolocate the root of a particular subtree, use the SelectS<strong>in</strong>gleNode method.SelectS<strong>in</strong>gleNode takes an XPath expression and returns a reference to the first matchfound.The SelectNodes and SelectS<strong>in</strong>gleNode methods per<strong>for</strong>m identical functionality to themethods available from the Component Object Model (COM)– based MS<strong>XML</strong> librarythat script and <strong>Microsoft</strong> W<strong>in</strong>32 applications normally use. It is worth not<strong>in</strong>g that these208


methods are not part of the official W3C <strong>XML</strong> DOM specification but represent, <strong>in</strong>stead,<strong>Microsoft</strong> extensions to the standard <strong>XML</strong> DOM.At the application level, <strong>XML</strong> DOM methods and the XPath navigator supply differentprogramm<strong>in</strong>g <strong>in</strong>terfaces, but <strong>in</strong>ternally they run absolutely equivalent code.The SelectNodes Internal ImplementationThe SelectNodes method <strong>in</strong>ternally employs a navigator object to retrieve the list ofmatch<strong>in</strong>g nodes. The return value of the navigator's Select method is then used to<strong>in</strong>itialize an undocumented <strong>in</strong>ternal node list class named System.Xml.XPath.XPathNodeList. As you have probably guessed, this class <strong>in</strong>herits from XmlNodeList,which is a documented class. To verify this statement, compile and run the follow<strong>in</strong>gsimple code:XmlDocument doc = new XmlDocument();doc.Load(fileName);XmlNodeList nodes = doc.SelectNodes("child::*");Console.WriteL<strong>in</strong>e(nodes.ToStr<strong>in</strong>g());The true type of the variable nodes is XPathNodeList. If you try to reference that type <strong>in</strong>your code, you get a compile error due to the protection level of the class.What's the difference between us<strong>in</strong>g SelectNodes and the XPath navigator object? TheSelectNodes method uses a navigator that works on top of a generic <strong>XML</strong> documentclass—the XmlDocument class. The SelectNodes method's navigator object is, <strong>in</strong> fact,created by the XmlDocument class's CreateNavigator method. If you choose to publiclymanage a navigator, you normally create it from a more specific and XPath-optimizeddocument class—the XPathDocument class.The XPath expression is passed to the navigator as pla<strong>in</strong> text:XmlNodeList SelectNodes(str<strong>in</strong>g xpathExpr, XmlNamespaceManagernsm)Interest<strong>in</strong>gly enough, however, if you use this overload of the SelectNodes method thathandles namespace <strong>in</strong><strong>for</strong>mation, the XPath expression is first compiled and thenpassed to the processor.As we'll see <strong>in</strong> the section "Compil<strong>in</strong>g Expressions," on page 274, only compiled XPathexpressions support namespace <strong>in</strong><strong>for</strong>mation. In particular, they get namespace<strong>in</strong><strong>for</strong>mation through an <strong>in</strong>stance of the XmlNamespaceManager class.The SelectS<strong>in</strong>gleNode Internal ImplementationThe SelectS<strong>in</strong>gleNode method is really a special case of SelectNodes. Un<strong>for</strong>tunately,there is no per<strong>for</strong>mance advantage <strong>in</strong> us<strong>in</strong>g SelectS<strong>in</strong>gleNode <strong>in</strong> lieu of SelectNodes.The follow<strong>in</strong>g pseudocode illustrates the current implementation of theSelectS<strong>in</strong>gleNode method:public XmlNode SelectS<strong>in</strong>gleNode(str<strong>in</strong>g xpathExpr){XmlNodeList nodes = SelectNodes(xpathExpr);return nodes[0];}The SelectS<strong>in</strong>gleNode method <strong>in</strong>ternally calls SelectNodes and retrieves all the nodesthat match a given XPath expression. Next it simply returns the first selected node to209


the caller. Us<strong>in</strong>g SelectS<strong>in</strong>gleNode perhaps results <strong>in</strong> a more easily readable code, butdo<strong>in</strong>g so certa<strong>in</strong>ly does not improve the per<strong>for</strong>mance of the application when you needjust one node.In the next section, we'll build a sample <strong>Microsoft</strong> W<strong>in</strong>dows Forms application to startpractic<strong>in</strong>g with XPath expressions, thus turn<strong>in</strong>g <strong>in</strong>to concrete programm<strong>in</strong>g calls all thattheory about the XPath query language.The Sample XPath EvaluatorThe sample XPath Evaluator application is a W<strong>in</strong>dows Forms application that loads an<strong>XML</strong> document and then per<strong>for</strong>ms an XPath query on it. The application's user<strong>in</strong>terface lets you type <strong>in</strong> both the context node and the query str<strong>in</strong>g. Next it creates an<strong>XML</strong> DOM <strong>for</strong> the document and calls SelectNodes.The output of the expression is rendered as an <strong>XML</strong> str<strong>in</strong>g rooted <strong>in</strong> an arbitrary node, as shown here:... <strong>XML</strong> nodes that match ...The sample application is shown <strong>in</strong> Figure 6-3. You can f<strong>in</strong>d the code list<strong>in</strong>g <strong>for</strong> thisapplication <strong>in</strong> this book's sample files.Figure 6-3: The XPath Evaluator sample application <strong>in</strong> action.Initializ<strong>in</strong>g the ApplicationWhen the user clicks the Load button, a StreamReader object is used to load thespecified <strong>XML</strong> document and refresh the left text box, which displays the contents ofthe XPath source document. I used the I/O API to read the document to preserve thenewl<strong>in</strong>e characters. An alternative approach consists of load<strong>in</strong>g the document <strong>in</strong>to theXmlDocument class and then gett<strong>in</strong>g the source through the document element'sOuterXml property. In this case, however, what you get is a str<strong>in</strong>g of contiguouscharacters that does not display well <strong>in</strong> a fixed-width text box.210


Sett<strong>in</strong>g the Context NodeAs mentioned, the context node is the start<strong>in</strong>g po<strong>in</strong>t of the query. The context node isimportant if you specify a relative expression. In this case, the context node—that is,the XmlNode object from which you call SelectNodes—determ<strong>in</strong>es the full path. Thecontext node is simply ignored if the expression conta<strong>in</strong>s an absolute location path, <strong>in</strong>which case, the path must start from the <strong>XML</strong> root node.The sample application first <strong>in</strong>itializes the <strong>XML</strong> DOM and then sets the context node bycall<strong>in</strong>g SelectS<strong>in</strong>gleNode on the document object. For the sake of generality, thisapplication's user <strong>in</strong>terface accepts a reference to the context node us<strong>in</strong>g an XPathexpression, as shown here:XmlDocument doc = new XmlDocument();doc.Load(xmlFile);XmlNode cxtNode = doc.SelectS<strong>in</strong>gleNode(ContextNode.Text);In a real-world situation, you normally know what the context node is (typically, the <strong>XML</strong>document root) and can locate it more efficiently us<strong>in</strong>g the ChildNodes collection. Forexample, the follow<strong>in</strong>g code shows how to set the context node to the document's root:XmlNode cxtNode = doc.DocumentElement;XmlNodeList nodes = cxtNode.SelectNodes(xpathExpr);Per<strong>for</strong>m<strong>in</strong>g the XPath QueryAfter you type the XPath expression, you click the Eval button to run the query. Notethat the node names <strong>in</strong> an XPath expression are case-sensitive and must perfectlymatch the names <strong>in</strong> the orig<strong>in</strong>al source document.After the processor has processed the node list, the output str<strong>in</strong>g is built by call<strong>in</strong>g theBuildOutputStr<strong>in</strong>g method and then displayed <strong>in</strong> the <strong>for</strong>m's results panel via theShowResults method, as shown here:str<strong>in</strong>g buf = "";<strong>in</strong>t nodeCount = 0;XmlNodeList nodes = null;try {nodes = cxtNode.SelectNodes(xpathExpr);nodeCount = nodes.Count;}catch{}if (nodes == null || nodeCount


Str<strong>in</strong>gBuilder sb = new Str<strong>in</strong>gBuilder("");<strong>for</strong>each(XmlNode n <strong>in</strong> nodes)sb.Append(n.OuterXml);sb.Append("");return sb.ToStr<strong>in</strong>g();Our sample application <strong>in</strong>tentionally follows a more sophisticated approach to display<strong>for</strong>matted output <strong>in</strong> the text box. In addition, this code turns out to be a useful exercise<strong>for</strong> understand<strong>in</strong>g the logic of <strong>XML</strong> writers.If you want to generate <strong>XML</strong> output <strong>in</strong> the .<strong>NET</strong> Framework, unless the text is short andstraight<strong>for</strong>ward, you have no good reason <strong>for</strong> not us<strong>in</strong>g <strong>XML</strong> writers. Us<strong>in</strong>g <strong>XML</strong> writersalso provides automatic and free <strong>in</strong>dentation. Don't th<strong>in</strong>k that choos<strong>in</strong>g an <strong>XML</strong> writerties you to us<strong>in</strong>g a specific output stream. As the follow<strong>in</strong>g code demonstrates, theoutput of an <strong>XML</strong> writer can be easily redirected to a str<strong>in</strong>g:str<strong>in</strong>g BuildOutputStr<strong>in</strong>g(XmlNodeList nodes){// Create a str<strong>in</strong>g writer to hold the <strong>XML</strong> text. Forefficiency,// the str<strong>in</strong>g writer is based on a Str<strong>in</strong>gBuilder object.Str<strong>in</strong>gBuilder sb = new Str<strong>in</strong>gBuilder("");Str<strong>in</strong>gWriter sw = new Str<strong>in</strong>gWriter(sb);// Instantiate the <strong>XML</strong> writerXmlTextWriter writer = new XmlTextWriter(sw);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Write the first element (No WriteStartDocument call isneeded)writer.WriteStartElement("results");// Loop through the children of each selected node and// recursively output attributes and text<strong>for</strong>each(XmlNode n <strong>in</strong> nodes)LoopThroughChildren(writer, n);// Complete pend<strong>in</strong>g nodes and then close the writerwriter.WriteEndElement();writer.Close();}// Flush the contents accumulated <strong>in</strong> the str<strong>in</strong>g writerreturn sw.ToStr<strong>in</strong>g();Let's see what happens when we process the follow<strong>in</strong>g <strong>XML</strong> document:212


1DavolioNancySales Representative⋮This document is the same <strong>XML</strong> representation of the Northw<strong>in</strong>d's Employeesdatabase that we used <strong>in</strong> previous chapters. To see the application <strong>in</strong> action, let's setMyDataSet (the root) as the context node and try the follow<strong>in</strong>g expression:Northw<strong>in</strong>dEmployees/Employee[employeeid > 7]The XPath query has two steps. The first step restricts the search to all the nodes <strong>in</strong> the source document. In this case, there is only onenode with that name. The second step moves the search one level down and thenfocuses on the nodes that are children of the current context node. The predicate [employeeid > 7] <strong>in</strong>cludes <strong>in</strong> thef<strong>in</strong>al result only the nodes with a child element greater than7. The follow<strong>in</strong>g <strong>XML</strong> output is what XPath Evaluator returns:8CallahanLauraInside Sales Coord<strong>in</strong>ator9DodsworthAnneSales RepresentativeFigure 6-4 shows the user <strong>in</strong>terface of XPath Evaluator when it is set to work on oursample document and expression.213


Figure 6-4: The node set returned by XPath Evaluator.NoteThe preced<strong>in</strong>g expression is an abbreviated <strong>for</strong>m that could have beenmore precisely expressed as follows:Northw<strong>in</strong>dEmployees/Employee/self::*[child::employeeid> 7]You apply the predicate to the context node <strong>in</strong> person (self) and verifythat the employeeid node on its children has a value greater than 7.The contents of the f<strong>in</strong>al node-set is determ<strong>in</strong>ed by the node that appears <strong>in</strong> the laststep of the XPath expression. Predicates allow you to per<strong>for</strong>m a sort of <strong>for</strong>wardcheck<strong>in</strong>g—that is, select<strong>in</strong>g nodes at a certa<strong>in</strong> level but based on the values of childnodes. The expression Northw<strong>in</strong>dEmployees/Employee[employeeid > 7] is differentfrom this one:Northw<strong>in</strong>dEmployees/Employee/employeeid[node() > 7]In this case, the node set consists of nodes, as shown here:89Concatenat<strong>in</strong>g Multiple PredicatesAn XPath expression can conta<strong>in</strong> any number of predicates. If no predicate is specified,child::* is assumed, and all the children are returned. Otherwise, the conditions set withthe various predicates are logically concatenated us<strong>in</strong>g a short-circuited AND operator.Predicates are processed <strong>in</strong> the order <strong>in</strong> which they appear, and the next predicatealways works on the node-set generated by the previous one, as shown here:Employee[conta<strong>in</strong>s(title, 'Representative')][employeeid >7]214


This example set first selects all the nodes whose child nodeconta<strong>in</strong>s the word Representative. Next the returned set is further filtered by discard<strong>in</strong>gall the nodes with an not greater than 7.Access<strong>in</strong>g the Selected NodesThe SelectNodes method returns the XPath node set through an XmlNodeList datastructure—that is, a list of references to XmlNode objects. If you need simply to pass onthis <strong>in</strong><strong>for</strong>mation to another application module, you can serialize the list to <strong>XML</strong> us<strong>in</strong>g apla<strong>in</strong> <strong>for</strong>-each statement and the XmlNode class's OuterXml property.Suppose, <strong>in</strong>stead, that you want to access and process all the nodes <strong>in</strong> the result set.In this case, you set up a recursive procedure, like the follow<strong>in</strong>g LoopThroughChildrenrout<strong>in</strong>e, and start it up with a <strong>for</strong>-each statement that touches on the first-level nodes <strong>in</strong>the XPath node-set:<strong>for</strong>each(XmlNode n <strong>in</strong> nodes)LoopThroughChildren(writer, n);The follow<strong>in</strong>g procedure is designed to output the node contents to an <strong>XML</strong> writer, butyou can easily modify the procedure to meet your own needs.void LoopThroughChildren(XmlTextWriter writer, XmlNode rootNode){// Process the start tagif (rootNode.NodeType == XmlNodeType.Element){writer.WriteStartElement(rootNode.Name);// Process any attributes<strong>for</strong>each(XmlAttribute a <strong>in</strong> rootNode.Attributes)writer.WriteAttributeStr<strong>in</strong>g(a.Name, a.Value);// Recursively process any child nodes<strong>for</strong>each(XmlNode n <strong>in</strong> rootNode.ChildNodes)LoopThroughChildren(writer, n);}// Process the end tagwriter.WriteEndElement();}else// Process any content textif (rootNode.NodeType == XmlNodeType.Text)writer.WriteStr<strong>in</strong>g(rootNode.Value);This version of the LoopThroughChildren rout<strong>in</strong>e is an adaptation of the rout<strong>in</strong>e weanalyzed <strong>in</strong> Chapter 5.215


A Better Way to Select a S<strong>in</strong>gle NodeIn the section "The SelectS<strong>in</strong>gleNode Internal Implementation," on page 255, I po<strong>in</strong>tedout that SelectS<strong>in</strong>gleNode is not as efficient as its signature and description mightsuggest. This <strong>XML</strong> DOM method is expected to per<strong>for</strong>m an XPath query and thenreturn only the first node. You might th<strong>in</strong>k that the method works smartly, return<strong>in</strong>g tothe caller as soon as the first node has been found.Un<strong>for</strong>tunately, that isn't what happens. SelectS<strong>in</strong>gleNode <strong>in</strong>ternally calls SelectNodes,downloads all the nodes (potentially a large number), and then returns only the firstnode to the caller. The <strong>in</strong>efficiency of this implementation lies <strong>in</strong> the fact that asignificant memory footpr<strong>in</strong>t might be required, albeit <strong>for</strong> a very short time.So <strong>in</strong> situations <strong>in</strong> which you need to per<strong>for</strong>m an XPath query to get only a subset of thef<strong>in</strong>al node-set (<strong>for</strong> example, exactly one node), you can use a smarter XPathexpression. The basic idea is that you avoid generic wildcard expressions like thefollow<strong>in</strong>g:doc.SelectS<strong>in</strong>gleNode("Northw<strong>in</strong>dEmployees/Employee");Instead, place a stronger filter on the XPath expression so that it returns just the subsetyou want. For example, to get only the first node, use the follow<strong>in</strong>g query:doc.SelectS<strong>in</strong>gleNode("Northw<strong>in</strong>dEmployees/Employee[position() =1");The same pattern can be applied to get a match<strong>in</strong>g node <strong>in</strong> a particular position. Forexample, if you need to get the nth match<strong>in</strong>g node, use the follow<strong>in</strong>g expression:doc.SelectS<strong>in</strong>gleNode("Northw<strong>in</strong>dEmployees/Employee[position()


Figure 6-5: Applications can access XPath through either XmlDocument orXPathDocument. In both cases, the actual query is per<strong>for</strong>med by a .<strong>NET</strong> XPath navigatorobject.As you can see, the XmlDocument and XPathDocument classes have different <strong>in</strong>ternallayouts. XmlDocument implements <strong>XML</strong> DOM, whereas XPathDocument provides amore agile and compact structure, designed to speed XPath-driven navigation. (Later <strong>in</strong>this chapter, <strong>in</strong> the section "The XPathDocument Class," on page 281, I'll have more tosay about this.)No matter the application-level API, the sequence of steps necessary to execute XPathqueries on an <strong>XML</strong> data source is always the same:1. Get a reference to an XPath-enabled document class (<strong>for</strong> example, an<strong>in</strong>stance of an XPathDocument or XmlDocument class).2. Create a navigator object <strong>for</strong> the class <strong>in</strong>stance.217


3. Optionally, precompile the XPath expression.4. Call the Select method on the navigator object to act on the specifiedXPath expression.The XPathNavigator ClassThe programm<strong>in</strong>g <strong>in</strong>terface of the navigator object is def<strong>in</strong>ed <strong>in</strong> the XPathNavigatorabstract class. The XPathNavigator class represents a generic <strong>in</strong>terface designed to actas a reader <strong>for</strong> any data that exposes its contents as <strong>XML</strong>.Functionally speak<strong>in</strong>g, the XPathNavigator class is not much different from a pseudoclassthat simply groups together all the <strong>XML</strong> DOM methods (ChildNodes,SelectNodes, and SelectS<strong>in</strong>gleNode) to navigate the document contents. The bigdifference lies <strong>in</strong> the fact that XPathNavigator is a dist<strong>in</strong>ct component completelydecoupled from the document class. As mentioned, XPathNavigator represents ageneric <strong>in</strong>terface to navigate and read data from any <strong>XML</strong>-based, or <strong>XML</strong>-look<strong>in</strong>g,contents.The XPathNavigator class enables you to move from one node to the next and per<strong>for</strong>mXPath queries. In the .<strong>NET</strong> Framework, only three classes support XPath navigators:XmlDocument, XPathDocument, and XmlDataDocument.An XPath navigator works on top of a special breed of <strong>XML</strong> document class that isgenerically referred to as an XPath data store. An XPath data store is simply any .<strong>NET</strong>Framework class that exposes its contents as <strong>XML</strong> and that can be queried us<strong>in</strong>gXPath expressions. An XPath data store can be based on a native <strong>XML</strong> stream or otherdata sources exposed as <strong>XML</strong>. For example, both the XmlDocument andXPathDocument classes are built from well-<strong>for</strong>med <strong>XML</strong> data. In contrast, theXmlDataDocument class exposes as <strong>XML</strong> the contents of an ADO.<strong>NET</strong> DataSet object.In all cases, however, the XPath query and navigation API works just f<strong>in</strong>e.As a stand-alone class provid<strong>in</strong>g a programm<strong>in</strong>g <strong>in</strong>terface, the navigator is much morethan a simple collection of XPath-related methods. The XPath navigator is not bound toa particular class document and can be associated with a number of data conta<strong>in</strong>erclasses.A .<strong>NET</strong> Framework class becomes XPath-enabled simply by implement<strong>in</strong>g theIXPathNavigable <strong>in</strong>terface. This <strong>in</strong>terface consists of a s<strong>in</strong>gle method, CreateNavigator,that creates and returns an <strong>in</strong>stance of a document-specific navigator object, as shownhere:public <strong>in</strong>terface IXPathNavigable{XPathNavigator CreateNavigator();}All document-specific navigators derive from the XPathNavigator abstract class.XPath Navigators and <strong>XML</strong> ReadersThe MSDN documentation def<strong>in</strong>es an XPath navigator as a class that reads data froman <strong>XML</strong>-based data store us<strong>in</strong>g a cursor model. XPathNavigator, there<strong>for</strong>e, providesread-only, random access to the underly<strong>in</strong>g <strong>XML</strong>-based data. The navigator has anotion of the current node and advances the <strong>in</strong>ternal po<strong>in</strong>ter us<strong>in</strong>g a series of movemethods. When the navigator is positioned on a given node, all of its properties reflectthe value of that node. How is this different from the <strong>XML</strong> readers that we encountered<strong>in</strong> Chapter 2?XPath navigators and <strong>XML</strong> readers are radically different objects, although both looklike client-side cursors <strong>for</strong> read<strong>in</strong>g <strong>XML</strong> data. Let's review the key differences:218


• Connection model Both readers and navigators work on top of a datasource. Readers, however, work connected to the <strong>in</strong>put stream, which isoften a persistent storage medium like a disk file. Navigators always workon memory-mapped data sources like <strong>XML</strong> DOM or more optimized andspecialized structures. Readers must be closed when you have f<strong>in</strong>ishedwith them; navigators are simply garbage-collected when they go out ofscope. A parallel can be drawn with ADO.<strong>NET</strong> data readers and DataSetobjects. An <strong>XML</strong> data reader object, like the SqlDataReader base class, isconnected to the data source, whereas a DataSet object is a disconnectedobject.• Navigation <strong>in</strong>terface Readers are simple read-only and <strong>for</strong>ward-onlycursors. Navigators too are read-only, but they let you move <strong>for</strong>ward andbackward. The navigator's set of move methods is significantly richer. Inparticular, the set <strong>in</strong>cludes methods <strong>for</strong> go<strong>in</strong>g to the root of the underly<strong>in</strong>gdocument, <strong>for</strong> reach<strong>in</strong>g the parent node, <strong>for</strong> reach<strong>in</strong>g the next and theprevious sibl<strong>in</strong>g, <strong>for</strong> reach<strong>in</strong>g the node where the given namespace isdef<strong>in</strong>ed, and even more. In addition, you can synchronize the navigatorposition with the current position on another navigator object.• <strong>Programm<strong>in</strong>g</strong> <strong>in</strong>terface Navigators provide rich XPath capabilities andsupply methods that per<strong>for</strong>m XPath queries and return groups of relatednodes. You have a generic Select method but also ad hoc selectionmethods that specialize on the most common XPath axes, such asdescendant, ancestor, and child. In addition, navigators can simplyevaluate an XPath expression and return the value.Conceptually, XPath navigators and <strong>XML</strong> readers occupy diametrically opposedpositions <strong>in</strong> the .<strong>NET</strong> <strong>XML</strong> puzzle. Moreover, this difference clearly stems from theirnames. Navigators are thought to traverse <strong>XML</strong>-based or <strong>XML</strong>look<strong>in</strong>g data. <strong>XML</strong>readers are simply lower-level tools that you can use to read <strong>XML</strong>-based or <strong>XML</strong>look<strong>in</strong>gdata and build <strong>in</strong>-memory data structures that navigators rely on.NoteAs mentioned, <strong>XML</strong> readers and navigators work on <strong>XML</strong>based or<strong>XML</strong>-look<strong>in</strong>g data. <strong>XML</strong>-based data refers to data persisted, or justread, as well-<strong>for</strong>med <strong>XML</strong>. As we saw <strong>in</strong> Chapter 2, however, youcan use specialized reader classes to publish non-<strong>XML</strong> datathrough a virtual <strong>XML</strong> tree. Likewise, a navigator can be built towork on top of a data store that creates a virtual <strong>XML</strong> tree from non-<strong>XML</strong> data. <strong>XML</strong>-look<strong>in</strong>g data refers to just such virtual <strong>XML</strong> trees.The XPathNavigator <strong>Programm<strong>in</strong>g</strong> InterfaceLet's briefly review the properties and methods that <strong>for</strong>m the programm<strong>in</strong>g <strong>in</strong>terface ofthe XPathNavigator class. A valid <strong>in</strong>stance of the class can be obta<strong>in</strong>ed by call<strong>in</strong>g theCreateNavigator method on any .<strong>NET</strong> Framework class that implements theIXPathNavigable <strong>in</strong>terface.Properties of the XPathNavigator ClassTable 6-3 summarizes the properties of the XPathNavigator class. As you can see,most of these properties reflect the characteristics of the currently selected node.Table 6-3: Properties of the XPathNavigator ClassPropertyDescriptionBaseURIGets the base URI of the current node219


Table 6-3: Properties of the XPathNavigator ClassPropertyHasAttributesHasChildrenIsEmptyElementLocalNameNameNamespaceURINameTableNodeTypePrefixValueXmlLangDescriptionIndicates whether the current node has any attributesIndicates whether the current node has any child nodesIndicates whether the current node is empty (<strong>for</strong>example, )Gets the name of the current node without thenamespace prefixGets the fully qualified name of the current nodeGets the URI of the namespace associated with thecurrent nodeGets the name table associated with the navigatorGets the type of the current nodeGets the namespace prefix associated with the currentnodeReturns a str<strong>in</strong>g denot<strong>in</strong>g the value of the current nodeGets the xml:lang scope <strong>for</strong> the current nodeLike <strong>XML</strong> readers and <strong>XML</strong> DOM documents, an XPath navigator employs a nametable to more efficiently store str<strong>in</strong>gs. The set of properties looks like the subset ofproperties that <strong>in</strong> the XmlTextReader class characterizes the current node.Methods of the XPathNavigator ClassThe tables <strong>in</strong> this section group the methods available <strong>in</strong> the XPathNavigator class <strong>in</strong>tothree ma<strong>in</strong> categories: move methods, selection methods, and miscellaneous methods.Table 6-4 lists the move methods.Table 6-4: XPathNavigator Move MethodsMethodMoveToMoveToAttributeMoveToFirstMoveToFirstAttributeMoveToFirstChildMoveToFirstNamespaceMoveToIdMoveToNamespaceDescriptionMoves to the same position as the specifiedXPathNavigator object.Moves to the specified attribute of the currentnode.Moves to the first sibl<strong>in</strong>g of the current node.Moves to the first attribute of the current node.Moves to the first child of the current node.Moves to the first namespace <strong>in</strong> the currentelement node.Moves to the node with an attribute of type IDwhose value matches the given str<strong>in</strong>g.Moves to the namespace node with thespecified prefix <strong>in</strong> the current element node. Anamespace node is seen as an attribute nodewith the xmlns name. The real name of the220


Table 6-4: XPathNavigator Move MethodsMethodMoveToNextMoveToNextAttributeMoveToNextNamespaceMoveToParentMoveToPreviousMoveToRootDescriptionnamespace node is the prefix.Moves to the next sibl<strong>in</strong>g of the current node.Moves to the next attribute of the current node.Moves to the next namespace <strong>in</strong> the currentelement node.Moves to the parent of the current node.Moves to the previous sibl<strong>in</strong>g of the currentnode.Moves to the root node of the document.The MoveTo method attempts to synchronize the current <strong>in</strong>stance of theXPathNavigator object with another <strong>in</strong>stance. MoveTo returns true or false depend<strong>in</strong>gon the success or failure of the operation. Note that the synchronization always fails ifthe two navigators are actually implemented through different and <strong>in</strong>compatible classes.Two navigators have different implementations if the other navigator can't be cast to thecurrent type.Consider the follow<strong>in</strong>g pseudocode:public bool MoveTo(XPathNavigator other){InternalXPathNavigator nav = other as InternalXPathNavigator;if (nav == null)return false;}⋮In C#, the as operator behaves like a cast except that, when the conversion fails, itreturns null rather than rais<strong>in</strong>g an exception. In the preced<strong>in</strong>g pseudocode, theInternalXPathNavigator class represents the actual (and <strong>in</strong>ternal) navigator class yougot from the document's CreateNavigator method. Each XPathenabled document classactually <strong>in</strong>stantiates a custom navigator class and returns that class when you call itsCreateNavigator method.The MoveTo method also might fail when the two navigators share the sameimplementation but po<strong>in</strong>t to different document <strong>in</strong>stances. What happens <strong>in</strong> this case,however, depends on the specific implementation. In particular, MoveTo fails when thedocument class is XmlDocument or XmlDataDocument, but not when the underly<strong>in</strong>gdata object is an <strong>in</strong>stance of XPathDocument.Namespace Node NavigationAs you might have noticed <strong>in</strong> Table 6-4, there are three types of move methods: <strong>for</strong>element, attribute, and namespace nodes. Call<strong>in</strong>g the wrong method on a nodecauses the whole operation to fail, and there is no change <strong>in</strong> the position of thenavigator. Only MoveTo and MoveToRoot can be called on any node, irrespective of221


the type. In addition, attributes and namespaces also have ad hoc methods to returntheir values: GetAttribute and GetNamespace.When you call either MoveToFirstNamespace or MoveToNextNamespace, you canspecify an argument of type XPathNamespaceScope. The XPathNamespaceScopeenumeration has three values: All, Exclude<strong>XML</strong>, and Local. All returns all namespacesdef<strong>in</strong>ed <strong>in</strong> the scope of the current node, <strong>in</strong>clud<strong>in</strong>g xmlns:xml, which is alwaysdeclared implicitly. ExcludeXml returns all namespaces def<strong>in</strong>ed <strong>in</strong> the scope of thecurrent node, exclud<strong>in</strong>g xmlns:xml. Local returns all namespaces that are def<strong>in</strong>edlocally at the current node. Whatever value you specify, the order of the namespacesreturned is not def<strong>in</strong>ed. A namespace node is a special type of attribute node. Whenselected, the navigator's Name property returns the namespace prefix. The Valueproperty, on the other hand, returns the URI.Table 6-5 lists the XPathNavigator class's methods <strong>for</strong> select<strong>in</strong>g nodes through XPathqueries.Table 6-5: XPathNavigator' Selection MethodsMethodSelectSelectAncestorsSelectChildrenSelectDescendantsDescriptionReturns the node-set selected by the specified XPathexpression. The context <strong>for</strong> the selection is theposition of the navigator when the method is called.The XPath expression can be passed <strong>in</strong> as pla<strong>in</strong> textor <strong>in</strong> a compiled <strong>for</strong>m.Selects all the ancestor element nodes of the currentnode. You can narrow the returned node-set byspecify<strong>in</strong>g a node name and a namespace URI tomatch.Selects all the child nodes of the current node. Youcan narrow the node-set by specify<strong>in</strong>g a node nameand a namespace URI to match. Attributes andnamespace nodes are not <strong>in</strong>cluded.Selects all the descendant nodes of the current node.You can narrow the node-set by specify<strong>in</strong>g a nodename and a namespace URI to match. Attributes andnamespace nodes are not <strong>in</strong>cluded.None of these methods produces any effect on the state of the XPathNavigator object.The follow<strong>in</strong>g code snippet demonstrates how to select the descendants of a node. Thecode to get the ancestors is nearly identical.// Create the underly<strong>in</strong>g XPath-enabled document objectXPathDocument doc = new XPathDocument(fileName);// Create the navigator <strong>for</strong> the specified objectXPathNavigator nav = doc.CreateNavigator();// Select the descendants of the current node that match// the specified criterianav.SelectDescendants(nodeName, nsUri, selfIncluded);222


SelectDescendants, as well as SelectAncestors, has the follow<strong>in</strong>g two over-loads. The<strong>for</strong>mer takes a node type and returns only the nodes of that type, if any. The latter takesa node name and a namespace URI.XPathNodeIterator SelectDescendants(XPathNodeType, bool);XPathNodeIterator SelectDescendants(str<strong>in</strong>g, str<strong>in</strong>g, bool);If you pass both the node name and the namespace URI as empty str<strong>in</strong>gs, alldescendant nodes with no namespace <strong>in</strong><strong>for</strong>mation are selected. This method, and thehomologous SelectAncestors and SelectChildren methods, is a specialized queryper<strong>for</strong>med along the correspond<strong>in</strong>g XPath axis.The Boolean argument you specify <strong>in</strong> the method signatures <strong>in</strong>dicates whether thecontext node must be <strong>in</strong>cluded <strong>in</strong> the f<strong>in</strong>al node-set. Sett<strong>in</strong>g the argument to true isequivalent to work<strong>in</strong>g along the descendant-or-self axis.ImportantAs you might have noticed, all selection methods return a newtype of object—the XPathNodeIteratorclass. This class will becovered <strong>in</strong> detail <strong>in</strong> the section "The XPathNodeIterator Class,"on page 285. For now, suffice to say that an XPath iteratorprovides a generic way to visit a set of selected nodes. Fromthis po<strong>in</strong>t of view, an iterator is not much different from anenumerator—just a bit more specialized.Table 6-6 lists the rema<strong>in</strong><strong>in</strong>g XPathNavigator methods.Table 6-6: XPathNavigator Miscellaneous MethodsMethodDescriptionCloneClones the navigator and returns a new object with thesame current node.ComparePosition Compares the position of the current navigator with theposition of the specified XPathNavigator object.CompileCompiles an XPath expression.EvaluateEvaluates the given XPath expression and returns theresult.GetAttributeGets the value of the specified attribute, if such anattribute exists on the current node.GetNamespace Gets the URI of the specified namespace prefix, if sucha namespace exists on the current node.IsDescendant Indicates whether the specified navigator is adescendant of the current navigator. A navigator is adescendant of another navigator if it is positioned <strong>in</strong> adescendant node.IsSamePosition Indicates whether the current navigator is at the sameposition as the specified navigator.MatchesDeterm<strong>in</strong>es whether the current node matches thespecified XPath expression.223


As you can see, several methods have to do with XPath expressions that are oftenrendered as <strong>in</strong>stances of the XPathExpression class. But why do we need to expressan XPath command us<strong>in</strong>g a new class?XPath Expressions <strong>in</strong> the .<strong>NET</strong> FrameworkAn XPath expression is first of all a str<strong>in</strong>g that represents a location path, but an XPathexpression is a bit more than a pla<strong>in</strong> command str<strong>in</strong>g. It has a surround<strong>in</strong>g context thatis just what the .<strong>NET</strong> Framework XPathExpression class encapsulates. The context ofan expression <strong>in</strong>cludes the return type and the namespace <strong>in</strong><strong>for</strong>mation to handle the<strong>in</strong>volved nodes.The XPathExpression ClassTable 6-7 lists the methods and properties that characterize a .<strong>NET</strong> Framework XPathexpression.Table 6-7: Properties and Methods of the XPathExpression ClassNameExpressionReturnTypeAddSortCloneSetContextDescriptionProperty that returns the XPath expression as a str<strong>in</strong>g.Property that returns the computed result type of theexpression.Method that sorts the nodes selected by the expression.Method that clones the XPathExpression object.Method that sets the necessary <strong>in</strong><strong>for</strong>mation to use <strong>for</strong>resolv<strong>in</strong>g nodes namespaces. The <strong>in</strong><strong>for</strong>mation is passed,packed <strong>in</strong>to an object of type XmlNamespaceManager.Look<strong>in</strong>g at the programm<strong>in</strong>g <strong>in</strong>terface of the XPathExpression class, you'll notice themethods Clone and AddSort. As its name suggests, Clone makes a deep copy of theobject, creat<strong>in</strong>g a brand-new and identical object. AddSort, on the other hand,associates the expression with a sort<strong>in</strong>g algorithm that will be automatically run oncethe node-set <strong>for</strong> the expression has been retrieved.The XPathExpression class is not publicly creatable. To get a new <strong>in</strong>stance of thisclass, you must take a pla<strong>in</strong> XPath str<strong>in</strong>g expression and compile it <strong>in</strong>to anXPathExpression object.Compil<strong>in</strong>g ExpressionsBoth the <strong>XML</strong> DOM SelectNodes method and the navigator object's Select method letyou execute an XPath query <strong>in</strong>dicat<strong>in</strong>g the expression as pla<strong>in</strong> text. In spite of thissimplified programm<strong>in</strong>g <strong>in</strong>terface, <strong>in</strong> the .<strong>NET</strong> Framework, an XPath expression canexecute only <strong>in</strong> its compiled <strong>for</strong>m. This means that both the a<strong>for</strong>ementioned methodssilently compile the provided text <strong>in</strong>to an XPathExpression be<strong>for</strong>e proceed<strong>in</strong>g.NoteIn this context, the term compile does not mean that the XPathexpression is trans<strong>for</strong>med <strong>in</strong>to an executable (and/or managed)piece of code. More simply, the action of compil<strong>in</strong>g must be literallyseen as the process that produces an object by collect<strong>in</strong>g andputt<strong>in</strong>g together many pieces of <strong>in</strong><strong>for</strong>mation.There are several advantages to compil<strong>in</strong>g the expression yourself. For one th<strong>in</strong>g, youcan reuse the compiled object over and over. If you repeatedly call an XPath selectionmethod to work on the same expression, each time the method will <strong>in</strong>stantiate the sameobject. If you have a compiled expression, you save a few operations.224


In addition, a compiled expression lets you know <strong>in</strong> advance about the expected returntype. The return type is one of the values def<strong>in</strong>ed <strong>in</strong> the XPathResultType enumeration,shown <strong>in</strong> Table 6-8.Table 6-8: XPath Return TypesTypeDescriptionAnyRepresents any of the XPath node typesBoolean Represents a Boolean valueErrorWhen returned, the expression does not evaluate to a correctXPath typeNavigator Described as a value that returns a tree fragment; <strong>in</strong> the currentversion of the .<strong>NET</strong> Framework, implemented as a synonym ofStr<strong>in</strong>gNodeSet Represents a collection of nodesNumber Represents a numeric, float<strong>in</strong>g-po<strong>in</strong>t valueStr<strong>in</strong>gRepresents a str<strong>in</strong>g valueThe Boolean, NodeSet, Number, and Str<strong>in</strong>g types come directly from the W3Cspecification; the others represent extensions. However, Any and Error do not <strong>in</strong>troduceany new functionality but simply make more consistent the enumeration type.If you use a compiled expression, you can add namespace <strong>in</strong><strong>for</strong>mation to process thenodes and def<strong>in</strong>e a sort<strong>in</strong>g algorithm <strong>for</strong> the resultant node-set. All this extra <strong>in</strong><strong>for</strong>mationrema<strong>in</strong>s associated with the XPathExpression object and can be reused at will.To compile an expression, you use the Compile method of the XPathNavigator class.The method takes a str<strong>in</strong>g and returns an XPathExpression object, as shown here:XPathDocument doc = new XPathDocument(fileName);XPathNavigator nav = doc.CreateNavigator();XPathExpression expr = nav.Compile(xpathExpr);// Output the expected return typeConsole.WriteL<strong>in</strong>e(expr.ReturnType.ToStr<strong>in</strong>g());// Execute the expressionnav.Select(expr);A compiled XPath expression can be consumed by a few navigator methods, <strong>in</strong>clud<strong>in</strong>gSelect, Evaluate, and Matches.ImportantUnlike the navigator's Select method, the <strong>XML</strong> DOMSelectNodes method can't accept a compiled XPathexpression. Internally, the SelectNodes method creates an<strong>in</strong>stance of the navigator object that actually compiles theXPath str<strong>in</strong>g <strong>in</strong>to an XPathExpression object. In this case,however, there is no object reuse.225


Sett<strong>in</strong>g Namespace In<strong>for</strong>mationThe <strong>in</strong><strong>for</strong>mation you can pass through the SetContext method helps the XPathprocessor to resolve any namespace references <strong>in</strong> the expression. If no prefix appears<strong>in</strong> the expression, it is assumed that the namespace URI <strong>for</strong> all nodes is the emptynamespace. Otherwise, you must let the processor know about def<strong>in</strong>ed prefix andnamespace URI mapp<strong>in</strong>gs.You create an XmlNamespaceManager object, pack it with all the needed <strong>in</strong><strong>for</strong>mation,and then use the SetContext method to register it with the XPath expression object, asshown here:// Create the navigatorXPathDocument doc = new XPathDocument(fileName);XPathNavigator xnm = doc.CreateNavigator();// Create and populate the <strong>XML</strong> namespace managerXmlNamespaceManager xnm = newXmlNamespaceManager(nav.NameTable);xnm.AddNamespace("dd", "urn:d<strong>in</strong>o-e");xnm.AddNamespace("es", "http://www.contoso.com");// Set the expression's contextXPathExpression expr = nav.Compile(xpathExpr);expr.SetContext(xnm);The .<strong>NET</strong> XPath processor is designed to look <strong>for</strong> the namespace manager on theXPath expression object prior to proceed<strong>in</strong>g.Evaluat<strong>in</strong>g ExpressionsAs mentioned, when evaluated, an XPath expression can return any of four basic types:node-set, Boolean, number, or str<strong>in</strong>g. If the return type is a node-set, you can run theexpression through both the Select method and the Evaluate method.The Select method returns an object of type XPathNodeIterator that you can use towalk your way through the members of the node-set. Unlike Select, the Evaluatemethod returns a generic object type, which it is your responsibility to cast to the correctstrong type, as <strong>in</strong> the follow<strong>in</strong>g example:XPathNodeIterator iterator = (XPathNodeIterator)nav.Evaluate(expr);Expressions that do not return a node-set can be used only with the Evaluate method.In this case, however, you must also cast the returned object to a strong type, as shownhere:str<strong>in</strong>g buf = (str<strong>in</strong>g) nav.Evaluate(expr);The Evaluate method has no effect on the state of the navigator. An <strong>in</strong>terest<strong>in</strong>goverload <strong>for</strong> the method is shown here:public object Evaluate(XPathExpression expr,XPathNodeIterator context);226


Normally, the expression is evaluated us<strong>in</strong>g the current node <strong>in</strong> the navigator as thecontext node. Us<strong>in</strong>g this overload, however, you can control the context node <strong>for</strong> theexpression. If the context argument is null, the method works as usual. Otherwise, ifcontext po<strong>in</strong>ts to a valid iterator object, the current node <strong>in</strong> the iterator is used todeterm<strong>in</strong>e the context node <strong>for</strong> the XPath expression.Sort<strong>in</strong>g the Node-SetAn <strong>in</strong>terest<strong>in</strong>g extension to the XPath programm<strong>in</strong>g model built <strong>in</strong>to theXPathExpression class and the XPath processor is the ability to sort the node-setbe<strong>for</strong>e it is passed back to the caller. To add a sort<strong>in</strong>g algorithm, call the AddSortmethod of the XPathExpression object. AddSort allows <strong>for</strong> two overloads, as follows:public void AddSort(object expr,IComparer comparer);public void AddSort(object expr,XmlSortOrder order,XmlCaseOrder caseOrder,str<strong>in</strong>g lang,XmlDataType dataType);The expr argument denotes the sort key. It can be a str<strong>in</strong>g represent<strong>in</strong>g a node name oranother XPathExpression object that evaluates to a node name. In the first overload,the comparer argument refers to an <strong>in</strong>stance of a class that implements the IComparer<strong>in</strong>terface. The <strong>in</strong>terface supplies a Compare method that is actually used <strong>for</strong> compar<strong>in</strong>ga pair of values. Use this overload if you need to specify a custom algorithm to sortnodes.Us<strong>in</strong>g the Comparer ObjectTo sort arrays of objects, the .<strong>NET</strong> Framework provides a few predef<strong>in</strong>ed comparerclasses, <strong>in</strong>clud<strong>in</strong>g Comparer and CaseInsensitiveComparer. The <strong>for</strong>mer classcompares objects (<strong>in</strong>clud<strong>in</strong>g str<strong>in</strong>gs) with respect to the case. The latter class does thesame, but irrespective of the case. To use both classes <strong>in</strong> your code, be sure to importthe System.Collections namespace.The Comparer class has no public constructor but provides a s<strong>in</strong>gleton <strong>in</strong>stancethrough the Default static property, as shown here:expr.AddSort("lastname", Comparer.Default);If you need to create your own comparer class, do as follows:class MyOwnStr<strong>in</strong>gComparer : IComparer{public <strong>in</strong>t Compare(object x, object y){227


str<strong>in</strong>g strX = (str<strong>in</strong>g) x;str<strong>in</strong>g strY = (str<strong>in</strong>g) y;// 0 if equals, >0 if x>y,


Figure 6-6: The sample application sorts nodes by title and lastname.To generate the output shown <strong>in</strong> this figure, I made use of an XPath iterator to visit allthe nodes and their own subtrees. We'll exam<strong>in</strong>e this code <strong>in</strong> detail <strong>in</strong> the section"Visit<strong>in</strong>g the Selected Nodes," on page 286, but first we'll take a look at the <strong>in</strong>ternallayout of the <strong>XML</strong> document classes the navigator relies on.XPath Data StoresAs mentioned, an XPath navigator works on top of an ad hoc document class. The.<strong>NET</strong> Framework provides three XPath-enabled classes: XmlPathDocument,XmlDocument, and XmlDataDocument. These classes have <strong>in</strong> common theIXPathNavigable <strong>in</strong>terface.In theory, each .<strong>NET</strong> Framework class can become XPath-enabled. In practice,however, only a subset of classes is a good candidate. In the first place, the class mustact as the <strong>in</strong>-memory repository of some sort of content. Second, this content must be,or must be exposed as, <strong>XML</strong>. When these two prerequisites are met, classes canreasonably implement the IXPathNavigable <strong>in</strong>terface and create their own navigators.An XPath navigator is always class-specific and is built by <strong>in</strong>herit<strong>in</strong>g from the abstractclass XPathNavigator. Although <strong>in</strong> practice you always use navigators through thegeneric reference type of XPathNavigator, each class has its own navigator object.Table 6-9 lists these <strong>in</strong>ternal, undocumented classes; they are programmaticallyunaccessible, and often each is implemented <strong>in</strong> a different way. Despite thiscomplexity, however, the classes' application-level programm<strong>in</strong>g <strong>in</strong>terface is commonand is based on their base class XPathNavigator.Table 6-9: Document-Specific Navigator ClassesDocument ClassXPathDocumentXmlDocumentXmlDataDocumentCorrespond<strong>in</strong>g Internal Navigator ClassSystem.Xml.XPath.XPathDocumentNavigatorSystem.Xml.DocumentXPathNavigatorSystem.Xml.DataDocumentXPathNavigatorThe document-specific navigator exploits the <strong>in</strong>ternal layout of the document class toprovide the navigation API. A document-specific navigator can also have new methodsand properties that make sense to a particular implementation. In this case, however,229


the navigator's author must carefully document the new features; otherwise, it would behard <strong>for</strong> a caller to exploit them through the generic XPathNavigator <strong>in</strong>terface.In the follow<strong>in</strong>g sections, we'll review the characteristics of the various XPath-enableddocument classes.The XPathDocument ClassThe XPathDocument class provides a highly optimized, read-only <strong>in</strong>-memory cache <strong>for</strong><strong>XML</strong> documents. Specifically designed to serve as an XPath data conta<strong>in</strong>er, the classdoes not provide any <strong>in</strong><strong>for</strong>mation or identity <strong>for</strong> nodes. XPathDocument simply createsan underly<strong>in</strong>g web of node references to let the navigator operate quickly andeffectively. XPathDocument does not respect any <strong>XML</strong> DOM specification and has onlyone method—CreateNavigator.The <strong>in</strong>ternal architecture of the XPathDocument class looks like a l<strong>in</strong>ked list of nodereferences. Nodes are managed through an <strong>in</strong>ternal class (XPathNode) that representsa small subset of the XmlNode class, which is the official <strong>XML</strong> DOM node class <strong>in</strong> the.<strong>NET</strong> Framework. You can access the <strong>XML</strong> nodes of the document only through theproperties exposed by the navigator object. (See Table 6-3.)The follow<strong>in</strong>g code shows how to create a new, XPathDocument -driven navigatorobject:XPathDocument doc = new XPathDocument(fileName);XPathNavigator nav = doc.CreateNavigator();The returned navigator is positioned at the root of the document. The XPathDocumentclass supports only <strong>XML</strong>-based data sources, and you can <strong>in</strong>itialize it from disk files,streams, text, and <strong>XML</strong> readers.TipYou can also <strong>in</strong>itialize an XPath document us<strong>in</strong>g the output returnedby the ExecuteXmlReader method of the SqlCommand ADO.<strong>NET</strong>class. The method builds and returns an <strong>XML</strong> reader us<strong>in</strong>g the resultset of a SQL query, as shown here:SqlCommand cmd = new SqlCommand(query, conn);XmlTextReader reader = (XmlTextReader)cmd.ExecuteXmlReader();XPathDocument doc = new XPathDocument(reader);The XmlDocument ClassXmlDocument is the class that represents the .<strong>NET</strong> Framework implementation of theW3C-compliant <strong>XML</strong> DOM. This aspect of XmlDocument was covered <strong>in</strong> detail <strong>in</strong>Chapter 5.Unlike XPathDocument, the XmlDocument class provides read/write access to thenodes of the underly<strong>in</strong>g <strong>XML</strong> document. In addition, each node can be <strong>in</strong>dividuallyaccessed and sets of nodes can be selected through XPath queries run by theSelectS<strong>in</strong>gleNode and SelectNodes methods, respectively.The XmlDocument class also enables you to create a navigator object. In this case,however, the navigator will work on a much more rich and complex web of nodereferences. The follow<strong>in</strong>g code shows how to get the navigator <strong>for</strong> the XmlDocumentclass:XmlDocument doc = new XmlDocument();doc.Load(fileName);XPathNavigator nav = doc.CreateNavigator();230


In particular, XmlDocument's navigator class extends the <strong>in</strong>terface of the standardnavigator by implement<strong>in</strong>g the IHasXmlNode <strong>in</strong>terface. This <strong>in</strong>terface def<strong>in</strong>es just onemethod, GetNode, as shown here:public <strong>in</strong>terface IHasXmlNode{XmlNode GetNode();}Us<strong>in</strong>g this method, callers can access and query the currently selected node of thenavigator. This feature is simply impossible to implement <strong>for</strong> navigators based onXPathDocument because it exploits the different <strong>in</strong>ternal layout of the XmlDocumentclass. By design, the XPathDocument class m<strong>in</strong>imizes the memory footpr<strong>in</strong>t and doesnot provide node identity.If the GetNode method is an extension to the XPathNavigator base class, how cancallers take advantage of it? Here's a code snippet:XmlDocument doc = new XmlDocument();doc.Load(fileName);XPathNavigator nav = doc.CreateNavigator();XmlNode node = ((IHasXmlNode) nav).GetNode();At this po<strong>in</strong>t, the caller program has ga<strong>in</strong>ed full access to the node and can read andupdate it at will.NoteWhen created, the XmlDocument navigator is not positioned on theroot of the document. Instead, it is positioned on the node fromwhich the CreateNavigator method was called.The XmlDataDocument ClassThe XmlDataDocument class is an extension of XmlDocument designed to allow themanipulation of a relational DataSet object through <strong>XML</strong>. The class also allows <strong>for</strong>render<strong>in</strong>g <strong>XML</strong> data as a relational DataSet object; but this aspect is less importanthere. (We will return to this topic <strong>in</strong> Chapter 8.)The XmlDataDocument class provides a CreateNavigator method to let callers navigatethe <strong>XML</strong> representation of an ADO.<strong>NET</strong> DataSet object. This is a neat example of thefact that the .<strong>NET</strong> Framework navigation API can be <strong>in</strong>differently applied to <strong>XML</strong>-baseddata as well as <strong>XML</strong>-look<strong>in</strong>g data. Like the XmlDocument navigator, theXmlDataDocument navigator also is not positioned on the root of the document but ispositioned on the node from which the CreateNavigator method was called.Custom Navigator ObjectsThe .<strong>NET</strong> Framework navigation API is extensible with navigator objects that work ontop of particular <strong>XML</strong> documents or any other data exposed through a virtual <strong>XML</strong> nodestructure. To XPath-enable a given data source, you create a class that <strong>in</strong>herits fromXPathNavigator. You can associate this new navigator class with a document class ormake it a stand-alone creatable class. The MSDN documentation <strong>in</strong>cludes an exampleclass named FileSystemNavigator. I extracted it from the documentation and compiledthe C# and <strong>Microsoft</strong> Visual Basic code <strong>in</strong>to an assembly. The assembly is available <strong>in</strong>this book's sample files.The file system navigator supports a virtual node structure similar to the follow<strong>in</strong>g:231


⋮Notice that the sample file system navigator places all subfolders of the context folderat the same level, thus los<strong>in</strong>g any hierarchical <strong>in</strong><strong>for</strong>mation. The follow<strong>in</strong>g code snippetshows how the custom navigator can be created and used:XPathNavigator nav = new FileSystemNavigator("c:\\folder");// Exclude the folder itself but not all the subfolders.// (If you run this on c:\ a VERY LONG list of nodes isreturned...)XPathNodeIterator it = nav.Select("descendant::*[position()>1]");while(it.MoveNext())Console.WriteL<strong>in</strong>e(it.Current.Name);In this case, the architecture of the sample code makes it significantly harder to executea query that selects only the children of the context folder. The preced<strong>in</strong>g list<strong>in</strong>g returnsall the folders and files below the c:\ folder despite the effective parent folder. Thepredicate [position() >1] skips over the context folder name.TipWhen you plan to build a navigator <strong>for</strong> a persistent data source (<strong>for</strong>example, a database, the file system, or the registry), you can dowithout a document class. A document class is key when there is noother API to provide the <strong>in</strong>-memory <strong>in</strong>frastructure <strong>for</strong> navigation. Inthe previous example, the DirectoryInfo and FileInfo classes providethe core API used by the FileSystemNavigator object. In this case,they actually play the role of the XPath document class.XPath IteratorsWhen the XPath expression orig<strong>in</strong>ates a node-set, the navigator object always returns itus<strong>in</strong>g a new breed of object—the node iterator. The node iterator is a relatively simpleobject that provides an agile, common <strong>in</strong>terface to navigate an array of nodes. Thebase class <strong>for</strong> XPath iterators is XPathNodeIterator.The node iterator does not cache any <strong>in</strong><strong>for</strong>mation about the identity of the nodes<strong>in</strong>volved. It simply works as an <strong>in</strong>dexer on top of the navigator object that operated theXPath query. All the functionalities you might f<strong>in</strong>d <strong>in</strong> the implementation of anyXPathNodeIterator classes could have been easily packed <strong>in</strong>to the navigator itself. Whythen does the .<strong>NET</strong> Framework provide the navigation and the iteration API as dist<strong>in</strong>ctcomponents?232


First, decoupl<strong>in</strong>g data conta<strong>in</strong>ers from navigators, and navigators from iterators,represents a good barga<strong>in</strong> from the software standpo<strong>in</strong>t. The ultimate reason <strong>for</strong>keep<strong>in</strong>g the navigation and the iteration API dist<strong>in</strong>ct, however, is that <strong>in</strong> this way theresults of any XPath query can be easily accessed and processed from differentprogramm<strong>in</strong>g environments—<strong>XML</strong> DOM, XPath, and, last but not least, XSLT.The XPathNodeIterator ClassThe XPathNodeIterator class has no public constructor and can be created only by theparent navigator object. The iterator provides <strong>for</strong>ward-only access to the nodesselected by XPath query. Callers use the iterator's methods and properties to access allthe nodes <strong>in</strong>cluded <strong>in</strong> the node-set. Figure 6-7 illustrates the relationship betweencallers, navigators, and iterators. A caller passes an XPath expression. The navigatorexecutes the command and gets a node-set. The caller then receives an iterator objectto access the members of the node-set. Current, Count, and MoveNext are the keymembers of the iterator's programm<strong>in</strong>g <strong>in</strong>terface.Figure 6-7: The relationship between callers, navigators, and iterators.Properties of the Iterator ObjectTable 6-10 summarizes the properties exposed by the XPathNodeIterator class.Table 6-10: Properties of the XPathNodeIterator ClassPropertyCountCurrentDescriptionReturns the number of elements <strong>in</strong> the node-set. Thisvalue refers to the top-level nodes and does not considerchild nodes.Returns a reference to a navigator object rooted <strong>in</strong> the233


Table 6-10: Properties of the XPathNodeIterator ClassPropertyCurrentPositionDescriptioniterator's current node.Gets the <strong>in</strong>dex of the currently selected node.The Current property is the key property <strong>for</strong> callers to drill down <strong>in</strong>to the structure of theselected node. In the XPath Evaluator sample application we discussed earlier <strong>in</strong> thischapter, at a certa<strong>in</strong> po<strong>in</strong>t we had to exam<strong>in</strong>e the subtree of each node <strong>in</strong>cluded <strong>in</strong> thenode-set. The code <strong>in</strong> Figure 6-4 uses a recursive rout<strong>in</strong>e (namedLoopThroughChildren) to navigate the subtree of a given node.The navigator/iterator pair makes that task quite straight<strong>for</strong>ward to accomplish. TheCurrent property already returns a reference to the XPathNavigator object rooted <strong>in</strong> thecurrently selected node. Pay attention to the fact that what you get is not a copy of thenavigator but a simple reference. If you need to dig <strong>in</strong>to the node structure, make adeep copy of the navigator first. For the purpose, you can use the navigator's Clonemethod.Methods of the Iterator ObjectTable 6-11 lists the public methods of an iterator object.Table 6-11: Methods of the XPathNodeIterator ClassMethodCloneMoveNextDescriptionMakes a deep copy of the current XPathNodeIterator objectMoves to the next node <strong>in</strong> the navigator's selected node-setWhen MoveNext is called, the iterator adjusts some <strong>in</strong>ternal po<strong>in</strong>ters and refreshes itsCurrent and CurrentPosition properties. When the iterator is first returned to the caller,there is no currently selected node. Only after the first call to MoveNext does theCurrent property po<strong>in</strong>t to a valid navigator object.Visit<strong>in</strong>g the Selected NodesLet's review the typical way <strong>in</strong> which an XPath iterator works. Suppose that you justexecuted an XPath command us<strong>in</strong>g an XPathNavigator object, as shown here:XPathDocument doc = new XPathDocument(fileName);XPathNavigator nav = doc.CreateNavigator();XPathNodeIterator iterator = nav.Select(expr);To visit all the selected nodes, you set up a loop controlled by the iterator's MoveNextmethod, as follows:while (iterator.MoveNext()){XPathNavigator nav2 = iterator.Current.Clone();}⋮In real-world applications, you need to drill down <strong>in</strong>to the subtree of each nodereferenced <strong>in</strong> the XPath node-set. You should not use the navigator returned by theCurrent property to move away from the node-set. Instead, you should clone the object234


and use the cloned navigator to per<strong>for</strong>m any additional moves. The follow<strong>in</strong>g codesnippet generates the output shown <strong>in</strong> Figure 6-6:while (iterator.MoveNext()){XPathNavigator _copy = iterator.Current.Clone();str<strong>in</strong>g buf = "";// Select the node and read the current value_copy.MoveToFirstChild();buf += _copy.Value + ". ";// Select the node and read the current value_copy.MoveToNext();buf += _copy.Value;// Select the node and read the current value_copy.MoveToNext();buf += ", "+ _copy.Value;// Select the node and read the current value_copy.MoveToNext();buf += "\t["+ _copy.Value + "]";}// Write out the f<strong>in</strong>al resultConsole.WriteL<strong>in</strong>e(buf);Of course, the cloned and the orig<strong>in</strong>al XPathNavigator objects are totally dist<strong>in</strong>ct and<strong>in</strong>dependent objects, and the clone is not affected by any subsequent changes made tothe orig<strong>in</strong>al navigator.ConclusionOn the long road to standardization, XPath seems like the first significant step toward auniversal query language to keep up with the universal protocol (HTTP), the universaldata description language (<strong>XML</strong>), and the universal remote procedure call protocol(SOAP).With XPath, you ga<strong>in</strong> the ability to identify and process a group of related nodes froman <strong>XML</strong>-driven data source. This ability can be exploited by a number of different clientenvironments. <strong>XML</strong> DOM classes, <strong>for</strong> example, can use XPath <strong>for</strong> <strong>in</strong>-memory dataretrieval. XPath is also great <strong>for</strong> query<strong>in</strong>g <strong>XML</strong> representations of relational data heldboth <strong>in</strong> disconnected structures (such as XmlDataDocument) and <strong>in</strong> more traditionalAPIs like <strong>XML</strong> Extensions <strong>for</strong> SQL Server 2000. (See Chapter 10.)XSLT is another programm<strong>in</strong>g environment that successfully leverages XPath. XSLT isparticularly powerful when it comes to apply<strong>in</strong>g code templates to <strong>XML</strong> subtrees. XPath235


supplies the underly<strong>in</strong>g means to identify those nodes declaratively. XPathNavigatorsupports XSLT and can be used as an <strong>in</strong>put mechanism to the XslTrans<strong>for</strong>m class.We'll look at XSLT <strong>in</strong> more detail <strong>in</strong> Chapter 7.This chapter presented two high-level APIs to evaluate XPath expressions: the <strong>XML</strong>DOM–based API and the newest, .<strong>NET</strong> Framework–specific navigation API. As we'veseen, under the hood, the two APIs make use of the same core code. What's new withXPath <strong>in</strong> the .<strong>NET</strong> Framework is the concept of the navigator object, especially <strong>in</strong>conjunction with the iterator object.The navigator is a self-conta<strong>in</strong>ed API used to navigate an <strong>XML</strong>-based, or <strong>XML</strong>-look<strong>in</strong>g,data source. The iterator is a child object that comes <strong>in</strong> handy <strong>for</strong> access<strong>in</strong>g the resultsof XPath queries run by the navigator. All the underly<strong>in</strong>g data structures are extremelyoptimized and compact. So if you're look<strong>in</strong>g <strong>for</strong> efficiency, run your XPath queries us<strong>in</strong>gthe navigation API.Further Read<strong>in</strong>gThe official XPath specification is available at http://www.w3.org/TR/xpath. This chapteralso mentioned XPo<strong>in</strong>ter and XInclude as XPath-related technologies. You can f<strong>in</strong>dtheir current W3C status and most recent specifications at http://www.w3.org/TR/xptrand http://www.w3.org/TR/x<strong>in</strong>clude.Like many other <strong>XML</strong>-related technologies, XPath is well covered <strong>in</strong> different <strong>for</strong>ms <strong>in</strong>Essential <strong>XML</strong> Quick Reference, written by Aaron Skonnard and Mart<strong>in</strong> Gudg<strong>in</strong>(Addison-Wesley, 2001) and mentioned <strong>in</strong> previous chapters. For even quicker andmore compact references, check out the "The <strong>XML</strong> Files," a monthly column <strong>in</strong> MSDNMagaz<strong>in</strong>e, at http://msdn.microsoft.com/msdnmag. F<strong>in</strong>ally, the follow<strong>in</strong>g URL po<strong>in</strong>ts youto a recent and useful article about XPath and namespaces:http://msdn.microsoft.com/library/en-us/dnexxml/html/xml05202002.asp.236


Chapter 7: <strong>XML</strong> Data Trans<strong>for</strong>mationOverview<strong>XML</strong> was first <strong>in</strong>troduced as a metalanguage <strong>for</strong> data description. Why is it ametalanguage and not just a language? In general, the prefix meta <strong>in</strong>dicates anevolutionary trans<strong>for</strong>mation process. A metalanguage represents a well-def<strong>in</strong>ed<strong>in</strong>terface that evolves and is trans<strong>for</strong>med <strong>in</strong>to derived languages. <strong>XML</strong> is simply thefoundation <strong>in</strong>terface <strong>for</strong> a number of specific markup languages, each of which is basedon its own vocabulary and schema.The schema syntactically differentiates <strong>XML</strong> languages from each other. <strong>XML</strong> is key <strong>for</strong>data exchange and <strong>in</strong>teroperability, and the schema is essential <strong>for</strong> provid<strong>in</strong>g <strong>XML</strong>documents with a typed and well-def<strong>in</strong>ed structure. Un<strong>for</strong>tunately, <strong>in</strong> the imperfect world<strong>in</strong> which we live, schemas often express the same semantics through differentsyntaxes.An <strong>XML</strong> trans<strong>for</strong>mation is simply the <strong>XML</strong> workaround <strong>for</strong> this relatively commonsituation. An <strong>XML</strong> trans<strong>for</strong>mation is a user-def<strong>in</strong>ed algorithm that attempts to expressthe semantics of a given document us<strong>in</strong>g another equivalent syntax. A trans<strong>for</strong>mation ismuch like a type cast <strong>in</strong> programm<strong>in</strong>g. You can always try to coerce the type, but <strong>in</strong>do<strong>in</strong>g so you could face and accept compromises like syntax adaptations and,sometimes, loss of data and logic.In <strong>XML</strong>, the trans<strong>for</strong>mation process is seen as the application of a style sheet to thesource document. The style sheet is a declarative and user-def<strong>in</strong>ed document that isreferred to as extensible. The term Extensible Stylesheet Language (XSL) <strong>in</strong>dicates ametalanguage designed <strong>for</strong> express<strong>in</strong>g style sheets <strong>for</strong> <strong>XML</strong> documents. An XSL fileconta<strong>in</strong>s the set of rules that will be used to trans<strong>for</strong>m a document <strong>in</strong>to another,possibly equivalent, document.XSL files were orig<strong>in</strong>ally conceived as the <strong>XML</strong> counterpart of HTML's cascad<strong>in</strong>g stylesheets (CSS). In this context, XSL files were simply extensible and user-def<strong>in</strong>able toolsto render an <strong>XML</strong> markup <strong>in</strong> HTML <strong>for</strong> display purposes. The grow<strong>in</strong>g complexity ofstyle sheets, as well as the advent of <strong>XML</strong> schemas, changed the perspective of XSLand led to XSL Trans<strong>for</strong>mations (XSLT).What Is XSLT, Anyway?The goal of XSL has evolved over time. Today, XSL is a blanket term <strong>for</strong> a number ofderived technologies that altogether better qualify and implement the orig<strong>in</strong>al idea ofstyl<strong>in</strong>g <strong>XML</strong> documents. The various components that fall under the umbrella of XSLare the actual software entities that you use <strong>in</strong> your code:• XSLT Rule-based language <strong>for</strong> trans<strong>for</strong>m<strong>in</strong>g <strong>XML</strong> documents <strong>in</strong>to any othertext-based <strong>for</strong>mat. XSLT provides <strong>for</strong> <strong>XML</strong>-to-<strong>XML</strong> trans<strong>for</strong>mation, whichmostly means schema trans<strong>for</strong>mation. An XSLT program is a generic set oftrans<strong>for</strong>mation rules whose output can be any text-based language,<strong>in</strong>clud<strong>in</strong>g HTML, Rich Text Format (RTF), and Wireless Markup Language(WML), to name just a few.• XPath Query language that XSLT programs use to select specific parts of an<strong>XML</strong> document. The result of XPath expressions is then parsed andelaborated by the XSLT processor. Normally, the XSLT processor workssequentially on the source document, but it resorts to XPath if it needs toaccess and refer to particular groups of nodes. XPath was covered <strong>in</strong>Chapter 6.237


• XSL Formatt<strong>in</strong>g Objects (XSL-FO) Advanced styl<strong>in</strong>g features expressed byan <strong>XML</strong> vocabulary that def<strong>in</strong>e the semantics of a set of <strong>for</strong>matt<strong>in</strong>gelements. Most of these <strong>for</strong>matt<strong>in</strong>g objects are borrowed from CSS, Level 2(CSS2) properties, but others have been added. (See the section "FurtherRead<strong>in</strong>g," on page 343, <strong>for</strong> more <strong>in</strong><strong>for</strong>mation.)XSL and XSLT are not the same th<strong>in</strong>g. XSL still refers to the page styl<strong>in</strong>g, of which <strong>XML</strong>trans<strong>for</strong>mations to arbitrary text are just one aspect, albeit the most important aspect.This chapter will accentuate the <strong>Microsoft</strong> .<strong>NET</strong> Framework implementation of XSLT.Be<strong>for</strong>e go<strong>in</strong>g any further with the .<strong>NET</strong> Framework core classes <strong>for</strong> data trans<strong>for</strong>mation,let's briefly recap the ma<strong>in</strong> concepts of XSLT and the programm<strong>in</strong>g tools it provides todevelopers.XSLT Template <strong>Programm<strong>in</strong>g</strong>XSLT is a process that comb<strong>in</strong>es two <strong>XML</strong> documents—the <strong>XML</strong> source file and thestyle sheet—to produce a third document. The resultant document can be an <strong>XML</strong>document, an HTML page, or any text-based file the style sheet has been <strong>in</strong>structed togenerate.The source document must meet only one requirement: it must be a well-<strong>for</strong>med <strong>XML</strong>document. The style sheet must be a valid <strong>XML</strong> document that conta<strong>in</strong>s thetrans<strong>for</strong>mation logic expressed us<strong>in</strong>g the elements <strong>in</strong> the XSLT vocabulary. An XSLTstyle sheet can be seen as a sequence of templates. Each template takes one or moresource elements as <strong>in</strong>put and returns some output text based on literals as well astrans<strong>for</strong>med <strong>in</strong>put data. Figure 7-1 illustrates the trans<strong>for</strong>mation process.Figure 7-1: An overview of the XSLT process.The core part of the trans<strong>for</strong>mation process is the application of templates to <strong>XML</strong>source elements. Other ancillary steps might <strong>in</strong>clude the expansion of elements to text,the execution of some script code, and the selection of a subset of nodes us<strong>in</strong>g XPathqueries. The layout of a generic XSLT script is shown here:238


⋮⋮⋮The root node of an XSLT script is . The node belongs to theofficial W3C namespace <strong>for</strong> XSLT 1.0. (Note that the .<strong>NET</strong> Framework supports onlyXSLT 1.0, but the W3C committees are currently work<strong>in</strong>g on a draft of XSLT 1.1.)Below the node are a variety of nodes, each of which conta<strong>in</strong>sa match attribute. The match attribute conta<strong>in</strong>s a valid XPath expression that selectsthe source node (or nodes) that will be used to fill the template.The template consists of some output literal text <strong>in</strong>terspersed with XSLT placeholdertags. At compile time, the XSLT processor reads source data <strong>for</strong> any match<strong>in</strong>g nodesand dynamically populates all the placeholders. The source markup text is poured <strong>in</strong>tothe template <strong>in</strong> various <strong>for</strong>ms accord<strong>in</strong>g to the particular XSLT <strong>in</strong>struction used. Text orattribute values can be copied or preprocessed us<strong>in</strong>g script code or extension objects.In addition, you can apply some basic flow constructs such as if, when, and <strong>for</strong>-each aswell as process nodes <strong>in</strong> a particular order or filtered by an ad hoc XPath expression.The f<strong>in</strong>al output of each template must <strong>for</strong>m a syntactically valid fragment <strong>in</strong> the targetlanguage—be it <strong>XML</strong>, HTML, RTF, or some other language. You are not required to<strong>in</strong>dicate the target language explicitly, although the XSLT vocabulary provides a tailormade<strong>in</strong>struction to declare what the expected output will be. The ma<strong>in</strong> requirement <strong>for</strong>the XSLT style sheet is that its overall text be well-<strong>for</strong>med <strong>XML</strong>. In addition, it mustmake syntactically correct use of all the XSLT <strong>in</strong>structions it needs. The syntax of eachembedded XSLT command, there<strong>for</strong>e, is validated aga<strong>in</strong>st the official XSLT schema.Although an XSLT style sheet is not necessarily composed of explicitly declaredtemplates, <strong>in</strong> many real-world cases, it is. In other situations, you can have an XSLTstyle sheet that consists of pla<strong>in</strong> XSLT <strong>in</strong>structions not grouped as <strong>in</strong>dividually callabletemplates.A template to the XSLT language is much like a function to other highlevelprogramm<strong>in</strong>g languages. You can group more <strong>in</strong>structions under a function or amethod, but you can also embed <strong>in</strong> the source program <strong>in</strong>structions to run sequentially.In the body of an XSLT style sheet, a template is always def<strong>in</strong>ed with <strong>in</strong>l<strong>in</strong>e code, but itcan be configured, and subsequently <strong>in</strong>voked, <strong>in</strong> two ways: it can have implicit orexplicit arguments. With implicit arguments, you use the match attribute to select thenodes <strong>for</strong> the template to process. In this case, you apply the template to the match<strong>in</strong>gnodes.With explicit arguments, you give the template a name and optionally some argumentsand let other templates call it explicitly. Like a DLL function, the <strong>in</strong>voked template cantry to determ<strong>in</strong>e its context by us<strong>in</strong>g XPath expressions, or it can work <strong>in</strong> isolation, us<strong>in</strong>gonly the passed arguments. In this case, you call the template to operate on somearguments. We'll look at some examples of template calls <strong>in</strong> the section "From <strong>XML</strong> to239


HTML," on page 299. In the meantime, Figure 7-2 illustrates the process of apply<strong>in</strong>gtemplates to nodes.Figure 7-2: Apply<strong>in</strong>g an XSLT template to source markup text.XSLT InstructionsThe XSLT vocabulary consists of special tags that represent particular operations youcan per<strong>for</strong>m on the source markup text or passed arguments. Although the overallsyntax is that of a rigorous <strong>XML</strong> dialect, you can easily recognize the ma<strong>in</strong> constructs ofa high-level programm<strong>in</strong>g language.The follow<strong>in</strong>g subsections summarize the ma<strong>in</strong> XSLT <strong>in</strong>structions you are likely to runacross <strong>in</strong> your XSLT experience. The XSLT <strong>in</strong>structions are divided <strong>in</strong>to four categories:templates, data manipulation, control flow, and layout.Template InstructionsAn XSLT template is a mixed-content template consist<strong>in</strong>g of verbatim text andexpandable placeholders. A template can be applied to a selected group of nodes aswell as <strong>in</strong>voked by other templates with or without arguments. Table 7-1 lists the ma<strong>in</strong>commands <strong>for</strong> work<strong>in</strong>g with templates. All of these XSLT elements are qualified with thexsl prefix, but bear <strong>in</strong> m<strong>in</strong>d that xsl is just an arbitrary, although common, namespaceprefix. Feel free to replace it with another prefix <strong>in</strong> your own code.Table 7-1: XSLT Instructions <strong>for</strong> TemplatesInstruction240Description Def<strong>in</strong>es the trans<strong>for</strong>mation rules<strong>for</strong> the nodes that match theXPath expression set <strong>in</strong> thematch attribute. The templatemust be explicitly applied to itsnodes us<strong>in</strong>g the command. The<strong>in</strong>struction can also be used todeclare a template that will thenbe called by name us<strong>in</strong>g the command. Inthis case, use the name attribute<strong>in</strong>stead of match. Applies all the possible


Table 7-1: XSLT Instructions <strong>for</strong> TemplatesInstructionDescriptiontemplates to the elements thatmatch the XPath description.The select attribute selects thetarget elements. In general, as<strong>in</strong>gle element can be affectedby multiple templates.Executes the specified template.The name attribute <strong>in</strong>dicates thename of the previously declaredtemplate to execute.Def<strong>in</strong>es a <strong>for</strong>mal argument <strong>for</strong> anamed template. The nameattribute <strong>in</strong>dicates the name ofthe argument. The parametercan have a default argument.You specify a default value us<strong>in</strong>geither an XPath expression (viathe select attribute) or a templateas the body of the element.Def<strong>in</strong>es an actual parameter <strong>for</strong>a template call. The nameattribute <strong>in</strong>dicates the match<strong>in</strong>gparameter. The actual value canbe expressed us<strong>in</strong>g either anXPath expression (via the selectattribute) or the body of theelement.When you set the select attribute, the template (or the parameter) will execute <strong>in</strong> thecontext of the selected nodes. Any further XPath expression to locate the text of aparticular node or attribute must be based <strong>in</strong> that context.Data Manipulation InstructionsThe commands listed <strong>in</strong> Table 7-2 are helpful <strong>for</strong> extract<strong>in</strong>g data out of source nodesand then preprocess<strong>in</strong>g it us<strong>in</strong>g <strong>in</strong>-place code.Table 7-2: XSLT Instructions <strong>for</strong> Data ManipulationInstructionDescription Returns the value of thespecified attribute or the textassociated with the given node.You select nodes us<strong>in</strong>g XPathexpressions. Of course,attributes must be prefixed withan at sign (@). This commandworks more or less as a macrothat expands at run time.Returns the entire node-set thatcorresponds to the results of241


Table 7-2: XSLT Instructions <strong>for</strong> Data ManipulationInstructionFuncName()Descriptionthe specified XPath expression.Specifies sort criteria <strong>for</strong> thenode-set be<strong>in</strong>g processed by or <strong>in</strong>structions. In thiscase, you use the selectkeyword to <strong>in</strong>dicate the sort keyand data-type <strong>for</strong> the type ofsort<strong>in</strong>g (text or number). Theorder attribute <strong>in</strong>dicates thedirection, and case-orderdesignates which case comesfirst <strong>in</strong> the sort.Evaluates a user-def<strong>in</strong>edfunction and returns the output.The function can access theunderly<strong>in</strong>g <strong>XML</strong> DocumentObject Model (<strong>XML</strong> DOM)us<strong>in</strong>g the this keyword as theentry po<strong>in</strong>t to the document rootnode. The tag is a<strong>Microsoft</strong> extension to the XSLimplementation.Each XSLT implementation supports a different set of languages <strong>for</strong> writ<strong>in</strong>g userdef<strong>in</strong>edfunctions. For example, <strong>Microsoft</strong>'s <strong>XML</strong> Core Services (MS<strong>XML</strong>) supports only<strong>Microsoft</strong> Visual Basic, Script<strong>in</strong>g Edition (VBScript) and JScript. The .<strong>NET</strong> Frameworktrans<strong>for</strong>mation classes, on the other hand, <strong>in</strong>clude support <strong>for</strong> C# and <strong>Microsoft</strong> VisualBasic .<strong>NET</strong>. (More on this later.)NoteThe syntax shown <strong>for</strong> the XSLT <strong>in</strong>structions is largely <strong>in</strong>complete. Ilimited the descriptions to the most important and most frequentlyused attributes. More attributes are actually available; you can f<strong>in</strong>dthem documented and expla<strong>in</strong>ed <strong>in</strong> the MSDN documentation aswell as <strong>in</strong> the resources listed <strong>in</strong> the section "Further Read<strong>in</strong>g," onpage 343.Control Flow InstructionsThe XSLT vocabulary <strong>in</strong>cludes some tags that represents control flow statements suchas conditional and iterative statements. Table 7-3 summarizes the most importantcommands.Table 7-3: XSLT Instructions <strong>for</strong> Control FlowInstructionDescriptionApplies the rules <strong>in</strong> the body to eachelement that matches the givenXPath expression. The node-set canbe sorted by putt<strong>in</strong>g an <strong>in</strong>the body.242


Table 7-3: XSLT Instructions <strong>for</strong> Control FlowInstruction ……DescriptionApplies the <strong>in</strong>ternal template only ifthe specified XPath expressionevaluates to true.Similar to the C# switch statement;represents a multiple-choicestatement. Each test is expressedus<strong>in</strong>g an statement,while the elementrepresents the default choice. Thestatement evaluates all the blocks until the testexpression returns true. When thathappens, the correspond<strong>in</strong>gtemplate is applied. If no test issuccessful, the template is <strong>in</strong>voked.Although this list of commands lacks a <strong>for</strong> statement, you can still realize a loop thatruns a specified number of times by us<strong>in</strong>g the XPath position function. Of course,position returns the <strong>in</strong>dex of the current context node and is not a general variablecounter. On the other hand, XSLT <strong>in</strong>structions are designed to work on XPath nodesets,not to arrange general-purpose programs.Layout InstructionsA typical task <strong>for</strong> an XSLT script is the creation of new elements and attributes.Sometimes attributes and node elements can be hard-coded <strong>in</strong> script; sometimes this isjust impossible to do. The XSLT statements listed <strong>in</strong> Table 7-4 let you programmaticallycreate layout elements.Table 7-4: XSLT Instructions <strong>for</strong> LayoutInstructionDescriptionCreates an element with thespecified name. The namespaceattribute <strong>in</strong>dicates the URI of thecreated element, if any. The element conta<strong>in</strong>s atemplate <strong>for</strong> the attributes andchildren of the created element.Creates an attribute node andattaches it to an output element.The name attribute denotes thename of the attribute, andnamespace <strong>in</strong>dicates thenamespace URI, if any. Thecontents of this element specify thevalue of the attribute. Note that can also be useddirectly on output elements, notonly <strong>in</strong> conjunction with.243


Table 7-4: XSLT Instructions <strong>for</strong> LayoutInstructionDescriptionGenerates a process<strong>in</strong>g <strong>in</strong>struction<strong>in</strong> the output text. The nameattribute represents the name ofthe process<strong>in</strong>g <strong>in</strong>struction. Thecontents of the element provide thetext of the process<strong>in</strong>g <strong>in</strong>struction.Generates a comment node <strong>in</strong> theoutput text. The text generated bythe body of appears between the typicalcomment wrappers .In addition to the <strong>in</strong>structions described <strong>in</strong> this section, the XSLT vocabulary conta<strong>in</strong>s afew more elements to def<strong>in</strong>e data-bound variables (), raw text(), or numbers (). In particular, a data-bound variable can begiven a name and its value calculated either by evaluat<strong>in</strong>g an XPath expression or byapply<strong>in</strong>g the template <strong>in</strong> the body of the tag.After our brief but <strong>in</strong>tensive tour of the XSLT programm<strong>in</strong>g <strong>in</strong>terface, let's see how toturn some of these <strong>in</strong>structions <strong>in</strong>to concrete calls <strong>in</strong> a real XSLT script. We'll look at acouple of typical examples: convert<strong>in</strong>g <strong>XML</strong> documents to HTML pages, andtrans<strong>for</strong>m<strong>in</strong>g an <strong>XML</strong> document <strong>in</strong>to an equivalent schema.From <strong>XML</strong> to HTMLLet's return to our faithful <strong>XML</strong> document (data.xml) from previous chapters and turn it<strong>in</strong>to a compell<strong>in</strong>g HTML page. This sample <strong>XML</strong> document conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation aboutthe employees <strong>in</strong> the Northw<strong>in</strong>d database's Employees table.The idea is to create a f<strong>in</strong>al HTML page that renders the <strong>in</strong><strong>for</strong>mation about employeesthrough a table. The structure of the XSLT script is shown <strong>in</strong> the follow<strong>in</strong>g code:Northw<strong>in</strong>d's Employees244


⋮more templates here⋮As the match attribute <strong>in</strong>dicates, the ma<strong>in</strong> <strong>in</strong>struction applies to the rootof the <strong>XML</strong> document. The XSLT script produces a simple HTML page with a fixed H1head<strong>in</strong>g and a table. The table is generated by apply<strong>in</strong>g all match<strong>in</strong>g templates to thenodes that match the follow<strong>in</strong>g XPath expression:MyDataSet/Northw<strong>in</strong>dEmployees/EmployeeThe actual templates that make the f<strong>in</strong>al HTML page are def<strong>in</strong>ed later <strong>in</strong> the document.To start off, you def<strong>in</strong>e a template <strong>for</strong> each node, as shown here:The template def<strong>in</strong>es a wrapper table row and then calls <strong>in</strong>to the child templates, one<strong>for</strong> each significant piece of <strong>in</strong><strong>for</strong>mation to be rendered. As you've probably guessed,each child template def<strong>in</strong>es a table cell. For example, the follow<strong>in</strong>g template selects the node below the current Employee and renders the text of the node <strong>in</strong>boldface:As you can see, the node selection is always per<strong>for</strong>med us<strong>in</strong>g XPath expressions. The"." expression <strong>for</strong> the node refers to the text of the current node. A similarpattern is used <strong>for</strong> other templates, as follows:,245


In the first template, the context node is , but at a certa<strong>in</strong> po<strong>in</strong>t, we need toaccess a sibl<strong>in</strong>g node—the node. The XPath syntax <strong>in</strong>cludes the doubledotsymbol (..), which is a shortcut <strong>for</strong> the parent of the current context node. (SeeChapter 6.)The f<strong>in</strong>al HTML output <strong>for</strong> the source <strong>XML</strong> document is shown Figure 7-3.Figure 7-3: The HTML page generated from a source <strong>XML</strong> file.To display the HTML output as pla<strong>in</strong> text, you must per<strong>for</strong>m the trans<strong>for</strong>mationprogrammatically, us<strong>in</strong>g either the MS<strong>XML</strong> object model or the newest .<strong>NET</strong>Framework classes. Alternatively, you can view the output us<strong>in</strong>g a specialized browserwith the direct brows<strong>in</strong>g functionality. <strong>Microsoft</strong> Internet Explorer has provided thiscapability s<strong>in</strong>ce version 5.0.L<strong>in</strong>k<strong>in</strong>g the Style Sheet to the HTML PageInternet Explorer applies a silent and automatic trans<strong>for</strong>mation to all <strong>XML</strong> documentsyou view through it. However, an <strong>XML</strong> document can override the default InternetExplorer style sheet by us<strong>in</strong>g a process<strong>in</strong>g <strong>in</strong>struction that simply l<strong>in</strong>ks an XSLT script.The follow<strong>in</strong>g code demonstrates how to add the style sheet from the previous section(emplist.xsl) to the source file (data.xml) so that double-click<strong>in</strong>g it generates the outputshown <strong>in</strong> Figure 7-3. A style sheet can have either a .xsl or a .xml extension.You register a style sheet with an <strong>XML</strong> document us<strong>in</strong>g a process<strong>in</strong>g <strong>in</strong>struction with acouple of attributes: type and href. The type attribute must be set to the str<strong>in</strong>g text/xsl.The href attribute <strong>in</strong>stead references the URL of the XSLT script. If you <strong>in</strong>sert more thanone process<strong>in</strong>g <strong>in</strong>struction <strong>for</strong> XSLT scripts, only the f<strong>in</strong>al <strong>in</strong>struction will be considered.Call<strong>in</strong>g TemplatesThe previous example used exclusively to per<strong>for</strong>m templatebasedtrans<strong>for</strong>mations. When you know that only one template applies to a given block246


of <strong>XML</strong> source code, you might want to use a more direct <strong>in</strong>struction: .If you plan to use the <strong>in</strong>struction, you must first give the targettemplate a name. For example, the follow<strong>in</strong>g code def<strong>in</strong>es a template namedEmployeeIdTemplate:How do you call <strong>in</strong>to this template? Just use the follow<strong>in</strong>g code:There is one difference you should be aware of. With , you usethe select attribute to select a node-set <strong>for</strong> the template, as shown here:As a result, the template works on the node and retrieves the value withthe follow<strong>in</strong>g expression:When you use the <strong>in</strong>struction, on the other hand, you call thetemplate by name, but it works on the currently selected context node. The ongo<strong>in</strong>gcontext node is , and you must explicitly <strong>in</strong>dicate the child node <strong>in</strong> the bodyof , as shown here:From Schema to SchemaTrans<strong>for</strong>m<strong>in</strong>g an <strong>XML</strong> document <strong>in</strong>to an <strong>XML</strong> document with another schema is <strong>in</strong> noway different from trans<strong>for</strong>m<strong>in</strong>g <strong>XML</strong> <strong>in</strong>to HTML. The real difference is that you useanother target <strong>XML</strong> vocabulary.The follow<strong>in</strong>g XSLT script is designed to simplify the structure of our sample data.xmlfile. The orig<strong>in</strong>al file is structured like this:……247


……⋮The expected target schema is simpler and conta<strong>in</strong>s only two levels of nodes, as shown<strong>in</strong> the follow<strong>in</strong>g code. In addition, all employee <strong>in</strong><strong>for</strong>mation is now coded us<strong>in</strong>g attributes<strong>in</strong>stead of child nodes, and last and first names are merged <strong>in</strong>to a s<strong>in</strong>gle value.⋮The follow<strong>in</strong>g script per<strong>for</strong>ms the magic:,248


This script <strong>in</strong>cludes only one template rooted <strong>in</strong> the node andcreates a new element <strong>for</strong> each child node. The node has a few attributes:id, name, and title. The <strong>in</strong>struction is used to read node values <strong>in</strong>to thenewly created attributes. The f<strong>in</strong>al output is shown here:As you can see, trans<strong>for</strong>m<strong>in</strong>g <strong>XML</strong> <strong>in</strong>to another arbitrary text-based language is simplya matter of becom<strong>in</strong>g familiar with a relatively small vocabulary of ad hoc tags. TheXSLT vocabulary is a bit peculiar because some of its tags look a lot like high-levelprogramm<strong>in</strong>g language statements. But grasp<strong>in</strong>g the essence of XSLT is not all thatdifficult.The .<strong>NET</strong> Framework XSLT ProcessorIn the .<strong>NET</strong> Framework, the core class <strong>for</strong> XSLT is XslTrans<strong>for</strong>m. Located <strong>in</strong> theSystem.Xml.Xsl namespace, the XslTrans<strong>for</strong>m class implements the XSLT processor.You make use of this class <strong>in</strong> two steps: first you load the style sheet <strong>in</strong> the processor,and then you apply trans<strong>for</strong>mations to as many source documents as you need.The XslTrans<strong>for</strong>m class supports only the XSLT 1.0 specification. A style sheetdeclares itself compliant with this version of the specification by <strong>in</strong>clud<strong>in</strong>g the follow<strong>in</strong>gnamespace:249


By the way, note that the version attribute is mandatory to ensure the correctness of thestyle sheet document.The key methods <strong>in</strong> the XslTrans<strong>for</strong>m class are Load and Trans<strong>for</strong>m. They per<strong>for</strong>m thetwo steps just mentioned. In particular, you use the Load method to read the style sheetfrom a variety of sources. The Trans<strong>for</strong>m method, on the other hand, applies thetrans<strong>for</strong>mation rules set <strong>in</strong> the style sheet to a given <strong>XML</strong> source document.A Quick XSLT Trans<strong>for</strong>merEarlier <strong>in</strong> the chapter, we used XSLT scripts to trans<strong>for</strong>m an <strong>XML</strong> source document <strong>in</strong>tosometh<strong>in</strong>g else—say, an HTML page or another <strong>XML</strong> schema. The scripts were testedsimply by add<strong>in</strong>g a process<strong>in</strong>g <strong>in</strong>struction to the <strong>XML</strong> source document. Such an<strong>in</strong>struction tells specialized browsers, like Internet Explorer 5 and later, to use thereferenced XSLT script to trans<strong>for</strong>m the <strong>XML</strong> document be<strong>for</strong>e display<strong>in</strong>g it.A .<strong>NET</strong> Framework application can programmatically control the entire trans<strong>for</strong>mationprocess us<strong>in</strong>g the XslTrans<strong>for</strong>m class. The follow<strong>in</strong>g console application represents aquick command-l<strong>in</strong>e XSLT trans<strong>for</strong>mer. It takes three arguments (the <strong>XML</strong> source, theXSLT style sheet, and the output file), sets up the processor, and saves the results ofthe trans<strong>for</strong>mation to the output file.us<strong>in</strong>g System;us<strong>in</strong>g System.Xml;us<strong>in</strong>g System.Xml.Xsl;class QuickXslTrans<strong>for</strong>mer{public QuickXslTrans<strong>for</strong>mer(str<strong>in</strong>g source, str<strong>in</strong>g stylesheet,str<strong>in</strong>g output){XslTrans<strong>for</strong>m xslt = new XslTrans<strong>for</strong>m();xslt.Load(stylesheet);xslt.Trans<strong>for</strong>m(source, output);}public static void Ma<strong>in</strong>(str<strong>in</strong>g[] args){try {QuickXslTrans<strong>for</strong>mer o;o = new QuickXslTrans<strong>for</strong>mer(args[0], args[1],args[2]);}catch (Exception e){Console.WriteL<strong>in</strong>e("Unable to apply the XSLT trans<strong>for</strong>mation.");Console.WriteL<strong>in</strong>e("Error:\t{0}", e.Message);250


Console.WriteL<strong>in</strong>e("Exception: {0}",e.GetType().ToStr<strong>in</strong>g());}}}return;The heart of the application is found <strong>in</strong> the follow<strong>in</strong>g three l<strong>in</strong>es of rather selfexplanatorycode:XslTrans<strong>for</strong>m xslt = new XslTrans<strong>for</strong>m();xslt.Load(stylesheet);xslt.Trans<strong>for</strong>m(source, output);The style sheet can be loaded from a variety of sources, <strong>in</strong>clud<strong>in</strong>g XPath documents,<strong>XML</strong> readers, local disk files, and URLs. The Load method compiles the style sheet anduses the stored <strong>in</strong><strong>for</strong>mation to <strong>in</strong>itialize the XSLT processor. When Load returns, theprocessor is ready to per<strong>for</strong>m any requested trans<strong>for</strong>mation.The Trans<strong>for</strong>m method loads an <strong>XML</strong> document, runs the XSLT script, and writes theresults to the specified stream. Trans<strong>for</strong>m is particularly handy, because it saves youfrom explicitly load<strong>in</strong>g the source document and creat<strong>in</strong>g the output file. As we'll seemore <strong>in</strong> detail <strong>in</strong> the section "Per<strong>for</strong>m<strong>in</strong>g Trans<strong>for</strong>mations," on page 314, Trans<strong>for</strong>muses an <strong>in</strong>termediate XPath document to trans<strong>for</strong>m the <strong>XML</strong>.NoteSeveral other programm<strong>in</strong>g environments allow you to exercise totalcontrol over the XSLT process. In particular, <strong>in</strong> <strong>Microsoft</strong> W<strong>in</strong>32, thecomb<strong>in</strong>ed use of two dist<strong>in</strong>ct <strong>in</strong>stances of the <strong>Microsoft</strong>.<strong>XML</strong>DOMCOM object lets you programmatically per<strong>for</strong>m an XSLTtrans<strong>for</strong>mation. The follow<strong>in</strong>g JScript code illustrates how toproceed:// Collects arguments from the WSH command l<strong>in</strong>esource = WScript.Arguments(0);stylesheet = WScript.Arguments(1);output = WScript.Arguments(2);// Instantiates the <strong>XML</strong>DOM <strong>for</strong> the sourcexml = new ActiveXObject("<strong>Microsoft</strong>.<strong>XML</strong>DOM");xml.load(source);// Instantiates the <strong>XML</strong>DOM <strong>for</strong> the style sheetxsl = new ActiveXObject("<strong>Microsoft</strong>.<strong>XML</strong>DOM");xsl.load(stylesheet);// Creates the outputfso = newActiveXObject("Script<strong>in</strong>g.FileSystemObject");f = fso.CreateTextFile(output);f.Write(xml.trans<strong>for</strong>mNode(xsl.documentElement));f.Close();251


The XslTrans<strong>for</strong>m ClassNow that we've seen how the XslTrans<strong>for</strong>m class implements the .<strong>NET</strong> Frameworkprocessor to trans<strong>for</strong>m <strong>XML</strong> data <strong>in</strong>to arbitrary text us<strong>in</strong>g XSL style sheets, let's lookmore closely at its programm<strong>in</strong>g <strong>in</strong>terface.As shown <strong>in</strong> the follow<strong>in</strong>g code, XslTrans<strong>for</strong>m has only the default constructor. Inaddition, it is a sealed class, mean<strong>in</strong>g that you can use it only as is and other classescan't <strong>in</strong>herit from it.public sealed class XslTrans<strong>for</strong>m{}⋮The programm<strong>in</strong>g <strong>in</strong>terface of the class is fairly simple and consists of just one publicproperty and a couple of methods.Properties of the XslTrans<strong>for</strong>m ClassThe only property that the XslTrans<strong>for</strong>m class exposes is XmlResolver, which handlesan <strong>in</strong>stance of the XmlResolver class. Interest<strong>in</strong>gly, the XmlResolver property is writeonly—thatis, you can set it, but you can't check the currently set resolver object.As we've seen <strong>in</strong> previous chapters, the XmlResolver object is used to resolve externalreferences found <strong>in</strong> the documents be<strong>in</strong>g processed. In this context, the XmlResolverproperty is used only dur<strong>in</strong>g the trans<strong>for</strong>mation process. It is not used, <strong>for</strong> example, toresolve external resources dur<strong>in</strong>g load operations.If you don't create a custom resolver object, an <strong>in</strong>stance of the XmlUrlResolver class isused.Methods of the XslTrans<strong>for</strong>m ClassTheXslTrans<strong>for</strong>m class supplies two methods specific to its activity—the Load andTrans<strong>for</strong>m methods mentioned earlier. The Load and Trans<strong>for</strong>m methods are described<strong>in</strong> more detail <strong>in</strong> Table 7-5.Table 7-5: Methods of the XSLT ProcessorMethodDescriptionLoadLoads the specified XSLT style sheet document from a numberof possible sources, <strong>in</strong>clud<strong>in</strong>g remote URLs and <strong>XML</strong> readers.The method has several overloads, <strong>in</strong>clud<strong>in</strong>g overloads that letyou specify a custom XmlResolver object to load any stylesheets referenced through xsl:import and xsl:<strong>in</strong>cludestatements.Trans<strong>for</strong>m Trans<strong>for</strong>ms the specified <strong>XML</strong> data us<strong>in</strong>g the loaded XSLTstyle sheet and writes the results to a given stream. Some ofthe method's overloads let you specify an argument list as<strong>in</strong>put to the trans<strong>for</strong>mation.The follow<strong>in</strong>g code snippet shows how to use an XmlResolver object with credentials toaccess a remote XSLT style sheet:XmlUrlResolver resolver = new XmlUrlResolver();NetworkCredential cred = new NetworkCredential(uid, pswd,doma<strong>in</strong>);resolver.Credentials = cred;XslTrans<strong>for</strong>m xslt = new XslTrans<strong>for</strong>m();252


xslt.Load(stylesheet, resolver);The XslTrans<strong>for</strong>m class is also unique from the thread<strong>in</strong>g and security standpo<strong>in</strong>ts.Let's see why.Thread<strong>in</strong>g ConsiderationsXslTrans<strong>for</strong>m is guaranteed to operate <strong>in</strong> a thread-safe way only dur<strong>in</strong>g trans<strong>for</strong>moperations. In other words, although an <strong>in</strong>stance of the class can be shared by multiplethreads, only the Trans<strong>for</strong>mmethod can be called safely from multiple threads. For thesake of your code, you must ensure that both of the follow<strong>in</strong>g conditions are met:• The Load method is not concurrently called from with<strong>in</strong> different threads.• No other method (<strong>for</strong> example, Trans<strong>for</strong>m) is called on the object dur<strong>in</strong>gload operations.In a nutshell, the XslTrans<strong>for</strong>m class is multithreaded only with respect totrans<strong>for</strong>mations. The reasons <strong>for</strong> this behavior stem from the <strong>in</strong>ternal architecture of theclass, which is summarized <strong>in</strong> Figure 7-4.Figure 7-4: The Load method is not thread-safe, and its state can be overwritten andspoiled by concurrent calls. The Trans<strong>for</strong>m method, on the other hand, reads the sharedstate and can run concurrently from multiple threads.When the Load method is called, the style sheet is compiled and its contents are usedto set the <strong>in</strong>ternal state of the object. For per<strong>for</strong>mance reasons, this code is not grouped<strong>in</strong>to a critical section, which would serialize the threads' access to the <strong>in</strong>ternal state.After load<strong>in</strong>g the style sheet, the XSLT processor needs to modify its state to reflect theloaded document. The operation does not occur atomically with<strong>in</strong> the virtual boundariescreated by a lock statement. As a result, concurrently runn<strong>in</strong>g threads could <strong>in</strong> theoryaccess the same <strong>in</strong>stance of the processor and break the data consistency. The loadoperation is thread-sensitive because it alters the global state of the object.The trans<strong>for</strong>m operation, on the other hand, is <strong>in</strong>herently thread-safe because itper<strong>for</strong>ms read-only access to the processor's state. Noth<strong>in</strong>g bad can happen ifconcurrent threads apply trans<strong>for</strong>mations us<strong>in</strong>g the same processor.253


To avoid thread<strong>in</strong>g risks, be aware that load<strong>in</strong>g a style sheet is an unprotectedoperation. Either lock the operation yourself, or avoid spawn<strong>in</strong>g concurrent threads thatper<strong>for</strong>m style sheet load<strong>in</strong>g on the same processor.Security ConsiderationsThe XslTrans<strong>for</strong>m class has a l<strong>in</strong>k demand permission set attached. A l<strong>in</strong>k demandspecifies which permissions direct callers must have to run the code, as shown <strong>in</strong> thefollow<strong>in</strong>g example. Callers' rights are checked dur<strong>in</strong>g just-<strong>in</strong>-time compilation.[PermissionSet(SecurityAction.L<strong>in</strong>kDemand, Name="FullTrust")]public sealed class XslTrans<strong>for</strong>m{}⋮The permission set attribute <strong>for</strong> the XslTrans<strong>for</strong>m class is expressed by name andpo<strong>in</strong>ts to one of the built-<strong>in</strong> permission sets— FullTrust. What does this mean to you?Only callers (direct callers are <strong>in</strong>volved with the check, not caller's callers) with fullytrusted access to all the local resources can safely call <strong>in</strong>to the XSLT processor.Try runn<strong>in</strong>g the XSLT Quick Security Tester sample application over a network.Because of the class security sett<strong>in</strong>gs, a security exception is thrown. Figure 7-5 showsthe security exception dialog box.Figure 7-5: The XSLT processor class works only if called by locally trusted callers. AnXSLT application can work well as long as you <strong>in</strong>voke it locally, but it will raise a securityexception if you run it over a network share.Under the Hood of the XSLT ProcessorIn the overall behavior of the .<strong>NET</strong> Framework XSLT processor, three phases can beclearly identified: load<strong>in</strong>g the style sheet document, sett<strong>in</strong>g up the <strong>in</strong>ternal state, andper<strong>for</strong>m<strong>in</strong>g the trans<strong>for</strong>mations. Although you see, and <strong>in</strong>teract with, only a s<strong>in</strong>gle class(XslTrans<strong>for</strong>m), a lot of <strong>in</strong>ternal classes are <strong>in</strong>volved <strong>in</strong> the process.The first two phases occur with<strong>in</strong> the context of the Load method. Of course, you can'tcall the Trans<strong>for</strong>m method be<strong>for</strong>e a previous call to Load has successfully term<strong>in</strong>ated. Ifyou do, you will experience an XsltException exception on the Trans<strong>for</strong>m method.254


Load always works synchronously, so when it returns, you can be sure that the load<strong>in</strong>gstep has been completed. You will not get from Load any return value that denotes thefailure or the success of the operation. When someth<strong>in</strong>g goes wrong with the Loadmethod, however, some exceptions are thrown. In particular, you will get aFileNotFoundException exception if you are po<strong>in</strong>t<strong>in</strong>g to a miss<strong>in</strong>g style sheet, and youwill get a more generic XsltCompileException exception if the XSLT script conta<strong>in</strong>serrors. An XsltCompileException exception provides you with a l<strong>in</strong>e position andnumber <strong>in</strong>dicat<strong>in</strong>g where the error occurred <strong>in</strong> the style sheet.Load<strong>in</strong>g the Style SheetThe <strong>in</strong>put style sheet can be loaded from four sources: a URL, an <strong>XML</strong> reader, anXPath document, or an XPath navigator. Whatever the source, the Load method firstexpresses it as an XPath navigator. As discussed <strong>in</strong> Chapter 6, an XPath navigatorrepresents a generic <strong>in</strong>terface able to navigate over any <strong>XML</strong>based, or <strong>XML</strong>-look<strong>in</strong>g,data store. The XPathNavigator class enables you to move from one node to the nextand to retrieve node-sets us<strong>in</strong>g XPath queries.The source style sheet is normalized to an XPath navigator mostly <strong>for</strong> per<strong>for</strong>mancereasons. The style sheet must be compiled and, given the compiler's architecture, anavigator is an extremely efficient object <strong>for</strong> per<strong>for</strong>m<strong>in</strong>g the task. Compil<strong>in</strong>g is a processthat simply excerpts <strong>in</strong><strong>for</strong>mation from the orig<strong>in</strong>al style sheet and stores it <strong>in</strong> handy datastructures <strong>for</strong> further use. The entire set of these data structures is said to be the stateof the XSLT processor. Figure 7-6 illustrates the flow of the Load method.Figure 7-6: The style sheet is first normalized to an XPath navigator and then compiled.Manag<strong>in</strong>g the Processor's StateThe style sheet compiler populates three <strong>in</strong>ternal data structures with the data readfrom the source. The compiled style sheet object shown <strong>in</strong> Figure 7-6 represents an255


<strong>in</strong>dex of the style sheet contents. The other two data structures are tables conta<strong>in</strong><strong>in</strong>gcompiled versions of the XPath queries to execute and the actions that the varioustemplates require.As mentioned, the state of the XSLT processor is not set atomically, which might poseproblems if you are us<strong>in</strong>g the XSLT processor from with<strong>in</strong> a multi-threaded application.Once set by the Load method, the processor's state is not modified until the same Loadmethod is called aga<strong>in</strong>.Per<strong>for</strong>m<strong>in</strong>g Trans<strong>for</strong>mationsThe trans<strong>for</strong>mation method, depicted <strong>in</strong> Figure 7-7, takes at least two explicitarguments—the source <strong>XML</strong> document and the output stream—plus a couple of implicitparameters. The compiled style sheet object is of course one of the implicit <strong>in</strong>putarguments. The second implicit parameter is the XmlResolver property. As mentioned,the XmlResolver property is designed to help the processor resolve external resources.Figure 7-7: The XSLT processor generates the output text based on the source <strong>XML</strong>document and the <strong>in</strong>ternally stored <strong>in</strong><strong>for</strong>mation about the style sheet.The Trans<strong>for</strong>m method can also take a third explicit argument—an object of classXsltArgumentList. The argument conta<strong>in</strong>s the namespace-qualified arguments used as<strong>in</strong>put to the trans<strong>for</strong>mation process. (More on this <strong>in</strong> the section "Creat<strong>in</strong>g a .<strong>NET</strong>Framework Argument List," on page 324.)The <strong>XML</strong> source document is normalized as an XPath navigator and passed down <strong>in</strong>this <strong>for</strong>m to the XSLT processor. Interest<strong>in</strong>gly, the <strong>in</strong>ternal processor class has twotypes of overloads. Some of the overloads work as void methods and simply write tothe specified stream. Others work as functions and specifically return an <strong>XML</strong> readerobject. As you'll see <strong>in</strong> a moment, this feature provides an <strong>in</strong>terest<strong>in</strong>g opportunity:implement<strong>in</strong>g asynchronous XSLT trans<strong>for</strong>mations.NoteHow easy is it to normalize <strong>XML</strong> readers, URLs, and documents to256


XPath navigators? Remember that you can always create anXPathDocument object from any <strong>XML</strong> file or reader. Once you havea reference to an XPathDocument object, or an <strong>in</strong>stance of anyother object that implements the IXPathNavigable<strong>in</strong>terface, yousimply call the CreateNavigator method and you're done. TheCreateNavigator method, of course, is part of the IXPathNavigable<strong>in</strong>terface.Apply<strong>in</strong>g Trans<strong>for</strong>mationsThe XSL style sheet and the <strong>XML</strong> source can be loaded from a variety of sources,<strong>in</strong>clud<strong>in</strong>g local disk files and remote URLs. You can't load style sheets and sourcedocuments from a stream, but because you can easily obta<strong>in</strong> an <strong>XML</strong> reader from astream, a workaround is quickly found. Whatever the <strong>in</strong>put <strong>for</strong>mat, the content istrans<strong>for</strong>med <strong>in</strong>to an XPath navigator object immediately after read<strong>in</strong>g.In light of this, pass<strong>in</strong>g style sheet and <strong>XML</strong> source data directly as XPath documentsor navigators is advantageous from two standpo<strong>in</strong>ts: you save conversion time, and youwork with objects whose <strong>in</strong>ternal storage mechanism is lighter and more compact.Choos<strong>in</strong>g optimized <strong>for</strong>ms of storage like XPath documents b<strong>in</strong>ds you to a read-onlymanipulation of the data. If you need to edit the document be<strong>for</strong>e a trans<strong>for</strong>mation isper<strong>for</strong>med, load it <strong>in</strong>to an XmlDocument object and apply all the changes. When youhave f<strong>in</strong>ished, pass the XmlDocument object to the XslTrans<strong>for</strong>m class. As you'll recallfrom Chapter 6, XmlDocument implements the IXPathNavigable <strong>in</strong>terface and as suchcan be used with the Trans<strong>for</strong>m method.The Load and Trans<strong>for</strong>m methods have several overloads each. In all this richness ofcall opportunities, not all possible comb<strong>in</strong>ations of <strong>in</strong>put and output channels are alwayssupported. For example, you can load the source document from a URL, but only if yououtput to another URL or disk file. Likewise, if you want to trans<strong>for</strong>m to a text writer, youcan't load the source from a file. Table 7-6 and Table 7-7 provide a quick-access viewof the available overloads.Table 7-6: Load Method OverloadsReturn Type Style Sheet Source <strong>XML</strong> Resolvervoid File or URL Novoid XPath document Novoid XPath navigator Novoid <strong>XML</strong> reader Novoid File or URL Yesvoid XPath document Yesvoid XPath navigator Yesvoid <strong>XML</strong> reader YesTable 7-7: Trans<strong>for</strong>m Method OverloadsReturn Type <strong>XML</strong> Source Argument List Output257


Table 7-7: Trans<strong>for</strong>m Method OverloadsReturn Type <strong>XML</strong> Source Argument List Outputvoid File or URL File orURLvoid XPath document XsltArgumentList Streamvoid XPath navigator XsltArgumentList Streamvoid XPath document XsltArgumentList Textwritervoid XPath navigator XsltArgumentList Textwritervoid XPath document XsltArgumentList <strong>XML</strong>writervoid XPath navigator XsltArgumentList <strong>XML</strong>writerXmlReader XPath document XsltArgumentListXmlReader XPath navigator XsltArgumentListThe <strong>in</strong>terface of the Load method is fairly regular. It always returns void, and it supportsfour read<strong>in</strong>g media, with or without an <strong>XML</strong> resolver object.The programm<strong>in</strong>g <strong>in</strong>terface of the Trans<strong>for</strong>m method is much less regular. Theoverloads that return an <strong>XML</strong> reader work only on XPath documents or navigators. Theoverload that manages URLs or files is an exception, perhaps provided <strong>for</strong> the sake ofsimplicity. The rema<strong>in</strong><strong>in</strong>g overloads are grouped by the type of the output media:stream, text, or <strong>XML</strong> writer. For each of them, you can have a source <strong>XML</strong> documentread from an XPath document or an XPath navigator.Design ConsiderationsThe style sheet and the source <strong>XML</strong> document are two equally important arguments <strong>for</strong>the XSLT processor. The XslTrans<strong>for</strong>m programm<strong>in</strong>g <strong>in</strong>terface requires that you<strong>in</strong>dicate them <strong>in</strong> different steps, however. In do<strong>in</strong>g so, the accent goes on a particularuse—trans<strong>for</strong>m<strong>in</strong>g multiple documents us<strong>in</strong>g the same style sheet.Although optimized <strong>for</strong> a particular scenario, such a design doesn't tax thoseprogrammers who use the style sheet <strong>for</strong> a s<strong>in</strong>gle trans<strong>for</strong>mation. In this case, the only,and very m<strong>in</strong>imal, drawback is that you have to write three l<strong>in</strong>es of code <strong>in</strong>stead of one!Look at the follow<strong>in</strong>g class. It provides a static method <strong>for</strong> per<strong>for</strong>m<strong>in</strong>g XSLTtrans<strong>for</strong>mations. It doesn't explicitly provide <strong>for</strong> style sheet reuse, but it does save youtwo l<strong>in</strong>es of code!public class QuickXslt{public static bool Trans<strong>for</strong>m(str<strong>in</strong>g source, str<strong>in</strong>g stylesheet, str<strong>in</strong>g output){try{258


}XslTrans<strong>for</strong>m xslt = new XslTrans<strong>for</strong>m();xslt.Load(stylesheet);xslt.Trans<strong>for</strong>m(source, output);return true;}catch (Exception e){return false;}The Trans<strong>for</strong>m method shown <strong>in</strong> the preced<strong>in</strong>g code also catches any exceptions andflattens them <strong>in</strong>to a Boolean value. Us<strong>in</strong>g this global method is as easy as writ<strong>in</strong>g thefollow<strong>in</strong>g code:public static void Ma<strong>in</strong>(str<strong>in</strong>g[] args){bool b = QuickXslt.Trans<strong>for</strong>m(args[0], args[1], args[2]);Console.WriteL<strong>in</strong>e(b.ToStr<strong>in</strong>g());}By design, the static Trans<strong>for</strong>m method accepts only disk files or URLs.TipBy pass<strong>in</strong>g an <strong>XML</strong> reader to the XslTrans<strong>for</strong>mclass's Load andTrans<strong>for</strong>m methods, you can load both the style sheet and the sourcedocument from an <strong>XML</strong> subtree. In this case, <strong>in</strong> fact, theXslTrans<strong>for</strong>m class will start read<strong>in</strong>g from the reader's current nodeand cont<strong>in</strong>ue through the entire subtree.Another <strong>in</strong>terest<strong>in</strong>g consideration that applies to XSLT concerns the process as awhole. The style sheet is always loaded synchronously. The trans<strong>for</strong>mation, on theother hand, can occur asynchronously—at least to some extent. Let's see why.Asynchronous Trans<strong>for</strong>mationsThe Trans<strong>for</strong>m method has a couple of overloads that return an <strong>XML</strong> reader, as shownhere:public XmlReader Trans<strong>for</strong>m(XPathNavigator <strong>in</strong>put,XsltArgumentList args);public XmlReader Trans<strong>for</strong>m(IXPathNavigable <strong>in</strong>put,XsltArgumentList args);The signature, and the behavior, of these overloads is slightly different from the others.As you can see, the method does not accept any argument represent<strong>in</strong>g the outputstream. The second argument can be an XsltArgumentList object, which serves otherpurposes that we'll get <strong>in</strong>to <strong>in</strong> the section "Creat<strong>in</strong>g a .<strong>NET</strong> Framework Argument List,"on page 324. The <strong>in</strong>put document must be an XPath navigator or an XPath documentreferenced through the IXPathNavigable <strong>in</strong>terface.259


XSLT Output RecordsThe output of the trans<strong>for</strong>mation process is not written out to a stream but created <strong>in</strong>memory and returned to the user via an <strong>XML</strong> reader. The overall trans<strong>for</strong>mationprocess works by creat<strong>in</strong>g an <strong>in</strong>termediate data structure (referred to as the navigator<strong>in</strong>put) <strong>in</strong> which the content of the style sheet is used as the underly<strong>in</strong>g surface. AnyXSLT tag found <strong>in</strong> the style sheet source is replaced with expanded text or anysequence of calls that results from embedded templates.The f<strong>in</strong>al output looks like a compiled program <strong>in</strong> which direct statements are<strong>in</strong>terspersed with calls to subrout<strong>in</strong>es. In an XSLT program, these statements are calledoutput records, while templates play the role of subrout<strong>in</strong>es. Figure 7-8 shows how theXSLT processor generates its output.Figure 7-8: An <strong>XML</strong> reader lets you access the output records one at a time.When the Trans<strong>for</strong>m method gets an output stream to write to, the XSLT processorloops through all the records and accumulates the text <strong>in</strong>to the specified buffer. If an<strong>XML</strong> reader has been requested, the processor creates an <strong>in</strong>stance of an <strong>in</strong>ternalreader class and returns that to the caller. The exact name of the <strong>in</strong>ternal reader isSystem.Xml.Xsl.ReaderOutput. No trans<strong>for</strong>mation is per<strong>for</strong>med until the caller explicitlyasks to read the cached output records. Figure 7-9 shows how the XSLT processorreturns its output.260


Figure 7-9: The XSLT processor <strong>in</strong>stantiates a reader object and returns. Notrans<strong>for</strong>mation is per<strong>for</strong>med until you "read" the <strong>in</strong>ternal data us<strong>in</strong>g the methods and theproperties of the returned reader.The XSLT Record ReaderThe ReaderOutput class builds a virtual <strong>XML</strong> tree on top of the compiled style sheet,thus mak<strong>in</strong>g it navigable us<strong>in</strong>g the standard <strong>XML</strong> reader <strong>in</strong>terface. When the Trans<strong>for</strong>mmethod returns, the reader is <strong>in</strong> its <strong>in</strong>itial state (and there<strong>for</strong>e it is not yet <strong>in</strong>itialized <strong>for</strong>read<strong>in</strong>g).Each time you pop an element from the reader, a new output record is properlyexpanded and returned. In this way, you have total control over the trans<strong>for</strong>mationprocess and can plan and realize a number of fancy features. For example, you couldprovide feedback to the user, discard nodes based on runtime conditions and userroles, or cause the process to occur asynchronously on a secondary thread.The reader <strong>in</strong>terface exposes the XSLT records as <strong>XML</strong> nodes—the same <strong>XML</strong> nodesyou will f<strong>in</strong>d by visit<strong>in</strong>g the output document. The follow<strong>in</strong>g code snippet demonstrateshow to set up a user-controlled trans<strong>for</strong>mation:// The <strong>XML</strong> source must be an XPath document or an XPathnavigatorXPathDocument doc = new XPathDocument(source);// No arg-list to provide <strong>in</strong> this caseXmlReader reader = xslt.Trans<strong>for</strong>m(doc, null);261


Per<strong>for</strong>m the trans<strong>for</strong>mation, record by recordwhile (reader.Read()){// Do someth<strong>in</strong>g}Figure 7-10 shows the user <strong>in</strong>terface of a sample application. It <strong>in</strong>cludes a list boxcontrol that is iteratively populated with <strong>in</strong><strong>for</strong>mation excerpted from the reader's currentnode. Each row <strong>in</strong> the list box corresponds to an output record generated by the XSLTprocessor.Figure 7-10: The HTML file generated by the trans<strong>for</strong>mation, rendered as a node tree, isreceived one row at a time.In the read<strong>in</strong>g loop, all nodes are analyzed and serialized to <strong>XML</strong> text, as shown <strong>in</strong> thefollow<strong>in</strong>g code. In this way, each row <strong>in</strong> the list box corresponds to the l<strong>in</strong>e of text that issent to an output stream if you opt <strong>for</strong> a synchronous trans<strong>for</strong>mation.void ReadOutputRecords(XmlReader reader){// Clear the list boxOutputList.Items.Clear();// Read the recordswhile(reader.Read()){str<strong>in</strong>g buf = "";switch(reader.NodeType){case XmlNodeType.Element:buf = Str<strong>in</strong>g.Format("{0}",new Str<strong>in</strong>g(' ', 2*reader.Depth),262


}}reader.Name,GetNodeAttributes(reader));break;case XmlNodeType.EndElement:buf = Str<strong>in</strong>g.Format("{0}",new Str<strong>in</strong>g(' ', 2*reader.Depth),reader.Name);break;case XmlNodeType.Text:buf = Str<strong>in</strong>g.Format("{0}{1}",new Str<strong>in</strong>g(' ', 2*reader.Depth),reader.Value);break;}OutputList.Items.Add(buf);The f<strong>in</strong>al text is <strong>in</strong>dented us<strong>in</strong>g a padd<strong>in</strong>g str<strong>in</strong>g whose size depends on the reader'sDepth property. Node names and values are returned by the Name and Valueproperties. For element nodes, attributes are read us<strong>in</strong>g a piece of code that weexam<strong>in</strong>ed <strong>in</strong> detail <strong>in</strong> Chapter 2:str<strong>in</strong>g GetNodeAttributes(XmlReader reader){if (!reader.HasAttributes)return "";str<strong>in</strong>g buf = "";while(reader.MoveToNextAttribute())buf += Str<strong>in</strong>g.Format("{0}=\"{1}\" ", reader.Name,reader.Value);}reader.MoveToElement();return buf;Output FormatsAn XSLT style sheet can declare the output <strong>for</strong>mat of the serialized text us<strong>in</strong>g the statement. This statement features several attributes, the most importantof which is method. The method attribute can be set with any of the follow<strong>in</strong>g keywords:xml, html, or text. By default, the output <strong>for</strong>mat is <strong>XML</strong> unless the root tag of the resultsdocument equals . In this case, the output is <strong>in</strong> HTML.Differences between <strong>XML</strong> and HTML are m<strong>in</strong>imal. If the output <strong>for</strong>mat is HTML, the<strong>XML</strong> well-<strong>for</strong>medness is sacrificed <strong>in</strong> the name of a greater programmer-friendl<strong>in</strong>ess.This means that, <strong>for</strong> example, empty tags will not have an end tag. In addition to263


method, other attributes of <strong>in</strong>terest are <strong>in</strong>dent, encod<strong>in</strong>g, and omit-xml-declaration,which respectively <strong>in</strong>dent the text, set the preferred character encod<strong>in</strong>g, and omit thetypical <strong>XML</strong> prolog.If you add an statement to the previously considered style sheets, thesource code of the results document will be significantly different, but not its overallmean<strong>in</strong>g. If you choose to output pla<strong>in</strong> text, on the other hand, the XSLT processor willdiscard any markup text <strong>in</strong> the style sheet and output only text.As a f<strong>in</strong>al note, consider that is a discretionary behavior that not all XSLTprocessors provide and not all <strong>in</strong> the same way. In particular, when the Trans<strong>for</strong>mmethod is writ<strong>in</strong>g to a text writer or an <strong>XML</strong> writer, the .<strong>NET</strong> Framework XSLTprocessor ignores the encod<strong>in</strong>g attribute <strong>in</strong> favor of the correspond<strong>in</strong>g property on theobject.Pass<strong>in</strong>g and Retriev<strong>in</strong>g ArgumentsAs mentioned, XSLT scripts can take arguments. You can declare arguments globally<strong>for</strong> the entire script or locally to a particular template. Arguments can have a defaultvalue that will make them always available as a variable <strong>in</strong> the scope. Aside from thedefault value, <strong>in</strong> XSLT there are no other differences between arguments and variables.The follow<strong>in</strong>g code shows a style sheet snippet <strong>in</strong> which a parameter namedMaxNumOfRows is declared and <strong>in</strong>itialized with a default value of 6:The script retrieves the argument us<strong>in</strong>g its public name prefixed with a dollar sign ($). Inparticular, the conditional statement shown here applies to the template only if fiveemployee nodes have not yet been processed:Note that you can't use the less than sign () can be safely used, however. If, like me, you don't like escapedstr<strong>in</strong>gs, you can <strong>in</strong>vert the terms of the comparison.NoteParameters can be associated only with templates or with the globalscript. You can't associate parameters with other XSLT <strong>in</strong>structionssuch as a .264


Call<strong>in</strong>g Templates with ArgumentsWhen you call a parameterized XSLT template, you give actual values to <strong>for</strong>malparameters us<strong>in</strong>g the <strong>in</strong>struction. Here's an example that calls thesample Employee template, giv<strong>in</strong>g the MaxNumOfRows argument a value of 7:If the called template has no such parameter, noth<strong>in</strong>g happens, and the argument willbe ignored. The <strong>in</strong>struction can be associated with both and <strong>in</strong>structions.Creat<strong>in</strong>g a .<strong>NET</strong> Framework Argument ListTheTrans<strong>for</strong>m method lets you pass arguments to the style sheet us<strong>in</strong>g an <strong>in</strong>stance ofthe XsltArgumentList class. When you pass arguments to an XSLT script <strong>in</strong> this way,you can't specify what template call will actually use those arguments. You just passarguments globally to the XSLT processor. The <strong>in</strong>ternal modules responsible <strong>for</strong>process<strong>in</strong>g templates will then read and import those arguments as appropriate.Creat<strong>in</strong>g an argument list is straight<strong>for</strong>ward. You create an <strong>in</strong>stance of theXsltArgumentList class and populate it with values, as shown here:XsltArgumentList args = new XsltArgumentList();args.AddParam("MaxNumOfRows", "", 7);The AddParam method creates a new entry <strong>in</strong> the argument list. AddParam requiresthree parameters: the (qualified) name of the parameter, the namespace URI (if thename is qualified by a namespace prefix), and an object represent<strong>in</strong>g the actual value.Regardless of the .<strong>NET</strong> Framework type you use to pack the entry <strong>in</strong>to the argumentlist, the parameter value must correspond to a valid XPath type: str<strong>in</strong>g, Boolean,number, node fragment, and node-set. The number type corresponds to a .<strong>NET</strong>Framework double type, whereas node fragments and node-sets are equivalent toXPath navigators and XPath node iterators. (See Chapter 6 <strong>for</strong> more <strong>in</strong><strong>for</strong>mation aboutthese data types.)The XsltArgumentList ClassDespite what its name suggests, XsltArgumentList is not a collection-based class. Itdoes not derive from ArrayList or from a collection class, nor does it implement any ofthe typical list <strong>in</strong>terfaces like IList or ICollection.The XsltArgumentList class is built around a couple of child hash tables: one to holdXSLT parameters and one to gather the so-called extension objects. An extensionobject is simply a liv<strong>in</strong>g <strong>in</strong>stance of a .<strong>NET</strong> Framework object that you can pass as anargument to the style sheet. Of course, this feature is specific to the .<strong>NET</strong> XSLTprocessor. We'll look at extension objects <strong>in</strong> more detail <strong>in</strong> the section "XSLT ExtensionObjects," on page 336.The programm<strong>in</strong>g <strong>in</strong>terface of the XsltArgumentList class is described <strong>in</strong> Table 7-8. Itprovides only methods.Table 7-8: Methods of the XsltArgumentList ClassMethodAddExtensionObjectDescriptionAdds a new managed object to the list. Youcan specify the namespace URI or use the265


Table 7-8: Methods of the XsltArgumentList ClassMethodDescriptiondefault namespace by pass<strong>in</strong>g an emptystr<strong>in</strong>g. If you pass null, an exception is thrown.AddParamAdds a parameter value to the list. Must<strong>in</strong>dicate the name of the argument andoptionally the associated namespace URI.Clear Removes all parameters and extensionobjects from the list.GetExtensionObjectReturns the object associated with the givennamespace.GetParamGets the value of the parameter with thespecified (qualified) name.RemoveExtensionObject Removes the specified object from the list.RemoveParamRemoves the specified parameter from thelist.As with parameters, the style sheet identifies an extension object through its classname and an associated namespace prefix.Practical ExamplesBe<strong>for</strong>e we take the plunge <strong>in</strong>to more advanced topics such as us<strong>in</strong>g managed objectswith XSLT style sheets, let's recap and summarize what we've looked at so far <strong>in</strong> acouple of real-world examples. First we'll trans<strong>for</strong>m a <strong>Microsoft</strong> ADO.<strong>NET</strong> DataSetobject <strong>in</strong>to a <strong>Microsoft</strong> ActiveX Data Objects (ADO) Recordset object. Of course, thistrans<strong>for</strong>mation will not <strong>in</strong>volve the b<strong>in</strong>ary image of the objects, just their <strong>XML</strong>representation.Second we'll look at a <strong>Microsoft</strong> ASP.<strong>NET</strong> example to <strong>in</strong>troduce you to the use of a veryhandy control: the <strong>XML</strong> Web server control. The <strong>XML</strong> Web server control is capable ofrender<strong>in</strong>g an <strong>XML</strong> document <strong>in</strong> the body of a Web page with or without XSLT<strong>for</strong>matt<strong>in</strong>g.Trans<strong>for</strong>m<strong>in</strong>g DataSet Objects <strong>in</strong>to Recordset ObjectsExport<strong>in</strong>g the contents of ADO.<strong>NET</strong> DataSet objects to legacy ADO applications is aproblem that we encountered and solved <strong>in</strong> Chapter 4. That solution was based on aspecial breed of <strong>XML</strong> writer. In this section, we'll reconsider that approach and use anXSLT style sheet to accomplish the same task.Bear <strong>in</strong> m<strong>in</strong>d that us<strong>in</strong>g a style sheet to convert a DataSet object to a Recordset objectdoes not necessarily lead to faster code. If we merely consider the trans<strong>for</strong>mationprocess, I do recommend that you always use the writer. Your code is not taxed by theXSLT processor and, perhaps more importantly, you can use a more familiarprogramm<strong>in</strong>g style. The writer is written <strong>in</strong> C# or Visual Basic and, as such, providesyou with total control over the generated output. An XSLT style sheet is someth<strong>in</strong>gdifferent, even though it is often referred to as a program.A style sheet is a k<strong>in</strong>d of mask that you put on top of a document to change itsappearance; the document can then be saved <strong>in</strong> its new <strong>for</strong>m. Us<strong>in</strong>g a style sheet alsodecouples the trans<strong>for</strong>mation process from the rest of the application. You can modifythe logic of the trans<strong>for</strong>mation without touch<strong>in</strong>g or recompil<strong>in</strong>g a s<strong>in</strong>gle l<strong>in</strong>e of code.266


Writ<strong>in</strong>g an XSLT style sheet to trans<strong>for</strong>m a DataSet object <strong>in</strong>to a Recordset object isuseful <strong>for</strong> other reasons as well. First, the style sheet code needed is not trivial andrequires a good work<strong>in</strong>g knowledge of both XPath and XSLT. Look at it as a usefulexercise to test your level of familiarity with the technologies. Second, you can applythe style sheet directly to the b<strong>in</strong>ary DataSet object, without first serializ<strong>in</strong>g the object to<strong>XML</strong>.The ability to style a b<strong>in</strong>ary DataSet object is provided by the XmlDataDocument class.As mentioned <strong>in</strong> Chapter 6, XmlDataDocument is an XPath document class. Itimplements the IXPathNavigable <strong>in</strong>terface and, as such, can be directly passed as anargument to the Trans<strong>for</strong>m method. (We'll exam<strong>in</strong>e the XmlDataDocument class <strong>in</strong>detail <strong>in</strong> Chapter 8.)Gett<strong>in</strong>g the DataSet ObjectThe follow<strong>in</strong>g code fetches some records from the Northw<strong>in</strong>d database's Employeestable and stores them <strong>in</strong>to a DataSet object:str<strong>in</strong>g conn = "DATABASE=northw<strong>in</strong>d;SERVER=localhost;UID=sa;";str<strong>in</strong>g comm = "SELECT firstname, lastname, title, notes FROMemployees";SqlDataAdapter adapter = new SqlDataAdapter(comm, conn);DataSet data = new DataSet("Northw<strong>in</strong>d");adapter.Fill(data, "Employees");The DataSet object is named Northw<strong>in</strong>d and conta<strong>in</strong>s just one DataTable object,Employees. As we'll see <strong>in</strong> a moment, the names of the DataSet and DataTable objectsplay a key role <strong>in</strong> the <strong>XML</strong> representation of the objects. By default, a DataSet object isnamed NewDataSet, and a DataTable object is named Table. (We'll look at ADO.<strong>NET</strong><strong>XML</strong> serialization <strong>in</strong> great detail <strong>in</strong> Chapter 9 and Chapter 10.)The <strong>XML</strong> representation of a DataSet object looks like this:......⋮⋮TipYou can get the str<strong>in</strong>g represent<strong>in</strong>g the <strong>XML</strong> version of the DataSetobject through the DataSet method GetXml. The text does not <strong>in</strong>cludeschema <strong>in</strong><strong>for</strong>mation. You can get the schema script separately bycall<strong>in</strong>g the GetXmlSchema method. To persist the <strong>XML</strong>representation to a stream, use the WriteXml method <strong>in</strong>stead.Trans<strong>for</strong>m<strong>in</strong>g the DataSet ObjectTrans<strong>for</strong>m<strong>in</strong>g a DataSet object <strong>in</strong>to a Recordset object poses a couple of problems.The first is that you have to <strong>in</strong>fer and write the Recordset object's schema. The secondis that the <strong>XML</strong> layout of the DataSet object depends on a number of differentparameters. In particular, the root of the <strong>XML</strong> version of the DataSet object depends on267


the object's DataSetName property. Likewise, each table record is grouped under anode whose name matches the DataTable object's TableName property.You could easily work around the first issue by writ<strong>in</strong>g a more generic XSLT script. As<strong>for</strong> the second problem, because a DataSet object can conta<strong>in</strong> multiple tables, youmust necessarily know the name of the table you want to process and render as aRecordset object. The name of the table must be passed to the XSLT processorthrough the argument list.The follow<strong>in</strong>g code shows how to trans<strong>for</strong>m the DataSet object <strong>in</strong>to an XPath documentand load it <strong>in</strong>to the processor. The result of the trans<strong>for</strong>mation is directly written out toan auto-<strong>in</strong>dent <strong>XML</strong> writer. The argument passed to the style sheet is the name of thefirst table <strong>in</strong> the specified DataSet object.// Set up the style sheetXslTrans<strong>for</strong>m xslt = new XslTrans<strong>for</strong>m();xslt.Load("ado.xsl");// Create an XPath document from the DataSetXmlDataDocument doc = new XmlDataDocument(data);// Prepare the output writerXmlTextWriter writer = new XmlTextWriter(outputFile, null);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Set some argumentsXsltArgumentList args = new XsltArgumentList();args.AddParam("TableName", "", data.Tables[0].TableName);// Call the transfomer and close the writer upon completionxslt.Trans<strong>for</strong>m(doc, args, writer);writer.Close();The XmlDataDocument class <strong>in</strong>ternally creates an <strong>XML</strong> DOM representation of theDataSet content. That content then becomes the <strong>in</strong>put <strong>for</strong> the XSLT style sheet.The ADO Style SheetLet's analyze the XSLT code necessary to trans<strong>for</strong>m a DataSet object <strong>in</strong>to the <strong>XML</strong>version of an ADO Recordset object. The follow<strong>in</strong>g list<strong>in</strong>g shows the over-all layout:268


⋮The style sheet conta<strong>in</strong>s a s<strong>in</strong>gle template that applies to the first node <strong>in</strong> thedocument—that is, the DataSet object's root. Because the match is found us<strong>in</strong>g ageneric XPath expression that selects the first child, the template will work on theDataSet object's root, whatever its name might be.The style sheet can accept one argument (TableName) that defaults to the str<strong>in</strong>g Table.Note that if you omit the XPath str<strong>in</strong>g function, Table denotes a node-set value ratherthan a str<strong>in</strong>g.The <strong>XML</strong> version of an ADO Recordset object consists of two dist<strong>in</strong>ct blocks—schemaand rows—grouped under an node. Here's the code <strong>for</strong> the Recordset schema:RowsetSchemaroweltOnly269


s:rowbaseAfter you create the node with all of its required namespace declarations, youcreate a node with an id attribute. The schema tree conta<strong>in</strong>s the def<strong>in</strong>itionsof all the element and attribute types that will be used later. Note that ADO expressesthe Recordset object <strong>in</strong> <strong>XML</strong> us<strong>in</strong>g the <strong>XML</strong>-Data Reduced (XDR) schema <strong>in</strong>stead ofthe newer <strong>XML</strong> Schema Def<strong>in</strong>ition (XSD) schema. (See Chapter 3.)In particular, the Recordset schema def<strong>in</strong>es a element to render a table row. Thenode will conta<strong>in</strong> as many attributes as there are columns <strong>in</strong> the source table. To def<strong>in</strong>eall the attributes <strong>in</strong> the Recordset schema, you must visit all the children of a node <strong>in</strong> the DataSet object. The actual name of the nodewill be specified by the $TableName style sheet argument.The sample list<strong>in</strong>g emphasizes a couple of <strong>for</strong>-each statements. The first statementselects the first node whose local, unqualified name matches the $TableNameargument. The second loop enumerates the children of this node and creates anattribute schema def<strong>in</strong>ition <strong>for</strong> each.The f<strong>in</strong>al step <strong>in</strong>volves the creation of the data rows. Each source row corresponds to a node whose attributes map to the source columns, as shown here:This list<strong>in</strong>g also <strong>in</strong>cludes a couple of nested <strong>for</strong>-each statements that run <strong>in</strong> the contextof the DataSet object's root. The outer loop selects all the nodes whose name matchesthe $TableName parameter, whereas the <strong>in</strong>nermost loop creates an attribute <strong>for</strong> eachchild node found. The node is expected to have as many attributes as the child270


nodes of the correspond<strong>in</strong>g source tree and be named after them. In other words, thename of the attribute must be determ<strong>in</strong>ed dynamically.In an XSLT script, you create an attribute us<strong>in</strong>g the <strong>in</strong>struction. The<strong>in</strong>struction has a name attribute to let you assign a name to the attribute. The nameattribute can only be set with a literal, however. What if you must use an XPathexpression to decide the name? In that case, you use the follow<strong>in</strong>g special XPathsyntax:By wrapp<strong>in</strong>g the expression <strong>in</strong> curly brackets, you tell the processor that the attributemust be assigned the result of the specified expression.Figure 7-11 illustrates a sample application that runs a query aga<strong>in</strong>st SQL Server andsaves the output <strong>in</strong> ADO-compliant <strong>XML</strong>.Figure 7-11: The DataSet-to-Recordset style sheet converter <strong>in</strong> action.CautionThe style sheet discussed <strong>in</strong> this example works well even if theDataSet object conta<strong>in</strong>s multiple tables. In fact, it has beendesigned to process only the nodes that match a given tablename. The style sheet will produce <strong>in</strong>correct <strong>XML</strong> output if arelationship exists between two tables and the correspond<strong>in</strong>gDataRelation object has the Nested property set to true. In thiscase, the records of the child table are serialized below eachparent row, thus result<strong>in</strong>g <strong>in</strong> a discrepancy between the declaredschema and the actual contents of each row.A possible workaround is to use a second parameter, n, thatspecifies the number of columns <strong>in</strong> the table to be processed.While you def<strong>in</strong>e the schema, you stop the loop after the first nchild rows, discard<strong>in</strong>g all the rows set there because of thenested relationship.The <strong>XML</strong> Web Server ControlThe <strong>XML</strong> Web server control is used to output the contents of an <strong>XML</strong> documentdirectly <strong>in</strong> an ASP.<strong>NET</strong> page. The control can display the source <strong>XML</strong> as is or as theresults of an XSLT trans<strong>for</strong>mation.The <strong>XML</strong> Web server control, denoted by the tag, is a declarativecounterpart to the XslTrans<strong>for</strong>m class. The <strong>XML</strong> Web server control has no more271


features than the XslTrans<strong>for</strong>m class. More precisely, the <strong>XML</strong> Web server controlmakes use of the XslTrans<strong>for</strong>m class <strong>in</strong>ternally.You use the <strong>XML</strong> Web server control when you need to embed <strong>XML</strong> documents <strong>in</strong> aWeb page. For example, the control is extremely handy when you need to create <strong>XML</strong>data islands <strong>for</strong> the client to consume. Data islands consist of <strong>XML</strong> data referenced or<strong>in</strong>cluded <strong>in</strong> an HTML page. The <strong>XML</strong> data can be <strong>in</strong>cluded <strong>in</strong>-l<strong>in</strong>e with<strong>in</strong> the HTML, or itcan be <strong>in</strong> an external file. By comb<strong>in</strong><strong>in</strong>g this control's ability with the ADO <strong>XML</strong> stylesheet we created <strong>in</strong> the previous section, you can trans<strong>for</strong>m a DataSet object <strong>in</strong>to anADO Recordset object and send it to the browser to be processed by client scriptprocedures.Let's take a closer look at the programm<strong>in</strong>g <strong>in</strong>terface of the <strong>XML</strong> Web server control.<strong>Programm<strong>in</strong>g</strong> the <strong>XML</strong> Web Server ControlIn addition to the typical and standard properties of all server controls, the <strong>XML</strong> Webserver control provides the properties listed <strong>in</strong> Table 7-9. The document propertiesrepresent the source <strong>XML</strong> data, and the trans<strong>for</strong>m properties handle the <strong>in</strong>stance of theXslTrans<strong>for</strong>m class to be used and the style sheet.Table 7-9: Properties of the <strong>XML</strong> Web Server ControlPropertyDescriptionDocumentSets the <strong>XML</strong> source document us<strong>in</strong>g anXmlDocument objectDocumentContentSets the <strong>XML</strong> source document us<strong>in</strong>g a str<strong>in</strong>gDocumentSourceSets the <strong>XML</strong> source document us<strong>in</strong>g a fileTrans<strong>for</strong>m Sets the XslTrans<strong>for</strong>m class to use <strong>for</strong>trans<strong>for</strong>mationsTrans<strong>for</strong>mArgumentList Gets or sets the argument list <strong>for</strong> trans<strong>for</strong>mationsTrans<strong>for</strong>mSourceSets the style sheet to use <strong>for</strong> trans<strong>for</strong>mationsYou can specify a source document us<strong>in</strong>g a file, a str<strong>in</strong>g, or an <strong>XML</strong> DOM object. Astyle sheet, on the other hand, can be specified us<strong>in</strong>g a file or a preconfiguredXslTrans<strong>for</strong>m object. The output of the trans<strong>for</strong>mation, if any, is the Web page outputstream.The sett<strong>in</strong>gs are mutually exclusive, and the last sett<strong>in</strong>g always w<strong>in</strong>s. For example, ifyou set both Document and DocumentSource, no exception is thrown, but the firstassignment is overridden. Although Table 7-9 emphasizes the writ<strong>in</strong>g of theseproperties, they are all read/write properties. For the DocumentContent property,however, only the set accessor has a significant implementation. If you attempt to readthe property, an empty str<strong>in</strong>g is returned.The DocumentContent property can be set programmatically by us<strong>in</strong>g a str<strong>in</strong>g variableor declaratively by plac<strong>in</strong>g text between the start and end tags of the control, as shownhere:... xml data ...You can optionally specify an XSL style sheet document that <strong>for</strong>mats the <strong>XML</strong>document be<strong>for</strong>e it is written to the output. The output of the style sheet must be HTML,272


<strong>XML</strong>, or pla<strong>in</strong> text. It can't be, <strong>for</strong> example, ASP.<strong>NET</strong> source code or a comb<strong>in</strong>ation ofASP.<strong>NET</strong> layout declarations. Let's look at a few practical examples.Server-Side Trans<strong>for</strong>mationsThe follow<strong>in</strong>g list<strong>in</strong>g demonstrates a simple but effective way to describe a portion ofyour Web page us<strong>in</strong>g <strong>XML</strong> code. The actual <strong>XML</strong>-to-HTML trans<strong>for</strong>mation isautomatically and silently per<strong>for</strong>med by the style sheet.1NancyDavolioSales Representative...The <strong>XML</strong> Web server control can have an ID and can be programmatically accessed.This opens up a new possibility. You can now check the browser's capabilities anddecide dynamically which style sheet is most appropriate.You can also describe the entire page with <strong>XML</strong> and use a style sheet to translate thepage <strong>in</strong>to HTML, as shown <strong>in</strong> the follow<strong>in</strong>g code. This is not always, and notnecessarily, the best solution to ga<strong>in</strong> flexibility, but the <strong>XML</strong> Web server controldef<strong>in</strong>itely makes implement<strong>in</strong>g that solution considerably easier.If you need to pass <strong>in</strong> an argument, simply create and populate an <strong>in</strong>stance of theXsltArgumentList class and pass it to the control us<strong>in</strong>g the Trans<strong>for</strong>mArgumentListproperty.Creat<strong>in</strong>g Client-Side Data IslandsA data island is a block of data that is embedded <strong>in</strong> the body of an HTML page and is<strong>in</strong>visible to the user. Stor<strong>in</strong>g data <strong>in</strong> hidden fields is certa<strong>in</strong>ly the oldest and more widelysupported way of implement<strong>in</strong>g data islands. You can th<strong>in</strong>k of <strong>XML</strong> data islands asislands of <strong>XML</strong> data dispersed <strong>in</strong> the sea of HTML pages.Modern browsers (Internet Explorer 5.0 and later) support an ad hoc client-side tag,, to store islands of data, hid<strong>in</strong>g them from view, as shown here:... <strong>XML</strong> data goes here ...273


Don't confuse the Internet Explorer 5.0 client-side HTML tag with the serversidecontrol. In Chapter 14, we'll return to data islands, and you'll learn how to def<strong>in</strong>ethem from with<strong>in</strong> server pages. For now, let's just say that an <strong>XML</strong> data island is <strong>XML</strong>text wrapped <strong>in</strong> an HTML tag. Not all browsers support this. The exampledescribed here requires Internet Explorer 5.0 or later.Used <strong>in</strong> conjunction with the tag, the <strong>XML</strong> Web server control can be very helpfuland effective. The follow<strong>in</strong>g code flushes the contents of the specified <strong>XML</strong> file <strong>in</strong> aparticular data island:If needed, you can first apply a trans<strong>for</strong>mation. For example, you can embed an ADO<strong>XML</strong> Recordset object <strong>in</strong> a data island. In this case, set the Trans<strong>for</strong>mSource propertyof the <strong>XML</strong> Web server control with the proper style sheet.Internet Explorer 5.0 automatically exposes the contents of the tag through an<strong>XML</strong> DOM object. Hold on, though—that's not managed code! What you get is ascriptable MS<strong>XML</strong> COM object. The follow<strong>in</strong>g ASP.<strong>NET</strong> page <strong>in</strong>cludes some VBScriptcode that retrieves the contents of the data island. (More on this <strong>in</strong> Chapter 14.)void Page_Load(object sender, EventArgs e){button.Attributes["onclick"] = "ReadXmlData()";}Sub ReadXmlData()' data is the name of the tag and' represents an MS<strong>XML</strong> <strong>XML</strong> DOM objectw<strong>in</strong>dow.alert(data.DocumentElement.nodeName)End SubClient-side Data Islands274


XSLT Extension ObjectsLet's complete our exam<strong>in</strong>ation of trans<strong>for</strong>mations by analyz<strong>in</strong>g the XSLT extensionobjects. As mentioned, the XsltArgumentList class can conta<strong>in</strong> both parameters andextension objects. Parameters are simply value types, whereas extension objects are<strong>in</strong>stances of .<strong>NET</strong> classes. When passed to the Trans<strong>for</strong>m method, both parametersand extension objects can be <strong>in</strong>voked from style sheets.The behavior of a style sheet can be extended <strong>in</strong> various ways. For example, you canuse the <strong>in</strong>struction to run VBScript or JScript <strong>in</strong>terpreted code. Be<strong>for</strong>e theadvent of the .<strong>NET</strong> Framework, this was the only option available. With the .<strong>NET</strong>Framework, given the other characteristics of the XSLT processor, the <strong>in</strong>struction is by far the less <strong>in</strong>terest<strong>in</strong>g alternative.In addition, <strong>in</strong> the .<strong>NET</strong> Framework, the <strong>in</strong>struction has been superseded bythe element. This new <strong>in</strong>struction works <strong>in</strong> much the same way as, but it supports managed languages, thus provid<strong>in</strong>g access to the entire.<strong>NET</strong> Framework.Process<strong>in</strong>g Embedded ScriptsWhen the style sheet is loaded <strong>in</strong> the XslTrans<strong>for</strong>mclass, all def<strong>in</strong>ed functions arewrapped <strong>in</strong> a class and compiled to the .<strong>NET</strong> Framework <strong>in</strong>termediate language (IL).They then become available to XPath expressions as native functions.The .<strong>NET</strong> Framework XSLT processor accepts external scripts through the element. The script must use only XPath-compliant types even though,<strong>in</strong> most cases, type coercion is automatically provided by the processor. The typecon<strong>for</strong>mance is fundamental <strong>for</strong> <strong>in</strong>put parameters and return values. Each script can<strong>in</strong>ternally use any .<strong>NET</strong> Framework type, pay<strong>in</strong>g some attention to the requirednamespaces. The follow<strong>in</strong>g namespaces are imported by default: System,System.Text, System.Xml, System.Text.RegularExpressions, System.Xml.XPath,System.Xml.Xsl, System.Collections, and <strong>Microsoft</strong>.VisualBasic. Classes <strong>in</strong> othersystem namespaces can be used too, but their names must be fully qualified. Forexample, to use a DataSet object, you must call it System.Data.DataSet.ImportantAn embedded script can't call <strong>in</strong>to a user-def<strong>in</strong>ed namespace.The XSLT subsystem knows noth<strong>in</strong>g about dependentassemblies and so can't reference them at compile time. Towork around this issue, use extension objects.The InstructionThe <strong>in</strong>struction has the follow<strong>in</strong>g syntax:275


Supported languages are C#, Visual Basic, and JScript. The language attribute is notmandatory and, if not specified, defaults to JScript. The implements-prefix attribute ismandatory, however. It declares a namespace and associates the user-def<strong>in</strong>ed codewith it. The namespace must be def<strong>in</strong>ed somewhere <strong>in</strong> the style sheet. In addition, tomake use of the <strong>in</strong>struction, the style sheet must <strong>in</strong>clude the follow<strong>in</strong>gnamespace:xmlns:msxsl=urn:schemas-microsoft-com:xsltLet's see how to def<strong>in</strong>e a simple script. To start off, we'll declare the extra namespaces<strong>in</strong> the the style sheet's root node, as shown here:This declaration is necessary to be able to call the <strong>in</strong>struction. Thenamespace simply groups under a s<strong>in</strong>gle roof some user-def<strong>in</strong>ed scripts. The prefixd<strong>in</strong>o is now necessary to qualify any calls to any functions def<strong>in</strong>ed <strong>in</strong> a block. Script blocks can be def<strong>in</strong>ed as children of the node, at the samelevel as templates.The follow<strong>in</strong>g script concatenates first and last names, separated by a comma:public str<strong>in</strong>g PrepareName(str<strong>in</strong>g last, str<strong>in</strong>g first){return last + ", "+ first;}In the body of the style sheet—typically <strong>in</strong> a template—you call the function, as follows:If you enclose parameters <strong>in</strong> quotation marks, they will be treated as literals. To ensurethat the function receives only node values, use the same expressions you would usewith the select attribute of an <strong>in</strong>struction. The preced<strong>in</strong>g script runs fromthe context of a node <strong>in</strong> the follow<strong>in</strong>g schema:......276


The dot symbol (.) <strong>in</strong>dicates the value of the current node, whereas../firstname stands<strong>for</strong> the sibl<strong>in</strong>g of the current context node, named .When a function is declared, it is conta<strong>in</strong>ed <strong>in</strong> a script block. Style sheets, however, canconta<strong>in</strong> multiple blocks. All blocks are namespace-scoped and <strong>in</strong>dependent from eachother. You can call a function def<strong>in</strong>ed <strong>in</strong> another block only when both functions sharethe same namespace and language.Why should we use the same language to call <strong>in</strong>to a function def<strong>in</strong>ed <strong>in</strong> another block?Isn't the .<strong>NET</strong> Framework totally language-neutral? The explanation <strong>for</strong> thisdiscrepancy is found under the hood of . The <strong>in</strong>struction works as a merecode runner. It groups all script blocks <strong>in</strong> one or more allencompass<strong>in</strong>g classes. Blockswith the same namespace flow <strong>in</strong> the same dynamically created class.In light of this, call<strong>in</strong>g <strong>in</strong>to external blocks is only possible because both <strong>in</strong>volvedfunctions—the caller and the callee—are members of the same managed class. For thesame reasons, you can't use different languages. What the .<strong>NET</strong> Framework providesis the ability to <strong>in</strong>voke a compiled class irrespective of its source language. In no waydoes the .<strong>NET</strong> Framework provide you with the ability to write and compile a s<strong>in</strong>gleclass us<strong>in</strong>g different languages.The CDATA SectionWhen an element is declared, you should enclose all of its code <strong>in</strong> aCDATA section. The ma<strong>in</strong> purpose of the CDATA delimitors is to protect the sourcecode from the <strong>XML</strong> parser. A style sheet document is <strong>in</strong> fact still an <strong>XML</strong> document andas such gets parsed, as shown here:Wrapped <strong>in</strong> a CDATA section, the user-def<strong>in</strong>ed code can conta<strong>in</strong> any unescapedcharacter that would otherwise confuse the parser. The most common example is


double CalculateSubTotal(XPathNodeIterator nodeset){double total = 0;while (nodeset.MoveNext())total += System.Convert.ToDouble(nodeset.Value);}return total;You call this function pass<strong>in</strong>g an XPath expression that evaluates to a node-set andthen use the iterator's methods to navigate the nodes.Pass<strong>in</strong>g Managed Objects to the Style SheetUs<strong>in</strong>g the <strong>in</strong>struction lets you execute managed code, which isadvantageous from at least two standpo<strong>in</strong>ts. First, you write extension code us<strong>in</strong>g highlevellanguages, thus access<strong>in</strong>g the true power of the .<strong>NET</strong> Framework. Second, youmove some of the style sheet logic <strong>in</strong>to functions, thus render<strong>in</strong>g it with moreappropriate tools than XSLT <strong>in</strong>structions.The <strong>in</strong>struction does not represent the optimal solution, however. Thema<strong>in</strong> problem is that you still have code def<strong>in</strong>ed <strong>in</strong> the body of the style sheet. Inaddition, this code is silently and automatically trans<strong>for</strong>med <strong>in</strong>to managed code throughthe <strong>in</strong>tervention of a system tool— the <strong>in</strong>struction—whose activity isneither monitored nor controllable. For this reason, the XSLT processor allows you todef<strong>in</strong>e a second group of parameters—extension objects.How Managed Extension Objects WorkThe idea beh<strong>in</strong>d extension objects is simple. Instead of def<strong>in</strong><strong>in</strong>g embedded scripts andleav<strong>in</strong>g the <strong>in</strong>struction the task of group<strong>in</strong>g them <strong>in</strong>to a dynamicallycreated and compiled class, you just create and pass a managed class yourself!Unlike embedded scripts, which are natively def<strong>in</strong>ed <strong>in</strong> the body of the style sheet,extension objects are external resources that must be plugged <strong>in</strong>to the style sheet <strong>in</strong>some way. You can't use the mechanism, however, because XSLTparameters must be XPath types. On the other hand, conceptually speak<strong>in</strong>g, anextension object is just an external argument you pass to the style sheet. For thisreason, the XsltArgumentList class def<strong>in</strong>es a parallel array of methods specifically tohandle extension objects. (See the section "Pass<strong>in</strong>g and Retriev<strong>in</strong>g Arguments," onpage 323.)The XSLT processor maps the parameters <strong>in</strong> the argument list to the <strong>in</strong>structions <strong>in</strong> the style sheet. The extension objects, on the other hand, are plugged <strong>in</strong>us<strong>in</strong>g the same <strong>in</strong>ternal mechanism that triggers when the code isgathered and then compiled. In abstract terms, us<strong>in</strong>g embedded scripts and us<strong>in</strong>gextension objects are somewhat equivalent. But us<strong>in</strong>g extension objects provides youwith greater flexibility and improves the overall software design.Script and Extension Object Trade-OffsUs<strong>in</strong>g extension objects is preferable over us<strong>in</strong>g embedded scripts <strong>for</strong> at least threereasons. First, extension objects provide much better code encapsulation, not tomention the possibility of class reuse. Second, you end up with more compact, layeredstyle sheets, with significant advantages also <strong>in</strong> terms of more seamless codema<strong>in</strong>tenance.278


F<strong>in</strong>ally, us<strong>in</strong>g classes lets you exploit the true potential of the .<strong>NET</strong> Framework moreeasily. You no longer have to worry about CDATA sections. And you can cascade callsfrom one class to another, with each class compiled separately and written <strong>in</strong> anylanguage. An additional pleasant side effect is that you can call methods <strong>in</strong> classesbelong<strong>in</strong>g to custom namespaces as well as system namespaces.Extension Objects <strong>in</strong> ActionThe follow<strong>in</strong>g code demonstrates how to register extension objects <strong>for</strong> use with theXSLT processor:// Create and configure the extension objectExtensionObject o = new ExtensionObject();// *** set properties on the object if needed// Register the object with the XSLT processorXsltArgumentList args = new XsltArgumentList();args.AddExtensionObject("urn:d<strong>in</strong>o-objects", o);XslTrans<strong>for</strong>m xslt = new XslTrans<strong>for</strong>m();xslt.Trans<strong>for</strong>m(doc, args, writer);TheExtensionObject class <strong>in</strong> this code snippet is any .<strong>NET</strong> class that is visible to thecaller program. When you add a liv<strong>in</strong>g <strong>in</strong>stance of the object to the argument list, youmust specify the namespace URI that will be used throughout the style sheet to qualifythe object.The style sheet must <strong>in</strong>clude the correspond<strong>in</strong>g namespace declaration with its ownstyle sheet–wide prefix, as <strong>in</strong> the follow<strong>in</strong>g example:F<strong>in</strong>ally, you <strong>in</strong>voke the methods on the object's <strong>in</strong>terface us<strong>in</strong>g XPath expressions, aswith embedded scripts. For example, if the ExtensionObject class has a DoSometh<strong>in</strong>gmethod, the follow<strong>in</strong>g would be perfectly valid code:As with embedded scripts, methods of extension objects must publicly handle .<strong>NET</strong>Framework types that can be converted to XPath types.Conclusion<strong>XML</strong> data is a key element <strong>for</strong> any modern distributed and tiered system. But <strong>XML</strong> dataalone is not really usable, and even when it is usable, it turns out to be not very279


profitable, because <strong>XML</strong> is a metalanguage that needs further <strong>in</strong>stantiation andspecialization.You can th<strong>in</strong>k of <strong>XML</strong> as an abstract class <strong>for</strong> data description languages. Like abstractclasses, you can use <strong>XML</strong> as a reference but not to per<strong>for</strong>m complex tasks. So <strong>XML</strong>does matter but only if you pair it with other related technologies. In Chapter 6, weanalyzed XPath as the emerg<strong>in</strong>g language <strong>for</strong> per<strong>for</strong>m<strong>in</strong>g queries. I can't say whetherXPath is the def<strong>in</strong>itive query tool or just a temporary technology that will soon bereplaced by someth<strong>in</strong>g else—perhaps XQuery. XPath is a key technology to enablepowerful and effective data trans<strong>for</strong>mation, which is just what this whole chapter hasbeen all about.In abstract terms, trans<strong>for</strong>m<strong>in</strong>g <strong>XML</strong> data means mak<strong>in</strong>g data usable by actualapplications and by end-users. XSLT is simply a subset of the <strong>XML</strong> style sheetlanguage, but it probably represents the core part. This chapter provided a quickrefresher course <strong>in</strong> the XSLT vocabulary of <strong>in</strong>structions and then focused on the .<strong>NET</strong>Framework implementation of the XSLT processor.In the .<strong>NET</strong> Framework, the XSLT processor is conta<strong>in</strong>ed <strong>in</strong> a s<strong>in</strong>gle class—theXslTrans<strong>for</strong>m class. This chapter expla<strong>in</strong>ed the programm<strong>in</strong>g <strong>in</strong>terface of the XSLTprocessor and unveiled some of its <strong>in</strong>ternal features. We also looked at security andthread<strong>in</strong>g aspects and a few concrete examples of style sheet def<strong>in</strong>itions and use.With this chapter, the second part of the book, dedicated to data manipulation via <strong>XML</strong>relatedstandards, has come to the end. In Part III, we'll look at a new programm<strong>in</strong>gaspect of <strong>XML</strong>—<strong>XML</strong> and databases. Chapter 8 <strong>in</strong> particular will discuss how to readand write data from and to databases <strong>in</strong> <strong>XML</strong> <strong>for</strong>mat.Further Read<strong>in</strong>gFor further study of the XSL <strong>in</strong>itiative and XSLT <strong>in</strong> particular, the official specification isavailable at http://www.w3.org/TR/xslt. It refers to XSLT 1.0, which is the versioncurrently supported by the .<strong>NET</strong> Framework. For a sneak preview of what's com<strong>in</strong>gnext, the work<strong>in</strong>g draft of XSLT 1.1 is downloadable from http://www.w3.org/TR/xslt11.In our exam<strong>in</strong>ation of the XSL technology as a whole, XSL Formatt<strong>in</strong>g Objects (XSL-FO) were <strong>in</strong>troduced. To learn more, have a look at the follow<strong>in</strong>g onl<strong>in</strong>e tutorial:http://www.dpawson.co.uk/xsl/sect3/bk/<strong>in</strong>dex.html. In general, useful l<strong>in</strong>ks <strong>for</strong> onl<strong>in</strong>ematerial about XSL and related technologies are listed at http://www.w3.org/Style/XSL.280


Part III: <strong>XML</strong> and Data AccessChapter ListChapter 8: <strong>XML</strong> and DatabasesChapter 9: ADO.<strong>NET</strong> <strong>XML</strong> Data SerializationChapter 10: Stateful Data SerializationPart Overview281


Chapter 8: <strong>XML</strong> and DatabasesOverviewMost likely, the majority of today's computer experts and students would associate theidea of a database with a relational database. S<strong>in</strong>ce their <strong>in</strong>troduction <strong>in</strong> the early1970s, relational databases have ga<strong>in</strong>ed an extraord<strong>in</strong>ary success. Relationaldatabases have grown so steadily and progressively that along the way they've lost thequalify<strong>in</strong>g adjective relational and become the only commonly accepted way to design adatabase.Today, relational databases like <strong>Microsoft</strong> SQL Server 2000, Oracle 9i, and IBM DB2are the favorite tools <strong>for</strong> stor<strong>in</strong>g and work<strong>in</strong>g with data. Modern databases do a lot ofth<strong>in</strong>gs, but what a (relational) database still does best is store data. Relationaldatabases won out over other data models such as the hierarchical and reticularmodels mostly because of their <strong>in</strong>herent simplicity and natural way of model<strong>in</strong>g dataand arrang<strong>in</strong>g queries. Relational databases exploit the structured query language(SQL) to search <strong>for</strong> conta<strong>in</strong>ed <strong>in</strong><strong>for</strong>mation.Recent developments <strong>in</strong> the computer <strong>in</strong>dustry have raised the need <strong>for</strong> total software<strong>in</strong>tegration and communication. As a side effect, data modeled <strong>in</strong>to a system must oftenbe trans<strong>for</strong>med <strong>in</strong>to analogous, but not identical, models <strong>in</strong> order to be stored or l<strong>in</strong>kedon different systems. Enter <strong>XML</strong> and its <strong>in</strong>nate ability to describe data.More and more often today you need to extract data out of databases and model it <strong>in</strong>toa particular data schema us<strong>in</strong>g <strong>XML</strong>. So why not just ask the database itself to returndata as <strong>XML</strong>, possibly <strong>for</strong>matted <strong>in</strong> a supplied schema? <strong>XML</strong> support is built <strong>in</strong>to (or willbe built <strong>in</strong>to) almost all database management systems (DBMS) currently available. Inparticular, <strong>Microsoft</strong> SQL Server 2000 comes with an embedded eng<strong>in</strong>e capable ofreturn<strong>in</strong>g data as <strong>XML</strong>. This feature is built as an extension to the traditional SELECTcommand, and data is rendered as <strong>XML</strong> be<strong>for</strong>e be<strong>in</strong>g sent back to the client. Oracle 9iprovides a slightly different model that treats <strong>XML</strong> as a native data type. <strong>XML</strong> data canbe stored <strong>in</strong> ad hoc relational tables as well as <strong>in</strong> b<strong>in</strong>ary large object (BLOB) fields thatcan be either b<strong>in</strong>ary or ASCII.Whatever the vendor approach, <strong>XML</strong> and databases represent a key alliance <strong>for</strong> thepresent and the future of data-driven and <strong>in</strong>teroperable applications. In this chapter,we'll review the essential aspects of <strong>XML</strong> <strong>in</strong> SQL Server 2000, and you'll learn how totake advantage of these features from with<strong>in</strong> a <strong>Microsoft</strong> .<strong>NET</strong> Framework environment.Read<strong>in</strong>g <strong>XML</strong> Data from DatabasesWith SQL Server 2000, you have two basic ways to retrieve data as <strong>XML</strong>: you can usethe <strong>XML</strong> extensions to the SELECT command, or you can execute a query on aparticular text or BLOB field that is known to conta<strong>in</strong> <strong>XML</strong> data. SQL Server does notmark these fields with a special attribute or data type to <strong>in</strong>dicate that they conta<strong>in</strong> <strong>XML</strong>data, however.With the first technique, you typically use the FOR <strong>XML</strong> clause <strong>in</strong> a traditional querycommand. In response, the DBMS executes the query <strong>in</strong> two steps. First it executes theSELECT statement, and next it applies the FOR <strong>XML</strong> trans<strong>for</strong>mation to a rowset. Theresult<strong>in</strong>g <strong>XML</strong> is then sent to the client as a one-column rowset.282


NoteAlthough specific to the OLE DB specification, the term rowset isoften used generically to <strong>in</strong>dicate a set of rows that conta<strong>in</strong> columnsof data. Rowsets are key objects that enable all OLE DB dataproviders to expose query result set data <strong>in</strong> a tabular <strong>for</strong>mat.The FOR <strong>XML</strong> extensions let you consider <strong>XML</strong> mostly as a data output <strong>for</strong>mat. Withthe alternative technique <strong>for</strong> retriev<strong>in</strong>g data as <strong>XML</strong>, you can store raw <strong>XML</strong> data <strong>in</strong> atext or BLOB field and retrieve that data us<strong>in</strong>g an ord<strong>in</strong>ary query—preferably a scalar,s<strong>in</strong>gle-field query. In both cases, the <strong>Microsoft</strong> ADO.<strong>NET</strong> object model, along with the<strong>Microsoft</strong> .<strong>NET</strong> Framework <strong>XML</strong> core classes, provide a number of handy features toextract <strong>XML</strong> data quickly and effectively.SQL Server 2000 <strong>XML</strong> ExtensionsThe <strong>XML</strong> support <strong>in</strong> SQL Server 2000 provides URL-driven access to the databaseresources, <strong>XML</strong>-driven data management, and the possibility of us<strong>in</strong>g XPath queriesto select data from relational tables. SQL Server 2000 does not create ad hoc storagestructures <strong>for</strong> <strong>XML</strong> data. It does provide an ad hoc <strong>in</strong>frastructure <strong>for</strong> read<strong>in</strong>g, writ<strong>in</strong>g,and query<strong>in</strong>g relational data through the <strong>XML</strong> logical filter.The follow<strong>in</strong>g list gives you a bird's-eye view of the key <strong>XML</strong> features available <strong>in</strong> SQLServer 2000 and its latest extension, SQL<strong>XML</strong> 3.0:• Access SQL Server through a URL. An ISAPI filter runn<strong>in</strong>g on top ofthe Internet In<strong>for</strong>mation Services (IIS) allows you to directly querycommands to SQL Server us<strong>in</strong>g HTTP. You simply po<strong>in</strong>t to a properly<strong>for</strong>matted URL, and what you get back is the result set data <strong>for</strong>matted as<strong>XML</strong> data.• Create <strong>XML</strong> schema-driven views of relational data. Similar toCREATE VIEW, this feature lets you represent a result set as an <strong>XML</strong>document written accord<strong>in</strong>g to a given <strong>XML</strong> Schema Def<strong>in</strong>ition (XSD) or<strong>XML</strong>-Data Reduced (XDR) schema. You specify the mapp<strong>in</strong>g rulesbetween the native fields and <strong>XML</strong> attributes and elements. The resultant<strong>XML</strong> document can be treated as a regular <strong>XML</strong> Document Object Model(<strong>XML</strong> DOM) object and queried us<strong>in</strong>g XPath expressions.• Return fetched data as <strong>XML</strong>. This feature is at the foundation of theentire <strong>XML</strong> support <strong>in</strong> SQL Server 2000. A database <strong>in</strong>ternal eng<strong>in</strong>e iscapable of <strong>for</strong>matt<strong>in</strong>g raw column data <strong>in</strong>to <strong>XML</strong> fragments and expos<strong>in</strong>gthose fragments as str<strong>in</strong>gs to callers. This capability is <strong>in</strong>corporated <strong>in</strong>the SELECT statement and can be controlled through a number ofclauses and attributes.• Insert data represented as an <strong>XML</strong> document. Just as you can readrelational data <strong>in</strong>to hierarchical <strong>XML</strong> documents, you can write <strong>XML</strong> data<strong>in</strong>to tables. The source document is preprocessed by a system storedprocedure named sp_xml_preparedocument. The parsed document isthen passed on to a special module—named OPEN<strong>XML</strong>—that providesa rowset view of the <strong>XML</strong> data. At this po<strong>in</strong>t, to ord<strong>in</strong>ary Transact-SQL(T-SQL) commands, <strong>XML</strong> native data looks like ord<strong>in</strong>ary result sets.SQL<strong>XML</strong> 3.0 is an extension to SQL Server 2000 designed to keep current withevolv<strong>in</strong>g W3C standards <strong>for</strong> <strong>XML</strong> and other requested functions. Available as a freedownload at http://msdn.microsoft.com/downloads, SQL<strong>XML</strong> 3.0 also provides abunch of managed classes <strong>for</strong> expos<strong>in</strong>g some of the functionalities to .<strong>NET</strong>283


Framework applications. SQL<strong>XML</strong> 3.0 <strong>in</strong>cludes the ability to expose storedprocedures as Web services via the Simple Object Access Protocol (SOAP) and addssupport <strong>for</strong> ADO.<strong>NET</strong> DiffGrams and client-side <strong>XML</strong> trans<strong>for</strong>mations.<strong>XML</strong> Extensions to the SELECT StatementIn SQL Server 2000, you can query exist<strong>in</strong>g relational tables and return results as <strong>XML</strong>documents rather than as standard rowsets. The query is written and runs normally. Ifthe SELECT statement conta<strong>in</strong>s a trail<strong>in</strong>g FOR <strong>XML</strong> clause, the result set is thentrans<strong>for</strong>med <strong>in</strong>to a str<strong>in</strong>g of <strong>XML</strong> text. With<strong>in</strong> the FOR <strong>XML</strong> clause, you can specify oneof the <strong>XML</strong> modes described <strong>in</strong> Table 8-1.Table 8-1: Modes of the FOR <strong>XML</strong> ExtensionModeAUTODescriptionReturns query results as a sequence of <strong>XML</strong>elements, where table is the name of the table. Fields arerendered as node attributes. If the additional ELEMENTSclause is specified, rows are rendered as child nodes <strong>in</strong>steadof attributes.EXPLICITRAWThe query def<strong>in</strong>es the schema of the <strong>XML</strong> document be<strong>in</strong>greturned.Returns query results as a sequence of generic nodeswith as many attributes as the selected fields.The mode is valid only <strong>in</strong> the SELECT command <strong>for</strong> which it has been set. In no waydoes the mode affect any subsequent queries. <strong>XML</strong>-driven queries can be executeddirectly or from with<strong>in</strong> stored procedures.TipThe <strong>XML</strong> data conta<strong>in</strong>s an XDR schema if you append the <strong>XML</strong>DATAattribute to the FOR <strong>XML</strong> mode of choice, as shown here:SELECT * FROM Employees FOR <strong>XML</strong>, <strong>XML</strong>DATASchema <strong>in</strong><strong>for</strong>mation is <strong>in</strong>corporated <strong>in</strong> a node prependedto the document.The FOR <strong>XML</strong> AUTO ModeThe AUTO mode returns data packed as <strong>XML</strong> fragments—that is, without a root node.The alias of the table determ<strong>in</strong>es the name of each node. If the query jo<strong>in</strong>s two tableson the value of a column, the result<strong>in</strong>g <strong>XML</strong> schema provides nested elements.Let's consider the follow<strong>in</strong>g simple query:SELECT CustomerID, ContactName FROM Customers FOR <strong>XML</strong> AUTOThe <strong>XML</strong> result set has the <strong>for</strong>m shown here:...Try now with a command that conta<strong>in</strong>s an INNER JOIN, as follows:SELECT Customers.CustomerID, Customers.ContactName,Orders.OrderIDFROM Customers284


INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerIDFOR <strong>XML</strong> AUTOInterest<strong>in</strong>gly, <strong>in</strong> this case the <strong>XML</strong> output automatically groups child records below theparent:.........If the ELEMENTS attribute is also specified, the data rows are rendered <strong>in</strong> <strong>XML</strong> throughelements rather than as attributes. Let's consider the follow<strong>in</strong>g query:SELECT CustomerID, ContactName FROM Customers FOR <strong>XML</strong> AUTO,ELEMENTSThe <strong>XML</strong> output is similar to this:ALFKIMaria AndersANATRAna Trujillo...In the case of INNER JOINs, the output becomes the follow<strong>in</strong>g:ALFKIMaria Anders1064310692285


......The FOR <strong>XML</strong> AUTO mode always resolves table dependencies <strong>in</strong> terms of nestedrows. The overall <strong>XML</strong> stream is not completely well-<strong>for</strong>med. Instead of an <strong>XML</strong>document, the output is an <strong>XML</strong> fragment, mak<strong>in</strong>g it easier <strong>for</strong> clients to concatenatemore result sets <strong>in</strong>to a s<strong>in</strong>gle structure.NoteIf you also add the BINARY BASE64 option to a FOR <strong>XML</strong> query,any b<strong>in</strong>ary data that is returned will automatically be encoded us<strong>in</strong>ga base64 algorithm.The FOR <strong>XML</strong> RAW ModeAs its name suggests, the FOR <strong>XML</strong> RAW mode is the least rich mode <strong>in</strong> terms offeatures and options. When designed us<strong>in</strong>g this mode, the query returns an <strong>XML</strong>fragment that, at a first glance, might look a lot like the fragment produced by the FOR<strong>XML</strong> AUTO option. You obta<strong>in</strong> an <strong>XML</strong> fragment made of nodes with as manyattributes as the columns. For example, consider the follow<strong>in</strong>g simple query:SELECT CustomerID, ContactName FROM Customers FOR <strong>XML</strong> RAWThe output is shown here:...You can't change the name of the node, nor can you render attributes as nested nodes.So far, so good—the RAW mode is only a bit less flexible than the AUTO mode.However, the situation changes when you use jo<strong>in</strong>ed tables.The schema of <strong>XML</strong> data rema<strong>in</strong>s <strong>in</strong>tact even when you process multiple tables. TheINNER JOIN statement from the previous section run <strong>in</strong> FOR <strong>XML</strong> RAW modeorig<strong>in</strong>ates the follow<strong>in</strong>g output:...Even with the naked eye, you can see that the RAW mode produces a less optimizedand more redundant output than the AUTO mode. The ELEMENTS clause is notsupported <strong>in</strong> RAW mode, whereas <strong>XML</strong>DATA and BINARY BASE64 are perfectlylegitimate.Limitations of FOR <strong>XML</strong>The FOR <strong>XML</strong> clause is not valid <strong>in</strong> all cases <strong>in</strong> which a SELECT statement isacceptable. In general, FOR <strong>XML</strong> can be used only when the selection produces directoutput go<strong>in</strong>g to the SQL Server client, whatever that output is. Let's review a couple of286


common scenarios <strong>in</strong> which you can't make use of the FOR <strong>XML</strong> clause. For a morecomplete overview, please refer to SQL Server's Books Onl<strong>in</strong>e.FOR <strong>XML</strong> Can't Be Used <strong>in</strong> SubselectionsSQL Server 2000 allows you to use the output of an <strong>in</strong>ner SELECT statement as avirtual table to which an outer SELECT statement can refer. The <strong>in</strong>ner query can'treturn <strong>XML</strong> data if you plan to use its output to per<strong>for</strong>m further operations. For example,the follow<strong>in</strong>g query is not valid:SELECT * FROM (SELECT * FROM Employees FOR <strong>XML</strong> AUTO) AS tLikewise, the FOR <strong>XML</strong> clause is not valid <strong>in</strong> a SELECT statement that is used tocreate a view. For example, the follow<strong>in</strong>g statement is not allowed:CREATE VIEW MyOrders ASSELECT OrderId, OrderDate FROM Orders FOR <strong>XML</strong> AUTOIn contrast, you can select data from a view and return it as <strong>XML</strong>. In addition, FOR <strong>XML</strong>can't be used with cursors.FOR <strong>XML</strong> Can't Be Used with Computed ColumnsThe current version of SQL Server does not permit GROUP BY and aggregatefunctions to be used with FOR <strong>XML</strong> AUTO. Aggregate functions and GROUP BYclauses can be safely used if the <strong>XML</strong> query is expressed <strong>in</strong> RAW mode, however. Thefollow<strong>in</strong>g code returns the expected results:SELECT m<strong>in</strong>(unitprice) AS price, max(quantity) AS quantityFROM [order details] FOR <strong>XML</strong> RAWThe only caveat is that you must explicity name the computed columns us<strong>in</strong>g the ASkeyword. The output is shown here:Table 8-1 mentioned a third FOR <strong>XML</strong> mode—the EXPLICIT mode. The EXPLICITmode goes beyond the rather basic goals of both AUTO and RAW. It is designed toenable users to build a personal schema to render relational data <strong>in</strong> <strong>XML</strong>. TheEXPLICIT mode is one of the ways that programmers have to create custom <strong>XML</strong>views of stored data.Client-Side <strong>XML</strong> Formatt<strong>in</strong>gSQL<strong>XML</strong> 3.0 extends the base set of SQL Server 2000 <strong>XML</strong> extensions by <strong>in</strong>clud<strong>in</strong>gclient-side <strong>for</strong>matt<strong>in</strong>g capabilities <strong>in</strong> addition to the default server-side <strong>XML</strong> <strong>for</strong>matt<strong>in</strong>g.From with<strong>in</strong> a .<strong>NET</strong> Framework application, you use SQL<strong>XML</strong> 3.0 managed classes(more on this <strong>in</strong> the section "SQL<strong>XML</strong> Managed Classes," on page 386) to set up acommand that returns <strong>XML</strong> data.When the command executes, the managed classes—at least <strong>in</strong> this version of theSQL<strong>XML</strong> library—call <strong>in</strong>to a middle-tier OLE DB provider (SQL<strong>XML</strong>OLEDB) object,which <strong>in</strong> turn calls <strong>in</strong>to the OLE DB provider <strong>for</strong> SQL Server. The command that hits thedatabase does not conta<strong>in</strong> the FOR <strong>XML</strong> clause. When the rowset gets back to theSQL<strong>XML</strong>OLEDB provider, it is trans<strong>for</strong>med <strong>in</strong>to <strong>XML</strong> accord<strong>in</strong>g to the syntax of theFOR <strong>XML</strong> clause and returned to the client. Figure 8-1 compares server-side andclient-side <strong>XML</strong> <strong>for</strong>matt<strong>in</strong>g.287


Figure 8-1: The client-side <strong>XML</strong> <strong>for</strong>matt<strong>in</strong>g feature of SQL<strong>XML</strong> 3.0 makes use of<strong>in</strong>termediate OLE DB providers to execute the query and trans<strong>for</strong>m the results.As you'd expect, the two <strong>in</strong>termediate OLE DB providers cause more per<strong>for</strong>manceproblems than they ever attempt to resolve. On the other hand, SQL<strong>XML</strong> 3.0 is notspecifically designed <strong>for</strong> the .<strong>NET</strong> Framework, although it conta<strong>in</strong>s a few managedclasses that we'll look at <strong>in</strong> the section "SQL<strong>XML</strong> Managed Classes," on page 386. In anutshell, keep <strong>in</strong> m<strong>in</strong>d that SQL<strong>XML</strong> 3.0 provides client-side <strong>XML</strong> <strong>for</strong>matt<strong>in</strong>g but thatthis feature is rather <strong>in</strong>efficient. For .<strong>NET</strong> Framework applications, a much betterapproach <strong>for</strong> client-side <strong>XML</strong> render<strong>in</strong>g is represented by the XmlDataDocument class.(See the section "The XmlDataDocument Class," on page 372.)Creat<strong>in</strong>g <strong>XML</strong> ViewsJust as a CREATE VIEW statement <strong>in</strong> SQL lets you create a virtual table by collect<strong>in</strong>gcolumns from one or more tables, an <strong>XML</strong> view provides an alternative and highlycustomizable way to present relational data <strong>in</strong> <strong>XML</strong>.Build<strong>in</strong>g an <strong>XML</strong> view consists of def<strong>in</strong><strong>in</strong>g a custom <strong>XML</strong> schema and mapp<strong>in</strong>g to itselements the columns and the tables selected by the query. Once built, an <strong>XML</strong> viewcan be used like its close cous<strong>in</strong>, SQL view. In particular, an <strong>XML</strong> view can be queriedus<strong>in</strong>g XPath expressions and trans<strong>for</strong>med us<strong>in</strong>g XSL Trans<strong>for</strong>mation (XSLT) scripts. An<strong>XML</strong> view is simply a stream of <strong>XML</strong> data and can be used as allowed by .<strong>NET</strong>. In the.<strong>NET</strong> Framework, you can use <strong>XML</strong> views through readers, <strong>XML</strong> DOM, or evenspecialized classes, such as those <strong>in</strong> SQL<strong>XML</strong> 3.0.There are two possible ways to create <strong>XML</strong> views: you can use the FOR <strong>XML</strong>EXPLICIT mode of the SELECT statement, or you can build an annotated XDR or XSDschema. To use an XSD schema, you must <strong>in</strong>stall SQL<strong>XML</strong> 3.0 first.288


The FOR <strong>XML</strong> EXPLICIT ModeThe query def<strong>in</strong>es the shape of the generated <strong>XML</strong> document. The ultimate goal of thequery is mak<strong>in</strong>g hierarchical data fit <strong>in</strong>to a tabular rowset. An EXPLICIT mode querycreates a virtual table <strong>in</strong> which all the <strong>in</strong><strong>for</strong>mation fetched from the tables is organized <strong>in</strong>such a way that it can then be easily rendered <strong>in</strong> <strong>XML</strong>. The def<strong>in</strong>ition of the schema isfree, and of course, programmers must ensure that the f<strong>in</strong>al output is well-<strong>for</strong>med <strong>XML</strong>.Any FOR <strong>XML</strong> EXPLICIT query requires two extra metacolumns, named Tag andParent. The values <strong>in</strong> these columns are used to generate the <strong>XML</strong> hierarchy. The Tagcolumn conta<strong>in</strong>s a unique numeric <strong>in</strong>dex <strong>for</strong> each <strong>XML</strong> root node that is expected tohave children <strong>in</strong> the <strong>XML</strong> schema. The Parent column conta<strong>in</strong>s a tag value that l<strong>in</strong>ks agiven node to a particular, and previously def<strong>in</strong>ed, subtree.To add columns, you must use a relatively complex syntax <strong>for</strong> column aliases. Eachselected column must have an alias def<strong>in</strong>ed accord<strong>in</strong>g to the follow<strong>in</strong>g syntax:SELECT column_name AS [ParentNode!ParentTag!TagName!Directive]TheParentNode item represents the name of the node element that is expected to bethe parent of the column data. The ParentTag is the tag number of the parent. TheTagName item <strong>in</strong>dicates the name of the <strong>XML</strong> element that conta<strong>in</strong>s the column data.F<strong>in</strong>ally, the Directive element can take various values, the most common ones be<strong>in</strong>g novalue or element. If no value is specified, the column data is rendered as an attributenamed TagName; otherwise, it will be an element attribute.It's <strong>in</strong>terest<strong>in</strong>g to note that an EXPLICIT mode query consists of one or more tables thatresult from SELECT statements potentially <strong>in</strong>volv<strong>in</strong>g multiple tables and jo<strong>in</strong>ed data.Let's see what's needed to obta<strong>in</strong> the follow<strong>in</strong>g <strong>XML</strong> representation of the rows <strong>in</strong> theNorthw<strong>in</strong>d database's Employees table:birthdatecityhiredatetitlenotesThe boldface l<strong>in</strong>es <strong>in</strong> this code represent the roots of the three subtrees of <strong>XML</strong> databe<strong>in</strong>g created. Each subtree corresponds to a different tag, and each must be filled byresort<strong>in</strong>g to a different SELECT statement.To beg<strong>in</strong> fill<strong>in</strong>g the subtrees, consider the follow<strong>in</strong>g query:SELECT 1 AS Tag,NULL AS Parent,employeeid AS [Employee!1!ID],289


lastname AS [Employee!1!Name]This statement fills <strong>in</strong> the first tag—the fragment's root—which has no parent andconta<strong>in</strong>s two attributes, ID and Name. The employeeid and the lastname columns willfill respectively the ID and the Name attributes of an node with no parent.The first table always def<strong>in</strong>es the structure of the <strong>XML</strong> view. Successive tables can onlyfill <strong>in</strong> holes—noth<strong>in</strong>g new will be added. Consequently, to obta<strong>in</strong> the previous schema,you must write the first tag as follows:SELECT 1 AS Tag,NULL AS Parent,employeeid AS [Employee!1!ID],titleofcourtesy + ' ' + lastname + ', ' + firstnameAS [Employee!1!Name],NULL AS [PersonalData!2!BirthDate!element],NULL AS [PersonalData!2!City!element],NULL AS [JobData!3!HireDate!element],NULL AS [JobData!3!Title!element],lastname AS [Employee!1!Notes!element]FROM EmployeesThe columns with NULL values will be selected by successive queries. In particular,you'll notice PersonalData and JobData trees with tag IDs of 2 and 3, respectively. The<strong>for</strong>mer conta<strong>in</strong>s a pair of BirthDate and City elements. The latter holds elements namedTitle and HireDate.To unify all the subtables, you must use the UNION ALL statement. The completestatement is shown here:SELECT1 AS Tag,NULL AS Parent,employeeid AS [Employee!1!ID],titleofcourtesy + ' ' + lastname + ', ' + firstnameAS [Employee!1!Name],NULL AS [PersonalData!2!BirthDate!element],NULL AS [PersonalData!2!City!element],NULL AS [JobData!3!HireDate!element],NULL AS [JobData!3!Title!element],lastname AS [Employee!1!Notes!element]FROM EmployeesUNION ALLSELECT2, 1,employeeid,290


titleofcourtesy + ' ' + lastname + ', ' + firstname,birthdate,city,hiredate,title,notesFROM EmployeesUNION ALLSELECT3, 1,employeeid,titleofcourtesy + ' ' + lastname + ', ' + firstname,birthdate,city,hiredate,title,notesFROM EmployeesORDER BY [Employee!1!ID]FOR <strong>XML</strong> EXPLICITThe T-SQL UNION ALL operator comb<strong>in</strong>es the results of two or more SELECTstatements <strong>in</strong>to a s<strong>in</strong>gle result set. All participat<strong>in</strong>g result sets must have the samenumber of columns, and correspond<strong>in</strong>g columns must have compatible data types.Us<strong>in</strong>g an Annotated Mapp<strong>in</strong>g SchemaA more lightweight alternative to FOR <strong>XML</strong> EXPLICIT views is the annotated schema.SQL Server 2000 lets you create <strong>XML</strong> views by def<strong>in</strong><strong>in</strong>g an XDR schema with specialannotations that work like placeholders <strong>for</strong> selected data. Basically, <strong>in</strong>stead of def<strong>in</strong><strong>in</strong>gthe schema us<strong>in</strong>g a new syntax and comb<strong>in</strong><strong>in</strong>g multiple virtual tables, you use astandard <strong>XML</strong> data def<strong>in</strong>ition language and map elements to columns us<strong>in</strong>g ad hocannotations.The base version of SQL Server 2000 supports only XDR. If you want to use XSD, youmust <strong>in</strong>stall SQL<strong>XML</strong> 3.0. (To review the differences between XDR and XSD, seeChapter 3)The follow<strong>in</strong>g list<strong>in</strong>g shows a simple XSD annotated schema that def<strong>in</strong>es an node with a couple of child nodes— and :


sql:field="FirstName" type="xsd:str<strong>in</strong>g" />The annotations sql:relation and sql:field facilitate the mapp<strong>in</strong>g between the sourcetable and the result<strong>in</strong>g <strong>XML</strong> data. In particular, sql:relation <strong>in</strong>dicates that the given nodeis related to the specified table. The sql:field annotation <strong>in</strong>dicates the column thatshould be used to populate the given element. If no sql:field annotation is provided,SQL Server expects to f<strong>in</strong>d a perfect match between the element or attribute name anda column. In the preced<strong>in</strong>g schema, the EmployeeID attribute is l<strong>in</strong>ked directly byname.NoteAnnotated schemas do not allow you to use expressions whenselect<strong>in</strong>g columns. The sql:field annotation can accept only thename of an exist<strong>in</strong>g column; it can't accept an expression thatevaluates to a column name.Are <strong>XML</strong> Views Effective?The FOR <strong>XML</strong> EXPLICIT clause and annotated schemas are two somewhat equivalentways to query relational tables and return data <strong>for</strong>matted accord<strong>in</strong>g to a particular <strong>XML</strong>schema. XSD mapp<strong>in</strong>g is more powerful than XDR, but all <strong>in</strong> all, <strong>in</strong> terms of rawfunctionalities, explicit and schema mapp<strong>in</strong>g are two nearly identical options <strong>for</strong> build<strong>in</strong>g<strong>XML</strong> views.Certa<strong>in</strong>ly the FOR <strong>XML</strong> EXPLICIT clause can lead to hard-to-ma<strong>in</strong>ta<strong>in</strong> code, whereasannotated schemas are probably easier to read and ma<strong>in</strong>ta<strong>in</strong> and, <strong>in</strong> addition, keep theschema dist<strong>in</strong>ct from the query and the data.The real <strong>XML</strong> mapp<strong>in</strong>g schema issue is this: What's the added value that <strong>XML</strong> viewsbr<strong>in</strong>g to your code? Are you sure that the ability to execute XPath queries justifies thecreation of an <strong>XML</strong> view? The XPath query eng<strong>in</strong>e is certa<strong>in</strong>ly <strong>in</strong>ferior to the SQLServer's query eng<strong>in</strong>e, at least <strong>for</strong> complex queries like the ones you might need toper<strong>for</strong>m on real-world data. In addition, <strong>for</strong> read/write solutions, writ<strong>in</strong>g data back to thenative relational tables can be less than effective if done through <strong>XML</strong>. We'll return tothis topic when we look at the OPEN<strong>XML</strong> provider <strong>in</strong> the section "The OPEN<strong>XML</strong>Rowset Provider," on page 376.One scenario <strong>in</strong> which read<strong>in</strong>g relational data as <strong>XML</strong> turns out to be really and clearlyeffective is when you need to turn fetched data <strong>in</strong>to more manageable or easily<strong>in</strong>teroperable structures. If you need to exchange an <strong>in</strong>voice document with commercialpartners, us<strong>in</strong>g an <strong>XML</strong> representation of the data is certa<strong>in</strong>ly useful, because youprocess data <strong>in</strong> an <strong>in</strong>termediate, plat<strong>for</strong>m-<strong>in</strong>dependent and application-<strong>in</strong>dependent<strong>for</strong>mat, while preserv<strong>in</strong>g the ability to create views and per<strong>for</strong>m queries locally. Inaddition, hav<strong>in</strong>g the database return and accept <strong>XML</strong> data with a custom layout canonly help considerably.In this scenario, another reasonable step you might need to take is trans<strong>for</strong>m<strong>in</strong>g the<strong>XML</strong> data <strong>in</strong>to high-level data structures such as classes. For .<strong>NET</strong> Framework292


applications, <strong>XML</strong>.serialization is key technology that you must absolutely be familiarwith. We'll exam<strong>in</strong>e <strong>XML</strong> serialization <strong>in</strong> Chapter 11.Let's look now at how ADO.<strong>NET</strong> and <strong>XML</strong> classes can be used to read and processrelational data expressed <strong>in</strong> a hierarchical shape.<strong>XML</strong> Data Readers.<strong>NET</strong> Framework applications delegate all their low-level data access tasks to a specialbreed of connector objects called managed data providers. The object model aroundthese connector components is known as ADO.<strong>NET</strong>. Basically, a data provider is thesoftware component that enables any .<strong>NET</strong> Framework application to connect to a datasource and execute commands to retrieve and modify data.A .<strong>NET</strong> Framework data provider component <strong>in</strong>terfaces client applications through theobjects <strong>in</strong> the ADO.<strong>NET</strong> namespace and exposes any provider-specific behaviordirectly to consumers. A .<strong>NET</strong> Framework data provider component creates a m<strong>in</strong>imallayer between the physical data source and the client code, thereby <strong>in</strong>creas<strong>in</strong>gper<strong>for</strong>mance without sacrific<strong>in</strong>g functionality.A .<strong>NET</strong> Framework data provider is fully <strong>in</strong>tegrated with the surround<strong>in</strong>g environment—the .<strong>NET</strong> Framework—so any results that a command generates are promptly andautomatically packed <strong>in</strong>to a familiar data structure—the ADO.<strong>NET</strong> and <strong>XML</strong> classes—<strong>for</strong> further use.A key architectural goal <strong>for</strong> .<strong>NET</strong> Framework data providers is that they must bedesigned to work on a rigorous per-data source basis. They expose connection,transaction, command, and reader objects, all work<strong>in</strong>g accord<strong>in</strong>g to the <strong>in</strong>ternalcapabilities and structure of the DBMS. As a result, the programm<strong>in</strong>g <strong>in</strong>terface of, say,the <strong>Microsoft</strong> Access data provider will not be completely identical to that of the SQLServer provider. An area <strong>in</strong> which this difference is palpable is <strong>in</strong> <strong>XML</strong> data queries.OLE DB and .<strong>NET</strong> Framework Managed Data ProvidersPrior to the advent of the .<strong>NET</strong> Framework, OLE DB was considered the emerg<strong>in</strong>gdata access technology. It was well positioned to def<strong>in</strong>itively replace <strong>in</strong> the heart, andthe code, of developers another well-known standard <strong>for</strong> universal data access—opendatabase connectivity (ODBC).OLE DB is the data access technology that translates the Universal Data Access(UDA) vision <strong>in</strong>to concrete programm<strong>in</strong>g calls. Introduced about five years ago, UDAdescribes a scenario <strong>in</strong> which all the data that can be expressed <strong>in</strong> a tabular <strong>for</strong>matcan be accessed and manipulated through a common API, no matter the actual b<strong>in</strong>ary<strong>for</strong>mat and the storage medium. Accord<strong>in</strong>g to the UDA vision, special modules—theOLE DB providers—would be called to expose the contents of a data source to theworld. Another family of components—the OLE DB consumers—would consume suchcontents by <strong>in</strong>teract<strong>in</strong>g with the providers through a common API.In design<strong>in</strong>g the <strong>in</strong>termediate API <strong>for</strong> OLE DB providers and consumers tocommunicate through, <strong>Microsoft</strong> decided to use the key software technology of thetime: the Component Object Model (COM). In this design approach, the consumer had293


to <strong>in</strong>stantiate a COM object, query <strong>for</strong> a number of <strong>in</strong>terfaces, and handle the results.The provider had to implement the same number of <strong>in</strong>terfaces (and even more) andaccess the wrapped data source at every method <strong>in</strong>vocation. The methods def<strong>in</strong>ed <strong>in</strong>the OLE DB <strong>in</strong>terfaces are quite general and are not tied to the features of a particulardata source.Compared to OLE DB providers, .<strong>NET</strong> Framework data providers implement a muchsmaller set of <strong>in</strong>terfaces and always work with<strong>in</strong> the boundaries of the .<strong>NET</strong>Framework common language runtime (CLR). A .<strong>NET</strong> Framework managed dataprovider and an OLE DB provider are different components mostly <strong>in</strong> the outermost<strong>in</strong>terface, which clients use to communicate. Under the hood, they look much moresimilar than you may expect. In particular, both components use the same low-levelAPI to talk to the physical data source. For example, both the .<strong>NET</strong> Framework dataprovider and the OLE DB provider access SQL Server 7.0 and later us<strong>in</strong>g TabularData Stream (TDS) packets. Both components hook up SQL Server at the wire level,thereby provid<strong>in</strong>g a nearly identical per<strong>for</strong>mance, each from their nativeenvironment—<strong>Microsoft</strong> W<strong>in</strong>32 <strong>for</strong> OLE DB and the .<strong>NET</strong> Framework <strong>for</strong> manageddata providers.Read<strong>in</strong>g from <strong>XML</strong> QueriesThe SQL Server .<strong>NET</strong> Framework data provider makes available a particular method <strong>in</strong>its command class, SqlCommand, that explicitly lets you obta<strong>in</strong> an <strong>XML</strong> readerwhenever the command text returns <strong>XML</strong> data. In other words, you can choose toexecute a SQL command with a trail<strong>in</strong>g FOR <strong>XML</strong> clause and then pick up the resultsdirectly us<strong>in</strong>g an <strong>XML</strong> reader. Let's see how.The follow<strong>in</strong>g code sets up a command that returns <strong>XML</strong> <strong>in</strong><strong>for</strong>mation about all theemployees <strong>in</strong> the Northw<strong>in</strong>d database:str<strong>in</strong>g nw<strong>in</strong>d = "DATABASE=northw<strong>in</strong>d;SERVER=localhost;UID=sa;";str<strong>in</strong>g query = "SELECT * FROM Employees FOR <strong>XML</strong> AUTO, ELEMENTS";SqlConnection conn = new SqlConnection(nw<strong>in</strong>d);SqlCommand cmd = new SqlCommand(query, conn);In general, an ADO.<strong>NET</strong> command can be run us<strong>in</strong>g a variety of execute methods,<strong>in</strong>clud<strong>in</strong>g ExecuteNonQuery, ExecuteReader, and ExecuteScalar. These methods differ<strong>in</strong> the <strong>for</strong>mat <strong>in</strong> which the result set is packed. The SQL Server 2000 ad hoc commandclass—SqlCommand—supplies a fourth execute method, ExecuteXmlReader, whichsimply returns the result set as an <strong>XML</strong> reader.You use the ExecuteXmlReader method as a special type of constructor <strong>for</strong> anXmlTextReader object, as shown here:conn.Open();XmlTextReader reader = (XmlTextReader) cmd.ExecuteXmlReader();ProcessXmlData(reader);reader.Close();conn.Close();The ExecuteXmlReader method executes the command and returns an <strong>in</strong>stance of anXmlTextReader object to access the result set. Of course, ExecuteXmlReader fails,throw<strong>in</strong>g an InvalidOperationException exception, if the command does not return an<strong>XML</strong> result.294


The SqlCommand class per<strong>for</strong>ms no prelim<strong>in</strong>ary check on the structure of the T-SQLcommand be<strong>in</strong>g executed to statically determ<strong>in</strong>e whether the command returns <strong>XML</strong>data. This means that any error that <strong>in</strong>validates the operation is detected on the server.A client-side check could verify that the command text <strong>in</strong>corporates a correct FOR <strong>XML</strong>clause prior to send<strong>in</strong>g the text to the database. However, such a test would also catchas erroneous a perfectly legitimate situation: select<strong>in</strong>g <strong>XML</strong> data from a text or a BLOBfield. So while per<strong>for</strong>m<strong>in</strong>g a prelim<strong>in</strong>ary check could still make sense <strong>for</strong> some userapplications, it would be <strong>in</strong>effective if done from with<strong>in</strong> the command class.Note Although the ExecuteXmlReader method returns a genericXmlReader object, the true type of the returned object is alwaysXmlTextReader. You can use this object at will—<strong>for</strong> example, tocreate a validat<strong>in</strong>g reader. Bear <strong>in</strong> m<strong>in</strong>d, however, that the more youuse the <strong>XML</strong> reader, the longer the connection stays open.The application shown <strong>in</strong> Figure 8-2 uses the schema we analyzed <strong>in</strong> the section "TheFOR <strong>XML</strong> EXPLICIT Mode," on page 356, while exam<strong>in</strong><strong>in</strong>g the FOR <strong>XML</strong> EXPLICITclause. The application runs the same SELECT command we used <strong>in</strong> that section andthen walks its way through the result set us<strong>in</strong>g an <strong>XML</strong> reader. The <strong>in</strong><strong>for</strong>mation read isused to fill up a treeview control.Figure 8-2: The application retrieves data from SQL Server us<strong>in</strong>g an explicit schema, readsthe <strong>in</strong><strong>for</strong>mation through an <strong>XML</strong> reader, and populates a treeview control.The follow<strong>in</strong>g code illustrates how to extract <strong>in</strong><strong>for</strong>mation from the previously describedschema and add nodes to the treeview. The ProcessXmlData rout<strong>in</strong>e has an extraBoolean argument used to specify whether you want the application's user <strong>in</strong>terface tobe generic. If the user <strong>in</strong>terface is not generic, it makes assumptions about the structureof the <strong>XML</strong> data and attributes specific semantics to each element. If the user <strong>in</strong>terfaceis generic, the sample application treats the data as a generic <strong>XML</strong> stream.void ProcessXmlData(XmlTextReader reader, bool bUseGenericMode){// Clear the treeviewdataTree.Nodes.Clear();dataTree.Beg<strong>in</strong>Update();295


}// Process elementswhile(reader.Read()){if(reader.NodeType == XmlNodeType.Element){// Creates an hash table of nodes at various// depths so that each element can figure out// what its parent is<strong>in</strong>t depth = reader.Depth;<strong>in</strong>t parentDepth = depth -1;str<strong>in</strong>g text = "";if (m_ParentNodes.Conta<strong>in</strong>sKey(parentDepth)){TreeNode n =(TreeNode) m_ParentNodes[parentDepth];text = PrepareOtherDataDisplayText(reader,bUseGenericMode);m_ParentNodes[depth] = n.Nodes.Add(text);}else{// Only first-level nodestext = PrepareEmployeeDisplayText(reader,bUseGenericMode);m_ParentNodes[depth] = dataTree.Nodes.Add(text);}}}dataTree.EndUpdate();Figure 8-3 shows the user <strong>in</strong>terface <strong>in</strong> generic mode.296


Figure 8-3: The user <strong>in</strong>terface of the application now shows only <strong>XML</strong> elements.A quick comment regard<strong>in</strong>g the algorithm used to populate this treeview object: I makeuse of a small hash table to keep track of the latest node <strong>in</strong>serted at a given level—theDepth property of the <strong>XML</strong> text reader. Each element that is expected to have aparent—that is, a depth greater than 0—looks upward <strong>for</strong> a TreeNode object <strong>in</strong> thetable and adds its description to the node. Next, the node itself registers as a parentnode <strong>for</strong> its own level of depth.Under the Hood of ExecuteXmlReaderInternally, the ExecuteXmlReader method first calls ExecuteReader and then creates anew <strong>in</strong>stance of an XmlTextReader object. The <strong>XML</strong> reader is configured to work on an<strong>in</strong>ternal stream object whose class name is SqlStream. The SqlStream class representsthe data stream that SQL Server uses to return rows to callers. The <strong>for</strong>mat of the SQLServer data stream is the TDS.NoteThe SqlStream class is def<strong>in</strong>ed <strong>in</strong>ternally to the System.Dataassembly and is marked with the <strong>in</strong>ternal modifier. This keywordmakes the class accessible only to the other classes def<strong>in</strong>ed <strong>in</strong> thesame assembly. The <strong>Microsoft</strong> Visual Basic .<strong>NET</strong> counterpart to the<strong>in</strong>ternal keyword is Friend.The follow<strong>in</strong>g list<strong>in</strong>g shows the pseudocode of ExecuteXmlReader. What happensunder the lid of this method leads straight to the conclusion that the ability to execute adatabase command to <strong>XML</strong> can also be added to the OleDbCommand class as well asto the command classes <strong>in</strong> a number of other managed providers. We'll exam<strong>in</strong>e thisconcept <strong>in</strong> more detail <strong>in</strong> a moment.public XmlReader ExecuteXmlReader(){// Execute the commandSqlDataReader datareader = ExecuteReader();// Obta<strong>in</strong> the TDS stream <strong>for</strong> the commandSqlStream tdsdata = new SqlStream(datareader);// Create the <strong>XML</strong> text reader297


(No context <strong>in</strong><strong>for</strong>mation specified)XmlReader xmlreader = new XmlTextReader(tdsdata,XmlNodeType.Element, null);// Close the temporary data reader but leaves the// stream opendatareader.Close();}return xmlreader;As long as the <strong>XML</strong> reader is open and <strong>in</strong> use, the underly<strong>in</strong>g database connectionrema<strong>in</strong>s open.At the end of the day, the trick that makes it possible to access the result set as <strong>XML</strong> issimply the availability of the data through a stand-alone <strong>XML</strong> reader object. SQL Server2000 trans<strong>for</strong>ms the contents of its low-level TDS stream <strong>in</strong>to <strong>XML</strong> and then builds an<strong>XML</strong> text reader from that. The whole process takes place on the server.Read<strong>in</strong>g from Text FieldsMost important with <strong>XML</strong> readers work<strong>in</strong>g on top of SQL commands is that thecommands return <strong>XML</strong> data. With SQL Server 2000, this certa<strong>in</strong>ly happens if you useany of the FOR <strong>XML</strong> clauses. It also happens if the query returns one or more rowsthat, <strong>in</strong> comb<strong>in</strong>ation, can be seen as a unique <strong>XML</strong> stream.Text or ntext fields that conta<strong>in</strong> <strong>XML</strong> data can be selected and then processed us<strong>in</strong>g an<strong>XML</strong> text reader. (The ntext data type is a variable-length Unicode data type that canhold a maximum of 1,073,741,823 characters. An ntext column stores a 16-byte po<strong>in</strong>ter<strong>in</strong> the data row, and the data is stored separately.) Of course, the query must <strong>in</strong>clude as<strong>in</strong>gle column and possibly a s<strong>in</strong>gle record. Let's consider the follow<strong>in</strong>g query from amodified version of the Northw<strong>in</strong>d database. I created the XmlNet database byduplicat<strong>in</strong>g the Northw<strong>in</strong>d databases Employees table and then wrapp<strong>in</strong>g all the str<strong>in</strong>gsstored <strong>in</strong> the Notes column <strong>in</strong> a pair. The Notes column is of typentext.SELECT notes FROM employeesAlthough the SELECT command listed here does not explicitly return <strong>XML</strong> data, youcan run it through the ExecuteXmlReader method, as shown here:str<strong>in</strong>g nw<strong>in</strong>d = "DATABASE=xmlnet;SERVER=localhost;UID=sa;";str<strong>in</strong>g query = "SELECT notes FROM employees";SqlConnection conn = new SqlConnection(nw<strong>in</strong>d);SqlCommand cmd = new SqlCommand(query, conn);conn.Open();XmlTextReader reader = (XmlTextReader) cmd.ExecuteXmlReader();ProcessNotes(reader);reader.Close();conn.Close();298


The <strong>XML</strong> reader will loop through the nodes, mov<strong>in</strong>g from one record to the next, asshown here:void ProcessNotes(XmlTextReader reader){try{while(reader.Read()){if (reader.NodeType == XmlNodeType.Text)MessageBox.Show(reader.Value);}}catch {}f<strong>in</strong>ally{MessageBox.Show("Closed...");}}The connection rema<strong>in</strong>s open until the reader is closed. Next store the results <strong>in</strong> astr<strong>in</strong>g variable and use that str<strong>in</strong>g to create a new XmlTextReader object. (See Chapter2.) This technique gives you an extra advantage: you can work with the reader whileyou are disconnected from the database.An <strong>XML</strong> Reader <strong>for</strong> Data ReadersAn <strong>XML</strong> reader can work on top of different data conta<strong>in</strong>ers, <strong>in</strong>clud<strong>in</strong>g streams, files,and text readers. By writ<strong>in</strong>g a custom <strong>XML</strong> reader, you can also navigate non-<strong>XML</strong> dataus<strong>in</strong>g the same <strong>XML</strong> reader metaphor. In this case, you create a virtual <strong>XML</strong> tree andmake the underly<strong>in</strong>g data look like <strong>XML</strong>. (In Chapter 2, you learned how to visit CSVfiles the <strong>XML</strong> way.)The ability to expose result sets via <strong>XML</strong> is specific to SQL Server 2000 and potentiallyto any other native managed provider <strong>for</strong> DBMS systems with full support <strong>for</strong> <strong>XML</strong>queries. You can't, <strong>for</strong> example, use the ExecuteXmlReader method with an object ofclass OleDbCommand.Recall from the section "Under the Hood of ExecuteXmlReader," on page 366, the<strong>in</strong>ternal structure of ExecuteXmlReader. The ExecuteXmlReader method simplycreates an <strong>XML</strong> text reader based on the <strong>in</strong>ternal stream used to carry data back and<strong>for</strong>th. What about creat<strong>in</strong>g a custom <strong>XML</strong> reader by build<strong>in</strong>g a virtual <strong>XML</strong> tree aroundthe provider-specific data reader? In this way, you could easily extend any .<strong>NET</strong>Framework data provider by us<strong>in</strong>g the ExecuteXmlReader method. This method is notas effective as us<strong>in</strong>g the <strong>in</strong>ternal stream, but it does work and can be applied to all dataproviders.Build<strong>in</strong>g the <strong>XML</strong> Data ReaderLet's rework the CSV reader example from Chapter 2 and build an XmlDataReaderclass <strong>in</strong>herit<strong>in</strong>g from XmlReader, as follows:public class XmlDataReader : XmlReader{299


}...The base class is <strong>for</strong> the most part abstract, thus requir<strong>in</strong>g you to override severalmethods and properties. When design<strong>in</strong>g an <strong>XML</strong> reader, a key step is def<strong>in</strong><strong>in</strong>g the<strong>XML</strong> virtual tree that underly<strong>in</strong>g data will populate. In this case, we'll try <strong>for</strong> a relativelysimple <strong>XML</strong> schema that closely resembles the schema of the FOR <strong>XML</strong> RAW mode,as shown here:...The XmlDataReader class features only one constructor, which takes any object thatimplements the IDataReader <strong>in</strong>terface. The programm<strong>in</strong>g <strong>in</strong>terface of a data readerobject like OleDbDataReader and SqlDataReader consists of two dist<strong>in</strong>ct groups offunctions: the IDataReader and IDataRecord <strong>in</strong>terfaces. The <strong>for</strong>mer <strong>in</strong>cludes basicmethods such as Read, Close, and GetSchemaTable. The latter conta<strong>in</strong>s specificread<strong>in</strong>g methods <strong>in</strong>clud<strong>in</strong>g GetName, GetValue, and the Item <strong>in</strong>dexer property.By mak<strong>in</strong>g the constructor accept a reference to the IDataReader <strong>in</strong>terface, you enablethe XmlDataReader class to support any data reader object. Internally, the classdef<strong>in</strong>es the follow<strong>in</strong>g private properties:protected IDataReader m_dataReader;protected IDataRecord m_dataRecord;protected ReadState m_readState;protected <strong>in</strong>t m_currentAttributeIndex;The idea is to map the read<strong>in</strong>g methods of the XmlDataReader class to the data readerobject and use the m_currentAttributeIndex member to track down the currentlyselected attribute, as shown <strong>in</strong> the follow<strong>in</strong>g code. Of course, each <strong>XML</strong> attributecorresponds to a column <strong>in</strong> the underly<strong>in</strong>g result set.public XmlDataReader(IDataReader dr){m_dataReader = dr;m_readState = ReadState.Initial;m_dataRecord = (IDataRecord) dr;m_currentAttributeIndex = -1;}Notice that the same object is passed as a reference to IDataReader but can also becast to IDataRecord. This is possible as long as the real object implements both<strong>in</strong>terfaces, but <strong>for</strong> data reader objects this is true by design.The XmlDataReader ImplementationLet's review the implementation of a few properties and methods to grasp the essenceof the reader, as shown <strong>in</strong> the follow<strong>in</strong>g code. The entire source code is available <strong>for</strong>download <strong>in</strong> this book's sample files.// Return the number of attributes (<strong>for</strong> example, the fieldcount)300


public override <strong>in</strong>t AttributeCount{get {return m_dataRecord.FieldCount;}}// Indexer property that works by <strong>in</strong>dex and namepublic override str<strong>in</strong>g this[<strong>in</strong>t i]{get {return m_dataRecord.GetValue(i).ToStr<strong>in</strong>g();}}public override str<strong>in</strong>g this[str<strong>in</strong>g name]{get {return m_dataRecord[name].ToStr<strong>in</strong>g();}}// Return the value of the current attributepublic override str<strong>in</strong>g Value{get {if(m_readState != ReadState.Interactive)return "";}}str<strong>in</strong>g buf = "";if (NodeType == XmlNodeType.Attribute)buf = this[m_currentAttributeIndex].ToStr<strong>in</strong>g();return buf;The Read method calls <strong>in</strong>to the Read method of the data reader and updates its stateaccord<strong>in</strong>gly, as shown <strong>in</strong> the follow<strong>in</strong>g code. The Close method closes the data readerand resets the <strong>in</strong>ternal state.public override bool Read(){// Read the new row and set the statebool canReadMore = m_dataReader.Read();m_readState = (canReadMore?ReadState.Interactive :ReadState.EndOfFile);}return canReadMore;301


public override void Close(){m_dataReader.Close();m_readState = ReadState.Closed;}The <strong>XML</strong> data reader object can work atop any provider-specific data readers, thusprovid<strong>in</strong>g a free <strong>XML</strong> trans<strong>for</strong>mation service that is functionally equivalent toExecuteXmlReader. The so-called <strong>XML</strong> trans<strong>for</strong>mation takes place on the client, but theconnection with the database rema<strong>in</strong>s open until you close the reader.NoteA custom <strong>XML</strong> reader does not really trans<strong>for</strong>m rows <strong>in</strong>to <strong>XML</strong>schemas. The XmlDataReader object simply causes a data recordto look like an <strong>XML</strong> fragment. You can derive new classes fromXmlDataReader to support more complex <strong>XML</strong> schemas. For suchsimple <strong>XML</strong> layouts at least, this approach is even slightly moreefficient than us<strong>in</strong>g FOR <strong>XML</strong>. Both solutions use an underly<strong>in</strong>gdata reader and expose an <strong>XML</strong> reader, but XmlDataReaderrequires no server-side rowset-to-<strong>XML</strong> trans<strong>for</strong>mation.Us<strong>in</strong>g <strong>XML</strong> with OLE DB Data ProvidersLet's see how to use the XmlDataReader class with an <strong>in</strong>stance of the OLE DB datareader. As usual, you create an OleDbCommand object, execute the command, andget a liv<strong>in</strong>g <strong>in</strong>stance of the OleDbDataReader class. Next you pass the OLE DB datareader to the XmlDataReader constructor, as shown here:str<strong>in</strong>g nw<strong>in</strong>d, query;nw<strong>in</strong>d = "PROVIDER=sqloledb;SERVER=localhost;" +"DATABASE=northw<strong>in</strong>d;UID=sa;";query = "SELECT employeeid, firstname, lastname," +" title FROM employees";OleDbConnection conn = new OleDbConnection(nw<strong>in</strong>d);OleDbCommand cmd = new OleDbCommand(query, conn);// Create the <strong>XML</strong> data readerconn.Open();OleDbDataReader dr = cmd.ExecuteReader();XmlDataReader reader = new XmlDataReader(dr);ProcessDataReader(reader);reader.Close();conn.Close();The reader can be used on demand to walk through the contents of the result set, asshown here:private void ProcessDataReader(XmlReader reader){ResultsListBox.Items.Clear();302


}while(reader.Read())ResultsListBox.Items.Add(reader.ReadOuterXml());reader.Close();This code generates the output shown <strong>in</strong> Figure 8-4.Figure 8-4: FOR <strong>XML</strong> RAW output obta<strong>in</strong>ed us<strong>in</strong>g the XmlDataReader class and an OLEDB data provider.A Disconnected <strong>XML</strong> Data ReaderBy design, a data reader object works while connected, and so do any <strong>XML</strong> readersyou might build on top of it. However, the .<strong>NET</strong> Framework provides a class that hasthe ability to expose a disconnected set of rows—a DataSet object—as <strong>XML</strong>. TheDataSet object is designed as a disconnected object with no relationship to any liv<strong>in</strong>g<strong>in</strong>stance of a DBMS. The XmlDataDocument class takes a DataSet object andtrans<strong>for</strong>ms it <strong>in</strong>to an <strong>XML</strong> DOM object—that is, the XmlDocument class we analyzed <strong>in</strong>Chapter 5. In a nutshell, the XmlDataDocument class provides a client-side and an<strong>XML</strong> DOM representation of a disconnected set of rows. Let's see how.The XmlDataDocument ClassThe XmlDataDocument class <strong>in</strong>herits from XmlDocument, and although it is def<strong>in</strong>ed <strong>in</strong>the system.data assembly, it belongs to the System.Xml namespace. A comb<strong>in</strong>ed useof the XmlDataDocument class and the DataSet class provides access to the samedata us<strong>in</strong>g two otherwise alternative approaches: relational and hierarchical. When aDataSet class and an XmlDataDocument class are synchronized, they work on thesame set of data and detect each other's changes <strong>in</strong> real time.The XmlDataDocument class has a DataSet property that is bound to the relatedDataSet object. The class does not duplicate the DataSet contents but simply holds areference to the object. When the DataSet property is set, the XmlDataDocumentregisters a listener module <strong>for</strong> each DataSet event that <strong>in</strong>dicates a change <strong>in</strong> the data.By hook<strong>in</strong>g the events, the XmlDataDocument class can stay <strong>in</strong> sync with the DataSetcontents.Event hook<strong>in</strong>g also works the other way around. In Chapter 5, we saw that wheneveran application changes the contents of the <strong>XML</strong> DOM, a NodeChanged event fires. TheXmlDataDocument class registers an event handler <strong>for</strong> NodeChanged and passes thechanges down to the referenced DataSet object.303


Synchroniz<strong>in</strong>g with a DataSet ObjectYou can synchronize a DataSet object with an XmlDataDocument object <strong>in</strong> variousways. For example, you can start by populat<strong>in</strong>g a DataSet object with schema and dataand then pass it on to a new XmlDataDocument object, as shown here:DataSet data = new DataSet();// Populate the DataSet with schema and dataXmlDataDocument dataDoc = new XmlDataDocument(data);In this case, the <strong>XML</strong> DOM object is created from the relational data. Alternatively, youcan set up the DataSet object with schema only, associate it with theXmlDataDocument class, and then populate the <strong>XML</strong> DOM object with <strong>XML</strong> data, asshown <strong>in</strong> the follow<strong>in</strong>g code. In this way, the DataSet object is filled with hierarchicaldata.DataSet data = new DataSet();// Populate the DataSet only with schema <strong>in</strong><strong>for</strong>mationXmlDataDocument dataDoc = new XmlDataDocument(data);dataDoc.Load(xmlfile);Note that an exception is thrown if you attempt to load an XmlDataDocument objectsynchronized with a DataSet object that conta<strong>in</strong>s data.You can take a third route. You can <strong>in</strong>stantiate and load an XmlDataDocument objectand then extract the correspond<strong>in</strong>g DataSet object from it, as shown here:XmlDataDocument dataDoc = new XmlDataDocument();DataSet data = dataDoc.DataSet;// Add schema <strong>in</strong><strong>for</strong>mation to the DataSetdataDoc.Load(xmlfile);In this case, no DataSet object is explicitly passed <strong>in</strong> by the user. The defaultconstructor creates an empty DataSet object anyway that is then filled when theXmlDataDocument object is loaded. A client application can get a reference to the<strong>in</strong>ternal DataSet object by us<strong>in</strong>g the DataSet property.An important issue to consider is that the DataSet object can't be filled if no schema<strong>in</strong><strong>for</strong>mation has been set. You can manually create tables and columns <strong>in</strong> the DataSetobject or read the <strong>in</strong><strong>for</strong>mation from an <strong>XML</strong> stream us<strong>in</strong>g the ReadXmlSchema method.(More on this topic <strong>in</strong> Chapter 9.)<strong>XML</strong> Data FidelityTo fill a DataSet object with <strong>XML</strong> data, you can use one of two methods. The firstmethod is to use the DataSet object's ReadXml method (see Chapter 9). The secondmethod is to load the data as <strong>XML</strong> <strong>in</strong>to an <strong>in</strong>stance of the XmlDataDocument class, andthen use the XmlDataDocument. DataSet method to fill the DataSet object. The twoapproaches differ significantly <strong>in</strong> terms of data fidelity.When ReadXml is used and the data is written back as <strong>XML</strong>, all extra <strong>XML</strong> <strong>in</strong><strong>for</strong>mationsuch as white spaces, process<strong>in</strong>g <strong>in</strong>structions, and CDATA sections is irreversibly lost.This happens because the DataSet relational <strong>for</strong>mat simply does not know how tohandle <strong>in</strong><strong>for</strong>mation that is mean<strong>in</strong>gful only to the hierarchical model.When the DataSet object is filled us<strong>in</strong>g an <strong>XML</strong> document loaded <strong>in</strong>toXmlDataDocument, the DataSet object still conta<strong>in</strong>s a simplified and adaptedrepresentation of the hierarchical contents but the orig<strong>in</strong>al <strong>XML</strong> document is preserved<strong>in</strong>tact.304


Nested Data RelationsIf the DataSet object to be synchronized with an XmlDataDocument object conta<strong>in</strong>s oneor more relations (<strong>in</strong>stances of the DataRelation object), you should set the Nestedproperty of the DataRelation object to true. In this way, the child rows of the relation willbe nested with<strong>in</strong> the parent column when written as <strong>XML</strong> data or synchronized with anXmlDataDocument object. By default, the Nested property of the DataRelation object isfalse.Read<strong>in</strong>g Data as <strong>XML</strong>Represent<strong>in</strong>g a DataSet object with an <strong>in</strong>stance of the XmlDataDocument class allowsyou to use XPath expressions to select data. In general, us<strong>in</strong>g XPath queries to select<strong>XML</strong> data makes sense especially if you have <strong>XML</strong> DOM data disconnected andstored <strong>in</strong> memory—that is, if you use XmlDataDocument. In do<strong>in</strong>g so, you actuallywork on an <strong>XML</strong> DOM object and don't <strong>in</strong> any way tax the database. Pay attentionwhen us<strong>in</strong>g this technique <strong>in</strong> <strong>Microsoft</strong> ASP.<strong>NET</strong> applications. In this case, the clientlives on the Web server, and you end up occupy<strong>in</strong>g the Web server's memory withpotential hits on the overall per<strong>for</strong>mance and scalability.Us<strong>in</strong>g XPath to query <strong>XML</strong> representations of data relationally stored <strong>in</strong> SQL Server(<strong>for</strong> example, annotated schemas) seems to be a rather twisted and <strong>in</strong>effective way toexecute queries. The query eng<strong>in</strong>e of SQL Server, there<strong>for</strong>e, outper<strong>for</strong>ms the XPathquery eng<strong>in</strong>e—not to mention that to run slower queries, you still have to pay the priceof trans<strong>for</strong>m<strong>in</strong>g relational data <strong>in</strong> <strong>XML</strong>.Read<strong>in</strong>g database contents as <strong>XML</strong> makes sense only if you need to represent that<strong>in</strong><strong>for</strong>mation <strong>in</strong> an <strong>in</strong>termediate <strong>for</strong>mat <strong>for</strong> further trans<strong>for</strong>mations and process<strong>in</strong>g.Currently, the best approach is still rely<strong>in</strong>g on FOR <strong>XML</strong> us<strong>in</strong>g the EXPLICIT operatorif you need complex schemas. SQL Server 2000 supports XDR schemas, and to useXSD, you should resort to SQL<strong>XML</strong> 3.0. Un<strong>for</strong>tunately, SQL<strong>XML</strong> 3.0 relies on the OLEDB provider <strong>for</strong> data access and is not recommended <strong>for</strong> .<strong>NET</strong> Frameworkapplications. If you f<strong>in</strong>d the FOR <strong>XML</strong> EXPLICIT syntax too quirky, look ahead to thediscussion of .<strong>NET</strong> Framework <strong>XML</strong> serialization <strong>in</strong> Chapter 11.Writ<strong>in</strong>g <strong>XML</strong> Data to DatabasesSo much <strong>for</strong> read<strong>in</strong>g database contents as <strong>XML</strong>. Now let's review the options available<strong>for</strong> persist<strong>in</strong>g data to relational DBMS systems us<strong>in</strong>g <strong>XML</strong> representations of data. SQLServer 2000 supports three basic ways <strong>for</strong> express<strong>in</strong>g database changes us<strong>in</strong>g <strong>XML</strong>:OPEN<strong>XML</strong>, <strong>XML</strong> bulk load<strong>in</strong>g, and Updategrams.OPEN<strong>XML</strong> is a SQL Server 2000 keyword that represents a rowset provider such as atable or a view. The net effect of OPEN<strong>XML</strong> is not really different from that of anotherrelatively popular T-SQL keyword—OPENROWSET. The OPENROWSET keywordrepresents an alternative to access<strong>in</strong>g tables <strong>in</strong> a l<strong>in</strong>ked server and an ad hoc methodof access<strong>in</strong>g data us<strong>in</strong>g any OLE DB providers. Both keywords can be referenced as ifthey were actual table names <strong>in</strong> the FROM clause of a query and <strong>in</strong> an INSERT orUPDATE command. The difference between the two keywords is that OPEN<strong>XML</strong>renders the contents of an <strong>XML</strong> file as a rowset, whereas OPENROWSET does thesame with the results of an OLE DB query.<strong>XML</strong> bulk load<strong>in</strong>g is a technique that lets you load semistructured <strong>XML</strong> data <strong>in</strong>to SQLServer tables. Functionally similar to OPEN<strong>XML</strong>, bulk load<strong>in</strong>g is implemented through aCOM object and provides higher per<strong>for</strong>mance when large amounts of <strong>XML</strong> data mustbe processed.F<strong>in</strong>ally, Updategrams are an <strong>XML</strong> description of the changes that must be applied tothe database. Updategrams are a syntax that applies to an annotated <strong>XML</strong> view to305


denote <strong>in</strong>sertions, deletions, and updates. The mapp<strong>in</strong>g schema of the <strong>XML</strong> viewconta<strong>in</strong>s the necessary <strong>in</strong><strong>for</strong>mation to map <strong>XML</strong> elements and attributes to tables andcolumns <strong>in</strong> the database. From a .<strong>NET</strong> Framework perspective, Updategrams look a lotlike DiffGrams. In SQL Server 2000, however, Updategrams are the native <strong>XML</strong>language to denote database changes.The OPEN<strong>XML</strong> Rowset ProviderOPEN<strong>XML</strong> is a T-SQL function that takes care of <strong>in</strong>sert<strong>in</strong>g data represented as an <strong>XML</strong>document. OPEN<strong>XML</strong> parses the contents of the <strong>XML</strong> document and exposes it as arowset. As a result, the records <strong>in</strong> the rowset can be stored <strong>in</strong> database tables.OPEN<strong>XML</strong> is not a write-only keyword that you can use only with INSERT or UPDATE.Because it is a generic rowset provider, you can use it with statements such asSELECT and SELECT INTO, and <strong>in</strong> general wherever a source table or view isaccepted.OPEN<strong>XML</strong> takes up to three arguments, as shown here:OPEN<strong>XML</strong> (handle, rowpattern [, flags])[WITH (SchemaDeclaration | TableName)]The first argument (handle) is the handle of the <strong>in</strong>ternal representation of an <strong>XML</strong>document. The document handle is created by the sp_xml_preparedocument systemstored procedure. The rowpattern argument is the XPath expression that selects thenodes <strong>in</strong> the source <strong>XML</strong> that must be processed as database rows.The flags argument is optional and, if specified, <strong>in</strong>dicates how attributes and elements<strong>in</strong> the selected nodes should be processed. By default, the flag is set to 1, which<strong>in</strong>dicates attribute-centric mapp<strong>in</strong>g. Attribute-centric mapp<strong>in</strong>g accepts <strong>in</strong>put values onlyfrom the attributes of the selected nodes. The mapp<strong>in</strong>g between attributes and columnsis determ<strong>in</strong>ed by name. Alternatively, you can specify element-centric mapp<strong>in</strong>g(a valueof 2). Element-centric mapp<strong>in</strong>g is similar to attribute-centric mapp<strong>in</strong>g except <strong>for</strong> the factthat it accepts <strong>in</strong>put values from the text of child element nodes.Caution You could also opt <strong>for</strong> mixed mapp<strong>in</strong>g—a value of 3—bycomb<strong>in</strong><strong>in</strong>g attribute-centric and element-centric mapp<strong>in</strong>g. In thiscase, attribute-centric mapp<strong>in</strong>g is applied first, and then <strong>for</strong> allstill unmatched columns, an element-centric mapp<strong>in</strong>g is applied.You should use this feature only when absolutely necessary.Us<strong>in</strong>g a double flag can significantly slow per<strong>for</strong>mance.The WITH clause is optional and can be used to def<strong>in</strong>e the schema of the target table.If a table with the desired schema already exists, you simply <strong>in</strong>dicate the table name.This is what commonly happens when you use OPEN<strong>XML</strong> to write data. When you useOPEN<strong>XML</strong> with a SELECT statement, you can specify the schema of the columnsbe<strong>in</strong>g returned. (More details on the syntax of OPEN<strong>XML</strong> can be found <strong>in</strong> SQL Server2000 Books Onl<strong>in</strong>e.)OPEN<strong>XML</strong> <strong>in</strong> ActionThe first step <strong>in</strong> us<strong>in</strong>g OPEN<strong>XML</strong> is call<strong>in</strong>g the sp_xml_preparedocument storedprocedure to parse the <strong>XML</strong> document. The stored procedure returns a treerepresentation of the nodes <strong>in</strong> the <strong>XML</strong> document, and this <strong>in</strong>-memory image becomesthe <strong>in</strong>put <strong>for</strong> OPEN<strong>XML</strong>. The stored procedure returns the handle of the document asan output parameter. Here's an example of how to use OPEN<strong>XML</strong>:DECLARE @handle <strong>in</strong>tEXEC sp_xml_preparedocument @handle OUTPUT,N'306


'INSERT EmployeesSELECT * FROM OPEN<strong>XML</strong>(@handle, N'/ROOT/Employees') WITHEmployeesEXEC sp_xml_removedocument @handleThis code adds a couple of records to the Employees table <strong>in</strong> the Northw<strong>in</strong>d database.Notice that the XPath expression selects all the nodes <strong>in</strong> the sourcedocument.The sp_xml_removedocument stored procedure removes the <strong>in</strong>ternal representation ofthe specified <strong>XML</strong> document that was previously built by sp_xml_preparedocument. Ifnot explicitly <strong>in</strong>validated, the handle of the document is valid <strong>for</strong> the duration of theconnection to SQL Server.Threshold and Per<strong>for</strong>manceOPEN<strong>XML</strong> uses the <strong>Microsoft</strong> <strong>XML</strong> Core Services (MS<strong>XML</strong>) COM parser to build ab<strong>in</strong>ary representation of the source document. Next it per<strong>for</strong>ms some XPath queries toselect the proper node-set to be processed to build the physical rowset to <strong>in</strong>terface withSQL Server.In general, you should avoid us<strong>in</strong>g XPath beyond a certa<strong>in</strong> threshold. If you realize thatyour code is rely<strong>in</strong>g on XPath <strong>for</strong> complex queries that run often, you are probablyus<strong>in</strong>g the wrong tool to address your needs. A temporary relational table wouldprobably serve you better.A parsed document is stored <strong>in</strong> the <strong>in</strong>ternal cache of SQL Server 2000. The memorythat the MS<strong>XML</strong> parser can use to generate b<strong>in</strong>ary images of the source <strong>XML</strong> canreach up to one-eighth of the total memory available to SQL Server. To avoid runn<strong>in</strong>gout of memory, free up b<strong>in</strong>ary images as soon as document handles go out of scope byus<strong>in</strong>g sp_xml_removedocument. Be sure to use the stored procedure <strong>in</strong> a timelymanner, however. If you free up memory that will be used later, SQL Server can onlyreparse the source document, which is probably worse than occupy<strong>in</strong>g more memory.To be on the safe side, keep the number of documents <strong>in</strong> memory under control, anddon't <strong>for</strong>get to call sp_xml_removedocument too.Keep <strong>in</strong> m<strong>in</strong>d that OPEN<strong>XML</strong> has been designed and optimized to handle documentsup to 50 KB <strong>in</strong> size. Over that threshold, monitor constantly the response time, anddecide whether you can still cont<strong>in</strong>ue with OPEN<strong>XML</strong> or you need someth<strong>in</strong>g different,like <strong>XML</strong> bulk load<strong>in</strong>g.<strong>XML</strong> Bulk Load<strong>in</strong>g<strong>XML</strong> Bulk Load is a COM component available <strong>for</strong> SQL Server 2000 that reads data outof an <strong>XML</strong> file and accord<strong>in</strong>g to an XDR or XSD schema copies the data <strong>in</strong>to databasetables and columns. Unlike OPEN<strong>XML</strong>, <strong>XML</strong> bulk load<strong>in</strong>g is optimized to work withlarge quantities of data.The bulk loader reads the <strong>XML</strong> data as a stream. Step by step, it identifies thedatabase tables and columns <strong>in</strong>volved and prepares and executes SQL statementsaga<strong>in</strong>st SQL Server. When the bulk loader encounters an <strong>XML</strong> element, it uses the307


schema <strong>in</strong><strong>for</strong>mation to associate the element with a record <strong>in</strong> a table. The record isactually written when the end tag <strong>for</strong> that element is found. This algorithm ensures that<strong>in</strong> the case of parent-child relationships, all the children are processed be<strong>for</strong>e theparent row.Transacted Load<strong>in</strong>gUnlike the T-SQL BULK INSERT statement, <strong>XML</strong> bulk load<strong>in</strong>g is a sort of add-on.Because <strong>XML</strong> bulk load<strong>in</strong>g is not natively part of SQL Server 2000, it never runs with<strong>in</strong>an implicit transaction, as normally happens with T-SQL statements. As a result, youmust manage transactions yourself. On the other hand, bulk load<strong>in</strong>g is the k<strong>in</strong>d ofoperation that sometimes does need to run <strong>in</strong> a transacted context.It goes without say<strong>in</strong>g that if you can af<strong>for</strong>d to run bulk load<strong>in</strong>g without transactions,do<strong>in</strong>g so would be greatly beneficial to the overall per<strong>for</strong>mance of the application.Nontransacted load<strong>in</strong>g makes a lot of sense when you have to fill up empty databases.In a transactionless scenario, you lose the ability to roll back changes, but becauseyour databases were orig<strong>in</strong>ally empty, if someth<strong>in</strong>g goes wrong, you can clear thedatabase and start over.NoteIn nontransacted mode, <strong>XML</strong> bulk load<strong>in</strong>g takes advantage of themethods of the OLE DB IRowsetFastLoad <strong>in</strong>terface to do the job.Not all OLE DB providers supply the IRowsetFastLoad <strong>in</strong>terface, butthe SQLOLEDB provider does.When <strong>XML</strong> bulk load<strong>in</strong>g works <strong>in</strong> transacted mode, the component creates a temporaryfile <strong>for</strong> each table <strong>in</strong>volved <strong>in</strong> the operation. The files will gather all the changes <strong>for</strong> thetables. When a commit occurs, the contents of the various files are flushed <strong>in</strong>to thecorrespond<strong>in</strong>g SQL Server table us<strong>in</strong>g the BULK INSERT statement.<strong>XML</strong> Bulk Load<strong>in</strong>g <strong>in</strong> ActionLet's see how <strong>XML</strong> bulk load<strong>in</strong>g really works. As mentioned, <strong>XML</strong> bulk load<strong>in</strong>g isimplemented through a COM object whose progID attribute is SQL<strong>XML</strong>BulkLoad. Thefollow<strong>in</strong>g Visual Basic 6.0 code shows how to use the object:conn = "PROVIDER=sqloledb;SERVER=localhost;" & _"database=Northw<strong>in</strong>d;UID=sa"Set bulk = CreateObject("SQL<strong>XML</strong>BulkLoad.SQL<strong>XML</strong>Bulkload.3.0")bulk.ConnectionStr<strong>in</strong>g = connbulk.Execute "schema.xml", "data.xml"To per<strong>for</strong>m bulk load<strong>in</strong>g, you set the connection str<strong>in</strong>g and then call the Executemethod. The method takes two arguments: the schema and the <strong>XML</strong> source data. Inl<strong>in</strong>eschemas are ignored, as are schema files referenced <strong>in</strong> the source file. As a result,you must always supply schema <strong>in</strong><strong>for</strong>mation and data through dist<strong>in</strong>ct <strong>XML</strong> files.F<strong>in</strong>ally, note that <strong>XML</strong> documents are checked <strong>for</strong> well-<strong>for</strong>medness, but their contentsare never validated aga<strong>in</strong>st any schema. Any contents outside the root node of thedocument—the node—are simply discarded.The follow<strong>in</strong>g list<strong>in</strong>g shows a typical source <strong>for</strong> a bulk load<strong>in</strong>g operation. It adds acouple of employees, each with a few related territories.308


991D<strong>in</strong>oEspositoRoma992FrancescoEspositoRomaThe schema that would make it possible <strong>for</strong> the bulk loader to <strong>in</strong>terpret and process this<strong>in</strong><strong>for</strong>mation is shown here:309


This schema first def<strong>in</strong>es a relationship between the Employees table and theEmployeeTerritories table. The relationship is based on the common field EmployeeID.Next the schema describes the elements and the attributes that <strong>for</strong>m the data source.The sql:relation annotation identifies the source table, whereas sql:relationship po<strong>in</strong>ts tothe relationship.Bulk Load<strong>in</strong>g <strong>in</strong> .<strong>NET</strong> Framework ApplicationsAs you've probably noticed, very little about <strong>XML</strong> bulk load<strong>in</strong>g is specifically related tothe .<strong>NET</strong> Framework world. <strong>XML</strong> bulk load<strong>in</strong>g and, more generally, a lot of SQL<strong>XML</strong> 3.0features are still based on COM. This means a couple of th<strong>in</strong>gs. First, the only way youcan take advantage of such features is through the .<strong>NET</strong> Framework COM <strong>in</strong>teroplayer. (COM <strong>in</strong>terop allows COM clients to access .<strong>NET</strong> objects and .<strong>NET</strong> code toaccess COM objects.) Be aware that, although highly optimized, the per<strong>for</strong>mance ofCOM <strong>in</strong>terop services isn't the same as you get by call<strong>in</strong>g managed code. If you haveno alternative, you should use COM <strong>in</strong>terop services; otherwise, choose a more .<strong>NET</strong>Framework– specific approach.<strong>XML</strong> bulk load<strong>in</strong>g can't be directly <strong>in</strong>voked from with<strong>in</strong> managed code. Managed codemust yield to COM code to do the job. As of SQL<strong>XML</strong> 3.0 SP1, the COM object thatprovides <strong>XML</strong> bulk load<strong>in</strong>g is named xblkld3.dll and is normally located under thefollow<strong>in</strong>g path: C:\Program Files\Common Files\System\Ole DB. You can use either<strong>Microsoft</strong> Visual Studio .<strong>NET</strong> or the tlbimp.exe command-l<strong>in</strong>e utility to generate a .<strong>NET</strong>Framework wrapper class.The Updategram TemplateAn Updategram is an <strong>XML</strong> file that conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the changes that must beentered <strong>in</strong> one or more database tables. In addition to <strong>in</strong>com<strong>in</strong>g changes, theUpdategram can also conta<strong>in</strong> optional mapp<strong>in</strong>g <strong>in</strong><strong>for</strong>mation to better associateelements <strong>in</strong> the <strong>XML</strong> source with columns <strong>in</strong> the database.Below the tag, an Updategram can have one or more blocks. Each ofthese blocks can conta<strong>in</strong> one or more pairs of and blocks. Us<strong>in</strong>g and blocks, you can specify the new expected state of the source. If arecord exists only <strong>in</strong> the block, a DELETE operation is per<strong>for</strong>med. If therecord appears only <strong>in</strong> the block, an INSERT operation occurs. If the recordappears <strong>in</strong> both blocks, an UPDATE statement is run. Records that do not appear <strong>in</strong>either block are left <strong>in</strong>tact.Structure of an UpdategramThe schema of an Updategram is illustrated here:310


...The contents of each block represent an atomic unit of process<strong>in</strong>g <strong>for</strong> which theUpdategram guarantees a transactional behavior—either all the changes take effect ornone do. You can use different pairs of and blocks to group changesthat must be executed <strong>in</strong> a certa<strong>in</strong> order.All the keywords <strong>in</strong> an Updategram are def<strong>in</strong>ed <strong>in</strong> the namespace urn:schemasmicrosoft-com:xml-updategram.The namespace must be associated with eachUpdategram, although with arbitrary prefixes, as <strong>in</strong> the follow<strong>in</strong>g example:By default, the Updategram maps any first-level element below the and blocks to a table of the same name <strong>in</strong> the current database. Any attribute <strong>in</strong> thatnode is implicitly mapped to columns <strong>in</strong> that table. For example, <strong>in</strong> the preced<strong>in</strong>gsample script, the Updategram would work on the Customers table, remov<strong>in</strong>g the rowwith a customerid attribute of 999 and replac<strong>in</strong>g it with a new row with a customeridattribute of 1999.You can specify a mapp<strong>in</strong>g schema us<strong>in</strong>g the mapp<strong>in</strong>g-schema attribute, as shown <strong>in</strong>the follow<strong>in</strong>g code. The attribute references an <strong>XML</strong> file (typically an XDR or XSD file)that describes the nature of the mapp<strong>in</strong>g <strong>in</strong> much the same way as described earlier <strong>for</strong><strong>XML</strong> bulk load<strong>in</strong>g. (See the section "<strong>XML</strong> Bulk Load<strong>in</strong>g <strong>in</strong> Action," on page 379.)NoteThe schema <strong>for</strong> <strong>XML</strong> bulk load<strong>in</strong>g does not recognize thesql:identity annotation to flag identity auto-<strong>in</strong>crement columns, whichmeans that <strong>XML</strong> bulk load<strong>in</strong>g is unable to handle tables with thisfeature. On the other hand, Updategrams handle identity columnsnicely. You simply annotate the column <strong>in</strong> the schema and set thesql:identity attribute to Ignore if you need to rely on the SQL Server–generated values or to useValue if a user-provided value should beused <strong>in</strong>stead.NULL values also require special handl<strong>in</strong>g. In practice, you declarean alternative text-based representation <strong>for</strong> NULL values and use311


that throughout the Updategram. The nullvalue attribute <strong>in</strong>dicatesthe alternative text, as shown here:...Submitt<strong>in</strong>g Commands Through UpdategramsUpdategrams can be executed <strong>in</strong> various ways. You can send the Updategram text toSQL Server over HTTP. Alternatively, you can write the <strong>XML</strong> contents out to a file andthen po<strong>in</strong>t the browser (or any other HTTP-enabled software) to that URL so that thecontents are executed. Or you can use an Updategram with ADO.The follow<strong>in</strong>g Visual Basic 6.0 code shows how to proceed. Notice that you must copythe Updategram to a stream and receive the response over another stream object.Dim cmd As New ADODB.CommandDim conn As New ADODB.ConnectionDim strIn As New ADODB.StreamDim strOut As New ADODB.Streamconn.Provider = "SQLOLEDB"conn.Open "SERVER=localhost;DATABASE=northw<strong>in</strong>d;UID=sa;"conn.Properties("SQL<strong>XML</strong> Version") = "SQL<strong>XML</strong>.3.0"Set cmd.ActiveConnection = conncmd.Dialect = "{5d531cb2-e6ed-11d2-b252-00c04f681b71}"strIn.OpenstrIn.WriteText SQLxmlstrIn.Position = 0Set cmd.CommandStream = strInstrOut.Opencmd.Properties("Output Stream").Value = strOutcmd.Properties("Output Encod<strong>in</strong>g").Value = "UTF-8"cmd.Execute , , adExecuteStreamNotice also that you need to set the command dialect to a particular globally uniqueidentifier (GUID)—DBGUID_MSSQL<strong>XML</strong>—and set a few properties on the commandand the connection objects.Concurrency IssuesUpdategrams are batches that work by loop<strong>in</strong>g on source data and execut<strong>in</strong>g asequence of commands. What happens if, due to the system concurrency, rows thatyou are go<strong>in</strong>g to modify have been changed s<strong>in</strong>ce the time you last read them?312


Updategrams have been designed to provide three levels of protection aga<strong>in</strong>st this k<strong>in</strong>dof conflict, as follows:• Bl<strong>in</strong>d updates You specify only the primary key of the record <strong>in</strong> the block. In this case, the change is persisted without firstcheck<strong>in</strong>g whether the current status of the record is consistent with theexpected one.• Partial conflict detection The block conta<strong>in</strong>s the primary keyas well as any other field you plan to update. When the Updategramexecutes, the change is applied only if the specified fields haven't beenchanged <strong>in</strong> the meantime.• Total conflict detection All the columns <strong>in</strong> the row are checked, andthe change fails if any of them has been modified. You can obta<strong>in</strong> this<strong>for</strong>m of protection either by list<strong>in</strong>g all the fields <strong>in</strong> the block orby us<strong>in</strong>g the table timestamp column, if one exists. A timestamp columnwill be updated whenever a user writes someth<strong>in</strong>g to the row.Updategrams and DiffGramsIf you're familiar with ADO.<strong>NET</strong>, you'll no doubt notice a close similarity, bothconceptual and physical, between Updategrams and DiffGrams. Although ADO.<strong>NET</strong>DiffGrams are a newer <strong>for</strong>mat—and perhaps the <strong>for</strong>mat of the future—currently, SQLServer 2000 natively supports only Updategrams.In the section "SQL<strong>XML</strong> Managed Classes," on page 386, we'll take a quick tour of themanaged classes <strong>in</strong> SQL<strong>XML</strong> 3.0. You'll notice that some of these classes apparentlyenable you to send DiffGrams to SQL Server. Although this is possible, the actualimplementation is not particularly effective. The source DiffGram is <strong>in</strong> fact <strong>in</strong>ternallytrans<strong>for</strong>med <strong>in</strong>to an Updategram and then processed by SQL Server.Apart from the patent similarity <strong>in</strong> their schemas, Updategrams and DiffGrams haveslightly different goals. Updategrams have been designed to update SQL Server;DiffGrams are mostly a stateful way to persist the contents of a DataSet object.(ADO.<strong>NET</strong> DiffGrams are covered <strong>in</strong> Chapter 10.) Convert<strong>in</strong>g DiffGrams toUpdategrams is certa<strong>in</strong>ly possible at the schema level, but Updategrams areunquestionably more powerful objects. Together with SQL<strong>XML</strong> 3.0 and the SQL Server<strong>XML</strong> extensions, Updategrams let you control concurrency, control the order ofupdates, per<strong>for</strong>m transactional updates, and specify parameters.On the other hand, there is not yet a .<strong>NET</strong> Framework class that works like anUpdategram. (And SQL<strong>XML</strong> 3.0 is still a hybrid, half COM and half managed code.)Most of the batch update features you f<strong>in</strong>d <strong>in</strong> Updategrams can be implemented <strong>in</strong>ADO.<strong>NET</strong> us<strong>in</strong>g the DataSet object's Update method and the provider-specific dataadapter object. Noth<strong>in</strong>g comes <strong>for</strong> free, though, and you must write a lot of code toemulate Updategrams <strong>in</strong> the .<strong>NET</strong> Framework.<strong>XML</strong> Batch UpdateIn ADO.<strong>NET</strong>, as well as <strong>in</strong> ADO, you can persist the changes made to a set of recordsstored <strong>in</strong> memory us<strong>in</strong>g a procedure called a batch update. This procedure consists ofa loop that looks up <strong>for</strong> changed records <strong>in</strong> the DataSet object (or the Recordsetobject <strong>in</strong> ADO) and issues a command to the back-end database. From theprogrammer's perspective, a batch update is ideal <strong>for</strong> work<strong>in</strong>g <strong>in</strong> disconnectedscenarios and <strong>in</strong> ADO.<strong>NET</strong>—although it is not yet perfect, it has been significantlyimproved and made applicable to real-world usage.Thanks to the ADO.<strong>NET</strong> <strong>XML</strong> serialization mechanism (see Chapter 9), you can loada DataSet object from <strong>XML</strong> data, enter the needed changes, and then proceed withthe batch update. The ADO.<strong>NET</strong> DiffGram is one of the possible <strong>XML</strong> representations<strong>for</strong> a DataSet object. Although, all <strong>in</strong> all, the Updategram is a more powerful and richer313


object <strong>for</strong> <strong>XML</strong>-driven updates, an ADO.<strong>NET</strong> batch update is still an option to considerwhen you're updat<strong>in</strong>g a database start<strong>in</strong>g with <strong>XML</strong> data.The ADO.<strong>NET</strong> batch update is a step-by-step procedure implemented through asequence of <strong>in</strong>dividual statements, all runn<strong>in</strong>g from the client environment. Onceaga<strong>in</strong>, this is different from Updategrams, <strong>in</strong> which all data is downloaded to SQLServer and applied as a server-side batch.The closest you can get to this model with ADO.<strong>NET</strong> is us<strong>in</strong>g a datatier componentthat decouples any middle-tier objects from the database. The middle-tier objectapplies all the needed changes to the DataSet object and then passes the object on toanother component, possibly located on the same mach<strong>in</strong>e as SQL Server. TheDataSet object is remoted as <strong>XML</strong> and is rebuilt at the dest<strong>in</strong>ation. F<strong>in</strong>ally, thechanges are applied <strong>in</strong> batch update mode but through a specialized and scalabledatatier component and with a more effective use of the bandwidth.SQL<strong>XML</strong> Managed ClassesSQL<strong>XML</strong> 3.0 comes with a handful of managed classes designed to expose thefunctionality of SQL<strong>XML</strong> 3.0 <strong>in</strong>side the .<strong>NET</strong> Framework. SQL<strong>XML</strong> managed classesallow you br<strong>in</strong>g <strong>XML</strong> data read from SQL Server <strong>in</strong>to .<strong>NET</strong> Framework applications,process the data, and send any updates back to SQL Server as an ADO.<strong>NET</strong> DiffGram.The managed classes are exposed by the microsoft.data.sqlxml assembly.SQL<strong>XML</strong> does not get along perfectly with the .<strong>NET</strong> Framework data provider <strong>for</strong> SQLServer. SQL<strong>XML</strong> needs to address special <strong>XML</strong>-driven functionalities of SQL Server2000 that the .<strong>NET</strong> Framework data provider simply does not support. As a result, theSQL Server .<strong>NET</strong> Framework provider can handle traditional SQL queries, <strong>in</strong>clud<strong>in</strong>gFOR <strong>XML</strong> queries, but it can't execute <strong>XML</strong> templates (<strong>for</strong> example, Updategrams) orserver-side XPath queries over <strong>XML</strong> views. For this reason, SQL<strong>XML</strong> managed classesrely on the SQL<strong>XML</strong>OLEDB OLE DB provider <strong>for</strong> all of the tasks that <strong>in</strong>volve a SQLServer connection.Figure 8-5 illustrates the key role that the SqlXmlCommand class and itsExecuteStream method play <strong>in</strong> the overall SQL<strong>XML</strong> 3.0 architecture.314


Figure 8-5: SQL<strong>XML</strong> managed classes go to SQL Server 2000 us<strong>in</strong>g the ExecuteStreammethod of the SqlXmlCommand class and the SQL<strong>XML</strong>OLEDB OLE DB provider.The set of SQL<strong>XML</strong> managed classes consists of two ma<strong>in</strong> classes—SqlXmlCommandand SqlXmlAdapter—plus a few ancillary classes like SqlXmlParameter andSqlXmlException. SqlXmlCommand is the fundamental class used to execute an <strong>XML</strong>drivencommand aga<strong>in</strong>st SQL Server. The SqlXmlAdapter class is actually a wrapper<strong>for</strong> the command that simply exposes the results through a DataSet object.The SqlXmlCommand ClassTheSqlXmlCommand class represents any <strong>XML</strong> command you can send to SQL Server2000. As mentioned, you should use this class only to issue those <strong>XML</strong> relatedcommands that the .<strong>NET</strong> Framework data provider <strong>for</strong> SQL Server does not nativelysupport. The class reliance on an OLE DB provider makes rather <strong>in</strong>effective any k<strong>in</strong>d ofabuse from with<strong>in</strong> a .<strong>NET</strong> Framework application.Do not use SqlXmlCommand to execute a simple FOR <strong>XML</strong> query, but take it <strong>in</strong>toaccount when you need to work with Updategrams, server-side XPath queries(assum<strong>in</strong>g that an XPath query makes sense at all <strong>in</strong> the context of the application), or<strong>XML</strong> views.SqlXmlCommand PropertiesThe properties available <strong>in</strong> the SqlXmlCommand class let you configure the query.Unlike most ADO.<strong>NET</strong> command classes, the SqlXmlCommand class provides acommand stream property that applications can use to pass potentially lengthy <strong>in</strong>putdata such as Updategrams. Table 8-2 summarizes the properties of theSqlXmlCommand class.315


Table 8-2: Properties of the SqlXmlCommand ClassPropertyBasePathClientSideXmlCommandStreamCommandTextCommandTypeNamespacesOutputEncod<strong>in</strong>gRootTagSchemaPathXslPathDescriptionGets or sets the base path used to resolve an XSLfile (XslPath property), a mapp<strong>in</strong>g schema file(SchemaPath property), or any other externalschema reference <strong>in</strong> an <strong>XML</strong> template.Boolean property, <strong>in</strong>dicates that the conversion of therowset to <strong>XML</strong> should occur on the client <strong>in</strong>stead ofon the server.Gets or sets the <strong>in</strong>put stream <strong>for</strong> the command. Usethis property to execute a command from a file (<strong>for</strong>example, a template or an Updategram).CommandStream and CommandText are mutuallyexclusive; if you set CommandStream,CommandText is automatically set to null.Gets or sets the text of the command to execute.CommandText and CommandStream are mutuallyexclusive; if you set CommandText,CommandStream is automatically set to null.Identifies the type of the command you want toexecute. Feasible values are def<strong>in</strong>ed <strong>in</strong> theSqlXmlCommandType enumeration.Enables the execution of XPath queries that usenamespaces.Specifies the encod<strong>in</strong>g <strong>for</strong> the stream that is returnedwhen the command executes. UTF-8 is the defaultencod<strong>in</strong>g.Gets or sets the name of the root element <strong>for</strong> <strong>XML</strong>generated by command execution. Set to by default.Gets or sets the name of the mapp<strong>in</strong>g schema <strong>for</strong>XPath queries. The path can be absolute or relative.If relative, the BasePath property is used to resolvethe path.Gets or sets the name of the XSL file to use <strong>for</strong> <strong>XML</strong>data trans<strong>for</strong>mations. The path can be absolute orrelative.Streams play a key role <strong>in</strong> the SqlXmlCommand class. Not only can you use a streamto specify the <strong>in</strong>put of a command, but you can also pick up the results of the commandfrom an output stream. You can also control the encod<strong>in</strong>g of this output stream. For abetter understand<strong>in</strong>g of these properties, review the ADO example about Updategrams<strong>in</strong> the section "Submitt<strong>in</strong>g Commands Through Updategrams," on page 383.Supported Command TypesTheSqlXmlCommand class can execute a variety of commands. The allowablecommand types are def<strong>in</strong>ed <strong>in</strong> the SqlXmlCommandType enumeration and are shown<strong>in</strong> Table 8-3.316


Table 8-3: Command TypesTypeDescriptionDiffgramExecutes an ADO.<strong>NET</strong> DiffGram.SqlExecutes an ord<strong>in</strong>ary SQL command that returns <strong>XML</strong>. Thedefault sett<strong>in</strong>g.Template Executes an <strong>XML</strong> template (<strong>for</strong> example, creates an XPathdrivenview). The command text is specified via thecommand <strong>in</strong>put stream.TemplateFile Executes an <strong>XML</strong> template via the specified file. The nameof the file is set through the CommandText property.UpdateGram Executes an updategram.XPathExecutes an XPath command.A template is an <strong>XML</strong> document that conta<strong>in</strong>s T-SQL commands wrapped <strong>in</strong> ad hoc<strong>XML</strong> attributes, as shown here:SELECT * FROM Employees FOR <strong>XML</strong> AUTOThe template specifies a sequence of commands to produce a particular result set.Overall, a template is a dynamically def<strong>in</strong>ed stored procedure expressed us<strong>in</strong>g <strong>XML</strong>syntax and support<strong>in</strong>g XPath queries.SqlXmlCommand MethodsOn <strong>in</strong>stantiation, the SqlXmlCommand class creates an <strong>in</strong>stance of theSQL<strong>XML</strong>OLEDB provider. Interest<strong>in</strong>gly, it does not make use of an explicit wrapperassembly but <strong>in</strong>stead gets a COM object type us<strong>in</strong>g the static methodGetTypeFromCLSID from the Type class. Next it <strong>in</strong>stantiates the COM object us<strong>in</strong>g theActivator class.NoteThe Activator class conta<strong>in</strong>s methods to create types of objectslocally or remotely, or obta<strong>in</strong> references to exist<strong>in</strong>g remote objects.Functionally equivalent to the new operator, Activator enables youto create <strong>in</strong>stances of objects whose type is passed as anargument. With Activator, you can sometimes experience difficultiesaddress<strong>in</strong>g a particular parameter-rich constructor. The Activatorobject will be covered <strong>in</strong> detail <strong>in</strong> Chapter 12.The methods provided by the SqlXmlCommand class are described <strong>in</strong> Table 8-4.Table 8-4: Methods of the SqlXmlCommand ClassMethodCreateParameterDescriptionCreates an SqlXmlParameter object that represents aparameter <strong>for</strong> the command317


Table 8-4: Methods of the SqlXmlCommand ClassMethodClearParametersExecuteNonQueryExecuteStreamExecuteToStreamExecuteXmlReaderDescriptionClears the parameters that were created <strong>for</strong> thecommandExecutes the command but does not return anyth<strong>in</strong>gExecutes the command and returns a new StreamobjectExecutes the command and writes the query resultsto the specified exist<strong>in</strong>g streamExecutes the command and returns an XmlReaderobjectExecuteStream is the key method <strong>in</strong> the <strong>in</strong>terface <strong>in</strong> the sense that all other executemethods fall back <strong>in</strong>ternally to it. In particular, ExecuteNonQuery merely wraps a call toExecuteStream, whereas ExecuteXmlReader creates and returns an XmlTextReaderobject built us<strong>in</strong>g the stream obta<strong>in</strong>ed from ExecuteStream.ExecuteToStream does not use ExecuteStream <strong>in</strong>ternally, but the two methods have asimilar architecture and use the same <strong>in</strong>ternal worker method. Basically,ExecuteStream calls an <strong>in</strong>ternal executor and sets it to work on a memory stream. Thememory stream (MemoryStream class) is then returned as a generic Stream object.ExecuteToStream, <strong>in</strong>stead, reads from, and writes to, the user-provided stream object.Figure 8-6 shows these two methods <strong>in</strong> action.Figure 8-6: ExecuteStream and ExecuteToStream <strong>in</strong> action.The follow<strong>in</strong>g code shows how to use a SqlXmlCommand object. Notice that theconnection str<strong>in</strong>g <strong>for</strong> SqlXmlCommand must necessarily use the SQLOLEDB providerbecause SQL<strong>XML</strong> 3.0 does not support the .<strong>NET</strong> Framework managed data provider.str<strong>in</strong>g conn = "PROVIDER=sqloledb;SERVER=(local);" +"DATABASE=northw<strong>in</strong>d;UID=sa";318


SqlXmlCommand cmd = new SqlXmlCommand(conn);cmd.CommandText = "SELECT * FROM Employees" +" FOR <strong>XML</strong> AUTO, BINARY BASE64";Stream stm = cmd.ExecuteStream();// Consumes the stream contentStreamReader sr = new StreamReader(stm)Console.WriteL<strong>in</strong>e(sr.ReadToEnd());sr.Close();The Employees table conta<strong>in</strong>s a BLOB field with a picture of each employee. If youwant the b<strong>in</strong>ary field returned encoded as a str<strong>in</strong>g, use the BINARY BASE64 keyword<strong>in</strong> the FOR <strong>XML</strong> clause.If the command that SqlXmlCommand executes does not return <strong>XML</strong>, an exception israised because stream<strong>in</strong>g is not supported over a result set with multiple columns.SqlXmlCommand works just f<strong>in</strong>e on non-<strong>XML</strong> queries as long as they return a s<strong>in</strong>glecolumn of data.TipThe ExecuteToStream method comes <strong>in</strong> handy <strong>for</strong> automaticallysend<strong>in</strong>g the result set over a special stream like the output stream ofan ASP.<strong>NET</strong> page or the console.Execut<strong>in</strong>g Server-Side XPath QueriesA typical functionality of the SQL<strong>XML</strong> library is execut<strong>in</strong>g server-side XPath queriesover SQL Server data. Personally, I would not recommend this practice—I believe thata well-designed SQL query outper<strong>for</strong>ms any XPath eng<strong>in</strong>e. The XPath language doeslet you address hierarchically structured data more easily, however, but keep <strong>in</strong> m<strong>in</strong>dthat a server-side XPath query requires a prelim<strong>in</strong>ary step—the relational-to-<strong>XML</strong> datatrans<strong>for</strong>mation, as shown here:SqlXmlCommand cmd = new SqlXmlCommand(conn);cmd.CommandText = "Emp[@EmployeeID >3]";cmd.CommandType = SqlXmlCommandType.XPath;cmd.SchemaPath = "Mapp<strong>in</strong>gSchema.xml";cmd.RootTag = "Northw<strong>in</strong>d";Stream stOut = cmd.ExecuteStream();When the command type is XPath, you must necessarily set the SchemaPath propertyon the SqlXmlCommand object. The property po<strong>in</strong>ts to an XSD or XDR file that def<strong>in</strong>esthe <strong>XML</strong> schema on which the XPath expression is called to operate. For example,consider the follow<strong>in</strong>g schema:319


This schema addresses a layout such as the follow<strong>in</strong>g, <strong>in</strong> which FName and LNamemap to FirstName and LastName and the target table is Employees:......Given this underly<strong>in</strong>g <strong>XML</strong> schema, us<strong>in</strong>g the follow<strong>in</strong>g command text to select all theemployees with an ID greater than 3 makes sense:Emp[@EmployeeID >3]The SqlXmlParameter ClassTo pass parameters to a SqlXmlCommand object, you must use <strong>in</strong>stances of theSqlXmlParameter class. Here's an example:str<strong>in</strong>g conn = "PROVIDER=sqloledb;SERVER=(local);" +"DATABASE=northw<strong>in</strong>d;UID=sa";SqlXmlCommand cmd = new SqlXmlCommand(conn);// Def<strong>in</strong>e the command textStr<strong>in</strong>gBuilder sb = new Str<strong>in</strong>gBuilder("");sb.Append("SELECT * FROM Employees ");sb.Append("WHERE employeeid=? ");sb.Append("FOR <strong>XML</strong> AUTO, BINARY BASE64");cmd.CommandText = sb.ToStr<strong>in</strong>g();// Set the parameterSqlXmlParameter p = cmd.CreateParameter();p.Value = 2;// Execute the commandStream stm = cmd.ExecuteStream();When you have several parameters set on a particular <strong>in</strong>stance of a SqlXmlCommandobject and you want to reuse that <strong>in</strong>stance <strong>for</strong> another command, use theClearParameters method to clear <strong>in</strong> a s<strong>in</strong>gle shot the parameters collection.320


The SqlXmlAdapter ClassThe SqlXmlAdapter class is a shr<strong>in</strong>k-wrapped adapter class. It does not implement theIDataAdapter <strong>in</strong>terface, so technically speak<strong>in</strong>g, it can't be presented as an adapterobject. Nevertheless, the class provides adapter-like methods such as Fill and Update,as shown <strong>in</strong> the follow<strong>in</strong>g code. These are also the only public methods <strong>for</strong> the class.void Fill(DataSet ds);void Update(DataSet ds);The SqlXmlAdapter class also provides three constructors, shown <strong>in</strong> the follow<strong>in</strong>g code,whose signatures re<strong>in</strong><strong>for</strong>ce the idea that this adapter is a mere wrapper class <strong>for</strong>SqlXmlCommand. In other words, the SqlXmlAdapter class is more a command thatmanages DataSet objects than a true data adapter object as it is described <strong>in</strong> theADO.<strong>NET</strong> specification.public SqlXmlAdapter(SqlXmlCommand cmd)public SqlXmlAdapter(str<strong>in</strong>g commandText,SqlXmlCommandType cmdType,str<strong>in</strong>g connectionStr<strong>in</strong>g)public SqlXmlAdapter(Stream commandStream,SqlXmlCommandType cmdType,str<strong>in</strong>g connectionStr<strong>in</strong>g)These constructors use the <strong>in</strong><strong>for</strong>mation they receive to set up an <strong>in</strong>ternal <strong>in</strong>stance of theSqlXmlCommand class. The Fill method makes full use of all the <strong>in</strong><strong>for</strong>mation passedthrough the constructor. For the Update method, on the other hand, only the connectionstr<strong>in</strong>g <strong>in</strong><strong>for</strong>mation is actually needed.Fill<strong>in</strong>g an <strong>XML</strong> AdapterThe Fill method is rather simple. First it executes the embedded <strong>XML</strong> command us<strong>in</strong>gExecuteStream. Next it uses the returned memory stream to fill the specified DataSetobject through its ReadXml method.The ReadXml method populates a DataSet object by read<strong>in</strong>g <strong>XML</strong> data from a varietyof sources, <strong>in</strong>clud<strong>in</strong>g streams and text readers, and <strong>in</strong>ferr<strong>in</strong>g the schema. We'll exam<strong>in</strong>ethe <strong>in</strong>ference process <strong>in</strong> detail <strong>in</strong> Chapter 9. For now, suffice to say that ReadXml candetect any <strong>in</strong>-l<strong>in</strong>e or referenced XSD schema or determ<strong>in</strong>e the schema dynamically.Once the DataSet object has been filled from the <strong>XML</strong> stream generated by thecommand execution, all the changes are accepted so that the DataSet object appears<strong>in</strong>tact and with no pend<strong>in</strong>g changes.Updat<strong>in</strong>g Us<strong>in</strong>g an <strong>XML</strong> AdapterThe Update method takes a DataSet object and applies its pend<strong>in</strong>g changes to thetarget database. The parameters specified on <strong>in</strong>stantiation conta<strong>in</strong> the details about theconnection str<strong>in</strong>g. The embedded SqlXmlCommand object has command text and acommand type that are simply ignored dur<strong>in</strong>g Update. Let's see why.When Update executes, the embedded command object is used to per<strong>for</strong>m the task,but its command text and command type properties are silently and temporarilyoverwritten with DataSet-specific sett<strong>in</strong>gs.The Update method writes the contents of the DataSet object to a newly createdmemory stream. The DataSet object is serialized as a DiffGram. Next the contents ofthe stream—that is, the DiffGram representation of the DataSet object—are copied <strong>in</strong>tothe CommandText property of the underly<strong>in</strong>g SqlXmlCommand object. TheCommandType property is set to Template, and ExecuteStream is called to update the321


database. If all goes well, the DataSet changes are committed us<strong>in</strong>g the DataSetobject's AcceptChanges method.Although COM is still <strong>in</strong>volved, the SqlXmlAdapter object represents a way toarchitecturally improve the batch update mechanism <strong>in</strong> ADO.<strong>NET</strong>. By us<strong>in</strong>gSqlXmlAdapter, you actually obta<strong>in</strong> a DataSet object that is serialized as a DiffGramdirectly to SQL Server and processed entirely on the server. To optimize the bandwidth,you can pass a DataSet object that conta<strong>in</strong>s only changed rows. The GetChangesmethod provides <strong>for</strong> that.NoteUs<strong>in</strong>g GetChanges with ADO.<strong>NET</strong> batch updat<strong>in</strong>g is not asignificant optimization—it simply reduces the total number ofiterations, but the elim<strong>in</strong>ated iterations are no-op by design. Instead,us<strong>in</strong>g GetChanges with SqlXmlAdapter can be a key optimization,as it truly m<strong>in</strong>imizes the amount of data be<strong>in</strong>g transferred from theclient to SQL Server.ConclusionIn this chapter, we have explored the connections between databases (SQL Server2000 <strong>in</strong> particular) and <strong>XML</strong>. Several DBMS systems provide <strong>XML</strong> support <strong>in</strong> various<strong>for</strong>ms. The <strong>in</strong>dustry standard, however, requires that a DBMS provide <strong>for</strong> direct <strong>XML</strong>result sets and accept changes expressed as <strong>XML</strong> streams. SQL Server 2000 adheresto these requirements.The difficulty lies <strong>in</strong> .<strong>NET</strong> and the different connect<strong>in</strong>g model it <strong>in</strong>troduces—.<strong>NET</strong>Framework data providers <strong>in</strong>stead of OLE DB providers. For .<strong>NET</strong> Frameworkapplications, fetch<strong>in</strong>g data as <strong>XML</strong> is much easier and more effective than persist<strong>in</strong>gchanges as <strong>XML</strong>. For COM applications, the same features are more balanced. Thereason is that SQL Server 2000 came out much earlier than the .<strong>NET</strong> Framework, butthe .<strong>NET</strong> Framework still came too soon to allow the managed provider to be designedwith a broader perspective.As a result, the SQL Server managed provider is unaware of <strong>XML</strong> extensions tosupport FOR <strong>XML</strong> queries and their limitations. Incidentally, this feature, comb<strong>in</strong>ed withthe power of .<strong>NET</strong> Framework <strong>XML</strong> readers, produces a really powerful toolkit. Thetruth, however, is that today the SQL Server managed provider is designed andoptimized <strong>for</strong> traditional SQL commands—period.SQL<strong>XML</strong> 3.0 is an add-on conceived to extend the SQL Server 2000 support <strong>for</strong> <strong>XML</strong>.SQL<strong>XML</strong> 3.0 is just that, however; <strong>in</strong> no way does it represent an <strong>in</strong>tegration to the.<strong>NET</strong> Framework managed provider model. For this reason, it is entirely based on COMOLE DB providers. The managed classes are wrappers around the SQL<strong>XML</strong>OLEDBprovider and, as such, require your code to silently jump out of the CLR dur<strong>in</strong>gexecution. This does not mean that you should not use SQL<strong>XML</strong> 3.0—just be aware ofthe managed classes' understandable, but still not optimal, design.Hopes <strong>for</strong> the future? That's easy—my wish is that SQL<strong>XML</strong> 3.0 will be improved and<strong>in</strong>tegrated with the .<strong>NET</strong> Framework managed provider. As a side effect of this<strong>in</strong>tegration, ADO.<strong>NET</strong> should be enriched with a k<strong>in</strong>d of Updategram object specificallydesigned <strong>for</strong> server-side batch updates.In Chapter 9, we'll tackle DataSet serialization and the theme of <strong>XML</strong> serialization <strong>for</strong>key ADO.<strong>NET</strong> objects <strong>in</strong> general, <strong>in</strong>clud<strong>in</strong>g DataTable and DataView objects. We'll alsotake another look at DiffGrams. DiffGrams will be explored <strong>in</strong> depth <strong>in</strong> Chapter 10.322


Further Read<strong>in</strong>gThis chapter touched on a number of SQL Server 2000 issues and, <strong>in</strong> particular, anumber of po<strong>in</strong>ts related to T-SQL—the SQL dialect of SQL Server. The onl<strong>in</strong>edocumentation that comes with the product (SQL Server's Books Onl<strong>in</strong>e) is certa<strong>in</strong>ly agood start<strong>in</strong>g po<strong>in</strong>t to learn more. If you're <strong>in</strong>terested <strong>in</strong> SQL Server 2000 from anarchitectural po<strong>in</strong>t of view, I recommend Kalen Delaney's Inside SQL Server 2000(<strong>Microsoft</strong> Press, 2000). Delaney's book covers the basics of the T-SQL language, butit is not an <strong>in</strong>-depth guide to T-SQL, and should be accompanied with another text morespecifically targeted to the SQL Server dialect. One that I've found useful is KenHenderson's The Guru's Guide to Transact-SQL, (Addison Wesley, 2000).<strong>Programm<strong>in</strong>g</strong> <strong>Microsoft</strong> SQL Server 2000 with <strong>XML</strong> by Graeme Malcolm (<strong>Microsoft</strong>Press, 2001) is a good <strong>in</strong>troductory text <strong>for</strong> explor<strong>in</strong>g <strong>XML</strong> extensions <strong>in</strong> SQL Server2000. Because the book is a bit outdated, it does not cover SQL<strong>XML</strong> 3.0 and managedextensions.Another topic <strong>in</strong>troduced <strong>in</strong> this chapter is ADO.<strong>NET</strong> and batch updat<strong>in</strong>g. My bookBuild<strong>in</strong>g Web Solutions with ASP.<strong>NET</strong> and ADO.<strong>NET</strong> (<strong>Microsoft</strong> Press, 2002) <strong>in</strong>cludesa practical chapter on batch updat<strong>in</strong>g from the ASP.<strong>NET</strong> perspective. A broader and <strong>in</strong>some respects more thoughtful and technology-oriented coverage can be found <strong>in</strong>Francesco Balena's <strong>Programm<strong>in</strong>g</strong> Visual Basic .<strong>NET</strong> (<strong>Microsoft</strong> Press, 2002). If you're<strong>in</strong>terested <strong>in</strong> the entire spectrum of ADO.<strong>NET</strong> technologies, take a look at DavidSceppa's <strong>Microsoft</strong> ADO.<strong>NET</strong> (<strong>Microsoft</strong> Press, 2002).323


Chapter 9: ADO.<strong>NET</strong> <strong>XML</strong> Data SerializationOverview<strong>XML</strong> is the key element responsible <strong>for</strong> the greatly improved <strong>in</strong>teroperability of the<strong>Microsoft</strong> ADO.<strong>NET</strong> object model when compared to <strong>Microsoft</strong> ActiveX Data Objects(ADO). In ADO, <strong>XML</strong> was merely an I/O <strong>for</strong>mat (nondefault) used to persist thecontents of a disconnected recordset. The participation of <strong>XML</strong> <strong>in</strong> the build<strong>in</strong>g and <strong>in</strong>the <strong>in</strong>terwork<strong>in</strong>gs of ADO.<strong>NET</strong> is much deeper. The aspects of ADO.<strong>NET</strong> <strong>in</strong> which the<strong>in</strong>teraction and <strong>in</strong>tegration with <strong>XML</strong> is stronger can be summarized <strong>in</strong> two categories:object serialization and remot<strong>in</strong>g and a dual programm<strong>in</strong>g <strong>in</strong>terface.In ADO.<strong>NET</strong>, you have several options <strong>for</strong> sav<strong>in</strong>g objects to, and restor<strong>in</strong>g objects from,<strong>XML</strong> documents. In effect, this capability belongs to one object only—the DataSetobject—but it can be extended to other conta<strong>in</strong>er objects with m<strong>in</strong>imal cod<strong>in</strong>g. Sav<strong>in</strong>gobjects like DataTable and DataView to <strong>XML</strong> is essentially a special case of theDataSet object serialization.As we saw <strong>in</strong> Chapter 8, ADO.<strong>NET</strong> and <strong>XML</strong> classes provide <strong>for</strong> a unified, <strong>in</strong>termediateAPI that is made available to programmers through a dual, synchronized programm<strong>in</strong>g<strong>in</strong>terface—the XmlDataDocument class. You can access and update data us<strong>in</strong>g eitherthe hierarchical node-based approach of <strong>XML</strong> or the relational approach of columnbasedtabular data sets. At any time, you can switch from a DataSet representation ofthe data to an <strong>XML</strong> Document Object Model (<strong>XML</strong> DOM) representation, and viceversa. Data is synchronized, and any change you enter <strong>in</strong> either model is immediatelyreflected and visible <strong>in</strong> the other.In this chapter, we'll explore the <strong>XML</strong> features built around the DataSet object and otherADO.<strong>NET</strong> objects <strong>for</strong> data serialization and deserialization. You'll learn how to persistand restore data contents, how to deal with schema <strong>in</strong><strong>for</strong>mation, and even how schema<strong>in</strong><strong>for</strong>mation is automatically <strong>in</strong>ferred from the <strong>XML</strong> source.Serializ<strong>in</strong>g DataSet ObjectsLike any other .<strong>NET</strong> Framework object, a DataSet object is stored <strong>in</strong> memory <strong>in</strong> ab<strong>in</strong>ary <strong>for</strong>mat. Unlike other objects, however, the DataSet object is always remoted andserialized <strong>in</strong> a special <strong>XML</strong> <strong>for</strong>mat, called a DiffGram. (We'll look at the DiffGram <strong>for</strong>matand the relative API <strong>in</strong> more detail <strong>in</strong> Chapter 10.) When the DataSet object trespassesacross the boundaries of the application doma<strong>in</strong>s (AppDoma<strong>in</strong>s), or the physicalborders of the mach<strong>in</strong>e, it is automatically rendered as a DiffGram. At its dest<strong>in</strong>ation,the DataSet object is silently rebuilt as a b<strong>in</strong>ary and immediately usable object.In ADO.<strong>NET</strong>, serialization of an object is per<strong>for</strong>med either through the publicISerializable <strong>in</strong>terface or through public methods that expose the object's <strong>in</strong>ternalserialization mechanism. As .<strong>NET</strong> Framework objects, ADO.<strong>NET</strong> objects can plug <strong>in</strong>tothe standard .<strong>NET</strong> Framework serialization mechanism and output their contents tostandard and user-def<strong>in</strong>ed <strong>for</strong>matters. The .<strong>NET</strong> Framework provides a couple of built<strong>in</strong><strong>for</strong>matters: the b<strong>in</strong>ary <strong>for</strong>matter and the Simple Object Access Protocol (SOAP)<strong>for</strong>matter. A .<strong>NET</strong> Framework object makes itself serializable by implement<strong>in</strong>g themethods of the ISerializable <strong>in</strong>terface—specifically, the GetObjectData method, plus aparticular flavor of the constructor. Accord<strong>in</strong>g to this def<strong>in</strong>ition, both the DataSet and theDataTable objects are serializable.In addition to the official serialization <strong>in</strong>terface, the DataSet object supplies analternative, and more direct, series of methods to serialize and deserialize itself, but <strong>in</strong> aclass-def<strong>in</strong>ed <strong>XML</strong> <strong>for</strong>mat only. To serialize us<strong>in</strong>g the standard method, you create<strong>in</strong>stances of the <strong>for</strong>matter object of choice (b<strong>in</strong>ary, SOAP, or whatever) and let the324


<strong>for</strong>matter access the source data through the methods of the ISerializable <strong>in</strong>terface.The <strong>for</strong>matter obta<strong>in</strong>s raw data that it then packs <strong>in</strong>to the expected output stream.In the alternative serialization model, the DataSet object itself starts and controls theserialization and deserialization process through a group of extra methods. TheDataTable object does not offer public methods to support such an alternative andembedded serialization <strong>in</strong>terface, nor does the DataView object.In the end, both the official and the embedded serialization eng<strong>in</strong>es share the same setof methods. The overall architecture of DataSet and DataTable serialization isgraphically rendered <strong>in</strong> Figure 9-1.Figure 9-1: Both the DataSet object and the DataTable object implement the ISerializable<strong>in</strong>terface <strong>for</strong> classic .<strong>NET</strong> Framework serialization. The DataSet object also publiclyexposes the <strong>in</strong>ternal API used to support classic serialization.All the methods that the DataSet object uses <strong>in</strong>ternally to support the .<strong>NET</strong> Frameworkserialization process are publicly exposed to applications through a group of methods,one pair of which clearly stands out—ReadXml and WriteXml. The DataTable object, onthe other hand, does not publish the same methods, although this feature can be easilyobta<strong>in</strong>ed with a little code. (I'll demonstrate this <strong>in</strong> the section "Serializ<strong>in</strong>g FilteredViews," on page 417.)As you can see <strong>in</strong> the architecture depicted <strong>in</strong> Figure 9-1, both objects always pass<strong>XML</strong> data to .<strong>NET</strong> Framework <strong>for</strong>matters. This means that there is no .<strong>NET</strong> Framework-325


provided way to serialize ADO.<strong>NET</strong> objects <strong>in</strong> b<strong>in</strong>ary <strong>for</strong>mats. We'll return to this topic <strong>in</strong>the section "Custom B<strong>in</strong>ary Serialization," on page 424.The DataSet Object's Embedded API <strong>for</strong> <strong>XML</strong>Table 9-1 presents the DataSet object methods you can use to work with <strong>XML</strong>, both <strong>in</strong>read<strong>in</strong>g and <strong>in</strong> writ<strong>in</strong>g. This list represents the DataSet object's <strong>in</strong>ternal <strong>XML</strong> API, whichis at the foundation of the serialization and deserialization processes <strong>for</strong> the object.Table 9-1: The DataSet Object's Embedded Serialization APIMethodGetXmlGetXmlSchemaReadXmlReadXmlSchemaWriteXmlWriteXmlSchemaDescriptionReturns an <strong>XML</strong> representation of the data currentlystored <strong>in</strong> the DataSet object. No schema <strong>in</strong><strong>for</strong>mation is<strong>in</strong>cluded.Returns a str<strong>in</strong>g that represents the <strong>XML</strong> schema<strong>in</strong><strong>for</strong>mation <strong>for</strong> the data currently stored <strong>in</strong> the object.Populates the DataSet object with the specified <strong>XML</strong>data read from a stream or a file. Dur<strong>in</strong>g the process,schema <strong>in</strong><strong>for</strong>mation is read or <strong>in</strong>ferred from the data.Loads the specified <strong>XML</strong> schema <strong>in</strong><strong>for</strong>mation <strong>in</strong>to thecurrent DataSet object.Writes out the <strong>XML</strong> data, and optionally the schema,that represents the DataSet object to a storagemedium—that is, a stream or a file.Writes out a str<strong>in</strong>g that represents the <strong>XML</strong> schema<strong>in</strong><strong>for</strong>mation <strong>for</strong> the DataSet object. Can write to astream or a file.Note that GetXml returns a str<strong>in</strong>g that conta<strong>in</strong>s <strong>XML</strong> data. As such, it requires moreoverhead than simply us<strong>in</strong>g WriteXml to write <strong>XML</strong> to a file. You should not use GetXmland GetXmlSchema unless you really need to obta<strong>in</strong> the DataSet representation orschema as dist<strong>in</strong>ct str<strong>in</strong>gs <strong>for</strong> <strong>in</strong>-memory manipulation. The GetXmlSchema methodreturns the DataSet object's <strong>XML</strong> Schema Def<strong>in</strong>ition (XSD) schema; there is no way toobta<strong>in</strong> the DataSet object's <strong>XML</strong>-Data Reduced (XDR) schema.As Table 9-1 shows, when you're work<strong>in</strong>g with DataSet and <strong>XML</strong>, you can managedata and schema <strong>in</strong><strong>for</strong>mation as dist<strong>in</strong>ct entities. You can take the <strong>XML</strong> schema out ofthe object and use it as a str<strong>in</strong>g. Alternatively, you could write the schema to a disk fileor load it <strong>in</strong>to an empty DataSet object. Alongside the methods listed <strong>in</strong> Table 9-1, theDataSet object also features two <strong>XML</strong>-related properties: Namespace and Prefix.Namespace specifies the <strong>XML</strong> namespace used to scope <strong>XML</strong> attributes and elementswhen you read them <strong>in</strong>to a DataSet object. The prefix to alias the namespace is stored<strong>in</strong> the Prefix property. The namespace can't be set if the DataSet object alreadyconta<strong>in</strong>s data.Writ<strong>in</strong>g Data as <strong>XML</strong>The contents of a DataSet object can be serialized as <strong>XML</strong> <strong>in</strong> two ways that I'll callstateless and stateful. Although these expressions are not common throughout theADO.<strong>NET</strong> documentation, I believe that they capture the gist of the two <strong>XML</strong> schemasthat can be used to persist a DataSet object's contents. A stateless representationtakes a snapshot of the current <strong>in</strong>stance of the data and renders it accord<strong>in</strong>g to aparticular <strong>XML</strong> schema (def<strong>in</strong>ed <strong>in</strong> Chapter 1 as the ADO.<strong>NET</strong> normal <strong>for</strong>m). A statefulrepresentation, on the other hand, conta<strong>in</strong>s the history of the data <strong>in</strong> the object and326


<strong>in</strong>cludes <strong>in</strong><strong>for</strong>mation about changes as well as pend<strong>in</strong>g errors. Keep <strong>in</strong> m<strong>in</strong>d thatstateless and stateful refer to the data <strong>in</strong> the DataSet object but not to the DataSetobject as a whole.In this chapter, we'll focus on the stateless representation of the DataSet object, withjust a glimpse at the stateful representation—the DiffGram <strong>for</strong>mat. In Chapter 10, we'lldelve <strong>in</strong>to the DiffGram's structure and goals.The <strong>XML</strong> representation of a DataSet object can be written to a file, a stream, anXmlWriter object, or a str<strong>in</strong>g us<strong>in</strong>g the WriteXml method. It can <strong>in</strong>clude, or not <strong>in</strong>clude,XSD schema <strong>in</strong><strong>for</strong>mation. The actual behavior of the WriteXml method can becontrolled by pass<strong>in</strong>g the optional XmlWriteMode parameter. The values <strong>in</strong> theXmlWriteMode enumeration determ<strong>in</strong>e the output's layout. The overloads of the methodare shown <strong>in</strong> the follow<strong>in</strong>g list<strong>in</strong>g:public void WriteXml(Stream, XmlWriteMode);public void WriteXml(str<strong>in</strong>g, XmlWriteMode);public void WriteXml(TextWriter, XmlWriteMode);public void WriteXml(XmlWriter, XmlWriteMode);WriteXml provides four additional overloads with the same structure as this code butwith no explicit XmlWriteMode argument.The stateless representation of the DataSet object takes a snapshot of the currentstatus of the object. In addition to data, the representation <strong>in</strong>cludes tables, relations,and constra<strong>in</strong>ts def<strong>in</strong>itions. The rows <strong>in</strong> the tables are written only <strong>in</strong> their currentversions, unless you use the DiffGram <strong>for</strong>mat—which would make this a statefulrepresentation. The follow<strong>in</strong>g schema shows the ADO.<strong>NET</strong> normal <strong>for</strong>m—that is, the<strong>XML</strong> stateless representation of a DataSet object:...............⋮The root tag is named after the DataSet object. If the DataSet object has no name, thestr<strong>in</strong>g NewDataSet is used. The name of the DataSet object can be set at any timethrough the DataSetName property or via the constructor upon <strong>in</strong>stantiation. Each table<strong>in</strong> the DataSet object is represented as a block of rows. Each row is a subtree rooted <strong>in</strong>a node with the name of the table. You can control the name of a DataTable object viathe TableName property. By default, the first unnamed table added to a DataSet objectis named Table. A trail<strong>in</strong>g <strong>in</strong>dex is appended if a table with that name already exists.The follow<strong>in</strong>g list<strong>in</strong>g shows the <strong>XML</strong> data of a DataSet object named Northw<strong>in</strong>dInfo:327


1DavolioNancy⋮106897⋮Basically, the <strong>XML</strong> representation of a DataSet object conta<strong>in</strong>s rows of data groupedunder a root node. Each row is rendered with a subtree <strong>in</strong> which child nodes representcolumns. The contents of each column are stored as the text of the node. The l<strong>in</strong>kbetween a row and the parent table is established through the name of the row node. Inthe preced<strong>in</strong>g list<strong>in</strong>g, the … subtree represents a row <strong>in</strong> aDataTable object named Employees.Modes of Writ<strong>in</strong>gTable 9-2 summarizes the writ<strong>in</strong>g options available <strong>for</strong> use with WriteXml through theXmlWriteMode enumeration.Table 9-2: The XmlWriteMode EnumerationWrite ModeDiffGramIgnoreSchemaWriteSchemaDescriptionWrites the contents of the DataSet object as a DiffGram,<strong>in</strong>clud<strong>in</strong>g orig<strong>in</strong>al and current values.Writes the contents of the DataSet object as <strong>XML</strong> datawithout a schema.Writes the contents of the DataSet object, <strong>in</strong>clud<strong>in</strong>g an <strong>in</strong>l<strong>in</strong>eXSD schema. The schema can't be <strong>in</strong>serted as XDR,nor can it be added as a reference.IgnoreSchema is the default option. The follow<strong>in</strong>g code demonstrates the typical way toserialize a DataSet object to an <strong>XML</strong> file:StreamWriter sw = new StreamWriter(fileName);dataset.WriteXml(sw); // Defaults to IgnoreSchemasw.Close();TipIn terms of functionality, call<strong>in</strong>g the GetXml method and then writ<strong>in</strong>gits contents to a data store is identical to call<strong>in</strong>g WriteXml withXmlWriteMode set to IgnoreSchema. Us<strong>in</strong>g GetXml can be328


com<strong>for</strong>table, but <strong>in</strong> terms of raw overhead, call<strong>in</strong>g WriteXml on aStr<strong>in</strong>gWriter object is slightly more efficient, as shown here:Str<strong>in</strong>gWriter sw = new Str<strong>in</strong>gWriter();ds.WriteXml(sw, XmlWriteMode.IgnoreSchema);// Access the str<strong>in</strong>g us<strong>in</strong>g sw.ToStr<strong>in</strong>g()The same considerations apply to GetXmlSchema andWriteXmlSchema.Preserv<strong>in</strong>g Schema and Type In<strong>for</strong>mationThe stateless <strong>XML</strong> <strong>for</strong>mat is a flat <strong>for</strong>mat. Unless you explicitly add schema <strong>in</strong><strong>for</strong>mation,the <strong>XML</strong> output is weakly typed. There is no <strong>in</strong><strong>for</strong>mation about tables and columns, andthe orig<strong>in</strong>al content of each column is normalized to a str<strong>in</strong>g. If you need a higher levelof type and schema fidelity, start by add<strong>in</strong>g an <strong>in</strong>-l<strong>in</strong>e XSD schema.In general, a few factors can <strong>in</strong>fluence the f<strong>in</strong>al structure of the <strong>XML</strong> document thatWriteXml creates <strong>for</strong> you. In addition to the overall <strong>XML</strong> <strong>for</strong>mat—DiffGram or a pla<strong>in</strong>hierarchical representation of the current contents—important factors <strong>in</strong>clude thepresence of schema <strong>in</strong><strong>for</strong>mation, nested relations, and how table columns are mappedto <strong>XML</strong> elements.NoteTo optimize the result<strong>in</strong>g <strong>XML</strong> code, the WriteXml method dropscolumn fields with null values. Dropp<strong>in</strong>g the null column fieldsdoesn't affect the usability of the DataSet object—you cansuccessfully rebuild the object from <strong>XML</strong>, and data-bound controlscan easily manage null values. This feature can become a problem,however, if you send the DataSet object's <strong>XML</strong> output to a non-.<strong>NET</strong> plat<strong>for</strong>m. Other parsers, unaware that null values are omitted<strong>for</strong> brevity, might fail to parse the document. If you want to representnull values <strong>in</strong> the <strong>XML</strong> output, replace the null values(System.DBNull type) with other neutral values (<strong>for</strong> example, blankspaces).Writ<strong>in</strong>g Schema In<strong>for</strong>mationWhen you serialize a DataSet object, schema <strong>in</strong><strong>for</strong>mation is important <strong>for</strong> two reasons.First, it adds structured <strong>in</strong><strong>for</strong>mation about the layout of the constituent tables and theirrelations and constra<strong>in</strong>ts. Second, extra table properties are persisted only with<strong>in</strong> theschema. Note, however, that schema <strong>in</strong><strong>for</strong>mation describes the structure of the <strong>XML</strong>document be<strong>in</strong>g created and is not a transcript of the database metadata.The schema conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the constituent columns of each DataTableobject. (Column <strong>in</strong><strong>for</strong>mation <strong>in</strong>cludes name, type, any expression, and all the contentsof the ExtendedProperties collection.)The schema is always written as an <strong>in</strong>-l<strong>in</strong>e XSD. As mentioned, there is no way <strong>for</strong> youto write the schema as XDR, as a document type def<strong>in</strong>ition (DTD), or even as an addedreference to an external file. The follow<strong>in</strong>g list<strong>in</strong>g shows the schema source <strong>for</strong> aDataSet object named Northw<strong>in</strong>dInfo that consists of two tables: Employees andTerritories. The Employees table has three columns—employeeid, lastname, andfirstname. The Territories table <strong>in</strong>cludes employeeid and territoryid columns. (Theseelements appear <strong>in</strong> boldface <strong>in</strong> this list<strong>in</strong>g.)329


The element describes the body of the root node as anunbounded sequence of and nodes. These first-level nodes<strong>in</strong>dicate the tables <strong>in</strong> the DataSet object. The children of each table denote the schemaof the DataTable object. (See Chapter 3 <strong>for</strong> more <strong>in</strong><strong>for</strong>mation about <strong>XML</strong> schemas.)The schema can be slightly more complex if relations exist between two or more pairsof tables. The msdata namespace conta<strong>in</strong>s ad hoc attributes that are used to annotatethe schema with ADO.<strong>NET</strong>-specific <strong>in</strong><strong>for</strong>mation, mostly about <strong>in</strong>dexes, tablerelationships, and constra<strong>in</strong>ts.In-L<strong>in</strong>e Schemas and ValidationChapter 3 h<strong>in</strong>ted at why the XmlValidat<strong>in</strong>gReader class is paradoxically unable tovalidate the <strong>XML</strong> code that WriteXml generates <strong>for</strong> a DataSet object with an <strong>in</strong>-l<strong>in</strong>eschema, as shown here:.........In the f<strong>in</strong>al <strong>XML</strong> layout, schema <strong>in</strong><strong>for</strong>mation is placed at the same level as the tablenodes, but <strong>in</strong>cludes <strong>in</strong><strong>for</strong>mation about the common root (DataSetName, <strong>in</strong> the330


preced<strong>in</strong>g code) as well as the tables (Table1 and Table2). Because the validat<strong>in</strong>gparser is a <strong>for</strong>ward-only reader, it can match the schema only <strong>for</strong> nodes placed after theschema block. The idea is that the parser first reads the schema and then checks thecompliance of the rema<strong>in</strong>der of the tree with the just-read <strong>in</strong><strong>for</strong>mation, as shown <strong>in</strong>Figure 9-2.Figure 9-2: How the .<strong>NET</strong> Framework validat<strong>in</strong>g reader parses a serialized DataSet objectwith an <strong>in</strong>-l<strong>in</strong>e schema.Due to the structure of the <strong>XML</strong> document be<strong>in</strong>g generated, what comes after theschema does not match the schema! Figure 9-3 shows that the validat<strong>in</strong>g parser webuilt <strong>in</strong> Chapter 3 around the XmlValidat<strong>in</strong>gReader class does not recognize (I'd say, bydesign) a serialized DataSet object when an <strong>in</strong>-l<strong>in</strong>e schema is <strong>in</strong>corporated.Figure 9-3: The validat<strong>in</strong>g parser built <strong>in</strong> Chapter 3 does not validate an <strong>XML</strong> DataSetobject with an <strong>in</strong>-l<strong>in</strong>e schema.Is there a way to serialize the DataSet object so that its <strong>XML</strong> representation rema<strong>in</strong>sparsable when an <strong>in</strong>-l<strong>in</strong>e schema is <strong>in</strong>cluded? The workaround is fairly simple.331


Serializ<strong>in</strong>g to Valid <strong>XML</strong>As you can see <strong>in</strong> Figure 9-2, the rub lies <strong>in</strong> the fact that the <strong>in</strong>-l<strong>in</strong>e schema is written <strong>in</strong>the middle of the document it is called to describe. This fact, <strong>in</strong> addition to the <strong>for</strong>wardonlynature of the parser, irreversibly alters the parser's perception of what the realdocument schema is. The solution is simple: move the schema out of the DataSet <strong>XML</strong>serialization output, and group both nodes under a new common root, as shown here: ... ⋮Here's a code snippet that shows how to implement this solution:XmlTextWriter writer = new XmlTextWriter(file);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;writer.WriteStartElement("Wrapper");ds.WriteXmlSchema(writer);ds.WriteXml(writer);writer.WriteEndElement();writer.Close();If you don't use an <strong>XML</strong> writer, the WriteXmlSchema method would write the <strong>XML</strong>declaration <strong>in</strong> the middle of the document, thus mak<strong>in</strong>g the document whollyunparsable. You can also mark this workaround with your own credentials us<strong>in</strong>g acustom namespace, as shown here:writer.WriteStartElement("de", "Wrapper", "d<strong>in</strong>oe-xml-07356-1801-1");Figure 9-4 shows the new document displayed <strong>in</strong> <strong>Microsoft</strong> Internet Explorer.Figure 9-4: The DataSet object's <strong>XML</strong> output after modification.Figure 9-5 shows that this new <strong>XML</strong> file (validdataset.xml) is successfully validated bythe XmlValidat<strong>in</strong>gReader class. The validat<strong>in</strong>g parser raises a warn<strong>in</strong>g about the newroot node; this feature was covered <strong>in</strong> Chapter 3.332


Figure 9-5: The validat<strong>in</strong>g parser raises a warn<strong>in</strong>g but accepts the updated <strong>XML</strong> file.A reasonable concern you might have is about the DataSet object's ability to read backsuch a modified <strong>XML</strong> stream. No worries! The ReadXml method is still perfectly able toread and process the modified schema, as shown here:DataSet ds = new DataSet();ds.ReadXml("ValidDataset.xml", XmlReadMode.ReadSchema);ds.WriteXml("standard.xml");NoteAlthough paradoxical, this behavior (whether it's by design or a bug)does not deserve much hype. At first glance, this behavior seems tolimit true cross-plat<strong>for</strong>m <strong>in</strong>teroperability, but after a more thoughtfullook, you can't help but realize that very few <strong>XML</strong> parsers todaysupport <strong>in</strong>-l<strong>in</strong>e <strong>XML</strong> schemas. In other words, what appears to be aclamorous and <strong>in</strong>capacitat<strong>in</strong>g bug is actually a rather <strong>in</strong>nocuousbehavior that today has a very limited impact on real applications.Real-world cross-plat<strong>for</strong>m data exchange, <strong>in</strong> fact, must be doneus<strong>in</strong>g dist<strong>in</strong>ct files <strong>for</strong> schema and data.Customiz<strong>in</strong>g the <strong>XML</strong> RepresentationThe schema of the DataSet object's <strong>XML</strong> representation is not set <strong>in</strong> stone and can bemodified to some extent. In particular, each column <strong>in</strong> each DataTable object canspecify how the <strong>in</strong>ternal serializer should render its content. By default, each column isrendered as an element, but this feature can be changed to any of the values <strong>in</strong> theMapp<strong>in</strong>gType enumeration. The DataColumn property that specifies the mapp<strong>in</strong>g typeis ColumnMapp<strong>in</strong>g.Customiz<strong>in</strong>g Column Mapp<strong>in</strong>gEach row <strong>in</strong> a DataTable object orig<strong>in</strong>ates an <strong>XML</strong> subtree whose structure depends onthe value assigned to the DataColumn object's ColumnMapp<strong>in</strong>g property. Table 9-3lists the allowable column mapp<strong>in</strong>gs.Table 9-3: The Mapp<strong>in</strong>gType EnumerationMapp<strong>in</strong>gAttributeElementDescriptionThe column is mapped to an <strong>XML</strong> attribute on the rownode.The column is mapped to an <strong>XML</strong> node element. Thedefault sett<strong>in</strong>g.333


Table 9-3: The Mapp<strong>in</strong>gType EnumerationMapp<strong>in</strong>gHiddenSimpleContentDescriptionThe column is not <strong>in</strong>cluded <strong>in</strong> the <strong>XML</strong> output unless theDiffGram <strong>for</strong>mat is used.The column is mapped to simple text. (Only <strong>for</strong> tablesconta<strong>in</strong><strong>in</strong>g exactly one column.)The column data depends on the row node. If ColumnMapp<strong>in</strong>g is set to Element, thecolumn value is rendered as a child node, as shown here:value⋮If ColumnMapp<strong>in</strong>g is set to Attribute, the column data becomes an attribute on the rownode, as shown here:⋮By sett<strong>in</strong>g ColumnMapp<strong>in</strong>g to Hidden, you can filter the column out of the <strong>XML</strong>representation. Unlike the two preced<strong>in</strong>g sett<strong>in</strong>gs, which are ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> the DiffGram<strong>for</strong>mat, a column marked with Hidden is still serialized <strong>in</strong> the DiffGram <strong>for</strong>mat, but with aspecial attribute that <strong>in</strong>dicates that it was orig<strong>in</strong>ally marked hidden <strong>for</strong> serialization. Thereason is that the DiffGram <strong>for</strong>mat is meant to provide a stateful and high-fidelityrepresentation of the DataSet object.F<strong>in</strong>ally, the SimpleContent attribute renders the column content as the text of the rownode, as shown here:valueFor this reason, this attribute is applicable only to tables that have a s<strong>in</strong>gle column.Persist<strong>in</strong>g Extended PropertiesMany ADO.<strong>NET</strong> classes, <strong>in</strong>clud<strong>in</strong>g DataSet, DataTable, and DataColumn, use theExtendedProperties property to enable users to add custom <strong>in</strong><strong>for</strong>mation. Th<strong>in</strong>k of theExtendedProperties property as a k<strong>in</strong>d of generic cargo variable similar to the Tagproperty of many ActiveX controls. You populate it with name/value pairs and managethe contents us<strong>in</strong>g the typical and familiar programm<strong>in</strong>g <strong>in</strong>terface of collections. Forexample, you can use the DataTable object's ExtendedProperties collection to store theSQL command that should be used to refresh the table itself.The set of extended properties is lost at serialization time, unless you choose to addschema <strong>in</strong><strong>for</strong>mation. The WriteXml method adds extended properties to the schemaus<strong>in</strong>g an ad hoc attribute prefixed with the msprop namespace prefix. Consider thefollow<strong>in</strong>g code:ds.Tables["Employees"].ExtendedProperties.Add("Command",EmployeesCommand.Text);ds.Tables["Territories"].ExtendedProperties.Add("Command",TerritoriesCommand.Text);334


When the tables are serialized, the Command slot is rendered as follows:ExtendedProperties holds a collection of objects and can accept values of any type, butyou might run <strong>in</strong>to trouble if you store values other than str<strong>in</strong>gs there. When the objectis serialized, any extended property is serialized as a str<strong>in</strong>g. In particular, the str<strong>in</strong>g iswhat the object's ToStr<strong>in</strong>g method returns. This can pose problems when the DataSetobject is deserialized.Not all types can be successfully and seamlessly rebuilt from a str<strong>in</strong>g. For example,consider the Color class. If you call ToStr<strong>in</strong>g on a Color object (say, Blue), you getsometh<strong>in</strong>g like Color [Blue]. However, no constructor on the Color class can rebuild avalid object from such a str<strong>in</strong>g. For this reason, pay careful attention to the nonstr<strong>in</strong>gtypes you store <strong>in</strong> the ExtendedProperties collection.Render<strong>in</strong>g Data RelationsA DataSet object can conta<strong>in</strong> one or more relations gathered under the Relationscollection property. A DataRelation object represents a parent/child relationship setbetween two DataTable objects. The connection takes place on the value of a match<strong>in</strong>gcolumn and is similar to a primary key/<strong>for</strong>eign key relationship. In ADO.<strong>NET</strong>, therelation is entirely implemented <strong>in</strong> memory and can have any card<strong>in</strong>ality: one-to-one,one-to-many, and even many-to-one.More often than not, a relation entails table constra<strong>in</strong>ts. In ADO.<strong>NET</strong>, you have twotypes of constra<strong>in</strong>ts: <strong>for</strong>eign-key constra<strong>in</strong>ts and unique constra<strong>in</strong>ts. A <strong>for</strong>eign-keyconstra<strong>in</strong>t denotes an action that occurs on the columns <strong>in</strong>volved <strong>in</strong> the relation when arow is either deleted or updated. A unique constra<strong>in</strong>t denotes a restriction on the parentcolumn whereby duplicate values are not allowed. How are relations rendered <strong>in</strong> <strong>XML</strong>?If no schema <strong>in</strong><strong>for</strong>mation is required, relations are simply ignored. When a schema isnot explicitly required, the <strong>XML</strong> representation of the DataSet object is a pla<strong>in</strong> snapshotof the currently stored data; any ancillary <strong>in</strong><strong>for</strong>mation is ignored. There are two ways toaccurately represent a DataRelation relation with<strong>in</strong> an <strong>XML</strong> schema: you can use the annotation or specify an element. The WriteXmlprocedure uses the latter solution.The msdata:Relationship AnnotationThe msdata:Relationship annotation is a <strong>Microsoft</strong> XSD extension that ADO.<strong>NET</strong> and<strong>XML</strong> programmers can use to explicitly specify a parent/child relationship between nonnestedtables <strong>in</strong> a schema. This annotation is ideal <strong>for</strong> express<strong>in</strong>g the content of aDataRelation object. In turn, the content of an msdata:Relationship annotation istrans<strong>for</strong>med <strong>in</strong>to a DataRelation object when ReadXml processes the <strong>XML</strong> file.Let's consider the follow<strong>in</strong>g relation:DataRelation rel = new DataRelation("Emp2Terr",ds.Tables["Employees"].Columns["employeeid"],ds.Tables["Territories"].Columns["employeeid"]);ds.Relations.Add(rel);The follow<strong>in</strong>g list<strong>in</strong>g shows how to serialize this relation to<strong>XML</strong>:335


⋮This syntax is simple and effective, but it has one little drawback—it is simply targetedto describe a relation. When you serialize a DataSet object to <strong>XML</strong>, you might want toobta<strong>in</strong> a hierarchical representation of the data, if a parent/child relationship is present.For example, which of the follow<strong>in</strong>g <strong>XML</strong> documents do you f<strong>in</strong>d more expressive? Thesequential layout shown here is the default:The follow<strong>in</strong>g layout provides a hierarchical view of the data—all the territories' rows are nested below the logical parentrow:As an annotation, msdata:Relationship can't express this schema-specific <strong>in</strong><strong>for</strong>mation.Another piece of <strong>in</strong><strong>for</strong>mation is still needed. For this reason, the WriteXml method usesthe element to describe the relationship along with nested type def<strong>in</strong>itionsto create a hierarchy of nodes.The XSD keyref ElementIn XSD, the keyref element allows you to establish l<strong>in</strong>ks between elements with<strong>in</strong> adocument <strong>in</strong> much the same way a parent/child relationship does. The WriteXmlmethod uses keyref to express a relation with<strong>in</strong> a DataSet object, as shown here:336


The name attribute is set to the name of the DataRelation object. By design, the referattribute po<strong>in</strong>ts to the name of a key or unique element def<strong>in</strong>ed <strong>in</strong> the same schema.For a DataRelation object, refer po<strong>in</strong>ts to an automatically generated unique elementthat represents the parent table, as shown <strong>in</strong> the follow<strong>in</strong>g code. The child table of aDataRelation object, on the other hand, is represented by the contents of the keyrefelement.The keyref element's contents consist of two mandatory subelements—selector andfield—both of which conta<strong>in</strong> an XPath expression. The selector subelement specifiesthe node-set across which the values selected by the expression <strong>in</strong> field must beunique. Put more simply, selector denotes the parent or the child table, and field<strong>in</strong>dicates the parent or the child column. The f<strong>in</strong>al <strong>XML</strong> representation of our sampleDataRelation object is shown here:This code is functionally equivalent to the msdata:Relationship annotation, but it iscompletely expressed us<strong>in</strong>g the XSD syntax.Nested Data and Nested TypesThe XSD syntax is also important <strong>for</strong> express<strong>in</strong>g relations <strong>in</strong> <strong>XML</strong> us<strong>in</strong>g nestedsubtrees. Neither msdata:Relationship nor keyref are adequate to express the relationwhen nested tables are required. Nested relations are expressed us<strong>in</strong>g nested types <strong>in</strong>the <strong>XML</strong> schema.In the follow<strong>in</strong>g code, the Territories type is def<strong>in</strong>ed with<strong>in</strong> the Employees type, thusmatch<strong>in</strong>g the hierarchical relationship between the correspond<strong>in</strong>g tables:⋮337


By us<strong>in</strong>g keyref and nested types, you have a s<strong>in</strong>gle syntax—the <strong>XML</strong> Schemalanguage—to render <strong>in</strong> <strong>XML</strong> the contents of any ADO.<strong>NET</strong> DataRelation object. TheNested property of the DataRelation object specifies whether the relation must berendered hierarchically—that is, with child rows nested under the parent—orsequentially—that is, with all rows treated as children of the root node.ImportantWhen read<strong>in</strong>g an <strong>XML</strong> stream to build a DataSet object, theReadXml method treats the annotationand the element as perfectly equivalent pieces ofsyntax. Both are resolved by creat<strong>in</strong>g and add<strong>in</strong>g aDataRelation object with the specified characteristics. WhenReadXml meets nested types, <strong>in</strong> the absence of explicitrelationship <strong>in</strong><strong>for</strong>mation, it ensures that the resultant DataSetobject has tables that reflect the hierarchy of types and createsa DataRelation object between them. This relation is given anauto-generated name and is set on a pair of automaticallycreated columns.Serializ<strong>in</strong>g Filtered ViewsAs mentioned, <strong>in</strong> ADO.<strong>NET</strong> both the DataSet object and the DataTable objectimplement the ISerializable <strong>in</strong>terface, thus mak<strong>in</strong>g themselves accessible to any .<strong>NET</strong>Framework serializers. Only the DataSet object, however, exposes additional methods(<strong>for</strong> example, WriteXml) to let you explicitly save the contents to <strong>XML</strong>. We'll explore thevarious aspects of ADO.<strong>NET</strong> object serialization <strong>in</strong> the section "B<strong>in</strong>ary DataSerialization," on page 422.In the meantime, let's see how to extend the DataTable and DataView objects with theequivalent of a WriteXml method.Serializ<strong>in</strong>g DataTable ObjectsThe .<strong>NET</strong> Framework does not allow you to save a stand-alone DataTable object to<strong>XML</strong>. (A stand-alone DataTable object is an object not <strong>in</strong>cluded <strong>in</strong> any parent DataSetobject.) Unlike the DataSet object, the DataTable object does not provide you with aWriteXml method. Nevertheless, when you persist a DataSet object to <strong>XML</strong>, anyconta<strong>in</strong>ed DataTable object is regularly rendered to <strong>XML</strong>. How is this possible?The DataSet class <strong>in</strong>cludes <strong>in</strong>ternal methods that can be used to persist an <strong>in</strong>dividualDataTable object to <strong>XML</strong>. Un<strong>for</strong>tunately, these methods are not publicly available.Sav<strong>in</strong>g the contents of a stand-alone DataTable object to <strong>XML</strong> is not particularlydifficult, however, and requires only one small trick.The idea is that you create a temporary, empty DataSet object, add the table to it, andthen serialize the DataSet object to <strong>XML</strong>. Here's some sample code:public staticvoid WriteDataTable(DataTable dt, str<strong>in</strong>g outputFile,XmlWriteMode mode)338


{}DataSet tmp = CreateTempDataSet(dt);tmp.WriteXml(outputFile, mode);This code is excerpted from a sample class library that provides static methods to saveDataTable and DataView objects to <strong>XML</strong>. Each method has several overloads andmimics as much as possible the DataSet object's WriteXml method. In the preced<strong>in</strong>gsample code, the <strong>in</strong>put DataTable object is <strong>in</strong>corporated <strong>in</strong> a temporary DataSet objectthat is then saved to a disk file. The follow<strong>in</strong>g code creates the temporary DataSetobject and adds the DataTable object to it:private static DataSet CreateTempDataSet(DataTable dt){// Create a temporary DataSetDataSet ds = new DataSet("DataTable");// Make sure the DataTable does not already belong to aDataSetif (dt.DataSet == null)ds.Tables.Add(dt);elseds.Tables.Add(dt.Copy());return ds;}Note that a DataTable object can't be l<strong>in</strong>ked to more than one DataSet object at a time.If a given DataTable object has a parent object, its DataSet property is not null. If theproperty is not null, the temporary DataSet object must be l<strong>in</strong>ked to an <strong>in</strong>-memory copyof the table.The class library that conta<strong>in</strong>s the various WriteDataTable overloads is available <strong>in</strong> thisbook's sample files and is named AdoNetXmlSerializer. A client application uses thelibrary as follows:Str<strong>in</strong>gWriter writer = new Str<strong>in</strong>gWriter();AdoNetXmlSerializer.WriteDataTable(m_data, writer);// Show the serialization outputOutputText.Text = writer.ToStr<strong>in</strong>g();writer.Close();Figure 9-6 shows the sample application <strong>in</strong> action.339


Figure 9-6: An application that passes some data to a DataTable object and then persists itto <strong>XML</strong>.So much <strong>for</strong> DataTable objects. Let's see what you can do to serialize to <strong>XML</strong> thecontents of an <strong>in</strong>-memory, possibly filtered, view.Inside the DataView ObjectThe DataView class represents a customized view of a DataTable object. Therelationship between DataTable and DataView objects is governed by the rules of awell-known design pattern: the document/view model. Accord<strong>in</strong>g to this model, theDataTable object acts as the document, and the DataView object acts as the view. Atany moment, you can have multiple, different views of the same underly<strong>in</strong>g data. Moreimportant, you can manage each view as an <strong>in</strong>dependent object with its own set ofproperties, methods, and events.The view is implemented by ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g a separate array with the <strong>in</strong>dexes of theorig<strong>in</strong>al rows that match the criteria set on the view. By default, the table view isunfiltered and conta<strong>in</strong>s all the records <strong>in</strong>cluded <strong>in</strong> the table. By configur<strong>in</strong>g theRowFilter and RowStateFilter properties, you can narrow the set of rows that fit <strong>in</strong>to aparticular view. Us<strong>in</strong>g the Sort property, you can apply a sort expression to the rows <strong>in</strong>the view. Figure 9-7 illustrates the <strong>in</strong>ternal architecture of the DataView object.Figure 9-7: A DataView object ma<strong>in</strong>ta<strong>in</strong>s an <strong>in</strong>dex of the table rows that match the criteria.When any of the filter properties is set, the DataView object gets from the underly<strong>in</strong>gDataTable object an updated <strong>in</strong>dex of the rows that match the criteria. The <strong>in</strong>dex is asimple array of positions. No row objects are physically copied or referenced at thistime.340


L<strong>in</strong>k<strong>in</strong>g Tables and ViewsThe l<strong>in</strong>k between the DataTable object and the DataView object is typically establishedat creation time through the constructor, as shown here:public DataView(DataTable table);However, you could also create a new view and associate it with a table at a later timeus<strong>in</strong>g the DataView object's Table property, as <strong>in</strong> the follow<strong>in</strong>g example:DataView dv = new DataView();dv.Table = dataSet.Tables["Employees"];You can also obta<strong>in</strong> a DataView object from any table. In fact, the DefaultView propertyof a DataTable object simply returns a DataView object <strong>in</strong>itialized to work on that table,as shown here:DataView dv = dt.DefaultView;Orig<strong>in</strong>ally, the view is unfiltered, and the <strong>in</strong>dex array conta<strong>in</strong>s as many elements asthere are rows <strong>in</strong> the table.Gett<strong>in</strong>g Views of RowsThe contents of a DataView object can be scrolled through a variety of programm<strong>in</strong>g<strong>in</strong>terfaces, <strong>in</strong>clud<strong>in</strong>g collections, lists, and enumerators. The GetEnumerator method <strong>in</strong>particular ensures that you can walk your way through the records <strong>in</strong> the view us<strong>in</strong>g thefamiliar <strong>for</strong>each statement.The follow<strong>in</strong>g code shows how to access all the rows that fit <strong>in</strong>to the view:DataView myView = new DataView(table);<strong>for</strong>each(DataRowView rowview <strong>in</strong> myView){// Dereferences the DataRow objectDataRow row = rowview.Row;}⋮When client applications access a particular row <strong>in</strong> the view, the class expects to f<strong>in</strong>d it<strong>in</strong> an <strong>in</strong>ternal rows cache. If the rows cache is not empty, the specified row is returnedto the caller via an <strong>in</strong>termediate DataRowView object. The DataRowView object is awrapper <strong>for</strong> the DataRow object that conta<strong>in</strong>s the actual data. You access row datathrough the Row property. If the rows cache is empty, the DataView class fills it with anarray of DataRowView objects, each of which references an orig<strong>in</strong>al DataRow object.The rows cache can be empty either because it has not yet been used or because thesort expression or the filter str<strong>in</strong>g has been changed <strong>in</strong> the meantime.Serializ<strong>in</strong>g DataView ObjectsThe AdoNetXmlSerializer class also provides overloaded methods to serialize aDataView object. You build a copy of the orig<strong>in</strong>al DataTable object with all the rows(and only those rows) that match the view, as shown here:public staticvoid WriteDataView(DataView dv, str<strong>in</strong>g outputFile, XmlWriteModemode)341


{}DataTable dt = CreateTempTable(dv);WriteDataTable(dt, outputFile, mode);You create a temporary DataTable object and then serialize it to <strong>XML</strong> us<strong>in</strong>g thepreviously def<strong>in</strong>ed methods. The structure of the <strong>in</strong>ternal CreateTempTable rout<strong>in</strong>e isfairly simple, as shown here:private static DataTable CreateTempTable(DataView dv){// Create a temporary DataTable with the same structure// as the orig<strong>in</strong>alDataTable dt = dv.Table.Clone();// Fill the DataTable with all the rows <strong>in</strong> the view<strong>for</strong>each(DataRowView rowview <strong>in</strong> dv)dt.ImportRow(rowview.Row);}return dt;The ImportRow method creates a new row object <strong>in</strong> the context of the table. Like manyother ADO.<strong>NET</strong> objects, the DataRow object can't be referenced by two conta<strong>in</strong>erobjects at the same time. Us<strong>in</strong>g ImportRow is logically equivalent to clon<strong>in</strong>g the row andthen add<strong>in</strong>g the clone as a reference to the table. Figure 9-8 shows a DataView objectsaved to <strong>XML</strong>.Figure 9-8: Sav<strong>in</strong>g a DataView object to <strong>XML</strong>.B<strong>in</strong>ary Data Serialization342


There are basically two ways to serialize ADO.<strong>NET</strong> objects: us<strong>in</strong>g the object's own <strong>XML</strong><strong>in</strong>terface, and us<strong>in</strong>g .<strong>NET</strong> Framework data <strong>for</strong>matters. So far, we have reviewed theDataSet object's methods <strong>for</strong> serializ<strong>in</strong>g data to <strong>XML</strong>, and you've learned how to persistother objects like DataTable and DataView to <strong>XML</strong>. Let's look now at what's needed toserialize ADO.<strong>NET</strong> objects us<strong>in</strong>g the standard .<strong>NET</strong> Framework data <strong>for</strong>matters.The big difference between methods like WriteXml and .<strong>NET</strong> Framework data<strong>for</strong>matters is that <strong>in</strong> the <strong>for</strong>mer case, the object itself controls its own serializationprocess. When .<strong>NET</strong> Framework data <strong>for</strong>matters are <strong>in</strong>volved, any object can behave <strong>in</strong>one of two ways. The object can declare itself as serializable (us<strong>in</strong>g the Serializableattribute) and passively let the <strong>for</strong>matter extrapolate any significant <strong>in</strong><strong>for</strong>mation thatneeds to be serialized. This type of object serialization uses .<strong>NET</strong> Framework reflectionto list all the properties that make up the state of an object.The second behavior entails the object implement<strong>in</strong>g the ISerializable <strong>in</strong>terface, thuspass<strong>in</strong>g the <strong>for</strong>matters the data to be serialized. After this step, however, the object nolonger controls the process. A class that neither is marked with the Serializable attributenor implements the ISerializable <strong>in</strong>terface can't be serialized. No ADO.<strong>NET</strong> classdeclares itself as serializable, and only DataSet and DataTable implement theISerializable <strong>in</strong>terface. For example, you can't serialize to any .<strong>NET</strong> Framework<strong>for</strong>matters a DataColumn or a DataRow object.Ord<strong>in</strong>ary .<strong>NET</strong> Framework SerializationThe .<strong>NET</strong> Framework comes with two predef<strong>in</strong>ed <strong>for</strong>matter objects def<strong>in</strong>ed <strong>in</strong> theSystem.Runtime.Serialization.Formatters namespace—the b<strong>in</strong>ary <strong>for</strong>matter and theSOAP <strong>for</strong>matter. The classes that provide these two serializers are B<strong>in</strong>aryFormatterand SoapFormatter. The <strong>for</strong>mer is more efficient, is faster, and produces more compactcode. The latter is designed <strong>for</strong> <strong>in</strong>teroperability and generates a SOAP-baseddescription of the class that can be easily consumed on non-.<strong>NET</strong> plat<strong>for</strong>ms.NoteA <strong>for</strong>matter object is merely a class that implements the IFormatter<strong>in</strong>terface to support the serialization of a graph of objects. TheSoapFormatter and B<strong>in</strong>aryFormatter classes also implement theIRemot<strong>in</strong>gFormatter <strong>in</strong>terface to support remote procedure callsacross AppDoma<strong>in</strong>s. No technical reasons prevent you fromimplement<strong>in</strong>g custom <strong>for</strong>matters. In most cases, however, you onlyneed to tweak the serialization process of a given class <strong>in</strong>stead ofcreat<strong>in</strong>g an extension to the general serialization mechanism. Quiteoften, this objective can be reached simply by implement<strong>in</strong>g theISerializable <strong>in</strong>terface.The follow<strong>in</strong>g code shows what's needed to serialize a DataTable object us<strong>in</strong>g a b<strong>in</strong>ary<strong>for</strong>matter:B<strong>in</strong>aryFormatter bf = new B<strong>in</strong>aryFormatter();StreamWriter swDat = new StreamWriter(outputFile);bf.Serialize(swDat.BaseStream, dataTable);swDat.Close();The Serialize method causes the <strong>for</strong>matter to flush the contents of an object to a b<strong>in</strong>arystream. The Deserialize method does the reverse—it reads from a previously createdb<strong>in</strong>ary stream, rebuilds the object, and returns it to the caller, as shown here:DataTable dt = new DataTable();B<strong>in</strong>aryFormatter bf = new B<strong>in</strong>aryFormatter();StreamReader sr = new StreamReader(sourceFile);343


dt = (DataTable) bf.Deserialize(sr.BaseStream);sr.Close();When you run this code, someth<strong>in</strong>g surpris<strong>in</strong>g happens. Have you ever tried to serializea DataTable object, or a DataSet object, us<strong>in</strong>g the b<strong>in</strong>ary <strong>for</strong>matter? If so, you certa<strong>in</strong>lygot a b<strong>in</strong>ary file, but with a ton of <strong>XML</strong> <strong>in</strong> it. Un<strong>for</strong>tunately, <strong>XML</strong> data <strong>in</strong> serialized b<strong>in</strong>aryfiles only makes them huge, without the portability and readability advantages that <strong>XML</strong>normally offers. As a result, deserializ<strong>in</strong>g such files might take a while to complete—usually seconds.There is an architectural reason <strong>for</strong> this odd behavior. The DataTable and DataSetclasses implement the ISerializable <strong>in</strong>terface, thus mak<strong>in</strong>g themselves responsible <strong>for</strong>the data be<strong>in</strong>g serialized. The ISerializable <strong>in</strong>terface consists of a s<strong>in</strong>gle method—GetObjectData—whose output the <strong>for</strong>matter takes and flushes <strong>in</strong>to the output stream.Can you guess what happens next? By design, the DataTable and DataSet classesdescribe themselves to serializers us<strong>in</strong>g an <strong>XML</strong> DiffGram document. The b<strong>in</strong>ary<strong>for</strong>matter takes this rather long str<strong>in</strong>g and appends it to the stream. In this way, DataSetand DataTable objects are always remoted and transferred us<strong>in</strong>g <strong>XML</strong>—which is great.Un<strong>for</strong>tunately, if you are search<strong>in</strong>g <strong>for</strong> a more compact representation of persistedtables, the ord<strong>in</strong>ary .<strong>NET</strong> Framework run-time serialization <strong>for</strong> ADO.<strong>NET</strong> objects is not<strong>for</strong> you. Let's see how to work around it.Custom B<strong>in</strong>ary SerializationTo optimize the b<strong>in</strong>ary representation of a DataTable object (or a DataSet object), youhave no other choice than mapp<strong>in</strong>g the class to an <strong>in</strong>termediate object whoseserialization process is under your control. The entire operation is articulated <strong>in</strong>to a fewsteps:1. Create a custom class, and mark it as serializable (or, alternatively,implement the ISerializable <strong>in</strong>terface).2. Copy the key properties of the DataTable object to the members of theclass. Which members you actually map is up to you. However, the listmust certa<strong>in</strong>ly <strong>in</strong>clude the column names and types, plus the rows.3. Serialize this new class to the b<strong>in</strong>ary <strong>for</strong>matter, and when deserializationoccurs, use the restored <strong>in</strong><strong>for</strong>mation to build a new <strong>in</strong>stance of theDataTable object.Let's analyze these steps <strong>in</strong> more detail.Creat<strong>in</strong>g a Serializable Ghost ClassAssum<strong>in</strong>g that you need to persist only columns and rows of a DataTable object, aghost class can be quickly created. In the follow<strong>in</strong>g example, this ghost class is namedGhostDataTable:[Serializable]public class GhostDataTable{public GhostDataTable(){colNames = new ArrayList();colTypes = new ArrayList();dataRows = new ArrayList();}public ArrayList colNames;public ArrayList colTypes;344


public ArrayList dataRows;}This class consists of three, serializable ArrayList objects that conta<strong>in</strong> column names,column types, and data rows.The serialization process now <strong>in</strong>volves the GhostDataTable class rather than theDataTable object, as shown here:private void B<strong>in</strong>arySerialize(DataTable dt, str<strong>in</strong>g outputFile){B<strong>in</strong>aryFormatter bf = new B<strong>in</strong>aryFormatter();StreamWriter swB<strong>in</strong> = new StreamWriter(outputFile);// Instantiate and fill the worker classGhostDataTable ghost = new GhostDataTable();CreateTableGraph(dt, ghost);}// Serialize the objectbf.Serialize(swB<strong>in</strong>.BaseStream, ghost);swB<strong>in</strong>.Close();The key event here is how the DataTable object is mapped to the GhostDataTableclass. The mapp<strong>in</strong>g takes place <strong>in</strong> the folds of the CreateTableGraph rout<strong>in</strong>e.Mapp<strong>in</strong>g Table In<strong>for</strong>mationThe CreateTableGraph rout<strong>in</strong>e populates the colNames array with column names andthe colTypes array with the names of the data types, as shown <strong>in</strong> the follow<strong>in</strong>g code.The dataRows array is filled with an array that represents all the values <strong>in</strong> the row.void CreateTableGraph(DataTable dt, GhostDataTable ghost){// Insert column <strong>in</strong><strong>for</strong>mation (names and types)<strong>for</strong>each(DataColumn col <strong>in</strong> dt.Columns){ghost.colNames.Add(col.ColumnName);ghost.colTypes.Add(col.DataType.FullName);}}// Insert rows <strong>in</strong><strong>for</strong>mation<strong>for</strong>each(DataRow row <strong>in</strong> dt.Rows)ghost.dataRows.Add(row.ItemArray);The DataRow object's ItemArray property is an array of objects. It turns out to beparticularly handy, as it lets you handle the contents of the entire row as a s<strong>in</strong>gle,monolithic piece of data. Internally, the get accessor of ItemArray is implemented as asimple loop that reads and stores one column after the next. The set accessor is even345


more valuable, because it automatically groups all the changes <strong>in</strong> a pair ofBeg<strong>in</strong>Edit/EndEdit calls and fires column-changed events as appropriate.Siz<strong>in</strong>g Up Serialized DataThe sample application shown <strong>in</strong> Figure 9-9 demonstrates that a DataTable objectserialized us<strong>in</strong>g a ghost class can be up to 80 percent smaller than an identical objectserialized the standard way.Figure 9-9: The difference between ord<strong>in</strong>ary and custom b<strong>in</strong>ary serialization.In particular, consider the DataTable object result<strong>in</strong>g from the follow<strong>in</strong>g query:SELECT * FROM [Order Details]The table conta<strong>in</strong>s five columns and 2155 records. It would take up half a megabyte ifserialized to the b<strong>in</strong>ary <strong>for</strong>matter as a DataTable object. By us<strong>in</strong>g an <strong>in</strong>termediate ghostclass, the size of the output is 83 percent less. Look<strong>in</strong>g at th<strong>in</strong>gs the other way round,the results of the standard serialization process is about 490 percent larger than theresults you obta<strong>in</strong> us<strong>in</strong>g the ghost class.Of course, not all cases give you such an impressive result. In all the tests I ran on theNorthw<strong>in</strong>d database, however, I got an average 60 percent reduction. The more thetable content consists of numbers, the more space you save. The more BLOB fieldsyou have, the less space you save. Try runn<strong>in</strong>g the follow<strong>in</strong>g query, <strong>in</strong> which photo isthe BLOB field that conta<strong>in</strong>s an employee's picture:SELECT photo FROM employeesThe ratio of sav<strong>in</strong>gs here is only 25 percent and represents the bottom end of theNorthw<strong>in</strong>d test results. Interest<strong>in</strong>gly, if you add only a couple of traditional fields to thequery, the ratio <strong>in</strong>creases to 28 percent. The application shown <strong>in</strong> Figure 9-9 (<strong>in</strong>cluded<strong>in</strong> this book's sample files) is a useful tool <strong>for</strong> f<strong>in</strong>e-tun<strong>in</strong>g the structure of the table andthe queries <strong>for</strong> better serialization results.Deserializ<strong>in</strong>g DataOnce the b<strong>in</strong>ary data has been deserialized, you hold an <strong>in</strong>stance of the ghost classthat must be trans<strong>for</strong>med back <strong>in</strong>to a usable DataTable object. Here's how the sampleapplication accomplishes this:DataTable B<strong>in</strong>aryDeserialize(str<strong>in</strong>g sourceFile)346


{B<strong>in</strong>aryFormatter bf = new B<strong>in</strong>aryFormatter();StreamReader sr = new StreamReader(sourceFile);GhostDataTable ghost =(GhostDataTable) bf.Deserialize(sr.BaseStream);sr.Close();// Rebuild the DataTable objectDataTable dt = new DataTable();// Add columns<strong>for</strong>(<strong>in</strong>t i=0; i


Load<strong>in</strong>g DataSet Objects from <strong>XML</strong>The contents of an ADO.<strong>NET</strong> DataSet object can be loaded from an <strong>XML</strong> stream ordocument—<strong>for</strong> example, from an <strong>XML</strong> stream previously created us<strong>in</strong>g the WriteXmlmethod. To fill a DataSet object with <strong>XML</strong> data, you use the ReadXml method of theclass.The ReadXml method fills a DataSet object by read<strong>in</strong>g from a variety of sources,<strong>in</strong>clud<strong>in</strong>g disk files, .<strong>NET</strong> Framework streams, or <strong>in</strong>stances of XmlReader objects. Ingeneral, the ReadXml method can process any type of <strong>XML</strong> file, but of course thenontabular and rather irregularly shaped structure of <strong>XML</strong> files might create someproblems and orig<strong>in</strong>ate unexpected results when the files are rendered <strong>in</strong> terms of rowsand columns.In addition, the ReadXml method is extremely flexible and lets you load data accord<strong>in</strong>gto a particular schema or even <strong>in</strong>fer the schema from the data.Build<strong>in</strong>g DataSet ObjectsThe ReadXml method has several overloads, all of which are similar. They take the<strong>XML</strong> source plus an optional XmlReadMode value as arguments, as shown here:public XmlReadMode ReadXml(Stream, XmlReadMode);public XmlReadMode ReadXml(str<strong>in</strong>g, XmlReadMode);public XmlReadMode ReadXml(TextReader, XmlReadMode);public XmlReadMode ReadXml(XmlReader, XmlReadMode);The ReadXml method creates the relational schema <strong>for</strong> the DataSet object accord<strong>in</strong>g tothe read mode specified and regardless of whether a schema already exists <strong>in</strong> theDataSet object. The follow<strong>in</strong>g code snippet is typical code you would use to load aDataSet object from <strong>XML</strong>:StreamReader sr = new StreamReader(fileName);DataSet ds = new DataSet();ds.ReadXml(sr);sr.Close();The return value of the ReadXml method is an XmlReadMode value that <strong>in</strong>dicates themodality used to read the data. This <strong>in</strong><strong>for</strong>mation is particularly important when noread<strong>in</strong>g mode is specified or when the automatic default mode is set. In either case,you don't really know how the schema <strong>for</strong> the target DataSet object has beengenerated.Modes of Read<strong>in</strong>gTable 9-4 summarizes the read<strong>in</strong>g options available <strong>for</strong> use with the ReadXml method;allowable options are grouped <strong>in</strong> the XmlReadMode enumeration.Table 9-4: XmlReadMode Enumeration ValuesRead ModeAutoDiffGramDescriptionDefault option; <strong>in</strong>dicates the most appropriate way ofread<strong>in</strong>g by look<strong>in</strong>g at the source data.Reads a DiffGram and adds the data to the currentschema. If no schema exists, an exception is thrown.348


Table 9-4: XmlReadMode Enumeration ValuesRead ModeFragmentIgnoreSchemaInferSchemaReadSchemaDescriptionIn<strong>for</strong>mation that doesn't match the exist<strong>in</strong>g schema isdiscarded.Reads and adds <strong>XML</strong> fragments until the end of thestream is reached.Ignores any <strong>in</strong>-l<strong>in</strong>e schema that might be available andrelies on the DataSet object's exist<strong>in</strong>g schema. If noschema exists, no data is loaded. In<strong>for</strong>mation that doesn'tmatch the exist<strong>in</strong>g schema is discarded.Ignores any <strong>in</strong>-l<strong>in</strong>e schema and <strong>in</strong>fers the schema from the<strong>XML</strong> data. If the DataSet object already conta<strong>in</strong>s aschema, the current schema is extended. An exception isthrown <strong>in</strong> the case of conflict<strong>in</strong>g table namespaces andcolumn data types.Reads any <strong>in</strong>-l<strong>in</strong>e schema and loads both data andschema. An exist<strong>in</strong>g schema is extended with newcolumns and tables, but an exception is thrown if a giventable already exists <strong>in</strong> the DataSet object.The default read mode is XmlReadMode.Auto. When this mode is set, or when no readmode has been explicitly set, the ReadXml method exam<strong>in</strong>es the <strong>XML</strong> source andchooses the most appropriate option.The first possibility checked is whether the <strong>XML</strong> data is a DiffGram. If it is, theXmlReadMode.DiffGram mode is used. If the <strong>XML</strong> data is not a DiffGram butreferences an XDR or an XSD schema, the InferSchema mode is used. ReadSchemais used only if the document conta<strong>in</strong>s an <strong>in</strong>-l<strong>in</strong>e schema. In both the InferSchema andReadSchema cases, the ReadXml method checks first <strong>for</strong> an XDR (referenced or <strong>in</strong>l<strong>in</strong>e)schema and then <strong>for</strong> an XSD schema. If the DataSet object already has a schema,the read mode is set to IgnoreSchema. F<strong>in</strong>ally, if no schema <strong>in</strong><strong>for</strong>mation can be found,the InferSchema mode is used.Read<strong>in</strong>g <strong>XML</strong> DataAlthough ReadXml supports various types of sources—streams, files, and textreaders—the underly<strong>in</strong>g rout<strong>in</strong>e used <strong>in</strong> all cases reads data us<strong>in</strong>g an <strong>XML</strong> reader. Thefollow<strong>in</strong>g pseudocode illustrates the <strong>in</strong>ternal architecture of the ReadXml overloads:public XmlReadMode ReadXml(Stream stream){return ReadXml(new XmlTextReader(stream));}public XmlReadMode ReadXml(TextReader reader){return ReadXml(new XmlTextReader(reader));}public XmlReadMode ReadXml(str<strong>in</strong>g fileName){return ReadXml(new XmlTextReader(fileName));}349


The <strong>XML</strong> source is read one node after the next until the end is reached. The<strong>in</strong><strong>for</strong>mation read is trans<strong>for</strong>med <strong>in</strong>to a DataRow object that is added to a DataTableobject. Of course, the layout of both the DataTable object and the DataRow object isdeterm<strong>in</strong>ed based on the schema read or <strong>in</strong>ferred.Merg<strong>in</strong>g DataSet ObjectsWhen load<strong>in</strong>g the contents of <strong>XML</strong> sources <strong>in</strong>to a DataSet object, the ReadXml methoddoes not merge new and exist<strong>in</strong>g rows whose primary key <strong>in</strong><strong>for</strong>mation matches. Tomerge an exist<strong>in</strong>g DataSet object with a DataSet object just loaded from an <strong>XML</strong>source, you must proceed <strong>in</strong> a particular way.First you create a new DataSet object and fill it with the <strong>XML</strong> data. Next you merge thetwo objects by call<strong>in</strong>g the Merge method on either object, as shown <strong>in</strong> the follow<strong>in</strong>gcode. The Merge method is used to merge two DataSet objects that have largely similarschemas.target.Merge(source);The target DataSet object is the object on which the merge occurs. The source DataSetobject provides the <strong>in</strong><strong>for</strong>mation to merge but is not affected by the operation.Determ<strong>in</strong><strong>in</strong>g which DataSet object must be the target and which will be the source is upto you and depends on the data your application needs to obta<strong>in</strong>. Dur<strong>in</strong>g the merg<strong>in</strong>g,the rows that get overwritten are those with match<strong>in</strong>g primary keys.An alternative way to merge exist<strong>in</strong>g DataSet objects with contents read from <strong>XML</strong> isthrough the DiffGram <strong>for</strong>mat. Load<strong>in</strong>g a DiffGram us<strong>in</strong>g ReadXml will automaticallymerge rows that have match<strong>in</strong>g primary keys. When us<strong>in</strong>g the XmlReadMode.DiffGram<strong>for</strong>mat, the target DataSet object must have the same schema as the DiffGram;otherwise, the merge operation fails and an exception is thrown.Read<strong>in</strong>g Schema In<strong>for</strong>mationThe XmlReadMode.IgnoreSchema option causes the ReadXml method to ignore anyreferenced or <strong>in</strong>-l<strong>in</strong>e schema. The data is loaded <strong>in</strong>to the exist<strong>in</strong>g DataSet schema, andany data that does not fit is discarded. If no schema exists <strong>in</strong> the DataSet object, nodata will be loaded. Of course, an empty DataSet object has no schema <strong>in</strong><strong>for</strong>mation, asshown <strong>in</strong> the follow<strong>in</strong>g list<strong>in</strong>g. If the <strong>XML</strong> source is <strong>in</strong> the DiffGram <strong>for</strong>mat, theIgnoreSchema option has the same effect as XmlReadMode.DiffGram.// No schema <strong>in</strong> the DataSet, no data will be loadedDataSet ds = new DataSet();StreamReader sr = new StreamReader(fileName);ds.ReadXml(sr, XmlReadMode.IgnoreSchema);Read<strong>in</strong>g In-L<strong>in</strong>e SchemasThe XmlReadMode.ReadSchema option works only with <strong>in</strong>-l<strong>in</strong>e schemas and does notrecognize external references to schema files. The ReadSchema mode causes theReadXml method to add new tables to the DataSet object, but if any tables def<strong>in</strong>ed <strong>in</strong>the <strong>in</strong>-l<strong>in</strong>e schema already exist <strong>in</strong> the DataSet object, an exception is thrown. You can'tuse the ReadSchema option to change the schema of an exist<strong>in</strong>g table.If the DataSet object does not conta<strong>in</strong> a schema (that is, the DataSet object is empty)and there is no <strong>in</strong>-l<strong>in</strong>e schema, no data is read or loaded. ReadXml can read only <strong>in</strong>l<strong>in</strong>eschemas def<strong>in</strong>ed us<strong>in</strong>g the XDR or XSD schema. DTD documents are notsupported.350


Read<strong>in</strong>g External SchemasAn <strong>XML</strong> source that imports XDR or XSD schema <strong>in</strong><strong>for</strong>mation from an externalresource can't be handled through ReadSchema. External references are resolvedthrough the InferSchema option by <strong>in</strong>ferr<strong>in</strong>g the schema from the external file.The InferSchema option is generally quite slow because it has to determ<strong>in</strong>e thestructure by read<strong>in</strong>g the source. With externally referenced schemas, however, theprocedure is considerably faster. The ReadXml method simply reads the schema<strong>in</strong><strong>for</strong>mation from the given URL <strong>in</strong> the same way as the ReadXmlSchema methoddoes—no true <strong>in</strong>ferential process is started.By design, external schema resolution is implemented <strong>in</strong> the InferSchema read<strong>in</strong>gmode rather than <strong>in</strong> ReadSchema. When called to operate <strong>in</strong> automatic mode on a filethat references an external schema, the ReadXml method returns InferSchema. In turn,ReadSchema does not work if called to work on external schemas.The ReadSchema and InferSchema options are complementary. The <strong>for</strong>mer reads only<strong>in</strong>-l<strong>in</strong>e schema and ignores external references. The latter does the reverse, ignor<strong>in</strong>gany <strong>in</strong>-l<strong>in</strong>e schema that might be present <strong>in</strong> the source.Read<strong>in</strong>g FragmentsWhen the XmlReadMode.Fragment option is set, the DataSet object is loaded from an<strong>XML</strong> fragment. An <strong>XML</strong> fragment is a valid piece of <strong>XML</strong> that identifies elements,attributes, and documents. The <strong>XML</strong> fragment <strong>for</strong> an element is the markup text thatfully qualifies the <strong>XML</strong> element (node, CDATA, process<strong>in</strong>g <strong>in</strong>struction, or comment).The fragment <strong>for</strong> an attribute is the Value attribute; the fragment <strong>for</strong> a document is theentire content set.When the <strong>XML</strong> data is a fragment, the root level rules <strong>for</strong> well-<strong>for</strong>med <strong>XML</strong> documentsare not applied. Fragments that match the exist<strong>in</strong>g schema are appended to theappropriate tables, and fragments that do not match the schema are discarded.ReadXml reads from the current position to the end of the stream. TheXmlReadMode.Fragment option should not be used to populate an empty, andsubsequently schemaless, DataSet object.Inferr<strong>in</strong>g Schema In<strong>for</strong>mationWhen the ReadXml method works with the XmlReadMode.InferSchema option set, thedata is loaded only after the schema has been completely read from an external sourceor after the schema has been <strong>in</strong>ferred. Exist<strong>in</strong>g schemas are extended by add<strong>in</strong>g newtables or by add<strong>in</strong>g new columns to exist<strong>in</strong>g tables, as appropriate.In addition to the ReadXml method, you can use the DataSet object's InferXmlSchemamethod to load the schema from a specified <strong>XML</strong> file <strong>in</strong>to the DataSet object. You cancontrol, to some extent, the <strong>XML</strong> elements processed dur<strong>in</strong>g the schema <strong>in</strong>ferenceoperation. The signature of the InferXmlSchema method allows you to specify an arrayof namespaces whose elements will be excluded from <strong>in</strong>ference, as shown here:void InferXmlSchema(Str<strong>in</strong>g fileName, Str<strong>in</strong>g[] rgNamespace);The InferXmlSchema method creates an <strong>XML</strong> DOM representation of the <strong>XML</strong> sourcedata and then walks its way through the nodes, creat<strong>in</strong>g tables and columns asappropriate.A Sample ApplicationTo demonstrate the various effects of ReadXml and other read<strong>in</strong>g modes, I've createda sample application and a few sample <strong>XML</strong> documents. Us<strong>in</strong>g the application isstraight<strong>for</strong>ward. You select an <strong>XML</strong> file, and the code attempts to load it <strong>in</strong>to a DataSetobject us<strong>in</strong>g the XmlReadMode option you specify. The results are shown <strong>in</strong> a DataGridcontrol. As shown <strong>in</strong> Figure 9-10, the bottom text box displays the schema of theDataSet object as read or <strong>in</strong>ferred by the read<strong>in</strong>g method.351


Figure 9-10: ReadXml correctly recognizes an <strong>XML</strong> document <strong>in</strong> ADO.<strong>NET</strong> normal <strong>for</strong>m.In Figure 9-10, the selected <strong>XML</strong> document is expressed <strong>in</strong> the ADO.<strong>NET</strong> normal<strong>for</strong>m—that is, the default schema generated by WriteXml—and the ReadXml methodhandles it correctly.Not all <strong>XML</strong> sources smoothly fill out a DataSet object, however. Let's consider whathappens with the follow<strong>in</strong>g <strong>XML</strong> document:<strong>XML</strong> Core Classes<strong>XML</strong>-related Technologies<strong>XML</strong> and ADO.<strong>NET</strong>Remot<strong>in</strong>g and Web servicesMiscellaneous and SamplesThis document is not <strong>in</strong> ADO.<strong>NET</strong> normal <strong>for</strong>m even though it conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation thatcan easily fit <strong>in</strong> a table of data. As you can see <strong>in</strong> Figure 9-11, the .<strong>NET</strong> Framework<strong>in</strong>ference algorithm identifies three dist<strong>in</strong>ct tables <strong>in</strong> this document: class, days, andday. Although acceptable, this is not probably what one would expect.352


Figure 9-11: The schema that ReadXml <strong>in</strong>fers from the specified and nonstandard <strong>XML</strong>file.I would read this <strong>in</strong><strong>for</strong>mation as a s<strong>in</strong>gle table—day—conta<strong>in</strong>ed <strong>in</strong> a DataSet object. My<strong>in</strong>terpretation is a logical rather than an algorithmic read<strong>in</strong>g of the data, however. Thef<strong>in</strong>al schema consists of three connected tables, shown <strong>in</strong> Figure 9-12, of which the firsttwo tables simply conta<strong>in</strong> a <strong>for</strong>eign key field that normalizes the entire data structure.353


Figure 9-12: How <strong>Microsoft</strong> Visual Studio .<strong>NET</strong> renders the <strong>XML</strong> schema <strong>in</strong>ferred byReadXml.Choos<strong>in</strong>g the Correct Read<strong>in</strong>g ModeIf you save the contents of a DataSet object to <strong>XML</strong> and then read it back via ReadXml,pay attention to the read<strong>in</strong>g mode you choose. Each read<strong>in</strong>g mode has its own set offeatures and to the extent that it is possible, you should exploit those features.Although it is fairly easy to use, the XmlReadMode.Auto mode is certa<strong>in</strong>ly not the mosteffective way to read <strong>XML</strong> data <strong>in</strong>to a DataSet object. Avoid us<strong>in</strong>g this mode as muchas possible, and <strong>in</strong>stead use a more direct, and data-specific, option.B<strong>in</strong>d<strong>in</strong>g <strong>XML</strong> to Data-Bound Controls<strong>XML</strong> data sources are not <strong>in</strong> the official list of allowable data sources <strong>for</strong> the .<strong>NET</strong>Framework data-bound client and server controls. Many .<strong>NET</strong> Framework classes canbe used as data sources—not just those deal<strong>in</strong>g with database contents. In general,any object that exposes the ICollection <strong>in</strong>terface is a potential source <strong>for</strong> data b<strong>in</strong>d<strong>in</strong>g.As a result, you can b<strong>in</strong>d a <strong>Microsoft</strong> W<strong>in</strong>dows Forms data-bound control or a WebForms data-bound control to any of the follow<strong>in</strong>g data structures:• In-memory .<strong>NET</strong> Framework collection classes, <strong>in</strong>clud<strong>in</strong>g arrays,dictionaries, sorted and l<strong>in</strong>ked lists, hash tables, stacks, and queues• User-def<strong>in</strong>ed data structures, as long as the structure exposesICollection or one of its child <strong>in</strong>terfaces, such as IList• Database-oriented classes such as DataTable and DataSet354


• Views of data represented by the DataView classYou can't directly b<strong>in</strong>d <strong>XML</strong> documents, however, unless you load <strong>XML</strong> data <strong>in</strong> one ofthe a<strong>for</strong>ementioned classes. Typically, you load <strong>XML</strong> data <strong>in</strong>to a DataTable or aDataSet object. This operation can be accomplished <strong>in</strong> a couple of ways. You canload the <strong>XML</strong> document <strong>in</strong>to a DataSet object us<strong>in</strong>g the ReadXml method.Alternatively, you can load the <strong>XML</strong> document <strong>in</strong>to an <strong>in</strong>stance of theXmlDataDocument class and access the <strong>in</strong>ternally created DataSet object.Load<strong>in</strong>g from Custom ReadersIn Chapter 2, we built a custom <strong>XML</strong> reader <strong>for</strong> load<strong>in</strong>g CSV files <strong>in</strong>to a DataTableobject. As mentioned, however, that reader is not fully functional and does not workthrough ReadXml. Let's see how to rewrite the class to make it render the CSV contentas a well-<strong>for</strong>med <strong>XML</strong> document.Our target <strong>XML</strong> schema <strong>for</strong> the CSV document would be the follow<strong>in</strong>g:⋮Of course, this is not the only schema you can choose. I have chosen it because it isboth compact and readable. If you decide to use another schema, the code <strong>for</strong> thereader should be changed accord<strong>in</strong>gly. The target <strong>XML</strong> schema is a crucial aspect, as itspecifies how the Read method should be implemented. Figure 9-13 illustrates thebehavior of the Read method.Figure 9-13: The process of return<strong>in</strong>g an <strong>XML</strong> schema <strong>for</strong> a CSV file.355


The reader tracks the current node and sets <strong>in</strong>ternal variables to <strong>in</strong>fluence the nextnode to be returned. For example, when return<strong>in</strong>g an Element node, the readerannotates that there's an open node to close. Given this extremely simple schema, aBoolean member is enough to implement this behavior. In fact, no embedded nodesare allowed <strong>in</strong> a CSV file. In more complex scenarios, you might want to use a stackobject.The Read MethodWhen a new node is returned, the reader updates the node's depth and state. Inaddition, the reader stores fresh <strong>in</strong><strong>for</strong>mation <strong>in</strong> node-specific properties such as Name,NodeType, and Value, as shown here:public override bool Read(){if (m_readState == ReadState.Initial){if (m_hasColumnHeaders){str<strong>in</strong>g m_headerL<strong>in</strong>e = m_fileStream.ReadL<strong>in</strong>e();m_headerValues = m_headerL<strong>in</strong>e.Split(',');}}SetupRootNode();m_readState = ReadState.Interactive;return true;if (m_readState != ReadState.Interactive)return false;// Return an end tag if there's one openedif (m_mustCloseRow){SetupEndElement();return true;}// Return an end tag if the document must be closedif (m_mustCloseDocument){m_readState = ReadState.EndOfFile;return false;}356


Open a new tagm_currentL<strong>in</strong>e = m_fileStream.ReadL<strong>in</strong>e();if (m_currentL<strong>in</strong>e != null)m_readState = ReadState.Interactive;else{SetupEndRootNode();return true;}// Populate the <strong>in</strong>ternal structure represent<strong>in</strong>g the currentelementm_tokenValues.Clear();str<strong>in</strong>g[] tokens = m_currentL<strong>in</strong>e.Split(',');<strong>for</strong> (<strong>in</strong>t i=0; i


}m_currentAttributeIndex = -1;When travers<strong>in</strong>g a document us<strong>in</strong>g an <strong>XML</strong> reader, the ReadXml method visitsattributes <strong>in</strong> a loop and reads attribute values us<strong>in</strong>g ReadAttributeValue.Sett<strong>in</strong>g AttributesAttributes are not read through calls made to the Read method. A reader provides adhoc methods to access attributes either randomly or sequentially. When one of thesemethods is called—say, MoveToNextAttribute—the reader calls an <strong>in</strong>ternal method thatrefreshes the state so that Name and NodeType can now po<strong>in</strong>t to the correct content,as shown here:private void SetupAttribute(){m_nodeType = XmlNodeType.Attribute;m_name = m_tokenValues.Keys[m_currentAttributeIndex];m_value = m_tokenValues[m_currentAttributeIndex].ToStr<strong>in</strong>g();if (m_parentNode == "")m_parentNode = m_name;}A node is associated with a l<strong>in</strong>e of text read from the CSV file. Each token of<strong>in</strong><strong>for</strong>mation becomes an attribute, and attributes are stored <strong>in</strong> a collection ofname/value pairs. (This part of the architecture was described <strong>in</strong> detail <strong>in</strong> Chapter 2.)The m_parentNode property tracks the name of the element act<strong>in</strong>g as the parent of thecurrent attribute. Basically, it represents the node to move to when MoveToElement iscalled. Aga<strong>in</strong>, <strong>in</strong> this rather simple scenario, a str<strong>in</strong>g is sufficient to identify the parentnode of an attribute. For more complex <strong>XML</strong> layouts, you might need to use a customclass.Read<strong>in</strong>g Attributes Us<strong>in</strong>g ReadXmlThe ReadXml method accesses all the attributes of an element us<strong>in</strong>g a loop like this:while (reader.MoveToNextAttribute()){// Use ReadAttributeValue to read attribute values}⋮To load <strong>XML</strong> data <strong>in</strong>to a DataSet object, the ReadXml method uses an <strong>XML</strong> loaderclass that basically reads the source and builds an XmlDocument object. Thisdocument is then parsed, and DataRow and DataTable objects are created and addedto the target DataSet object. While build<strong>in</strong>g the temporary XmlDocument object, theloader scrolls attributes us<strong>in</strong>g MoveToNextAttribute and reads values us<strong>in</strong>gReadAttributeValue.ReadAttributeValue does not really return the value of the current attribute. Thismethod, <strong>in</strong> fact, simply returns a Boolean value <strong>in</strong>dicat<strong>in</strong>g whether there's more to readabout the attribute. By us<strong>in</strong>g ReadAttributeValue, however, you can read through thetext and entity reference nodes that make up the attribute value. Let's say that this is amore general way to read the content of an attribute; certa<strong>in</strong>ly, it is the method that358


ReadXml uses <strong>in</strong>directly. To let ReadXml read the value of an attribute, you mustprovide a significant implementation <strong>for</strong> ReadAttributeValue. In particular, if the currentnode is an attribute, your implementation should set the new node type toXmlNodeType.Text, <strong>in</strong>crease the depth by 1, and return true.public override bool ReadAttributeValue(){if (m_nodeType == XmlNodeType.Attribute){m_nodeType = XmlNodeType.Text;m_depth ++;return true;}}return false;ReadAttributeValue parses the attribute value <strong>in</strong>to one or more Text, EntityReference,or EndEntity nodes. This means that the <strong>XML</strong> loader won't be able to read the valueunless you explicitly set the node type to Text. (We don't support references <strong>in</strong> oursample CSV reader.) At this po<strong>in</strong>t, the loader will ask the reader <strong>for</strong> the value of a nodeof type Text. Our implementation of the Value property does not dist<strong>in</strong>guish betweennode types, but assumes that Read and other move methods (<strong>for</strong> example,MoveToNextAttribute) have already stored the correct value <strong>in</strong> Value. This is just whathappens. In fact, the attribute value is read and stored <strong>in</strong> Value right after position<strong>in</strong>g onthe attribute, be<strong>for</strong>e ReadAttributeValue is called. In other cases, you might want tocheck the node type <strong>in</strong> the Value property's get accessor prior to return<strong>in</strong>g a value.In general, understand<strong>in</strong>g the role of ReadAttributeValue and <strong>in</strong>tegrat<strong>in</strong>g this methodwith the rest of the code is key to writ<strong>in</strong>g effective custom readers. Nevertheless, as yousaw <strong>in</strong> Chapter 2, if you don't care about ReadXml support, you can write <strong>XML</strong> readerseven simpler than this. But the specialness of an <strong>XML</strong> reader is precisely that you canuse it with any method that accepts an <strong>XML</strong> reader! So dropp<strong>in</strong>g the support <strong>for</strong> theDataSet object's ReadXml method would be a significant loss.NoteHow ReadXml works with custom readers is <strong>in</strong> no way different fromthe way it works with system-provided <strong>XML</strong> readers. However,understand<strong>in</strong>g how ReadXml works with <strong>XML</strong> readers can help youto build effective and functional custom <strong>XML</strong> readers.ConclusionIn ADO.<strong>NET</strong>, <strong>XML</strong> is much more than a simple output <strong>for</strong>mat <strong>for</strong> serializ<strong>in</strong>g data. Youcan use <strong>XML</strong> to streaml<strong>in</strong>e the entire contents of a DataSet object, but you can alsochoose the actual <strong>XML</strong> schema and control the structure of the result<strong>in</strong>g <strong>XML</strong>document.There are several ways to persist a DataSet object's contents. You can create asnapshot of the currently stored data us<strong>in</strong>g a standard layout referred to here as theADO.<strong>NET</strong> normal <strong>for</strong>m. This data <strong>for</strong>mat can <strong>in</strong>clude schema <strong>in</strong><strong>for</strong>mation or not. Sav<strong>in</strong>gto the ADO.<strong>NET</strong> normal <strong>for</strong>m does not preserve the state of the DataSet object anddiscards any <strong>in</strong><strong>for</strong>mation about the previous state of each row. If you want stateful359


persistence, resort to the DiffGram <strong>XML</strong> <strong>for</strong>mat. DiffGrams are the subject of Chapter10.In this chapter, we also exam<strong>in</strong>ed how ADO.<strong>NET</strong> objects <strong>in</strong>tegrate with the standard.<strong>NET</strong> Framework run-time serialization mechanism. DataSet and DataTable objectsalways expose themselves to data <strong>for</strong>matters as <strong>XML</strong> DiffGrams, thus result<strong>in</strong>g <strong>in</strong>larger output files. We looked at a technique <strong>for</strong> reduc<strong>in</strong>g the size of the serialized dataas much as 500 percent.In ADO.<strong>NET</strong>, the deserialization process is tightly coupled with the <strong>in</strong>ference eng<strong>in</strong>e,which basically attempts to algorithmically extract the layout of the <strong>XML</strong> stream. Whenload<strong>in</strong>g <strong>XML</strong> <strong>in</strong>to a DataSet object, the <strong>in</strong>ference eng<strong>in</strong>e is <strong>in</strong>volved more frequentlythan not. Because it is not a lightweight piece of code, you should always opt <strong>for</strong> a clearand effective read<strong>in</strong>g mode and use the <strong>in</strong>ference eng<strong>in</strong>e only when absolutelynecessary.As mentioned, <strong>in</strong> the next chapter we'll tackle a very special <strong>XML</strong> serialization <strong>for</strong>mat—the DiffGram. Among other th<strong>in</strong>gs, the DiffGram <strong>for</strong>mat is the <strong>for</strong>mat used to deliverDataSet objects to other plat<strong>for</strong>ms through Web services. It is also ideal <strong>for</strong> sett<strong>in</strong>g up<strong>in</strong>termittent applications—that is, applications that can work both connected to anddisconnected from the system.Further Read<strong>in</strong>gObject serialization and ADO.<strong>NET</strong> are the key topics of this chapter. You'll f<strong>in</strong>d a lot ofbooks out there cover<strong>in</strong>g ADO.<strong>NET</strong> from various perspectives. I recommend <strong>Microsoft</strong>ADO.<strong>NET</strong>, Core Reference, by David Sceppa (<strong>Microsoft</strong> Press, 2002).It's more difficult to locate a book that provides thorough coverage of objectserialization. Chapter 11 <strong>in</strong> <strong>Programm<strong>in</strong>g</strong> <strong>Microsoft</strong> Visual Basic .<strong>NET</strong>, Core Reference,by Francesco Balena (<strong>Microsoft</strong> Press, 2002), is an excellent and self-conta<strong>in</strong>edreference. If you want a shorter but complete overview, have a look at the follow<strong>in</strong>gonl<strong>in</strong>e article: http://msdn.microsoft.com/library/enus/dnadvnet/html/vbnet09252001.asp.360


Chapter 10: Stateful Data SerializationHighlightsThe DataSet object is designed with data disconnection <strong>in</strong> m<strong>in</strong>d and with theassumption that optimistic concurrency is the default. In a multiple-user environment,optimistic concurrency occurs when applications do not lock a row while read<strong>in</strong>g it. Incontrast, a pessimistic <strong>for</strong>m of concurrency <strong>in</strong>volves lock<strong>in</strong>g rows at the data source toprevent users from modify<strong>in</strong>g data <strong>in</strong> a way that affects other users. The DataSet objectabstracts from the physical data source and qualifies itself as a superarray componentcapable of conta<strong>in</strong><strong>in</strong>g <strong>in</strong>-memory data.As a conta<strong>in</strong>er of disconnected data, the DataSet object accepts any sort of update tothe rows it conta<strong>in</strong>s, so you can add new rows to any child tables, and you can updateor delete exist<strong>in</strong>g rows. All these changes are persisted <strong>in</strong> memory and are not passedon to a persistent storage medium until an explicit update operation is conducted. Suchan update requires a new connection and applies an array of changes <strong>in</strong> a s<strong>in</strong>gle shot.For this reason, a DataSet update operation is often referred to as a batch update.When the batch update is completed, the DataSet <strong>in</strong>-memory changes areautomatically committed to ensure consistency between the <strong>in</strong>-memory cache and theunderly<strong>in</strong>g storage medium.As a result, each row of data stored <strong>in</strong> a DataSet object can have a history of changesthat applications might be <strong>in</strong>terested <strong>in</strong> know<strong>in</strong>g about and exploit<strong>in</strong>g. All this<strong>in</strong><strong>for</strong>mation is irreversibly lost when you serialize a DataSet object to the <strong>Microsoft</strong>ADO.<strong>NET</strong> normal <strong>for</strong>m us<strong>in</strong>g the standard option of the WriteXml method. (Weexam<strong>in</strong>ed this type of serialization <strong>in</strong> Chapter 9.)An alternative <strong>XML</strong> schema <strong>for</strong> serializ<strong>in</strong>g the contents of a DataSet object is theDiffGram <strong>for</strong>mat. The DiffGram <strong>for</strong>mat of the WriteXml method can provide a statefulrepresentation of the DataSet contents, as opposed to the stateless nature of thenormal <strong>for</strong>m. Because of its ability to preserve the state of the constituent rows, theDiffGram <strong>for</strong>mat is also used to remote a DataSet object through both the <strong>Microsoft</strong>.<strong>NET</strong> Framework remot<strong>in</strong>g architecture and Web services. But let's start by tak<strong>in</strong>g acloser look at the structure of a DiffGram script.Overview of the DiffGram FormatA DiffGram is an <strong>XML</strong> serialization <strong>for</strong>mat that <strong>in</strong>cludes both the orig<strong>in</strong>al values and thecurrent values of each row <strong>in</strong> each table. In particular, a DiffGram conta<strong>in</strong>s the current<strong>in</strong>stance of rows with the up-to-date values, plus a section where all the orig<strong>in</strong>al values<strong>for</strong> changed rows are grouped.Each row is given a unique identifier that is used to track changes between the twosections of the DiffGram. This relationship looks a lot like a <strong>for</strong>eign key relationship. Thefollow<strong>in</strong>g list<strong>in</strong>g outl<strong>in</strong>es the structure of a DiffGram:....361


......The root node can have up to three children. The first is the DataSetobject with its current contents, <strong>in</strong>clud<strong>in</strong>g newly added rows and modified rows but notdeleted rows. The actual name of this subtree depends on the DataSetName propertyof the source DataSet object. If the DataSet object has no name, the subtree's root isNewDataSet.The subtree rooted <strong>in</strong> the node conta<strong>in</strong>s enough <strong>in</strong><strong>for</strong>mation to restorethe orig<strong>in</strong>al state of all modified rows. For example, it still conta<strong>in</strong>s any row that hasbeen deleted as well as the orig<strong>in</strong>al contents of any modified row. All columns affectedby any change are tracked <strong>in</strong> the subtree.The last subtree is , which conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about any errors thathave occurred <strong>in</strong> a particular row. The DataRow class provides a few methods andproperties that programmers can use to set an error on any column <strong>in</strong> the row. Errorscan be set at any time, not necessarily when the data is entered. For example, <strong>in</strong>distributed applications, it's typical <strong>for</strong> one user to create some data that another userhas to validate. In this situation, the reviewer can set an error message on each columnof a row to signal that someth<strong>in</strong>g is wrong with that column. Amaz<strong>in</strong>gly, the <strong>Microsoft</strong>W<strong>in</strong>dows Forms DataGrid control then detects any pend<strong>in</strong>g errors on displayed rowsand marks them with a red exclamation po<strong>in</strong>t, provid<strong>in</strong>g the user with visual feedbackthat a particular column conta<strong>in</strong>s an error.The follow<strong>in</strong>g list<strong>in</strong>g shows a sample DiffGram <strong>in</strong> which row 1 has been modified, row 2has been deleted, row 3 has an error, and a new row has been added:362


Some of the attributes and nodes that <strong>for</strong>m a DiffGram come from a couple of <strong>Microsoft</strong>proprietary namespaces. The default prefixes are msdata and diffgr. In particular, themsdata namespace conta<strong>in</strong>s a number of attributes that are annotations <strong>for</strong> the data <strong>in</strong>the stream. We'll look at these attributes and the entire structure of the DiffGram <strong>in</strong> thesection "DiffGram Format Annotations," on page 448.The Current Data InstanceThe first section of the DiffGram represents the current <strong>in</strong>stance of the data. Althoughit's not strictly mandatory from a syntax standpo<strong>in</strong>t, of the three constituent subtrees,the data <strong>in</strong>stance is the only subtree that you will always f<strong>in</strong>d <strong>in</strong> a DiffGram. A DiffGramwithout data is just the representation of an empty DataSet object. The and subtrees are not present if the source DataSet object has nopend<strong>in</strong>g changes and errors.A DiffGram is stateful and is like a superset of the ADO.<strong>NET</strong> <strong>XML</strong> normal <strong>for</strong>m. Thedata <strong>in</strong>stance is nearly identical to the normal <strong>for</strong>m, which is a simple, statelesssnapshot of data. The major difference between the DiffGram's data <strong>in</strong>stance and thenormal <strong>for</strong>m is that the DiffGram <strong>for</strong>mat does not <strong>in</strong>clude schema <strong>in</strong><strong>for</strong>mation. To makethe overall DiffGram <strong>for</strong>mat truly stateful, you must comb<strong>in</strong>e the data with two othersubtrees—the orig<strong>in</strong>al data and the pend<strong>in</strong>g errors. By comb<strong>in</strong><strong>in</strong>g the contents of thethree subtrees, a client can rebuild a faithful representation of the orig<strong>in</strong>al DataSetcontents.NoteLike the normal <strong>for</strong>m, not even the DiffGram can be considered aserialization <strong>for</strong>mat <strong>for</strong> the DataSet as an object. The DiffGram is aserialization <strong>for</strong>mat <strong>for</strong> the contents of a DataSet object. To be avalid serialization of the DataSet object itself, the DiffGram wouldneed to conta<strong>in</strong> schema <strong>in</strong><strong>for</strong>mation. Incidentally, theimplementation of the ISerializable <strong>in</strong>terface that both the DataSetobject and the DataTable object provide manages to return aspecial version of the DiffGram <strong>for</strong>mat that differs from this becauseit <strong>in</strong>corporates schema <strong>in</strong><strong>for</strong>mation. You'll learn how to buildDiffGram documents that conta<strong>in</strong> a schema <strong>in</strong> the section "TheDiffGram Viewer Application," on page 457.Data Generator ObjectsAs mentioned, the data subtree <strong>in</strong> a DiffGram is similar to the ADO.<strong>NET</strong> normal <strong>for</strong>m <strong>for</strong><strong>XML</strong> we looked at <strong>in</strong> Chapter 9. In both cases, the <strong>XML</strong> code be<strong>in</strong>g generated by theWriteXml method represents a snapshot of the data currently stored <strong>in</strong> the DataSet363


object's tables. The data written out faithfully tracks any pend<strong>in</strong>g updates and deletionsthat have occurred <strong>in</strong> the meantime. As Figure 10-1 shows, the similarity between thefirst block of a DiffGram and the <strong>XML</strong> normal <strong>for</strong>m is not just cosmetic, nor it is due to amere chance.Figure 10-1: Components that work under the hood of the DataSet object's WriteXmlmethod.The same <strong>in</strong>ternal component, the <strong>XML</strong> tree writer, is used to generate both theADO.<strong>NET</strong> <strong>XML</strong> normal <strong>for</strong>m and the data <strong>in</strong>stance block <strong>in</strong> a DiffGram. A pleasant sideeffect of this architecture is that all the mapp<strong>in</strong>g features <strong>for</strong> DataColumn objects weexam<strong>in</strong>ed <strong>in</strong> Chapter 9 (see the discussion of the Mapp<strong>in</strong>gType enumeration <strong>in</strong> thesection "Customiz<strong>in</strong>g the <strong>XML</strong> Representation," on page 411) are still valid <strong>in</strong> thecontext of a DiffGram. You can decide whether a given column is better rendered us<strong>in</strong>gan attribute or an element, or whether the column should be hidden altogether.The Hidden FlagThe Mapp<strong>in</strong>gType.Hidden flag reveals a slight difference <strong>in</strong> the <strong>XML</strong> code that WriteXmlgenerates <strong>for</strong> DiffGrams. A column mapped as hidden text is still part of DiffGram'sdata <strong>in</strong>stance, but qualified with a particular attribute, as shown here:Davolio364


Nancy...For example, assume that you marked the employeeid column as hidden, as shownhere:DataColumn col = ds.Tables["Employees"].Columns["employeeid"];col.ColumnMapp<strong>in</strong>g = Mapp<strong>in</strong>gType.Hidden;The employeeid column is not rendered as an element or an employeeidattribute, but a custom attribute is always used. The name of this attribute ishiddenXXX, where XXX represents the name of the column—<strong>in</strong> this case,hiddenemployeeid. The new attribute belongs to the msdata namespace.NoteIn the context of the DiffGram, the msdata:hiddenXXX attribute is afull replacement <strong>for</strong> the hidden column—<strong>in</strong> other words, the<strong>in</strong><strong>for</strong>mation is not hidden at all, but the name of the column is a bitcamouflaged.DiffGram Format AnnotationsAnother remarkable difference between the ADO.<strong>NET</strong> <strong>XML</strong> normal <strong>for</strong>m and theDiffGram's data <strong>in</strong>stance is that the latter <strong>in</strong>cludes extra attributes such as id,hasChanges, hasErrors, and rowOrder. The extra attributes come from a couple ofcustom namespaces that are referenced at the beg<strong>in</strong>n<strong>in</strong>g of the DiffGram. Thesespecial attributes are used to flag nodes, thus relat<strong>in</strong>g elements across the varioussections—data <strong>in</strong>stance, changes, and errors.Table 10-1 lists all the DiffGram special attributes, also commonly referred to asannotations.Table 10-1: DiffGram AnnotationsAttributeDescriptiondiffgr:errorConta<strong>in</strong>s the text that describes the error <strong>for</strong> the rowor a column on the row.diffgr:hasChanges Indicates that the row has been modified or <strong>in</strong>serted.diffgr:hasErrorsIndicates that the row conta<strong>in</strong>s an error.diffgr:idReturns the unique ID used to couple rows acrosssections.diffgr:parentIdReturns the unique ID <strong>for</strong> the parent row.msdata:hiddenXXX Replacement attribute <strong>for</strong> columns marked as hidden.XXX denotes the actual name of the column.msdata:rowOrder Tracks the ord<strong>in</strong>al position of the row <strong>in</strong> the DataSetobject.There's no special reason <strong>for</strong> annotations to come from different namespaces—it's justa more rational categorization. Attributes <strong>in</strong> the diffgr namespace relate elements from365


different blocks. Attributes <strong>in</strong> the msdata namespace represent work<strong>in</strong>g <strong>in</strong><strong>for</strong>mation thatis useful to know when you're process<strong>in</strong>g the DiffGram.Cross-Section L<strong>in</strong>ksEach row rendered <strong>in</strong> a DiffGram is given a unique ID. The ID is automaticallygenerated and consists of the table name followed by a one-based <strong>in</strong>dex—<strong>for</strong> example,Employees1, Employees2, and so on. The diffgr:id attribute is used as a key to retrievethe orig<strong>in</strong>al data and the errors of a row from the and sections.The follow<strong>in</strong>g DiffGram conta<strong>in</strong>s a modified row:.....The same row can be referenced <strong>in</strong> any, or even all, of the DiffGram blocks. If the rowis currently part of the DataSet object, you will f<strong>in</strong>d it <strong>in</strong> the data <strong>in</strong>stance block. If therow has been updated or deleted, it will have a correspond<strong>in</strong>g entry <strong>in</strong> the section. If error messages have been associated with any of the row'scolumns, another record will be found <strong>in</strong> the section. The diffgr:idattribute is used to pair related elements.The msdata:rowOrder attribute is a simple zero-based <strong>in</strong>dex that tracks the ord<strong>in</strong>alposition of the row <strong>in</strong> the source DataSet object. This <strong>in</strong><strong>for</strong>mation is not updated when arow is deleted. An msdata:rowOrder value of 1 <strong>in</strong>dicates that the row was the second <strong>in</strong>the table when the DiffGram was created.Catch<strong>in</strong>g Changes <strong>in</strong> the DataThe diffgr:hasChanges attribute <strong>in</strong>dicates the type of change that has occurred <strong>in</strong> therow. This attribute can take any of the values listed <strong>in</strong> Table 10-2.Table 10-2: Values <strong>for</strong> the diffgr:hasChanges AttributeValuedescent<strong>in</strong>sertedmodifiedDescriptionIndicates that the row received has one or more children from aparent/child relationship that have been modified.Indicates that the row has been added.Indicates that the row has been modified. The orig<strong>in</strong>al values arestored <strong>in</strong> the correspond<strong>in</strong>g row <strong>in</strong> the section.366


An added row has no correspond<strong>in</strong>g element <strong>in</strong> the section. A deletedrow has no correspond<strong>in</strong>g element <strong>in</strong> the data <strong>in</strong>stance block, but there will be an entry<strong>in</strong> the block. Look<strong>in</strong>g at the data <strong>in</strong>stance, you can quickly and easilyidentify the modified and added rows—each has a diffgr:hasChanges attribute set to aself-explanatory value. But what about deleted rows?By design, any hole <strong>in</strong> the sequence of msdata:rowOrder values denotes a deleted row.The msdata:rowOrder values must necessarily be consecutive. Let's look more closelyat how a DiffGram is actually loaded <strong>in</strong> memory and trans<strong>for</strong>med <strong>in</strong>to a DataSet object.Read<strong>in</strong>g Back DiffGramsWhen read<strong>in</strong>g a DiffGram, the DataSet object's ReadXml method first loads the data<strong>in</strong>stance and creates all the necessary tables and rows. Each row is put <strong>in</strong> the added ormodified state, as appropriate. All the diffgr:id values are temporarily copied <strong>in</strong>to an<strong>in</strong>ternal hash table def<strong>in</strong>ed as a property of the DataSet object. Each entry <strong>in</strong> the hashtable references a DataRow object <strong>in</strong> the table be<strong>in</strong>g created.Next ReadXml processes the section and reads the old values <strong>for</strong> theavailable rows. If a match can be found between a row <strong>in</strong> the sectionand a row already loaded <strong>in</strong> the table, the just-read values are stored as the orig<strong>in</strong>alvalues of the table row. ReadXml looks <strong>for</strong> a match between the diffgr:id attribute <strong>in</strong> the section and the contents of the hash table. Figure 10-2 shows how theDataSet object is built.Figure 10-2: The DataSet is built by read<strong>in</strong>g the DiffGram sections one after the next andus<strong>in</strong>g the row IDs to pair elements <strong>in</strong> the various blocks.367


If no match is found, ReadXml deduces that the row <strong>in</strong> the section wasdeleted from the table when it was at the position that the msdata:rowOrder attribute<strong>in</strong>dicates. The method <strong>in</strong>serts a new row <strong>in</strong> the table at the same position andpopulates it with the values read from the section. Next the row ismarked <strong>for</strong> deletion us<strong>in</strong>g the Delete method of the DataRow object.The f<strong>in</strong>al step consists of read<strong>in</strong>g the values from the section andupdat<strong>in</strong>g accord<strong>in</strong>gly the RowError property of the correspond<strong>in</strong>g DataRow object <strong>in</strong> thetable.The Row Commit ModelThe DataSet, DataTable, and DataRow objects ma<strong>in</strong>ta<strong>in</strong> a local cache of changes.When a row is modified, deleted, or added, its state changes to one of the values of theDataRowState enumeration. (See the .<strong>NET</strong> Framework documentation <strong>for</strong> details.)Similarly, when a row is added, modified, or deleted from a table, the <strong>in</strong>ternal state ofthe table is altered, result<strong>in</strong>g <strong>in</strong> pend<strong>in</strong>g changes <strong>for</strong> the affected rows.Pend<strong>in</strong>g changes can be either accepted or rejected at the DataSet, DataTable, orDataRow level. Accept<strong>in</strong>g a pend<strong>in</strong>g change means that the row (changes always<strong>in</strong>volve a row) updates are committed to the table. Reject<strong>in</strong>g a pend<strong>in</strong>g change rollsback the state of the table, and the table appears as though the change never occurred.A DiffGram can track pend<strong>in</strong>g changes—that is, <strong>in</strong>-memory changes that have not yetbeen committed. Table 10-3 lists the allowable states <strong>for</strong> a DataRow object.Table 10-3: States of a DataRow ObjectStateAddedDeletedDetachedModifiedUnchangedDescriptionThe row has been added to the table, but AcceptChangeshas not yet been called.The row is marked <strong>for</strong> deletion from the parent table.Either the row has been created but not yet added to thetable, or the row has been removed from the rows collection.Some columns with<strong>in</strong> the row have been changed.No changes have been made s<strong>in</strong>ce the last call toAcceptChanges. This is also the state of all rows when thetable is first created.The AcceptChanges method has the power to commit all the changes and accept thecurrent values as the new orig<strong>in</strong>al values of the table, clear<strong>in</strong>g pend<strong>in</strong>g changes.RejectChanges rolls back all the pend<strong>in</strong>g changes. We'll encounter the row commitmodel aga<strong>in</strong> <strong>in</strong> the section "A Save-And-Resume Application," on page 464, when welook at save-and-resume applications.The Orig<strong>in</strong>al Data SectionThe DiffGram has a layered structure <strong>in</strong> which current values, orig<strong>in</strong>al values <strong>for</strong> themodified rows, and pend<strong>in</strong>g errors are stored <strong>in</strong> dist<strong>in</strong>ct sections. The state of theDataSet object is rebuilt by comb<strong>in</strong><strong>in</strong>g the contents of these sections. The orig<strong>in</strong>alvalues are stored <strong>in</strong> the section as a change with respect to the currentdata <strong>in</strong>stance.The DataRow object ma<strong>in</strong>ta<strong>in</strong>s several versions of itself that are <strong>in</strong>ternally stored <strong>in</strong> anarray of rows. The versions are grouped <strong>in</strong> the DataRowVersion enumeration, shown <strong>in</strong>Table 10-4.368


Table 10-4: Values <strong>for</strong> the DataRowVersion EnumerationValueCurrentDefaultOrig<strong>in</strong>alProposedDescriptionConta<strong>in</strong>s the current values of the rowThe default row version, accord<strong>in</strong>g to the current state of therowConta<strong>in</strong>s the orig<strong>in</strong>al values <strong>for</strong> the row—that is, the valuesstored when AcceptChanges was last calledConta<strong>in</strong>s proposed values <strong>for</strong> the rowOnly the Current and Orig<strong>in</strong>al versions are permanently stored <strong>in</strong> the DataRow object.The Proposed versions have a shorter life and are available only dur<strong>in</strong>g the row editphase. A row is <strong>in</strong> edit mode only dur<strong>in</strong>g the time that elapses between two successivecalls to the Beg<strong>in</strong>Edit and EndEdit methods. When read<strong>in</strong>g values from a DataRowobject, you can also specify which of the available versions you want, as shown here:if(row[0] == row[0, DataRowVersion.Orig<strong>in</strong>al]){...}The section conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation that the ReadXml method will use torestore the Orig<strong>in</strong>al version of each row referenced <strong>in</strong> the data <strong>in</strong>stance. Newly addedrows have no previous state and, subsequently, are not listed <strong>in</strong> the section.Deleted rows are present only <strong>in</strong> the section, as they have no currentdata to show. Deleted rows are detected by match<strong>in</strong>g the diffgr:id attribute of orig<strong>in</strong>alrows <strong>in</strong> the DiffGram with the IDs of the rows <strong>in</strong> the current data <strong>in</strong>stance. Rows <strong>in</strong> the section that have no counterpart <strong>in</strong> the current data <strong>in</strong>stance are first<strong>in</strong>serted <strong>in</strong> the table and then deleted. Although this approach might appear a bit odd,it's probably the most sensible way to add a logically deleted row to a DataTable object.NoteThe DataTable class provides two methods to delete child rows:Delete and Remove. The Delete method deletes the row logically bychang<strong>in</strong>g the state of the row. The row no longer appears <strong>in</strong> theRows collection, but it is not detached from the DataTable object.The Remove method, on the other hand, per<strong>for</strong>ms a physicaldeletion and detaches the row from the table. The detachedDataRow object is not automatically destroyed and rema<strong>in</strong>s valid aslong as it does not go out of scope. (Out of scope objects areautomatically garbage-collected and destroyed.) Valid DataRowobjects can be readded to the same DataTable object (or to anotherDataTable object) at any time.No matter how many columns <strong>in</strong> a row have effectively been updated, <strong>in</strong> the section, the orig<strong>in</strong>al row is stored <strong>in</strong> its entirety. The <strong>XML</strong> layout of therow depends on the column mapp<strong>in</strong>gs, as shown here:2FullerAndrew369


Although this solution is clearly not optimal, because unchanged columns are storedtwice, it closely reflects the <strong>in</strong>ternal architecture of the DataRow object and, as such,speeds up the restoration of the DataRow object <strong>in</strong> the dest<strong>in</strong>ation DataTable object.NoteThe DataRow class ma<strong>in</strong>ta<strong>in</strong>s its various versions by implement<strong>in</strong>gan array of subobjects—one <strong>for</strong> the current values, one <strong>for</strong> theorig<strong>in</strong>al version, and one <strong>for</strong> <strong>in</strong>termediate proposed values. Other<strong>in</strong>ternal properties <strong>in</strong>dicate at any moment which is the currentversion and what the state of the row is.As a f<strong>in</strong>al note, consider that <strong>for</strong> each column <strong>in</strong> a DataRow object, only the orig<strong>in</strong>al andthe current values are tracked, and no <strong>in</strong>termediate values are buffered. For example,suppose that you per<strong>for</strong>m the follow<strong>in</strong>g operation on an unchanged row:// 1 is the current value of the fieldrow[0][field] = 2;The row state changes to Modified, the orig<strong>in</strong>al value (1) is persisted <strong>in</strong> the Orig<strong>in</strong>alcopy of the row, and the new value (2) is registered as the current value. Next thefollow<strong>in</strong>g code runs:// 2 is the current value of the field// 1 is the orig<strong>in</strong>al value of the fieldrow[0][field] = 3;// 3 is NOW the current value of the field// 1 is the orig<strong>in</strong>al value of the fieldThe orig<strong>in</strong>al copy of the row rema<strong>in</strong>s <strong>in</strong>tact, but the current version is updated. As aresult, the <strong>in</strong>termediate value (2) is overwritten and is irreversibly lost.NoteBuild<strong>in</strong>g an automatic mechanism <strong>for</strong> track<strong>in</strong>g the entire history of arow is probably unnecessary <strong>in</strong> most cases. If you need a morepowerful mechanism to track changes, you can build a parallel tableof changes <strong>for</strong> each row <strong>in</strong> the table. Each entry <strong>in</strong> the custom tablewould po<strong>in</strong>t to a particular DataRow object and conta<strong>in</strong> a collectionof changes organized as you prefer.Track<strong>in</strong>g Pend<strong>in</strong>g ErrorsThe DataRow class provides a few methods <strong>for</strong> handl<strong>in</strong>g row errors. You can set ageneral error message on the entire row, and you can set a column-specific message.To set a general error message, you use the RowError property. To set a columnspecificmessage, you use the pair of methods SetColumnError and GetColumnError.Other helper methods available are GetColumnsInError and ClearErrors.A column or row with an error is <strong>in</strong> no way different from a column or row withoutpend<strong>in</strong>g errors. In this context, an error is simply a description of contents that the user,or the application, f<strong>in</strong>ds erroneous and <strong>in</strong>consistent. Noth<strong>in</strong>g prevents you from us<strong>in</strong>gerror properties as general-purpose cargo variables <strong>in</strong> which to store custom<strong>in</strong><strong>for</strong>mation and annotations.NoteIf you choose to use error properties as general-purpose cargovariables, keep <strong>in</strong> m<strong>in</strong>d that some advanced W<strong>in</strong>dows Forms andWeb Forms controls can, <strong>in</strong> the presence of error flags, refresh theirown user <strong>in</strong>terfaces accord<strong>in</strong>gly. For example, the W<strong>in</strong>dows Forms370


Data-Grid control displays a red exclamation mark on the columns<strong>in</strong> error, as shown here:The DataRow Error <strong>Programm<strong>in</strong>g</strong> InterfaceThe tables <strong>in</strong> this section provide a quick overview of the properties and methodsavailable <strong>in</strong> the DataRow class <strong>for</strong> sett<strong>in</strong>g and gett<strong>in</strong>g error messages. Thesemessages are then tracked <strong>in</strong> the section of the DiffGram. Table 10-5lists the error-related properties of the DataRow class.Table 10-5: Error-Related DataRow PropertiesPropertyHasErrorsRowErrorDescriptionIndicates whether the row conta<strong>in</strong>s errorsGets or sets a custom error description <strong>for</strong> the rowThe HasErrors property is set to true when either the RowError property conta<strong>in</strong>s avalue or at least one column is not associated with an empty message. If you want toknow about all the columns with errors, use the GetColumnsInError method to obta<strong>in</strong>an array conta<strong>in</strong><strong>in</strong>g the DataColumn objects with errors.Table 10-6 shows the error-related methods of the DataRow class.Table 10-6: Error-Related DataRow MethodsMethodClearErrorsGetColumnErrorGetColumnsInErrorSetColumnErrorDescriptionClears all the pend<strong>in</strong>g errors <strong>for</strong> the row. Does notdist<strong>in</strong>guish between errors set us<strong>in</strong>g RowError anderrors set us<strong>in</strong>g SetColumnError.Gets the error description <strong>for</strong> the specified column.Returns an array of the DataColumn objects witherrors.Sets the error description <strong>for</strong> the specified column.Contents of the SectionA table row is assigned an element <strong>in</strong> the section if its HasErrorsproperty returns true. In this case, the element that represents the row <strong>in</strong> the datasection has an extra attribute, diffgr:hasErrors, as shown here:1DavolioNancyThe preced<strong>in</strong>g element is coupled with another element <strong>in</strong> the section <strong>in</strong>which the error messages are tracked, as follows:371


The diffgr:error attribute on the row node ( <strong>in</strong> the preced<strong>in</strong>g sample code)conta<strong>in</strong>s the text stored <strong>in</strong> the RowError property. For each column with a custom errordescription, a new child element is created with the name of the column and adiffgr:error attribute. In the sample code, the employeeid and lastname columns conta<strong>in</strong>errors. Note that the RowError property is not automatically filled when at least onecolumn is <strong>in</strong> error.CautionThe <strong>XML</strong> schema of the elements <strong>in</strong> the section isnot affected by column mapp<strong>in</strong>gs, as is the case with the currentdata and the sections we exam<strong>in</strong>ed earlier.The DiffGram Viewer ApplicationTo fully demonstrate the work<strong>in</strong>gs of <strong>XML</strong> DiffGrams, noth<strong>in</strong>g is better than tak<strong>in</strong>g aDataSet object, enter<strong>in</strong>g some changes, and see<strong>in</strong>g how the correspond<strong>in</strong>g DiffGramrepresentation varies. For this purpose, I created the DiffGram Viewer W<strong>in</strong>dows Formsapplication, shown <strong>in</strong> Figure 10-3. The application is available <strong>in</strong> this book's samplefiles.Figure 10-3: The DiffGram Viewer sample application <strong>in</strong> action.This application executes a couple of SQL commands to obta<strong>in</strong> a DataSet object filledwith two tables—Employees and Territories. The names of the DataSet object and the<strong>in</strong>-memory tables can be changed at will us<strong>in</strong>g text boxes. Next the application createsa relation between the tables, sets the nest<strong>in</strong>g property to true, and creates theDiffGram.372


The DiffGram is created us<strong>in</strong>g an <strong>in</strong>-memory str<strong>in</strong>g writer, and the output text is writtento a multil<strong>in</strong>e, read-only text box. Click<strong>in</strong>g the Edit button opens a new <strong>for</strong>m with aDataGrid control <strong>for</strong> edit<strong>in</strong>g rows. The DataGrid control is bound to the DataSet objectgenerated by the query, and is shown <strong>in</strong> Figure 10-4.Figure 10-4: At the end of the edit<strong>in</strong>g phase, the updated DataSet object is resaved as aDiffGram and the pend<strong>in</strong>g changes are displayed.The child <strong>for</strong>m allows you to set errors and enter any type of changes. When the <strong>for</strong>m isdismissed, the ma<strong>in</strong> application automatically saves the bound DataSet object back to aDiffGram and refreshes the user <strong>in</strong>terface. As a result, you can easily test the DiffGramand view how the output varies after data changes.A nice feature of the DiffGram Viewer application is that it lets you toggle the DiffGramview between pla<strong>in</strong> text and <strong>XML</strong>. The <strong>XML</strong> view is provided by Internet Explorer, asshown <strong>in</strong> Figure 10-5.Figure 10-5: The DiffGram displayed <strong>in</strong> Internet Explorer.The DiffGram Viewer application makes use of the WebBrowser ActiveX control, whichis imported almost seamlessly by <strong>Microsoft</strong> Visual Studio .<strong>NET</strong>. The follow<strong>in</strong>g codeshows how to refresh such a Web view. To view the DiffGram us<strong>in</strong>g the WebBrowsercontrol, the DiffGram must first be saved to disk as a temporary <strong>XML</strong> file.void RefreshWebBrowser()373


{}// Url is a <strong>for</strong>m property that po<strong>in</strong>ts to the DiffGram fileobject o1=null, o2=null, o3=null, o4=null;WebBrowser.Navigate(Url, ref o1, ref o2, ref o3, ref o4);A DiffGram has no trace of relationships between tables unless the Nested property ofthe DataRelation object is set to true. This system is reasonable <strong>in</strong> light of what we saw<strong>in</strong> Chapter 9. ADO.<strong>NET</strong> serializes <strong>in</strong><strong>for</strong>mation about tables relationships us<strong>in</strong>g <strong>XML</strong>Schema constructs. Because a DiffGram does not <strong>in</strong>clude schemas, it can't conta<strong>in</strong>static <strong>in</strong><strong>for</strong>mation about table relationships. When the Nested property is set to true, theparent/child relationship is expressed by group<strong>in</strong>g child rows as a subtree of the parentrow.Persist<strong>in</strong>g a DataSet Object to a DiffGramA DiffGram is programmatically created by call<strong>in</strong>g the WriteXml method of the DataSetclass. To save data to a DiffGram, however, you must explicitly set the XmlWriteModeargument of the method to the flag XmlWriteMode.DiffGram, as shown <strong>in</strong> the follow<strong>in</strong>gcode. The <strong>XML</strong> data created <strong>in</strong> this way does not <strong>in</strong>clude schema <strong>in</strong><strong>for</strong>mation. We'llreturn to this important po<strong>in</strong>t <strong>in</strong> the section "Schema In<strong>for</strong>mation <strong>in</strong> the DiffGram," onpage 461.// Prepare the output streamStreamWriter sw = new StreamWriter(fileName);XmlTextWriter writer = new XmlTextWriter(sw);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Create the diffgramds.WriteXml(writer, XmlWriteMode.DiffGram);writer.Close();The DiffGram conta<strong>in</strong>s all the rows from all the tables found <strong>in</strong> the DataSet object. Youcan create ad hoc subsets of the DataSet object to narrow the <strong>in</strong><strong>for</strong>mation be<strong>in</strong>g saved.In particular, you can use the DataSet object's GetChanges method to save only thoserows that conta<strong>in</strong> uncommitted changes, as shown here:DataSet dsChanges = ds.GetChanges();dsChanges.WriteXml(writer, XmlWriteMode.DiffGram);The GetChanges method also has a few overloads that let you control the type ofchanges you are <strong>in</strong>terested <strong>in</strong>. For example, the follow<strong>in</strong>g code prepares a DiffGramconta<strong>in</strong><strong>in</strong>g only the rows that have been <strong>in</strong>serted:DataSet dsChanges = ds.GetChanges(DataRowState.Added);dsChanges.WriteXml(writer, XmlWriteMode.DiffGram);Load<strong>in</strong>g a DataSet Object from a DiffGramWhen you try to build a DataSet object from an <strong>XML</strong> DiffGram, you must first ensurethat the target DataSet object has a schema that is compatible with the data <strong>in</strong> theDiffGram.374


In no case does the ReadXml method—the only DataSet method that can load aDiffGram—<strong>in</strong>fer the schema or extend with new elements an exist<strong>in</strong>g schema.ReadXml works by merg<strong>in</strong>g the rows read from the DiffGram with exist<strong>in</strong>g rows <strong>in</strong> theDataSet object. The DiffGram row identifier (the diffgr:id attribute) is used to pairDiffGram and DataSet object rows.Any <strong>in</strong>compatibility between the current schema of the DataSet object and the data <strong>in</strong>the DiffGram throws an exception and causes the merge operation to fail. As a result,you can't load a DiffGram <strong>in</strong>to an empty, newly created DataSet object. You can createthe target DataSet object simply by clon<strong>in</strong>g an exist<strong>in</strong>g object that you know has thecorrect schema. Or, more realistically, you might want to read the schema from anexternal support us<strong>in</strong>g the ReadXmlSchema method. The follow<strong>in</strong>g code snippet showshow to create a DiffGram and its schema <strong>in</strong> dist<strong>in</strong>ct files:.// Prepare the output stream <strong>for</strong> the DiffGramStreamWriter diffStrm = new StreamWriter(diffgramFile);XmlTextWriter writer = new XmlTextWriter(diffStrm);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Create the diffgram from the ds DataSetds.WriteXml(writer, XmlWriteMode.DiffGram);writer.Close();// Prepare the output stream <strong>for</strong> the schemaStreamWriter xsdStrm = new StreamWriter(schemaFile);XmlTextWriter writer = new XmlTextWriter(xsdStrm);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Create the schema from the ds DataSetds.WriteXmlSchema(writer);writer.Close();The schema written with WriteXmlSchema is an <strong>XML</strong> Schema and <strong>in</strong>cludes table,relation, and constra<strong>in</strong>t def<strong>in</strong>itions.Schema In<strong>for</strong>mation <strong>in</strong> the DiffGramIn general, the schema and the data should be kept <strong>in</strong> separate files and handled astruly <strong>in</strong>dependent entities. The schema and the data are tightly coupled, and ifserialization is <strong>in</strong>volved, you might want to consider putt<strong>in</strong>g schema <strong>in</strong><strong>for</strong>mation <strong>in</strong>-l<strong>in</strong>e<strong>in</strong> the data.In the .<strong>NET</strong> Framework, the WriteXml method does not provide the capability to <strong>in</strong>cludeschema <strong>in</strong><strong>for</strong>mation along with the data. This is more of a design choice than anobjective difficulty. An <strong>in</strong>direct confirmation comes from the <strong>XML</strong> str<strong>in</strong>g you get from aWeb service method that returns a DataSet object. The output is a DiffGram extendedwith schema <strong>in</strong><strong>for</strong>mation, as shown here: ... 375


...By design, the current DiffGram implementation does not <strong>in</strong>clude schema <strong>in</strong><strong>for</strong>mation.However, I can't see any reason <strong>for</strong> not provid<strong>in</strong>g the schema option <strong>in</strong> future versions.The DataSet representation you get from a Web service method offers a glimpse ofwhat could be a possible enhancement of the DiffGram <strong>for</strong>mat. Technically speak<strong>in</strong>g,the Web service serialization of a DataSet object is not a DiffGram, but rather a new<strong>XML</strong> <strong>for</strong>mat that <strong>in</strong>corporates a DiffGram. In addition, this new <strong>for</strong>mat is not producedby WriteXml but comes care of the <strong>XML</strong> serializer—a different breed of data <strong>for</strong>matterthat we'll explore <strong>in</strong> Chapter 11.Creat<strong>in</strong>g DiffGrams with SchemasThe DiffGram Viewer application <strong>in</strong>cludes a Save With Schema check box that enablesyou to persist the DataSet object us<strong>in</strong>g the <strong>XML</strong> serializer. The f<strong>in</strong>al output, shown <strong>in</strong>Figure 10-6, is the same as you would obta<strong>in</strong> through a Web service. (This happensbecause .<strong>NET</strong> Framework Web services are actually serviced by the <strong>XML</strong> serializer.)Figure 10-6: A DataSet object serialized through the <strong>XML</strong> serializer class.The code that saves the DataSet object to a DiffGram changes as follows:StreamWriter sw = new StreamWriter(fileName);XmlTextWriter writer = new XmlTextWriter(sw);writer.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Create the diffgramif (!bUseSchema)ds.WriteXml(writer, XmlWriteMode.DiffGram);else{XmlSerializer ser = new XmlSerializer(typeof(DataSet));ser.Serialize(writer, ds);376


}writer.Close();If schema <strong>in</strong><strong>for</strong>mation must be <strong>in</strong>cluded, the application makes use of the <strong>XML</strong>serializer def<strong>in</strong>ed <strong>in</strong> the System.Xml.Serialization namespace. The constructor of the<strong>XML</strong> serializer takes the type of the data to process as an argument and <strong>in</strong>vokes theSerialize method. In Chapter 11, I'll unveil what really happens at this stage and howthe <strong>XML</strong> serializer sets itself up to work on a particular data type. For now, suffice tosay that once the <strong>in</strong>stance of the serializer has been configured, you simply call theSerialize method on the object <strong>in</strong>stance to be persisted. When the object is a DataSet,the output is a DiffGram and a schema—that is, an <strong>XML</strong> Schema and a DiffGramrooted under a common node. The name of the root matches the name of the typebe<strong>in</strong>g serialized (<strong>for</strong> example, DataSet) and can't be modified programmatically.Load<strong>in</strong>g DiffGrams with SchemasTo read back a DiffGram and a schema <strong>in</strong>to a DataSet object, you call the <strong>XML</strong>deserializer. Deserialization is the process of read<strong>in</strong>g an <strong>XML</strong> document and build<strong>in</strong>gan object <strong>in</strong>stance that co<strong>in</strong>cides with a given <strong>XML</strong> Schema. With DataSet objects, theschema and the data are stored as dist<strong>in</strong>ct nodes under a common root. The data isexpressed as a DiffGram.To set up the serializer, follow the same steps as <strong>in</strong> the previous section. You<strong>in</strong>stantiate the XmlSerializer class and pass the type of the object to process, as shownhere:XmlSerializer ser = new XmlSerializer(typeof(DataSet));DataSet dsNew = (DataSet) ser.Deserialize(writer, ds);To deserialize, call the Deserialize method and cast the object you get to the DataSettype.DiffGrams and Remot<strong>in</strong>gWhen a DataSet object is serialized to a .<strong>NET</strong> Framework <strong>for</strong>matter, it directly controlsthe <strong>for</strong>mat of its data through the methods of the ISerializable <strong>in</strong>terface. In particular, aserializable class implements the GetObjectData method, as shown here:void GetObjectData(SerializationInfo <strong>in</strong>fo,Stream<strong>in</strong>gContext context)The class passes its data to the <strong>for</strong>matter by add<strong>in</strong>g entries to the SerializationInfoobject us<strong>in</strong>g the AddValue method. A DataSet object serializes itself by add<strong>in</strong>g acouple of entries, as shown <strong>in</strong> the follow<strong>in</strong>g pseudocode:<strong>in</strong>fo.AddValue("XmlSchema", this.GetXmlSchema());this.WriteXml(strWriter, XmlWriteMode.DiffGram);<strong>in</strong>fo.AddValue("XmlDiffGram", strWriter.ToStr<strong>in</strong>g());The <strong>in</strong><strong>for</strong>mation stored <strong>in</strong> the SerializationInfo is then flushed to a b<strong>in</strong>ary stream or aSimple Object Access Protocol (SOAP) stream, accord<strong>in</strong>g to the <strong>for</strong>matter <strong>in</strong> use.The gist of this story is that a DataSet object is remoted us<strong>in</strong>g a couple of <strong>XML</strong>documents—one <strong>for</strong> the schema and one <strong>for</strong> the data—and the data is rendered us<strong>in</strong>g377


a DiffGram. To make DiffGrams really usable, the availability of schema <strong>in</strong><strong>for</strong>mation isvital.A Save-and-Resume ApplicationAs a stateful data <strong>for</strong>mat, a DiffGram is particularly useful <strong>for</strong> build<strong>in</strong>g save-and-resumeapplications. In this context, a save-and-resume application is a desktop or Webapplication that can work both on l<strong>in</strong>e and off l<strong>in</strong>e. For such applications, the connectionto the rest of the back-end system is optional and is not guaranteed to be up all thetime. From the connectivity standpo<strong>in</strong>t, a save-and-resume application is <strong>in</strong>termittentand must be able to get its core data either remotely (<strong>for</strong> example, from the centralsystem) or locally (<strong>for</strong> example, from data persisted to files).In this section, we'll build a W<strong>in</strong>dows Forms application that connects to a database,downloads some data, and disconnects. From this po<strong>in</strong>t on, the application worksdisconnected, the data it needs is stored locally, and the application can be usedanywhere and shut down and resumed any number of times. All the changes made tothe local data are correctly tracked and reported as <strong>in</strong>sertions, deletions, and updates.At a later time, the application reconnects to the system and submits its changes.In this description, common words such as connection, back-end system, data,database, and updates are treated as blanket terms that each application canimplement as needed. For example, a simple query executed on a SQL Server table <strong>in</strong>the sample application can easily become a call to a middle-tier object. Similarly, asimple connection to SQL Server <strong>in</strong> the sample application could be viewed as a log<strong>in</strong><strong>in</strong> a distributed application.NoteWhile look<strong>in</strong>g at the sample application discussed here, keep <strong>in</strong>m<strong>in</strong>d that it is just a sample. Focus on the technologies <strong>in</strong>volved andtheir <strong>in</strong>teractions rather than on the implementation details. Theoverall context of the sample application, while representative of acommon type of application, is certa<strong>in</strong>ly not a real-world scenario!Sett<strong>in</strong>g Up the ApplicationThe key functions of a save-and-resume application can be summarized <strong>in</strong> threecategories. First, the application must be able to work disconnected, thus transparentlyus<strong>in</strong>g a local copy of the back-end database. Next, the application must allow you toreview, filter, and reject changes. F<strong>in</strong>ally, the application must allow you to reconnectand submit changes at any time. Figure 10-7 shows the key elements of thearchitecture.378


Figure 10-7: Constituent parts of a disconnected save-and-resume application.At startup, the application loads data either from the local store or from a centralizedrepository. Applications can determ<strong>in</strong>e what route should be taken first accord<strong>in</strong>g totheir own features and requirements. Likewise, they can provide dist<strong>in</strong>ct user <strong>in</strong>terfaceelements to trigger the local and remote downloads <strong>in</strong>dependently.The DataSet object is ideal <strong>for</strong> stor<strong>in</strong>g a disconnected database. It can conta<strong>in</strong> multiple,even <strong>in</strong>dexed, tables, as well as relations and constra<strong>in</strong>ts. Once rebuilt, the DataSetobject is used to populate the user <strong>in</strong>terface, which also provides <strong>for</strong> edit<strong>in</strong>g. In thischapter and <strong>in</strong> Chapter 9 and Chapter 11, we exam<strong>in</strong>e the various options available <strong>for</strong>serializ<strong>in</strong>g a DataSet object: .<strong>NET</strong> Framework <strong>for</strong>matters, the ADO.<strong>NET</strong> normal <strong>for</strong>m,DiffGrams, and <strong>XML</strong> serializers.A disconnected application should allow users to accumulate and review changes tothe orig<strong>in</strong>al through several work sessions. This means that the local data store mustpersist the state of each change and possibly the history of each row. The DataSetobject provides <strong>for</strong> just this situation.The DataSet object is also ideal <strong>for</strong> gather<strong>in</strong>g all the modified rows to be submitted tothe back-end system <strong>for</strong> permanent updates. The DataSet object has been designedwith disconnection <strong>in</strong> m<strong>in</strong>d and to be used <strong>in</strong> save-and-resume applications. In saveand-resumescenarios, the serialization of the object is a critical aspect <strong>in</strong> improv<strong>in</strong>goverall client-side per<strong>for</strong>mance and efficiency.Creat<strong>in</strong>g the Local Data StoreThe sample application shown <strong>in</strong> Figure 10-8 is a simple W<strong>in</strong>dows Forms applicationconta<strong>in</strong><strong>in</strong>g an editable DataGrid control. The grid is bound to a DataSet object that canbe obta<strong>in</strong>ed by execut<strong>in</strong>g a SQL query or by read<strong>in</strong>g a local DiffGram file.379


Figure 10-8: The sample save-and-resume application.The code that populates the data grid looks like this:void PopulateGrid(){if (!File.Exists(m_diffgram))LoadFromDatabase();elseLoadFromLocalStore();// Load methods fill the m_dataSet <strong>in</strong>ternal propertygrid.DataSource = m_dataSet;grid.DataMember = "Employees";}Once the data loads, users can start work<strong>in</strong>g and enter changes as appropriate. TheDataSet object tracks any changes and signals those changes to the applicationthrough the HasChanges method. Here's the code to load the data from the local store:private void LoadFromLocalStore(){// Load the schema <strong>in</strong>to the DataSetm_dataSet.ReadXmlSchema(m_schemaFile);}// Load the datam_dataSet.ReadXml(m_diffgramFile, XmlReadMode.DiffGram);The sample application uses a DiffGram to implement the local store. More precisely,the local store consists of two dist<strong>in</strong>ct files—one <strong>for</strong> the data (the DiffGram) and one <strong>for</strong>the schema. As mentioned, a DiffGram can't be used to populate a DataSet objectwithout schema <strong>in</strong><strong>for</strong>mation. This is not your only option, however.You can use the <strong>XML</strong> serializer to persist a DataSet object to a file that stores schemaand data <strong>in</strong> the same place. In all these cases, the f<strong>in</strong>al output <strong>for</strong>mat is <strong>XML</strong>. If you380


want a more compact <strong>for</strong>mat, opt <strong>for</strong> the b<strong>in</strong>ary .<strong>NET</strong> Framework <strong>for</strong>matter andconsider us<strong>in</strong>g a ghost class, as described <strong>in</strong> Chapter 9.Review<strong>in</strong>g and Reject<strong>in</strong>g ChangesUsers of the sample application enter changes through the <strong>in</strong>terface of the DataGridcontrol. Each change is detected, and controls <strong>in</strong> the user <strong>in</strong>terface are enabled anddisabled to reflect those changes. For example, the Review Changes button is enabledif there are changes to review.Detect<strong>in</strong>g Ongo<strong>in</strong>g ChangesIn a W<strong>in</strong>dows Forms application, data sources associated with data-bound controls aremanaged by a special breed of component—the b<strong>in</strong>d<strong>in</strong>g manager.B<strong>in</strong>d<strong>in</strong>gManagerBase is the abstract class <strong>for</strong> b<strong>in</strong>d<strong>in</strong>g managers; the actual classes youwill work with are CurrencyManager and PropertyManager.The PropertyManager class keeps track of a simple b<strong>in</strong>d<strong>in</strong>g between a data-boundcontrol property and a data source scalar value. The CurrencyManager class plays amore sophisticated role. CurrencyManager handles complex data b<strong>in</strong>d<strong>in</strong>g andma<strong>in</strong>ta<strong>in</strong>s b<strong>in</strong>d<strong>in</strong>gs between a data source and all the list controls (<strong>for</strong> example, theDataGrid control) that b<strong>in</strong>d to it or to one of its member tables. The CurrencyManagerclass takes care of synchroniz<strong>in</strong>g the controls bound to the same data source andprovides a uni<strong>for</strong>m <strong>in</strong>terface <strong>for</strong> clients to access the current item <strong>for</strong> the list. Bothmanager classes have a property named Current and fire position-related events suchas ItemChanged. The Current property returns the currently selected item, whateverthat is <strong>for</strong> the particular b<strong>in</strong>d<strong>in</strong>g class. For example, <strong>for</strong> the DataGrid class, the currentitem is the nth bound element—that is, a DataRow object if a DataTable is bound, or astr<strong>in</strong>g if an array of str<strong>in</strong>gs is bound.To access the b<strong>in</strong>d<strong>in</strong>g manager <strong>for</strong> a particular data source, you use the Form object'sB<strong>in</strong>d<strong>in</strong>gContext collection, as shown here:CurrencyManager m_bmbEmployees;m_bmbEmployees = (CurrencyManager) B<strong>in</strong>d<strong>in</strong>gContext[m_dataSet,"Employees"];m_bmbEmployees.ItemChanged +=new ItemChangedEventHandler(CurrentChanged);This code also registers a handler <strong>for</strong> the ItemChanged event. The b<strong>in</strong>d<strong>in</strong>g managerautomatically fires the event whenever an item <strong>in</strong> the bound data source—theEmployees table <strong>in</strong> the grid's DataSet object—changes. In other words, the handlerexecutes whenever a change occurs and refreshes the application's user <strong>in</strong>terfaceaccord<strong>in</strong>gly.Select<strong>in</strong>g Changed RowsAs mentioned, the DataSet object registers all the changes but reta<strong>in</strong>s the orig<strong>in</strong>alvalues of the modified rows. Thanks to these features, sett<strong>in</strong>g up a <strong>for</strong>m to review thecurrent changes is not at all difficult. Let's see how to proceed.The idea is to create a view of the table—possibly a copy of the table that <strong>in</strong>cludes onlythe changes. The GetChanges method can be used to obta<strong>in</strong> a copy of the DataTableobject (or the DataSet object) that <strong>in</strong>cludes only the changed rows, as shown here:DataTable dtChanges =m_dataSet.Tables["Employees"].GetChanges();if (dtChanges == null)return;DataView dv = dtChanges.DefaultView;381


dv.RowStateFilter = DataViewRowState.Added |DataViewRowState.ModifiedOrig<strong>in</strong>al |DataViewRowState.Deleted;gridChanges.DataSource = dv;A DataView object can be obta<strong>in</strong>ed from the table through the DefaultView property.Normally, the DefaultView property returns an unfiltered view of the table contents. TheRowStateFilter property enables you to select the rows to be displayed <strong>in</strong> the viewbased on the state. With the preced<strong>in</strong>g code, only the rows added and deleted areshown. In addition, the view <strong>in</strong>cludes the orig<strong>in</strong>al version of the modified rows.Because the dtChanges table has already been constructed to conta<strong>in</strong> all the changes,a good question would be, Should we really need to set the RowStateFilter property toAdded, ModifiedOrig<strong>in</strong>al, and Deleted ? Shouldn't such rows already be displayed <strong>in</strong>the view? This consideration applies only to added and deleted rows. By default, themodified rows are displayed with the current values, not the orig<strong>in</strong>al values. The goal ofthe Review Changes feature is to display pend<strong>in</strong>g changes, so we need to display theorig<strong>in</strong>al values to let users make comparisons with the current values. The Changesw<strong>in</strong>dow, shown <strong>in</strong> Figure 10-9, allows you to see any changes to the data.Figure 10-9: The Review Changes feature <strong>in</strong> action. The bottom grid shows the orig<strong>in</strong>alversion of the modified rows.Reject<strong>in</strong>g ChangesPend<strong>in</strong>g changes can be rejected by call<strong>in</strong>g the RejectChanges method.RejectChanges is available on the DataSet class as well as on the DataTable andDataRow classes. By call<strong>in</strong>g RejectChanges on the DataSet class, you cancel all thepend<strong>in</strong>g changes <strong>in</strong> all the tables <strong>in</strong> the DataSet object. Similarly, call<strong>in</strong>gRejectChanges on a DataTable object rejects all the changes on the table. F<strong>in</strong>ally,call<strong>in</strong>g the method on the DataRow class simply cancels the current changes on thegiven row.If RejectChanges per<strong>for</strong>ms an <strong>in</strong>-memory rollback, AcceptChanges does the oppositeand commits all the pend<strong>in</strong>g changes. When changes are committed, the orig<strong>in</strong>alvalues of each <strong>in</strong>volved row are overwritten with the current values and the row state isreset to Unchanged. Uncommitted changes are key to per<strong>for</strong>m<strong>in</strong>g a batch update to theback-end system.382


Submitt<strong>in</strong>g ChangesData submission is the process <strong>in</strong> which all <strong>in</strong>-memory changes are passed on to theback-end system <strong>for</strong> permanent storage and global availability. In ADO.<strong>NET</strong>, thissubmission does not consist of a block of data be<strong>in</strong>g sent to the database—SQL Server2000 or any other database—<strong>in</strong> a s<strong>in</strong>gle shot as an Updategram or a text stream. AnADO.<strong>NET</strong> batch update executes <strong>in</strong>dividual statements on the target system, one <strong>for</strong>each change that needs to be submitted. For the most part, statements will be SQLstatements.The Batch UpdateThe DataSet object can submit data to the database <strong>in</strong> batch mode by us<strong>in</strong>g the dataadapter's Update method, as shown <strong>in</strong> the follow<strong>in</strong>g code. Data can be submitted onlyon a per-table basis. When you call Update without specify<strong>in</strong>g a table name, the codeassumes a default name of Table. If no table exists with that name, an exception israised.adapter.Update(dataSet, tableName);The Update method first exam<strong>in</strong>es the RowState property of each table row. It thenprepares and calls a tailor-made INSERT, UPDATE, or DELETE statement <strong>for</strong> each<strong>in</strong>serted, updated, or deleted row <strong>in</strong> the specified DataTable object. The Update methodbelongs to a data adapter object, so you need a connection str<strong>in</strong>g, or a connectionobject, to proceed.Rows are scanned and processed accord<strong>in</strong>g to their natural order (their position <strong>in</strong> thetable's Rows collection). If you need to process rows <strong>in</strong> a particular order, you mustdivide the overall update process <strong>in</strong>to various subprocesses, each work<strong>in</strong>g on theselected rows you need. For example, if you have parent/child–related tables, youmight want to start by updat<strong>in</strong>g rows <strong>in</strong> both tables. Next you delete rows <strong>in</strong> the childtable and then <strong>in</strong> the parent table. F<strong>in</strong>ally, you <strong>in</strong>sert new rows <strong>in</strong> the parent table andf<strong>in</strong>ish with child <strong>in</strong>sertions.The follow<strong>in</strong>g code shows how to submit only the rows that have been added to the <strong>in</strong>memorytable:// Submit all the rows that have been added to a given tableDataRow[] arrayOfRows = table.Select("", "",DataViewRowState.Added);adapter.Update(arrayOfRows);This arrangement is made possible by the fact that one of the Update overloads takesan array of DataRow objects, which provides <strong>for</strong> the greatest flexibility.Detect<strong>in</strong>g and Resolv<strong>in</strong>g Update ConflictsData disconnection is based on a clearly optimistic vision of concurrency. Whathappens if, by the time you attempt to apply your changes to the back-end system,someone else has modified the same records? Technically speak<strong>in</strong>g, <strong>in</strong> this case, youhave a data conflict. How conflicts are handled is strictly application-specific, but thereasonable options can be easily summarized <strong>in</strong> three categories, as follows:• First-w<strong>in</strong> The conflict is resolved by silently and automatically dropp<strong>in</strong>gthe latest change—that is, the change that you were try<strong>in</strong>g to submit. Toimplement a first-w<strong>in</strong> approach, you simply set theCont<strong>in</strong>ueUpdateOnError property on the data adapter to true. IfCont<strong>in</strong>ueUpdateOnError is set to true, no exception is thrown when anerror occurs dur<strong>in</strong>g the update of a row. The error <strong>in</strong><strong>for</strong>mation is stored383


<strong>in</strong> the RowError property of the correspond<strong>in</strong>g row. The batch updateprocess cont<strong>in</strong>ues with subsequent rows.• Last-w<strong>in</strong> Your change is applied regardless of the current status of therow. To implement this approach, you have only to ensure that the SQLcommand used to carry the update is not too restrictive to generate adata conflict. A data conflict occurs when the SQL command f<strong>in</strong>ds norow to affect. If you build the SQL command so that it updates or deletesrows that match a primary key field, no data conflict will ever be raised.Conflict-aware SQL code is code generated by ADO.<strong>NET</strong> commandbuilders <strong>in</strong> which the WHERE clause ensures that the current status andthe orig<strong>in</strong>al status of the row match prior to proceed<strong>in</strong>g with thestatement.• Ask-the-user Take this route when neither of the two preced<strong>in</strong>g optionswill work <strong>in</strong> all possible cases you <strong>for</strong>esee handl<strong>in</strong>g. By default, a dataconflict raises a DBConcurrencyException exception. This exception isnot raised if you set the Cont<strong>in</strong>ueUpdateOnError property to true. TheRow property of the exception class returns a reference to the row <strong>in</strong>error. By read<strong>in</strong>g the properties of such a DataRow object, you haveaccess to both proposed and orig<strong>in</strong>al values. You have no access to theunderly<strong>in</strong>g value, but you can obta<strong>in</strong> that value by issu<strong>in</strong>g another queryaga<strong>in</strong>st the database. Resolv<strong>in</strong>g the conflict ultimately means opt<strong>in</strong>geither <strong>for</strong> a first-w<strong>in</strong> or a last-w<strong>in</strong> approach, but you let the user decidewhich. Your goal is to provide the user with enough <strong>in</strong><strong>for</strong>mation to makethe correct choice.The follow<strong>in</strong>g code uses the "ask-the-user" approach <strong>for</strong> resolv<strong>in</strong>g update dataconflicts:OleDbDataAdapter da = new OleDbDataAdapter();da.Cont<strong>in</strong>ueUpdateOnError = true;da.SelectCommand = new OleDbCommand("SELECT * FROM employees",m_conn);OleDbCommandBuilder cb = new OleDbCommandBuilder(da);da.Update(m_dataSet, "Employees");Figure 10-10 shows the sample application when a change fails.384


Figure 10-10: The user <strong>in</strong>terface of the application when the batch update fails on a row.Notice the custom error message on the row <strong>in</strong> error. This message is obta<strong>in</strong>ed us<strong>in</strong>gthe follow<strong>in</strong>g code:// Select all the rows <strong>in</strong> error after the batch update<strong>for</strong>each(DataRow row <strong>in</strong>m_dataSet.Tables["Employees"].GetErrors()){str<strong>in</strong>g msg = row.RowState.ToStr<strong>in</strong>g() + "row. Failed.";row.RowError = msg;}Updat<strong>in</strong>g conflicts and reconcil<strong>in</strong>g tables after a batch update procedure can beexpensive. Sometimes, it might be even more costly than work<strong>in</strong>g connected. Choos<strong>in</strong>gthe right application perspective is a delicate task with quite a simple guidel<strong>in</strong>e: go <strong>for</strong>disconnection if you have a low degree of data contention and your tables aren'tupdated frequently with highly volatile data.ConclusionIf you've recently programmed data-driven applications, disconnected programm<strong>in</strong>g isnoth<strong>in</strong>g new <strong>for</strong> you. Disconnected scenarios are key <strong>in</strong> the era of the Internet, as theylet you ga<strong>in</strong> <strong>in</strong> scalability and mobility, br<strong>in</strong>g<strong>in</strong>g simplification to the software and to theuser. For disconnected applications, effective local copies of the data are more thanvital—they're absolutely mandatory.For .<strong>NET</strong> Framework applications, the DataSet object is the ideal candidate to take theposition of the client-side data conta<strong>in</strong>er <strong>in</strong> disconnected, <strong>in</strong>termittent applications. Inthis chapter and <strong>in</strong> Chapter 9, we analyzed various options <strong>for</strong> serializ<strong>in</strong>g the contentsof a DataSet object to output streams. In this chapter <strong>in</strong> particular, we analyzed astateful way to persist the DataSet contents.In general, there are two different angles from which you should look at the DataSetobject's serialization. One is the physical layout of the data when stored to disk; theother is the statefulness of the <strong>for</strong>mat. Normally, a DataSet object serializes itself us<strong>in</strong>ga couple of <strong>XML</strong> blocks—schema and data. This data can then be saved as is to a textor SOAP output stream or saved to a more compact b<strong>in</strong>ary stream. In this case,385


however, the verbosity of <strong>XML</strong> patently w<strong>in</strong>s over the compactness of b<strong>in</strong>ary data. As aresult, the size of the f<strong>in</strong>al stream is often unacceptably large. You must resort to trickssuch as the ghost class discussed <strong>in</strong> Chapter 9 to overcome this difficulty.As <strong>for</strong> the data <strong>for</strong>mat, you can choose between the stateless ADO.<strong>NET</strong> normal <strong>for</strong>m,the DiffGram <strong>for</strong>mat, and the DiffGram with a schema. In the first case, you take asnapshot of the current data, disregard<strong>in</strong>g orig<strong>in</strong>al values, ongo<strong>in</strong>g changes, andpend<strong>in</strong>g row errors. The DiffGram <strong>for</strong>mat is stateful and ma<strong>in</strong>ta<strong>in</strong>s a history of thechanges and pend<strong>in</strong>g errors. Un<strong>for</strong>tunately, the DiffGram <strong>for</strong>mat does not <strong>in</strong>cludeschema <strong>in</strong><strong>for</strong>mation. Schema <strong>in</strong><strong>for</strong>mation is fundamental <strong>for</strong> construct<strong>in</strong>g a DataSetobject from <strong>XML</strong> data. By us<strong>in</strong>g the <strong>XML</strong> serializer class, you obta<strong>in</strong> a new <strong>XML</strong> <strong>for</strong>mat<strong>in</strong> which schema and DiffGram data are grouped under a common umbrella.Incidentally, <strong>XML</strong> serializers are the topic of Chapter 11.Further Read<strong>in</strong>gIn my book Build<strong>in</strong>g Web Solutions with ASP.<strong>NET</strong> and ADO.<strong>NET</strong> (<strong>Microsoft</strong> Press,2002), I devoted Chapter 7 to disconnected applications and batch updates. In thatchapter, I discuss save-and-resume applications from the Web perspective. A widercoverage of disconnected ADO.<strong>NET</strong> can be found <strong>in</strong> Francesco Balena's <strong>Programm<strong>in</strong>g</strong>Visual Basic .<strong>NET</strong> (<strong>Microsoft</strong> Press, 2002) and David Sceppa's <strong>Microsoft</strong> ADO.<strong>NET</strong>Core Reference (<strong>Microsoft</strong> Press, 2002). Both books will more than get you started, sodecid<strong>in</strong>g which works better <strong>for</strong> you is more of a matter of personal preference. If youwant to focus on ADO.<strong>NET</strong>, go <strong>for</strong> Sceppa's book; if you want to look at ADO.<strong>NET</strong> as apart of the larger .<strong>NET</strong> Framework, pick up Balena's book.Data b<strong>in</strong>d<strong>in</strong>g is a key enhancement <strong>in</strong> the .<strong>NET</strong> Framework. Although based on ashared model such as ADO.<strong>NET</strong>, data b<strong>in</strong>d<strong>in</strong>g is implemented <strong>in</strong> radically differentways <strong>in</strong> W<strong>in</strong>dows Forms and Web Forms applications. Insights <strong>in</strong>to W<strong>in</strong>dows Formsdata b<strong>in</strong>d<strong>in</strong>g can be found <strong>in</strong> the follow<strong>in</strong>g <strong>Microsoft</strong> Developer Network (MSDN)articles: http://msdn.microsoft.com/library/en-us/dndive/html/data06132002.asp andhttp://msdn.microsoft.com/msdnmag/issues/02/02/cutt<strong>in</strong>g/cutt<strong>in</strong>g0202.asp.386


Part IV: Applications InteroperabilityChapter ListChapter 11: <strong>XML</strong> SerializationChapter 12: The .<strong>NET</strong> Remot<strong>in</strong>g SystemChapter 13: <strong>XML</strong> Web ServicesChapter 14: <strong>XML</strong> on the ClientChapter 15: .<strong>NET</strong> Framework Application ConfigurationPart Overview387


Chapter 11: <strong>XML</strong> SerializationOverviewSerialization is the run-time process that converts an object, or a graph of objects, to al<strong>in</strong>ear sequence of bytes. You can then use the resultant block of memory either <strong>for</strong>storage or <strong>for</strong> transmission over the network on top of a particular protocol. In the<strong>Microsoft</strong> .<strong>NET</strong> Framework, object serialization can have three different output <strong>for</strong>ms:b<strong>in</strong>ary, Simple Object Access Protocol (SOAP), and <strong>XML</strong>. We touched on b<strong>in</strong>aryserialization <strong>in</strong> Chapter 9 while exam<strong>in</strong><strong>in</strong>g how to work around <strong>XML</strong> DiffGram code <strong>in</strong>serialized DataSet and DataTable objects. In this chapter, we'll look briefly at SOAPserialization and then move on to the core topic—<strong>XML</strong> serialization.Run-time object serialization (<strong>for</strong> example, b<strong>in</strong>ary and SOAP) and <strong>XML</strong> serialization aresignificantly different technologies with different implementations and, more important,different goals. Nevertheless, both <strong>for</strong>ms of serialization do just one key th<strong>in</strong>g: theysave the contents and the state of liv<strong>in</strong>g objects out to memory, and from there to anyother storage media. Run-time serialization is governed by .<strong>NET</strong> Framework <strong>for</strong>matterobjects. <strong>XML</strong> serialization takes place under the aegis of the XmlSerializer class.The <strong>XML</strong> serialization process converts the public <strong>in</strong>terface of an object to a particular<strong>XML</strong> schema. Such a mechanism is widely used throughout the .<strong>NET</strong> Framework as away to save the state of an object <strong>in</strong>to a stream or a memory buffer. In Chapter 10, wesaw <strong>XML</strong> serialization used as a way to persist DiffGram with schema scripts thatdescribe a DataSet object. Web services use the XmlSerializer class to encode object<strong>in</strong>stances be<strong>in</strong>g returned by methods.The Object Serialization ProcessIn the .<strong>NET</strong> Framework, object serialization is offered through the classes <strong>in</strong> theSystem.Runtime.Serialization namespace. These classes provide type fidelity andsupport deserialization. As you probably know, the deserialization process is thereverse of serialization. Deserialization takes <strong>in</strong> stored <strong>in</strong><strong>for</strong>mation and recreatesobjects from that <strong>in</strong><strong>for</strong>mation.Object serialization <strong>in</strong> the .<strong>NET</strong> Framework allows you to store public, protected, andprivate fields and automatically handles circular references. A circular reference occurswhen a child object references a parent object and the parent object also referencesthe child object. Serialization classes <strong>in</strong> the .<strong>NET</strong> Framework can detect these circularreferences and resolve them. Serialization can generate output data <strong>in</strong> multiple <strong>for</strong>matsby us<strong>in</strong>g different made-to-measure <strong>for</strong>matter modules. The two system-provided<strong>for</strong>matters are represented by the B<strong>in</strong>aryFormatter and SoapFormatter classes, whichwrite the object's state <strong>in</strong> b<strong>in</strong>ary <strong>for</strong>mat and SOAP <strong>for</strong>mat.Classes make themselves serializable through <strong>for</strong>matters <strong>in</strong> two ways: they can eithersupport the [Serializable]attribute or implement the ISerializable <strong>in</strong>terface. With the[Serializable] attribute, the class author has noth<strong>in</strong>g else to do, as the serialization takesplace governed by caller applications and the class data is obta<strong>in</strong>ed through reflection.The ISerializable <strong>in</strong>terface, on the other hand, enables the class author to exercisecloser control over how the bits of the liv<strong>in</strong>g object are actually persisted.A <strong>for</strong>matter is the .<strong>NET</strong> Framework object that obta<strong>in</strong>s the serialized data from thetarget object. Data is requested either by call<strong>in</strong>g the GetObjectData method on theISerializable <strong>in</strong>terface or through the services of the FormatterServices static class. Inparticular, the GetSerializableMembers method returns all the serializable members <strong>for</strong>a particular class.In the .<strong>NET</strong> Framework, <strong>for</strong>matters are of two types, depend<strong>in</strong>g on the nature of theunderly<strong>in</strong>g stream they use. The b<strong>in</strong>ary <strong>for</strong>matter (available through theB<strong>in</strong>aryFormatter class) saves data to a b<strong>in</strong>ary stream. The SOAP <strong>for</strong>matter (available388


through the SoapFormatter class) saves data to a text stream, automatically encod<strong>in</strong>g<strong>in</strong><strong>for</strong>mation <strong>in</strong> a SOAP message be<strong>for</strong>e writ<strong>in</strong>g.The SOAP FormatterTo use the SOAP <strong>for</strong>matter, you must reference a dist<strong>in</strong>ct assembly—System.Runtime.Serialization.Formatters.Soap. You add this separate assemblythrough the Add Reference dialog box or manually on the compiler's command l<strong>in</strong>ethrough the /reference switch. In addition to l<strong>in</strong>k<strong>in</strong>g the assembly to the project, you stillhave to import the namespace with the same name as the assembly, as shown here:us<strong>in</strong>g System.Runtime.Serialization.Formatters.Soap;At this po<strong>in</strong>t, you prepare the output stream, <strong>in</strong>stantiate the SOAP <strong>for</strong>matter, and callthe Serialize method, as follows:// emp is the object <strong>in</strong>stance to processStreamWriter writer = new StreamWriter(filename);SoapFormatter soap = new SoapFormatter();soap.Serialize(writer.BaseStream, emp);writer.Close();Note that the Serialize method accepts only a stream object, which makes serializ<strong>in</strong>g to<strong>in</strong>-memory str<strong>in</strong>gs a little more difficult.Let's consider a rather simple class, such as the follow<strong>in</strong>g Employee class:[Serializable]public class Employee{public <strong>in</strong>t ID;public str<strong>in</strong>g FirstName;public str<strong>in</strong>g LastName;public str<strong>in</strong>g Position;public <strong>in</strong>t[] Territories;}Upon <strong>in</strong>stantiation, only the numeric ID field has a determ<strong>in</strong>ed value (0). All the othermembers are null, as shown here:Employee emp = new Employee();After the Employee class has been <strong>in</strong>stantiated, the SOAP <strong>for</strong>matter generates thefollow<strong>in</strong>g script:389


0As you can see, the class representation is perfect, and the fidelity between the SOAPdescription and the class is total. In<strong>for</strong>mation about the namespace is preserved andnull values are listed. But what about types?Retriev<strong>in</strong>g Type In<strong>for</strong>mationThe <strong>for</strong>matter's TypeFormat property lets you <strong>in</strong>dicate how type descriptions are laidout <strong>in</strong> the serialized stream. By default, TypeFormat is set to TypesWhenNeeded,which means that type <strong>in</strong><strong>for</strong>mation is <strong>in</strong>serted only when strictly necessary. This is true<strong>for</strong> arrays of objects, generic Object objects, and nonprimitive value types. If you wantto <strong>for</strong>ce type description, use either the TypesAlways or the XsdStr<strong>in</strong>g option. Thedifference between these two options is <strong>in</strong> the <strong>for</strong>mat used to describe the type: SOAP<strong>in</strong> the <strong>for</strong>mer case; XSD <strong>in</strong> the latter. All the type <strong>for</strong>mat options are gathered <strong>in</strong> theFormatterTypeStyle enumeration.Serializ<strong>in</strong>g to Str<strong>in</strong>gsBecause the SOAP <strong>for</strong>matter and the b<strong>in</strong>ary <strong>for</strong>matter write only to streams, to avoidcreat<strong>in</strong>g disk files you can use the MemoryStream object, as shown here:// emp is the object <strong>in</strong>stance to processMemoryStream ms = new MemoryStream();SoapFormatter soap = new SoapFormatter();soap.Serialize(ms, emp);Read<strong>in</strong>g back data is a bit trickier. First you must get the size of the serialized stream.This <strong>in</strong><strong>for</strong>mation is stored <strong>in</strong> the Length property of the MemoryStream class. Bear <strong>in</strong>m<strong>in</strong>d, however, that Length moves the <strong>in</strong>ternal po<strong>in</strong>ter ahead to the end of the stream.To be able to read the specified number of bytes, you must first reset the <strong>in</strong>ternalpo<strong>in</strong>ter. The Seek method serves just this purpose, as shown here:<strong>in</strong>t size = (<strong>in</strong>t) ms.Length; // Moves the po<strong>in</strong>ter <strong>for</strong>wardbyte[] buf = new byte[size];ms.Seek (0, SeekOrig<strong>in</strong>.Beg<strong>in</strong>);ms.Read(buf, 0, size);ms.Close();390


str<strong>in</strong>g soapText = Encod<strong>in</strong>g.UTF8.GetStr<strong>in</strong>g(buf);The MemoryStream object reads data only as bytes. Especially <strong>in</strong> a strong-typedenvironment like the .<strong>NET</strong> Framework, an array of bytes and a str<strong>in</strong>g are as different asapples and oranges. Fortunately, the encod<strong>in</strong>g classes provide <strong>for</strong> handy conversionmethods. The Encod<strong>in</strong>g static class belongs to the System.Text namespace.Deserializ<strong>in</strong>g ObjectsTo rebuild a liv<strong>in</strong>g <strong>in</strong>stance of a previously serialized object, you call the Deserializemethod on the specified <strong>for</strong>matter. The deserializer returns an object that you cast tothe particular class type you need, as shown here:StreamReader reader = new StreamReader(filename);Employee emp1 = (Employee) soap.Deserialize(reader.BaseStream);reader.Close();The .<strong>NET</strong> Framework serialization mechanism also allows you to control the postdeserializationprocess<strong>in</strong>g and explicitly handle data be<strong>in</strong>g serialized and deserialized.In this way, you are given a chance to restore transient state and data that, <strong>for</strong> onereason or another, you decide not to serialize. Remember that by mark<strong>in</strong>g a field withthe [NonSerializable]attribute, you keep it out of the serialized stream.By implement<strong>in</strong>g the IDeserializationCallback <strong>in</strong>terface, a class <strong>in</strong>dicates that it wants tobe notified when the deserialization of the entire object is complete. The class caneasily complete the operation by re-creat<strong>in</strong>g parts of the state and add<strong>in</strong>g any<strong>in</strong><strong>for</strong>mation not made serializable. The OnDeserialization method is called after the typehas been deserialized.F<strong>in</strong>ally, it goes without say<strong>in</strong>g that you can't serialize to, say, SOAP, and then pretendto deserialize us<strong>in</strong>g the b<strong>in</strong>ary <strong>for</strong>matter. See the section "Further Read<strong>in</strong>g," on page518, <strong>for</strong> more <strong>in</strong><strong>for</strong>mation about run-time b<strong>in</strong>ary and SOAP serialization.From SOAP to <strong>XML</strong> SerializationA second, very special type of .<strong>NET</strong> Framework serialization is <strong>XML</strong> serialization.Compared to ord<strong>in</strong>ary .<strong>NET</strong> Framework object serialization, <strong>XML</strong> serialization is sodifferent that it shouldn't even be considered another type of <strong>for</strong>matter. It is similar toSOAP and b<strong>in</strong>ary <strong>for</strong>matters because it also persists and restores the object's state, butwhen you exam<strong>in</strong>e the way each serializer works, you see many significant differences.<strong>XML</strong> serialization is handled by us<strong>in</strong>g the XmlSerializer class, which also enables youto control how objects are encoded <strong>in</strong>to elements of an <strong>XML</strong> schema. In addition todifferences <strong>in</strong> goals and implementation details, the strongest difference between runtimeand <strong>XML</strong> serialization is <strong>in</strong> the level of type fidelity they provide.Run-time object serialization guarantees full type fidelity. For this reason, b<strong>in</strong>ary andSOAP serialization are particularly well-suited to preserv<strong>in</strong>g the state of an objectacross multiple <strong>in</strong>vocations of an application. For example, .<strong>NET</strong> Framework remot<strong>in</strong>g(see Chapter 12) uses run-time serialization to marshal objects by value from oneAppDoma<strong>in</strong> to another. Whereas run-time serialization is specifically aimed atserializ<strong>in</strong>g object <strong>in</strong>stances, <strong>XML</strong> serialization is a system-provided (as opposed toobject-provided) mechanism <strong>for</strong> serializ<strong>in</strong>g the data stored <strong>in</strong> an object <strong>in</strong>stance <strong>in</strong>to awell-<strong>for</strong>med schema.The primary goal of <strong>XML</strong> serialization is mak<strong>in</strong>g another application, possibly anapplication runn<strong>in</strong>g on a different plat<strong>for</strong>m, effectively able to consume any stored data.Let's recap the key differences between run-time and <strong>XML</strong> serialization:• Persisted properties Run-time serialization takes <strong>in</strong>to account anyproperties, regardless of the scope a property has <strong>in</strong> the context of theclass. <strong>XML</strong> serialization, on the other hand, avoids private, protected, and391


ead-only properties; does not handle circular references; and works onlywith public classes. In addition, if one property is set to null <strong>in</strong> theparticular <strong>in</strong>stance be<strong>in</strong>g serialized, the <strong>XML</strong> serializer just ignores theproperty. The <strong>XML</strong> serializer never <strong>in</strong>cludes type <strong>in</strong><strong>for</strong>mation.• Object identity Run-time serialization ma<strong>in</strong>ta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about theorig<strong>in</strong>al class name, namespace, and assembly. All this <strong>in</strong><strong>for</strong>mation—theobject's identity—is irreversibly lost with <strong>XML</strong> serialization.• Control of the output Run-time serialization lets you <strong>in</strong>dicate the data toserialize by add<strong>in</strong>g values to a cargo collection. You can't control howthese values are actually written, however. The schema of the persisteddata is fixed and hard-coded <strong>in</strong> the <strong>for</strong>matter. In this respect, the <strong>XML</strong>serializer is much more flexible. The <strong>XML</strong> serializer lets you specifynamespaces, the name of the <strong>XML</strong> element that will conta<strong>in</strong> a particularproperty, and even whether a given property should be rendered as anattribute, text, or an element.ImportantDur<strong>in</strong>g serialization, the .<strong>NET</strong> Framework <strong>for</strong>matters get <strong>in</strong><strong>for</strong>mationdynamically from the target object and write any bytes to thespecified stream. The <strong>XML</strong> serializer uses any object <strong>in</strong><strong>for</strong>mation tocreate a couple of highly specialized reader and writer classes <strong>in</strong> aC# source file. The file is then silently compiled <strong>in</strong>to a temporaryassembly. As a result, <strong>XML</strong> serialization and deserialization <strong>for</strong> anobject are actually per<strong>for</strong>med us<strong>in</strong>g the classes <strong>in</strong> the temporaryassembly. (More on this <strong>in</strong> the section "The Temporary Assembly,"on page 513.)One f<strong>in</strong>al note about SOAP and <strong>XML</strong> serialization: Although it's more powerful <strong>in</strong> termsof the <strong>in</strong><strong>for</strong>mation carried, SOAP is significantly more verbose than <strong>XML</strong> serializationand of course much less flexible. In fact, SOAP is just a particular <strong>XML</strong> dialect withvocabulary and syntax rules def<strong>in</strong>ed by the SOAP specification. With <strong>XML</strong> serialization,you def<strong>in</strong>e the schema you want, and the process is designed to return a more compactoutput.The <strong>XML</strong> SerializerThe central element <strong>in</strong> the <strong>XML</strong> serialization architecture is the XmlSerializer class,which belongs to the System.Xml.Serialization namespace. The <strong>XML</strong> serializationprocess is articulated <strong>in</strong> the follow<strong>in</strong>g steps:1. The serializer generates an XSD schema <strong>for</strong> the target class that <strong>in</strong>cludesall the public properties and fields.2. Us<strong>in</strong>g this XSD schema, the serializer generates a C# source file with amade-to-measure reader and writer class. The source file is compiled <strong>in</strong>toa temporary assembly.The Serialize and Deserialize methods are simply higher level <strong>in</strong>terfaces <strong>for</strong> thosewriter and reader classes. This list does not cover all the features of <strong>XML</strong> serialization,but it certa<strong>in</strong>ly focuses on the key aspects. Let's look more closely at these key aspectsbe<strong>for</strong>e we move on to more advanced issues such as customiz<strong>in</strong>g the XSD schemabe<strong>in</strong>g generated and hook<strong>in</strong>g up the deserialization process.The <strong>Programm<strong>in</strong>g</strong> InterfaceTheXmlSerializer class has a rather limited programm<strong>in</strong>g <strong>in</strong>terface, with no properties,only a few methods, and a handful of events. XmlSerializer has several constructorswith important functionalities, however. As you'll see <strong>in</strong> the follow<strong>in</strong>g sections, theconstructor is the place where most of the serializer's activity occurs.392


The XmlSerializer Class's ConstructorsTable 11-1 lists all the public constructors available <strong>in</strong> the XmlSerializer class. This listdoes not <strong>in</strong>clude the default class constructor because it is declared as protected and,as such, is not <strong>in</strong>tended to be used directly from the user's code.Table 11-1: Constructors of XmlSerializerConstructorXmlSerializer(Type)XmlSerializer(XmlTypeMapp<strong>in</strong>g)XmlSerializer(Type, str<strong>in</strong>g)XmlSerializer(Type, Type[])XmlSerializer(Type,XmlAttributeOverrides)XmlSerializer(Type,XmlRootAttribute)XmlSerializer(Type,XmlAttributeOverrides, Type[],XmlRootAttribute, str<strong>in</strong>g)DescriptionSerializes objects of the specified type.Allows you to customize the defaultmapp<strong>in</strong>g between properties and XSDelements. Adds type <strong>in</strong><strong>for</strong>mation toelements. Useful if you don't have thesource code <strong>for</strong> the class be<strong>in</strong>gserialized.Serializes objects of the specified typeus<strong>in</strong>g <strong>XML</strong> elements <strong>in</strong> the given defaultnamespace.Serializes objects of the specified typeand all child objects listed <strong>in</strong> thespecified array of extra types.Allows you to customize the defaultmapp<strong>in</strong>g between properties and XSDelements. No type <strong>in</strong><strong>for</strong>mation is addedto elements. Useful if you don't have thesource code <strong>for</strong> the class be<strong>in</strong>gserialized.Allows you to specify the root element ofthe <strong>XML</strong> output.Sums up all the previous sett<strong>in</strong>gs andprovides a signature to set anycomb<strong>in</strong>ation of features <strong>in</strong> a s<strong>in</strong>gle shot.Let's review the code necessary to set up and use an <strong>XML</strong> serializer class:[Serializable]public class Employee{protected <strong>in</strong>t m_ID;public <strong>in</strong>t ID{get {return m_ID;}}public str<strong>in</strong>g FirstName;public str<strong>in</strong>g LastName;public str<strong>in</strong>g Position;public <strong>in</strong>t[] Territories;393


}public Employee(){m_ID = -1;}public Employee(<strong>in</strong>t empID){m_ID = empID;}public override str<strong>in</strong>g ToStr<strong>in</strong>g(){return LastName + ", "+ FirstName;}This class has one read-only member (ID), a couple of constructors, and a protectedmember. To beg<strong>in</strong>, let's use the simplest constructor and see what happens:Employee emp = new Employee(1);emp.LastName = "Esposito";emp.FirstName = "D<strong>in</strong>o";Str<strong>in</strong>gWriter writer = new Str<strong>in</strong>gWriter();XmlSerializer ser = new XmlSerializer(typeof(Employee));ser.Serialize(writer, emp);str<strong>in</strong>g xmlText = writer.ToStr<strong>in</strong>g();writer.Close();The output generated is rather compact and does not <strong>in</strong>clude null and less than publicfields, as shown here:D<strong>in</strong>oEspositoThe read-only ID property is ignored, as are all protected members. In addition, publicproperties set to null are blissfully discarded.CautionIf the class be<strong>in</strong>g serialized does not provide the defaultconstructor, an exception is thrown and the class won't beprocessed further. The XmlSerializer class raises anInvalidOperationException exception stat<strong>in</strong>g that the class can't394


e successfully reflected. The true reason <strong>for</strong> the exception isslightly more subtle, however. The XmlSerializer class needs tocreate <strong>in</strong>ternally an <strong>in</strong>stance of the target class to collect all the<strong>in</strong><strong>for</strong>mation necessary to create the serialization reader andwriter objects. The serializer can't make assumptions about theconstructors available on the class, so it always uses the defaultconstructor. If there is no such constructor, an exception isthrown.Configur<strong>in</strong>g the Root NodeBy default, the root element is def<strong>in</strong>ed by the serializer. However, the serializer givesyou a chance to <strong>in</strong>tervene and change th<strong>in</strong>gs around a bit. For example, you can createan XmlRootAttribute object, set some of its properties, and pass it on to the serializerconstructor, as shown here:XmlRootAttribute root = new XmlRootAttribute();root.ElementName = "Northw<strong>in</strong>dEmployee";root.Namespace = "urn:d<strong>in</strong>o-e";root.IsNullable = true;XmlSerializer ser = new XmlSerializer(typeof(Employee), root);The subsequent output is shown here:D<strong>in</strong>oEspositoAlternatively, <strong>in</strong>stead of creat<strong>in</strong>g an XmlRootAttribute object, you can simply set anotherattribute to the class be<strong>in</strong>g serialized, as shown here:[XmlRootAttribute(ElementName="Northw<strong>in</strong>dEmployee")]public class Employee{ ... }Although the f<strong>in</strong>al effect on the <strong>XML</strong> code is the same, the two approaches are notidentical. To set the attribute, you must have access to the source code <strong>for</strong> the class. Ifyou resort to the XmlRootAttribute object, you can change the root node of each class,<strong>in</strong>clud<strong>in</strong>g those classes available only <strong>in</strong> a compiled <strong>for</strong>m.The XmlRootAttribute object, both as an attribute and as an object, lets you set adefault namespace <strong>for</strong> all elements <strong>in</strong> the <strong>XML</strong> document be<strong>in</strong>g generated. If you wantto set only the namespace, however, use another constructor overload, as follows:XmlSerializer ser = new XmlSerializer(typeof(Employee),"urn:d<strong>in</strong>o-e");In this case, the root node rema<strong>in</strong>s <strong>in</strong>tact but an extra xmlns attribute is added.395


Methods of the XmlSerializer ClassTable 11-2 describes the methods exposed by the XmlSerializer class. As you'd expect,this list does not <strong>in</strong>clude methods such as ToStr<strong>in</strong>g and Equals that are <strong>in</strong>herited fromObject and overridden.Table 11-2: Methods of the XmlSerializer ClassMethodCanDeserializeDeserializeFromTypesSerializeDescriptionIndicates whether the contents po<strong>in</strong>ted to by the specifiedXmlReader object can be successfully deserialized us<strong>in</strong>gthis <strong>in</strong>stance of the serializer class.Deserializes an <strong>XML</strong> document read from a stream, text,or an <strong>XML</strong> reader.Static method that returns an array of XmlSerializerobjects created from an array of types. Useful <strong>for</strong>speed<strong>in</strong>g operations when you need to create multipleserializers <strong>for</strong> different types.Serializes an object <strong>in</strong>to an <strong>XML</strong> document.As with the Deserialize method, the output <strong>for</strong> the Serialize method can be a stream,text, or an <strong>XML</strong> writer.Events of the XmlSerializer ClassTable 11-3 lists the events that the XmlSerializer class triggers dur<strong>in</strong>g thedeserialization process.Table 11-3: Events of the XmlSerializer ClassEventUnknownAttributeUnknownElementUnknownNodeUnreferencedObjectDescriptionFires when the deserializer encounters an <strong>XML</strong>attribute of unknown type.Fires when the deserializer encounters an <strong>XML</strong>element of unknown type.Fires when the deserializer encounters any <strong>XML</strong>node, <strong>in</strong>clud<strong>in</strong>g Attribute and Element.Fires when the deserializer encounters a recognizedtype that is not used. Occurs dur<strong>in</strong>g thedeserialization of a SOAP-encoded <strong>XML</strong> stream.(More on this topic <strong>in</strong> the section "Deserializ<strong>in</strong>g <strong>XML</strong>Data to Objects," on page 496.)UnknownNode is a more generic event that fires <strong>for</strong> all nodes. It reaches the clientapplication be<strong>for</strong>e more specific events such as UnknownAttribute andUnknownElement arrive.Serializ<strong>in</strong>g Objects to <strong>XML</strong>The[Serializable] attribute, which makes a class serializable through <strong>for</strong>matters, is not<strong>in</strong>heritable and must be explicitly assigned to derived classes. No such explicitconditions exclude some classes from the benefits of the <strong>XML</strong> serialization technique.This certa<strong>in</strong>ly does not mean that all the classes can be serialized to <strong>XML</strong>, however.396


The most restrictive condition <strong>in</strong> qualify<strong>in</strong>g <strong>for</strong> <strong>XML</strong> serialization is not hav<strong>in</strong>g circularreferences. A lot of relatively complex .<strong>NET</strong> Framework classes can't be serialized to<strong>XML</strong> <strong>for</strong> this reason. Want an illustrious example? Consider the DataTable class.If you try to serialize an <strong>in</strong>stance of the DataTable class, you get a fairly unclear errormessage. Try the follow<strong>in</strong>g code:DataTable dt = new DataTable();XmlSerializer ser = new XmlSerializer(typeof(DataTable));ser.Serialize(writer, dt);The debugger stops on the constructor l<strong>in</strong>e and displays a message about a certa<strong>in</strong>error that occurred dur<strong>in</strong>g reflection of the DataTable class. Like many other <strong>Microsoft</strong>ADO.<strong>NET</strong> and <strong>XML</strong> classes, the DataTable class has circular references. For example,DataTable conta<strong>in</strong>s the Rows property, which is a collection of DataRow objects. Inturn, each DataRow object has a Table property that po<strong>in</strong>ts to the parent DataTableobject. This is clearly a circular reference, and, as such, is an appropriate justification<strong>for</strong> the run-time error.Why Is the DataSet Object <strong>XML</strong>-Serializable?The DataSet class (and the XmlNode and XmlElement classes) conta<strong>in</strong>s at least onecircular reference—specifically, the Tables collection, whose child DataTable objectsreference the parent DataSet object. Nevertheless, the DataSet object is serializablethrough the XmlSerializer class. Why is this so?The <strong>in</strong>ternal module that imports the <strong>XML</strong> schema <strong>for</strong> the type to serialize—the samemodule that does not handle circular references—specifically checks <strong>for</strong> the DataSettype. If the object turns out to be a DataSet object, the standard schema importationprocess aborts, and an alternative schema is applied. The schema importer uses themethods of the IXmlSerializable <strong>in</strong>terface to serialize and deserialize a DataSet object.The MSDN documentation only touches on the IXmlSerializable <strong>in</strong>terface, which isdef<strong>in</strong>ed <strong>in</strong> the System.Xml.Serialization namespace. This <strong>in</strong>terface is not <strong>in</strong>tended tobe used by applications—at least not yet. IXmlSerializable def<strong>in</strong>es three methods:GetSchema, ReadXml, and WriteXml. Despite their names, these ReadXml andWriteXml methods have noth<strong>in</strong>g to do with the methods we saw <strong>in</strong> Chapter 9 andChapter 10. Serialization methods are void, private, and accept only a s<strong>in</strong>gle Xml-Reader argument.You can serialize <strong>XML</strong> classes with no circular references, the default constructor, andat least one public property. If the class implements the ICollection or IEnumerable<strong>in</strong>terface, other constra<strong>in</strong>ts apply. In addition to these classes, the <strong>XML</strong> serializersupports three more classes as an exception to the previous rules: DataSet, XmlNode,and XmlElement.The XmlSerializerNamespaces ClassA few of the Serialize overloads can take an extra parameter that denotes the <strong>XML</strong>namespaces and prefixes that the XmlSerializer uses to generate qualified names. TheXmlRootAttribute class we exam<strong>in</strong>ed <strong>in</strong> the section "Configur<strong>in</strong>g the Root Node," onpage 486, is useful <strong>for</strong> def<strong>in</strong><strong>in</strong>g the default namespace but provides no way <strong>for</strong> you touse more namespaces and prefixes.The XmlSerializerNamespaces class can be used to cache multiple namespace URIsand prefixes that the target class will reference through attributes. You populate thenamespace conta<strong>in</strong>er as follows:XmlSerializer ser = new XmlSerializer(typeof(Employee));XmlSerializerNamespaces ns = new XmlSerializerNamespaces();397


ns.Add("d", "urn:d<strong>in</strong>o-e-xml");ns.Add("x", "urn:mspress-xml");ser.Serialize(writer, emp, ns);After it is populated, the <strong>in</strong>stance of the XmlSerializerNamespaces class is passed onto one of the overloads of the Serialize method. The source class can associateproperties with namespaces us<strong>in</strong>g a couple of attributes, XmlType and XmlElement, asshown <strong>in</strong> the follow<strong>in</strong>g code. In particular, you use XmlType to provide a namespace toall the members of a class. XmlElement applies the namespace <strong>in</strong><strong>for</strong>mation to only thecurrent element. Of course, you can use XmlType and XmlElement together, but can'tuse XmlType with a property. We'll return to <strong>XML</strong> attributes <strong>in</strong> the section "TheXmlElement Attribute," on page 501.[XmlType(Namespace ="urn:d<strong>in</strong>o-e-xml")]public class Employee{public str<strong>in</strong>g FirstName;}[XmlElement(Namespace ="urn:mspress-xml")]public str<strong>in</strong>g LastName;public str<strong>in</strong>g Position;...The resultant <strong>XML</strong> code is shown here. All the elements have the d prefix except theelement that maps to the LastName property.D<strong>in</strong>oEspositoCEO


EspositoCEO123Classes that must be serialized to <strong>XML</strong> can't use most of the more common collectionclasses. For example, the ArrayList class is serializable, but NameValueCollection,Hashtable, and ListDictionary are not. The reason lies <strong>in</strong> the extra constra<strong>in</strong>ts set <strong>for</strong>the classes that implement ICollection and IEnumerable.In particular, a class that implements IEnumerable must also implement a public Addmethod that takes a s<strong>in</strong>gle parameter. This condition filters out dictionaries and hashtables but keeps ArrayList and Str<strong>in</strong>gCollection objects on board. In addition, the type ofthe argument you pass to Add must be polymorphic with the type returned by theCurrent property of the underly<strong>in</strong>g enumerator object.A class that implements the ICollection <strong>in</strong>terface can't be serialized if it does not havean <strong>in</strong>teger <strong>in</strong>dexer—that is, a public Item <strong>in</strong>dexed property that accepts <strong>in</strong>teger <strong>in</strong>dexes.The class must also have a public Count property of type <strong>in</strong>teger. The type of theargument passed to Add (only one argument is allowed) must be compatible with thetype returned by Item.Serializ<strong>in</strong>g Enumerated Types<strong>XML</strong> serialization supports enumerated types. The serialized stream conta<strong>in</strong>s thenamed constant that identifies the value. The enum value is stored as a str<strong>in</strong>g, andneither the actual value nor the type are serialized. Dur<strong>in</strong>g deserialization, the namedvalue is reassociated with the underly<strong>in</strong>g enum value through the Enum.Parse staticmethod.The Notion of SerializabilityHav<strong>in</strong>g the Add method take exactly one argument is a strong, but rather <strong>in</strong>evitable,constra<strong>in</strong>t that is needed to wed consistency with effectiveness of cod<strong>in</strong>g. Unlike runtimeserialization, <strong>XML</strong> serialization never actively <strong>in</strong>volves objects. <strong>XML</strong> serialization<strong>in</strong>stead treats objects as passive entities. It parses their <strong>in</strong>terface through reflectionand irrevocably decides whether a given object can be serialized.The basic notion of serializability is different <strong>in</strong> the two approaches. Run-timeserialization is a more rigorous process based on the assumption that classes makethemselves serializable by tak<strong>in</strong>g clear actions. <strong>XML</strong> serialization, on the other hand,is a centralized process that <strong>in</strong>volves classes only <strong>for</strong> the details of the f<strong>in</strong>al <strong>XML</strong>schema. The <strong>XML</strong> serialization process makes assumptions about what the classesshould do (or, better yet, should have done) to be serializable.Collection classes, <strong>in</strong> particular, are seen simply as a collection of objects of a giventype. By en<strong>for</strong>c<strong>in</strong>g this basic concept, the <strong>XML</strong> serializer discards all collections thatdo not provide such an <strong>in</strong>terface—that is, the Add method to append new objects ofthat type and the Item property (or the enumerator) to return a particular object of thattype.399


When design<strong>in</strong>g classes dest<strong>in</strong>ed to be serialized to <strong>XML</strong>, either avoid collectionclasses altogether or express their contents as an array of basic objects. One possibilityis to use the ArrayList class as the conta<strong>in</strong>er and a user-def<strong>in</strong>ed class to store element<strong>in</strong><strong>for</strong>mation. Alternatively, you could write your own collection class. In this case,however, consider that no public or private properties on the collection class would beserialized, only the child objects would be.TipAs mentioned, <strong>XML</strong> serialization skips over read-only data members.You can overcome this built-<strong>in</strong> behavior with a simple and<strong>in</strong>expensive trick. Add an empty set accessor to a read-only property,as shown <strong>in</strong> the follow<strong>in</strong>g code, and the serializer will treat themember as a read/write property. The empty set accessor will stillprevent the variable from be<strong>in</strong>g updated, however.public <strong>in</strong>t ID{get {return m_ID;}set {}}Serializ<strong>in</strong>g Child ClassesThe only drawback is that no compile error will be raised <strong>for</strong>(<strong>in</strong>nocuous) l<strong>in</strong>es of code that might attempt to assign a value to theproperty.If a class conta<strong>in</strong>s a public member that belongs to a nonprimitive, user-def<strong>in</strong>ed class,that member would be recursively serialized as an element nested with<strong>in</strong> the ma<strong>in</strong> <strong>XML</strong>document. Let's see what happens with the follow<strong>in</strong>g classes:public class Employee{...public Order LastOrder;public ArrayList Orders;...}public class Order{public <strong>in</strong>t ID;public DateTime Date;public double Total;}The Orders member is <strong>in</strong>tended to be a collection of Order objects, as shown here:emp.LastOrder = new Order();emp.LastOrder.ID = 123;emp.LastOrder.Date = new DateTime(2002,8,12);emp.LastOrder.Total = 1245.23;emp.Orders = new ArrayList();400


Order ord1 = new Order();ord1.ID = 98;ord1.Date = new DateTime(2002,7,4);ord1.Total = 145.90;emp.Orders.Add(ord1);Order ord2 = new Order();ord2.ID = 101;ord2.Date = new DateTime(2002,7,24);ord2.Total = 2000.00;emp.Orders.Add(ord2);After <strong>in</strong>itializ<strong>in</strong>g the members as shown <strong>in</strong> the preced<strong>in</strong>g code, the f<strong>in</strong>al output looks likethis:...1232002-08-12T00:00:00.0000000+02:001245.23982002-07-04T00:00:00.0000000+02:00145.91012002-07-24T00:00:00.0000000+02:002000As you can see, the <strong>XML</strong> code be<strong>in</strong>g generated conta<strong>in</strong>s very little type <strong>in</strong><strong>for</strong>mation.This is not a specific feature of <strong>XML</strong> serialization, however. The run-time objectserialization process also considers type <strong>in</strong><strong>for</strong>mation optional—at least <strong>in</strong> most cases.This standpo<strong>in</strong>t is quite reasonable. Serialization is just a way to persist the state of anobject. Dur<strong>in</strong>g deserialization, an <strong>in</strong>stance of the object will be created from the401


eferenced assembly and its properties configured with the stored <strong>in</strong><strong>for</strong>mation. Theserialization process needs mapp<strong>in</strong>g <strong>in</strong><strong>for</strong>mation rather than type <strong>in</strong><strong>for</strong>mation.That said, you can see <strong>in</strong> the preced<strong>in</strong>g list<strong>in</strong>g that the ArrayList object is serialized withtype <strong>in</strong><strong>for</strong>mation <strong>in</strong> the node. This happens because the ArrayList classmanages generic object references, whereas concrete types are needed <strong>for</strong>serialization and deserialization. To <strong>for</strong>ce .<strong>NET</strong> Framework <strong>for</strong>-matters to <strong>in</strong>clude type<strong>in</strong><strong>for</strong>mation, you simply set the TypeFormat property of the serializer. Let's look at howto accomplish this with the <strong>XML</strong> serializer.Add<strong>in</strong>g Type In<strong>for</strong>mationOne of the constructors of the XmlSerializerclass takes a second argument of typeXmlTypeMapp<strong>in</strong>g. The XmlSerializer class is used to encode and serialize an object toSOAP. The follow<strong>in</strong>g code is used to add XSD type def<strong>in</strong>itions to a serialized class:SoapReflectionImporter imp = new SoapReflectionImporter();XmlTypeMapp<strong>in</strong>g tm = imp.ImportTypeMapp<strong>in</strong>g(typeof(Employee));XmlSerializer ser = new XmlSerializer(tm);Let's assume the follow<strong>in</strong>g class def<strong>in</strong>ition:public class Employee{public <strong>in</strong>t ID;public str<strong>in</strong>g FirstName;public str<strong>in</strong>g LastName;}The typed <strong>XML</strong> output looks like this:4D<strong>in</strong>oEspositoThe f<strong>in</strong>al output gets a bit more complicated if custom types are <strong>in</strong>volved. For example,consider the follow<strong>in</strong>g nested classes:public class Employee{public <strong>in</strong>t ID;public str<strong>in</strong>g FirstName;public str<strong>in</strong>g LastName;public Order LastOrder;}public class Order{402


}public <strong>in</strong>t Number;public DateTime Date;public double Total;In this case, when SOAP encod<strong>in</strong>g is <strong>in</strong>volved, the serializer does not generate a well<strong>for</strong>med<strong>XML</strong> document. More precisely, the <strong>XML</strong> code is correct, but the document hasno root, because the child class is written at the same level as the parent class. If youdon't explicitly serialize to a writer with a user-def<strong>in</strong>ed root, a writ<strong>in</strong>g exception isthrown.The follow<strong>in</strong>g code demonstrates how nested classes are encoded. As you can see,without the custom element, the <strong>XML</strong> serializer would have generated onlyan <strong>XML</strong> fragment.4D<strong>in</strong>oEsposito552002-07-04T00:00:00.0000000+02:002000SOAP type mapp<strong>in</strong>g can also be used to map one type to another. In other words,while generat<strong>in</strong>g type <strong>in</strong><strong>for</strong>mation, you can also rename elements and slightly changethe structure of the f<strong>in</strong>al serialized document. To exploit this feature <strong>in</strong> depth, you createattribute overrides, as shown here:SoapAttributes attrib1 = new SoapAttributes();SoapElementAttribute elem1 =new SoapElementAttribute("FamilyName");attrib1.SoapElement = elem1;SoapAttributeOverrides sao = new SoapAttributeOverrides();sao.Add(typeof(Employee), "LastName", attrib1);403


The preced<strong>in</strong>g code creates an attribute override based on an element namedFamilyName. This new element is added to an attribute overrides collection. Inparticular, the FamilyName attribute overrides the LastName element on the Employeetype. The follow<strong>in</strong>g code snippet shows how to hide a source element—<strong>in</strong> this case,FirstName:SoapAttributes attrib2 = new SoapAttributes();attrib2.SoapIgnore = true;sao.Add(typeof(Employee), "FirstName", attrib2);The attribute overrides are gathered <strong>in</strong> the SoapAttributeOverrides collection, which isthen used to <strong>in</strong>itialize the SoapReflectionImporter class, as shown here, and then canbe used <strong>in</strong> the type mapp<strong>in</strong>g <strong>in</strong> the serializer:SoapReflectionImporter imp = new SoapReflectionImporter(sao);We'll return to this topic <strong>in</strong> the section "<strong>XML</strong> Serialization Attributes," on page 499. Inparticular, you'll learn how to add type <strong>in</strong><strong>for</strong>mation to pla<strong>in</strong> <strong>XML</strong> serialization, when noSOAP-encoded types are <strong>in</strong>volved.Deserializ<strong>in</strong>g <strong>XML</strong> Data to ObjectsThe deserialization process is controlled by the Deserialize method <strong>for</strong> a variety ofsources, <strong>in</strong>clud<strong>in</strong>g streams, <strong>XML</strong> readers, and text readers. Remember that by us<strong>in</strong>gthe trick discussed <strong>in</strong> Chapter 2 <strong>for</strong> <strong>XML</strong> readers (pack<strong>in</strong>g a str<strong>in</strong>g <strong>in</strong>to a Str<strong>in</strong>gReaderobject), you can also easily deserialize from str<strong>in</strong>gs.Although officially you can deserialize from streams and text readers, thedeserialization process is actually a matter of <strong>in</strong>vok<strong>in</strong>g an <strong>XML</strong> reader—more precisely,a very special breed of <strong>XML</strong> reader, optimized <strong>for</strong> serialization and <strong>for</strong> the specific class<strong>in</strong>volved. Connected to the deserialization process is the Can-Deserialize method. Thismethod returns a Boolean value <strong>in</strong>dicat<strong>in</strong>g whether the <strong>XML</strong> reader is correctlypositioned on the start element of the <strong>XML</strong> data. In addition, CanDeserialize ensuresthat the start element of the <strong>XML</strong> data is compatible with the orig<strong>in</strong>ally saved class.Normally, you call CanDeserialize <strong>in</strong> the context of a more general strategy designed totrap as many errors and exceptions as possible. If the application always deserializesdata that the <strong>XML</strong> serializer has previously created, a call to CanDeserialize can easilybe redundant. The call becomes crucial, how-ever, as soon as your application beg<strong>in</strong>sto deserialize <strong>XML</strong> data whose genu<strong>in</strong>eness and quality are not guaranteed. It is worthnot<strong>in</strong>g that CanDeserialize works only on <strong>XML</strong> readers, whereas Deserialize cansuccessfully handle streams and text readers too.From a programm<strong>in</strong>g perspective, deserializ<strong>in</strong>g is not rocket science, as the follow<strong>in</strong>gcode clearly demonstrates:StreamReader reader = new StreamReader(fileName);Employee emp = (Employee) ser.Deserialize(reader);reader.Close();Dur<strong>in</strong>g the deserialization stage, a few events can be fired. In particular, theUnknownElement, UnknownAttribute, and UnknownNode events signal when unknownand unexpected nodes are found <strong>in</strong> the <strong>XML</strong> text be<strong>in</strong>g deserialized. TheUnknownNode event is more generic than the other two and triggers regardless of thenode type on which the exception is detected. In case of unknown element or attributenodes, the UnknownNode event is fired first.404


Hook<strong>in</strong>g Up the Deserialization ProcessThe follow<strong>in</strong>g code demonstrates how to register event handlers <strong>for</strong> the eventsdescribed <strong>in</strong> the previous section:XmlSerializer ser = new XmlSerializer(typeof(Employee));ser.UnknownElement +=new XmlElementEventHandler(GotUnknownElement);ser.UnknownAttribute +=new XmlAttributeEventHandler(GotUnknownAttribute);ser.UnknownNode += new XmlNodeEventHandler(GotUnknownNode);Each event requires its own event handler class and passes a dist<strong>in</strong>ct data structure tothe client code. All the event data structures share the properties listed <strong>in</strong> Table 11-4.Table 11-4: Common Properties of Deserialization Event HandlersPropertyL<strong>in</strong>eNumberL<strong>in</strong>ePositionObjectBe<strong>in</strong>gDeserializedDescriptionGets the l<strong>in</strong>e number of the unknown <strong>XML</strong>attributeGets the column number <strong>in</strong> the l<strong>in</strong>e of theunknown <strong>XML</strong> attributeGets the object be<strong>in</strong>g deserializedIn addition, the XmlElementEventArgs, XmlAttributeEventArgs, and Xml-NodeEventArgs classes add some extra and more specific properties. Figure 11-1shows a sample application that lets you enter some <strong>XML</strong> code.Figure 11-1: Trac<strong>in</strong>g deserialization events.The application then attempts to map the code to the follow<strong>in</strong>g class:public class Employee{public str<strong>in</strong>g LastName;405


}public str<strong>in</strong>g FirstName;public str<strong>in</strong>g Position;Any exceptions are traced <strong>in</strong> the bottom pane of the w<strong>in</strong>dow. As shown <strong>in</strong> Figure 11-1,the ID attribute and the Title node have noth<strong>in</strong>g to do with the target schema. Bydefault, the deserializer ignores unknown nodes.The XmlElementEventArgs class has an extra property named Element whose type isXmlElement. Likewise, XmlAttributeEventArgs features an extra Attr property that is an<strong>in</strong>stance of the XmlAttribute type. The XmlNodeEvent-Args class also <strong>in</strong>cludes a groupof additional properties that look like a subset of the XmlNode class properties.Import<strong>in</strong>g Unmatched DataThe most compell<strong>in</strong>g reason to use deserialization events is that they enable you toattempt to fix <strong>in</strong>com<strong>in</strong>g data that doesn't perfectly match your target schema. Forexample, our target class conta<strong>in</strong>s a Position member, so the deserializer expects tof<strong>in</strong>d a element <strong>in</strong> the source code. If a needed element is not found, no eventis triggered. However, if an unexpected node is found, the user code receives anotification.If you know that the contents of one or more unknown elements can be adapted topopulate target members, an event handler is the best place <strong>in</strong> which to have yourcustom code plug <strong>in</strong> and do the job. For example, suppose that the node conta<strong>in</strong>s the same <strong>in</strong><strong>for</strong>mation as Position, but expressed with a different elementname. The follow<strong>in</strong>g code shows how to fix th<strong>in</strong>gs up and have the <strong>in</strong><strong>for</strong>mation fill thePosition property <strong>in</strong> the target class:void GotUnknownElement(object sender, XmlElementEventArgs e){if (e.Element.Name == "Title"){Employee emp = (Employee) e.ObjectBe<strong>in</strong>gDeserialized;emp.Position = e.Element.InnerText;}}You can also easily comb<strong>in</strong>e <strong>in</strong><strong>for</strong>mation com<strong>in</strong>g from multiple unknown elements. Inthis case, however, you must figure out an application-specific way to cache crucial<strong>in</strong><strong>for</strong>mation across multiple <strong>in</strong>vocations of the event handler. The event handler is<strong>in</strong>voked <strong>for</strong> each unknown node, although the event's ObjectBe<strong>in</strong>gDeserialized propertyis cumulatively set with the results of the deserialization.Shap<strong>in</strong>g the <strong>XML</strong> Output<strong>XML</strong> serialization enables you to shape the f<strong>in</strong>al <strong>for</strong>m of the <strong>XML</strong> data be<strong>in</strong>g created.Although the code of the class is not directly <strong>in</strong>volved <strong>in</strong> the generation of the output,the programmer is given a couple of tools to significantly <strong>in</strong>fluence the serializationprocess.The first approach is fairly static and works by sett<strong>in</strong>g attributes on the variousmembers of the class to be serialized. Accord<strong>in</strong>g to the attribute set, a given membercan be rendered as an attribute, an element, or pla<strong>in</strong> text, or it can be ignored406


altogether. The second approach is more dynamic and, more importantly, does notrequire the availability of the class source code. This approach is particularly effective<strong>for</strong> achiev<strong>in</strong>g a rather odd yet realistic result: shap<strong>in</strong>g an <strong>XML</strong> flow you can't control tofit <strong>in</strong>to a data structure you can't modify.<strong>XML</strong> Serialization AttributesThe XmlAttributes class represents a collection of .<strong>NET</strong> Framework attributes that letyou exercise strict control over how the XmlSerializer class processes an object. TheXmlAttributes class is similar to the SoapAttributes class mentioned <strong>in</strong> the section"Add<strong>in</strong>g Type In<strong>for</strong>mation," on page 494. Both classes per<strong>for</strong>m the same logicaloperation, but the <strong>for</strong>mer outputs to <strong>XML</strong>, whereas the latter returns SOAP-encodedmessages with type <strong>in</strong><strong>for</strong>mation.Each property of the XmlAttributes class corresponds to an attribute class. Theavailable XmlAttributes properties and their correspond<strong>in</strong>g attribute classes are listedhere:• XmlAnyAttribute Corresponds to the XmlAnyAttributeAttribute attributeand applies to properties that return an array of XmlAttribute objects. Aproperty marked with this attribute is populated with any unknown attributedetected dur<strong>in</strong>g the deserialization process.• XmlAnyElements Corresponds to the XmlAnyElementAttribute attributeand applies to properties that return an array of XmlElement objects. Aproperty marked with this attribute conta<strong>in</strong>s all the unknown elementsfound.• XmlArray Corresponds to the XmlArrayAttribute attribute and applies toall properties that return an array of user-def<strong>in</strong>ed objects. This attributecauses the contents of the property to be rendered as an <strong>XML</strong> array. An<strong>XML</strong> array is a subtree <strong>in</strong> which child elements are recursively serializedand appended to a common parent node.• XmlArrayItems Corresponds to the XmlArrayItemAttribute attribute andapplies to all properties that return an array of objects. Tightly coupled withthe previous attribute, XmlArrayItemAttribute describes the type of theitems <strong>in</strong> the array. XmlArrayItemAttribute specifies how the serializerrenders items <strong>in</strong>serted <strong>in</strong>to an array.• XmlAttribute Corresponds to the XmlAttributeAttribute attribute andapplies to public properties, caus<strong>in</strong>g the serializer to render them asattributes. By default, if no attribute is applied to a public read/writeproperty, it will be serialized as an <strong>XML</strong> element.• XmlChoiceIdentifier Corresponds to the XmlChoiceIdentifierAttributeattribute and implements the xsi:choice XSD data structure. The xsi:choicedata type resembles the C++ union structure and consists of additionalproperties, only one of which is valid <strong>for</strong> each <strong>in</strong>stance. TheXmlChoiceIdentifierAttribute attribute lets you express the choice of whichdata member to consider <strong>for</strong> serialization.• XmlDefaultValue Corresponds to the XmlDefaultValueAttribute attributeand gets or sets the default value of an <strong>XML</strong> element or attribute.• XmlElement Corresponds to the XmlElementAttribute attribute and <strong>for</strong>cesthe serializer to render a given public field as an <strong>XML</strong> element.• XmlEnum Corresponds to the XmlEnumAttribute attribute and specifiesthe way <strong>in</strong> which an enumeration member is serialized. You use thisattribute class to change the enumeration that the XmlSerializer generatesand recognizes when deserializ<strong>in</strong>g.• XmlIgnore Corresponds to the XmlIgnoreAttribute attribute and specifieswhether a given property should be ignored and skipped or serialized to407


<strong>XML</strong> as the type dictates. The attribute requires no further properties to bespecified.• XmlRoot Corresponds to the XmlRootAttribute attribute and overrides anycurrent sett<strong>in</strong>gs <strong>for</strong> the root node of the <strong>XML</strong> serialization output, replac<strong>in</strong>git with the specified element.• XmlText Corresponds to the XmlTextAttribute attribute and <strong>in</strong>structs theXmlSerializer class to serialize a public property as <strong>XML</strong> text. The propertyto which this attribute is applied must return primitive and enumerationtypes, <strong>in</strong>clud<strong>in</strong>g an array of str<strong>in</strong>gs or objects. If the return type is an arrayof objects, the Type property of the XmlTextAttribute type must be set tostr<strong>in</strong>g, and the objects will then be serialized as str<strong>in</strong>gs. Only one <strong>in</strong>stanceof the attribute can be applied <strong>in</strong> a class.• XmlType Corresponds to the XmlTypeAttribute attribute and can be usedto control how a type is serialized. When a type is serialized, theXmlSerializer class uses the class name as the <strong>XML</strong> element name. TheTypeName property of the XmlTypeAttribute class lets you change the<strong>XML</strong> element name. The IncludeInSchema property lets you specifywhether the type should be <strong>in</strong>cluded <strong>in</strong> the schema.The XmlElement AttributeThe key <strong>XML</strong> attributes are XmlElement and XmlAttribute. XmlElement, <strong>in</strong> particular,has a few <strong>in</strong>terest<strong>in</strong>g properties: IsNullable, DataType, ElementName, and Namespace.IsNullable lets you specify whether the property should be rendered even if set to null.DataType allows you to specify the XSD type of the element the serializer will generate.ElementName <strong>in</strong>dicates the name of the element. F<strong>in</strong>ally, Namespace associates theelement with a namespace URI. If you want to use a namespace prefix, add areference to that namespace us<strong>in</strong>g the XmlSerializerNamespaces class, as shownhere:[XmlElement(Namespace ="urn:mspress-xml", IsNullable=true,DataType="nonNegativeInteger", ElementName="FamilyName")]When the IsNullable property is set to true and the property has a null value, theserializer renders the element with a nil attribute that equals true, as shown here:If you specify the DataType attribute, the type name must match exactly the XSD typename. Specify<strong>in</strong>g the DataType attribute does not actually change the serialization<strong>for</strong>mat, it affects only the schema <strong>for</strong> the member.The XmlAttribute AttributeThe XmlAttribute attribute also supports the DataType and the Namespace properties.IsNullable is not supported. In addition, you can replace the default name of theattribute with the str<strong>in</strong>g assigned to the AttributeName property. As with elements, thedefault name of the attribute is the name of the parent class member.The XmlEnum AttributeIf your class def<strong>in</strong>ition conta<strong>in</strong>s an enumeration type, the XmlEnum attribute lets youmodify the named constants used to def<strong>in</strong>e each value member, as shown here:public enum SeatsAvailable{408


}[XmlEnum(Name = "AisleSeat")]Aisle,[XmlEnum(Name = "CentralSeat")]Central,[XmlEnum(Name = "W<strong>in</strong>dowSeat")]W<strong>in</strong>dowYou use the Name property to modify the name of the enum member.The <strong>XML</strong> Schema Def<strong>in</strong>ition ToolInstalled as part of the .<strong>NET</strong> Framework SDK, the <strong>XML</strong> Schema Def<strong>in</strong>ition Tool(xsd.exe) has several purposes. When it comes to <strong>XML</strong> serialization, the tool is helpful<strong>in</strong> a couple of scenarios. For example, you can use xsd.exe to generate source classfiles that are the C# or <strong>Microsoft</strong> Visual Basic .<strong>NET</strong> counterpart of exist<strong>in</strong>g XSDschemas. In addition, you can make the tool scan the public <strong>in</strong>terface exposed bymanaged executables (DLL or EXE) and extrapolate an <strong>XML</strong> schema <strong>for</strong> any of theconta<strong>in</strong>ed classes.In the first case, the tool automatically generates the source code of a .<strong>NET</strong>Framework class that is con<strong>for</strong>mant to the specified <strong>XML</strong> schema. This feature isextremely handy when you are <strong>in</strong> the process of writ<strong>in</strong>g an application that must copewith a flow of <strong>XML</strong> data described by a fixed schema. In a matter of seconds, the toolprovides you with either C# or Visual Basic source files conta<strong>in</strong><strong>in</strong>g a number ofclasses that, when serialized through XmlSerializer, con<strong>for</strong>m to the schema.Another common situation <strong>in</strong> which xsd.exe can help considerably is when you don'thave the source code <strong>for</strong> the classes your code manages. In this case, the tool cangenerate an <strong>XML</strong> schema document from any public class implemented <strong>in</strong> a DLL oran EXE.Overrid<strong>in</strong>g AttributesA fairly common scenario <strong>for</strong> <strong>XML</strong> serialization is when you call <strong>in</strong>to middle-tier classmethods, get back some <strong>XML</strong> data, and then map that <strong>in</strong><strong>for</strong>mation onto other classes.In real-world situations, you can't control or modify the layout of the <strong>in</strong>com<strong>in</strong>g <strong>XML</strong> dataor the structure of the target classes.This is certa<strong>in</strong>ly noth<strong>in</strong>g new <strong>for</strong> experienced developers who have been <strong>in</strong>volved <strong>in</strong> thedesign and development of distributed, multitiered systems. Normally, you resolve theissue by writ<strong>in</strong>g adapter components that use hard-coded logic to trans<strong>for</strong>m the<strong>in</strong>bound <strong>XML</strong> flow <strong>in</strong>to fresh <strong>in</strong>stances of the target classes. Although the map of thesolution is certa<strong>in</strong>ly effective and reasonable, a number of submerged obstacles canmake your trip through the data long and w<strong>in</strong>d<strong>in</strong>g.First you must parse the <strong>XML</strong> data and extrapolate significant <strong>in</strong><strong>for</strong>mation. Next youcopy any pieces of <strong>in</strong><strong>for</strong>mation <strong>in</strong>to a newly created <strong>in</strong>stance of a target class. The <strong>XML</strong>serialization mechanism was designed to resolve this difficulty, thus mak<strong>in</strong>g the processof <strong>in</strong>itializ<strong>in</strong>g classes from <strong>XML</strong> data both effective and efficacious.Adapt<strong>in</strong>g Data to ClassesRead<strong>in</strong>g <strong>in</strong>com<strong>in</strong>g <strong>XML</strong> data is itself a k<strong>in</strong>d of deserialization. However, as we've seen,the <strong>XML</strong> deserializer can only re-create an <strong>in</strong>stance of the type you pass when you409


create the XmlSerializer object. How can you comply with any difference <strong>in</strong> the schemaof the target class and the <strong>in</strong>com<strong>in</strong>g <strong>XML</strong> data? That task is handled by the attributeoverrides process <strong>for</strong> the <strong>XML</strong>Serializer object, shown <strong>in</strong> Figure 11-2.Figure 11-2: Attribute overrides are crucial architectural elements to allow effective <strong>XML</strong>-toclassmapp<strong>in</strong>g.The <strong>XML</strong> serializer works on top of a particular type—the target class. Whiledeserializ<strong>in</strong>g, the deserializer eng<strong>in</strong>e attempts to fit <strong>in</strong>com<strong>in</strong>g data <strong>in</strong>to the properties ofthe target class, tak<strong>in</strong>g <strong>in</strong>to careful account any attributes set <strong>for</strong> the various properties.What happens if the source and the dest<strong>in</strong>ation follow <strong>in</strong>compatible schemas? Thismight seem a rather odd situation—how could you deserialize data that you haven'tpreviously serialized?—but <strong>in</strong> practice it exemplifies the real goal of <strong>XML</strong> serialization.Beyond any technological and implementation details, <strong>XML</strong> serialization is simply a wayto automatically <strong>in</strong>stantiate classes from <strong>XML</strong> data.This is not simply the problem of trans<strong>for</strong>m<strong>in</strong>g one schema <strong>in</strong>to another; <strong>in</strong>stead, youmust trans<strong>for</strong>m a schema <strong>in</strong>to a class. If you don't want to write an ad hoc piece ofcode, you have only the follow<strong>in</strong>g few options:• Modify the source data to make it fit the target class through default<strong>XML</strong> serialization. This solution is impractical if you don't have access tothe component that generates this flow.• Modify the target class with static attributes to make it support <strong>in</strong>deserialization the schema of the <strong>in</strong>com<strong>in</strong>g data. This solution isimpractical if you don't have access to the source code <strong>for</strong> the class—<strong>for</strong>example, if the class is deployed through an assembly.• Override the attributes of the target class us<strong>in</strong>g dynamic hooks providedby the objects you can create and store <strong>in</strong> an XmlAttributeOverridesclass. We'll exam<strong>in</strong>e this solution more closely <strong>in</strong> the section "TheXmlAttributeOverrides Class," on page 505.• If the differences <strong>in</strong>volve data, too, and there<strong>for</strong>e can't be addressedwith schema elements, resort to deserialization events, as described <strong>in</strong>the section "Deserializ<strong>in</strong>g <strong>XML</strong> Data to Objects," on page 496.Attribute overrid<strong>in</strong>g is a technique that lets you change the default way <strong>in</strong> whichserialization and deserialization occur. In addition to the case just mentioned, attributeoverrides are also useful <strong>for</strong> sett<strong>in</strong>g up different (and selectable)serialization/deserialization schemes <strong>for</strong> a given class.The XmlAttributeOverrides ClassYou pass an <strong>in</strong>stance of the XmlAttributeOverrides class to the XmlSerializerconstructor. As a result, the serializer will use the data conta<strong>in</strong>ed <strong>in</strong> the410


XmlAttributeOverrides object to override the serialization attributes set on the class.The XmlAttributeOverrides class is a collection and conta<strong>in</strong>s pairs consist<strong>in</strong>g of theobject types that will be overridden and the changes to apply.As shown <strong>in</strong> the follow<strong>in</strong>g code, you first create an <strong>in</strong>stance of the XmlAttributes class—that is, a helper class that conta<strong>in</strong>s all the pairs of overrid<strong>in</strong>g objects. Next you createan attribute object that is appropriate <strong>for</strong> the object be<strong>in</strong>g overridden. For example,create an XmlElementAttribute object to override a property. In do<strong>in</strong>g so, you canoptionally change the element name or the namespace. Then store the override <strong>in</strong> theXmlAttributes object. F<strong>in</strong>ally, add the XmlAttributes object to the XmlAttributeOverridesobject and <strong>in</strong>dicate the element to which all those overrides will apply.// Create the worker collection of changesXmlAttributes changes = new XmlAttributes();// Add the first override (change the element's name)XmlElementAttribute newElem = new XmlElementAttribute();newElem.ElementName = "New name";changes.XmlElements.Add(newElem);// Create the list of overridesXmlAttributeOverrides over = new XmlAttributeOverrides();// Fill the overrides list (Employee is the target class)over.Add(typeof(Employee), "Element-to-Override", changes);The <strong>in</strong>stance of the XmlAttributeOverrides class is associated with the <strong>XML</strong> serializer atcreation time, as shown here:XmlSerializer ser = new XmlSerializer(typeof(Employee), over);NoteAttribute overrid<strong>in</strong>g also enables you to use derived classes <strong>in</strong> lieuof the def<strong>in</strong>ed classes. For example, suppose you have a propertyof a certa<strong>in</strong> type. To <strong>for</strong>ce the serializer (both <strong>in</strong> serialization anddeserialization) to use a derived class, follow the steps outl<strong>in</strong>ed <strong>in</strong>the preced<strong>in</strong>g code but also set the Type property on the overrid<strong>in</strong>gelement, as shown here:// Manager is a class that <strong>in</strong>herits from EmployeenewElem.Type = typeof(Manager);Attribute overrid<strong>in</strong>g is a useful technique, and <strong>in</strong> the next section, we'll see it <strong>in</strong> action.Mapp<strong>in</strong>g SQL Server Data to ClassesIn Chapter 8, we saw the ExecuteXmlReader method exposed by the SqlCommandclass <strong>in</strong> the SQL Server–managed provider. The ExecuteXmlReader method executesa command aga<strong>in</strong>st the database and returns an <strong>XML</strong> reader if the output of thecommand can be expressed as a well-<strong>for</strong>med <strong>XML</strong> document or fragment. Let's seewhat's needed to trans<strong>for</strong>m that output <strong>in</strong>to an <strong>in</strong>stance of a class. The follow<strong>in</strong>g code isat the heart of the example. You call <strong>in</strong>to a method, the method executes an SQL <strong>XML</strong>411


command, the data flows <strong>in</strong>to the serializer, and an <strong>in</strong>stance of a particular class isreturned.Employee emp = LoadEmployeeData(empID);The follow<strong>in</strong>g code shows the body of the LoadEmployeeData method:private Employee LoadEmployeeData(<strong>in</strong>t empID){// Create the serializerXmlSerializer ser = PrepareEmployeeTypeSerializer();// Prepare the connection and the SQL commandSqlConnection conn = new SqlConnection(NW<strong>in</strong>dConnection);SqlCommand cmd = PrepareSqlCommand(empID, conn);conn.Open();// Execute the commandEmployee emp = null;XmlTextReader reader =(XmlTextReader) cmd.ExecuteXmlReader();// Deserialize the <strong>in</strong>com<strong>in</strong>g dataif(ser.CanDeserialize(reader))emp = (Employee) ser.Deserialize(reader);elseConsole.WriteL<strong>in</strong>e("Cannot deserialize");// Clean-upreader.Close();conn.Close();}return emp;The serializer is tailor-made <strong>for</strong> the Employee class shown here:public class Employee{public str<strong>in</strong>g FirstName;public str<strong>in</strong>g LastName;public str<strong>in</strong>g Position;public DateTime Hired;}The SQL command used <strong>in</strong> our example is shown here:SELECT firstname, lastname, title, hiredate FROM employees412


WHERE employeeid=@empIDFOR <strong>XML</strong> AUTOThe f<strong>in</strong>al <strong>XML</strong> output takes the follow<strong>in</strong>g <strong>for</strong>m:As you can see, the class requires some attribute overrides to adapt to the actual <strong>XML</strong>stream com<strong>in</strong>g from SQL Server. In general, you can modify either the SQL commandor the class source to make each fit the other's structure. This is not always possible,however. When it's not possible, attribute overrides are the only safe way to make twoimmutable and <strong>in</strong>compatible flows of data <strong>in</strong>teroperate.Overrid<strong>in</strong>g the Class NameIn this scenario, the serializer is used only to deserialize data com<strong>in</strong>g from SQL Server.No previous serialization has been explicitly done. The deserializer reads the <strong>in</strong>bounddata and determ<strong>in</strong>es an ad hoc class structure. It then matches this <strong>in</strong>ferred structurewith the specified type to be deserialized to—<strong>in</strong> this case, Employee.The first issue to consider is the name of the class. The deserializer takes the classname from the root of the stream. In our example, the <strong>in</strong>ferred class name would beemployees. This issue is easily resolved by creat<strong>in</strong>g an alias <strong>for</strong> the SQL Server table.Add an AS Employee clause to the table name, and you're done. As mentioned,however, this solution is not possible at all if you don't have enough rights to modifyhard-coded SQL code. An XmlRoot attribute is another way to work around theproblem.The attribute can be assigned either statically or dynamically. Aga<strong>in</strong>, static attributesrequire that you have access to the class source code. Let's create attributesdynamically, as follows:XmlAttributes changesRoot = new XmlAttributes();XmlRootAttribute newRoot = new XmlRootAttribute();newRoot.ElementName = "employees";changesRoot.XmlRoot = newRoot;You create an XmlRootAttribute object and set its ElementName property to the nameof the source root tag—<strong>in</strong> this case, employees. Next you assign the newly createdelement attribute to the XmlRoot property of the XmlAttributes object that gathers all theattribute overrides <strong>for</strong> a particular element—<strong>in</strong> this case, the class as a whole. Tobecome effective, the changes must be added to an XmlAttributeOverrides object,which will then be passed to the type-specific serializer's constructor, as shown here:XmlAttributeOverrides over = new XmlAttributeOverrides();over.Add(typeof(Employee), changesRoot);Overrid<strong>in</strong>g Class PropertiesEach property of the Employee class must be renamed and remapped to match one ofthe source <strong>XML</strong> attributes because we assume we're work<strong>in</strong>g on the data flow of aFOR <strong>XML</strong> AUTO, <strong>in</strong> which each field is rendered as an attribute. No remapp<strong>in</strong>g wouldbe needed if you assumed the data flow of a FOR <strong>XML</strong> AUTO ELEMENTS, <strong>in</strong> whichfields are represented with elements.Renam<strong>in</strong>g properties is necessary because the deserializer works <strong>in</strong> a strictly casesensitivefashion and considers firstname completely different from FirstName, as youcan see by runn<strong>in</strong>g the follow<strong>in</strong>g code:413


XmlAttributes changesFirstName = new XmlAttributes();XmlAttributeAttribute newFirstName =new XmlAttributeAttribute();newFirstName.AttributeName = "firstname";changesFirstName.XmlAttribute = newFirstName;over.Add(typeof(Employee), "FirstName", changesFirstName);You need a dist<strong>in</strong>ct XmlAttributes object <strong>for</strong> each element you want to override. TheXmlAttributes object collects all the overrides you want to enter <strong>for</strong> a given element. Inthis case, after creat<strong>in</strong>g a new XmlAttributeAttribute object, we change the attributename and store the resultant object <strong>in</strong> the XmlAttribute property of the overridesconta<strong>in</strong>er.When the overrides are <strong>for</strong> a specific element, you use a particular overload of theXmlAttributeOverrides class's Add method. In this case, you specify a third argument—the name of the element be<strong>in</strong>g overridden. The follow<strong>in</strong>g code replaces the currentsett<strong>in</strong>gs of the FirstName property:over.Add(typeof(Employee), "FirstName", changesFirstName);The code is slightly different if you need to override an element <strong>in</strong>stead of an attribute,as shown here:XmlAttributes changesFirstName = new XmlAttributes();XmlElementAttribute newFirstName = new XmlElementAttribute();newFirstName.ElementName = "firstname";changesFirstName.XmlElements.Add(newFirstName);over.Add(typeof(Employee), "FirstName", changesFirstName);A different attribute class is <strong>in</strong>volved—XmlElementAttribute—with a slightly differentprogramm<strong>in</strong>g <strong>in</strong>terface.Similar code should be written <strong>for</strong> each class property you want to map to a source<strong>XML</strong> attribute or element.CautionIf the name of the <strong>XML</strong> root does not match the name of thetarget class, the deserializer can't proceed further, and theCanDeserialize method returns false. If the root and classnames match, the deserialization can take place. Anyunmatched attributes and elements are treated as unknownobjects, and the proper deserialization event is fired.Mix<strong>in</strong>g Overrides and EventsUp to now, we have considered a simple scenario <strong>in</strong> which a direct mapp<strong>in</strong>g existsbetween elements <strong>in</strong> the source <strong>XML</strong> and properties <strong>in</strong> the target class. In this case, allof your overrides end up chang<strong>in</strong>g the structure of the <strong>XML</strong> code be<strong>in</strong>g deserialized.But what if you need to apply some logic <strong>in</strong> the middle of your code? Let's consider ascenario <strong>in</strong> which the <strong>XML</strong> source conta<strong>in</strong>s a birthdate field but your class conta<strong>in</strong>s anAge property <strong>in</strong>stead. In this case, an attribute override is no longer useful and hook<strong>in</strong>gthe deserialization process is the only way.Earlier <strong>in</strong> this chapter, we discussed deserialization events. If the birthdate value isexpressed as an attribute, you write an UnknownAttribute handler; otherwise, resort toan UnknownElement event handler. The follow<strong>in</strong>g code snippet shows how todeterm<strong>in</strong>e the correct value <strong>for</strong> the Age property based on birthdate:414


Unknown attribute detectedif (e.Attr.Name == "birthdate"){Employee emp = (Employee) e.ObjectBe<strong>in</strong>gDeserialized;DateTime dt = DateTime.Parse(e.Attr.Value);emp.Age = (<strong>in</strong>t) (DateTime.Now.Year - dt.Year);}Populat<strong>in</strong>g Collection PropertiesAn even more complex scenario arises when the source <strong>XML</strong> conta<strong>in</strong>s embedded data,the result of INNER JOIN operations be<strong>in</strong>g rendered <strong>in</strong> <strong>XML</strong>. Consider the follow<strong>in</strong>gstatement:SELECT firstname, lastname, title, hiredate, birthdate,terr.territorydescriptionFROM Employees As employeesINNER JOIN EmployeeTerritories AS empterrON employees.employeeid=empterr.employeeidINNER JOIN Territories AS terrON empterr.territoryid=terr.territoryidWHERE employees.employeeid=@empIDFOR <strong>XML</strong> AUTOThe <strong>XML</strong> output <strong>for</strong> the empID parameter that equals 1 is shown here:This output changes a little bit if you use the ELEMENTS clause, as follows:NancyDavolio...WiltonNewardThe application is always notified of any elements through an UnknownElementevent. Suppose you also want any territory description to populate a Str<strong>in</strong>gCollectionproperty <strong>in</strong> the Employee class. The follow<strong>in</strong>g code shows how to handle the event andaccumulate the data <strong>for</strong> an unknown element <strong>in</strong> the str<strong>in</strong>g collection:if (e.Element.Name == "terr")415


{if (emp.Territories == null)emp.Territories = new Str<strong>in</strong>gCollection();object o =e.Element.Attributes["territorydescription"].Value;emp.Territories.Add(o.ToStr<strong>in</strong>g());}If the territory description is not expressed as an attribute, you can use the InnerTextproperty of the terr element to get its value.Figure 11-3 shows the sample application <strong>in</strong> action. The application retrieves the data<strong>for</strong> a particular employee, copies the data <strong>in</strong>to an <strong>in</strong>stance of the Employee class, andthen displays the data through the user <strong>in</strong>terface.Figure 11-3: Us<strong>in</strong>g the <strong>XML</strong> serializer to deserialize the output of a SQL Server <strong>XML</strong> query.NoteThe query used <strong>in</strong> this sample application restricts the output to atmost one record so that the f<strong>in</strong>al <strong>XML</strong> output will be an <strong>XML</strong>document <strong>in</strong>stead of an <strong>XML</strong> fragment. <strong>XML</strong> fragments are not readby the <strong>XML</strong> serializer.Per<strong>for</strong>mance ConsiderationsThe first time you write test code that <strong>in</strong>vokes the <strong>XML</strong> serializer, you'll notice that ittakes a while to complete when compared to SOAP or b<strong>in</strong>ary serialization. When theserializer object is created, an unknown assembly is loaded. If you run the sampleapplication and monitor the output w<strong>in</strong>dow, you'll see someth<strong>in</strong>g like this :'Sql2Class_CS.exe': Loaded 'qsxgw21i', No symbols loaded.416


The name of the first assembly varies each time you create the <strong>XML</strong> serializer, a clearsign that it is a temporary assembly created on the fly. We'll exam<strong>in</strong>e the <strong>in</strong>ternalarchitecture of the <strong>XML</strong> serializer <strong>in</strong> the next section.For now, consider that each <strong>in</strong>stantiation of the XmlSerializer class results <strong>in</strong> an adhoc assembly be<strong>in</strong>g created and loaded. After that, the read<strong>in</strong>g and writ<strong>in</strong>gper<strong>for</strong>mance you get from the <strong>XML</strong> serializer is not different from that of other types ofread<strong>in</strong>g/writ<strong>in</strong>g tools. The creation of the assembly takes several milliseconds—probably several hundred milliseconds—as compared to the one or two millisecondsthat serializ<strong>in</strong>g a class might take. This means that us<strong>in</strong>g the <strong>XML</strong> serializer taxes you<strong>for</strong> about half a second each time you <strong>in</strong>stantiate the XmlSerializer class.This book's sample files <strong>in</strong>clude a console application named Perf-Test thatdemonstrates the differences <strong>in</strong> per<strong>for</strong>mance you get when us<strong>in</strong>g <strong>XML</strong> serializationand ad hoc user code. The output is the same, but custom code runs significantlyfaster. On the other hand, the <strong>XML</strong> serializer saves you from writ<strong>in</strong>g and test<strong>in</strong>gcomplex code <strong>for</strong> complex classes. Keep these issues <strong>in</strong> m<strong>in</strong>d if you are us<strong>in</strong>g <strong>XML</strong>serialization and want to improve the overall per<strong>for</strong>mance.NoteThe full source code <strong>for</strong> the sample application demonstrat<strong>in</strong>g thedeserialization of SQL Server <strong>XML</strong> queries to .<strong>NET</strong> Frameworkclasses is available <strong>in</strong> this book's sample files. The application isnamed Sql2Class. The application demonstrates attribute overrid<strong>in</strong>gand works with FOR <strong>XML</strong> and FOR <strong>XML</strong> ELEMENTS queries. Inaddition, it compares the per<strong>for</strong>mance of the serializer and a pieceof ad hoc code mapp<strong>in</strong>g <strong>XML</strong> data to the same class.Inside the <strong>XML</strong> SerializerThe <strong>XML</strong> serializer is a powerful tool that can trans<strong>for</strong>m a fair number of .<strong>NET</strong>Framework classes <strong>in</strong>to portable <strong>XML</strong> code. The key th<strong>in</strong>g to note is that the serializeris a k<strong>in</strong>d of compiler. It first imports type <strong>in</strong><strong>for</strong>mation from the class and then serializes itto the output stream. It also works the other way around. The serializer reads <strong>XML</strong> dataand maps elements to the target class members.Normally, serialization and deserialization are functions that each class implements <strong>in</strong>whatever way it determ<strong>in</strong>es is more convenient <strong>for</strong> its data. This is precisely whathappens with run-time object serialization. <strong>XML</strong> serialization works differently, however.With the <strong>XML</strong> serializer, you have a compiler tool that takes <strong>in</strong><strong>for</strong>mation out of the classand conveys it to the stream. Each class is particular and, <strong>in</strong> a certa<strong>in</strong> way, unique.How can a generic tool work efficiently on all possible classes? This is where thetemporary assembly comes <strong>in</strong>.The Temporary AssemblyThe follow<strong>in</strong>g list<strong>in</strong>g shows the pseudocode that makes up the constructor of theXmlSerializer class:public XmlSerializer(Type type){// Looks up <strong>for</strong> the assembly <strong>in</strong> the <strong>in</strong>ternal cachetempAssembly = Cache[type];// If no assembly is found, create a new oneif (tempAssembly == null)417


{// Import type mapp<strong>in</strong>g <strong>in</strong><strong>for</strong>mationXmlReflectionImporter importer =new XmlReflectionImporter();XmlTypeMapp<strong>in</strong>g map = importer.ImportTypeMapp<strong>in</strong>g(type);}}// Generate the assembly and add it to// the cache <strong>for</strong> that typetempAssembly = GenerateTempAssembly(map);Cache.Add(type, tempAssembly);The XmlSerializer class ma<strong>in</strong>ta<strong>in</strong>s an <strong>in</strong>ternal table of type/assembly pairs. If no knownassembly exists to handle the type, a new assembly is promptly generated and cached;otherwise, the exist<strong>in</strong>g assembly is used to serialize and deserialize. (More on theassembly's contents <strong>in</strong> the next section.)Each <strong>in</strong>stance of the XmlSerializer class ma<strong>in</strong>ta<strong>in</strong>s a reference to the assembly to beused <strong>for</strong> read<strong>in</strong>g and writ<strong>in</strong>g operations. In the preced<strong>in</strong>g pseudocode, tempAssemblyis the name of this data member. Both the Serialize method and the Deserialize methoduse this reference to obta<strong>in</strong> the tailor-made reader and writer objects to work on theparticular type.The Assembly CacheThe assembly cache is built around a hash table that conta<strong>in</strong>s objects of typeTempAssembly. As the ILDASM shows <strong>in</strong> Figure 11-4, the assembly cachecorresponds to a class named TempAssemblyCache. The XmlSerializer class holds astatic TempAssemblyCache member that is shared by all <strong>in</strong>stances of the XmlSerializeryou might create.Figure 11-4: Peek<strong>in</strong>g <strong>in</strong>to the System.Xml.Serialization namespace.The TempAssembly class ma<strong>in</strong>ta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the assembly that providesreader and writer classes optimized to <strong>XML</strong> serialize and <strong>XML</strong> deserialize classes of acerta<strong>in</strong> type. To build a type-specific assembly, the serializer needs fresh <strong>in</strong><strong>for</strong>mationabout the type be<strong>in</strong>g serialized. An <strong>in</strong>ternal class named XmlReflectionImporterretrieves this <strong>in</strong><strong>for</strong>mation through the .<strong>NET</strong> Framework reflection API. The type data ispacked <strong>in</strong>to an XmlTypeMapp<strong>in</strong>g structure and then passed to the <strong>in</strong>ternal method thatprovides <strong>for</strong> the assembly generation.418


CautionThe ma<strong>in</strong> purpose of the assembly cache is to save you fromrepeatedly re-creat<strong>in</strong>g the assembly <strong>for</strong> the same type <strong>in</strong> thesame application session. Un<strong>for</strong>tunately, this seems to work onlyif you use the simplest XmlSerializer constructor, as shown here:XmlSerializer ser = new XmlSerializer(type);All other constructors—that is, those that can acceptnamespaces, type mapp<strong>in</strong>g, and attribute overrides—never look<strong>in</strong>to the cache to f<strong>in</strong>d match<strong>in</strong>g assemblies. The net effect of thisbehavior is that if you use, say, attribute overrides, as we didearlier, the assembly <strong>for</strong> the type is generated each time youcreate the constructor, even if the type is always the same.To work around this, use global <strong>in</strong>stances of the XmlSerializerclass, one <strong>for</strong> each type you plan to work on. This workaround isnot strictly required if you use the simple constructor, but us<strong>in</strong>g aglobal serializer <strong>for</strong> each type results <strong>in</strong> slightly more efficientcode because you avoid any access to the cache, and not do<strong>in</strong>gso will certa<strong>in</strong>ly result <strong>in</strong> significantly slower code. Generat<strong>in</strong>gthe assembly pays <strong>for</strong> itself <strong>in</strong> a s<strong>in</strong>gle serializ<strong>in</strong>g or deserializ<strong>in</strong>goperation.Assembly CreationThe assembly is created from dynamically generated C# source code. The codeconta<strong>in</strong>s two classes whose names are hard-coded as XmlSerializationReader1 andXmlSerializationWriter1. The <strong>for</strong>mer class works like a tailor-made reader <strong>for</strong> the typebe<strong>in</strong>g deserialized. The latter class is an ad hoc writer that dumps out to <strong>XML</strong> thecontents of the specified object <strong>in</strong>stance. The classes are generated <strong>in</strong> the<strong>Microsoft</strong>.Xml.Serialization.GeneratedAssembly namespace.The serializer's constructor uses an <strong>in</strong>ternal code-writer object to trans<strong>for</strong>m all type<strong>in</strong><strong>for</strong>mation stored <strong>in</strong> XmlTypeMapp<strong>in</strong>g <strong>in</strong>to C# source code. The C# source file, as wellas the assembly, are generated <strong>in</strong> a temporary folder—the path returned byPath.GetTempPath. Normally, the follow<strong>in</strong>g temporary path is used:C:\Documents And Sett<strong>in</strong>gs\[user name]\Local Sett<strong>in</strong>gs\TempIf you monitor this folder with a tool like the one shown <strong>in</strong> Figure 11-5, you'll discoverwhat really happens when you call the XmlSerializer constructor.Figure 11-5: Unveil<strong>in</strong>g the clandest<strong>in</strong>e life of the temporary assembly.419


As you can see, the first file created is a C# source file whose name has been randomlygenerated. Next the serializer <strong>in</strong>vokes the C# compiler, and the assembly is sooncreated! The files are cached <strong>in</strong> memory and deleted from disk immediately after hav<strong>in</strong>gbeen created. It's almost impossible to programmatically catch those files and make acopy <strong>for</strong> further perusal. The <strong>XML</strong> Serialization Notifier tool (XmlSerial_CS) shown <strong>in</strong>Figure 11-5 (and available <strong>in</strong> this book's sample files) uses the FileSystemWatcherclass to monitor file system events that take place <strong>in</strong> a given folder. The only trick I'vecome up with to get my hands on the serializer's <strong>in</strong>ternal files is dropp<strong>in</strong>g the deletepermission on the folder.Figure 11-6 shows the files generated us<strong>in</strong>g this trick <strong>for</strong> the serializer <strong>in</strong>stance shown<strong>in</strong> Figure 11-5.Figure 11-6: The ILDASM view of the temporary assembly's contents.Serialization Writers and ReadersLet's take a brief look at what happens under the hood of the XmlSerialization-Readerand XmlSerializationWriter classes. The MSDN documentation touches on these twoclasses, which <strong>for</strong>m the substrate of the classes conta<strong>in</strong>ed <strong>in</strong> the temporary assembly.The XmlSerializationReader and XmlSerializationWriter classes are <strong>in</strong>ternal and are not<strong>in</strong>tended to be used directly from user code. More <strong>in</strong>terest<strong>in</strong>g than the actual contentsof the classes is how the Serialize and Deserialize methods <strong>in</strong>teract with them.Serializ<strong>in</strong>g to <strong>XML</strong>The Serialize method first gets a reference to the class-specific type writer. An <strong>in</strong>stanceof the XmlSerializationWriter1 class is returned by the TempAssembly class thatrepresents the temporary assembly. Once the Serialize method holds a reference to theactual serialization writer, it calls the write method that outputs <strong>XML</strong> code to a textwriter.Deserializ<strong>in</strong>g from <strong>XML</strong>Although a CanDeserialize method is provided, the Deserialize method never calls it. Ifthe type is not fully serializable, or if errors occurred somewhere along the way, theDeserialize method fails, throw<strong>in</strong>g an exception.If no errors occur, the Deserialize method asks the temporary assembly to return areference to the reader object to be used. The reader object is simply an <strong>in</strong>stance of theXmlSerializationReader1 class. The method that actually returns the object is one of theReadN_XXX methods, where N is the method <strong>in</strong>dex and XXX is the type.420


ConclusionThe <strong>XML</strong> serializer is a double-edged sword. On one hand, it lets you serialize anddeserialize even complex .<strong>NET</strong> Framework classes to and from <strong>XML</strong> with very few l<strong>in</strong>esof code. To accomplish this, the serializer needs to create an assembly on the fly. If youdon't use a global <strong>in</strong>stance of the serializer <strong>for</strong> each type, you can easily add hundredsof milliseconds of overhead to each call—def<strong>in</strong>itely not a pleasant prospect.On the other hand, appropriately used, <strong>XML</strong> serialization produces more compact codethan run-time SOAP serialization. If you add type <strong>in</strong><strong>for</strong>mation, and SOAP type<strong>in</strong><strong>for</strong>mation <strong>in</strong> particular, the ratio changes, however. The moral of this story is don'tever mix <strong>XML</strong> and SOAP—use only the process you need.Serialization is one of the new frontiers of <strong>XML</strong>. It is not clear yet whether today'sSOAP, extensions to SOAP, or a brand-new dialect will become the universal plat<strong>for</strong>m<strong>for</strong> describ<strong>in</strong>g objects. Currently, <strong>XML</strong> serialization is a hybrid, <strong>in</strong>complete, technology.Orig<strong>in</strong>ally designed as a tool runn<strong>in</strong>g underneath the .<strong>NET</strong> Framework implementationof Web services, <strong>XML</strong> serialization entered prime time a bit too early, or if not too early,certa<strong>in</strong>ly not optimized.If you look at <strong>XML</strong> serialization as a way to save and resume objects to and from a tagbaseddescription, the current architecture makes sense because it is fairly unobtrusiveand even efficient. The apparently odd use of a temporary assembly is fully justified <strong>in</strong> aWeb service context. As we'll see <strong>in</strong> Chapter 13, the return type of a Web method isserialized back to the caller us<strong>in</strong>g an <strong>in</strong>stance of the XmlSerializer class. In this context,a Web service class does not need to use attribute overrid<strong>in</strong>g or other features thatrequire a rich constructor. This could be just the unofficial explanation <strong>for</strong> the fact thatassembly cach<strong>in</strong>g is enabled only <strong>for</strong> the simplest constructor. This was orig<strong>in</strong>ally thecore of what we know today as <strong>XML</strong> serialization. All the rest was untidily tacked onwhen someone pushed <strong>XML</strong> serialization <strong>in</strong>to prime time.If you look at object serialization from a broader perspective, you can't help wonder<strong>in</strong>gwhy run-time object serialization and <strong>XML</strong> serialization are so different. My hunch isthat <strong>XML</strong> serialization was <strong>in</strong>itially designed as an <strong>in</strong>ternal tool and was tailor-made <strong>for</strong>use with Web services. In that context, a dynamic assembly is useful and speeds upthe process. The <strong>XML</strong> serializer then came to be seen, and with good reason, as amore powerful and useful tool to be made public and with a richer programm<strong>in</strong>g<strong>in</strong>terface. This project is still <strong>in</strong>complete. Overall, <strong>XML</strong> serialization touches aprogrammer's sensitive nerve, but at least <strong>in</strong> this version of the .<strong>NET</strong> Framework, itcomes with a clearly <strong>in</strong>consistent design, although with some great ideas spr<strong>in</strong>kled hereand there. It's as if the technology was rushed out the door with no further thought. Aglimpse of the potential future of the <strong>XML</strong> serialization is buried <strong>in</strong> the recesses of theDataSet object—<strong>in</strong> the IXmlSerializable <strong>in</strong>terface. Forc<strong>in</strong>g objects to make themselves<strong>XML</strong> serializable by implement<strong>in</strong>g a particular <strong>in</strong>terface is a clean way toward muchfaster, better designed, consistent, and more effective code.Further Read<strong>in</strong>gThis chapter focused on <strong>XML</strong> serialization. For a more thorough coverage of objectserialization <strong>in</strong> general, look at <strong>Programm<strong>in</strong>g</strong> Visual Basic .<strong>NET</strong> by Francesco Balena(<strong>Microsoft</strong> Press, 2002). Chapter 11 of that book provides a comprehensive explanationof run-time object serialization <strong>in</strong> the .<strong>NET</strong> Framework, <strong>in</strong>clud<strong>in</strong>g <strong>XML</strong> serialization.SOAP was also repeatedly mentioned <strong>in</strong> this chapter. A good <strong>in</strong>troduction to SOAP thatsuccessfully weds philosophy and technology can be found <strong>in</strong> Don Box's article "YoungPerson's Guide to the Simple Object Access Protocol: SOAP Increases InteroperabilityAcross Plat<strong>for</strong>ms and Languages," <strong>in</strong> MSDN Magaz<strong>in</strong>e, March 2000.421


A good source <strong>for</strong> learn<strong>in</strong>g about SOAP <strong>in</strong> general terms and not specifically from a.<strong>NET</strong> Web service perspective is Understand<strong>in</strong>g SOAP, by Kennard Scribner and MarkStiver (SAMS, 2000). For an <strong>in</strong>-depth reference discuss<strong>in</strong>g both SOAP and Webservices from a .<strong>NET</strong> Framework angle, try Build<strong>in</strong>g <strong>XML</strong> Web Services <strong>for</strong> the<strong>Microsoft</strong> .<strong>NET</strong> Plat<strong>for</strong>m, by Scott Short (<strong>Microsoft</strong> Press, 2002).422


Chapter 12: The .<strong>NET</strong> Remot<strong>in</strong>g SystemThe <strong>Microsoft</strong> .<strong>NET</strong> Framework <strong>in</strong>frastructure <strong>for</strong> remot<strong>in</strong>g is the set of system servicesthat enable .<strong>NET</strong> applications to communicate and exchange data and objects. In thischapter and Chapter 13, you'll f<strong>in</strong>d an annotated overview of the two technologies thatconstitute the .<strong>NET</strong> answer to the universal demand <strong>for</strong> a seamless and effectivemechanism <strong>for</strong> build<strong>in</strong>g distributed and <strong>in</strong>teroperable applications: .<strong>NET</strong> Remot<strong>in</strong>g andWeb services.Be<strong>for</strong>e we beg<strong>in</strong> our technical exam<strong>in</strong>ation of the .<strong>NET</strong> Remot<strong>in</strong>g architecture, abroader perspective is necessary to understand how .<strong>NET</strong> Remot<strong>in</strong>g—that is, a non-<strong>XML</strong> technology—fits <strong>in</strong>to a book about <strong>XML</strong>.Interprocess Communications <strong>in</strong> the .<strong>NET</strong> FrameworkWeb services and .<strong>NET</strong> Remot<strong>in</strong>g are dist<strong>in</strong>ct, stand-alone technologies that share acommon root but have different sets of features and, more important, different goals.Both Web services and .<strong>NET</strong> Remot<strong>in</strong>g let you publish functions over a network andhandle <strong>in</strong>com<strong>in</strong>g calls. Both share an architectural design that <strong>in</strong>cludes layers <strong>for</strong>request/response handl<strong>in</strong>g, object serialization, and data transportation. Both shareunderly<strong>in</strong>g network protocols such as Simple Object Access Protocol (SOAP) andHTTP.Overall, Web services and .<strong>NET</strong> Remot<strong>in</strong>g are two dist<strong>in</strong>ct and <strong>in</strong>dependent sides ofthe same co<strong>in</strong>. Web services—a clearly <strong>XML</strong>-based technology—are a special case ofthe .<strong>NET</strong> Remot<strong>in</strong>g <strong>in</strong>frastructure. The .<strong>NET</strong> Framework <strong>in</strong>frastructure <strong>for</strong> remot<strong>in</strong>g canbe seen as an abstract approach to <strong>in</strong>terprocess communication. Web services and.<strong>NET</strong> Remot<strong>in</strong>g are technologies that represent concrete implementations of thatabstract <strong>in</strong>terface. As dist<strong>in</strong>ct implementations, they end up us<strong>in</strong>g different build<strong>in</strong>gblocks to set up constituent features such as object serialization, type description, andreflection. The actual underly<strong>in</strong>g technologies that make Web services and .<strong>NET</strong>Remot<strong>in</strong>g happen are chosen accord<strong>in</strong>g to the f<strong>in</strong>al goal of each technology.Web services are targeted to cross-plat<strong>for</strong>m communication and heterogeneoussystems. .<strong>NET</strong> Remot<strong>in</strong>g doesn't allow <strong>for</strong> cross-plat<strong>for</strong>m communication, but it is highlyoptimized <strong>for</strong> .<strong>NET</strong>-to-.<strong>NET</strong> communication. In a nutshell, .<strong>NET</strong> Remot<strong>in</strong>g takes <strong>in</strong> thebest aspects of its <strong>Microsoft</strong> W<strong>in</strong>32 predecessor—Distributed Component Object Model(DCOM)—and elegantly fills <strong>in</strong> the gaps.In this chapter and Chapter 13, we'll exam<strong>in</strong>e the major features of each technologyand demonstrate that a common, plat<strong>for</strong>m-<strong>in</strong>dependent piece of code—say, a .<strong>NET</strong>Framework class—can be exposed <strong>in</strong> both models and per<strong>for</strong>m <strong>in</strong> the same way <strong>in</strong>.<strong>NET</strong> Framework as well as W<strong>in</strong>32 and L<strong>in</strong>ux applications..<strong>NET</strong> Remot<strong>in</strong>g as a Better DCOMPrior to the advent of the .<strong>NET</strong> Framework, DCOM was the underly<strong>in</strong>g technology ofchoice <strong>for</strong> any sort of remote communication between <strong>Microsoft</strong> W<strong>in</strong>dows applications.Based on a proprietary b<strong>in</strong>ary protocol, DCOM has suffered s<strong>in</strong>ce its conception from anumber of shortcom<strong>in</strong>gs. For this reason, DCOM never charmed its way <strong>in</strong>to theaverage programmer's heart, although it did prove to be functional and effective.DCOM is somewhat quirky to set up and configure, and under certa<strong>in</strong>, but relativelyfrequent, circumstances, it also raises serious <strong>in</strong>teroperability exceptions that basicallyput you <strong>in</strong> the unenviable position of hav<strong>in</strong>g to change the connectivity eng<strong>in</strong>e <strong>for</strong> thesake of the application or simply give up.423


NoteSome programmers believe that .<strong>NET</strong> Remot<strong>in</strong>g is even harder toset up than DCOM. They po<strong>in</strong>t out that DCOM, at least, has a tool(dcomcnfg.exe) to help with the setup and configuration of remotecomponents; .<strong>NET</strong> Remot<strong>in</strong>g has no such tool (although the ControlPanel applet called the .<strong>NET</strong> Framework Configuration tool[mscorcfg.msc] provides a m<strong>in</strong>imal amount of configurationsupport). My personal op<strong>in</strong>ion, however, is that the tasks required toset up a .<strong>NET</strong> Remot<strong>in</strong>g application are far simpler to understandthan the equivalent DCOM tasks.Aware of the ubiquity of HTTP, which allows you to legitimately penetrate any systemthrough the always open port 80, at a certa<strong>in</strong> po<strong>in</strong>t users began ask<strong>in</strong>g more and more<strong>for</strong> distributed applications capable of <strong>in</strong>terconnect<strong>in</strong>g and <strong>in</strong>teroperat<strong>in</strong>g with any sortof remote system. For a time, the most natural response to such a demand seemed tobe tak<strong>in</strong>g the official W<strong>in</strong>dows Component Object Model (COM) and attach<strong>in</strong>g a logicalwire to both ends. DCOM became the network extension of COM, thus build<strong>in</strong>g a new<strong>in</strong>frastructure on the same successful component technology. Seamless <strong>in</strong>tegration, ashort learn<strong>in</strong>g curve, and a concrete possibility of reta<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g <strong>in</strong>vestments <strong>in</strong>COM-based applications and tools were understandably the most <strong>in</strong>trigu<strong>in</strong>g benefits ofDCOM.DCOM works as a wrapper <strong>for</strong> COM components. DCOM takes care of all that bor<strong>in</strong>gstuff about low-level network protocols and leaves you free to concentrate your ef<strong>for</strong>tson the bread and butter of your bus<strong>in</strong>ess: plann<strong>in</strong>g and realiz<strong>in</strong>g great and effectivesolutions <strong>for</strong> customers.DCOM is a b<strong>in</strong>ary protocol that has <strong>in</strong> its favor a theoretically excellent measure ofper<strong>for</strong>mance, especially when compared to text-based <strong>in</strong>teractions such as those tak<strong>in</strong>gplace over HTTP and the Internet. DCOM applications are fundamentally location<strong>in</strong>dependent,as the protocol <strong>in</strong>frastructure covers the physical distance between users<strong>in</strong> the way it f<strong>in</strong>ds best. For example, DCOM automatically creates a pair of proxy/stubmodules <strong>for</strong> any <strong>in</strong>terprocess and <strong>in</strong>termach<strong>in</strong>e communication and resolves the callwith<strong>in</strong> the boundary of the current process, whenever this is possible and plausible. Sowhere are the jarr<strong>in</strong>g notes with DCOM?DCOM Shortcom<strong>in</strong>gsIn many Internet scenarios, the level of connectivity allowed between a client and aserver is subject to a variety of restrictions. For example, on its way to the remoteserver component, a client component might run across a proxy server that filters andcontrols outbound network traffic. As a result, the proxy might prevent the client fromproperly <strong>in</strong>teract<strong>in</strong>g with the object of its software desire. Furthermore, a firewall mightfilter any <strong>in</strong>com<strong>in</strong>g Internet requests to protect the server components from anyunauthorized contact. A firewall normally def<strong>in</strong>es the comb<strong>in</strong>ation of network ports,packets, and protocols that is acceptable <strong>for</strong> the safety and the health of the networkenvironment runn<strong>in</strong>g beh<strong>in</strong>d it.The ultimate effect of such restrictions is that a DCOM client and a server can set upand carry out a conversation only through a quite narrow set of protocol and portcomb<strong>in</strong>ations. When open<strong>in</strong>g a port and send<strong>in</strong>g out the packets that constitute amethod request, DCOM dynamically selects a network port <strong>in</strong> the range 1024 through65,535. Un<strong>for</strong>tunately, system adm<strong>in</strong>istrators normally prohibit <strong>in</strong>bound Internet trafficfrom pass<strong>in</strong>g through these ports and penetrat<strong>in</strong>g <strong>in</strong>to <strong>in</strong>tranet microcosms.Us<strong>in</strong>g DCOM over the Internet is not particularly reliable—or, at least, not as reliable asit is <strong>in</strong> <strong>in</strong>tranet scenarios. The fact that DCOM can use such a wide range of ports424


makes cod<strong>in</strong>g significantly easier. In fact, programmers don't have to worry aboutpossible conflicts with other applications attempt<strong>in</strong>g to access the same port. Inaddition, dynamic port allocation also <strong>in</strong>creases the overall level of flexibility becausethe particular communication port doesn't have to be hard-coded or persistedsomewhere as an application-specific argument. On the down side, systemadm<strong>in</strong>istrators don't usually agree to leave such a wide range of ports open to <strong>in</strong>boundtraffic because do<strong>in</strong>g so could leave a major hole <strong>in</strong> security.NoteThe DCOM security model is based on the assumption thatdevelopers and adm<strong>in</strong>istrators configure the security sett<strong>in</strong>gsproperly <strong>for</strong> each component. The net effect of this approach is thatthe same b<strong>in</strong>ary code works unchanged both <strong>in</strong> environments <strong>in</strong>which the security is of no concern (<strong>for</strong> example, on a local s<strong>in</strong>glemach<strong>in</strong>e) and <strong>in</strong> environments <strong>in</strong> which the code needs to beprocessed <strong>in</strong> a secure fashion (as <strong>in</strong> a fully distributed environment).DCOM Extensions <strong>for</strong> the InternetOver the years, DCOM has been extended to work around this security issue. Inparticular, the COM Internet Services (CIS) layer has given DCOM the capability towork over port 80 thanks to a new transportation protocol called Tunnel<strong>in</strong>gTransmission Control Protocol (TTCP). CIS works as an Internet Server Application<strong>Programm<strong>in</strong>g</strong> Interface (ISAPI) filter and requires <strong>Microsoft</strong> Internet In<strong>for</strong>mationServices (IIS) 4.0 or later to run on the server mach<strong>in</strong>e. Basically, TTCP works byfool<strong>in</strong>g the firewall. At the very beg<strong>in</strong>n<strong>in</strong>g of each DCOM operation, TTCP shakeshands with the server, declar<strong>in</strong>g its <strong>in</strong>tention to use HTTP over port 80. If the firewallagrees, what follows is a traffic pattern of non-HTTP packets that are blissfully deliveredover port 80 of those firewalls that are lazy enough to accept b<strong>in</strong>ary packets overHTTP. All <strong>in</strong> all, CIS gives DCOM a good chance of enter<strong>in</strong>g through a w<strong>in</strong>dow when itf<strong>in</strong>ds that the front door is locked. This peculiarity also affects the way DCOMcomponents work. In fact, server components can't call back the client component tos<strong>in</strong>k events or send notifications..<strong>NET</strong> Remot<strong>in</strong>g to the RescueWhat's new and better with .<strong>NET</strong> Remot<strong>in</strong>g? The advent of the .<strong>NET</strong> Frameworkpushed COM-related technology aside, and DCOM is no exception. The .<strong>NET</strong>Framework architecture <strong>for</strong> remot<strong>in</strong>g arose completely redesigned, with two key goalsto pursue: allow<strong>in</strong>g <strong>for</strong> seamless and location-<strong>in</strong>dependent cod<strong>in</strong>g while provid<strong>in</strong>g a fullyoperational way of <strong>in</strong>teract<strong>in</strong>g with restricted servers.NoteThe previous statement about the dim<strong>in</strong>ished status of COM-relatedtechnologies doesn't imply that exist<strong>in</strong>g COM components areobsolete <strong>in</strong> the new .<strong>NET</strong> world. On the contrary, the .<strong>NET</strong>Framework <strong>in</strong>tegrates seamlessly with COM components and theW<strong>in</strong>32 API through ad hoc <strong>in</strong>teroperability mechanisms such asCOM Callable Wrappers (CCWs) and P/Invoke. By adopt<strong>in</strong>g a"leave-no-COM-object-beh<strong>in</strong>d" philosophy, the .<strong>NET</strong> Frameworkdesigners ensured the cont<strong>in</strong>ued success of exist<strong>in</strong>g COM-relatedtechnologies even as developers migrate to the .<strong>NET</strong> Framework.The .<strong>NET</strong> Remot<strong>in</strong>g classes allow <strong>for</strong> optimized and effective communication between.<strong>NET</strong> Framework applications. They don't offer even the possibility of be<strong>in</strong>g used <strong>in</strong> anyother scenario. For cross-plat<strong>for</strong>m scenarios <strong>in</strong> which heterogeneous environments are425


<strong>in</strong>volved, you must use Web services. But if you need to set up communicationbetween two .<strong>NET</strong> Framework applications, noth<strong>in</strong>g is better and more efficient than.<strong>NET</strong> Remot<strong>in</strong>g.What Is .<strong>NET</strong> Remot<strong>in</strong>g?The entire set of services that enable .<strong>NET</strong> Framework applications to communicatewith each other falls under the umbrella of .<strong>NET</strong> Remot<strong>in</strong>g. Such applications canreside on the same computer, can work on different computers <strong>in</strong> the same LAN, andcan even be scattered across the world <strong>in</strong> heterogeneous networks but onhomogeneous plat<strong>for</strong>ms—that is, plat<strong>for</strong>ms that can host the common languageruntime (CLR) and access the .<strong>NET</strong> Framework.The .<strong>NET</strong> Remot<strong>in</strong>g architecture enables you to use different transportation protocols,serialization <strong>for</strong>mats, object lifetime schemes, and modes of object creation. In addition,programmers can directly plug <strong>in</strong>to the flow of messages that each communicationorig<strong>in</strong>ates and can hook up activities at various stages of the process.At a lower level of abstraction, however, the only th<strong>in</strong>g .<strong>NET</strong> Remot<strong>in</strong>g can do <strong>for</strong> you isenable communication and data exchange between different application doma<strong>in</strong>s(AppDoma<strong>in</strong>s).Application Doma<strong>in</strong>sThe .<strong>NET</strong> Framework CLR provides a feature-rich execution environment <strong>for</strong> code.With<strong>in</strong> the CLR, code f<strong>in</strong>ds available services like garbage collection, security,version<strong>in</strong>g, and thread<strong>in</strong>g. Executable code must be loaded <strong>in</strong>to the CLR to bemanaged while runn<strong>in</strong>g, however.NoteCurrently, only the <strong>Microsoft</strong> W<strong>in</strong>dows XP operat<strong>in</strong>g system isequipped with a CLR-aware program loader capable of runn<strong>in</strong>g a.<strong>NET</strong> Framework executable with<strong>in</strong> the context of a CLR <strong>in</strong>stance.For compatibility with all non-XP W<strong>in</strong>dows operat<strong>in</strong>g systems, all.<strong>NET</strong> Framework executables <strong>in</strong>clude a tailor-made stub programthat operat<strong>in</strong>g systems automatically launch when executables don'tmatch the current system plat<strong>for</strong>m. This stub passes the control toanother piece of code that <strong>in</strong>stantiates the CLR and loads themanaged code <strong>in</strong>to it. See the section "Further Read<strong>in</strong>g," on page559, <strong>for</strong> additional resources on this topic.To run an application's code, the <strong>in</strong>stance of the CLR must obta<strong>in</strong> a po<strong>in</strong>ter to anAppDoma<strong>in</strong>. AppDoma<strong>in</strong>s are separate units of process<strong>in</strong>g that the CLR recognizes <strong>in</strong>a runn<strong>in</strong>g process. All .<strong>NET</strong> Framework processes run at least one AppDoma<strong>in</strong>—known as the default AppDoma<strong>in</strong>—that is created dur<strong>in</strong>g the CLR <strong>in</strong>itialization. Anapplication can have additional AppDoma<strong>in</strong>s. Each AppDoma<strong>in</strong> is <strong>in</strong>dependentlyconfigured and given personal sett<strong>in</strong>gs <strong>for</strong> security, reference paths, and configurationfiles.AppDoma<strong>in</strong>s are separated and isolated from one another <strong>in</strong> a way that resemblesprocess separation <strong>in</strong> W<strong>in</strong>32. The CLR en<strong>for</strong>ces isolation by prevent<strong>in</strong>g direct callsbetween objects resid<strong>in</strong>g <strong>in</strong> different AppDoma<strong>in</strong>s. From the CPU perspective,AppDoma<strong>in</strong>s are much more lightweight than W<strong>in</strong>32 processes and provide <strong>for</strong> a morelightweight mechanism of isolation between process<strong>in</strong>g units. The .<strong>NET</strong> Frameworkprovides the remot<strong>in</strong>g API as a tailor-made set of system services to access an objectthat resides <strong>in</strong> an external AppDoma<strong>in</strong>. Figure 12-1 illustrates such an <strong>in</strong>ter-AppDoma<strong>in</strong>communication.426


Figure 12-1: Inter-AppDoma<strong>in</strong> communication <strong>in</strong> the .<strong>NET</strong> Framework.Why AppDoma<strong>in</strong>s Do It BetterManaged code needs an AppDoma<strong>in</strong> to run, but it must also pass through a verificationprocess be<strong>for</strong>e it can be run. Code that passes such a test is said to be type-safe.Type-safe code never reads memory that has not been previously written, never calls amethod us<strong>in</strong>g an <strong>in</strong>correct number of arguments, and always assigns a return value tofunctions. In summary, type-safe code can't cause memory faults, which <strong>in</strong> W<strong>in</strong>32 wereone of the reasons to have a physical separation between process memory contexts.The certa<strong>in</strong>ty of runn<strong>in</strong>g type-safe code allows the CLR to provide a level of isolation asstrong as process boundaries, but more cost-effective because an AppDoma<strong>in</strong> is alogical process and as such is more lightweight than a true process.NoteDirect use of po<strong>in</strong>ters is allowed <strong>in</strong> C# as long as you explicitly markyour code (classes, methods, and <strong>in</strong>terfaces) as unsafe by us<strong>in</strong>g theunsafe keyword. Unsafe code loads and runs <strong>in</strong> an AppDoma<strong>in</strong>, justlike managed code, but isn't verified to be type-safe. Unsafe code issupported by the C# compiler only.Unlike W<strong>in</strong>32 processes, you can have several AppDoma<strong>in</strong>s runn<strong>in</strong>g with<strong>in</strong> theboundaries of the same .<strong>NET</strong> Framework application. Individual doma<strong>in</strong>s can bestopped without stopp<strong>in</strong>g the entire process, but you can't unload only a s<strong>in</strong>gleassembly with<strong>in</strong> an AppDoma<strong>in</strong>. Managed code runn<strong>in</strong>g <strong>in</strong> an AppDoma<strong>in</strong> is carried outby a particular thread. However, threads and AppDoma<strong>in</strong>s are orthogonal entities <strong>in</strong> thesense that you can have several threads active dur<strong>in</strong>g the execution of theAppDoma<strong>in</strong>'s code, but a s<strong>in</strong>gle thread is <strong>in</strong> no way limited to runn<strong>in</strong>g only with<strong>in</strong> thecontext of a given AppDoma<strong>in</strong>.Location TransparencyFrom an application's standpo<strong>in</strong>t, an external AppDoma<strong>in</strong> can transparently be anotherAppDoma<strong>in</strong> <strong>in</strong> the same process, the default AppDoma<strong>in</strong> <strong>in</strong> another process on thesame mach<strong>in</strong>e, and even an AppDoma<strong>in</strong> resid<strong>in</strong>g on a physically distant mach<strong>in</strong>e. Allthe low-level details that make each of these scenarios unique are transparently427


handled by .<strong>NET</strong> Remot<strong>in</strong>g; the user is responsible only <strong>for</strong> higher-level aspects suchas actual network paths or the URLs used to set up the communication.Remotable ObjectsThe overall architecture that makes .<strong>NET</strong> Remot<strong>in</strong>g happen is extremely modular andflexible enough to let you customize several aspects of the service. For example, youcan decide whether remote objects should be marshaled on the local plat<strong>for</strong>m by valueor by reference. Similarly, you can control how objects are activated and whether theactivation should take place on the client or on the server. Programmers also can<strong>in</strong>tervene <strong>in</strong> the object's lifetime and specify the most suitable communications channeland <strong>for</strong>matter module <strong>for</strong> transport<strong>in</strong>g messages to and from remote applications.A remotable object can be implemented <strong>in</strong> one of two ways. One possibility is that youdesign the class to be serializable so that its <strong>in</strong>stance data can be marshaled from theserver to the client. At the receiv<strong>in</strong>g end, the client unmarshals the data and createsanother <strong>in</strong>stance of the class with the same values as the <strong>in</strong>stance on the server. Thisapproach is referred to as marshal by value (MBV). The other possibility is that theclass allows <strong>for</strong> its object reference to be marshaled. When unmarshaled on the client,the object reference becomes a proxy to the remote <strong>in</strong>stance. This second approach isknown as marshal by reference (MBR). Unlike MBV, MBR preserves the object'sidentity.No matter how you design your remotable objects—MBV or MBR—a networkconnection must always exist between the client application and the remote object <strong>for</strong>.<strong>NET</strong> Remot<strong>in</strong>g to work.Note.<strong>NET</strong> Remot<strong>in</strong>g doesn't support the automatic download of theassembly conta<strong>in</strong><strong>in</strong>g the type of the <strong>in</strong>stance that is be<strong>in</strong>gmarshaled (unlike other remote access technologies, such as Java'sRemote Method Invocation [RMI]). Instead, the assembly <strong>for</strong> thetype needs to exist on the client be<strong>for</strong>ehand. How the assemblygets on the client is outside the purview of .<strong>NET</strong> Remot<strong>in</strong>g.Marshal<strong>in</strong>g Objects by ValueMarshal<strong>in</strong>g by value downloads the entire object's contents to the client, which uses the<strong>in</strong>stance data to <strong>in</strong>itialize a client-side object of that type. The client obta<strong>in</strong>s a perfectlocal clone of the orig<strong>in</strong>al object and can work with it completely oblivious to the factthat the object data has been downloaded from a remote location.In general, MBV is not recommended when you have to cope with large objects withseveral properties. With MBV, you take the risk of consum<strong>in</strong>g a significant portion ofbandwidth to per<strong>for</strong>m the full object's data download, thus subject<strong>in</strong>g the client to apotentially long wait to execute only one or two methods. MBV also imposes someconstra<strong>in</strong>ts on the remotable objects. In particular, any objects that need to beconsumed by value must qualify as serializable—which is not the case <strong>for</strong> all objects. Inaddition to objects that deliberately make themselves nonserializable, some objects areobjectively hard to serialize. In this list, you certa<strong>in</strong>ly f<strong>in</strong>d classes that represent orconta<strong>in</strong> database connections. More generally, the list <strong>in</strong>cludes all those objects thatcan't be reasonably represented outside their native environment. This happens whenall or part of the <strong>in</strong><strong>for</strong>mation stored <strong>in</strong> an object does not make sense once the object istransferred to the client. If the object has any implicit dependencies on server-sideresources, you can't just use it from the client. For example, if the class has a methodthat accesses a SQL Server table, you could call it from the client only if the same SQLServer table is accessible from the current location.428


When to Marshal by ValueSo how do you know when MBV is a good option? Let's say that MBV is a compell<strong>in</strong>goption when the follow<strong>in</strong>g conditions are true:• The object is not particularly large and complex.• You're go<strong>in</strong>g to make <strong>in</strong>tensive use of the object.• You have no special security concerns.• The object has no dependencies on remote resources such as files,databases, devices, or system resources.Some rather illustrious .<strong>NET</strong> Framework classes that support remot<strong>in</strong>g through theMBV technique are the DataSet and DataTable classes.MBV ObjectsThe .<strong>NET</strong> Remot<strong>in</strong>g system serializes all the <strong>in</strong>ternal data of MBV objects and passesthe stream to the call<strong>in</strong>g AppDoma<strong>in</strong>, as illustrated <strong>in</strong> Figure 12-2.Figure 12-2: How .<strong>NET</strong> Remot<strong>in</strong>g marshals objects by value.After the data is <strong>in</strong> the client AppDoma<strong>in</strong>, a new local object is <strong>in</strong>stantiated and<strong>in</strong>itialized and starts handl<strong>in</strong>g calls. To write remotable objects that are exchanged byvalue, you need to make them serializable, either by declar<strong>in</strong>g the SerializableAttributeattribute or by implement<strong>in</strong>g ISerializable. Aside from this, noth<strong>in</strong>g else is required <strong>for</strong><strong>in</strong>stances of the class to be passed by value across AppDoma<strong>in</strong>s.Marshal<strong>in</strong>g Objects by ReferenceWhen an object is marshaled by reference, the client process receives a reference tothe server-side object, rather than a copy. This means that any call directed to theobject is always resolved on the server with<strong>in</strong> the native context of the object. Theremot<strong>in</strong>g <strong>in</strong>frastructure governs the call, collect<strong>in</strong>g all <strong>in</strong><strong>for</strong>mation about the call andsend<strong>in</strong>g it to the server process. On the server, the correct object is located and askedto execute the call us<strong>in</strong>g the client's arguments. When the call is f<strong>in</strong>ished, the resultsare packaged and sent back to the client. Unlike MBV, MBR uses the network only <strong>for</strong>transmitt<strong>in</strong>g arguments and return values. Figure 12-3 shows the architecture of MBRremot<strong>in</strong>g.429


Figure 12-3: How .<strong>NET</strong> Remot<strong>in</strong>g marshals objects by reference.The .<strong>NET</strong> Remot<strong>in</strong>g implementation of MBR provides <strong>for</strong> a proxy/stub pair and aphysical channel <strong>for</strong> network transportation. The proxy represents the remote object tothe client, as it simply mirrors the same set of methods and properties. Each client<strong>in</strong>vocation of a remote method actually hits the local proxy, which, <strong>in</strong> turn, takes care ofrout<strong>in</strong>g the call down to the server. A method <strong>in</strong>vocation orig<strong>in</strong>ates a message thattravels on top of a channel and a transmission protocol.Each message passes through a cha<strong>in</strong> of hook objects (called s<strong>in</strong>ks) on each side ofthe transport channel. S<strong>in</strong>ks are nearly identical to W<strong>in</strong>dows hooks. By def<strong>in</strong><strong>in</strong>g andregister<strong>in</strong>g a s<strong>in</strong>k, the programmer can per<strong>for</strong>m a specific operation at a specific stageof the remot<strong>in</strong>g process. Because the creation of the proxy takes place automatically,the programmer has little to do other than creat<strong>in</strong>g an <strong>in</strong>stance of the target object andissu<strong>in</strong>g the call.If the object resides <strong>in</strong> an external AppDoma<strong>in</strong>, the remot<strong>in</strong>g <strong>in</strong>frastructure creates alocal proxy <strong>for</strong> it to per<strong>for</strong>m the requested operation. But how can the code determ<strong>in</strong>ewhether a given object is local, lives <strong>in</strong> a remote AppDoma<strong>in</strong>, or just doesn't exist? Inspite of the sophisticated code that constitutes the remot<strong>in</strong>g <strong>in</strong>frastructure,programm<strong>in</strong>g remote objects is mostly a matter of setup. Once the client has beenproperly configured, you normally create a new <strong>in</strong>stance of the remote class us<strong>in</strong>g thenew operator, no matter what type of class you're call<strong>in</strong>g and where it resides. Clientsmust declare to the CLR which classes are remote and provide connection <strong>in</strong><strong>for</strong>mation.Remote objects, <strong>in</strong> turn, must be publicly available and bound to a given channel.The MarshalByRefObject ClassInherit<strong>in</strong>g from the MarshalByRefObject class is the key that enables user classes to beaccessed across AppDoma<strong>in</strong> boundaries <strong>in</strong> applications that support remot<strong>in</strong>g.MarshalByRefObject is the base class <strong>for</strong> objects that communicate acrossAppDoma<strong>in</strong>s. Serializable classes that do not <strong>in</strong>herit from MarshalByRefObject, when<strong>in</strong>stantiated from a remote assembly, are implicitly marshaled by value. Other classesare simply considered nonremotable.So if you want to write a remote component that uses the network efficiently and alwaysruns on the server, the only th<strong>in</strong>g you have to do is create the class <strong>in</strong>herit<strong>in</strong>g fromMarshalByRefObject, as follows:public class Northw<strong>in</strong>dService : MarshalByRefObject430


{}public DataSet GetSalesReport(<strong>in</strong>t year);For example, the Northw<strong>in</strong>dService class shown here is ideally suited to act as aremote console that clients access through transparent proxies.NoteWhen creat<strong>in</strong>g a remotable object, you normally limit the class to<strong>in</strong>herit<strong>in</strong>g from MarshalByRefObject. In some situations, however,you might want to override some of the parent class's methods. Inparticular, you might want to replace the InitializeLifetimeServicemethod and configure the object's lifetime. We'll return to this topic<strong>in</strong> the section "Memory Management," on page 551.The ObjRef ClassWhen a MarshalByRefObject object is be<strong>in</strong>g remoted, the .<strong>NET</strong> Remot<strong>in</strong>g systempacks all the relevant <strong>in</strong><strong>for</strong>mation <strong>in</strong>to an ObjRef object. An ObjRef object is aserializable representation of the orig<strong>in</strong>al MBR object. This <strong>in</strong>termediary object enablesthe .<strong>NET</strong> Remot<strong>in</strong>g system to transfer an object reference across the boundaries ofAppDoma<strong>in</strong>s. In effect, the entire action of marshal<strong>in</strong>g by reference can be summarizedwith the creation an ObjRef object.An ObjRef object conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation that describes the type and the class of theobject be<strong>in</strong>g marshaled, the exact location, and any communicationrelated <strong>in</strong><strong>for</strong>mationsuch as port and protocols. The ObjRef <strong>in</strong>stance is created on the server when theMBR object is first referenced; next it is transferred <strong>in</strong>to the target AppDoma<strong>in</strong>, possibly<strong>in</strong> another process or on another mach<strong>in</strong>e. On the client, the ObjRef object is thendeserialized, and the real proxy is created to access the remote <strong>in</strong>stance of the MBRobject. This operation is globally known as unmarshal<strong>in</strong>g.The RealProxy ClassRealProxy is an abstract class that represents a remot<strong>in</strong>g proxy. Any remot<strong>in</strong>g clienttransparently uses an <strong>in</strong>stance of this class to issue calls to the remote object. Theoverall .<strong>NET</strong> Framework model <strong>for</strong> distributed programm<strong>in</strong>g is designed to create theillusion that remote objects are actually work<strong>in</strong>g locally. This is true <strong>for</strong> .<strong>NET</strong> Remot<strong>in</strong>gas well as <strong>for</strong> Web services, even though the effect is obta<strong>in</strong>ed with radically differenttechniques.Note.<strong>NET</strong> Remot<strong>in</strong>g creates the local <strong>in</strong>stance of the remote object us<strong>in</strong>gdynamically created proxies that result from the run-timedeserialization process. Basically, the deserialization of the ObjRefclass generates a transparent proxy to handle user calls. With Webservices, a proxy class is statically added to the application's projectat design time when the Web service is referenced as an externallibrary. The generation of the source code <strong>for</strong> the class and thesubsequent addition to the project are automatically handled byVisual Studio .<strong>NET</strong>. However, the wsdl.exe utility (part of the .<strong>NET</strong>Framework SDK) allows you to generate the class yourself.The RealProxy class hidden beh<strong>in</strong>d the software creates the illusion that remot<strong>in</strong>gclients actually work locally. The proxy is transparently <strong>in</strong>voked whenever a method iscalled on the remote object. The RealProxy class executes the method by <strong>for</strong>ward<strong>in</strong>gany calls to the real object us<strong>in</strong>g the remot<strong>in</strong>g <strong>in</strong>frastructure.431


If you want to play with the transparent proxy object yourself, you can get a reference toit by us<strong>in</strong>g the follow<strong>in</strong>g code:Remot<strong>in</strong>gServices.GetRealProxy(localObject);The variable localObject is the local <strong>in</strong>stance of the remote object that you have createdus<strong>in</strong>g the new operator. (More on this <strong>in</strong> a moment.) As mentioned, RealProxy is onlyan abstract class. The actual proxy object belongs to the Remot<strong>in</strong>gProxy class <strong>in</strong> theSystem.Runtime.Remot<strong>in</strong>g.Proxies namespace.Build<strong>in</strong>g a Remote ServiceLet's take the plunge <strong>in</strong>to .<strong>NET</strong> Remot<strong>in</strong>g and start build<strong>in</strong>g a service that can beexploited and consumed from remote clients. In Chapter 13, we'll extend the service tomake it openly available to Internet clients too. In this way, you can really grab theessence of .<strong>NET</strong> Framework distributed programm<strong>in</strong>g and understand the keydifferences that keep .<strong>NET</strong> Remot<strong>in</strong>g and Web services separate even though they'reboth children of a common model <strong>for</strong> remotable objects.A .<strong>NET</strong> Remot<strong>in</strong>g server and a Web service are both .<strong>NET</strong> Framework classes. Assuch, they can <strong>in</strong>herit from a parent class and can be left open to further <strong>in</strong>heritance. Asyou'll see <strong>in</strong> more detail <strong>in</strong> Chapter 13, a Web service class can optionally <strong>in</strong>herit fromthe WebService class, but there is no syntax obligation. A .<strong>NET</strong> Remot<strong>in</strong>g server classmust <strong>in</strong>herit from MarshalByRefObject.The object-oriented nature of the .<strong>NET</strong> Framework makes shar<strong>in</strong>g classes between a.<strong>NET</strong> Remot<strong>in</strong>g server and a Web service straight<strong>for</strong>ward. However, because of the<strong>in</strong>heritance difference just mentioned, you can't have the Web service and the .<strong>NET</strong>Remot<strong>in</strong>g server descend from the same base class of functionality. The .<strong>NET</strong>Framework, <strong>in</strong> fact, does not permit <strong>in</strong>heritance from multiple classes.We'll start by writ<strong>in</strong>g a helper class that constitutes the programm<strong>in</strong>g <strong>in</strong>terface <strong>for</strong> boththe .<strong>NET</strong> Remot<strong>in</strong>g server <strong>in</strong> this chapter and the Web service we'll create <strong>in</strong> Chapter13. The remote service is actually a class built around the Northw<strong>in</strong>d database that letsyou obta<strong>in</strong> gross sales <strong>in</strong><strong>for</strong>mation on a per-year basis. A nice feature of this service isthat it lets you obta<strong>in</strong> <strong>in</strong><strong>for</strong>mation <strong>in</strong> two ways: as raw tabular data to <strong>for</strong>mat and analyzeor as a ready-to-pr<strong>in</strong>t, snazzy bar chart.Writ<strong>in</strong>g the Data Provider ClassBecause our f<strong>in</strong>al goal is expos<strong>in</strong>g a common set of functionalities through both the.<strong>NET</strong> Remot<strong>in</strong>g server and the Web service <strong>in</strong>terfaces, let's group all the needed corecode <strong>in</strong>to a separate middle-tier class that both higher-level layers can easily call. We'llcall this helper class SalesDataProvider and bury <strong>in</strong>to its code all the details aboutconnection str<strong>in</strong>gs, SQL commands, and bar chart creation. The class outl<strong>in</strong>e is shownhere:namespace XmlNet.CS{public class SalesDataProvider{// Constructor(s)public SalesDataProvider() {...}// Internal properties432


private str<strong>in</strong>g m_conn ="DATABASE=northw<strong>in</strong>d;SERVER=...;UID=sa;";private <strong>in</strong>t m_Year = 0;// Returns sales details <strong>for</strong> the specified yearpublic DataTable GetSalesReport(<strong>in</strong>t theYear) {...}// Create a bar chart with the sales data <strong>for</strong> thespecified yearpublic str<strong>in</strong>g GetSalesReportBarChart(<strong>in</strong>t theYear) {...}// INTERNAL METHODS// Fetch the dataprivate DataTable ExecuteQuery(<strong>in</strong>t theYear) {...}table// Draw the bar chart based on the data <strong>in</strong> the specifiedprivate str<strong>in</strong>g CreateBarChart(DataTable dt) {...}}}// Encode the specified bitmap object as B<strong>in</strong>Hex <strong>XML</strong>private str<strong>in</strong>g SaveBitmapAsEncodedXml(Bitmap bmp)The class conta<strong>in</strong>s only a couple of public methods—GetSalesReport andGetSalesReportBarChart. These methods will also <strong>for</strong>m the public <strong>in</strong>terface of the .<strong>NET</strong>Remot<strong>in</strong>g server we'll build <strong>in</strong> this chapter and the Web service slated <strong>for</strong> Chapter 13.Implementation DetailsGetSalesReport takes an <strong>in</strong>teger that <strong>in</strong>dicates the year to consider and returns aDataTable object with two columns—one conta<strong>in</strong><strong>in</strong>g employee last names and oneshow<strong>in</strong>g total sales <strong>for</strong> the year <strong>for</strong> each employee. The method runs the follow<strong>in</strong>g SQLquery aga<strong>in</strong>st the Northw<strong>in</strong>d database:SELECT e.lastname AS Employee, SUM(price) AS Sales FROM(SELECT o.employeeid, od.orderid,SUM(od.quantity*od.unitprice)AS priceFROM Orders o, [Order Details] odWHERE Year(o.orderdate)=@TheYear AND od.orderid=o.orderidGROUP BY o.employeeid, od.orderid)AS t1INNER JOIN Employees e ON t1.employeeid=e.employeeidGROUP BY t1.employeeid, e.lastname433


The query <strong>in</strong>volves three tables—Employees, Orders, and Order Details—and basicallycalculates the total amount of each order issued <strong>in</strong> the specified year by a particularemployee. F<strong>in</strong>ally, the amounts of all orders are summed and returned together with theemployee's last name.GetSalesReportBarChart works <strong>in</strong> two steps: first it gets the sales data by call<strong>in</strong>gGetSalesReport, and then it uses this <strong>in</strong><strong>for</strong>mation to create the bar chart. The bar chartis generated as an <strong>in</strong>-memory bitmap object and is drawn us<strong>in</strong>g the GDI+ classes <strong>in</strong> theSystem.Draw<strong>in</strong>g namespace. To make the image easily transportable over the wire <strong>for</strong>.<strong>NET</strong> Remot<strong>in</strong>g clients as well as <strong>for</strong> Web service clients, the GetSalesReportBarChartmethod converts the bitmap to JPEG, encodes the bits as B<strong>in</strong>Hex, and puts the results<strong>in</strong> an <strong>XML</strong> str<strong>in</strong>g.Us<strong>in</strong>g GDI+ to Create ChartsGDI+ is the latest <strong>in</strong>carnation of the classic W<strong>in</strong>dows Graphical Device Interface (GDI),a graphics subsystem that enables you to write device-<strong>in</strong>dependent applications. The.<strong>NET</strong> Framework encapsulates the full spectrum of GDI+ functionalities <strong>in</strong> quite a fewmanaged classes that wrap any GDI+ low-level functions, thus mak<strong>in</strong>g them availableto Web Forms and W<strong>in</strong>dows Forms applications.GDI+ services fall <strong>in</strong>to three broad categories: 2-D vector graphics, imag<strong>in</strong>g, andtypography. The 2-D vector graphics category <strong>in</strong>cludes draw<strong>in</strong>g primitives such asl<strong>in</strong>es, curves, and any other figures that are specified by a set of po<strong>in</strong>ts on a coord<strong>in</strong>atesystem. The imag<strong>in</strong>g category <strong>in</strong>cludes functions <strong>for</strong> display<strong>in</strong>g, manipulat<strong>in</strong>g, andsav<strong>in</strong>g pictures as bitmaps and metafiles. The typography category concerns thedisplay of text <strong>in</strong> a variety of fonts, sizes, and styles. Only the imag<strong>in</strong>g functions are keyto the GetSalesReportBarChart implementation.In GDI+, the Graphics class represents the managed counterpart of the W<strong>in</strong>32 GDIdevice context. You can th<strong>in</strong>k of it as the central console from which you call allprimitives. Everyth<strong>in</strong>g you draw, or fill, through a Graphics object acts on a particularcanvas. Typical draw<strong>in</strong>g surfaces are the w<strong>in</strong>dow background (<strong>in</strong>clud<strong>in</strong>g controlbackgrounds), the pr<strong>in</strong>ter, and <strong>in</strong>-memory bitmaps.The follow<strong>in</strong>g code creates a new bitmap object and gets a Graphics object from it:Bitmap bmp = new Bitmap(500, 400);Graphics g = Graphics.FromImage(bmp);g.Clear(Color.Ivory);From this po<strong>in</strong>t on, any draw<strong>in</strong>g methods called on the Graphics object will result <strong>in</strong>changes to the bitmap. For example, the Clear method clears the bitmap's backgroundus<strong>in</strong>g the specified color.Creat<strong>in</strong>g a bar chart is as easy as creat<strong>in</strong>g and fill<strong>in</strong>g a certa<strong>in</strong> number of rectangles, asshown <strong>in</strong> the follow<strong>in</strong>g code. We need to create a bar <strong>for</strong> each employee <strong>in</strong> theDataTable object and give it a height that is both proportional to the maximum value todraw and based on the scale given by the bitmap's size.// Save the names of the fields to use to get datastr<strong>in</strong>g fieldLabel, fieldValue;fieldLabel = dt.Columns[0].ColumnName;fieldValue = dt.Columns[1].ColumnName;// For each employee...<strong>for</strong>(<strong>in</strong>t i=0; i


{//// Set up some <strong>in</strong>ternal variables to determ<strong>in</strong>e// size and position of the bar and the// companion text//// Draw the value (top of the bar)g.DrawStr<strong>in</strong>g(dt.Rows[i][fieldValue].ToStr<strong>in</strong>g(),fnt, textBrush, x, yCaption);// Draw the barRectangle bar = new Rectangle(x, yBarTop, barWidth - 10,barHeight);L<strong>in</strong>earGradientBrush fill = new L<strong>in</strong>earGradientBrush(bar,Color.Spr<strong>in</strong>gGreen, Color.Yellow,L<strong>in</strong>earGradientMode.BackwardDiagonal);g.FillRectangle(fill, bar);fill.Dispose();}// Draw the employee name (bottom of the bar)g.DrawStr<strong>in</strong>g(dt.Rows[i][fieldLabel].ToStr<strong>in</strong>g(),fnt, textBrush, x, barBottom + textHeight);At the end of the loop, the bar chart is completely rendered <strong>in</strong> the Bitmap object. Thebitmap is still held <strong>in</strong> memory <strong>in</strong> an <strong>in</strong>termediate, <strong>in</strong>ternal <strong>for</strong>mat, however. Two moresteps are necessary: convert<strong>in</strong>g the bitmap to a public <strong>for</strong>mat such JPEG, BMP, or GIF,and figur<strong>in</strong>g out a way to persist or transfer its content.Encod<strong>in</strong>g Images as B<strong>in</strong>HexConvert<strong>in</strong>g a Bitmap object to one of the commonly used image <strong>for</strong>mats is a nonissue.You call the Save method on the Bitmap object, pick up one of the supported <strong>for</strong>mats,and you're done. The real difficulty has to do with the planned use of this helper class.Remember, we designed this class <strong>for</strong> later use with<strong>in</strong> a .<strong>NET</strong> Remot<strong>in</strong>g server and aWeb service. When Web services <strong>in</strong> particular are <strong>in</strong>volved, hav<strong>in</strong>g the helper classsave the image to persistent storage just doesn't make sense. An alternative approachwould be sav<strong>in</strong>g the bitmap locally on the server <strong>in</strong> a location accessible <strong>for</strong> downloadvia FTP or HTTP. Creat<strong>in</strong>g files on the server might pose security problems, however,and normally <strong>for</strong>ces the system adm<strong>in</strong>istrator to change default sett<strong>in</strong>gs to allow <strong>for</strong>local files be<strong>in</strong>g created.The SalesDataProvider helper class was designed to return the dynamically createdimage as an encoded text str<strong>in</strong>g packed <strong>in</strong> an <strong>XML</strong> document. This approach is notoptimal <strong>in</strong> a .<strong>NET</strong> Remot<strong>in</strong>g scenario, but it probably represents the only option if youhave to also publish the function through a Web service.435


As we saw <strong>in</strong> Chapter 4, the XmlTextWriter class provides methods <strong>for</strong> encod<strong>in</strong>g andwrit<strong>in</strong>g arrays of bytes, and an image—no matter the <strong>for</strong>mat—is just an array of bytes.A further step is needed to trans<strong>for</strong>m the Bitmap object <strong>in</strong>to an array of bytes that makeup a JPEG image. To convert a Bitmap object to a real-world image <strong>for</strong>mat, you mustuse the Save method. The Save method can accept only a file name or a stream,however.To solve this problem, you first save the bitmap as a JPEG image to a memory stream.Next you read back the contents of the stream as an array of bytes and write it to anXmlTextWriter object as B<strong>in</strong>Hex or base64 code, as shown here:// Save the bitmap to a memory streamMemoryStream ms = new MemoryStream();bmp.Save(ms, ImageFormat.Jpeg);<strong>in</strong>t size = (<strong>in</strong>t) ms.Length;// Read back the bytes of the imagebyte[] img = new byte[size];img = ms.GetBuffer();ms.Close();The preced<strong>in</strong>g code snippet converts the <strong>in</strong>stance of the Bitmap object that conta<strong>in</strong>s thebar chart to an array of bytes—the img variable—that represents the JPEG version ofthe bitmap.As the f<strong>in</strong>al step, you encode the bytes as B<strong>in</strong>Hex (or base64, if you prefer) and writethem to an <strong>XML</strong> stream, as shown here:// Prepare the writerStr<strong>in</strong>gWriter buf = new Str<strong>in</strong>gWriter();XmlTextWriter xmlw = new XmlTextWriter(buf);xmlw.Formatt<strong>in</strong>g = Formatt<strong>in</strong>g.Indented;// Write the <strong>XML</strong> documentxmlw.WriteStartDocument();xmlw.WriteComment("Sales report <strong>for</strong> "+ m_Year.ToStr<strong>in</strong>g());xmlw.WriteStartElement("jpeg");xmlw.WriteAttributeStr<strong>in</strong>g("Size", size.ToStr<strong>in</strong>g());xmlw.WriteB<strong>in</strong>Hex(img, 0, size);xmlw.WriteEndElement();xmlw.WriteEndDocument();// Extract the str<strong>in</strong>g and close the writerstr<strong>in</strong>g tmp = buf.ToStr<strong>in</strong>g();xmlw.Close();buf.Close();436


The XmlTextWriter object is still a stream-based component that needs a dest<strong>in</strong>ation towrite to. Unlike the Bitmap object, however, the XmlTextWriter object can be <strong>for</strong>ced towrite the output to a str<strong>in</strong>g. To do that, you <strong>in</strong>itialize the <strong>XML</strong> text writer with an <strong>in</strong>stanceof the Str<strong>in</strong>gWriter object. The f<strong>in</strong>al str<strong>in</strong>g with the <strong>XML</strong> code can be obta<strong>in</strong>ed with a callto the Str<strong>in</strong>gWriter object's ToStr<strong>in</strong>g method.The <strong>for</strong>mat of the <strong>XML</strong> text returned is shown here:FFD8FF...E00010Notice that the comment and the size of the file are strictly call-specific parameters. TheSize attribute refers to the size of the B<strong>in</strong>Hex-encoded text. As you'd expect, this valueis significantly larger than JPEG size. Hav<strong>in</strong>g that value available is not strictlynecessary, but once it's on the client, it can simplify the task of trans<strong>for</strong>m<strong>in</strong>g the <strong>XML</strong>stream back <strong>in</strong>to a JPEG image.Str<strong>in</strong>gWriter and Unicode Encod<strong>in</strong>gThe <strong>XML</strong> output generated by the GetSalesReportBarChart method uses the Unicodeencod<strong>in</strong>g scheme—UTF-16—<strong>in</strong>stead of the default UTF-8. This would be f<strong>in</strong>e if not <strong>for</strong>the fact that <strong>Microsoft</strong> Internet Explorer returns an error when you double-click the<strong>XML</strong> file. The error has noth<strong>in</strong>g to do with the <strong>XML</strong> itself; it is more a bug (or perhapseven a feature) of Internet Explorer and the <strong>in</strong>ternal style sheet Internet Explorer usesto display <strong>XML</strong> documents.In general, UTF-16 is used whenever you write <strong>XML</strong> text to a Str<strong>in</strong>gWriter object.When a TextWriter object (Str<strong>in</strong>gWriter <strong>in</strong>herits from TextWriter) is passed to theXmlTextWriter constructor, no explicit encod<strong>in</strong>g argument is allowed. In this case, theXmlTextWriter object transparently <strong>in</strong>herits the encod<strong>in</strong>g set conta<strong>in</strong>ed <strong>in</strong> the writerobject be<strong>in</strong>g passed. The Str<strong>in</strong>gWriter class hard-codes its Encod<strong>in</strong>g property to UTF-16—there's no way <strong>for</strong> you to change it, because the property is marked as read-only.If you want to generate <strong>XML</strong> str<strong>in</strong>gs with an encod<strong>in</strong>g scheme other than UTF-16,drop Str<strong>in</strong>gWriter objects <strong>in</strong> favor of memory streams.The helper class shared by the remotable object and the Web service is now ready touse. Let's look more closely at the remote service component.Writ<strong>in</strong>g the Remote Service ComponentAs mentioned, a remotable component has just one requirement: the class thatrepresents the object must be <strong>in</strong>herited from MarshalByRefObject. Unless you need toexercise stricter control over the object lifetime, you don't need to override any of themethods def<strong>in</strong>ed <strong>in</strong> the base class <strong>for</strong> MBR objects.Apart from the parent class, a remotable class is not different from any other class <strong>in</strong>the .<strong>NET</strong> Framework. All of its public methods are callable by clients, the class canimplement any number and any type of <strong>in</strong>terfaces, and the class can reference anyother external class.Because we already put all the core code <strong>in</strong> the SalesDataProvider class, writ<strong>in</strong>g theremote service class—ServiceSalesProvider—is a snap. The class is a simple wrapper<strong>for</strong> SalesDataProvider, as shown here:public class ServiceSalesProvider : MarshalByRefObject437


{// Propertiesprotected SalesDataProvider m_dataManager;// Constructorpublic ServiceSalesProvider(){m_dataManager = new SalesDataProvider();}// GetSalesReportpublic DataSet GetSalesReport(<strong>in</strong>t theYear){DataSet ds = new DataSet();ds.Tables.Add(m_dataManager.GetSalesReport(theYear));return ds;}}// GetSalesReportBarChartpublic str<strong>in</strong>g GetSalesReportBarChart(<strong>in</strong>t theYear){return m_dataManager.GetSalesReportBarChart(theYear);}The SalesDataProvider protected member is <strong>in</strong>itialized only once, when theServiceSalesProvider class <strong>in</strong>stance is constructed. After that, any call to the variousmethods is resolved us<strong>in</strong>g the same <strong>in</strong>stance of the helper class.The ServiceSalesProvider class has two public methods with the same names as themethods <strong>in</strong> SalesDataProvider. The implementation of these methods is straight<strong>for</strong>wardand fairly self-explanatory. The only aspect worth not<strong>in</strong>g is that the remotableGetSalesReport method adds the DataTable object returned by the correspond<strong>in</strong>gmethod on the SalesDataProvider class to a newly created DataSet object. TheDataSet object is then returned to the caller.NoteWhen writ<strong>in</strong>g remotable classes, be sure that all the methods useand return serializable classes. No extra steps are required if youdecide to write your own, user-def<strong>in</strong>ed classes as long as they<strong>in</strong>clude SerializableAttribute or implement the ISerializable <strong>in</strong>terface.Publish<strong>in</strong>g the Remote Service ComponentTo be usable <strong>in</strong> a distributed environment, a remotable class must be configured andexposed so that <strong>in</strong>terested callers can reach it. A remotable object needs a runn<strong>in</strong>ghost application to handle any <strong>in</strong>com<strong>in</strong>g calls. In addition, the object must specify whatprotocol, port, and name a potential client must use to issue its calls. All requirementsthat callers must fulfill are stored <strong>in</strong> the remote object's configuration file.438


The Host ApplicationThe host application can be IIS or a custom program (<strong>for</strong> example, a consoleapplication or a <strong>Microsoft</strong> W<strong>in</strong>dows NT service) written by the same team that authoredthe class. Unlike DCOM, the .<strong>NET</strong> Remot<strong>in</strong>g system does not automatically start up thehost application whenever a client call is issued. To m<strong>in</strong>imize network traffic, .<strong>NET</strong>Remot<strong>in</strong>g assumes that the host application on the server is always up, runn<strong>in</strong>g, andlisten<strong>in</strong>g to the specified port. This is not an issue if you choose IIS as the host, as IIS isgenerally up all the time.If you use a custom host, you must make sure it is runn<strong>in</strong>g when a call is issued. Asimple, yet effective, host program is shown here:// MyHost.cs -- compiled to MyHost.exeus<strong>in</strong>g System;us<strong>in</strong>g System.Runtime.Remot<strong>in</strong>g;public class MyHost{public static void Ma<strong>in</strong>(){Remot<strong>in</strong>gConfiguration.Configure("MyHost.exe.config");Console.WriteL<strong>in</strong>e("Press Enter to term<strong>in</strong>ate...");Console.ReadL<strong>in</strong>e();}}The key statement <strong>in</strong> the preced<strong>in</strong>g code is this:Remot<strong>in</strong>gConfiguration.Configure("MyHost.exe.config");The host program reads the given configuration file and organizes itself to listen on thespecified channels and ports <strong>for</strong> calls directed to the remote object. The configurationfile conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the remote class name, the assembly that conta<strong>in</strong>s theclass, the required activation mode (Client, S<strong>in</strong>gleton, or S<strong>in</strong>gleCall), and, if needed, theobject URI. Here is the configuration file that fully describes the ServiceSalesProviderclass:439


We'll look more closely at channels and activation modes <strong>in</strong> a moment. For now, keep<strong>in</strong> m<strong>in</strong>d that the contents of this configuration file tell the host application (whatever it is)which channels and ports to listen to and the name and the location of the class. In thisexample, the host application listens to the HTTP channel, and there<strong>for</strong>e the port mustbe 80.Predef<strong>in</strong>ed ChannelsA channel is the element <strong>in</strong> the .<strong>NET</strong> Remot<strong>in</strong>g architecture that physically moves bytesfrom one endpo<strong>in</strong>t to the other. A channel takes a stream of bytes, creates a packageaccord<strong>in</strong>g to a particular protocol, and routes the package to the f<strong>in</strong>al dest<strong>in</strong>ation acrossremot<strong>in</strong>g boundaries. A channel object listens <strong>for</strong> <strong>in</strong>com<strong>in</strong>g messages and sendsoutbound messages. The messages it handles consist of packets written <strong>in</strong> accordancewith a variety of network protocols.The .<strong>NET</strong> Framework provides two predef<strong>in</strong>ed channels, tcp and http, both of which arebidirectional and work as senders and receivers. The tcp channel uses a b<strong>in</strong>ary<strong>for</strong>matter to serialize data to a b<strong>in</strong>ary stream and transport it to the target object us<strong>in</strong>gTCP through the specified port. The http channel transports messages to and fromremote objects us<strong>in</strong>g SOAP and always through port 80. A channel can connect twoAppDoma<strong>in</strong>s <strong>in</strong> the same process as well as two mach<strong>in</strong>es over a network.An object can legitimately decide to listen on both channels. In this case, the subtree <strong>in</strong> the configuration file changes as follows:A client can select any of the channels registered on the server to communicate withthe remote object. At least one channel must be registered with the remot<strong>in</strong>g system onthe server.Us<strong>in</strong>g IIS as the Remot<strong>in</strong>g HostIf you write your own host application, you can make it as flexible as you need. If youdecide to use IIS as the host, some constra<strong>in</strong>ts apply. To use IIS <strong>in</strong>stead of ahandcrafted host as the activation agent, you must first create a virtual directory (say,SalesReport) and copy the object's assembly <strong>in</strong> the BIN subdirectory. The configurationfile must have a fixed name—web.config—and must reside <strong>in</strong> the virtual directory'sroot, as shown <strong>in</strong> Figure 12-4.440


Figure 12-4: The SalesReport virtual directory created to make the remotable objectaccessible.If you choose IIS as the activation agent, you must be aware of a few th<strong>in</strong>gs. IIS canlisten only to the http channel; any other channel you <strong>in</strong>dicate is simply ignored. Theway IIS applies the <strong>in</strong><strong>for</strong>mation read from the web.config file is hard-coded and can't beprogrammatically controlled or changed. However, you can create a global.asax file <strong>in</strong>the virtual folder, hook the Application_Start event, and then execute some customcode. In addition, the <strong>in</strong>evitable use of SOAP as the underly<strong>in</strong>g protocol <strong>in</strong>creases theaverage size of network packets.NoteAs often happens, the use of IIS as the activation agent has prosand cons. You don't need to write any extra code, but you lose a bit<strong>in</strong> flexibility. Rega<strong>in</strong><strong>in</strong>g the lost flexibility is still possible, but at theprice of writ<strong>in</strong>g nontrivial code. For example, you can write anApplication_Start event handler and apply extra b<strong>in</strong>ary <strong>for</strong>matters atboth ends of the http channel. In this way, the SOAP packets willconta<strong>in</strong> b<strong>in</strong>ary data and you'll save some bytes.Us<strong>in</strong>g IIS as the activation agent is natural when you plan to expose the same remoteservice through .<strong>NET</strong> Remot<strong>in</strong>g and Web services. So let's assume <strong>in</strong> our exampleapplication that IIS is the activation agent and SalesReport is the virtual directory.Activation PoliciesIn addition to the remotable object's identity, channels, and ports, the serverconfiguration file also conta<strong>in</strong>s another important piece of <strong>in</strong><strong>for</strong>mation—the objectactivation policy. An MBR remotable object can be either server-activated or clientactivated.Server-activated objects are created by the server only when the client<strong>in</strong>vokes the first method through the local proxy. Client-activated objects are created onthe server as soon as the client <strong>in</strong>stantiates the object us<strong>in</strong>g either the new operator ormethods of the System.Activator class.In addition, server-activated objects can be declared as S<strong>in</strong>gleton or S<strong>in</strong>gleCall objects.A S<strong>in</strong>gleton object has exactly one <strong>in</strong>stance to serve all possible clients. A S<strong>in</strong>gleCallobject, on the other hand, requires that each <strong>in</strong>com<strong>in</strong>g call is served by a new <strong>in</strong>stanceof the remotable object. A remotable object declares its required activation policy <strong>in</strong> theconfiguration file through specific subtrees placed below the node.Server-Side ActivationServer-activated objects are remotable objects whose entire life cycle is directlycontrolled by the host application. Server-activated objects are <strong>in</strong>stantiated on the441


server only when the client calls a method on the object. The object is not <strong>in</strong>stantiated ifthe client simply calls the new operator or the methods of the System.Activator object.This policy is slightly more efficient than client-side activation because it saves anetwork round-trip <strong>for</strong> the sole purpose of creat<strong>in</strong>g an <strong>in</strong>stance of the target object. Inaddition, this approach makes better use of server memory by delay<strong>in</strong>g as much aspossible the object <strong>in</strong>stantiation.What happens when the client code apparently <strong>in</strong>stantiates the remote object?Consider the follow<strong>in</strong>g client-side sample code:ServiceSalesProvider ssp = new ServiceSalesProvider();str<strong>in</strong>g img = ssp.GetSalesReportBarChart(theYear);The remot<strong>in</strong>g client treats the remote object as a local object and calls the new operatoron it. The object has been previously registered as a well-known type, so the .<strong>NET</strong>Remot<strong>in</strong>g system knows about it. In particular, the .<strong>NET</strong> Remot<strong>in</strong>g system knows thatany object of type ServiceSalesProvider is just a local proxy <strong>for</strong> a remote object. Whenthe client calls new or System.Activator on the well-known type, only the remot<strong>in</strong>g proxyis created <strong>in</strong> the client application doma<strong>in</strong>.The real <strong>in</strong>stantiation of the object will take place on the server at a later time, when anon-null <strong>in</strong>stance is needed to serve the first method call. Because the constructor iscalled implicitly and outside the control of the client, only the default constructor issupported. This means that if your class has a constructor that takes some arguments,that constructor is never taken <strong>in</strong>to account by the host application and never used tocreate <strong>in</strong>stances of the remotable class.NoteAs part of the .<strong>NET</strong> Framework reflection API, the System.Activatorobject provides a CreateInstance method that you can use to create<strong>in</strong>stances of dynamically determ<strong>in</strong>ed types. (Instantiat<strong>in</strong>g types thisway is a k<strong>in</strong>d of .<strong>NET</strong> Framework late b<strong>in</strong>d<strong>in</strong>g.) Interest<strong>in</strong>gly, thismethod supports a nice feature that would have fit well <strong>in</strong> the .<strong>NET</strong>Remot<strong>in</strong>g system too (and hopefully will <strong>in</strong> a future version). TheCreateInstance method has an overload that takes an array ofobject objects. It then uses the size of the array and the actual typesboxed <strong>in</strong> the various objects to match one of the constructorsdeclared on the target type. However, maybe <strong>for</strong> per<strong>for</strong>manceconcerns or perhaps just to simplify the feature, the .<strong>NET</strong> Remot<strong>in</strong>g<strong>in</strong>frastructure does not supply this facility.If you need to publish a remotable type whose <strong>in</strong>stances must be created us<strong>in</strong>g aspecific, nondefault constructor, you should resort to client activation.Well-Known ObjectsFrom the perspective of a .<strong>NET</strong> Remot<strong>in</strong>g client, server-activated objects are said to bewell-known objects. Well-known objects have two possible work<strong>in</strong>g modes: S<strong>in</strong>gletonand S<strong>in</strong>gleCall. In the <strong>for</strong>mer case, one <strong>in</strong>stance of the object services all calls from allclients. In the latter case, a new <strong>in</strong>stance of the object is created to service each call.A well-known object declares its work<strong>in</strong>g mode us<strong>in</strong>g the tag <strong>in</strong> theconfiguration file under the tag, as shown here:442


The mode attribute specifies the work<strong>in</strong>g mode of the well-known object. Allowedvalues are S<strong>in</strong>gleton and S<strong>in</strong>gleCall, def<strong>in</strong>ed <strong>in</strong> the WellKnownObjectModeenumeration. The type attribute conta<strong>in</strong>s two pieces of <strong>in</strong><strong>for</strong>mation. It is a commaseparatedstr<strong>in</strong>g <strong>in</strong> which the first token represents the fully qualified name of theremotable type and the second part of the str<strong>in</strong>g po<strong>in</strong>ts to the assembly <strong>in</strong> which theremotable type is def<strong>in</strong>ed. You must use the display name of the assembly without theDLL extension. The assembly must be located either <strong>in</strong> the global assembly cache(GAC) or on the server <strong>in</strong> a location that the host application can reach.If the host application is a normal console application or a W<strong>in</strong>dows NT service, thedirectory of the application's executable is a safe place to store the remotable type'sassembly. Similarly, you can store the assembly <strong>in</strong> any other path <strong>for</strong> which the hostapplication is configured to probe when search<strong>in</strong>g <strong>for</strong> assemblies. If you use IIS as theactivation agent, all the assemblies needed <strong>for</strong> the remotable type must be located <strong>in</strong>the BIN directory of the host application.Giv<strong>in</strong>g Well-Known Types a URIA well-known type also needs to be identified by a unique URI. The URI must be unique<strong>for</strong> the type and not <strong>for</strong> the object. This name represents remote objects of a certa<strong>in</strong>type and is the means by which the client gets a proxy po<strong>in</strong>t<strong>in</strong>g to the specified object.The server-side remot<strong>in</strong>g <strong>in</strong>frastructure ma<strong>in</strong>ta<strong>in</strong>s a list of all published well-knownobjects, and the object URI is the key to access this <strong>in</strong>ternal table. Well-known objectsmust explicitly <strong>in</strong>dicate the URI. For client-activated objects, a unique URI istransparently generated (and used) <strong>for</strong> a particular <strong>in</strong>stance of the class.When an object is hosted <strong>in</strong> IIS, the objectUri name must have a .soap or .remextension, as shown <strong>in</strong> Figure 12-5. This nam<strong>in</strong>g convention enables IIS to recognizethe <strong>in</strong>com<strong>in</strong>g call as a remot<strong>in</strong>g request that must be routed to a particular handler.443


Figure 12-5: The IIS application mapp<strong>in</strong>g table <strong>for</strong> .rem and .soap URIs.When IIS detects a remot<strong>in</strong>g call, it passes the call to the ad hoc HTTP handlerregistered to handle .soap and .rem resources. Although the object URI gives theimpression of be<strong>in</strong>g a URL—that is, a true server-side resource—it is only a name andshould <strong>in</strong> no way correspond to a physical file. Whether the URI should be a str<strong>in</strong>g orthe name of a physical resource depends on the expectations of the handler. Theremot<strong>in</strong>g handler uses .soap and .rem URIs as str<strong>in</strong>gs to retrieve the proxy <strong>for</strong> the type.S<strong>in</strong>gleton ObjectsWhen an object declares itself as a S<strong>in</strong>gleton type, the host application uses only as<strong>in</strong>gle <strong>in</strong>stance of the object to service all <strong>in</strong>com<strong>in</strong>g calls. So when a call arrives, thehost attempts to locate the runn<strong>in</strong>g <strong>in</strong>stance of the object. If such an <strong>in</strong>stance exists, therequest <strong>for</strong> execution is processed. Otherwise, the host creates the unique <strong>in</strong>stance ofthe remote class (us<strong>in</strong>g the default constructor) and <strong>for</strong>wards the request to it.What happens if two requests arrive at the same time? The .<strong>NET</strong> Remot<strong>in</strong>g subsystemarranges <strong>for</strong> them to be automatically serviced by dist<strong>in</strong>ct threads. This requires thatS<strong>in</strong>gleton objects be thread-safe. Note that this is not a mandatory programm<strong>in</strong>g rulebut is more of a practical guidel<strong>in</strong>e <strong>for</strong> real-world scenarios.State management <strong>for</strong> S<strong>in</strong>gleton objects is certa<strong>in</strong>ly possible <strong>in</strong> theory, but it must becoded <strong>in</strong> the body of the object <strong>in</strong> much the same way as you do with Active ServerPages (ASP) and <strong>Microsoft</strong> ASP.<strong>NET</strong> pages and even Web services. The idea is thatyou use a shared cache that all clients can access (a sort of ASP.<strong>NET</strong> Applicationobject), unless you apply a filter on a per-client basis (a sort of ASP.<strong>NET</strong> Sessionobject).The lifetime of a S<strong>in</strong>gleton well-known object is managed by the .<strong>NET</strong> Remot<strong>in</strong>g systemthrough a special module called the lease manager (LM). (See the section "MemoryManagement," on page 551, <strong>for</strong> more <strong>in</strong><strong>for</strong>mation.)444


S<strong>in</strong>gleCall ObjectsA well-known type declared as a S<strong>in</strong>gleCall object has a new <strong>in</strong>stance of it createdwhenever a request arrives. The host application creates a new <strong>in</strong>stance of theS<strong>in</strong>gleCall object, executes the requested method, and then routes any return valuesback to the client. After that, the object goes out of scope and is left to the garbagecollector.Although it's not completely impossible, preserv<strong>in</strong>g state from one call to the next isrealistically a bit impractical <strong>for</strong> S<strong>in</strong>gleCall objects. In this case, the lifetime of the object<strong>in</strong>stance is extremely short and barely covers the duration of the method call. You cantry either stor<strong>in</strong>g <strong>in</strong><strong>for</strong>mation <strong>in</strong> a database (or any sort of persistent storage medium) orpark<strong>in</strong>g data <strong>in</strong> other objects with a different lifetime scheme.Client-Side ActivationClient-activated objects are <strong>in</strong>stantiated on the client as the result of a call to the newoperator or to the System.Activator object. Each remot<strong>in</strong>g client runs its own copy of theobject and can control it at will. For example, the client can use any of the availableconstructors. In addition, persist<strong>in</strong>g the state dur<strong>in</strong>g the session is straight<strong>for</strong>ward anddoes not require any special cod<strong>in</strong>g. On the down side, shar<strong>in</strong>g state between clients isdifficult, and to do so, you must resort to a database, a disk file, or any other globalobject <strong>in</strong> the current AppDoma<strong>in</strong>.The follow<strong>in</strong>g code snippet shows how to change the contents of the tag toreflect a client-activated object. Instead of the tag, you use the tag. This tag supports only the type attribute. No object URI is necessarywith client-activated objects. More precisely, the URI is still necessary, but because theactivation occurs on the client and at a very specific moment <strong>in</strong> time, the URI can besilently generated by the .<strong>NET</strong> Remot<strong>in</strong>g <strong>in</strong>frastructure and attached to each call.As with S<strong>in</strong>gleton objects, the lifetime of a client-activated object is controlled by theLM. The <strong>in</strong>stance of the object rema<strong>in</strong>s active until the proxy is destroyed.Choos<strong>in</strong>g the Activation Mode That FitsTheoretically, all the work<strong>in</strong>g modes exam<strong>in</strong>ed up to now don't affect <strong>in</strong> any shape orfashion the way <strong>in</strong> which you code your remotable classes. For example, a clientactivatedobject is <strong>in</strong> no way different from a S<strong>in</strong>gleton object. All options can be setdeclaratively and, aga<strong>in</strong> speak<strong>in</strong>g theoretically, each object can be configured to work<strong>in</strong> different ways simply by chang<strong>in</strong>g a few entries <strong>in</strong> the server's configuration file.Intrigu<strong>in</strong>g as this possibility is, such flexibility is not realistic <strong>in</strong> practice because a realworldobject might want to exploit <strong>in</strong> depth the specific features of a work<strong>in</strong>g mode. Inother words, you should thoughtfully and carefully choose the configuration options <strong>for</strong>your remote object and then stick to that configuration as long as the user'srequirements are stable. For example, if you determ<strong>in</strong>e that the S<strong>in</strong>gleton mode isappropriate <strong>for</strong> your component, you will probably want to implement an <strong>in</strong>ternal statemanagement eng<strong>in</strong>e to share some variables. When at a later time you decide to setthe object to work—say, <strong>in</strong> S<strong>in</strong>gleCall mode—the state management eng<strong>in</strong>e issomewhat useless.Let's analyze our ServiceSalesProvider class to determ<strong>in</strong>e the most appropriateoptions. To beg<strong>in</strong>, the object needs to query a back-end database (Northw<strong>in</strong>d). Even445


this little requirement is enough to lead us to discard the option of mak<strong>in</strong>g the objectavailable by value. As an MBR object, the remotable class can be client-activated orserver-activated. What's better to us?The ServiceSalesProvider class doesn't need a nondefault constructor, so both clientactivatedand server-activated modes are f<strong>in</strong>e. The object is expected to work as a oneoffservice and has no need to ma<strong>in</strong>ta<strong>in</strong> per-client state, so you can discard the clientactivatedoption and go <strong>for</strong> the server-driven activation. OK, but should you opt <strong>for</strong>S<strong>in</strong>gleton or S<strong>in</strong>gleCall ?S<strong>in</strong>gleCall—that is, a short-lived <strong>in</strong>stance that serves the request and dies—is certa<strong>in</strong>lyan option. If you use the object as a S<strong>in</strong>gleton, however, you can architect slightly moreefficient code and avoid hav<strong>in</strong>g to query SQL Server each and every time a requestcomes <strong>in</strong>. The remot<strong>in</strong>g code <strong>in</strong>cluded <strong>in</strong> this book's sample files makes use of theServiceSalesProvider class configured to run as a S<strong>in</strong>gleCall object.Memory ManagementS<strong>in</strong>gleCall objects present no problems <strong>in</strong> terms of memory management. They requirea new object <strong>in</strong>stance that is extremely volatile and does not survive the end of themethod's code. S<strong>in</strong>gleton and client-activated objects, on the other hand, need amechanism to determ<strong>in</strong>e when they can be safely destroyed. In COM, this issue wasresolved by implement<strong>in</strong>g reference count<strong>in</strong>g. In the .<strong>NET</strong> Remot<strong>in</strong>g system, the sametasks are accomplished us<strong>in</strong>g a new module: the LM.Unlike reference count<strong>in</strong>g, the LM works on a per-AppDoma<strong>in</strong> basis and allows objectsto be released even though clients still hold a reference. Let's quickly review thedifferences between these two approaches.Old-Fashioned Reference Count<strong>in</strong>gReference count<strong>in</strong>g requires clients—<strong>in</strong>clud<strong>in</strong>g, of course, distributed and remoteclients—to communicate with the server each time they connect or disconnect. Theobject ma<strong>in</strong>ta<strong>in</strong>s the number of currently active client <strong>in</strong>stances, and when the countgoes to 0, the object destroys itself.In the presence of an unreliable network, however, chances are good that some objectsmight rema<strong>in</strong> with a reference count that never goes to 0. If this weren't bad enough,the cont<strong>in</strong>ual sequence of AddRef/Release calls would generate significant networktraffic.The Lease Manager (LM)The idea beh<strong>in</strong>d leas<strong>in</strong>g is that each object <strong>in</strong>stance is leased to the client <strong>for</strong> a givenamount of time fixed by the LM. The lease starts when the object is created. By default,each S<strong>in</strong>gleton or client-activated object is given 5 m<strong>in</strong>utes to process <strong>in</strong>com<strong>in</strong>g calls.When the <strong>in</strong>terval ends, the object is marked <strong>for</strong> deletion. Dur<strong>in</strong>g the object's lifetime,however, any processed client call resets the lease time to a fixed value (by default, 2m<strong>in</strong>utes), thus <strong>in</strong>creas<strong>in</strong>g or decreas<strong>in</strong>g the overall lease time.Note that leas<strong>in</strong>g is managed exclusively on the server and doesn't require additionalnetwork traffic, apart from the traffic needed <strong>for</strong> normal method execution. The <strong>in</strong>itiallease time and the renewal period can be set both programmatically and declaratively<strong>in</strong> the configuration file.Gett<strong>in</strong>g a SponsorAnother mechanism <strong>for</strong> controll<strong>in</strong>g an object's lifetime is sponsorship. Both clients andserver objects can register with the AppDoma<strong>in</strong>'s LM to act as sponsors of a particularobject. Prior to mark<strong>in</strong>g an object <strong>for</strong> deletion when its lease expires, the .<strong>NET</strong>Remot<strong>in</strong>g run time gives sponsors a chance to renew the lease. By implement<strong>in</strong>g446


sponsors, you can control the lifetime of objects based on logical criteria rather thanstrict time <strong>in</strong>tervals.In summary, noth<strong>in</strong>g can guarantee that clients will always f<strong>in</strong>d their server objects upand runn<strong>in</strong>g. When a remot<strong>in</strong>g client attempts to access an object that is no longeravailable, a Remot<strong>in</strong>gException exception is thrown. One way to resolve the exceptionis by creat<strong>in</strong>g a new <strong>in</strong>stance of the remote object and repeat<strong>in</strong>g the operation thatfailed.Call<strong>in</strong>g a Remote ServiceLet's see what a client must do to call a method on a remote object. To beg<strong>in</strong>, add aproject reference to the assembly that conta<strong>in</strong>s the remote object by rightclick<strong>in</strong>gReferences <strong>in</strong> Solution Explorer, choos<strong>in</strong>g Add Reference from the shortcut menu, andtravers<strong>in</strong>g the network to locate the target assembly. The project reference lets theclient application know about the types def<strong>in</strong>ed <strong>in</strong> the assembly.NoteEven if your remotable object is hosted by IIS, when you referencethe assembly from a remot<strong>in</strong>g client, choose the Add Referenceoption. The Add Web Reference command on the same shortcutmenu is reserved <strong>for</strong> Web services and, more importantly, starts acompletely different l<strong>in</strong>k<strong>in</strong>g procedure. (More on this <strong>in</strong> Chapter 13.)Referenc<strong>in</strong>g a remote assembly is only the first step to be<strong>in</strong>g able to call any of itsmethods.Configur<strong>in</strong>g the CallerThe remote object must be registered with the local application be<strong>for</strong>e you cansuccessfully use it. The .<strong>NET</strong> Remot<strong>in</strong>g system must be aware that objects of certa<strong>in</strong>types represent <strong>in</strong>stances of remote objects. In this way, ad hoc code can be generatedto obta<strong>in</strong> the necessary proxy.You configure the client application either through a configuration file orprogrammatically by call<strong>in</strong>g the RegisterWellKnownClientType method on the staticRemot<strong>in</strong>gConfiguration object, as shown here:Remot<strong>in</strong>gConfiguration.RegisterWellKnownClientType(typeof(ServiceSalesProvider),"http://www.contoso.com/SalesReport/ServiceSalesProvider.rem");To register a well-known type, you pass <strong>in</strong> the type and object URI. If the object is notserver-activated, and there<strong>for</strong>e is not a well-known object, you use theRegisterActivatedClientType <strong>in</strong>stead, as follows:Remot<strong>in</strong>gConfiguration.RegisterActivatedClientType(typeof(ServiceSalesProvider),"http://www.contoso.com/SalesReport");In this case, you don't need to pass an explicit object URI. However, you still need to<strong>in</strong>dicate the remote path <strong>for</strong> the target object. Because we are work<strong>in</strong>g with IIS as thehost, the remote path must be the URL of the virtual directory. If a custom host is used,<strong>in</strong>stead of the URL, you use a TCP address and the port, as shown here:Remot<strong>in</strong>gConfiguration.RegisterActivatedClientType(447


typeof(ServiceSalesProvider),"tcp://192.345.34.1:8082");You can also direct the caller application to read setup <strong>in</strong><strong>for</strong>mation from a configurationfile located <strong>in</strong> the same path as the executable. In this case, the convention is to givethe file the same name as the executable plus a .config extension. You then pass thefile name to the Configure method, as shown here:Remot<strong>in</strong>gConfiguration.Configure("MyClient.exe.config");The follow<strong>in</strong>g script shows the layout of a client configurationfile:As you can see, the differences between the client and the server-side configurationfiles are m<strong>in</strong>imal and are all related to the use of the tag <strong>in</strong>stead of .The server object publishes the list of supported channels, and based on that list, theclient can decide which channel to use. Note that servers must register at least onechannel. Clients are not required to <strong>in</strong>dicate a channel. If a client doesn't <strong>in</strong>dicate achannel, the .<strong>NET</strong> Remot<strong>in</strong>g system uses one of the default channels. On the otherhand, a client that plans to use a given channel must first register with it. Theapplication can run the channel registration procedure personally or let it run by defaultunder the control of the Remot<strong>in</strong>gConfiguration object.Channels are registered on a per-AppDoma<strong>in</strong> basis and must have unique names <strong>in</strong>that context. On physical mach<strong>in</strong>es, however, only one channel can listen to a givenport. In other words, at any time you can't have more than one channel registered towork on a given port on a given mach<strong>in</strong>e.A client enabled to make remote calls on a remote object simply creates an <strong>in</strong>stance ofthe desired class us<strong>in</strong>g the language-specific operator <strong>for</strong> <strong>in</strong>stantiation—new <strong>in</strong> C# andVisual Basic. Alternatively, the client can use the System.Activator object—a managedcounterpart of the VBScript CreateObject and GetObject functions.448


Writ<strong>in</strong>g the Client ComponentFigure 12-6 shows the <strong>in</strong>itial user <strong>in</strong>terface of the client application we'll use to query <strong>for</strong>sales reports and bar charts. You select the year of <strong>in</strong>terest and click one of the twobuttons—Get Data to display sales <strong>in</strong><strong>for</strong>mation as a DataSet object, or Get Chart todisplay the <strong>in</strong><strong>for</strong>mation as a bar chart saved as a JPEG image. The <strong>for</strong>m conta<strong>in</strong>s aDataGrid control (<strong>in</strong>visible by default) and a PictureBox control. Needless to say, theDataGrid object will display the contents of the DataSet object, whereas the PictureBoxobject will show the image.Figure 12-6: The sample application <strong>in</strong> action, wait<strong>in</strong>g <strong>for</strong> user <strong>in</strong>put.Access<strong>in</strong>g the Raw DataOnce the remote assembly has been referenced by the project and the remote typeconfigured <strong>in</strong> the <strong>for</strong>m's Load event, you can write the client application and use theremote type as if it were a local type. The follow<strong>in</strong>g code shows what happens whenyou click to get raw data:private void ButtonGetData_Click(object sender, System.EventArgse){// Get the year to process<strong>in</strong>t theYear = Convert.ToInt32(Years.Text);// Instantiate the object and issue the callServiceSalesProvider ssp = new ServiceSalesProvider();DataSet ds = ssp.GetSalesReport(theYear);// Turn on and fill the DataGrid control// Also and turn off the picture boxPictureConta<strong>in</strong>er.Visible = false;Data.Visible = true;Data.DataSource = ds.Tables[0];// Update the UITitle.Text = "Sales Report <strong>for</strong> "+ theYear.ToStr<strong>in</strong>g();449


}The code <strong>in</strong> boldface demonstrates that, at this po<strong>in</strong>t, us<strong>in</strong>g the remote object is <strong>in</strong> noway different from us<strong>in</strong>g any other local, or system, class.Figure 12-7 shows the sales <strong>in</strong><strong>for</strong>mation displayed <strong>in</strong> DataSet <strong>for</strong>mat.Figure 12-7: The sample application display<strong>in</strong>g downloaded sales data <strong>in</strong> DataSet <strong>for</strong>mat.Access<strong>in</strong>g B<strong>in</strong>Hex-Encoded ImagesCall<strong>in</strong>g the GetSalesReportBarChart method is not all that different from call<strong>in</strong>g theGetSalesReport method, but more work is needed to make the downloaded datausable. As mentioned, the GetSalesReportBarChart method draws a bar chart,converts it to JPEG, encodes the image as a B<strong>in</strong>Hex str<strong>in</strong>g, and packs everyth<strong>in</strong>g <strong>in</strong>toan <strong>XML</strong> document. The content of the document is then returned as a str<strong>in</strong>g, as shownhere:ServiceSalesProvider ssp = new ServiceSalesProvider();str<strong>in</strong>g encImage = ssp.GetSalesReportBarChart(theYear);The next step is trans<strong>for</strong>m<strong>in</strong>g the str<strong>in</strong>g <strong>in</strong>to a bitmap and display<strong>in</strong>g it <strong>in</strong> the PictureBoxcontrol. The follow<strong>in</strong>g procedure takes the B<strong>in</strong>Hex image description and creates anequivalent Bitmap object. Because the str<strong>in</strong>g is an <strong>XML</strong> document, an XmlTextReaderobject is needed to parse the contents and then decode the B<strong>in</strong>Hex data.private Bitmap EncodedXmlToBitmap(str<strong>in</strong>g encImage){Bitmap bmp = null;// Parse the <strong>XML</strong> data us<strong>in</strong>g a str<strong>in</strong>g readerStr<strong>in</strong>gReader buf = new Str<strong>in</strong>gReader(encImage);XmlTextReader reader = new XmlTextReader(buf);reader.Read();reader.MoveToContent();// The root node of the document is if (reader.LocalName == "jpeg"){450


Get the size of the B<strong>in</strong>Hex data<strong>in</strong>t encodedSize =Convert.ToInt32(reader["Size"].ToStr<strong>in</strong>g());// Read and decode the B<strong>in</strong>Hex databyte[] img = new byte[encodedSize];reader.ReadB<strong>in</strong>Hex(img, 0, encodedSize);// Trans<strong>for</strong>m the just read bytes <strong>in</strong>to an Image objectMemoryStream ms = new MemoryStream();ms.Write(img, 0, img.Length);bmp = new Bitmap(ms);ms.Close();}}reader.Close();return bmp;You decode the image data us<strong>in</strong>g the ReadB<strong>in</strong>Hex method on the XmlTextReaderclass. Next you copy the resultant array of bytes <strong>in</strong>to a temporary memory stream. Thisstep is necessary because a Bitmap object can't be created directly from an array ofbytes.F<strong>in</strong>ally, the returned Bitmap object is bound to the PictureBox control <strong>in</strong> the <strong>for</strong>m, asshown <strong>in</strong> the follow<strong>in</strong>g code:PictureConta<strong>in</strong>er.SizeMode = PictureBoxSizeMode.StretchImage;PictureConta<strong>in</strong>er.Image = bmp;Figure 12-8 shows the results.Figure 12-8: The sample application display<strong>in</strong>g an encoded bar chart.The client can easily create a local copy of the JPEG file. The follow<strong>in</strong>g code snippetshows how to proceed:// img is the array of bytes obta<strong>in</strong>ed from ReadB<strong>in</strong>HexFileStream fs = new FileStream(fileName, FileMode.Create);451


B<strong>in</strong>aryWriter writer = new B<strong>in</strong>aryWriter(fs);writer.Write(img);writer.Close();TipWhen convert<strong>in</strong>g a Bitmap object to JPEG, you can control thecompression ratio to obta<strong>in</strong> a better image. However, JPEG is not acompression scheme designed <strong>for</strong> text and simple figures like barcharts. In fact, JPEG was orig<strong>in</strong>ally designed to effectively compressphotographic images. To ensure a better image, you might want touse the GIF <strong>for</strong>mat or control the compression ratio of the f<strong>in</strong>al JPEGimage. You can do that by us<strong>in</strong>g one of the overloads of the Bitmapobject's Save method.Us<strong>in</strong>g the System.Activator ClassA remot<strong>in</strong>g client can obta<strong>in</strong> a proxy to make calls to a remote object <strong>in</strong> two ways: byus<strong>in</strong>g the new operator or by us<strong>in</strong>g methods of the System.Activator class. TheActivator class provides two methods—CreateInstance and GetObject. Clients of wellknownobjects use GetObject, whereas clients of client-activated objects useCreateInstance.GetObject returns a proxy <strong>for</strong> the well-known type served at the specified URL location,as shown <strong>in</strong> the follow<strong>in</strong>g code. GetObject is a wrapper placed around the globalRemot<strong>in</strong>gServices.Connect method. The proxy is built on the client from the remoteobject metadata and exposed to the client application as the orig<strong>in</strong>al type.ServiceSalesProvider ssp;ssp = (ServiceSalesProvider) Activator.GetObject(typeof(ServiceSalesProvider),"http://www.contoso.com/SalesReport");From this relatively simple explanation, it should be clear that .<strong>NET</strong> Remot<strong>in</strong>g is no lessquirky than DCOM, but unlike DCOM, the .<strong>NET</strong> Framework successfully hides a greatwealth of low-level details.CreateInstance differs from GetObject <strong>in</strong> that it actually creates a new remote <strong>in</strong>stanceof the object, as shown here:// Set the URL of the remote objectobject[1] attribs;attribs[0] = new Activation.UrlAttribute(url);// Create the <strong>in</strong>stance of the objectServiceSalesProvider ssp;ssp = (ServiceSalesProvider) Activator.CreateInstance(typeof(ServiceSalesProvider), null, attribs);ConclusionThe .<strong>NET</strong> Remot<strong>in</strong>g system enables you to access .<strong>NET</strong> Framework objects across theboundaries of AppDoma<strong>in</strong>s. It represents the actual implementation of a programm<strong>in</strong>g452


model designed <strong>for</strong> <strong>in</strong>terprocess communication. Another facet of this model is .<strong>NET</strong><strong>XML</strong> Web services. Although .<strong>NET</strong> <strong>XML</strong> Web services allow you to expose .<strong>NET</strong>Framework objects to any client that can use HTTP, .<strong>NET</strong> Remot<strong>in</strong>g is optimized <strong>for</strong>.<strong>NET</strong>-to-.<strong>NET</strong> communication. Communication between the client and the remotableobject can take place us<strong>in</strong>g SOAP or b<strong>in</strong>ary payloads transported over HTTP or TCP..<strong>NET</strong> Remot<strong>in</strong>g can transfer any serializable CLR types; it is not limited to <strong>XML</strong> SchemaDef<strong>in</strong>ition (XSD) types or complex custom types as rendered by the .<strong>NET</strong> <strong>XML</strong>serializer.This chapter illustrated the key features of the .<strong>NET</strong> Remot<strong>in</strong>g system and showed youhow to set up a remotable object that exposes nontrivial functionalities. In particular,you learned how to expose JPEG images through <strong>XML</strong> documents. Of course, if thegoal of your distributed system is simply to create and return dynamic images, .<strong>NET</strong>Remot<strong>in</strong>g might not be <strong>for</strong> you. But from a broader standpo<strong>in</strong>t that encompasses Webservices, .<strong>NET</strong> Remot<strong>in</strong>g not only makes sense, it is also compell<strong>in</strong>g. The examplewe've constructed <strong>in</strong> this chapter has two aims. First, it demonstrates that .<strong>NET</strong>Remot<strong>in</strong>g and Web services are just two remot<strong>in</strong>g <strong>in</strong>terfaces and that the same coreclass can outfit both. Second, it shows that to come up with truly efficient and effectivecode, you must always take the most appropriate route and create specialized code<strong>in</strong>stead of pursu<strong>in</strong>g the promises of code universality and plat<strong>for</strong>m <strong>in</strong>dependence.This chapter covered only the first side of remot<strong>in</strong>g—.<strong>NET</strong> Remot<strong>in</strong>g <strong>for</strong> CLR types. InChapter 13, we'll look at Web services—a truly <strong>in</strong>teroperable <strong>in</strong>frastructure ideal <strong>for</strong>roll<strong>in</strong>g up your functionalities and mak<strong>in</strong>g them available to a potentially <strong>in</strong>f<strong>in</strong>ite set ofclients.Further Read<strong>in</strong>gAlthough this chapter touched on all the key aspects of the .<strong>NET</strong> Remot<strong>in</strong>g technology,it revealed only the tip of the iceberg. Throughout the chapter, I've noted severalaspects of .<strong>NET</strong> Remot<strong>in</strong>g whose coverage was simply beyond the scope of a bookabout <strong>XML</strong>. Pr<strong>in</strong>cipal among the resources that cover these topics <strong>in</strong> more detail is theMSDN .<strong>NET</strong> Framework documentation, but many other appropriate resources are alsoavailable.I mentioned that W<strong>in</strong>dows XP and newer systems boast a modified loader that looksdirectly <strong>in</strong>to the source Portable Executable (PE) file to f<strong>in</strong>d .<strong>NET</strong> Framework-specificmetadata. To understand the entire load<strong>in</strong>g process of managed executables <strong>in</strong>W<strong>in</strong>dows XP as well as <strong>in</strong> W<strong>in</strong>dows 2000, I know just one resource: Jeffrey Richter'sexcellent book <strong>Applied</strong> <strong>Microsoft</strong> .<strong>NET</strong> Framework <strong>Programm<strong>in</strong>g</strong> (<strong>Microsoft</strong> Press,2002).In the October 2002 issue of MSDN Magaz<strong>in</strong>e, you can f<strong>in</strong>d an article of m<strong>in</strong>e that, likethis chapter, attempts to expla<strong>in</strong> the ABCs of .<strong>NET</strong> Remot<strong>in</strong>g. In that article, you'll f<strong>in</strong>d adeeper discussion of architectural aspects—channels, <strong>for</strong>matters, and s<strong>in</strong>k cha<strong>in</strong>s—than we've covered here.The <strong>in</strong>ternal eng<strong>in</strong>e that per<strong>for</strong>ms memory management <strong>for</strong> <strong>in</strong>stances of remote objectsis the lease manager (LM). Jeff Prosise, <strong>in</strong> Chapter 15 of his book <strong>Programm<strong>in</strong>g</strong><strong>Microsoft</strong> .<strong>NET</strong> (<strong>Microsoft</strong> Press, 2002), expla<strong>in</strong>s a lot about it.F<strong>in</strong>ally, if you're just look<strong>in</strong>g <strong>for</strong> a complete .<strong>NET</strong> Remot<strong>in</strong>g book, here it is: <strong>Microsoft</strong>.<strong>NET</strong> Remot<strong>in</strong>g, by Scott McLean, James Naftel, and Kim Williams (<strong>Microsoft</strong> Press,2002).453


Chapter 13: <strong>XML</strong> Web ServicesOverviewThe term Web service is relatively new, but the idea beh<strong>in</strong>d Web services has beenaround <strong>for</strong> a while. A Web service is an <strong>in</strong>terface-less Web site designed <strong>for</strong>programmatic access. This means that <strong>in</strong>stead of <strong>in</strong>vok<strong>in</strong>g URLs represent<strong>in</strong>g Webpages, you <strong>in</strong>voke URLs that represent methods on remote objects. Similarly, <strong>in</strong>steadof gett<strong>in</strong>g back colorful and animated HTML code, you get back <strong>XML</strong> Schema Def<strong>in</strong>ition(XSD) data types packed <strong>in</strong> <strong>XML</strong> messages. Aside from these higher-level differences,the underly<strong>in</strong>g models <strong>for</strong> a Web site and a Web service are the same. In addition, anysecurity measure you can implement on a Web site can be duplicated <strong>in</strong> a Web service.To summarize, the Web service model is just another programm<strong>in</strong>g model runn<strong>in</strong>g ontop of HTTP.A Web service is a software application that can be accessed over the Web by othersoftware. Web services are applicable <strong>in</strong> any type of Web environment, be it Internet,<strong>in</strong>tranet, or extranet. All you need to locate and access a Web service is a URL. Intheory, a number of Internet-friendly protocols might be work<strong>in</strong>g through that URL. Inpractice, the protocol <strong>for</strong> everyday use of Web services is always HTTP.How is a Web service different from a remote procedure call (RPC) implementation ofdistributed <strong>in</strong>terfaces? For the most part, a Web service is an RPC mechanism thatuses the Simple Object Access Protocol (SOAP) to support data <strong>in</strong>terchange. Thisgeneral def<strong>in</strong>ition represents the gist of a Web service, but it focuses only on the corebehavior. A Web service is more than just a bus<strong>in</strong>ess object available over an HTTPaccessiblenetwork. A number of evolv<strong>in</strong>g <strong>in</strong>dustry standards are supported today,<strong>in</strong>clud<strong>in</strong>g the Universal Description, Discovery, and Integration (UDDI) standard and theWeb Services Description Language (WSDL); others, such as the Web ServicesSecurity (WS-Security) and the Global <strong>XML</strong> Web Services Architecture (GXA), will besupported soon. These <strong>in</strong>dustry standards contribute to sett<strong>in</strong>g up a full and powerfulenvironment <strong>for</strong> remote object-oriented access and programm<strong>in</strong>g.In this chapter, we'll look at implement<strong>in</strong>g and programm<strong>in</strong>g Web services <strong>in</strong> the<strong>Microsoft</strong> .<strong>NET</strong> Framework. We'll also take a look at the Web <strong>in</strong>frastructure that makesthese services available and at the functionalities you can obta<strong>in</strong> and publish. Todemonstrate the breakthrough that Web services represent <strong>in</strong> the software <strong>in</strong>dustry,we'll rewrite the .<strong>NET</strong> Remot<strong>in</strong>g code example from Chapter 12 to make it work as aWeb service. In do<strong>in</strong>g so, we'll also be able to exam<strong>in</strong>e the differences between the.<strong>NET</strong> Remot<strong>in</strong>g and Web service architectures and determ<strong>in</strong>e <strong>in</strong> which scenarios eacharchitecture is suitable.The .<strong>NET</strong> Framework Infrastructure <strong>for</strong> Web ServicesAlthough Web services and the .<strong>NET</strong> Framework were <strong>in</strong>troduced at roughly the sametime, there is no strict dependency between the two, and the presence of one does notnecessarily imply the presence of the other. The .<strong>NET</strong> Framework is simply one of theplat<strong>for</strong>ms that support Web services and that provide effective tools and system classesto create and consume Web services. No one person <strong>in</strong>vented Web services, but all thebig players <strong>in</strong> the IT arena are rapidly adopt<strong>in</strong>g and trans<strong>for</strong>m<strong>in</strong>g the raw idea of"software callable by other software" <strong>in</strong>to someth<strong>in</strong>g that fits their respectivedevelopment plat<strong>for</strong>ms.Regardless of how a Web service is created—and whether it is vendor-specific orplat<strong>for</strong>m-specific—the way <strong>in</strong> which a Web service is exposed to the public is the same.454


Any Web service can be imported and <strong>in</strong>corporated <strong>in</strong>to vendor-specific and plat<strong>for</strong>mspecificsolutions, as long as the service adheres to accepted standards, like HTTP,SOAP, and WSDL, to name a few. Web services guarantee <strong>in</strong>teroperability becausethey are based entirely on open standards. By roll<strong>in</strong>g your functionalities <strong>in</strong>to a Webservice, you can expose them to anyone on the Web who speaks HTTP andunderstands <strong>XML</strong>. Of course, <strong>for</strong> this to happen, some <strong>in</strong>frastructure that deals withWeb communication and data transportation is still required. No worries, though—this isjust what the major IT players are build<strong>in</strong>g <strong>in</strong>to their development plat<strong>for</strong>ms.The primary factor <strong>in</strong> <strong>in</strong>dustry-wide adoption of Web services is SOAP. Although it is abit verbose, SOAP offers a standard way to def<strong>in</strong>e the method to call and thearguments to pass. In addition, SOAP exploits a standard, rich, and extensible typesystem—the XSD type system. In the .<strong>NET</strong> Framework, the XSD type system isextended with a set of .<strong>NET</strong> Framework classes—the classes that the <strong>XML</strong> serializercan handle. (Chapter 11 covers the <strong>XML</strong> serializer <strong>in</strong> detail.)NoteWeb service clients are not <strong>for</strong>ced to use SOAP as the protocol <strong>for</strong>issu<strong>in</strong>g their calls. HTTP-GET and HTTP-POST are effective aswell, and even more compact if you look at the size of the <strong>in</strong>dividualpayload. SOAP is not a stand-alone protocol; it simply def<strong>in</strong>es the<strong>XML</strong> vocabulary used to express method <strong>in</strong>vocations. The SOAPpayload does need a transportation protocol, however, and usually,SOAP packets travel over HTTP-POST commands.The Simple Object Access Protocol (SOAP)SOAP is a simple, lightweight <strong>XML</strong>-based protocol <strong>for</strong> exchang<strong>in</strong>g <strong>in</strong><strong>for</strong>mation on theWeb. SOAP def<strong>in</strong>es a messag<strong>in</strong>g framework that is <strong>in</strong>dependent from any applicationor transportation protocol. Although, as mentioned, SOAP packets travel mostly asHTTP-POST commands, SOAP neither mandates nor excludes any network andtransportation protocol.The most important part of the SOAP specification consists of an envelope <strong>for</strong>encapsulat<strong>in</strong>g data. The SOAP envelope def<strong>in</strong>es a one-way message and is the atomicunit of exchange between SOAP senders and receivers. The SOAP specification alsoneeds a request/response message exchange pattern, although it does not mandate aspecific message pattern. The rema<strong>in</strong><strong>in</strong>g, optional parts of the SOAP specification aredata encod<strong>in</strong>g rules <strong>for</strong> represent<strong>in</strong>g application-def<strong>in</strong>ed data types and a b<strong>in</strong>d<strong>in</strong>gbetween SOAP and HTTP.NoteAlthough SOAP is often associated with HTTP alone, it has beendesigned accord<strong>in</strong>g to general pr<strong>in</strong>ciples so that you can use SOAP<strong>in</strong> comb<strong>in</strong>ation with any transportation protocol or mechanism that isable to transport the SOAP envelope, <strong>in</strong>clud<strong>in</strong>g SMTP and FTP.The follow<strong>in</strong>g code shows a simple SOAP envelope that <strong>in</strong>vokes a GetSalesReportmethod on the specified Web server:POST /salesreport/SalesReportService.asmx HTTP/1.1Host: expo-starContent-Type: text/xml; charset=utf-8Content-Length: lengthSOAPAction: "xmlnet/cs/0735618011/GetSalesReport"455


<strong>in</strong>tSOAP is not magic—it is a simple <strong>XML</strong>-based, message-based protocol whose packetsnormally travel over HTTP. The Web server must have a special listener ready to catch<strong>in</strong>com<strong>in</strong>g calls on port 80. These listeners are <strong>in</strong>tegrated with the Web servers, as is thecase with Internet In<strong>for</strong>mation Services (IIS).IIS SupportA .<strong>NET</strong> Framework Web service is a <strong>Microsoft</strong> ASP.<strong>NET</strong> application with an.asmxextension that is accessed over HTTP. ASP.<strong>NET</strong>, as a whole, is part of the .<strong>NET</strong>Framework that works on top of IIS, tak<strong>in</strong>g care of files with special extensions such as.aspx and .asmx. One of the key components of the ASP.<strong>NET</strong> <strong>in</strong>frastructure is theInternet Server Application <strong>Programm<strong>in</strong>g</strong> Interface (ISAPI) filter that IIS <strong>in</strong>volves when itgets a call <strong>for</strong> files with a certa<strong>in</strong> extension. For example, Figure 13-1 shows thesett<strong>in</strong>gs <strong>in</strong> the IIS Configuration Manager that associate .asmx files with a systemmodule named aspnet_isapi.dll.Figure 13-1: The IIS mapp<strong>in</strong>g between .asmx files and the appropriate ASP.<strong>NET</strong> ISAPIfilter.As mentioned, calls <strong>for</strong> Web services always come through port 80. For .<strong>NET</strong>Framework Web services, such calls are always directed to URLs with an .asmxextension. IIS <strong>in</strong>tercepts these calls and passes all the related packets on to theregistered ASP.<strong>NET</strong> ISAPI filter (aspnet_isapi.dll). The filter connects to a workerprocess named aspnet_wp.exe, which implements the HTTP pipel<strong>in</strong>e that ASP.<strong>NET</strong>uses to process Web requests. Both executables are made of ord<strong>in</strong>ary W<strong>in</strong>32 code.The ASP.<strong>NET</strong> layer built atop IIS is shown <strong>in</strong> Figure 13-2.456


Figure 13-2: The ASP.<strong>NET</strong> architecture to process page and Web service requests.The connection between the IIS process (the executable named <strong>in</strong>et<strong>in</strong>fo.exe) and theHTTP pipel<strong>in</strong>e (the worker executable named aspnet_wp.exe) is established through anamed pipe—that is, a W<strong>in</strong>32 mechanism <strong>for</strong> transferr<strong>in</strong>g data over a network. As you'dexpect, a named pipe works just like a pipe: you enter data <strong>in</strong> one end, and the samedata comes out at the other end. Pipes can be established both locally to connectprocesses and between remote mach<strong>in</strong>es.After the ASP.<strong>NET</strong> worker process receives a request, it routes that request throughthe .<strong>NET</strong> Framework HTTP pipel<strong>in</strong>e. The entry po<strong>in</strong>t of the pipel<strong>in</strong>e is the HttpRuntimeclass. This class is responsible <strong>for</strong> packag<strong>in</strong>g the HTTP context <strong>for</strong> the request, whichis noth<strong>in</strong>g more than familiar Active Server Pages (ASP) objects such as Request,Response, Server, and the like. These objects are packed <strong>in</strong>to an <strong>in</strong>stance of theHttpContext class, and then a .<strong>NET</strong> Framework application is started.The WebService ClassIn the .<strong>NET</strong> Framework, a Web service is an ord<strong>in</strong>ary class with public and protectedmethods. The Web service class is normally placed <strong>in</strong> a source file that is saved with an.asmx extension. Web service files must conta<strong>in</strong> the @ WebService directive that<strong>in</strong><strong>for</strong>ms the ASP.<strong>NET</strong> run time about the nature of the file, the language <strong>in</strong> usethroughout, and the ma<strong>in</strong> class that implements the service, as shown here:The Language attribute can be set to C#, VB, or JS. The ma<strong>in</strong> class must match thename declared <strong>in</strong> the Class attribute and must be public, as shown here:457


public class MyWebService : WebService{}⋮Indicat<strong>in</strong>g the base class <strong>for</strong> a .<strong>NET</strong> Framework Web service is not mandatory. A Webservice can also be architected start<strong>in</strong>g from the ground up us<strong>in</strong>g a new class.Inherit<strong>in</strong>g the behavior of the WebService class has some advantages, however. AWeb service based on the System.Web.Services.WebService class has direct accessto common ASP.<strong>NET</strong> objects, <strong>in</strong>clud<strong>in</strong>g Application, Request, Cache, Session, andServer. These objects are packed <strong>in</strong>to an HttpContext object, which also <strong>in</strong>cludes thetime when the request was made. If you don't have any need to access the ASP.<strong>NET</strong>object model, you can do without the WebService class and simply implement the Webservice as a class with public methods. With the WebService base class, however, aWeb service also has access to the ASP.<strong>NET</strong> server User object, which can be used toverify the credentials of the current user execut<strong>in</strong>g the method.NoteThe Class attribute is normally set to a class resid<strong>in</strong>g <strong>in</strong> the samefile as the @ WebService directive, but noth<strong>in</strong>g prevents you fromspecify<strong>in</strong>g a class with<strong>in</strong> a separate assembly. In such cases, theentire Web service file consists of a s<strong>in</strong>gle l<strong>in</strong>e of code:The actual implementation is conta<strong>in</strong>ed <strong>in</strong> the specified class, andthe assembly that conta<strong>in</strong>s the class must be placed <strong>in</strong> the B<strong>in</strong>subdirectory of the virtual folder where the Web service resides.The @ WebService directive supports two additional attributes: Debug andCodeBeh<strong>in</strong>d. The <strong>for</strong>mer is a Boolean property that <strong>in</strong>dicates whether the Web serviceshould be compiled with debug symbols. The latter specifies the source file thatconta<strong>in</strong>s the class implement<strong>in</strong>g the Web service when the class is neither located <strong>in</strong>the same file nor resident <strong>in</strong> a separate assembly.The WebService AttributeThe WebService attribute is optional and does not affect the activity of the Web serviceclass <strong>in</strong> terms of what is published and executed. The WebService attribute isrepresented by an <strong>in</strong>stance of the WebServiceAttribute class and enables you tochange three default sett<strong>in</strong>gs <strong>for</strong> the Web service: the namespace, the name, and thedescription.The syntax <strong>for</strong> configur<strong>in</strong>g the WebService attribute is declarative and somewhat selfexplanatory.With<strong>in</strong> the body of the WebService attribute, you simply <strong>in</strong>sert a commaseparatedlist of names and values, as shown <strong>in</strong> the follow<strong>in</strong>g code. The keywordDescription identifies the description of the Web service, whereas Name po<strong>in</strong>ts to theofficial name of the Web service.[WebService(Name="Northw<strong>in</strong>d Sales Report Web Service",Description="The Northw<strong>in</strong>d Sales Report Web Service")]public class SalesReportWebService : WebService{⋮458


}Chang<strong>in</strong>g the name and description of the Web service is mostly a matter ofconsistency. The .<strong>NET</strong> Framework assumes that the name of the implement<strong>in</strong>g class isalso the name of the Web service; no default description is provided. The Nameattribute is used to identify the service <strong>in</strong> the WSDL text that expla<strong>in</strong>s the behavior ofthe service to prospective clients. The description is not used <strong>in</strong> the companion WSDLtext; it is retrieved and displayed by the IIS default page only <strong>for</strong> URLs with an .asmxextension.Chang<strong>in</strong>g the Default NamespaceEach Web service should have a unique namespace that makes it clearlydist<strong>in</strong>guishable from other services. By default, the .<strong>NET</strong> Framework gives each newWeb service the same default namespace: http://tempuri.org. This namespace comeswith the strong recommendation to change it as soon as possible and certa<strong>in</strong>ly prior topublish<strong>in</strong>g the service on the Web.NoteUs<strong>in</strong>g a temporary name does not affect the overall functionality, butit will affect consistency and violate Web service nam<strong>in</strong>gconventions. Although most namespace names out there look likeURLs, you don't need to use real URLs. A name that you'rereasonably certa<strong>in</strong> is unique will suffice.The only way to change the default namespace of a .<strong>NET</strong> Framework Web service isby sett<strong>in</strong>g the Namespace property of the WebService attribute, as shown <strong>in</strong> follow<strong>in</strong>gcode. This example uses a custom path that merges the namespace of the classprovid<strong>in</strong>g the sample service with the ISBN of this book.[WebService(Namespace="xmlnet/cs/0735618011",Name="Northw<strong>in</strong>d Sales Report Web Service",Description="The Northw<strong>in</strong>d Sales Report Web Service")]The namespace <strong>in</strong><strong>for</strong>mation is used extensively <strong>in</strong> the WSDL def<strong>in</strong>ition of the Webservice.Build<strong>in</strong>g a .<strong>NET</strong> Web ServiceAs mentioned, a Web service is a class that optionally <strong>in</strong>herits from WebService. Assuch, the class can implement any number of <strong>in</strong>terfaces and, as long as you don't needto directly access common ASP.<strong>NET</strong> objects, can also <strong>in</strong>herit from any other .<strong>NET</strong>Framework or user-def<strong>in</strong>ed class. The def<strong>in</strong>ition of the class must necessarily be coded<strong>in</strong> an .asmx file. The file is made available to potential clients through a Web servervirtual directory and is accessed through a URL. Any client that can issue HTTPcommands can connect to the Web service unless security sett<strong>in</strong>gs restrict the client'saccess to the service.What happens after a client po<strong>in</strong>ts to the URL is the focus of the rest of this chapter.Let's start by analyz<strong>in</strong>g the <strong>in</strong>ternal structure of the Web service class.Expos<strong>in</strong>g Web MethodsUnlike the .<strong>NET</strong> Framework remotable classes described <strong>in</strong> Chapter 12, <strong>in</strong> a Webservice class, public methods are not automatically exposed to the public. To be459


effectively exposed over the Web, a Web service method requires a special attribute <strong>in</strong>addition to be<strong>in</strong>g declared as public. Only methods marked with the WebMethodattribute ga<strong>in</strong> the level of visibility sufficient to make them available over the Web.The WebMethod AttributeIn practice, the WebMethod attribute represents a member modifier similar to public,protected, or <strong>in</strong>ternal. Only public methods are affected by WebMethod, and theattribute is effective only to callers <strong>in</strong>vok<strong>in</strong>g the class over the Web. This characteristic<strong>in</strong>creases the overall flexibility of the class design. A software component allowed to<strong>in</strong>stantiate the Web service class sees all the public methods and does not necessarilyrecognize the service as a Web service. However, when the same component is<strong>in</strong>voked as part of a Web service, the IIS and ASP.<strong>NET</strong> <strong>in</strong>frastructure ensure thatexternal callers can see only methods marked with the WebMethod attribute. Anyattempt to <strong>in</strong>voke untagged methods via a URL results <strong>in</strong> a failure.The WebMethod attribute features several properties that you can use to adjust thebehavior of the method. Table 13-1 lists the properties.Table 13-1: Properties of the WebMethod AttributePropertyBufferResponseCacheDurationDescriptionEnableSessionMessageNameTransactionOptionDescriptionSet to true by default, this property <strong>in</strong>dicates that theIIS run time should buffer the method's entire responsebe<strong>for</strong>e send<strong>in</strong>g it to the client. Even if set to false, theresponse is partially buffered; however, <strong>in</strong> this case,the size of the buffer is limited to 16 KB.Specifies the number of seconds that the IIS run timeshould cache the response of the method. This<strong>in</strong><strong>for</strong>mation is useful when you can <strong>for</strong>esee that themethod will handle several calls <strong>in</strong> a short period oftime. Set to 0 by default (mean<strong>in</strong>g no cach<strong>in</strong>g), thecach<strong>in</strong>g eng<strong>in</strong>e is smart enough to recognize andcache page <strong>in</strong>vocations that use different parametervalues.Provides the description <strong>for</strong> the method. The value ofthe property is then embedded <strong>in</strong>to the WSDLdescription of the service.Set to false by default, this property makes available tothe method the Session object of the ASP.<strong>NET</strong>environment. Depend<strong>in</strong>g on how Session isconfigured, us<strong>in</strong>g this property might require cookiesupport on the client or a <strong>Microsoft</strong> SQL Server 2000<strong>in</strong>stallation on the server.Allows you to provide a publicly callable name <strong>for</strong> themethod. When you set this property, the result<strong>in</strong>gSOAP messages <strong>for</strong> the method target the name youset <strong>in</strong>stead of the actual name. Use this property togive dist<strong>in</strong>ct names to overloaded methods <strong>in</strong> theevent that you use the same class as part of themiddle tier and a Web service.Specifies the level of COM+ transactional support youwant <strong>for</strong> the method. A Web service method can haveonly two behaviors, regardless of the value assigned460


Table 13-1: Properties of the WebMethod AttributePropertyDescriptionto the standard TransactionOption enumeration youselect: either it does not require a transaction or it mustbe the root of a new transaction.The follow<strong>in</strong>g code snippet shows how to set a few method attributes:[WebService(Namespace="xmlnet/cs/0735618011",Name="Northw<strong>in</strong>d Sales Report Web Service",Description="The Northw<strong>in</strong>d Sales Report Web Service")]public class SalesReportWebService : WebService{[WebMethod(CacheDuration=60,Description="Returns sales <strong>for</strong> the specified year")]public DataSet GetSalesReport(<strong>in</strong>t theYear){⋮}Don't be fooled by appearances: attributes must be strongly typed <strong>in</strong> the declaration. Inother words, the value you assign to CacheDuration must be a true number and not aquoted str<strong>in</strong>g conta<strong>in</strong><strong>in</strong>g a number. This is a general rule <strong>for</strong> attributes <strong>in</strong> the .<strong>NET</strong>Framework—not a peculiarity of Web services.Transactional MethodsThe behavior of a Web service method <strong>in</strong> the COM+ environment deserves a bit ofattention. The <strong>in</strong>herent reliance of Web services on HTTP <strong>in</strong>evitably prevents them frombe<strong>in</strong>g enlisted <strong>in</strong> runn<strong>in</strong>g transactions; <strong>in</strong> the case of a rollback, it would be difficult totrack and cancel per<strong>for</strong>med operations. For this reason, a Web method can do either oftwo th<strong>in</strong>gs: it can work <strong>in</strong> nontransacted mode, or it can start a nondistributedtransaction.For consistency, the TransactionOption property of the WebMethod attribute takesvalues from the .<strong>NET</strong> Framework's TransactionOption enumeration. The behavior ofsome of the values <strong>in</strong> this enumeration, however, is different from what their namessuggest. In particular, the Disabled, NotSupported, and Supported values from theTransactionOption enumeration always cause the method to execute without atransaction. Both Required and RequiresNew, on the other hand, create a newtransaction.NoteWhen a transactional method throws an exception or an externallythrown exception is not handled, the transaction automaticallyaborts. If no exceptions occur, the transaction automaticallycommits at the end of the method be<strong>in</strong>g called.461


Format of SOAP Messages <strong>for</strong> a Web MethodAlthough SOAP dictates that the messages be<strong>in</strong>g exchanged between the Web serviceand its clients must be <strong>in</strong> <strong>XML</strong>, it says noth<strong>in</strong>g about the actual schema of the <strong>XML</strong>. The.<strong>NET</strong> Framework provides an attribute-based mechanism to let you control the <strong>for</strong>matof the <strong>XML</strong> packed <strong>in</strong> the SOAP message. To customize the structure of a SOAPmessage, you can <strong>in</strong>tervene <strong>in</strong> two places: you can modify the layout of the <strong>in</strong><strong>for</strong>mationbe<strong>in</strong>g packed beneath the tag, and you can change the way <strong>in</strong> whichparameter values are <strong>for</strong>matted.The options available <strong>for</strong> <strong>for</strong>matt<strong>in</strong>g the body of the message are RPC and Document;the latter is the default <strong>for</strong>mat <strong>for</strong> the .<strong>NET</strong> Framework. The Document style refers to<strong>for</strong>matt<strong>in</strong>g the body of the method call accord<strong>in</strong>g to an XSD schema. Typically, the bodyis given by a sequence of message parts whose actual syntax is specified by otherproperties such as Use and ParameterStyle. The RPC style <strong>for</strong>mats the body of theSOAP message accord<strong>in</strong>g to the <strong>for</strong>matt<strong>in</strong>g rules outl<strong>in</strong>ed <strong>in</strong> the SOAP specification,section 7.The SoapDocumentMethod and the SoapRpcMethod attributes apply to an <strong>in</strong>dividualmethod. If you want the same attributes to apply to all methods <strong>in</strong> the Web service, usethe SoapDocumentService and SoapRpcService attributes with the same syntax.The SoapDocumentMethod AttributeAs mentioned, the Document body style is set by default. If you need to change someof its default sett<strong>in</strong>gs, you can use the SoapDocumentMethod attribute implemented <strong>in</strong>the SoapDocumentMethodAttribute attribute class. The Use property of the attributespecifies whether parameters are <strong>for</strong>matted <strong>in</strong> the Encoded or Literal style. (Bothvalues come from the SoapB<strong>in</strong>d<strong>in</strong>gUse enumeration.)The Literal flag <strong>for</strong>mats parameters us<strong>in</strong>g a predef<strong>in</strong>ed XSD schema <strong>for</strong> eachparameter, whereas Encoded encodes all message parts us<strong>in</strong>g the encod<strong>in</strong>g rules set<strong>in</strong> the SOAP specification, section 5. Literal is the default option.The ParameterStyle specifies whether the parameters are encapsulated with<strong>in</strong> a s<strong>in</strong>glemessage part follow<strong>in</strong>g the element or whether each parameter is an<strong>in</strong>dividual message part. The second option is the default. To encapsulate theparameters, set the ParameterStyle attribute to SoapParameterStyle.Wrapped.The follow<strong>in</strong>g code snippet attempts to return a str<strong>in</strong>g encoded <strong>in</strong> a SOAP message<strong>in</strong>stead of described by an XSD document:[WebMethod(CacheDuration=60)][SoapDocumentMethod(Use=SoapB<strong>in</strong>d<strong>in</strong>gUse.Encoded)]public str<strong>in</strong>g GetSalesReportBarChart(<strong>in</strong>t theYear){⋮}This script represents the SOAP request message <strong>for</strong> the method when the request isSOAP-encoded:POST /salesreport/SalesReportService.asmx HTTP/1.1Host: expo-starContent-Type: text/xml; charset=utf-8Content-Length: lengthSOAPAction: "xmlnet/cs/0735618011/GetSalesReportBarChart"462


<strong>in</strong>tThe default request <strong>for</strong> the same method is shown here:POST /salesreport/SalesReportService.asmx HTTP/1.1Host: expo-starContent-Type: text/xml; charset=utf-8Content-Length: lengthSOAPAction: "xmlnet/cs/0735618011/GetSalesReportBarChart"<strong>in</strong>tCautionThe DataSet object can't be used with a Web service method ifthe parameters <strong>for</strong> the method are SOAP-encoded. This meansthat you can't use the SoapRpcMethod attribute with the method.In addition, when you use the default SoapDocumentMethodattribute, be sure that the Use property is set toSoapB<strong>in</strong>d<strong>in</strong>gUse.Literal.463


The SoapRpcMethod AttributeThe RPC <strong>for</strong>mat is expressed by the SoapRpcMethod attribute and specifies that allparameters are encapsulated with<strong>in</strong> a s<strong>in</strong>gle <strong>XML</strong> element named after the Web servicemethod, as shown <strong>in</strong> the follow<strong>in</strong>g code. The RPC style does not support the Literalb<strong>in</strong>d<strong>in</strong>g mode; only the SOAP-encoded b<strong>in</strong>d<strong>in</strong>g mode (Encoded) is accepted.POST /salesreport/SalesReportService.asmx HTTP/1.1Host: expo-starContent-Type: text/xml; charset=utf-8Content-Length: lengthSOAPAction: "xmlnet/cs/0735618011/GetSalesReportBarChart"<strong>in</strong>tYou must <strong>in</strong>clude the System.Web.Services.Protocols andSystem.Web.Services.Description namespaces <strong>in</strong> the Web service source to useSOAP <strong>for</strong>matt<strong>in</strong>g attributes.NoteWeb service methods <strong>in</strong> which the OneWay property of either theSoapRpcMethod attribute or the SoapDocumentMethod attribute isset to true do not have access to ASP.<strong>NET</strong> objects packed <strong>in</strong> theHttpContext object. References to these objects are still allowed,but null is always returned.The Sales Report Web ServiceTo see a concrete example of a Web service, let's trans<strong>for</strong>m the remote service created<strong>in</strong> Chapter 12 <strong>in</strong>to a Web service. The Web service class makes externally available agroup of functions nearly identical to that of the .<strong>NET</strong> Remot<strong>in</strong>g component. In do<strong>in</strong>g so,it also uses the same <strong>in</strong>ternal class, thus demonstrat<strong>in</strong>g a true reuse of code.464


The SalesReportService.asmx file is located <strong>in</strong> the same virtual folder as the remoteobject. The follow<strong>in</strong>g code shows the implementation of the Sales Report Web service.The ma<strong>in</strong> class is named SalesReportWebService.us<strong>in</strong>g System;us<strong>in</strong>g System.Web.Services;us<strong>in</strong>g System.Data;us<strong>in</strong>g System.Data.SqlClient;us<strong>in</strong>g XmlNet.CS;[WebService(Namespace="xmlnet/cs/0735618011",Name="Northw<strong>in</strong>d Sales Report Web Service",Description="The Northw<strong>in</strong>d Sales Report Web Service")]public class SalesReportWebService{[WebMethod(CacheDuration=60)]public DataSet GetSalesReport(<strong>in</strong>t theYear){SalesDataProvider m_dataManager;m_dataManager = new SalesDataProvider();}DataSet ds = new DataSet();ds.Tables.Add(m_dataManager.GetSalesReport(theYear));return ds;}[WebMethod(CacheDuration=120)]public str<strong>in</strong>g GetSalesReportBarChart(<strong>in</strong>t theYear){SalesDataProvider m_dataManager;m_dataManager = new SalesDataProvider();return m_dataManager.GetSalesReportBarChart(theYear);}The class features two methods—GetSalesReportBarChart and GetSalesReport—thatare simply wrappers around the same methods of the SalesDataProvider class. As wesaw <strong>in</strong> Chapter 12, the SalesDataProvider class provides the implementation ofbus<strong>in</strong>ess logic, <strong>in</strong>clud<strong>in</strong>g the code necessary to draw graphics.465


If you compare this code with the remotable object <strong>in</strong> Chapter 12, you can't help butnotice a close resemblance. For the most part, this similarity depends on the use of an<strong>in</strong>termediate, common class. Just this fact proves the extreme flexibility of the .<strong>NET</strong>Framework. Roll<strong>in</strong>g your own functionalities <strong>in</strong>to an <strong>in</strong>terface-less Web site is just oneside of a co<strong>in</strong> that has on its other side .<strong>NET</strong> Remot<strong>in</strong>g accessibility. Later <strong>in</strong> thischapter, after we f<strong>in</strong>ish our implementation of a Web service, we'll complete thecomparison between .<strong>NET</strong> Remot<strong>in</strong>g and Web services. Figure 13-3 shows the typicaluser <strong>in</strong>terface that IIS and ASP.<strong>NET</strong> provide <strong>for</strong> Web services, mostly <strong>for</strong> test<strong>in</strong>gpurposes.Figure 13-3: The standard user <strong>in</strong>terface <strong>for</strong> .<strong>NET</strong> Framework Web services.If you test the Web service us<strong>in</strong>g a Netscape browser, you might get a slightly differentuser <strong>in</strong>terface, depend<strong>in</strong>g on the version of the browser and the level of support itprovides <strong>for</strong> cascad<strong>in</strong>g style sheets (CSS). Also bear <strong>in</strong> m<strong>in</strong>d that the Web serviceconsole shown <strong>in</strong> Figure 13-3 assumes that your client mach<strong>in</strong>e has a programregistered to handle <strong>XML</strong> files. The response of the method is saved to a local <strong>XML</strong> filethat is then displayed through the registered program. On many <strong>Microsoft</strong> W<strong>in</strong>dowsmach<strong>in</strong>es, the default handler of <strong>XML</strong> files is Internet Explorer.Figure 13-4 shows what happens when you test the GetSalesReport method with thedefault (and test-only) user <strong>in</strong>terface.466


Figure 13-4: Test<strong>in</strong>g the GetSalesReport Web method.Under the Hood of a Web Method CallAny call made to a Web service method is resolved by an HTTP handler module tailormade<strong>for</strong> Web services. In the ASP.<strong>NET</strong> and IIS architectures, an HTTP handler is aWeb server extension that handles all the URLs of a certa<strong>in</strong> type. Once the <strong>in</strong>com<strong>in</strong>gcall has been recognized as a Web service call, an <strong>in</strong>stance of theWebServiceHandlerFactory class is created. The just-created object compiles the Webservice class <strong>in</strong>to an assembly (only the first time). Next the Web service factory classanalyzes the request bits and parses the contents of the messages (probably, but notnecessarily, a SOAP payload). If successful, the request is trans<strong>for</strong>med <strong>in</strong>to method<strong>in</strong><strong>for</strong>mation. An ad hoc data structure conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation such as the name of themethod, the list of <strong>for</strong>mal and actual parameters, whether the method is void, and thereturned type.The method <strong>in</strong><strong>for</strong>mation is then passed to a call handler that will actually take care ofexecut<strong>in</strong>g the method. Accord<strong>in</strong>g to the <strong>in</strong><strong>for</strong>mation specified <strong>in</strong> the request, the callhandler can conta<strong>in</strong> context <strong>in</strong><strong>for</strong>mation (<strong>for</strong> example, Session) and work eithersynchronously or asynchronously. F<strong>in</strong>ally, the server object is <strong>in</strong>stantiated, the methodis <strong>in</strong>voked, and the return value is written to the output stream. Figure 13-5 illustratesthe process.467


Figure 13-5: Process<strong>in</strong>g a Web service call.Sett<strong>in</strong>g Cach<strong>in</strong>g PropertiesAs mentioned, the CacheDuration property of the WebMethodAttribute class sets thelength of time <strong>in</strong> seconds that the Web service should cache the page output. Thisfeature demonstrates once aga<strong>in</strong> the tight <strong>in</strong>tegration between Web services and theASP.<strong>NET</strong> run-time <strong>in</strong>frastructure. The CacheDuration property is implemented us<strong>in</strong>gthe ASP.<strong>NET</strong> Cache object. Just be<strong>for</strong>e <strong>in</strong>stantiat<strong>in</strong>g the server object, the Web servicehandler configures the Cache object. In particular, the Web service handler sets thecache to work on the server, as shown here:Response.Cache.SetCacheability(HttpCacheability.Server);In addition, the Web service handler sets the expiration time and configures the cach<strong>in</strong>gsubsystem <strong>for</strong> parametric output, as follows:Response.Cache.VaryByHeaders["SOAPAction"] = true;Response.Cache.VaryByParams["*"] = true;468


The VaryByHeaders property enables you to cache multiple versions of a page,depend<strong>in</strong>g on the value of the HTTP header (or headers) you specify—<strong>in</strong> this example,the header value is SOAPAction. The VaryByParams property, on the other hand, letsyou ma<strong>in</strong>ta<strong>in</strong> different caches <strong>for</strong> each set of dist<strong>in</strong>ct values of the specified parameters.In this case, us<strong>in</strong>g the asterisk (*) <strong>in</strong>dicates that all parameters must be consideredwhen cach<strong>in</strong>g a page.NoteUnder certa<strong>in</strong> conditions, the CacheDurationattribute can constitutea significant improvement <strong>for</strong> your Web services. Ideally, you mightwant to set this attribute when your method returns a large amountof data (<strong>for</strong> example, a DataSet object) but receives quite a fewrequests distributed throughout the day. The cach<strong>in</strong>g mechanism—the same mechanism available to all ASP.<strong>NET</strong> applications—letsyou dist<strong>in</strong>guish cached copies of the output that are also based onparameters. Under these circumstances, generat<strong>in</strong>g a new data setevery time the method is called isn't efficient—unless, of course,user requirements mandate that you return fresh data. Theadvantage <strong>in</strong> per<strong>for</strong>mance can be relevant and significant. In myexperimentation, I was able to get response times up to 8 timesfaster, with 2 or 3 times faster be<strong>in</strong>g the average.The Role of the <strong>XML</strong> SerializerAs shown <strong>in</strong> Figure 13-5, the return value of the method call is packed as <strong>XML</strong> us<strong>in</strong>gthe <strong>XML</strong> serializer that we saw <strong>in</strong> action <strong>in</strong> Chapter 11. The follow<strong>in</strong>g script representsthe pseudocode that creates the response <strong>for</strong> a Web service method:Response.ContentType = ContentType.Compose("text/xml",Encod<strong>in</strong>g.UTF8Encod<strong>in</strong>g);ser.Serialize(outputStream, returnValue);The <strong>XML</strong> serializer can't process all .<strong>NET</strong> Framework types. Remember, the <strong>XML</strong>serializer doesn't work with types that have circular references and only packs publicand read/write members. The <strong>XML</strong> serializer doesn't ensure type fidelity but simply aneffective XSD (or SOAP-encoded) representation of the data.NoteA Web service can't return an ADO.<strong>NET</strong> object other than theDataSet object <strong>for</strong> the simple reason that the XmlSerializer classdoesn't know how to handle them. On the other hand, XmlSerializercan normally handle arrays of primitive objects, and this can helpwhen you're creat<strong>in</strong>g workarounds <strong>for</strong> return<strong>in</strong>g complex data likethat stored <strong>in</strong> many ADO.<strong>NET</strong> objects.Disabl<strong>in</strong>g HTTP-POST and HTTP-GETAs we'll see <strong>in</strong> more detail <strong>in</strong> the section "Invok<strong>in</strong>g a Web Service Through Script," onpage 586, you can <strong>in</strong>voke a Web service method us<strong>in</strong>g a SOAP message as well as apla<strong>in</strong> HTTP-POST or HTTP-GET command. The latter two protocols have been<strong>in</strong>troduced to make access<strong>in</strong>g a Web service easier than ever. However, leav<strong>in</strong>g theWeb service door open to HTTP packets can constitute a potential security hole.If you want to disable the HTTP-POST and HTTP-GET support on a mach<strong>in</strong>e-widebasis, do as follows. First locate the mach<strong>in</strong>e.config file (more on configuration files <strong>in</strong>Chapter 15) <strong>in</strong> the local system. The file is normally located <strong>in</strong> the config subdirectory ofthe .<strong>NET</strong> Framework <strong>in</strong>stallation path. A typical path is shown here:c:\w<strong>in</strong>nt\microsoft.net\framework\v1.0.3705\config\mach<strong>in</strong>e.config469


The mach<strong>in</strong>e.config file is an <strong>XML</strong> file that conta<strong>in</strong>s a section similar to the follow<strong>in</strong>g:⋮To disable HTTP-POST and HTTP-GET support <strong>for</strong> all Web services on the server,simply comment out the l<strong>in</strong>es correspond<strong>in</strong>g to "HttpPost" and "HttpGet". You can alsodisable HTTP-POST and HTTP-GET support on a perservice basis. In this case, do notenter any changes <strong>in</strong> the mach<strong>in</strong>e.config file; <strong>in</strong>stead, create a web.config file <strong>in</strong> yourWeb service's virtual directory and add the follow<strong>in</strong>g <strong>XML</strong> to the file:⋮⋮NoteIf you open up the mach<strong>in</strong>e.config file and look <strong>in</strong> the section, you can't help but notice the specialDocumentation protocol. This protocol is the key that enables theASP.<strong>NET</strong> run time to deliver a help page, such as the one shown <strong>in</strong>Figure 13-3, when you po<strong>in</strong>t your browser to an .asmx resource.The default help page is generated by a file namedDefaultWsdlHelpGenerator.aspx, which is located <strong>in</strong> the same folderas mach<strong>in</strong>e.config. The page is modifiable, but if you need to enterchanges, I'd recommend that you create and register your owngenerator page. The generator page can be changed with thefollow<strong>in</strong>g configuration code:Of course, the help page can be customized <strong>for</strong> all Web services byadd<strong>in</strong>g the preced<strong>in</strong>g code to mach<strong>in</strong>g.config, or it can becustomized <strong>for</strong> a particular Web service by add<strong>in</strong>g the code to the470


service's web.config file.Build<strong>in</strong>g a .<strong>NET</strong> Framework Web Service ClientWhether you use <strong>Microsoft</strong> Visual Studio .<strong>NET</strong> or a simple text editor to code the .asmxfile, writ<strong>in</strong>g Web services us<strong>in</strong>g the .<strong>NET</strong> Framework is def<strong>in</strong>itely an easy task. And asyou'll see, writ<strong>in</strong>g client applications to use those services is even easier.You can call a Web service through a URL us<strong>in</strong>g either the HTTP-GET or the HTTP-POST command. You can do that also from with<strong>in</strong> an ASP.<strong>NET</strong> page us<strong>in</strong>g theWebRequest .<strong>NET</strong> Framework class. From with<strong>in</strong> Visual Studio .<strong>NET</strong>, referenc<strong>in</strong>g aWeb service is nearly identical to add<strong>in</strong>g a reference to another assembly. What youget is a proxy class through which your W<strong>in</strong>dows Forms or Web Forms application canreach its URL across port 80, just like a user's browser. In do<strong>in</strong>g so, firewall problemsdisappear and HTTP on top of Secure Sockets Layer (SSL) or any other <strong>for</strong>m ofencryption can be used to transfer data.Connect<strong>in</strong>g to a Web service is similar to connect<strong>in</strong>g to a .<strong>NET</strong> Framework remotableobject <strong>in</strong> that <strong>in</strong> both cases you end up us<strong>in</strong>g a proxy class. The big difference is <strong>in</strong> thecharacteristics of the proxy. The .<strong>NET</strong> Remot<strong>in</strong>g proxy is a dynamically created objectthat works transparently under the hood of the remote object <strong>in</strong>stance. The client hasthe impression that it is work<strong>in</strong>g with a local object that silently posts all calls to theremote object.The Web service proxy is a statically created class that must be compiled and l<strong>in</strong>ked tothe project. The .<strong>NET</strong> Framework provides a tool to generate such a class. This tool,named wsdl.exe, takes the Web service WSDL script and generates a <strong>Microsoft</strong> VisualBasic .<strong>NET</strong> or a C# class (the default) that mirrors methods <strong>for</strong> synchronous andasynchronous calls. From the client perspective, call<strong>in</strong>g <strong>in</strong>to the proxy class is a localcall. Each call, however, results <strong>in</strong> a roundtrip to the server. The follow<strong>in</strong>g commandl<strong>in</strong>e generates the C# proxy <strong>for</strong> the previously written Web service:wsdl.exe http://server/salesreport/salesreportservice.asmx?wsdlThe wsdl.exe utility is part of the .<strong>NET</strong> Framework SDK, and among its other options, itallows you to specify the protocol <strong>for</strong> the call and the language <strong>for</strong> the source code. Theutility is also silently <strong>in</strong>voked by Visual Studio .<strong>NET</strong> when you reference a Web serviceus<strong>in</strong>g the Add Web Reference menu command <strong>in</strong> Solution Explorer.The Proxy ClassThe proxy class generated <strong>for</strong> a Web service is added to the project and is <strong>in</strong> effect alocal class. The difference <strong>in</strong> the remot<strong>in</strong>g architecture is that .<strong>NET</strong> Remot<strong>in</strong>g uses adynamically generated class whose method <strong>in</strong><strong>for</strong>mation is hard-coded <strong>in</strong> the object<strong>in</strong><strong>for</strong>mation be<strong>in</strong>g marshaled—the ObjRef object. With a Web service, there is nodynamic class creation. The follow<strong>in</strong>g source code represents the proxy <strong>for</strong> the SalesReport Web Service:us<strong>in</strong>g System;us<strong>in</strong>g System.Xml.Serialization;us<strong>in</strong>g System.Web.Services.Protocols;us<strong>in</strong>g System.Web.Services;[System.Web.Services.WebServiceB<strong>in</strong>d<strong>in</strong>gAttribute(Name="Northw<strong>in</strong>d Sales Report Web ServiceSoap",471


Namespace="xmlnet/cs/0735618011")]public class Northw<strong>in</strong>dSalesReportWebService :SoapHttpClientProtocol{public Northw<strong>in</strong>dSalesReportWebService(){// Feel free to change this URLthis.Url ="http://expostar/salesreport/salesreportservice.asmx";}[SoapDocumentMethodAttribute("xmlnet/cs/0735618011/GetSalesReport",RequestNamespace="xmlnet/cs/0735618011",ResponseNamespace="xmlnet/cs/0735618011",Use=SoapB<strong>in</strong>d<strong>in</strong>gUse.Literal,ParameterStyle=SoapParameterStyle.Wrapped)]public DataSet GetSalesReport(<strong>in</strong>t theYear){object[] results = Invoke("GetSalesReport",new object[] {theYear});return ((DataSet)(results[0]));}public IAsyncResult Beg<strong>in</strong>GetSalesReport(<strong>in</strong>t theYear,AsyncCallback callback, object asyncState){return Beg<strong>in</strong>Invoke("GetSalesReport", new object[] {theYear}, callback, asyncState);}public DataSet EndGetSalesReport(IAsyncResult asyncResult){object[] results = EndInvoke(asyncResult);return ((DataSet)(results[0]));}[SoapDocumentMethodAttribute("xmlnet/cs/073561801/GetSalesReportBarChart",RequestNamespace="xmlnet/cs/0735618011",472


ResponseNamespace="xmlnet/cs/0735618011",Use=SoapB<strong>in</strong>d<strong>in</strong>gUse.Literal,ParameterStyle=SoapParameterStyle.Wrapped)]public str<strong>in</strong>g GetSalesReportBarChart(<strong>in</strong>t theYear){object[] results = Invoke("GetSalesReportBarChart",new object[] {theYear});return ((str<strong>in</strong>g)(results[0]));}public IAsyncResult Beg<strong>in</strong>GetSalesReportBarChart(<strong>in</strong>ttheYear,AsyncCallback callback, object asyncState){return Beg<strong>in</strong>Invoke("GetSalesReportBarChart",new object[] {theYear}, callback, asyncState);}public str<strong>in</strong>g EndGetSalesReportBarChart(IAsyncResultasyncResult){object[] results = EndInvoke(asyncResult);return ((str<strong>in</strong>g)(results[0]));}}In addition to the class constructor, the proxy conta<strong>in</strong>s a public method <strong>for</strong> each Webmethod def<strong>in</strong>ed on the Web service. The proxy also provides a pair of Beg<strong>in</strong> and Endmembers <strong>for</strong> each Web method; these members are used to set up asynchronous calls.NoteThis proxy class uses the SoapDocumentMethod attribute. Up tonow, we've used the SoapDocumentMethod and SoapRpcMethodattributes <strong>for</strong> server files. One th<strong>in</strong>g developers often miss is that theSOAP-related sett<strong>in</strong>gs you use on the server must be repeated onthe client. Normally, the wsdl.exe utility takes care of this <strong>for</strong><strong>for</strong>matt<strong>in</strong>g attributes. However, if you use SOAP extensions, youhave to assign the same attributes to the proxy class manually—<strong>in</strong>addition, of course, to mak<strong>in</strong>g the necessary assemblies availableon the client.Chang<strong>in</strong>g the Web Service ReferenceThe proxy constructor sets the Url property of the proxy class to the orig<strong>in</strong>al URL of theWeb service. The value of the property can be changed at design time and evenprogrammatically. The Url property is <strong>in</strong>herited from the base classWebClientProtocol—one of the proxy's ancestors.473


In situations <strong>in</strong> which the URL can't be determ<strong>in</strong>ed unequivocally or might change on aper-user basis or because of other run-time factors, you can ask the wsdl.exe utility notto hard-code the URL <strong>in</strong> the source. By us<strong>in</strong>g the /urlkey command-l<strong>in</strong>e switch, you<strong>in</strong>struct the utility to dynamically read the Web service URL from the application'sconfiguration file. If you use a switch such as /urlkey:ActualUrl, the proxy classconstructor changes as follows:us<strong>in</strong>g System.Configuration;⋮public Northw<strong>in</strong>dInfoService(){Str<strong>in</strong>g urlSett<strong>in</strong>g =ConfigurationSett<strong>in</strong>gs.AppSett<strong>in</strong>gs["ActualUrl"];if ((urlSett<strong>in</strong>g != null))this.Url = urlSett<strong>in</strong>g;else // Defaults to the URL used to build the proxythis.Url ="http://server/salesreport/salesreportservice.asmx";}ConfigurationSett<strong>in</strong>gs.AppSett<strong>in</strong>gs is a special property that provides access to theapplication sett<strong>in</strong>gs def<strong>in</strong>ed <strong>in</strong> the section of the configuration file.Configuration files are <strong>XML</strong> files that allow you to change sett<strong>in</strong>gs without recompil<strong>in</strong>gthe application. Configuration files also allow adm<strong>in</strong>istrators to apply security andrestriction policies that affect how applications run on various mach<strong>in</strong>es. (We'll coverconfiguration files <strong>in</strong> Chapter 15.)The name and location of the configuration file depends on the nature of theapplication. For ASP.<strong>NET</strong> pages and Web services, the file is named web.config and islocated <strong>in</strong> the root directory of the application. You can also have other web.config fileslocated <strong>in</strong> child directories. Child configuration files <strong>in</strong>herit the sett<strong>in</strong>gs def<strong>in</strong>ed <strong>in</strong>configuration files located <strong>in</strong> parent directories. For W<strong>in</strong>dows Forms applications, theconfiguration file takes the name of the executable plus a .config extension. Such a filemust be resident <strong>in</strong> the same folder as the ma<strong>in</strong> executable.Issu<strong>in</strong>g Calls to the Web ServiceOnce a client application is l<strong>in</strong>ked to the Web service, it simply creates a new <strong>in</strong>stanceof the proxy class and calls its methods. Consider that call<strong>in</strong>g <strong>in</strong>to a Web service is apotentially lengthy operation that might take a few seconds to complete. If you f<strong>in</strong>d thatthe method call is too long, go <strong>for</strong> an asynchronous call.This book's sample files <strong>in</strong>clude a W<strong>in</strong>dows Forms application that uses the Webservice proxy to get data from the site. Surpris<strong>in</strong>gly enough, the code is nearly identicalto the related application we built <strong>in</strong> Chapter 12 as a remot<strong>in</strong>g client. (See the follow<strong>in</strong>glist<strong>in</strong>g.) The key difference is <strong>in</strong> the name of the class to call. In addition, with a Webservice you don't need to <strong>in</strong>itialize the class <strong>in</strong> the <strong>for</strong>m's Load event because the proxyclass is statically l<strong>in</strong>ked to the project.Northw<strong>in</strong>dSalesReportWebService service;service = new Northw<strong>in</strong>dSalesReportWebService();str<strong>in</strong>g img = service.GetSalesReportBarChart(theYear);474


Figure 13-6 shows the W<strong>in</strong>dows Forms client <strong>in</strong> action.Figure 13-6: A W<strong>in</strong>dows Forms Web service client <strong>in</strong> action.Invok<strong>in</strong>g a Web Service Through ScriptA Web service is always <strong>in</strong>voked by us<strong>in</strong>g an ord<strong>in</strong>ary HTTP packet that conta<strong>in</strong>s<strong>in</strong><strong>for</strong>mation about the method to call and the arguments to use. This HTTP packetreaches the Web server by travel<strong>in</strong>g as a GET or POST command. You can <strong>in</strong>voke aWeb service method us<strong>in</strong>g one of these commands:• A POST command that embeds a SOAP request• A POST command that specifies the method name and parameters• A GET command whose URL conta<strong>in</strong>s the method name and parametersTo <strong>in</strong>voke a method <strong>in</strong> a Web service, SOAP is not strictly necessary. You can useGET or POST commands, which results <strong>in</strong> a more compact body. However, thebenefits of us<strong>in</strong>g SOAP become clearer as the complexity of data <strong>in</strong>creases. GET andPOST commands support primitive types, <strong>in</strong>clud<strong>in</strong>g arrays and enumerations. SOAP,on the other hand, relies on a portable and more complex type system based on <strong>XML</strong>schemas. In addition, <strong>in</strong> the .<strong>NET</strong> Framework, Web services also support classes thatthe <strong>XML</strong> serializer can handle.A W<strong>in</strong>dows Script Host ExampleTo give you a practical demonstration of how Web services are really just HTTPaccessiblesoftware agents, let's write a W<strong>in</strong>dows Script Host (WSH) script that allowspla<strong>in</strong> <strong>Microsoft</strong> Visual Basic, Script<strong>in</strong>g Edition (VBScript) code to download <strong>in</strong><strong>for</strong>mationfrom a remote server. To send HTTP commands from VBScript code, we'll use the<strong>Microsoft</strong>.XmlHttp object—a native component of <strong>Microsoft</strong> Internet Explorer 5.0 andMS<strong>XML</strong> 3.0 and later versions. The follow<strong>in</strong>g script calls the method GetSalesReport byus<strong>in</strong>g a GET command:Const HOST = "http://expo-star/"Const URL = "salesreport/salesreportservice.asmx/"475


Const TheYear = 1997' Create the HTTP objectSet xmlhttp = CreateObject("<strong>Microsoft</strong>.<strong>XML</strong>HTTP")xmlhttp.open "GET", _HOST & URL & "GetSalesReport?TheYear="& TheYear, _False' Send the request synchronouslyxmlhttp.send ""' Store the results <strong>in</strong> a file named RAW_OUTPUT.<strong>XML</strong>Set fso = CreateObject("Script<strong>in</strong>g.FileSystemObject")Set f = fso.CreateTextFile("raw_output.xml")f.Write xmlhttp.responseTextf.CloseThe resultant <strong>XML</strong> str<strong>in</strong>g—the body of the response—is stored <strong>in</strong> a local <strong>XML</strong> file.Extract<strong>in</strong>g a JPEG Image from the <strong>XML</strong> OutputWe've now built a Web service that returns JPEG images, B<strong>in</strong>Hex-encoded and packed<strong>in</strong> an <strong>XML</strong> str<strong>in</strong>g. Let's see how to get the image and save it locally as a dist<strong>in</strong>ct JPEGfile. And because we used a GET command <strong>in</strong> our previous example, we'll use a POSTcommand this time.With POST commands, you have to use a URL without parameters and store theparameter <strong>in</strong><strong>for</strong>mation <strong>in</strong> the body of the message, as shown <strong>in</strong> the follow<strong>in</strong>g code. Inaddition, you must <strong>in</strong>dicate the content type of the message.Const HOST = "http://expo-star/"Const URL = "salesreport/salesreportservice.asmx/"Const TheYear = 1997' Create the HTTP objectSet xmlhttp = CreateObject("<strong>Microsoft</strong>.<strong>XML</strong>HTTP")xmlhttp.open "POST", _HOST & URL & "GetSalesReportBarChart", _False' Set the Content-Type header to the specified valuexmlhttp.setRequestHeader "Content-Type", _"application/x-www-<strong>for</strong>m-urlencoded"' Send the request synchronouslyxmlhttp.send "TheYear="& TheYear476


' Get the results as a <strong>XML</strong>DOMSet xmldoc = xmlhttp.responseXml' Extract the <strong>XML</strong>-based image description from the responseimg = xmldoc.text' Store the results <strong>in</strong> a file named RAW_OUTPUT.<strong>XML</strong>Set fso = CreateObject("Script<strong>in</strong>g.FileSystemObject")Set f = fso.CreateTextFile("raw_output.xml")f.Write imgf.Close' Extract the JPEG image from raw outputSet shell = CreateObject("WScript.Shell")shell.Run "jpegextractor.exe raw_output.xml image.jpg"This script first <strong>in</strong>vokes the method and gets the results as an <strong>XML</strong> Document ObjectModel (<strong>XML</strong> DOM) object. The <strong>in</strong>ner text of the document is saved to a local variableand then to a temporary file (raw_output.xml). F<strong>in</strong>ally, a small managed utility(jpegextractor.exe) parses the <strong>XML</strong> stream, extracts and decodes the JPEG bits, andsaves them to a file. The result is a JPEG file represent<strong>in</strong>g the sales report <strong>for</strong> the yearyou specify.NoteThe jpegextractor.exe utility is available as source code <strong>in</strong> thisbook's sample files, along with the Web service, the scripts, and theclient applications discussed <strong>in</strong> this chapter..<strong>NET</strong> Remot<strong>in</strong>g vs. Web ServicesWeb services were designed to overcome a few Web architecture problems—particularly <strong>in</strong> the area of component <strong>in</strong>teroperability. Web services are key tools <strong>for</strong>access<strong>in</strong>g otherwise <strong>in</strong>accessible functionalities exposed over heterogeneous hardwareand software plat<strong>for</strong>ms.If we stopped our analysis here, the conclusion would be rather obvious: Web servicesare the first fundamental software development of the new millennium. Although Webservices will certa<strong>in</strong>ly represent a milestone <strong>in</strong> the history of computer programm<strong>in</strong>g,the more we design them and use them, the more we realize they have seriouslimitations. Subsequently, and perhaps un<strong>for</strong>tunately, us<strong>in</strong>g a Web service isn't alwaysthe best solution.Which Came First?I perceive the .<strong>NET</strong> Framework Web services as a special case of .<strong>NET</strong> Remot<strong>in</strong>g, butone could argue <strong>for</strong> the opposite scenario as well. Putt<strong>in</strong>g Web services at the center ofthe <strong>in</strong>teroperability universe and consider<strong>in</strong>g .<strong>NET</strong> Remot<strong>in</strong>g as a plat<strong>for</strong>m-specific477


implementation does make a lot of sense. In general, the way you look at the newest<strong>Microsoft</strong> remot<strong>in</strong>g technologies depends on your <strong>in</strong>dividual perspective.If you look at <strong>in</strong>teroperability from a .<strong>NET</strong> Framework–specific viewpo<strong>in</strong>t, you willprobably agree with my perception and put Web services on a secondary plane. If yoursituation spans more vendors and more plat<strong>for</strong>ms, you'll recognize that theunquestionable similarity between the Web service API and the .<strong>NET</strong> Remot<strong>in</strong>g APIstems from the fact that .<strong>NET</strong> Remot<strong>in</strong>g has stolen some features from the Web servicespecification.So which came first, the .<strong>NET</strong> Remot<strong>in</strong>g egg or the Web service chicken? If you'reconsider<strong>in</strong>g .<strong>NET</strong> Remot<strong>in</strong>g, you are look<strong>in</strong>g at <strong>Microsoft</strong>'s remot<strong>in</strong>g technologiesmostly from a .<strong>NET</strong> Framework perspective. The key issue is slightly different, however.Instead of focus<strong>in</strong>g on which technology came first, you should ask what eachtechnology can do <strong>for</strong> you. And your f<strong>in</strong>al choice should favor the technology that mostclosely meets your needs.When to Use .<strong>NET</strong> Remot<strong>in</strong>g.<strong>NET</strong> Remot<strong>in</strong>g is ideal <strong>for</strong> .<strong>NET</strong>-to-.<strong>NET</strong> communication. More exactly, it's beendesigned <strong>for</strong> precisely that purpose. As a .<strong>NET</strong> Framework–specific technology, .<strong>NET</strong>Remot<strong>in</strong>g lets you use all common language runtime (CLR) types, detects and handleslocal calls differently, and dist<strong>in</strong>guishes the atomic unit of process<strong>in</strong>g at a differentlevel—the application doma<strong>in</strong> (AppDoma<strong>in</strong>) level <strong>in</strong>stead of the process level. And.<strong>NET</strong> Remot<strong>in</strong>g <strong>in</strong>creases its per<strong>for</strong>mance by allow<strong>in</strong>g the use of b<strong>in</strong>ary protocols.The Special Case of W<strong>in</strong>32/COMIf you need to set up communication between a .<strong>NET</strong> Framework application and aW<strong>in</strong>32 or COM application, you might consider an ad hoc DLL, a COM object, or even amemory mapped file as an alternative to us<strong>in</strong>g Web services. In this scenario, you can'tuse .<strong>NET</strong> Remot<strong>in</strong>g because one of the applications is either a W<strong>in</strong>32 or a COMapplication—that is, a non-.<strong>NET</strong>-Framework application.When to Use Web ServicesWeb services are ideal <strong>in</strong> a couple of scenarios. First, they are the only safe way to goif your goal is target<strong>in</strong>g a non-<strong>Microsoft</strong> plat<strong>for</strong>m. If you have to access code runn<strong>in</strong>g onL<strong>in</strong>ux or want to make your .<strong>NET</strong> Framework component available to a L<strong>in</strong>ux client, byall means, go <strong>for</strong> Web services. Second, you should use Web services when,irrespective of the <strong>in</strong>volved plat<strong>for</strong>ms, the user requirements mandate that theapplication must be programmatically accessible through a URL.Web Service IssuesAs a software application that makes itself available only through Internet connections,a Web service is at risk of be<strong>in</strong>g, or becom<strong>in</strong>g, a slow application. For this reason,optimization is more than ever a critical factor. Overall per<strong>for</strong>mance is affected mostlyby the network latency but also, <strong>in</strong> small part, by the <strong>for</strong>mat of the protocol be<strong>in</strong>g used.HTTP and SOAP are both based on text, and SOAP <strong>in</strong> particular is a quite verboseprotocol. This results <strong>in</strong> packets significantly larger than those typical of b<strong>in</strong>ary protocolssuch as Common Object Request Broker Architecture (CORBA) or even DistributedCOM (DCOM).When try<strong>in</strong>g to improve the usability of a Web service (the area <strong>in</strong> which you shouldfocus your optimization ef<strong>for</strong>ts), you should address the follow<strong>in</strong>g tasks:• Per<strong>for</strong>m<strong>in</strong>g asynchronous calls478


• Compress<strong>in</strong>g packets us<strong>in</strong>g a SOAP extension• M<strong>in</strong>imiz<strong>in</strong>g round-trips• Enhanc<strong>in</strong>g the <strong>in</strong>terface of the Web service with mobile codeAsynchronous calls let an application <strong>in</strong>voke a method and cont<strong>in</strong>ue runn<strong>in</strong>g as usualuntil the response is downloaded on the client. The mechanism exploits the features ofasynchronous programm<strong>in</strong>g <strong>in</strong> the .<strong>NET</strong> Framework.A SOAP extension is like a hook that you register with the Web service to access theraw SOAP <strong>XML</strong> either as it is about to be transmitted or as it is received. A SOAPextension works both on the client, by us<strong>in</strong>g proxy classes, and on the server. Whenyou want to per<strong>for</strong>m tricks or customize the underly<strong>in</strong>g <strong>XML</strong>, SOAP extensions providethe right connection po<strong>in</strong>t. One valid use of SOAP extensions is <strong>for</strong> encrypt<strong>in</strong>g orcompress<strong>in</strong>g method parameters <strong>for</strong> improved per<strong>for</strong>mance and security.The last two tasks, m<strong>in</strong>imiz<strong>in</strong>g round-trips and creat<strong>in</strong>g mobile code, are somewhatmore complex. M<strong>in</strong>imiz<strong>in</strong>g round-trips is a key aspect of optimization that goes deeperthan simply improv<strong>in</strong>g per<strong>for</strong>mance us<strong>in</strong>g software tricks. Mobile code is a concept thatis quite popular <strong>in</strong> the Java community and <strong>in</strong>volves software agents that execute someuser code on the server. Let's look at these two topics <strong>in</strong> more detail.M<strong>in</strong>imiz<strong>in</strong>g Round-TripsEach call address<strong>in</strong>g a Web service method requires a round-trip. Because all Webservice activity takes place over the Internet, you can't always expect a rapid response.And because the round-trip is permanently tied to the request of an operation on theWeb service, the best—and possibly the only—way to m<strong>in</strong>imize round-trips is to mergemore logically dist<strong>in</strong>ct functions. The open issue <strong>in</strong> this approach concerns us<strong>in</strong>gadditional methods <strong>in</strong> the <strong>in</strong>terface or additional parameters <strong>in</strong> the prototype of certa<strong>in</strong>methods. Simple, succ<strong>in</strong>ct, and direct methods enhance overall design but certa<strong>in</strong>ly donot m<strong>in</strong>imize round-trips, because to execute two functions, you need at least tworound-trips.On the other hand, <strong>in</strong>corporat<strong>in</strong>g more functionality <strong>in</strong> the body of a s<strong>in</strong>gle and morecomplex method is effective <strong>in</strong> terms of per<strong>for</strong>mance but not necessarily <strong>in</strong> terms of theservice usability. A client might receive more <strong>in</strong><strong>for</strong>mation than needed, pay<strong>in</strong>g the price<strong>in</strong> <strong>in</strong>creased download<strong>in</strong>g time. Moreover, a client might be <strong>for</strong>ced to use an overlycomplex signature, expos<strong>in</strong>g itself to the risk of gett<strong>in</strong>g the requested <strong>in</strong><strong>for</strong>mation bytrial and error.Creat<strong>in</strong>g Mobile CodeAlthough the .<strong>NET</strong> Framework environment attempts to make you com<strong>for</strong>table withWeb service client programm<strong>in</strong>g, you must still call <strong>in</strong>to remote methods over theInternet. M<strong>in</strong>imiz<strong>in</strong>g round-trips with a smart design is only the first step. What if youcan't easily come up with a sequence of operations to pack <strong>in</strong>to a new method? What ifthe next step depends on run-time conditions? In the database world, you use storedprocedures to concatenate multiple SQL calls with some logic. Why can't the sameconcept be ported to the Web?Mobile code technology has already been tested on other plat<strong>for</strong>ms, although withslightly different purposes, and sooner or later it will make its way to the .<strong>NET</strong>Framework run time. By mobile code, I mean the ability that certa<strong>in</strong> server applications(<strong>for</strong> example, Web services) might have to execute code sent by clients. Created toallow software agents to transport code to specialized servers <strong>for</strong> longtime executions,mobile code is a concept that proves useful also <strong>in</strong> the land of Web services.Interest<strong>in</strong>gly enough, mobile code can solve many problems but exacerbate others.479


Mobile code allows you to send C# or Visual Basic .<strong>NET</strong> code to a Web service, whereit can be compiled and executed on the fly. Once the user code has been given accessto the methods of the Web service, it can execute any operations and comb<strong>in</strong>e the Webservice calls <strong>in</strong> any suitable order—all <strong>in</strong> a s<strong>in</strong>gle round-trip.Mobile code is not perfect, however. But the problems it makes more acute tend to beproblems that you'll have to address anyway <strong>for</strong> the sake of the Web service's stabilityand success. For example, us<strong>in</strong>g mobile code poses serious security concerns. Howcan you ensure that the code accepted by the Web service is safe <strong>for</strong> the Web server?You can work around this issue <strong>in</strong> several ways: You can enable the compilationfeature only <strong>for</strong> authorized users. Or, better yet, you can allow the result<strong>in</strong>g dynamicassembly to run <strong>in</strong> a sort of sandbox, where potentially dangerous calls are simply<strong>for</strong>bidden.True InteroperabilityAnother issue that must be consistently addressed to guarantee the widespreadacceptance and success of Web service technology is data <strong>in</strong>teroperability. Althoughseveral recent articles claim that <strong>in</strong>teroperability is the key feature of Web services, thetruth is that Web services are currently fully <strong>in</strong>teroperable only with<strong>in</strong> the boundaries ofthe .<strong>NET</strong> Framework.At this time, you can safely transmit over the Web only primitive types that are <strong>in</strong>cluded<strong>in</strong> the XSD type system. What happens to a .<strong>NET</strong> Framework class or a user-def<strong>in</strong>edclass? In the section "The Role of the <strong>XML</strong> Serializer," on page 579, you saw that the<strong>XML</strong> serializer takes care of writ<strong>in</strong>g the return value of a Web service method call. The<strong>XML</strong> serializer is actually responsible <strong>for</strong> the data types—custom and .<strong>NET</strong> Frameworkclasses—that will be sent to callers. The <strong>XML</strong> serializer is not perfect, and moreimportant, it is not standard. So how could a Java application quickly and easilyunderstand and deserialize the <strong>XML</strong> stream it gets from the XmlSerializer class? Onlywhen a recognized standard <strong>for</strong> serializ<strong>in</strong>g classes to <strong>XML</strong> is available will true<strong>in</strong>teroperability between plat<strong>for</strong>ms be realized.ConclusionWeb services are often presented as the perfect tool <strong>for</strong> today's programmers. Webservices are <strong>in</strong>teroperable, are based on open standards such as SOAP and WSDL,and, more importantly, are fully <strong>in</strong>tegrated with the .<strong>NET</strong> plat<strong>for</strong>m. This apparent po<strong>in</strong>tof strength <strong>in</strong> Web services—the perfect and seamless <strong>in</strong>tegration with the rest of the.<strong>NET</strong> Framework—on closer exam<strong>in</strong>ation turns out to be, if not a weakness, a reliable<strong>in</strong>dicator of where Web services are limited. Aspects such as security, <strong>in</strong>teroperability,and code optimization are underm<strong>in</strong><strong>in</strong>g the stability of the technology. Don't be fooledby the hype that vendors are attach<strong>in</strong>g to the blanket term Web service. A lot of workhas been done, but a lot still rema<strong>in</strong>s.In this chapter, we looked at Web services from the perspective of usability <strong>in</strong>stead ofas a programm<strong>in</strong>g topic. We exam<strong>in</strong>ed the key operations you might want toaccomplish with a Web service and the core code that makes this happen. We did nottouch on topics such as state management, authentication, and service discovery,which are bread and butter <strong>for</strong> serious Web service developers. Instead, we focused oncompar<strong>in</strong>g Web services with .<strong>NET</strong> Remot<strong>in</strong>g.In Chapter 14 and Chapter 15, we'll address some ancillary topics related to application<strong>in</strong>teroperability. One of these topics regards the use of <strong>XML</strong> data from the client side ofa Web application—specifically, an ASP.<strong>NET</strong> application.480


Further Read<strong>in</strong>gThis chapter provides an essential <strong>in</strong>troductory reference to .<strong>NET</strong> Framework Webservices; <strong>for</strong> a thorough guide, have a look at Scott Short's Build<strong>in</strong>g <strong>XML</strong> Services <strong>for</strong>the <strong>Microsoft</strong> .<strong>NET</strong> Plat<strong>for</strong>m (<strong>Microsoft</strong> Press, 2002). Concrete examples cover<strong>in</strong>g apossible .<strong>NET</strong> Framework implementation of the mobile code feature can be found <strong>in</strong>the article "Us<strong>in</strong>g an Eval Function <strong>in</strong> Web Services," <strong>in</strong> the September 2002 issue ofMSDN Magaz<strong>in</strong>e.For more <strong>in</strong><strong>for</strong>mation about Web service–related standards, here are some usefulURLs: You'll f<strong>in</strong>d the SOAP specification at http://www.w3.org/TR/soap. The UDDIofficial Web site is http://www.uddi.org. From that Web site, I recommend the "UDDIExecutive White Paper," which is available <strong>for</strong> download athttp://www.uddi.org/pubs/uddi_executive_white_paper.<strong>pdf</strong>. Notes about the WSDLstandard can be found at http://www.w3.org/tr/wsdl. F<strong>in</strong>ally, if you need an <strong>in</strong>troductionto the WS-Security <strong>in</strong>itiative, get a copy of the June 2002 issue of MSDN Magaz<strong>in</strong>e andread the "<strong>XML</strong> Files" column.481


Chapter 14: <strong>XML</strong> on the ClientOverviewAll the technologies and programm<strong>in</strong>g <strong>in</strong>terfaces we've looked at up to now workregardless of the surround<strong>in</strong>g environment—be it the <strong>Microsoft</strong> W<strong>in</strong>dows desktop, anMS-DOS console, or a Web server. As long as the <strong>Microsoft</strong> .<strong>NET</strong> Framework isavailable, <strong>XML</strong>-based code works just f<strong>in</strong>e. When you move on to Web applications,however, th<strong>in</strong>gs change a little bit. Us<strong>in</strong>g <strong>XML</strong> on the client side of a Web applicationposes a few extra problems and affects the browsers you can use.In this chapter, you'll learn how to embed <strong>XML</strong> data <strong>in</strong> the body of server-sidegenerated HTML pages and how to access that data us<strong>in</strong>g script code on the client. Todo this, you don't need managed code or the <strong>XML</strong> classes of the .<strong>NET</strong> Framework.We'll also <strong>in</strong>vestigate a little-used feature of the .<strong>NET</strong> Framework and ComponentObject Model (COM) <strong>in</strong>teraction and import a W<strong>in</strong>dows Forms application <strong>in</strong>to an HTMLpage as a special type of <strong>Microsoft</strong> ActiveX control. F<strong>in</strong>ally, we'll review the possibleways to make the embedded W<strong>in</strong>dows Forms application access the <strong>XML</strong> data nested<strong>in</strong> the same HTML page.To use this chapter's Web applications <strong>in</strong>cluded with the book's sample files, follow thisprocedure:1. Copy the EmbReaders subfolder to your Web server's root (usuallyc:\<strong>in</strong>etpub\wwwroot).2. Create an IIS virtual folder named EmbReaders, and po<strong>in</strong>t it to thepreced<strong>in</strong>g folder.3. Po<strong>in</strong>t your browser to the dataisland.aspx and dataislandstep2.aspx files<strong>in</strong> the EmbReaders IIS virtual folder.<strong>XML</strong> Support <strong>in</strong> Internet ExplorerInternet Explorer versions 5.0 and later provide good support <strong>for</strong> <strong>XML</strong> on the client.Among the supported features are direct brows<strong>in</strong>g and data islands. Direct brows<strong>in</strong>g isthe browser's ability to automatically apply an Extensible Stylesheet LanguageTrans<strong>for</strong>mation (XSLT) to the <strong>XML</strong> files be<strong>in</strong>g viewed. In particular, Internet Exploreruses a default, built-<strong>in</strong> style sheet unless the document po<strong>in</strong>ts to a specific style sheet.The default style sheet produces the typical tree-based view of nodes you're familiarwith. If, as mentioned <strong>in</strong> Chapter 7, the <strong>XML</strong> document <strong>in</strong>cludes its own style sheetreference (the xml-stylesheet process<strong>in</strong>g <strong>in</strong>struction), the direct brows<strong>in</strong>g functionautomatically applies the style sheet and displays the result<strong>in</strong>g HTML code.A data island is an <strong>XML</strong> document that exists with<strong>in</strong> an HTML page. In general, a dataisland can conta<strong>in</strong> any k<strong>in</strong>d of text, not just <strong>XML</strong> text. S<strong>in</strong>ce version 5.0, InternetExplorer provides extra support <strong>for</strong> <strong>XML</strong> data islands. If you use the special tagto wrap the text, the browser automatically exposes the contents as an <strong>XML</strong> DocumentObject Model (<strong>XML</strong> DOM) object and allows you to script aga<strong>in</strong>st the document. The<strong>XML</strong> DOM object is expressed as a COM object created by the MS<strong>XML</strong> parser. Theadvantage <strong>for</strong> developers is that the <strong>XML</strong> data travels with the rest of the page anddoesn't have to be loaded us<strong>in</strong>g ad hoc script or through the tag. On the otherhand, because the <strong>XML</strong> data is an <strong>in</strong>tegral part of the page, the size of the page itselfgrows. Determ<strong>in</strong><strong>in</strong>g the best way to <strong>in</strong>clude <strong>XML</strong> data <strong>for</strong> client-side process<strong>in</strong>g isapplication-specific, but the tag certa<strong>in</strong>ly represents an <strong>in</strong>terest<strong>in</strong>g andcompell<strong>in</strong>g option.482


The Data Island () TagThe tag marks the beg<strong>in</strong>n<strong>in</strong>g of a data island, and the ID attribute provides thename you use to reference the <strong>XML</strong> DOM object. The <strong>XML</strong> text can be <strong>in</strong>serted <strong>in</strong> thedata island either <strong>in</strong>-l<strong>in</strong>e or through an external reference to a URL. The follow<strong>in</strong>g codesnippet shows an <strong>XML</strong> data island with <strong>in</strong>-l<strong>in</strong>e text:DavolioNancyThe tag simply wraps the <strong>XML</strong> data; it is not part of the data. Internet Explorerdoes not throw an exception if the <strong>XML</strong> text is not well-<strong>for</strong>med, but if the <strong>XML</strong> data isnot well-<strong>for</strong>med, the MS<strong>XML</strong> parser fails to load it, and no <strong>XML</strong> DOM object is madeavailable to client-side scripts.The follow<strong>in</strong>g code snippet demonstrates the use of the src attribute with the tag.If this attribute is specified, the <strong>XML</strong> data and the host page from the specified URL aredownloaded separately.The contents of the <strong>XML</strong> data island are not displayed as a portion of the page. Thismeans that if you attempt to view any of the preced<strong>in</strong>g HTML pages us<strong>in</strong>g InternetExplorer, an empty page will be displayed. In fact, the pages have no contents otherthan the data island.NoteThe <strong>XML</strong> data island should not <strong>in</strong>clude a nested tag. If thishappens, no error is returned, but the nested end tag closesthe data island's open tag. As a result, the <strong>XML</strong> text thatfollows the nested element becomes part of the HTML bodyand is treated as displayable contents.The Role of the MS<strong>XML</strong> ParserInternet Explorer uses the COM-based MS<strong>XML</strong> parser to load the contents of the <strong>XML</strong>data island <strong>in</strong>to a programmable <strong>XML</strong> DOM object. The parser is <strong>in</strong>cluded <strong>in</strong> theInternet Explorer <strong>in</strong>stallation, so <strong>for</strong> this feature to work, you don't have to <strong>in</strong>stall anadditional tool. Of course, the availability of a client-side <strong>XML</strong> parser is a necessarycondition <strong>for</strong> handl<strong>in</strong>g <strong>XML</strong> data on the client.In the next section, we'll review alternative ways to embed nondisplayable <strong>XML</strong> data <strong>in</strong>HTML pages. Some of these tricks also work with Internet Explorer 4.0 and oldNetscape browsers. Bear <strong>in</strong> m<strong>in</strong>d, however, that although you can figure out severalways to embed <strong>XML</strong> data <strong>in</strong> HTML pages, you always need a client-side, script-483


accessible <strong>XML</strong> parser to consume that data effectively. COM objects and Java classesare probably the most popular and broadly available tools to process client-side <strong>XML</strong>. Inthis chapter, we'll look at a third approach that requires the availability of the .<strong>NET</strong>Framework.Access<strong>in</strong>g Data Islands Through ScriptLet's expand the previously created HTML pages with some script code to see what'sneeded to programmatically access the embedded <strong>XML</strong> data island. The follow<strong>in</strong>gHTML page conta<strong>in</strong>s a button that, when clicked, prompts you with the <strong>XML</strong> contents ofthe data island:DavolioNancyfunction getDataIsland() {alert(xmldoc.<strong>XML</strong>Document.xml);}Figure 14-1 shows the page <strong>in</strong> action.484


Figure 14-1: Extract<strong>in</strong>g and display<strong>in</strong>g the contents of the <strong>XML</strong> data island.As mentioned, when Internet Explorer encounters the tag, it extracts the <strong>XML</strong>data and <strong>in</strong>itializes an <strong>XML</strong>DOMDocument COM object. The object is created andreturned by an <strong>in</strong>ternal <strong>in</strong>stance of the MS<strong>XML</strong> parser. Internet Explorer calls theload<strong>XML</strong> method on the parser and <strong>in</strong>itializes the <strong>XML</strong> DOM object us<strong>in</strong>g the dataisland contents. The document <strong>in</strong>stance is then added to the HTML object model andmade available to scripts via the document.all collection, as shown here:var doc = document.all("xmldoc");The document.all property is a name/value collection that conta<strong>in</strong>s all the elementsfound <strong>in</strong> the HTML page. To simplify cod<strong>in</strong>g, Internet Explorer also provides an object<strong>in</strong>stance named as the ID of the data island. The data island contents can bereferenced us<strong>in</strong>g either the document.all collection or the property with the same nameas the ID.Once you hold the reference to the data island, you use the <strong>XML</strong>Document property toaccess the actual contents, as shown here:var dataIslandText = xmldoc.<strong>XML</strong>Document.xml;This expression demonstrates how to access the entire <strong>XML</strong> text stored <strong>in</strong> the dataisland. If you need to access a subset of the <strong>XML</strong> DOM object, you can narrow the setof nodes by us<strong>in</strong>g an XPath query or by mov<strong>in</strong>g to a particular root node.Handl<strong>in</strong>g Pars<strong>in</strong>g ErrorsIf errors occur dur<strong>in</strong>g the pars<strong>in</strong>g of the data island contents, Internet Explorer does notraise exceptions; any error is silently trapped and a null object is returned. The codeshown <strong>in</strong> the previous section <strong>for</strong> access<strong>in</strong>g the data island does not produce run-timeerrors <strong>in</strong> the case of badly <strong>for</strong>med <strong>XML</strong> text, but an empty str<strong>in</strong>g is returned. To check<strong>for</strong> errors, use the parseError property of the <strong>XML</strong>DOMDocument object.The parseError property is a reference to an <strong>XML</strong>DOMParseError object. The<strong>XML</strong>DOMParseError object returns <strong>in</strong><strong>for</strong>mation about the last parser error. This<strong>in</strong><strong>for</strong>mation <strong>in</strong>cludes the error number, l<strong>in</strong>e number, character position, and a textdescription.485


The follow<strong>in</strong>g code shows a version of the script code from the previous sectionmodified to provide error handl<strong>in</strong>g:function getDataIsland(){if(xmldoc.parseError.errorCode == 0)alert(xmldoc.<strong>XML</strong>Document.xml);elsealert("ERROR: "+ xmldoc.parseError.reason);}NoteAll the code we've looked at up to now as part of a static HTMLpage can be dynamically generated by Active Server Pages (ASP)or <strong>Microsoft</strong> ASP.<strong>NET</strong> code. Later <strong>in</strong> this chapter, <strong>in</strong> the section"Creat<strong>in</strong>g Data Islands <strong>in</strong> ASP.<strong>NET</strong>," on page 603, we'll exam<strong>in</strong>eASP.<strong>NET</strong> pages that produce HTML code with child <strong>XML</strong> dataislands.Other Ways to Embed <strong>XML</strong> DataThe ma<strong>in</strong> reason <strong>for</strong> embedd<strong>in</strong>g <strong>XML</strong> data <strong>in</strong> a special tag is that an <strong>XML</strong> document is<strong>for</strong>med by a sequence of markup delimiters that <strong>in</strong> most cases are unrecognized by aWeb browser. By us<strong>in</strong>g a special tag like the tag, you <strong>in</strong>struct the browser totreat the embedded <strong>in</strong><strong>for</strong>mation <strong>in</strong> an appropriate way. Note that although an <strong>XML</strong> dataisland is a general concept, the special tag is a peculiarity of Internet Explorerversions 5.0 and later. Other browsers, <strong>in</strong>clud<strong>in</strong>g older versions of Internet Explorer,don't support the tag and don't provide alternative specific tags.Normally, Web browsers ignore any tag they encounter that is not part of the predef<strong>in</strong>edHTML vocabulary. Most browsers don't raise errors; <strong>in</strong>stead, they send all the textfound between the start and end tags <strong>in</strong> the ma<strong>in</strong> body of the page. Consider thefollow<strong>in</strong>g HTML page:Hello, worldThis page produces the follow<strong>in</strong>g output when viewed with Internet Explorer 5.0 andNetscape Communicator 4.5 and later versions. Neither browser recognizes the tag; they simply ignore the tag and <strong>in</strong>ject the <strong>in</strong>ner text <strong>in</strong> the body.Hello, worldData islands let you embed external blocks of data so that they have no impact on thef<strong>in</strong>al page be<strong>in</strong>g rendered but are accessible programmatically. In other words, thecontents of a data island must be <strong>in</strong>visible to the user but not to the other childcomponents of the page.486


Let's look briefly at how to simulate data islands with Internet Explorer 4.0 and olderHTML 3.2 browsers such as Netscape 4.x. This <strong>in</strong><strong>for</strong>mation will be useful if you createASP.<strong>NET</strong> pages with embedded islands of data that can be viewed through a variety ofbrowsers.Data Islands <strong>in</strong> Internet Explorer 4.0Internet Explorer 4.0 already provides great support <strong>for</strong> Dynamic HTML (DHTML). Forour purposes, this means that once you've assigned an ID to a tag, you can laterretrieve the tag by name and run a script aga<strong>in</strong>st it. Internet Explorer 4.0 also providesgood support <strong>for</strong> cascad<strong>in</strong>g style sheets (CSS), which means that you can use ad hocattributes to control the visibility style of any tag you want.If you plan to embed <strong>XML</strong> text <strong>in</strong> an HTML page us<strong>in</strong>g an ord<strong>in</strong>ary tag, keep<strong>in</strong>g the text<strong>in</strong>visible is only half the task. The key is <strong>for</strong>c<strong>in</strong>g the browser not to process theembedded text as HTML. In Internet Explorer 4.0, the tag is one of few that offersthis capability. When you comb<strong>in</strong>e display styles and implicit ID-based objectreferences, you can write code similar to the follow<strong>in</strong>g:<strong>XML</strong> data islandYou wrap the <strong>XML</strong> code <strong>in</strong> any HTML or custom tag you want, mak<strong>in</strong>g sure to assign ita unique ID and set the CSS display attribute to none. As a result, the contents of the<strong>XML</strong> data island will be accessible through the expression shown here and, moreimportant, won't affect the page render<strong>in</strong>g:xmldoc.<strong>in</strong>nerHTMLWhat you get us<strong>in</strong>g this technique is not an <strong>XML</strong> DOM object, however, but a pla<strong>in</strong>str<strong>in</strong>g. Initializ<strong>in</strong>g a valid <strong>XML</strong> DOM object and actually pars<strong>in</strong>g and manipulat<strong>in</strong>g the<strong>XML</strong> contents is completely up to you.Us<strong>in</strong>g Hidden FieldsHTML 3.2-compliant browsers make th<strong>in</strong>gs slightly more difficult. You can't count onCSS support, and you can't expect to f<strong>in</strong>d a rich object model attached to all tags. Agood compromise can be assign<strong>in</strong>g the <strong>XML</strong> source code to an INPUT control markedas hidden, as shown here:Assign<strong>in</strong>g a name attribute to the INPUT tag lets you retrieve the <strong>XML</strong> code laterthrough the follow<strong>in</strong>g code:oForm = document.<strong>for</strong>ms[0];oInput = oForm["xml"];alert(oInput.value);Be sure to use the exact case <strong>for</strong> names, and be sure to wrap the INPUT tag <strong>in</strong> aFORM tag. Both th<strong>in</strong>gs arent't necessary with Internet Explorer, but Netscape'sbrowsers require it.487


NoteIn general, you can name the outer <strong>for</strong>m as well and use the nameto select the particular <strong>for</strong>m that conta<strong>in</strong>s the hidden field. However,bear <strong>in</strong> m<strong>in</strong>d that if you use this technique from with<strong>in</strong> ASP.<strong>NET</strong>pages, only one <strong>for</strong>m is available.The TagAnother possible trick <strong>for</strong> embedd<strong>in</strong>g <strong>XML</strong> data <strong>in</strong> an HTML page entails us<strong>in</strong>g the tag. There are two possible ways of overload<strong>in</strong>g the element so that itaccepts <strong>XML</strong> contents. The approaches differ <strong>in</strong> the trick they use to <strong>in</strong><strong>for</strong>m the tag that it is actually handl<strong>in</strong>g <strong>XML</strong> data.You can use the language or the type attribute. Set the language attribute to xml, or setthe type attribute to text/xml, as shown <strong>in</strong> the follow<strong>in</strong>g code:<strong>XML</strong> content here<strong>XML</strong> content hereYou can also reference the <strong>XML</strong> data through the src attribute by mak<strong>in</strong>g the attributepo<strong>in</strong>t to an external URL, as shown here:In all these cases, you should give the tag a unique ID and use it to access the <strong>XML</strong>data either directly or through the document.all collection.NoteOverall, if you can control the version of the client browser, the tag is by far the most preferable and flexible solution.Otherwise, I suggest that you embed any <strong>XML</strong> data <strong>in</strong> a hiddenfield.Creat<strong>in</strong>g Data Islands <strong>in</strong> ASP.<strong>NET</strong>To create data islands <strong>in</strong> ASP.<strong>NET</strong>, you can use the server control to <strong>in</strong>ject<strong>XML</strong> code <strong>in</strong> the body of the HTML tag. We saw this technique <strong>in</strong> action <strong>in</strong>Chapter 7 when we exam<strong>in</strong>ed XSLT and used the control to apply serversidetrans<strong>for</strong>mations. The control can also be used to <strong>in</strong>ject pla<strong>in</strong> <strong>XML</strong> codewithout any prelim<strong>in</strong>ary trans<strong>for</strong>mation.The follow<strong>in</strong>g code demonstrates an ASP.<strong>NET</strong> page that is functionally equivalent tothe HTML page discussed <strong>in</strong> the previous section. The page creates a couple of dataislands by import<strong>in</strong>g the contents of a local <strong>XML</strong> file and then us<strong>in</strong>g a hidden field. Thepage conta<strong>in</strong>s two buttons bound to client-side script<strong>in</strong>g to read the <strong>XML</strong> source.void Page_Load(object sender, EventArgs e){button1.Attributes["onclick"] = "getDataFromXmlTag()";488


utton2.Attributes["onclick"] = "getDataFromHiddenField()";RegisterHiddenField("xml", "my data");}function getDataFromXmlTag() {// Get the data island content from the IE5+ tagif(xmldoc.parseError.errorCode == 0)alert(xmldoc.<strong>XML</strong>Document.xml);elsealert("ERROR: "+ xmldoc.parseError.reason);}function getDataFromHiddenField() {// Get the data island content from a hidden fieldoForm = document.<strong>for</strong>ms[0];oInput = oForm["xml"];alert(oInput.value);}Creat<strong>in</strong>g Data Islands489


To create a hidden field, you can use the pla<strong>in</strong> INPUT HTML tag with the type attributeset to the hidden keyword. In ASP.<strong>NET</strong>, however, you can also use the newRegisterHiddenField method exposed by the Page object. The advantage of thistechnique is that you can create and add the field dynamically. The follow<strong>in</strong>g codeshows how it works:RegisterHiddenField("xml", "my data");The method takes two arguments: the unique name of the <strong>in</strong>put field and the contentsto be output. When the method executes, no actual HTML code is generated, but areference is added to an <strong>in</strong>ternal collection to keep track of the hidden fields to becreated. The hidden <strong>in</strong>put field is actually added to the output when the HTML code <strong>for</strong>the page is rendered.Embedd<strong>in</strong>g .<strong>NET</strong> Framework Components <strong>in</strong> Internet ExplorerThe one key reason <strong>for</strong> creat<strong>in</strong>g data islands or, more generally, <strong>for</strong> embedd<strong>in</strong>g <strong>XML</strong>data <strong>in</strong> the folds of an HTML page is to cache data on the client to outfit some of thecontrols on the page. In the previous section, we saw how to embed a data island andhow to retrieve its contents. Once retrieved, the <strong>XML</strong> data can be passed on to clientsidecomponents <strong>for</strong> further process<strong>in</strong>g or can be manipulated via script. As you canimag<strong>in</strong>e, the latter option is less effective because it is based on <strong>in</strong>terpreted code andbecause, <strong>in</strong> general, script languages aren't particularly rich <strong>in</strong> programm<strong>in</strong>g features.So far, COM objects and Java classes have been the most popular technologies usedby developers to write client-side components runn<strong>in</strong>g <strong>in</strong> the context of Web pages.COM objects and Java classes can be passed, or can directly access, <strong>XML</strong> data stored<strong>in</strong> embedded blocks and can then apply some bus<strong>in</strong>ess logic. Both COM objects andJava classes require special support from the browser.The advent of the .<strong>NET</strong> Framework added a third option to this list. In addition to writ<strong>in</strong>gCOM components (<strong>in</strong>clud<strong>in</strong>g ActiveX controls) or Java classes (<strong>in</strong>clud<strong>in</strong>g applets), youcan now write W<strong>in</strong>dows Forms controls and embed them <strong>in</strong> HTML pages andASP.<strong>NET</strong>-generated Web <strong>for</strong>ms.In the rest of this chapter, we'll exam<strong>in</strong>e the foundation of W<strong>in</strong>dows Forms controls andthe tools and techniques you need to know to embed these controls <strong>in</strong> HTML pages.Next we'll build a sample control that imports the contents of a data island, parses the<strong>XML</strong> text us<strong>in</strong>g a .<strong>NET</strong> Framework reader, and f<strong>in</strong>ally displays the resultant datathrough a data-bound control.Build<strong>in</strong>g W<strong>in</strong>dows Forms Controls <strong>for</strong> HTML PagesInternet Explorer versions 5.5 and later support a special syntax <strong>for</strong> the tagthat lets you embed managed objects <strong>in</strong> Web applications. The object must be an<strong>in</strong>stance of a class that <strong>in</strong>herits from the System.W<strong>in</strong>dows.Forms.Control class eitherdirectly or <strong>in</strong>directly. The assembly that conta<strong>in</strong>s the class is downloaded to the client ifit is not already cached. Of course, <strong>for</strong> this feature to work, the .<strong>NET</strong> Framework mustbe <strong>in</strong>stalled on the client.The follow<strong>in</strong>g code shows how to embed a .<strong>NET</strong> Framework user-def<strong>in</strong>ed class <strong>in</strong>to aWeb page:


height="300" width="100%">The id attribute identifies the <strong>in</strong>stance of the control, whereas the width and heightproperties specify the dimensions of the control's site. The key attribute to consider isclassid. Normally, classid identifies the CLSID of the COM object or the ActiveX controlto embed. Its typical syntax consists of the keyword clsid followed by a colon and thetext representation of the object's CLSID, as shown here:S<strong>in</strong>ce version 5.5, Internet Explorer supports an extended <strong>for</strong>mat that looks like this:classid="http:[assembly URL]#[full class name]"To <strong>in</strong>struct the browser to download the DataListView assembly from the root of thevirtual directory, use the follow<strong>in</strong>g code snippet:classid="http:DataListView.dll#XmlNet.CS.DataListView"The class to <strong>in</strong>stantiate is XmlNet.CS.DataListView. The class must be referenced withits fully qualified name. The assembly doesn't necessarily have to be a DLL; it can bean EXE file <strong>in</strong>stead.NoteThe size of the object must be set explicitly; otherwise, the controlwill not be displayed <strong>in</strong> the HTML page. The size can be specified <strong>in</strong>one of two ways: you can set the width and height attributes of the tag, or you can <strong>in</strong>dicate a size <strong>in</strong> the control classconstructor.Locat<strong>in</strong>g AssembliesThe HTML document can provide <strong>in</strong><strong>for</strong>mation about the locations of the assemblies todownload as well as a configuration file <strong>in</strong> which additional <strong>in</strong><strong>for</strong>mation can be stored.Applications hosted <strong>in</strong> Internet Explorer <strong>in</strong>dicate the location of the configuration filethrough the tag and the follow<strong>in</strong>g syntax:The href attribute <strong>in</strong>dicates the URL of the configuration file. By default, InternetExplorer creates a unique application doma<strong>in</strong> (AppDoma<strong>in</strong>) over the entire site thatconta<strong>in</strong>s the HTML page, which means that all the managed components <strong>in</strong>volved run<strong>in</strong> the same AppDoma<strong>in</strong>. This is not necessarily a bad th<strong>in</strong>g; however, it is a sett<strong>in</strong>g thatcan be overridden us<strong>in</strong>g configuration files. When a configuration file is specified, allpages that po<strong>in</strong>t to the same file are created <strong>in</strong> the same doma<strong>in</strong>.All dependent assemblies should be available <strong>in</strong> the same directory as the control—thatis, the URL <strong>in</strong>dicated through the classid attribute. If needed, however, you candownload assemblies from other Web sites us<strong>in</strong>g the sett<strong>in</strong>g <strong>in</strong> aconfiguration file. The sett<strong>in</strong>g specifies where the common languageruntime (CLR) can f<strong>in</strong>d a needed assembly. The syntax of the sett<strong>in</strong>g isshown here:491


To load assemblies from directories other than the application base directory, you canresort to the element <strong>in</strong> the configuration file. In this case, you dictate thatthe run time searches <strong>for</strong> assemblies <strong>in</strong> the listed subdirectories of the application base.The application base is the directory that conta<strong>in</strong>s the configuration file or the directorythat conta<strong>in</strong>s the control, if no configuration file is used.NoteIf your control references only assemblies stored <strong>in</strong> the globalassembly cache, you don't need to take any additional measures.Those assemblies are always correctly located.Sett<strong>in</strong>g Up the Virtual DirectoryTo successfully test HTML pages that conta<strong>in</strong> managed controls, you should create anad hoc virtual directory and access the page through Internet In<strong>for</strong>mation Services (IIS).In other words, you can't simply prepare an HTML document and double-click it fromW<strong>in</strong>dows Explorer.In addition, the virtual directory must have the Execute Permissions sett<strong>in</strong>g configuredto Scripts Only, as shown <strong>in</strong> Figure 14-2.Figure 14-2: The virtual directory <strong>for</strong> the page that embeds a managed control must beconfigured to run only scripts.The reason <strong>for</strong> this is that if you configure Execute Permissions to Scripts AndExecutables, IIS will be fooled by the assembly's .dll or .exe extension and will treat thecontrol's assembly as an ISAPI application. As a result, the control won't be hosted bythe browser.A Data Display Custom ControlThe browser control class must be derived from Control or from another Control derivedclass. The control can't be a <strong>for</strong>m or a W<strong>in</strong>dows Forms-derived type. In addition, thecontrol class must be publicly accessible and must conta<strong>in</strong> a public default constructor492


that takes no parameters. Aside from these requirements, a browser-embeddablecontrol is noth<strong>in</strong>g special and does not require you to take any particular steps otherthan those you would take <strong>for</strong> any other k<strong>in</strong>d of W<strong>in</strong>dows Forms control.Let's build a sample control named DataListView and make it <strong>in</strong>herit from the W<strong>in</strong>dowsForms ListView control. We will also add a new method that receives an <strong>XML</strong> str<strong>in</strong>g andloads the parsed text <strong>in</strong>to a DataSet object. If successful, the DataSet object will thenbe used to populate the view. The <strong>in</strong>put <strong>XML</strong> str<strong>in</strong>g can be set programmatically fromany source and <strong>in</strong> particular can be extracted from a data island.The DataListView ControlThe DataListView class <strong>in</strong>herits from ListView, but unlike the parent class, it alwaysworks <strong>in</strong> Details mode. The view mode and the font are set dur<strong>in</strong>g the <strong>in</strong>itializationphase. The follow<strong>in</strong>g code is <strong>in</strong>voked from with<strong>in</strong> the constructor:protected void SetupControl(){this.View = View.Details;this.Font = new Font("Verdana", 8f);this.FullRowSelect = true;}Although the control is automatically configured to work <strong>in</strong> Details mode, no columnsare added to the view until the user <strong>in</strong>terface is populated with data. The Details viewprovides clickable columns of data arranged <strong>in</strong> a grid.Populat<strong>in</strong>g the Control's User InterfaceLoad is the key method of the DataListView control. It is also the only extension madeto the programm<strong>in</strong>g <strong>in</strong>terface of the parent class, as shown here:public void Load(str<strong>in</strong>g xmldata)The Load method expects to receive an <strong>XML</strong> str<strong>in</strong>g that can be successfully parsed <strong>in</strong>toa DataSet object. The resultant object, if any, is used to populate the ListView class.Unlike other list controls, the ListView class does not fully support the .<strong>NET</strong>Framework's complex data-b<strong>in</strong>d<strong>in</strong>g. In fact, the ListView class does not provide <strong>for</strong> aDataSource property. To populate its user <strong>in</strong>terface with data read out of data-b<strong>in</strong>dableobject, you must loop through the rows and update the list items yourself.The follow<strong>in</strong>g code illustrates the behavior of the Load method:public void Load(str<strong>in</strong>g xmldata){DataSet ds = new DataSet();Str<strong>in</strong>gReader reader = new Str<strong>in</strong>gReader(xmldata);ds.ReadXml(reader);reader.Close();// Store the current data source and its view objectm_data = ds.Tables[0];m_viewOfData = new DataView(m_data);// Add columnsthis.Columns.Clear();<strong>for</strong>(<strong>in</strong>t j=0; j


{}<strong>in</strong>t size = 130;this.Columns.Add(m_data.Columns[j].ColumnName,size, HorizontalAlignment.Left);}// Add rowsFillTable();The first task accomplished is trans<strong>for</strong>m<strong>in</strong>g the <strong>in</strong>put <strong>XML</strong> data <strong>in</strong>to a DataSet object.The <strong>XML</strong> data is read and parsed by the ReadXml method of the DataSet object.ReadXml normally works on streams and files, but you can <strong>for</strong>ce it to work on a str<strong>in</strong>g ifyou specify the str<strong>in</strong>g through a Str<strong>in</strong>gReader object.Once the <strong>in</strong>put <strong>XML</strong> data has been trans<strong>for</strong>med <strong>in</strong>to a DataSet object, the first table <strong>in</strong>the DataSet object is extracted and its columns and rows processed. (In this example,the control arbitrarily processes only the first table.) For each column <strong>in</strong> the table, theDataListView control creates and adds a new column with default sett<strong>in</strong>gs and size.Next the table rows are enumerated. Each row becomes a new l<strong>in</strong>e <strong>in</strong> the ListViewobject. The first column maps to the ListView primary item; the other columns arerendered as ListView subitems, as shown <strong>in</strong> the follow<strong>in</strong>g code:private void FillTable(){// Clear exist<strong>in</strong>g rowsthis.Items.Clear();}// Add new rows<strong>for</strong>(<strong>in</strong>t i=0; i


Access<strong>in</strong>g the Data Island ContentsIn the previous section, we learned how to extract the contents of an <strong>XML</strong> data island,regardless of the technique that was used to store it <strong>in</strong> an exist<strong>in</strong>g HTML page. Thecontent of an <strong>XML</strong> data island is a pla<strong>in</strong> str<strong>in</strong>g and as such can be passed on to theLoad method <strong>for</strong> further process<strong>in</strong>g. The follow<strong>in</strong>g code demonstrates how:function getDataFromXmlTag(){// Get the data island content from the IE5+ tagif(xmldoc.parseError.errorCode == 0){g = document.all("grid");var data = xmldoc.<strong>XML</strong>Document.xml;g.Load(data);}elsealert("ERROR: "+ xmldoc.parseError.reason);}The content of the data island is extracted, parked <strong>in</strong> a temporary variable, and thenpassed on to Load. In the next section, we'll see a sample page <strong>in</strong> action.Add<strong>in</strong>g Sort<strong>in</strong>g and Filter<strong>in</strong>g CapabilitiesTo make the DataListView control even more useful, you can add advanced viewcapabilities. Add<strong>in</strong>g sort<strong>in</strong>g and filter<strong>in</strong>g capabilities to the DataListView control issurpris<strong>in</strong>gly simple thanks to the programm<strong>in</strong>g power of the .<strong>NET</strong> Framework. To addsort<strong>in</strong>g and filter<strong>in</strong>g features, you use the Sort and RowFilter properties of theembedded DataView object.Data sort<strong>in</strong>g is triggered when the user clicks on the column's header. The baseListView control already provides the ColumnClick event and an ad hoc delegate (theColumnClickEventHandler class) to handle the event, as shown <strong>in</strong> the follow<strong>in</strong>g code.The event data, gathered <strong>in</strong> the ColumnClickEventArgs structure, provides a Columnmember that <strong>in</strong>dicates the zero-based <strong>in</strong>dex of the column clicked. The actual sort<strong>in</strong>g ofthe displayed data is up to you.// Execute when the user clicks on a column's headerprivate void SortData(object sender, ColumnClickEventArgs e){// Prepare a view with sorted dataPrepareSortedDataView(e.Column);}// Refresh the view to reflect sort<strong>in</strong>gFillTable();// Configure the <strong>in</strong>ternal DataView to support sort<strong>in</strong>g495


private void PrepareSortedDataView(<strong>in</strong>t colPos){// Set the column to sort bym_viewOfData.Sort = m_data.Columns[colPos].ColumnName;}// Arrange the auto-reverse sort<strong>in</strong>gif (m_columnSorted == colPos){// If the same column is clicked twice,// <strong>in</strong>vert the directionm_viewOfData.Sort += "DESC";m_columnSorted = -1;}else// Store the <strong>in</strong>dex of the currently sorted columnm_columnSorted = colPos;Implement<strong>in</strong>g row filter<strong>in</strong>g is even easier. You simply expose a read/write propertycalled, say, RowFilter and make it work as a wrapper around the DataView object'sRowFilter property, as shown here:private str<strong>in</strong>g m_rowFilter = "";public str<strong>in</strong>g RowFilter{get {return m_rowFilter;}set{// Store the filter str<strong>in</strong>gm_rowFilter = value;}}// Pass the <strong>in</strong><strong>for</strong>mation on to the DataViewm_viewOfData.RowFilter = m_rowFilter;// Refresh the viewFillTable();What we have built so far is a ListView -based control that features data-b<strong>in</strong>d<strong>in</strong>gfunctionalities along with advanced capabilities <strong>for</strong> sort<strong>in</strong>g and filter<strong>in</strong>g the data. Thiscontrol can be <strong>in</strong>itialized from an <strong>XML</strong> str<strong>in</strong>g that can be deserialized to a DataSetobject. The DataListView control can be used with any W<strong>in</strong>dows Forms application, butwhen embedded <strong>in</strong> an HTML or ASP.<strong>NET</strong> page, the programm<strong>in</strong>g <strong>in</strong>terface lends itselfvery well to fill<strong>in</strong>g the control with the contents of an <strong>XML</strong> data island.496


TipTo significantly improve your programm<strong>in</strong>g experience whendevelop<strong>in</strong>g browser-embeddable W<strong>in</strong>dows Forms controls, you mightwant to create a simple test application that hosts the control. Onlywhen the control works as expected should you write the test HTMLor ASP.<strong>NET</strong> page. Test<strong>in</strong>g a control embedded <strong>in</strong> Internet Explorercan be quite frustrat<strong>in</strong>g because the CLR does not redownloadassemblies that already figure <strong>in</strong> the cache. This means that youhave to physically empty the assembly cache or replace the localcopy of the assembly be<strong>for</strong>e you can see changes <strong>in</strong> action.Putt<strong>in</strong>g It All TogetherThe DataListView control is an effective tool <strong>for</strong> display<strong>in</strong>g a snapshot of data cachedon the client. An ASP.<strong>NET</strong> page that makes use of the DataListView control differs <strong>in</strong>some respects from any other ord<strong>in</strong>ary data-bound ASP.<strong>NET</strong> page. First and <strong>for</strong>emost,us<strong>in</strong>g the DataListView control or similar controls <strong>in</strong> a Web application requires a richclient such as Internet Explorer and requires that the .<strong>NET</strong> Framework is <strong>in</strong>stalled onthe client. As you might expect, these requirements make such a Web application moresuitable <strong>for</strong> controlled environments like an <strong>in</strong>tranet than <strong>for</strong> the Internet.On the other hand, cach<strong>in</strong>g data on the client allows you page through data, as well assort and filter rows, without repeated access to the database and without ty<strong>in</strong>g up Webserver memory with server-side cached objects. Writ<strong>in</strong>g a browser-managed controlalso lets you exploit the power of the .<strong>NET</strong> Framework on the client, although withsome limitations. The DataListView control will run as partially trusted code, andalthough the control can adm<strong>in</strong>istratively receive more privileges and permissions, thecore code you write should not presume itself to be more than a partially trustedapplication. In particular, this means that file I/O should be avoided to the extent that itis possible and replaced with isolated storage whenever data persistence becomes astrong necessity.NoteThe GetSalesReportBarChart method of the Web service built <strong>in</strong>Chapter 13 creates the JPEG image that represents the chart as an<strong>in</strong>-memory image just to avoid security restrictions <strong>for</strong> file I/O. Forthe most part, the location of the assembly determ<strong>in</strong>es therestrictions it will be subject to. Locations are articulated <strong>in</strong> zones,<strong>in</strong>clud<strong>in</strong>g MyComputer, Intranet, and Internet.Registry, clipboard, and network access are restricted also. Network access isrestricted to the URL from which the control's assembly was downloaded. Pr<strong>in</strong>t<strong>in</strong>g isallowed only through the W<strong>in</strong>dows Forms common dialog box, and no direct access tothe resource is permitted. F<strong>in</strong>ally, both run-time and <strong>XML</strong> serialization are consideredrestricted functionalities whose full access is reserved <strong>for</strong> fully trusted applications.With these considerations <strong>in</strong> m<strong>in</strong>d, let's f<strong>in</strong>alize the DataListView control and build anASP.<strong>NET</strong> page that makes use of it. A sneak preview of the f<strong>in</strong>al page is shown <strong>in</strong>Figure 14-3.497


Figure 14-3: An ASP.<strong>NET</strong> page that creates and consumes an <strong>XML</strong> data island.Serializ<strong>in</strong>g DataSet Objects to Data IslandsThe sample page shown <strong>in</strong> Figure 14-3 is named dataisland.aspx and is available <strong>in</strong>this book's sample files, along with the source code <strong>for</strong> the DataListView control. Thefollow<strong>in</strong>g code shows the body of the page. Key parts of the code are shown <strong>in</strong>boldface—<strong>in</strong> particular, the data island def<strong>in</strong>ition and the managed control declaration.Consum<strong>in</strong>g Data Islands498


The data island is created us<strong>in</strong>g the server control, which reads a previouslycreated <strong>XML</strong> file. The employees.xml file is simply the <strong>XML</strong> normal <strong>for</strong>m of a DataSetobject. The DataSet object is serialized to the data island, and the page is sent to thebrowser. On the client, some Javascript code takes care of extract<strong>in</strong>g the data islandcontents as <strong>XML</strong> text and pass<strong>in</strong>g it on to a method—Load—on the managed control.Internally, the Load method rebuilds the DataSet object and uses it to populate its ownuser <strong>in</strong>terface. Figure 14-4 shows the ASP.<strong>NET</strong> page <strong>in</strong> action, with a filter applied andwith the data sorted <strong>in</strong> ascend<strong>in</strong>g order by last name.Figure 14-4: Sort<strong>in</strong>g and filter<strong>in</strong>g data on the client.NoteWhen embedd<strong>in</strong>g script code <strong>in</strong> Web pages to be consumed overthe Internet, you should use the Javascript language to reach thewidest possible range of browsers. VBScript is limited to InternetExplorer. In this example, however, we're mak<strong>in</strong>g seriousassumptions about the capabilities of the client—.<strong>NET</strong> Framework<strong>in</strong>stalled, support <strong>for</strong> the extended syntax of the tag, andability to host managed code. This means that your browser mustbe Internet Explorer 5.5 or, more likely, Internet Explorer 6.0 orlater. So <strong>in</strong> this case you can reasonably drop Javascript <strong>in</strong> favor ofVBScript.From MS<strong>XML</strong> Documents to .<strong>NET</strong> <strong>XML</strong> DocumentsWhen Internet Explorer detects the tag <strong>in</strong> a client page, it automatically extractsthe page's contents, creates an <strong>in</strong>ternal <strong>in</strong>stance of the MS<strong>XML</strong> parser, and makes thedata available through an <strong>XML</strong>DOMDocument object. Note that <strong>XML</strong>DOMDocument isnot a managed object created from any of the .<strong>NET</strong> Framework classes but rather an<strong>in</strong>stance of a COM object that constitutes the <strong>XML</strong> DOM representation of the dataisland contents. The follow<strong>in</strong>g pseudocode, written <strong>in</strong> JScript, illustrates this po<strong>in</strong>t; thevariable xmldoc is an <strong>XML</strong>DOMDocument object.// Extract the data island contents// xmldoc is the ID of the tagvar xmldata = document.all("xmldoc").<strong>in</strong>nerHTML;// Instantiate MS<strong>XML</strong>499


var parser = new ActiveXObject("<strong>Microsoft</strong>.<strong>XML</strong>DOM");// Parse the contents of the data island and makes// it available as a <strong>XML</strong> DOM object. The object is given// the same name as the tag's IDvar xmldoc = parser.load<strong>XML</strong>(xmldata);If your f<strong>in</strong>al goal is consum<strong>in</strong>g the data island with<strong>in</strong> the body of a managed control,there is no need to pass through a COM-based <strong>in</strong>termediate representation of the <strong>XML</strong>data. In this case, <strong>in</strong> fact, the parser that will actually process the data is the .<strong>NET</strong>Framework <strong>XML</strong> reader. The reader needs only a str<strong>in</strong>g of <strong>XML</strong> data, not a COMobject. On the other hand, whenever you use the tag, Internet Explorerautomatically creates the <strong>XML</strong>DOMDocument object. So if the f<strong>in</strong>al dest<strong>in</strong>ation of thedata island is a W<strong>in</strong>dows Forms control, you might want to speed th<strong>in</strong>gs a little bit bynot us<strong>in</strong>g the tag, which will produce a useless COM <strong>XML</strong> DOM object. Us<strong>in</strong>g ahidden field offers the same functionality at a lower price. But keep <strong>in</strong> m<strong>in</strong>d that thisoption is valid only if you plan to consume the data island contents through embeddedmanaged code.The Role of Script CodeTo establish a connection between the host environment and the managed control, youmust use script code—Javascript <strong>in</strong> particular. For this reason, while you're design<strong>in</strong>gthe <strong>in</strong>terface of the managed control, don't <strong>for</strong>get what the actual callers of thosemethods will be. A Javascript client has different capabilities than a .<strong>NET</strong> Frameworkclient, so you should keep the signature of public methods as simple as possible andavoid us<strong>in</strong>g arrays and other complex and user-def<strong>in</strong>ed types.In the dataisland.aspx sample code, the connection between the data island and themanaged control is made through the Load method. The Load method accepts a simplestr<strong>in</strong>g, which results <strong>in</strong> a signature that the Javascript code can easily match, as shownhere:// At this po<strong>in</strong>t, Internet Explorer has already created// the <strong>XML</strong>DOMDocument. You can retrieve the content of the// data island either through the <strong>XML</strong>Document object or// the <strong>in</strong>nerHTML property.var data = xmldoc.<strong>XML</strong>Document.xml;// Pass the data island content to the managed controlvar listView = document.all("grid");listView.Load(data);Avoid<strong>in</strong>g Problems with Submit ButtonsWhile develop<strong>in</strong>g the sample ASP.<strong>NET</strong> page to test the DataListView object, I ran <strong>in</strong>toan <strong>in</strong>terest<strong>in</strong>g snag. I orig<strong>in</strong>ally used the tag to <strong>in</strong>sert a button to load thedata island <strong>in</strong>to the control. As a result, the data island was correctly read and thecontrol filled, but a moment later the page refreshed, and the control lost its state andwas displayed as empty. What happened? The reason <strong>for</strong> this strange behavior is thatthe tag always generates a submit button, as shown here:500


As a result, the page first executes the client-side script associated with the HTMLbutton and fills the control with the <strong>XML</strong> data. Next the browser posts the page back tothe server as the submit button type mandates. This behavior is undesired <strong>for</strong> a coupleof reasons. First, it produces an unneeded round-trip to the Web server. Second, theround-trip cancels the changes to the user <strong>in</strong>terface that have been made on the clientand that constitute the core of our ef<strong>for</strong>ts and our ma<strong>in</strong> reason <strong>for</strong> build<strong>in</strong>g and us<strong>in</strong>g amanaged control. On the other hand, the W<strong>in</strong>dows Forms control is not a server-sidecontrol and does not have access to the ViewState property to control its state whenthe page posts back.This problem has a simple workaround: don't use the tag to <strong>in</strong>sert abutton that is expected to <strong>in</strong>teract with the managed control through client-side scriptcode. Instead, use the tag and explicitly set the type attribute to button, asshown <strong>in</strong> the follow<strong>in</strong>g code:Also, don't set the runat attribute; if you do, the onclick attribute will be mistaken <strong>for</strong>server-side code to be executed. In this way, the browser executes the associatedclient-side script code and refreshes the page accord<strong>in</strong>gly, but no postback occurs.Us<strong>in</strong>g Hidden Fields and SQL QueriesDespite the fact that the tag is the official way of def<strong>in</strong><strong>in</strong>g <strong>XML</strong> data islands withInternet Explorer, a hidden field is probably a better solution. With a hidden field,Internet Explorer doesn't preprocess the <strong>XML</strong> data <strong>in</strong>to a COM-based <strong>XML</strong> DOMobject. This feature is welcome if you are go<strong>in</strong>g to process the <strong>XML</strong> data us<strong>in</strong>g scriptcode. No pars<strong>in</strong>g is needed if you only plan to pass the <strong>XML</strong> data island to a managedcontrol, however. Us<strong>in</strong>g a hidden field or a hidden tag is a valid approach to <strong>in</strong>sert<strong>in</strong>g<strong>XML</strong> data <strong>in</strong> the body of an HTML page.The follow<strong>in</strong>g code illustrates how to create a hidden field that conta<strong>in</strong>s dynamicallygenerated <strong>XML</strong> data. The data is the output you get from the <strong>XML</strong> normal <strong>for</strong>m of aDataSet object. In this sample code, the DataSet object is obta<strong>in</strong>ed by runn<strong>in</strong>g a queryaga<strong>in</strong>st the Customers table <strong>in</strong> the Northw<strong>in</strong>d database.private void Page_Load(object sender, EventArgs e){if (!IsPostBack){str<strong>in</strong>g xmldata = GetDataAsXml();RegisterHiddenField("xml", xmldata);}}private str<strong>in</strong>g GetDataAsXml(){SqlDataAdapter adapter = new SqlDataAdapter("SELECT customerid, companyname, contactname,contacttitle, city, country FROM customers",501


"SERVER=localhost;DATABASE=northw<strong>in</strong>d;UID=sa;");DataSet ds = new DataSet();adapter.Fill(ds);return ds.GetXml();}Figure 14-5 shows the sample page <strong>in</strong> action.Figure 14-5: The sample page now shows filtered data from the Customers table. The <strong>XML</strong>data has been carried us<strong>in</strong>g a hidden field.NoteAnother key technique you can use to refresh the page us<strong>in</strong>g clientsidedata leverages DHTML. Although this approach can beeffective and powerful, it doesn't comb<strong>in</strong>e well with managed code.DHTML refers to the page object model and is designed <strong>for</strong>script<strong>in</strong>g. The page object model is exposed as a suite of COMobjects, and driv<strong>in</strong>g it from with<strong>in</strong> managed code is certa<strong>in</strong>lypossible but not particularly easy.ConclusionUs<strong>in</strong>g <strong>XML</strong> data islands to import sensitive data <strong>in</strong>to HTML pages is a technique thatdeserves further <strong>in</strong>vestigation. Creat<strong>in</strong>g <strong>XML</strong> data islands is easier with ASP.<strong>NET</strong> butwas not rocket science even prior to the advent of the .<strong>NET</strong> Framework. Access<strong>in</strong>g thecontents of a data island on the client is still based on Javascript code, and there<strong>for</strong>e isnot a feature that has been affected by the .<strong>NET</strong> Framework. So what's the problemwith us<strong>in</strong>g <strong>XML</strong> and the .<strong>NET</strong> Framework on the client?The .<strong>NET</strong> Framework classes provide a far richer object model that has a lot to offer <strong>in</strong>terms of <strong>XML</strong> data manipulation, as we saw <strong>in</strong> Chapter 8, Chapter 9, and Chapter 10.Exploit<strong>in</strong>g this bounty of functions on the client is possible thanks to the browserdeployableW<strong>in</strong>dows Forms controls that we exam<strong>in</strong>ed <strong>in</strong> this chapter. Code that uses<strong>XML</strong> and the .<strong>NET</strong> Framework on the client, although based on ASP.<strong>NET</strong> code, is notInternet-oriented because it imposes two key restrictions on the client environment: the502


owser must be Internet Explorer 5.5 (or later), and the .<strong>NET</strong> Framework must be<strong>in</strong>stalled on the client mach<strong>in</strong>e. (Because you often end up <strong>in</strong>stall<strong>in</strong>g Internet Explorer6.0 with the .<strong>NET</strong> Framework, this is really a s<strong>in</strong>gle requirement.)Pass<strong>in</strong>g data to managed controls is relatively easy; each component can def<strong>in</strong>e itsown <strong>in</strong>terface. However, any <strong>in</strong>teraction between the user and the control can takeplace only through script code. Keep this <strong>in</strong> m<strong>in</strong>d when you're design<strong>in</strong>g theprogramm<strong>in</strong>g <strong>in</strong>terface of the managed controls.The key concept that this chapter has pursued is that you can split your Web functionsand balance them between the client and the server without renounc<strong>in</strong>g managed codeand the power of the .<strong>NET</strong> Framework. To do so, you create a W<strong>in</strong>dows Forms richclient and embed it <strong>in</strong> an HTML or ASP.<strong>NET</strong> page us<strong>in</strong>g the tag. Next youpass server-side data (<strong>for</strong> example, the results of a SQL query) to the client us<strong>in</strong>g <strong>XML</strong>data islands and script code to <strong>in</strong>voke properties and methods on the managedcontrols.Admittedly, the concepts illustrated <strong>in</strong> this chapter are probably not the most commonway to use <strong>XML</strong> <strong>in</strong> a .<strong>NET</strong> Framework environment. In my ADO.<strong>NET</strong> and <strong>XML</strong>sem<strong>in</strong>ars, however, I often get questions that touch on, directly or <strong>in</strong>directly, the use of<strong>XML</strong> <strong>in</strong> a client-side scenario. This chapter should answer some of the most frequentlyasked questions.In Chapter 15, we'll f<strong>in</strong>ish our exam<strong>in</strong>ation of <strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> Framework, <strong>in</strong>clud<strong>in</strong>gapplication configuration, the <strong>for</strong>mat of .config files, and ways to extend and customizethem.Further Read<strong>in</strong>gIn an article published <strong>in</strong> MSDN Magaz<strong>in</strong>e <strong>in</strong> June 2000 ("Creat<strong>in</strong>g and Optimiz<strong>in</strong>gPer<strong>for</strong>mance <strong>for</strong> <strong>XML</strong> Document/View Web Applications"), I discussed ways to use<strong>XML</strong> on the client us<strong>in</strong>g COM technologies. In particular, I explored <strong>XML</strong>implementations of the document/view architecture. The book <strong>XML</strong> <strong>Programm<strong>in</strong>g</strong> CoreReference (<strong>Microsoft</strong> Press, 2002) also conta<strong>in</strong>s chapters that illustrate the use of <strong>XML</strong>on the client.Internet Explorer has played a key role <strong>in</strong> this chapter as the richest browser availabletoday. You can get an <strong>in</strong>side look at the expanded capabilities of Internet Explorer 6.0through the <strong>Microsoft</strong> Internet Explorer 6 Resource Kit, (<strong>Microsoft</strong> Press, 2001).F<strong>in</strong>ally, Jason Clark's excellent piece "Code Access Security and Distribution Features<strong>in</strong> .<strong>NET</strong> Enhanced Client-Side Apps" (MSDN Magaz<strong>in</strong>e, June 2002) celebrates thereturn of the rich client <strong>in</strong> the W<strong>in</strong>dows Forms plat<strong>for</strong>m. Among other th<strong>in</strong>gs, this articlecovers .<strong>NET</strong> Framework browser controls and provides a handful of useful caveats andtips.503


Chapter 15: .<strong>NET</strong> Framework ApplicationConfigurationOverviewTo the extent that it is possible, all applications, regardless of plat<strong>for</strong>m, should bedesigned <strong>in</strong> a parametric way and should read some of their sett<strong>in</strong>gs from an externalfile. Simply by updat<strong>in</strong>g the configuration file, developers and system adm<strong>in</strong>istrators canchange the way <strong>in</strong> which the application works as well as elements of the user<strong>in</strong>terface. In <strong>Microsoft</strong> W<strong>in</strong>dows 3.x, user preferences and application sett<strong>in</strong>gs wereusually stored <strong>in</strong> INI files located <strong>in</strong> the W<strong>in</strong>dows folder or <strong>in</strong> the application's ma<strong>in</strong>directory. This practice was reta<strong>in</strong>ed <strong>in</strong> <strong>Microsoft</strong> W<strong>in</strong>32, although s<strong>in</strong>ce W<strong>in</strong>dows 95,the system registry has become the recommended store <strong>for</strong> W<strong>in</strong>32 and ComponentObject Model (COM) application sett<strong>in</strong>gs. With both INI files and the registry, however,the developer had a certa<strong>in</strong> degree of freedom <strong>in</strong> design<strong>in</strong>g the layout of the data.Various guidel<strong>in</strong>es have been suggested over time, but <strong>in</strong> fact the structure of INI filesand registry subtrees was different from one application to the next.The <strong>Microsoft</strong> .<strong>NET</strong> Framework def<strong>in</strong>es a tailor-made, <strong>XML</strong>-based API to accessconfiguration files and, <strong>in</strong> do<strong>in</strong>g so, <strong>for</strong>ces developers to adopt a common, rich, andpredef<strong>in</strong>ed schema <strong>for</strong> stor<strong>in</strong>g application sett<strong>in</strong>gs. Us<strong>in</strong>g configuration files,adm<strong>in</strong>istrators can control which resources a user can access, which versions ofassemblies an application will use and from where, and which connection str<strong>in</strong>gsshould be used. Configuration files can also <strong>in</strong>clude application-specific sett<strong>in</strong>gs suchas the buttons to be displayed on the toolbar, the size and position of controls, andother, more specific, state <strong>in</strong><strong>for</strong>mation. Us<strong>in</strong>g configuration files, you give yourapplication a bunch of dynamic properties and elim<strong>in</strong>ate the need to recompile everytime different sett<strong>in</strong>gs should be applied..<strong>NET</strong> Framework configuration files are <strong>XML</strong> files saved with the .config extension andnamed and located accord<strong>in</strong>g to the type of the application. Managed code can use theclasses <strong>in</strong> the System.Configuration namespace to read sett<strong>in</strong>gs from the configurationfiles but not to write sett<strong>in</strong>gs to those files. Configuration files are considered pla<strong>in</strong> <strong>XML</strong>files, and appropriate <strong>XML</strong> writers should be used to edit their contents.In this chapter, we'll delve <strong>in</strong>to the .<strong>NET</strong> Framework configuration eng<strong>in</strong>e, review<strong>in</strong>g thecharacteristics of the ma<strong>in</strong> classes <strong>in</strong>volved and how key tasks are accomplished. We'llanalyze the various types of configuration files and their overall schemas, and you'lllearn how to customize a .config file with custom tags and custom contents.Configuration FilesThe .<strong>NET</strong> Framework provides three basic types of configuration files: mach<strong>in</strong>e,application, and security. Despite their different contents and goals, all configurationfiles are <strong>XML</strong> files and share the same schema. For example, all configuration filesbeg<strong>in</strong> with a node and then differentiate their contents and child nodesaccord<strong>in</strong>g to the f<strong>in</strong>al goal and the <strong>in</strong><strong>for</strong>mation conta<strong>in</strong>ed. In this chapter, we'll focusprimarily on application configuration files, but this section also provides a quick<strong>in</strong>troduction to the other types of configuration files.504


The <strong>XML</strong> Schema <strong>for</strong> Configuration Sett<strong>in</strong>gsAs mentioned, configuration files are standard <strong>XML</strong> files that follow a particular schema.This schema def<strong>in</strong>es all possible configuration sett<strong>in</strong>gs <strong>for</strong> mach<strong>in</strong>e, security, andapplication configuration files. The .<strong>NET</strong> Framework provides you with ad hoc classesto read configuration sett<strong>in</strong>gs, but no writ<strong>in</strong>g can be per<strong>for</strong>med. You need to be familiarwith <strong>XML</strong> readers and writers if you want to directly edit the configuration files. (In lightof this, bear <strong>in</strong> m<strong>in</strong>d that <strong>XML</strong> elements and attribute names are case-sensitive.)All the configuration files are rooted <strong>in</strong> the element. Table 15-1 lists thefirst-level children of the element. Each node has a specified number ofchild elements that provide a full description of the sett<strong>in</strong>g. For example, the element optionally conta<strong>in</strong>s the tag, <strong>in</strong> which you canstore <strong>in</strong><strong>for</strong>mation about the users who can safely access the URL resources.Table 15-1: Children of the ElementElementDescriptionConta<strong>in</strong>s custom application sett<strong>in</strong>gs<strong>in</strong> the specified <strong>XML</strong> <strong>for</strong>mat.Describes the configuration sections<strong>for</strong> custom sett<strong>in</strong>gs. If this element is<strong>in</strong> a configuration file, it must be thefirst child of the root.\ Cryptography schema; describes theelements that map friendly algorithmnames to classes that implementcryptography algorithms.Run-time sett<strong>in</strong>gs schema; describesthe elements that configure assemblyb<strong>in</strong>d<strong>in</strong>g and run-time behavior.Startup sett<strong>in</strong>gs schema; conta<strong>in</strong>s theelements that specify which versionof the common language runtime(CLR) must be used.Describes the elements that specifytrace switches and listeners thatcollect, store, and route messages.Network schema; specifies elementsto <strong>in</strong>dicate how the .<strong>NET</strong> Frameworkconnects to the Internet, <strong>in</strong>clud<strong>in</strong>g thedefault proxy, authenticationmodules, and connection parameters. Sett<strong>in</strong>gs schema; configures theclient and server applications thatimplement remot<strong>in</strong>g. <strong>Microsoft</strong> ASP.<strong>NET</strong> configurationsection schema; conta<strong>in</strong>s theelements that control how ASP.<strong>NET</strong>Web applications behave.505


Because we're focus<strong>in</strong>g on application configuration files <strong>in</strong> this chapter, <strong>for</strong> ourpurposes, two of these elements have particular importance: and. The element def<strong>in</strong>es the sections that will be used<strong>in</strong> the rest of the document to group <strong>in</strong><strong>for</strong>mation. The element conta<strong>in</strong>suser-def<strong>in</strong>ed nodes whose structure has been previously def<strong>in</strong>ed <strong>in</strong> the node.Armed with this work<strong>in</strong>g knowledge of the <strong>in</strong>ternal layout of configuration files, let'slearn a bit more about the two configuration file types that won't receive an <strong>in</strong>-depthexposure <strong>in</strong> this chapter—mach<strong>in</strong>e and security configuration files.Mach<strong>in</strong>e Configuration FilesMach<strong>in</strong>e configuration files are named mach<strong>in</strong>e.config and are located <strong>in</strong> the CONFIGsubdirectory of the .<strong>NET</strong> Framework <strong>in</strong>stallation path. A typical path is shown here:C:\WINNT\<strong>Microsoft</strong>.<strong>NET</strong>\Framework\v1.0.3705\CONFIGThe mach<strong>in</strong>e.config file conta<strong>in</strong>s mach<strong>in</strong>e-wide sett<strong>in</strong>gs that apply to assembly b<strong>in</strong>d<strong>in</strong>g,built-<strong>in</strong> remot<strong>in</strong>g channels, and the ASP.<strong>NET</strong> runtime. In particular, the mach<strong>in</strong>e.configfile conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the browser capabilities, registered HTTP handlers, andpage compilation. The follow<strong>in</strong>g list<strong>in</strong>g provides an excerpt from a mach<strong>in</strong>e.config file:⋮⋮506


The mach<strong>in</strong>e.config file typically conta<strong>in</strong>s remot<strong>in</strong>g, ASP.<strong>NET</strong>, and diagnosticssections, plus the element. Declar<strong>in</strong>g a section <strong>in</strong> the mach<strong>in</strong>e.configfile enables you to use that section <strong>in</strong> any configuration file on that computer, unless thesett<strong>in</strong>g is explicitly overwritten <strong>in</strong> the application configuration file.Security Configuration FilesSecurity configuration files conta<strong>in</strong> <strong>in</strong><strong>for</strong>mation about the code groups and thepermission sets associated with a policy level. A policy level describes all the securitymeasures <strong>for</strong> a given context. There are three policy levels <strong>for</strong> security: enterprise,mach<strong>in</strong>e, and user. The CLR grants permissions to an assembly based on them<strong>in</strong>imum set of permissions granted by any of the policy levels.NoteA code group is a logical group<strong>in</strong>g of code that specifies certa<strong>in</strong>conditions <strong>for</strong> membership. Any code that meets the given criteriacan be <strong>in</strong>cluded <strong>in</strong> the group. Code groups have associatedpermission sets. A permission set, <strong>in</strong> turn, def<strong>in</strong>es the resourcesthat can be accessed at execution time.The name and the location of the security configuration file depend on the policy level.The configuration file <strong>for</strong> the enterprise policy level is named enterprisesec.config andresides <strong>in</strong> the same directory as the mach<strong>in</strong>e.config file. Conta<strong>in</strong>ed <strong>in</strong> the same folderbut with a different name, the security.config file characterizes the mach<strong>in</strong>e policy level.The enterprise level groups security sett<strong>in</strong>gs <strong>for</strong> the entire enterprise; the mach<strong>in</strong>epolicy level, on the other hand, def<strong>in</strong>es the security <strong>for</strong> the local mach<strong>in</strong>e. Both levelscan be configured only by an adm<strong>in</strong>istrator.The user policy configuration file is configurable by the current logged-on user. It isnamed security.config and resides <strong>in</strong> a folder under the user profile subtree. A typicalpath is shown here:C:\Documents and Sett<strong>in</strong>gs\[UserName]\Application Data\<strong>Microsoft</strong>\CLR Security Config\v1.0.3705NoteThe paths <strong>for</strong> security configuration files are specific to eachoperat<strong>in</strong>g system. The paths mentioned here refer to <strong>Microsoft</strong>W<strong>in</strong>dows 2000. For other systems' paths, refer to the MSDNdocumentation.Edit<strong>in</strong>g the contents of the file, and thereby modify<strong>in</strong>g the security policies, is apotentially critical task that should be accomplished us<strong>in</strong>g the .<strong>NET</strong> FrameworkConfiguration tool (a Control Panel applet named mscorcfg.msc) or the Code AccessSecurity Policy tool (caspol.exe).Application Configuration FilesAs the name suggests, application configuration files are designed to conta<strong>in</strong> sett<strong>in</strong>gsspecific to an application. The sett<strong>in</strong>gs stored <strong>in</strong> the file are consumed by the CLR aswell as by the application itself. The CLR reads <strong>in</strong><strong>for</strong>mation such as assembly b<strong>in</strong>d<strong>in</strong>gpolicy, the location of remoted objects, and ASP.<strong>NET</strong> sett<strong>in</strong>gs, if applicable. Theapplication reads sett<strong>in</strong>gs that correspond to the parameters it needs to work.The name and the location of the application configuration file depend on theapplication's model, which can be one of the follow<strong>in</strong>g: W<strong>in</strong>dows Forms or console507


executable, ASP.<strong>NET</strong> application or Web service, or Internet Explorer– hostedapplication.The Configuration File <strong>for</strong> ExecutablesFor W<strong>in</strong>dows Forms and console-based applications, the configuration file resides <strong>in</strong>the same directory as the application. The name of the file is the name of theapplication (<strong>in</strong>clud<strong>in</strong>g the .exe extension) followed by a .config extension. For example,if the application is named MyProgram.exe, the configuration file must be namedMyProgram.exe.config.NoteW<strong>in</strong>dows Forms applications, as well as any other type of .<strong>NET</strong>Framework applications, can <strong>in</strong> some situations use a configurationfile with a custom name and <strong>for</strong>mat. This is possible when the only<strong>in</strong><strong>for</strong>mation stored <strong>in</strong> the file is application-specific sett<strong>in</strong>gs.The ASP.<strong>NET</strong> web.config FileASP.<strong>NET</strong> and Web service configuration files are named web.config and are located <strong>in</strong>the root of the virtual directory. When you request a particular page, however, theASP.<strong>NET</strong> runtime determ<strong>in</strong>es the correct sett<strong>in</strong>gs by look<strong>in</strong>g at all web.config filesfound, proceed<strong>in</strong>g from the virtual folder root down to the actual path of the requestedresource—typically a child directory.Innermost configuration files can overwrite sett<strong>in</strong>gs def<strong>in</strong>ed at an outer level. Likewise,pages located <strong>in</strong> <strong>in</strong>ternal folders <strong>in</strong>herit the sett<strong>in</strong>gs of configuration files found at upperlevels. For example, you have two web.config files, one <strong>in</strong> the root of the Webapplication and one <strong>in</strong> the OtherPages subfolder. The <strong>in</strong>nermost configuration file is <strong>in</strong>no way <strong>in</strong>volved when the URL po<strong>in</strong>ts to a page <strong>in</strong> the root folder. However, when apage <strong>in</strong> the OtherPages subfolder is requested, the contents of the two web.config filesare merged. In the case of conflict<strong>in</strong>g sett<strong>in</strong>gs, the <strong>in</strong>nermost values w<strong>in</strong>.Internet Explorer–Hosted ApplicationsAs we saw <strong>in</strong> Chapter 14, managed controls hosted <strong>in</strong> Internet Explorer can also havea configuration file. The name of this file doesn't have to follow specific rules, but thelocation of the file must be <strong>in</strong> the same virtual directory as the application. You simply<strong>in</strong>dicate the file and its location us<strong>in</strong>g the tag, as shown here:In this declaration, location is a placeholder that denotes the URL to the actualconfiguration file. Whatever the name of the file, the <strong>for</strong>mat must be compliant with thestandard <strong>XML</strong> schema described <strong>in</strong> the section "The <strong>XML</strong> Schema <strong>for</strong> ConfigurationSett<strong>in</strong>gs," on page 624.Manag<strong>in</strong>g Configuration Sett<strong>in</strong>gsApplication sett<strong>in</strong>gs, <strong>in</strong>clud<strong>in</strong>g general user preferences and state <strong>in</strong><strong>for</strong>mation, aresaved <strong>in</strong> the section of a configuration file. The follow<strong>in</strong>g code snippetshows some typical output:508


The sett<strong>in</strong>gs <strong>in</strong> the preced<strong>in</strong>g sample file refer to the position and size of a w<strong>in</strong>dowwhen the application is closed. The syntax of the section is def<strong>in</strong>ed asfollows:The element adds a new sett<strong>in</strong>g to the <strong>in</strong>ternal collection. This new sett<strong>in</strong>g has avalue and is identified by a unique key. The element removes a specifiedsett<strong>in</strong>g from the collection. The sett<strong>in</strong>g is identified us<strong>in</strong>g the key. F<strong>in</strong>ally, the element clears all the sett<strong>in</strong>gs that have previously been def<strong>in</strong>ed <strong>in</strong> the section.NoteThe and the elements are particularly useful <strong>in</strong>ASP.<strong>NET</strong> configuration files <strong>in</strong> which a hierarchy of files can becreated. For example, you can use the element to removeall sett<strong>in</strong>gs from your application that were def<strong>in</strong>ed at a higher level<strong>in</strong> the configuration file hierarchy.In general, the requirement that an application sett<strong>in</strong>g must be composed of aname/value pair is arbitrary. By default, the section is configured to usethe name/value <strong>for</strong>m. All sections used <strong>in</strong> a configuration file, <strong>in</strong>clud<strong>in</strong>g the section, must be declared <strong>in</strong> the <strong>in</strong>itial block. Thefollow<strong>in</strong>g code snippet demonstrates the standard declaration of the section:The element takes two attributes—name and type. The name attributedenotes the name of the section be<strong>in</strong>g declared. The type attribute <strong>in</strong>dicates the nameof the managed class that reads and parses the contents of the section from theconfiguration file. The value of the type attribute is a comma-separated str<strong>in</strong>g that<strong>in</strong>cludes the class name and the assembly that conta<strong>in</strong>s it.The element also has two optional attributes: allowDef<strong>in</strong>ition andallowLocation. These attributes apply only to ASP.<strong>NET</strong> applications and are ignoredwhen other types of applications are runn<strong>in</strong>g. AllowDef<strong>in</strong>ition specifies <strong>in</strong> whichconfiguration files the section can be used—everywhere, the mach<strong>in</strong>e configuration fileonly, or the mach<strong>in</strong>e and the application configuration file. This attribute provides a wayto control ASP.<strong>NET</strong> sett<strong>in</strong>gs <strong>in</strong>heritance. The attribute specifieswhether the section can be used with<strong>in</strong> the section.509


NoteUser applications don't need to declare the sectionbecause the section is already declared <strong>in</strong> the system'smach<strong>in</strong>e.config file, as we saw <strong>in</strong> the section "Mach<strong>in</strong>eConfiguration Files," on page 626. You don't need to repeat the declaration unless you want to modify some of theattributes, <strong>in</strong>clud<strong>in</strong>g the name/value <strong>for</strong>mat of the sett<strong>in</strong>gs.The ConfigurationSett<strong>in</strong>gs ClassTo programmatically read application sett<strong>in</strong>gs, you use the ConfigurationSett<strong>in</strong>gs class.ConfigurationSett<strong>in</strong>gs is a small, sealed class that simply provides one static method(GetConfig) and one static property (AppSett<strong>in</strong>gs).The AppSett<strong>in</strong>gs property is a read-only NameValueCollection object designed to getthe <strong>in</strong><strong>for</strong>mation stored <strong>in</strong> the section. If no sett<strong>in</strong>g is specified, or if no section exists, an empty collection is returned.NoteTo have a read-only NameValueCollection object, you need to usea class that derives from NameValueCollection and sets theprotected member IsReadonly to true. This is exactly what happensunder the hood of the AppSett<strong>in</strong>gs property. The helper collectionclass that the AppSett<strong>in</strong>gs property returns is an undocumentedclass named ReadOnlyNameValueCollection.The GetConfig method returns the configuration sett<strong>in</strong>gs <strong>for</strong> the specified section, asshown here:public static object GetConfig(str<strong>in</strong>g sectionName);Although the method signature <strong>in</strong>dicates an object return type, the actual return valueyou get from a call to GetConfig is a class derived from NameValueCollection. Inparticular, the class is ReadOnlyNameValueCollection if the section is .NoteIn general, the object returned by GetConfig is determ<strong>in</strong>ed by thehandler class specified <strong>for</strong> the section. If the handler isNameValueSectionHandler or a related class, you get sett<strong>in</strong>gsstored <strong>in</strong> a name/value collection. As we'll see chapter <strong>in</strong> the section"Types of Section Handlers," on page 640, other options exist thatcould result <strong>in</strong> a different way of pack<strong>in</strong>g sett<strong>in</strong>gs <strong>for</strong> applications.The AppSett<strong>in</strong>gs property acts as a wrapper <strong>for</strong> the GetConfig method. The actualimplementation of the property consists of a call to GetConfig <strong>in</strong> which the section namedefaults to . The follow<strong>in</strong>g pseudocode demonstrates:public static NameValueCollection AppSett<strong>in</strong>gs{get {return GetConfig("appSett<strong>in</strong>gs");}}The real code is a bit more sophisticated than this, however. After GetConfig returns,the get accessor verifies that the returned value is not null. GetConfig returns null if thespecified section is empty or does not exist. If the returned object is null, the getaccessor of the AppSett<strong>in</strong>gs property creates an empty collection and returns that tothe caller. The pseudocode is shown here:public static NameValueCollection AppSett<strong>in</strong>gs{510


get{ReadOnlyNameValueCollection o = GetConfig("appSett<strong>in</strong>gs");if (o == null){o = new ReadOnlyNameValueCollection();o.IsReadOnly = true;}}return o;}Internally, the GetConfig method first determ<strong>in</strong>es the name and location of theconfiguration file to access and then proceeds by creat<strong>in</strong>g a specialized <strong>XML</strong> textreader to operate on the <strong>XML</strong> document. Each <strong>XML</strong> node read is parsed and thecontents stored as name/value pairs <strong>in</strong> a ReadOnlyNameValueCollection object. Toparse the contents of each <strong>XML</strong> node found, the method uses an <strong>in</strong>stance of thesection handler class specified <strong>in</strong> the section declaration with<strong>in</strong> the block. To read the section, GetConfig resorts to theNameValueSectionHandler handler. This handler parses all the nodes below and adds entries to the collection. We'll look at section handler objects<strong>in</strong> more detail <strong>in</strong> the section "Customiz<strong>in</strong>g the <strong>XML</strong> Schema <strong>for</strong> Your Data," on page646.The Section HandlerIn our sample mach<strong>in</strong>e.config file, the section is read through an<strong>in</strong>stance of the NameValueFileSectionHandler class. What's the difference betweenthis class and the NameValueSectionHandler class?The MSDN documentation doesn't provide further <strong>in</strong><strong>for</strong>mation about theNameValueFileSectionHandler class; it notes only that the class is <strong>in</strong>tended to be usedonly by the .<strong>NET</strong> Framework. But the NameValueFileSectionHandler class is actually awrapper <strong>for</strong> NameValueSectionHandler class, which provides an extra, althoughundocumented, feature. In particular, the NameValueFileSectionHandler sectionhandler allows the application sett<strong>in</strong>gs to be stored <strong>in</strong> a separated file <strong>in</strong> accordancewith the follow<strong>in</strong>g syntax:The file po<strong>in</strong>ted to by the file attribute is read as if it is an section <strong>in</strong> theconfiguration file. Note that the root element of the myfile.config file must match thesection that refers to it. So if the file attribute belongs to the section, theroot element of the file be<strong>in</strong>g po<strong>in</strong>ted to must be named .The NameValueFileSectionHandler object processes the contents of the embedded fileus<strong>in</strong>g the NameValueSectionHandler class. If no file is embedded <strong>in</strong> the section but the default documented schema is used, the two section handlers arefunctionally equivalent.Although undocumented, the follow<strong>in</strong>g code represents a perfectly valid schema <strong>for</strong> theapplication's configuration file. The sample application AppSett<strong>in</strong>gs, available <strong>in</strong> thisbook's sample files, demonstrates how to take advantage of this syntax.511


The myfile.config file conta<strong>in</strong>s the actual sett<strong>in</strong>gs, as shown here:Us<strong>in</strong>g Sett<strong>in</strong>gs Through CodeNow that you know how to read sett<strong>in</strong>gs, let's create a sample application that usespersistent sett<strong>in</strong>gs to refresh its own user <strong>in</strong>terface. This application, shown <strong>in</strong> thefollow<strong>in</strong>g code, is a simple W<strong>in</strong>dows Forms program that always appears at the samesize and <strong>in</strong> the same position as when it was last closed. The sett<strong>in</strong>gs are stored <strong>in</strong> amyfile.config file and are read us<strong>in</strong>g the AppSett<strong>in</strong>gs property of theConfigurationSett<strong>in</strong>gs class.private void Form1_Load(object sender, System.EventArgs e){// Read sett<strong>in</strong>gsstr<strong>in</strong>g wndPos =ConfigurationSett<strong>in</strong>gs.AppSett<strong>in</strong>gs["LastLeftTopPosition"];str<strong>in</strong>g wndSize =ConfigurationSett<strong>in</strong>gs.AppSett<strong>in</strong>gs["LastSize"];// Update <strong>in</strong>ternal membersstr<strong>in</strong>g[] tmp;if (wndPos != null){<strong>in</strong>t m_top, m_left;tmp = wndPos.Split(',');m_left = Convert.ToInt32(tmp[0]);m_top = Convert.ToInt32(tmp[1]);this.Location = new Po<strong>in</strong>t(m_left, m_top);}if (wndSize != null){<strong>in</strong>t m_width, m_height;tmp = wndSize.Split(',');m_width = Convert.ToInt32(tmp[0]);m_height = Convert.ToInt32(tmp[1]);this.Size = new Size(m_width, m_height);}512


}At load<strong>in</strong>g, the <strong>for</strong>m reads the sett<strong>in</strong>gs from the configuration file, extracts position andsize <strong>in</strong><strong>for</strong>mation, and updates the Location and Size properties. Next the <strong>for</strong>m isdisplayed <strong>in</strong> the same location and at the same size as when it was closed.Enumerat<strong>in</strong>g All Sett<strong>in</strong>gsThe AppSett<strong>in</strong>gs property is a static member shared by all <strong>in</strong>stances ofConfigurationSett<strong>in</strong>gs runn<strong>in</strong>g <strong>in</strong> the application doma<strong>in</strong> (AppDoma<strong>in</strong>). If you need toaccess all the application sett<strong>in</strong>gs, or simply to count them, you don't need to read oneproperty after the next. The property already conta<strong>in</strong>s all the sett<strong>in</strong>gs <strong>in</strong> an easilymanageable NameValueCollection object. The follow<strong>in</strong>g code shows how to enumerateall the sett<strong>in</strong>gs <strong>in</strong> a drop-down list:<strong>for</strong>each(str<strong>in</strong>g s <strong>in</strong> ConfigurationSett<strong>in</strong>gs.AppSett<strong>in</strong>gs)Sett<strong>in</strong>gList.Items.Add(s);Sett<strong>in</strong>gList.SelectedIndex = 0;Figure 15-1 shows the sample application <strong>in</strong> action. The drop-down list conta<strong>in</strong>s all thesett<strong>in</strong>gs.Figure 15-1: Read<strong>in</strong>g and us<strong>in</strong>g configuration sett<strong>in</strong>gs programmatically.Updat<strong>in</strong>g Sett<strong>in</strong>gsThe .<strong>NET</strong> Framework does not provide any facilities <strong>for</strong> updat<strong>in</strong>g a configuration file.How you create and ma<strong>in</strong>ta<strong>in</strong> the application's file is up to you and might requiredifferent approaches <strong>for</strong> different cases. As long as the size of the file is limited to just afew KB, load<strong>in</strong>g the entire document <strong>in</strong>to an Xml-Document object is plausible andresults <strong>in</strong> an effective and familiar programm<strong>in</strong>g <strong>in</strong>terface. To add new nodes, you usethe methods of the <strong>XML</strong> Document Object Model (<strong>XML</strong> DOM); to locate a particularnode to update, you use XPath queries. (<strong>XML</strong> DOM is covered <strong>in</strong> Chapter 5, and XPathexpressions are covered <strong>in</strong> Chapter 6.)Let's proceed to persist<strong>in</strong>g the location and size of the <strong>for</strong>m. When the <strong>for</strong>m is about toclose, a Clos<strong>in</strong>g event is fired to let users per<strong>for</strong>m some clean-up operations and otherf<strong>in</strong>aliz<strong>in</strong>g tasks—<strong>for</strong> example, persist<strong>in</strong>g state <strong>in</strong><strong>for</strong>mation. The follow<strong>in</strong>g code illustratesthe event handler used <strong>in</strong> the sample application:private void Form1_Clos<strong>in</strong>g(object sender, CancelEventArgs e){// Load the config file as an <strong>XML</strong> document// (Assume that the config file exists)str<strong>in</strong>g configFile;configFile = Assembly.GetExecut<strong>in</strong>gAssembly().Location +".config";XmlDocument doc = new XmlDocument();doc.Load(configFile);513


Some <strong>in</strong>ternal variablesXmlNodeList sett<strong>in</strong>gs;XmlElement node, appSett<strong>in</strong>gsNode;str<strong>in</strong>g query;// Get the nodequery = "configuration/appSett<strong>in</strong>gs";appSett<strong>in</strong>gsNode = (XmlElement) doc.SelectS<strong>in</strong>gleNode(query);if (appSett<strong>in</strong>gsNode == null)return;}⋮This code first loads the configuration file <strong>in</strong>to an <strong>in</strong>stance of the XmlDocument class.The name of the file is obta<strong>in</strong>ed by comb<strong>in</strong><strong>in</strong>g the name of the currently execut<strong>in</strong>gassembly with the .config extension. Next the code gets a reference to the node. The reference to the node is obta<strong>in</strong>ed through anXPath query executed by SelectS<strong>in</strong>gleNode. By design, the subtree isalways a direct child of the root node. The follow<strong>in</strong>g code demonstrateshow to update—or, if needed, to create—a sett<strong>in</strong>g.// Get the LastLeftTopPosition sett<strong>in</strong>gquery ="configuration/appSett<strong>in</strong>gs/add[@key='LastLeftTopPosition']";sett<strong>in</strong>gs = doc.SelectNodes(query);// If the node does not exist, create itif (sett<strong>in</strong>gs.Count >0)node = (XmlElement) sett<strong>in</strong>gs[0];else{// Create the node node = doc.CreateElement("add");XmlAttribute attKey = doc.CreateAttribute("key");attKey.Value = "LastLeftTopPosition";node.Attributes.SetNamedItem(attKey);XmlAttribute attVal = doc.CreateAttribute("value");node.Attributes.SetNamedItem(attVal);}// Append the nodeappSett<strong>in</strong>gsNode.AppendChild(node);// Update the value attribute514


node.Attributes["value"].Value = Str<strong>in</strong>g.Format("{0},{1}",this.Left, this.Top);F<strong>in</strong>ally, you save the file and persist the changes, as shown here:doc.Save(configFile);The XmlDocument class is particularly useful <strong>for</strong> per<strong>for</strong>m<strong>in</strong>g this k<strong>in</strong>d of task because itallows you to selectively access a particular node. If you have dozens of sett<strong>in</strong>gs topersist, you might want to take a different route and rewrite the configuration file fromscratch each time. In this case, us<strong>in</strong>g an <strong>XML</strong> writer can result <strong>in</strong> more effective code.If the configuration file conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation other than application sett<strong>in</strong>gs and this<strong>in</strong><strong>for</strong>mation takes up a lot of room, referenc<strong>in</strong>g an external configuration file from the node can become an attractive option. Although the node's file attribute is not documented, it works just f<strong>in</strong>e and enables you to separateapplication and user sett<strong>in</strong>gs from the rest of the sett<strong>in</strong>gs.The AppSett<strong>in</strong>gsReader ClassA more specialized tool <strong>for</strong> read<strong>in</strong>g application sett<strong>in</strong>gs is the AppSett<strong>in</strong>gsReader class.This class provides a s<strong>in</strong>gle method, named GetValue, <strong>for</strong> read<strong>in</strong>g values of a particulartype from the configuration file. The GetValue method takes two arguments—the nameof the sett<strong>in</strong>g to retrieve and the type to return—as shown here:public object GetValue(str<strong>in</strong>g key, Type type);The GetValue method retrieves the value of the given sett<strong>in</strong>g us<strong>in</strong>g the AppSett<strong>in</strong>gsproperty and then per<strong>for</strong>ms an automatic cast to the specified type. Unlike theAppSett<strong>in</strong>gs property of the ConfigurationSett<strong>in</strong>gs object, which always returns a str<strong>in</strong>g,the GetValue method works <strong>in</strong> a strongly typed way. Suppose that you have thefollow<strong>in</strong>g sett<strong>in</strong>g:You can load the value directly <strong>in</strong>to a DateTime object. Here'show:AppSett<strong>in</strong>gsReader reader = new AppSett<strong>in</strong>gsReader();DateTime relDate = (DateTime) reader.GetValue("ReleaseDate",typeof(DateTime));MessageBox.Show(relDate.ToShortDateStr<strong>in</strong>g());Note that the GetValue method is not marked as static, which means that you need afresh <strong>in</strong>stance of the AppSett<strong>in</strong>gsReader class to call the method. As mentioned, theGetValue method is a simple wrapper <strong>for</strong> the AppSett<strong>in</strong>gs property, which is a staticmember. If you plan to use AppSett<strong>in</strong>gsReader <strong>in</strong> your application, you're better off<strong>in</strong>stantiat<strong>in</strong>g the object only once dur<strong>in</strong>g the startup phase.Creat<strong>in</strong>g New Configuration SectionsThe section is one of many predef<strong>in</strong>ed configuration sections providedby the .<strong>NET</strong> Framework. Programmers can also create their own sections. To create anew section, you need to accomplish two basic tasks: declare the section <strong>in</strong> the block, and fill the section with custom data.One of the key bits of <strong>in</strong><strong>for</strong>mation you need to specify while declar<strong>in</strong>g a new section isthe name of the section handler class. The section handler class can be one of thepredef<strong>in</strong>ed classes provided by the .<strong>NET</strong> Framework or a class that you write from515


scratch or <strong>in</strong>herit from an exist<strong>in</strong>g class. The section handler object is responsible <strong>for</strong>read<strong>in</strong>g and pars<strong>in</strong>g the actual contents of the sett<strong>in</strong>g.Declar<strong>in</strong>g a New SectionThe node conta<strong>in</strong>s the declarations of all the sections <strong>in</strong> the variousconfiguration files. The predef<strong>in</strong>ed sections are declared <strong>in</strong> the mach<strong>in</strong>e.config file thatthe .<strong>NET</strong> Framework <strong>in</strong>stalls. Custom sections must be registered by the applicationthat plans to use them. The application's configuration file is a good place <strong>for</strong> <strong>in</strong>sert<strong>in</strong>gthis <strong>in</strong><strong>for</strong>mation.The node can accept up to four child nodes: ,, , and . The element removes a previouslydef<strong>in</strong>ed section, or a section group, from the block. The element clears all previously def<strong>in</strong>ed sections and section groups.NoteThe and elements don't affect the actual datastored <strong>in</strong> the configuration file. Remov<strong>in</strong>g a section doesn't erasethe related data from the file, but the data becomes unreachablebecause of the miss<strong>in</strong>g section declaration.A new section is registered us<strong>in</strong>g the element. As mentioned, the nameattribute of this element specifies the name of the section and the type attributespecifies the name of the section handler class. The name of the configuration sectionclass should conta<strong>in</strong> full assembly <strong>in</strong><strong>for</strong>mation, <strong>in</strong>clud<strong>in</strong>g version, culture, and public keytoken, if any. All the predef<strong>in</strong>ed handlers are def<strong>in</strong>ed <strong>in</strong> the same assembly andthere<strong>for</strong>e share the same <strong>in</strong><strong>for</strong>mation, as <strong>in</strong> the follow<strong>in</strong>g example:System, Version=1.0.3300.0, Culture=neutral,PublicKeyToken=b77a5c561934e089NoteWhen you create a custom assembly with no strong name (a strongname is necessary if you want to put the assembly <strong>in</strong> the globalassembly cache), the version number is def<strong>in</strong>ed <strong>in</strong> the assembly<strong>in</strong>fofile that <strong>Microsoft</strong> Visual Studio .<strong>NET</strong> automatically adds to theproject. The culture is neutral, and the public key token is null.Here's an example:AppSett<strong>in</strong>gs_CS, Version=1.0.9.0, Culture=neutral,PublicKeyToken=nullThe custom section follows the block and conta<strong>in</strong>s the actualconfiguration sett<strong>in</strong>gs. The follow<strong>in</strong>g code creates a new section nameduserPreferences that accepts name/value pairs:516


Sections can be grouped under a element. Declar<strong>in</strong>g a section groupcreates a namespace and ensures that no nam<strong>in</strong>g conflicts arise with otherconfiguration sections def<strong>in</strong>ed by someone else. Section groups can also be nestedwith<strong>in</strong> each other. The follow<strong>in</strong>g code snippet declares the userPreferences sectionnested <strong>in</strong> the AppName group:A node with the group name must also wrap the sett<strong>in</strong>gs subtree,as shown here:To read the sett<strong>in</strong>gs of a custom section, you use the GetConfig method, pass<strong>in</strong>g thefully qualified name of the section to retrieve. For example, the follow<strong>in</strong>g code returnsthe sett<strong>in</strong>gs <strong>in</strong> the section:NameValueCollection sett<strong>in</strong>gs;sett<strong>in</strong>gs =ConfigurationSett<strong>in</strong>gs.GetConfig("AppName/userPreferences");MessageBox.Show(sett<strong>in</strong>g["ReleaseDate"]);NoteA new section, or section group, that is def<strong>in</strong>ed <strong>in</strong> themach<strong>in</strong>e.config file is visible to all applications. This sett<strong>in</strong>g can bechanged us<strong>in</strong>g the allowDef<strong>in</strong>ition attribute <strong>for</strong> ASP.<strong>NET</strong>applications only. In contrast, sections def<strong>in</strong>ed <strong>in</strong> the applicationconfiguration file are visible only to the local application.Types of Section HandlersA section handler is a .<strong>NET</strong> Framework class that implements theIConfigurationSectionHandler <strong>in</strong>terface. It <strong>in</strong>terprets and processes the configurationsett<strong>in</strong>gs stored <strong>in</strong> a configuration section and returns a configuration object based onthe configuration sett<strong>in</strong>gs. The returned object is accessed by the GetConfig method.The data type returned by the GetConfig method depends on the section handlerdef<strong>in</strong>ed <strong>for</strong> the particular section.The .<strong>NET</strong> Framework provides a few predef<strong>in</strong>ed section handlers, listed <strong>in</strong> Table 15-2.All of these section handlers belong to the System.Configuration namespace and areimplemented <strong>in</strong> the System assembly.517


Table 15-2: Predef<strong>in</strong>ed Section HandlersClassDictionarySectionHandlerIgnoreSectionHandlerNameValueFileSectionHandlerNameValueSectionHandlerS<strong>in</strong>gleTagSectionHandlerDescriptionReads name/value pairs and groups them<strong>in</strong> a hash table object.The System.Configuration classes ignorethe sections marked with this handlerbecause their contents will be processedby other components. This handler is analternative to us<strong>in</strong>g and declar<strong>in</strong>g customhandlers.Reads name/value pairs from a filereferenced <strong>in</strong> the sectionand groups them <strong>in</strong> aNameValueCollection object.Reads name/value pairs and groups them<strong>in</strong> a NameValueCollection object.Reads sett<strong>in</strong>gs from attributes stored <strong>in</strong> as<strong>in</strong>gle <strong>XML</strong> node. The data is returned asa hash table.In the .<strong>NET</strong> Framework, the classes <strong>in</strong> the System.Configuration namespace areresponsible <strong>for</strong> pars<strong>in</strong>g the contents of the configuration files. These classes aredesigned to process the entire contents of the configuration files. The classes alsothrow an exception when a configuration section lacks a correspond<strong>in</strong>g entry <strong>in</strong> the block and when the layout of the data does not match the declaration.Of the five section handlers, we have exam<strong>in</strong>ed NameValueSectionHandler andNameValueFileSectionHandler. The DictionarySectionHandler class is very similar; itdiffers only <strong>in</strong> that it stores sett<strong>in</strong>gs <strong>in</strong> a hash table <strong>in</strong>stead of <strong>in</strong> a NameValueCollectionobject. Collection objects are more efficient if they are used to store a small number ofitems (ideally fewer than 10), whereas a hash table provides better per<strong>for</strong>mance withlarge collections of items. The IgnoreSectionHandler and S<strong>in</strong>gleTagSectionHandlerclasses deserve a bit more attention, and we'll look at them next.The IgnoreSectionHandler Section HandlerA few subsystems <strong>in</strong> the .<strong>NET</strong> Framework store configuration data <strong>in</strong> themach<strong>in</strong>e.config file but process the data themselves, without rely<strong>in</strong>g on the servicesprovided by the System.Configuration classes. For example, the mach<strong>in</strong>e.config fileconta<strong>in</strong>s remot<strong>in</strong>g and startup <strong>in</strong><strong>for</strong>mation that is processed outside the configurationeng<strong>in</strong>e. To prevent the configuration file from pars<strong>in</strong>g exceptions, you can use a dummysection handler—IgnoreSectionHandler. This handler handles sections of configurationdata rather than rely<strong>in</strong>g on the classes <strong>in</strong> System.Configuration. It could be argued thatsuch data should be stored <strong>in</strong> a system configuration file, like the mach<strong>in</strong>e.config file, or<strong>in</strong> a custom file. Look<strong>in</strong>g at the follow<strong>in</strong>g excerpt from the mach<strong>in</strong>e.config file, you cansee that remot<strong>in</strong>g configuration sett<strong>in</strong>gs are processed by the remot<strong>in</strong>g classes,whereas HTTP run-time configuration sett<strong>in</strong>gs are processed by a custom handler:518


In both cases, configuration sett<strong>in</strong>gs need a customized and more sophisticated layoutthan name/value pairs. In the first scenario, the handler is embedded <strong>in</strong> the remot<strong>in</strong>gsubsystem; <strong>in</strong> the second scenario, the handler con<strong>for</strong>ms to the configurationguidel<strong>in</strong>es but is simply not one of the predef<strong>in</strong>ed handlers. As mentioned, because bydesign the configuration classes read through all the contents of a configuration file andthrow exceptions whenever they encounter someth<strong>in</strong>g wrong, custom sett<strong>in</strong>gs handledoutside the configuration namespace must have a section handler, although one thatdoes noth<strong>in</strong>g—the IgnoreSectionHandler handler.The S<strong>in</strong>gleTagSectionHandler Section HandlerThe S<strong>in</strong>gleTagSectionHandler class supports a simpler schema <strong>for</strong> stor<strong>in</strong>gconfiguration sett<strong>in</strong>gs. Unlike NameValueSectionHandler, which supports name/valuepairs def<strong>in</strong>ed with<strong>in</strong> nodes, the S<strong>in</strong>gleTagSectionHandler class uses a s<strong>in</strong>gle<strong>XML</strong> node with as many attributes as needed. Each attribute maps to a sett<strong>in</strong>g, and thename of the attribute is also the key to access the value.In other words, the S<strong>in</strong>gleTagSectionHandler class provides an attribute-based view ofthe configuration sett<strong>in</strong>gs, whereas the NameValueSectionHandler class (andDictionarySectionHandler as well) provides an element-based representation. Thefollow<strong>in</strong>g code shows the way <strong>in</strong> which sett<strong>in</strong>gs are stored by a s<strong>in</strong>gle tag sectionhandler:Under the Hood of Section HandlersAs mentioned, a configuration section handler is simply a managed class thatimplements the IConfigurationSectionHandler <strong>in</strong>terface. The classes that implement theIConfigurationSectionHandler <strong>in</strong>terface def<strong>in</strong>e the rules <strong>for</strong> trans<strong>for</strong>m<strong>in</strong>g pieces of <strong>XML</strong>configuration files <strong>in</strong>to usable objects. The created objects can be of an arbitrary type.The follow<strong>in</strong>g code shows the <strong>in</strong>terface signature:public <strong>in</strong>terface IConfigurationSectionHandler{object Create(object parent, object configContext,XmlNode section);}519


The <strong>in</strong>terface <strong>in</strong>cludes a s<strong>in</strong>gle method, Create, that configuration readers call to obta<strong>in</strong>an object that represents the contents of a particular sett<strong>in</strong>g. This method takes threearguments: a parent object, a context object, and a section <strong>XML</strong> node. In general, theconfiguration object can be obta<strong>in</strong>ed by comb<strong>in</strong><strong>in</strong>g the <strong>in</strong><strong>for</strong>mation read and composed<strong>in</strong> a parent directory with the current sett<strong>in</strong>gs. This <strong>in</strong><strong>for</strong>mation is stored <strong>in</strong> the parentargument. A configuration sett<strong>in</strong>g can't always have a parent path, however; this ispossible only with web.config files, which are specifically designed to supportconfiguration <strong>in</strong>heritance. For all other configuration files, the parent argument of theCreate method is always null. The parent argument be<strong>in</strong>g passed should not be altered,and if a modification is necessary, you first clone the object and then modify it.NoteIf it isn't null, the parent argument is guaranteed to be an objectreturned by a previous call made to the Create method on the samesection handler object. There<strong>for</strong>e, by design, the type of the parentargument is identical to the return type of the currentimplementation of Create. For example, if the Create methodreturns a NameValueCollection object, the parent argument canonly be an object of type NameValueCollection or null.A section handler object might be used <strong>in</strong> any configuration file, <strong>in</strong>clud<strong>in</strong>g a web.configfile. For this reason, when implement<strong>in</strong>g the IConfigurationSectionHandler <strong>in</strong>terface,you should check the value <strong>in</strong> the parent argument and act accord<strong>in</strong>gly. We'll look at anexample of this <strong>in</strong> the section "Implement<strong>in</strong>g the DataSet Section Handler," on page653.The configContext argument is non-null only if you use the section handler with<strong>in</strong> aweb.config file <strong>in</strong> an ASP.<strong>NET</strong> application. In this case, the argument evaluates to anobject of type HttpConfigurationContext, whose only significant member is a propertynamed VirtualPath. The VirtualPath property conta<strong>in</strong>s the virtual path to web.config withrespect to the ongo<strong>in</strong>g Web request. In this way, you can determ<strong>in</strong>e the level ofconfiguration nest<strong>in</strong>g at which your handler is called to operate.F<strong>in</strong>ally, the section argument is the <strong>XML</strong> DOM node object rooted at the section to behandled. The argument is an <strong>XML</strong> DOM subtree that represents the data to beprocessed.NoteTo better understand the rather symbolic role played by theIgnoreSectionHandler section handler class, consider what theimplementation of its Create method looks like:object Create(object parent, object context,XmlNode section){return null;}No <strong>in</strong><strong>for</strong>mation is returned, but neither is an exception thrown.Customiz<strong>in</strong>g Attribute NamesConfiguration sett<strong>in</strong>gs are stored us<strong>in</strong>g predef<strong>in</strong>ed attribute names: key <strong>for</strong> the sett<strong>in</strong>g'sname, and value <strong>for</strong> the actual contents. Such names are hard-coded as protectedmembers <strong>in</strong> the NameValueSectionHandler and DictionarySectionHandler classes.Their associated properties are named KeyAttributeName and ValueAttributeName,respectively. To customize those names, you must derive a new class, override theproperties, and use the new class as your section handler.520


The follow<strong>in</strong>g code demonstrates a class that <strong>in</strong>herits from NameValueSectionHandlerand simply renames the attributes to be used <strong>for</strong> the sett<strong>in</strong>gs. Instead of the defaultnames key and value, Sett<strong>in</strong>gKey and Sett<strong>in</strong>gValue are used.public class CustomNameValueSectionHandler:NameValueSectionHandler{public CustomNameValueSectionHandler(): base(){}}protected override str<strong>in</strong>g KeyAttributeName{get{return "Sett<strong>in</strong>gKey";}}protected override str<strong>in</strong>g ValueAttributeName{get{return "Sett<strong>in</strong>gValue";}}Note that the KeyAttributeName and ValueAttributeName properties are read-only,protected, and virtual. You must reta<strong>in</strong> the same modifier and override the properties <strong>in</strong>a new class. There is no need to make the properties read/write. The preced<strong>in</strong>g class isdef<strong>in</strong>ed <strong>in</strong> the sample application AppSett<strong>in</strong>gs_CS available <strong>in</strong> this book's sample filesand enables you to access the configuration file shown here:The beauty of these section handlers is that they encapsulate all the logic necessary toaccess sett<strong>in</strong>gs <strong>in</strong> the configuration file. The application is not affected by the actuallayout of the sett<strong>in</strong>g. As a result, read<strong>in</strong>g the preced<strong>in</strong>g value requires the same highlevelcode, regardless of the attribute names you use, as shown here:521


NameValueCollection coll;coll = ConfigurationSett<strong>in</strong>gs.GetConfig("AppName/CustomSection");MessageBox.Show(coll["Property"]);Customiz<strong>in</strong>g the <strong>XML</strong> Schema <strong>for</strong> Your DataThe predef<strong>in</strong>ed <strong>XML</strong> schema <strong>for</strong> configuration files fits the bill <strong>in</strong> most cases, but whenyou have complex and structured <strong>in</strong><strong>for</strong>mation to preserve across application sessions,none of the exist<strong>in</strong>g schemas appear to be powerful enough. At this po<strong>in</strong>t, you have twopossible workarounds. You can simply avoid us<strong>in</strong>g a standard configuration file and<strong>in</strong>stead use a pla<strong>in</strong> <strong>XML</strong> file written accord<strong>in</strong>g to the schema that you feel is appropriate<strong>for</strong> the data. Alternatively, you can embed your <strong>XML</strong> configuration data <strong>in</strong> the standardapplication configuration file but provide a tailor-made configuration section handler toread it. A third option exists. You could <strong>in</strong>sert the data <strong>in</strong> the configuration file, registerthe section with a null handler (IgnoreSectionHandler), and then use another piece ofcode (<strong>for</strong> example, a custom utility) to read and write the sett<strong>in</strong>gs.Be<strong>for</strong>e we look more closely at design<strong>in</strong>g and writ<strong>in</strong>g a custom configuration handleraccord<strong>in</strong>g to the <strong>XML</strong> schema you prefer, let's briefly compare the various approaches.In terms of per<strong>for</strong>mance and programm<strong>in</strong>g power, all approaches are roughlyequivalent, but some key differences still exist. In theory, us<strong>in</strong>g an ad hoc file results <strong>in</strong>the most efficient approach because you can create made-to-measure, andsubsequently faster, code. However, this is only a possibility—if your code happens tobe badly written, the per<strong>for</strong>mance of your whole application might still be bad. TheSystem.Configuration classes are designed to serve as a general-purpose mechanism<strong>for</strong> manipulat<strong>in</strong>g sett<strong>in</strong>gs. They work great on average but are not necessarily the bestoption when an effective manipulation of the sett<strong>in</strong>gs is key to your code. On the otherhand, the System.Configuration classes, and the standard configuration files, requireyou to write a m<strong>in</strong>imal amount of code. The more customization you want, the morecode you have to write, with all the risks (mostly errors and bugs) that this <strong>in</strong>troduces.As a rule of thumb, us<strong>in</strong>g the standard configuration files should be the first option toevaluate. Resort to custom files only if you want to control all aspects of data read<strong>in</strong>g(<strong>for</strong> example, if you want to provide feedback while load<strong>in</strong>g), if per<strong>for</strong>mance is critical, orif you just don't feel com<strong>for</strong>table with the predef<strong>in</strong>ed section handlers. F<strong>in</strong>ally, althoughit's reasonable to use the IgnoreSectionHandler handler <strong>in</strong> the context <strong>in</strong> which the.<strong>NET</strong> Framework uses it, I don't recommend us<strong>in</strong>g IgnoreSectionHandler <strong>in</strong> userapplications. A custom section handler or a custom file is preferable.If you're consider<strong>in</strong>g creat<strong>in</strong>g a custom file based on a customized <strong>XML</strong> schema,DataSet objects present an <strong>in</strong>terest<strong>in</strong>g option. Assum<strong>in</strong>g that the data to be storedlends itself to be<strong>in</strong>g represented <strong>in</strong> a tabular <strong>for</strong>mat, you could write an <strong>XML</strong>configuration file us<strong>in</strong>g the <strong>Microsoft</strong> ADO.<strong>NET</strong> normal <strong>for</strong>m and load that data <strong>in</strong>to aDataSet object. Load<strong>in</strong>g data requires a s<strong>in</strong>gle call to the ReadXml method, andmanag<strong>in</strong>g data is easy due to the powerful <strong>in</strong>terface of the DataSet class. We'll look atan example of the DataSet section handler next.NoteIn the section "Customiz<strong>in</strong>g Attribute Names," on page 645, weanalyzed a custom section handler <strong>in</strong>herited from theNameValueSectionHandler class. That trivial handler was simplyaimed at overrid<strong>in</strong>g some of the standard features of one of thepredef<strong>in</strong>ed handlers. A truly custom section handler is a moresophisticated object that uses an <strong>XML</strong> reader to access a portion ofthe configuration file and parse the contents.522


Creat<strong>in</strong>g a DataSet Section HandlerLet's look at a practical example of a new section handler namedDatasetSectionHandler. This section handler reads <strong>XML</strong> data from a configuration fileand stores it <strong>in</strong> a new DataSet object. The data must be laid out <strong>in</strong> a <strong>for</strong>mat that theReadXml method can successfully process. The typical <strong>for</strong>mat is the ADO.<strong>NET</strong> normal<strong>for</strong>m that we exam<strong>in</strong>ed <strong>in</strong> Chapter 9.Along with the custom section handler, let's write an application that can handleconfiguration data through a DataSet object. Suppose you have a W<strong>in</strong>dows Formsapplication that can be extended with plug-<strong>in</strong> modules. We won't look at the details ofhow this could be done here; <strong>in</strong>stead, we'll focus on how to effectively storeconfiguration data as <strong>XML</strong>. (In the section "Further Read<strong>in</strong>g," on page 655, you'll f<strong>in</strong>d areference to a recent article that addresses this topic fully.) We'll analyze the plug-<strong>in</strong>eng<strong>in</strong>e <strong>for</strong> W<strong>in</strong>dows Forms applications only, but the same pattern can be easilyapplied to Web Forms applications as well.Extend<strong>in</strong>g W<strong>in</strong>dows Forms Application MenusThe sample application shown <strong>in</strong> Figure 15-2 allows users to add custom menu itemsbelow the first item on the Tools menu. Such menu items are l<strong>in</strong>ked to external plug-<strong>in</strong>modules. In this context, a plug-<strong>in</strong> module is simply a class dynamically loaded from anassembly. More generally, the plug-<strong>in</strong> class will need to implement a particular<strong>in</strong>terface, or <strong>in</strong>herit from a given base class, because the application needs to have aconsistent way to call <strong>in</strong>to any plug-<strong>in</strong> class. (For more <strong>in</strong><strong>for</strong>mation and a completeexample of extensible .<strong>NET</strong> Framework applications, check out the article referenced <strong>in</strong>the section "Further Read<strong>in</strong>g," on page 655. In our sample application, we'll limitourselves to creat<strong>in</strong>g a context-sensitive MessageBox call <strong>for</strong> each new registeredplug-<strong>in</strong>.Figure 15-2: A W<strong>in</strong>dows Forms application that can be extended with plug-<strong>in</strong> modules that<strong>in</strong>tegrate with the menu.At load<strong>in</strong>g, the sample application calls the follow<strong>in</strong>g rout<strong>in</strong>e to set up the menu:private void SetupMenu(){// Access the menu config filestr<strong>in</strong>g path = "TypicalW<strong>in</strong>FormsApp/PlugIns";DataSet configMenu = (DataSet)ConfigurationSett<strong>in</strong>gs.GetConfig(path);// Add dynamic items to exist<strong>in</strong>g popup menus523


}if (configMenu != null)AddMenuToolsPlugIns(configMenu);The configuration sett<strong>in</strong>gs—that is, the menu items to be added to the Tools menu—areread from the configuration file us<strong>in</strong>g the ConfigurationSett<strong>in</strong>gs class, as usual. Noth<strong>in</strong>g<strong>in</strong> the preced<strong>in</strong>g code reveals the presence of a custom section handler and acompletely custom <strong>XML</strong> schema <strong>for</strong> the sett<strong>in</strong>gs. The only fa<strong>in</strong>t clue is the use of aDataSet object.After it has been successfully loaded from the configuration file, the DataSet object ispassed to a helper rout<strong>in</strong>e, AddMenuToolsPlugIns, which will modify the menu. We'llreturn to this po<strong>in</strong>t <strong>in</strong> the section "Invok<strong>in</strong>g Plug-In Modules," on page 650; <strong>in</strong> themeantime, let's review the layout of the configuration file.The <strong>XML</strong> Layout of the Configuration Sett<strong>in</strong>gsThe data correspond<strong>in</strong>g to plug-<strong>in</strong> modules is stored <strong>in</strong> a section group namedTypicalW<strong>in</strong>FormsApp. The actual section is named PlugIns. Each plug-<strong>in</strong> module isidentified by an assembly name, a class name, and display text. The display textconstitutes the caption of the menu item, whereas the assembly name and the classname provide <strong>for</strong> a dynamic method call. As mentioned, <strong>in</strong> a realworld scenario, youmight <strong>for</strong>ce the class to implement a particular <strong>in</strong>terface so that it's clear to the call<strong>in</strong>gapplication which methods are available <strong>for</strong> the object it is <strong>in</strong>stantiat<strong>in</strong>g.Here is a sample configuration file <strong>for</strong> the application shown <strong>in</strong> Figure 15-2:Add new tool...MyToolsPlugInsMyPlugIn.AddNewToolSpecial tool...MyToolsPlugInsMyPlugIn.SpecialTool524


I deliberately left a few standard application sett<strong>in</strong>gs (the section) <strong>in</strong> thislist<strong>in</strong>g just to demonstrate that custom sections can happily work side by side withstandard system and application sett<strong>in</strong>gs. In particular, the sample application depicted<strong>in</strong> Figure 15-2 also supports the same save and restore features described <strong>in</strong> thesection "Us<strong>in</strong>g Sett<strong>in</strong>gs Through Code," on page 634.The element po<strong>in</strong>ts to the class XmlNet.CS.DatasetSectionHandler, which isdeclared and implemented <strong>in</strong> the DatasetSectionHandler assembly. The net effect ofthis section declaration is that whenever an application asks <strong>for</strong> a PlugIns section, thepreced<strong>in</strong>g section handler is <strong>in</strong>volved, its Create method is called, and a DataSet objectis returned. We'll look at the implementation of the section handler <strong>in</strong> the section"Implement<strong>in</strong>g the DataSet Section Handler," on page 653.Invok<strong>in</strong>g Plug-In ModulesThe AddMenuToolsPlugIns procedure modifies the application's Tools menu, add<strong>in</strong>g allthe items registered <strong>in</strong> the configuration file. The follow<strong>in</strong>g code shows how it works:private void AddMenuToolsPlugIns(DataSet ds){DynamicMenuItem mnuItem;DataTable config;// Get the table that represents the sett<strong>in</strong>gs <strong>for</strong> the menuconfig = ds.Tables["MenuTools"];if (config == null)return;// Add a separatorif (config.Rows.Count >0)menuTools.MenuItems.Add("-");// Start position <strong>for</strong> <strong>in</strong>sertions<strong>in</strong>t <strong>in</strong>dex = menuTools.MenuItems.Count;// Populate the Tools menu<strong>for</strong>each(DataRow configMenuItem <strong>in</strong> config.Rows){mnuItem = newDynamicMenuItem(configMenuItem["Text"].ToStr<strong>in</strong>g(),new EventHandler(StdOnClickHandler));mnuItem.AssemblyName =configMenuItem["Assembly"].ToStr<strong>in</strong>g();mnuItem.ClassName = configMenuItem["Class"].ToStr<strong>in</strong>g();525


}}menuTools.MenuItems.Add(<strong>in</strong>dex, mnuItem);<strong>in</strong>dex += 1;The DataSet object that the section handler returns is built from the <strong>XML</strong> code rooted <strong>in</strong>. This code orig<strong>in</strong>ates a DataSet object with one table, named MenuTools.The MenuTools table has three columns: Text, Assembly, and Class. Each row <strong>in</strong> thetable corresponds to a plug-<strong>in</strong> module.The preced<strong>in</strong>g code first adds a separator and then iterates on the rows of the tableand adds menu items to the Tools menu, as shown <strong>in</strong> Figure 15-3. MenuTools is justthe name of the Tools pop-up menu <strong>in</strong> the sample application.Figure 15-3: Registered plug-<strong>in</strong> modules appear on the Tools menu of the application.To handle a click on a menu item <strong>in</strong> a W<strong>in</strong>dows Forms application, you need toassociate an event handler object with the menu item. Visual Studio .<strong>NET</strong> does this <strong>for</strong>you at design time <strong>for</strong> static menu items. For dynamic items, this association must beestablished at run time, as shown here:DynamicMenuItem mnuItem;mnuItem = new DynamicMenuItem(configMenuItem["Text"].ToStr<strong>in</strong>g(),new EventHandler(StdOnClickHandler));A menu item is normally represented by an <strong>in</strong>stance of the MenuItem class. What isthat DynamicMenuItem class all about then? DynamicMenuItem is a user-def<strong>in</strong>ed classthat extends MenuItem with a couple of properties particularly suited <strong>for</strong> menu itemsthat represent calls to plug-<strong>in</strong> modules. Here's the class def<strong>in</strong>ition:public class DynamicMenuItem: MenuItem{public str<strong>in</strong>g AssemblyName;public str<strong>in</strong>g ClassName;public DynamicMenuItem(str<strong>in</strong>g text, EventHandler onClick):base(text, onClick){}}526


The new menu item class stores the name of the assembly and the class to use whenclicked. An <strong>in</strong>stance of this class is passed to the event handler procedure through thesender argument, as shown here:private void StdOnClickHandler(object sender, EventArgs e){// Get the current <strong>in</strong>stance of the dynamic menu itemDynamicMenuItem mnuItem = (DynamicMenuItem) sender;// Display a message box that proves we know thecorrespond<strong>in</strong>g// assembly and class namestr<strong>in</strong>g msg = "Execute a method on class [{0}] from assembly[{1}]";msg = Str<strong>in</strong>g.Format(msg, mnuItem.ClassName,mnuItem.AssemblyName);MessageBox.Show(msg, mnuItem.Text);}In a real-world context, you can use the assembly and class <strong>in</strong><strong>for</strong>mation to dynamicallycreate an <strong>in</strong>stance of the class us<strong>in</strong>g the Activator object that we encountered <strong>in</strong>Chapter 12, as follows:// Assum<strong>in</strong>g that the class implements the IAppPlugIn <strong>in</strong>terface// asm is the assembly name, cls is the class nameIAppPlugIn o = (IAppPlugIn) Activator.CreateInstance(asm,cls).Unwrap()// Assume that the IAppPlugIn <strong>in</strong>terface has a method Execute()o.Execute();Figure 15-4 shows the message box that appears when you click a custom menu item<strong>in</strong> the sample application. All the <strong>in</strong><strong>for</strong>mation displayed is read from the configurationfile.Figure 15-4: The message box that appears when a custom menu item is clicked.527


Implement<strong>in</strong>g the DataSet Section HandlerTo top off our exam<strong>in</strong>ation of section handlers, let's review the source code <strong>for</strong> thecustom section handler that we've been us<strong>in</strong>g, shown here:us<strong>in</strong>g System;us<strong>in</strong>g System.Data;us<strong>in</strong>g System.Xml;us<strong>in</strong>g System.Configuration;namespace XmlNet.CS{public class DatasetSectionHandler:IConfigurationSectionHandler{// Constructor(s)public DatasetSectionHandler(){}// IConfigurationSectionHandler.Createpublic object Create(object parent,object context, XmlNode section){DataSet ds;// Clone the parent DataSet if not nullif (parent == null)ds = new DataSet();elseds = ((DataSet) parent).Clone();// Read the data us<strong>in</strong>g a node readerDataSet tmp = new DataSet();XmlNodeReader nodereader = new XmlNodeReader(section);tmp.ReadXml(nodereader);}}}// Merge with the parent and returnds.Merge(tmp);return ds;528


The DatasetSectionHandler class implements the IConfigurationSectionHandler andprovides the default constructor. The most <strong>in</strong>terest<strong>in</strong>g part of this code is the Createmethod, which reads the current section specified through the section argument andthen merges the resultant DataSet object with the parent, if a non-null parent object hasbeen passed. Because configuration <strong>in</strong>heritance proceeds from top to bottom, the baseDataSet object <strong>for</strong> merg<strong>in</strong>g is the parent.The <strong>XML</strong> data to be parsed is passed via an XmlNode object—that is, an object thatrepresents the root of an <strong>XML</strong> DOM subtree. To make an <strong>XML</strong> DOM subtree parsableby the DataSet object's ReadXml method, you must wrap it <strong>in</strong> an XmlNodeReaderobject—that is, one of the <strong>XML</strong> reader objects that we encountered <strong>in</strong> Chapter 2 andChapter 5. When called to action on the configuration file from the section "The <strong>XML</strong>Layout of the Configuration Sett<strong>in</strong>gs," on page 649, the XmlNode object passed to thehandler po<strong>in</strong>ts to the node.ConclusionThe .<strong>NET</strong> Framework API <strong>for</strong> read<strong>in</strong>g configuration sett<strong>in</strong>gs is designed to greatlysimplify the code needed on the client. This API represents the perfect example ofsmooth <strong>XML</strong> <strong>in</strong>tegration. No matter how the configuration data is organized and wherethe data is located, the code you use to access the data is nearly identical. The onlysignificant drawback I've noticed <strong>in</strong> the current implementation of the configuration APIis that you can't rely on a common and official API to update sett<strong>in</strong>gs. However, as thischapter showed, us<strong>in</strong>g <strong>XML</strong> writers or, better yet, <strong>XML</strong> DOM documents provides aquick and effective workaround.In this chapter, we reviewed the fundamentals of the .<strong>NET</strong> Framework configurationsubsystem, the files <strong>in</strong> which it is articulated, and their related locations. Next wereviewed the properties and methods commonly used to access configuration sett<strong>in</strong>gs.The f<strong>in</strong>al part of the chapter addressed the topics and the tasks <strong>in</strong>volved <strong>in</strong> an <strong>in</strong>-depthcustomization of configuration files. In particular, you learned how to create newsections and new section handlers, and we exam<strong>in</strong>ed a comprehensive example.Read<strong>in</strong>gThe configuration API is described <strong>in</strong> detail <strong>in</strong> the MSDN documentation. I've noticedonly a few omissions and a few po<strong>in</strong>ts about which that text is unclear, and I've tried to<strong>in</strong>clude that <strong>in</strong><strong>for</strong>mation <strong>in</strong> this chapter. The f<strong>in</strong>al example presented <strong>in</strong> this chapterrepresents a hot topic <strong>for</strong> many developers: build<strong>in</strong>g desktop applications that can beextended with external plug-<strong>in</strong> modules. I discussed this topic at length and withextensive code examples <strong>in</strong> an article that appeared <strong>in</strong> the "Cutt<strong>in</strong>g Edge" column ofthe July 2002 issue of MSDN Magaz<strong>in</strong>e.529


AfterwordOverviewWhile writ<strong>in</strong>g this book, I accumulated a few thoughts that I'd like to share with you asmy f<strong>in</strong>al considerations about <strong>XML</strong> and the <strong>Microsoft</strong> .<strong>NET</strong> Framework. If you considerthese ideas <strong>in</strong>dividually, they might appear completely unrelated to one another, butconsidered all together, they <strong>for</strong>m a sort of filter through which you can reconsider andreview this book's contents from a higher level perspective. These are the four ma<strong>in</strong>concepts:• <strong>XML</strong> is a native data type <strong>in</strong> the .<strong>NET</strong> Framework.• We need a pars<strong>in</strong>g model that falls <strong>in</strong> the middle between the <strong>XML</strong> DocumentObject Model (<strong>XML</strong> DOM) and Simple API <strong>for</strong> <strong>XML</strong> (SAX).• The capability to query data effectively is key.• We need more than the Simple Object Access Protocol (SOAP) and the <strong>XML</strong>Schema Def<strong>in</strong>ition (XSD) <strong>for</strong> true <strong>in</strong>teroperability.Some of these ideas address cross-plat<strong>for</strong>m issues whose solution is beyond thecapabilities and <strong>in</strong>terests of <strong>in</strong>dividual vendors. The W3C is work<strong>in</strong>g on XQuery, anevolution of the XPath query language, which will provide a data model <strong>for</strong> <strong>XML</strong>documents as well as a set of operators <strong>for</strong> that data model and a query languagebased on these operators. (For more <strong>in</strong><strong>for</strong>mation, refer tohttp://www.w3.org/<strong>XML</strong>/Query.)To date, the recent WS-I <strong>in</strong>itiative (see http://www.ws-i.org) appears to be the Webservices counterpart to the W3C. The goal of the consortium beh<strong>in</strong>d the WS-I <strong>in</strong>itiativeis to promote true <strong>in</strong>teroperability across Web services implementations. To the extentthat I can envision th<strong>in</strong>gs, the most effective way to make this happen is by def<strong>in</strong><strong>in</strong>gnew <strong>XML</strong>-based standards at least <strong>for</strong> security and object representation.Native <strong>XML</strong> <strong>in</strong> the .<strong>NET</strong> FrameworkPrior to the advent of the .<strong>NET</strong> Framework, we were used to writ<strong>in</strong>g <strong>XML</strong>-driven<strong>Microsoft</strong> W<strong>in</strong>dows applications based on the MS<strong>XML</strong> COM-based library. Unlikeclasses <strong>in</strong> the .<strong>NET</strong> Framework, however, MS<strong>XML</strong> is a bolted-on API thatcommunicates with the rest of the application but does not really <strong>in</strong>tegrate with it.Communication entails the activity or the process of pass<strong>in</strong>g <strong>in</strong><strong>for</strong>mation to others. It isbased on some set of signals that both parties understand and that encode the<strong>in</strong><strong>for</strong>mation be<strong>in</strong>g exchanged. Integration, on the other hand, means that items arecomb<strong>in</strong>ed so that they are closely l<strong>in</strong>ked and <strong>for</strong>m one unit. This dist<strong>in</strong>ction issignificant.The MS<strong>XML</strong> library can be imported <strong>in</strong>to your code but rema<strong>in</strong>s an external, selfconta<strong>in</strong>edblack box that acts as a server component. .<strong>NET</strong> Framework applications, onthe other hand, use <strong>XML</strong> classes along with other classes <strong>in</strong> the .<strong>NET</strong> Framework,result<strong>in</strong>g <strong>in</strong> a homogeneous comb<strong>in</strong>ation of "equal-sized" pieces. As a self-conta<strong>in</strong>edcomponent, the MS<strong>XML</strong> must provide itself with advanced features such asasynchronous pars<strong>in</strong>g. This feature is apparently lack<strong>in</strong>g <strong>in</strong> the <strong>XML</strong> classes of the.<strong>NET</strong> Framework. By <strong>in</strong>tegrat<strong>in</strong>g <strong>XML</strong> classes with other classes <strong>in</strong> the .<strong>NET</strong>Framework, however, you can easily obta<strong>in</strong> the same functionality and even ga<strong>in</strong> morecontrol over the overall process.530


Neither <strong>XML</strong> DOM nor SAXThe .<strong>NET</strong> Framework supports the <strong>XML</strong> DOM but not SAX. The <strong>XML</strong> DOM is theclassic way to process <strong>XML</strong> documents, but it also turns out to be <strong>in</strong>effective <strong>for</strong> certa<strong>in</strong>classes of documents—mostly very large and volatile documents. The SAX model wasdeveloped to provide an alternative approach. The idea beh<strong>in</strong>d SAX is great; the actualprogramm<strong>in</strong>g model is much less ideal. SAX uses the push model, whereas a pullmodel is certa<strong>in</strong>ly more effective and flexible.The .<strong>NET</strong> Framework provides a third pars<strong>in</strong>g model based on the concept of thereader. The reader is a k<strong>in</strong>d of read-only, <strong>for</strong>ward-only cursor that doesn't cacheanyth<strong>in</strong>g—it just reads as quickly as possible.Programmers need classes that implement the <strong>XML</strong> DOM because the <strong>XML</strong> DOM is arecognized standard and because it is useful <strong>in</strong> a number of realistic scenarios.However, <strong>XML</strong> DOM can't be the only API available to work with <strong>XML</strong> documents. Alower level set of tools is needed. The .<strong>NET</strong> reader is just this. In fact, the <strong>XML</strong> DOMimplementation <strong>in</strong> the .<strong>NET</strong> Framework is built us<strong>in</strong>g readers.Query Is KeyAn <strong>XML</strong> document is primarily a repository of <strong>in</strong><strong>for</strong>mation and as such must besearchable. But how? XPath was the first answer to the demand <strong>for</strong> a query tool toextract node-sets out of <strong>XML</strong> documents. But more powerful tools are needed. Today,XPath 2.0 is on the way, with XQuery 1.0 runn<strong>in</strong>g close beh<strong>in</strong>d.XPath as we know it today, and as supported by the .<strong>NET</strong> Framework, is a language <strong>for</strong>address<strong>in</strong>g parts of an <strong>XML</strong> document. XPath 2.0 presents itself as an expressionlanguage <strong>for</strong> process<strong>in</strong>g sequences of text. It also comes with built-<strong>in</strong> support <strong>for</strong>query<strong>in</strong>g <strong>XML</strong> documents. But what's the difference between address<strong>in</strong>g and query<strong>in</strong>g?And between XPath and XQuery?I th<strong>in</strong>k that the difference between address<strong>in</strong>g and query<strong>in</strong>g can be summarized byresort<strong>in</strong>g to a SQL metaphor. A simple SELECT statement with a WHERE clauseaddresses a subset of rows; a more complex SELECT statement that <strong>in</strong>cludes UNION,GROUP BY, INNER JOIN, and temporary tables does much more and actuallyper<strong>for</strong>ms a query.XPath 1.0 addresses parts of the documents; XQuery per<strong>for</strong>ms complex queries andsupports more data types. From a syntax po<strong>in</strong>t of view, XPath 2.0 is a subset of XQuerybut with a number of key features already <strong>in</strong>cluded. Stepp<strong>in</strong>g from XPath 1.0 to XPath2.0 positions you nicely <strong>for</strong> a further jump to XQuery when it becomes a W3Crecommendation.A good reference <strong>for</strong> clear<strong>in</strong>g up any confusion you might have about XPath andXQuery is the follow<strong>in</strong>g: http://www.xml.com/pub/a/2002/03/20/xpath2.html.The Dream of True InteroperabilityThat <strong>XML</strong> can be exchanged between heterogeneous plat<strong>for</strong>ms and understoodanywhere is a fact. Web services are a relatively new type of software that exploits thisaspect of <strong>XML</strong>. The rub lies <strong>in</strong> the fact that <strong>in</strong> the real world, data must be used once ithas been transferred. <strong>XML</strong> data must be converted to usable objects. But which toolcan take care of this mapp<strong>in</strong>g process? An easy answer would be the parser, but theparser is a generic tool that processes <strong>XML</strong> data and returns an <strong>XML</strong>-specific object,531


not an application-specific object. For example, while pars<strong>in</strong>g employee data, theparser can create an <strong>XML</strong> DOM object that conta<strong>in</strong>s a tree of nodes set to employeedata. There is no way <strong>for</strong> the parser to return an application-specific object such as anEmployee class with properties and methods.Just as SOAP provides a universal technique <strong>for</strong> def<strong>in</strong><strong>in</strong>g a method call, anotherprotocol should provide the ability to describe a class. I'd like to have a simple classdef<strong>in</strong>ition protocol that would let servers and clients exchange documents that conta<strong>in</strong>structure and data of a given class <strong>in</strong>stance. A specialized type of parser would beneeded with the extra ability to deserialize the class description <strong>in</strong>to a valid <strong>in</strong>stance ofa type. Sound confus<strong>in</strong>g? Th<strong>in</strong>k of the .<strong>NET</strong> Framework <strong>XML</strong> serializer (or the SOAP<strong>for</strong>matter). The <strong>XML</strong> serializer provides the ability to save and restore <strong>in</strong>stances ofclasses. The saved data conta<strong>in</strong>s <strong>in</strong><strong>for</strong>mation about the structure of the class and its<strong>in</strong>stance data. I believe that the .<strong>NET</strong> Framework already conta<strong>in</strong>s a prototype of theparser of the future.It will be <strong>in</strong>terest<strong>in</strong>g to see how many of the features predicted or called <strong>for</strong> <strong>in</strong> this bookwill f<strong>in</strong>d their place <strong>in</strong> the next version of the .<strong>NET</strong> Framework (code-named Whidbey).532

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!