12.07.2015 Views

Representing and utilizing DDI in relational databases

Representing and utilizing DDI in relational databases

Representing and utilizing DDI in relational databases

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> Utiliz<strong>in</strong>g <strong>DDI</strong> <strong>in</strong>Relational DatabasesContributorsAlerk Am<strong>in</strong>, CentERdata [a.am<strong>in</strong>@uvt.nl]Ingo Barkow, DIPF - Leibniz Institute for Educational Research <strong>and</strong> Educational Information[barkow@dipf.de]Stefan Kramer, Cornell Institute for Social <strong>and</strong> Economic Research [stefan.kramer@cornell.edu]David Schiller, Research Data Center (FDZ) of the German Federal Employment Agency (BA)at the Institute for Employment Research (IAB) [david.schiller@iab.de]Jeremy Williams, Cornell Institute for Social <strong>and</strong> Economic Research [jw568@cornell.edu]Table of contentsContributors ............................................................................................................................... 1Target Audience ......................................................................................................................... 2Problem Statement .................................................................................................................... 2Introduction ................................................................................................................................ 2<strong>DDI</strong> <strong>in</strong> Relational Databases vs. <strong>in</strong> XML: Pros <strong>and</strong> Cons, <strong>and</strong> Other Approaches ...................... 3<strong>Represent<strong>in</strong>g</strong> the <strong>DDI</strong> model with<strong>in</strong> a <strong>relational</strong> database ....................................................... 3<strong>Represent<strong>in</strong>g</strong> the <strong>DDI</strong> model <strong>in</strong> XML <strong>in</strong>stances ....................................................................... 5Issues with <strong>DDI</strong> specification changes <strong>in</strong> <strong>relational</strong> <strong>databases</strong> <strong>and</strong> <strong>in</strong> XML ............................. 6Consider<strong>in</strong>g a hybrid RDB-XML database approach for <strong>DDI</strong> .................................................. 7Model<strong>in</strong>g <strong>DDI</strong> <strong>in</strong> Relational Databases ....................................................................................... 7<strong>DDI</strong> elements .......................................................................................................................... 7XML hierarchy ........................................................................................................................ 8References ............................................................................................................................. 8Recursive structures ............................................................................................................... 8Substitution groups ................................................................................................................. 9Controlled vocabularies .......................................................................................................... 9Database IDs .........................................................................................................................10Advanced Cases .......................................................................................................................10Version<strong>in</strong>g (<strong>in</strong>clud<strong>in</strong>g late-bound references) .........................................................................10Model<strong>in</strong>g schemes which <strong>in</strong>clude other schemes ..................................................................11Multiple language support......................................................................................................12H<strong>and</strong>l<strong>in</strong>g unknown or external elements <strong>in</strong> <strong>DDI</strong> .....................................................................13Ensur<strong>in</strong>g Application Compatibility <strong>in</strong> Transferr<strong>in</strong>g <strong>DDI</strong> Between Databases .............................14A Look <strong>in</strong>to the Future: A More Abstract Representation of <strong>DDI</strong> ................................................16Conclusion ................................................................................................................................16Acknowledgments .....................................................................................................................17<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 1 of 17


currently uses XML-based structures to describe the content of the model. For <strong>DDI</strong> version 2 5 , aDocument Type Def<strong>in</strong>ition (DTD) was used; this was changed to an XML Schema (XSD) for <strong>DDI</strong>version 3 6 . These schemas are employed to structure metadata content <strong>in</strong> the form of <strong>DDI</strong><strong>in</strong>stances. Essentially <strong>DDI</strong> can represent metadata <strong>in</strong> the form of XML files based on the <strong>DDI</strong>XML Schema stored on a common file share, or can be put <strong>in</strong>to an XML database (like BaseX 7or eXist 8 ) to enable collaborative work with multiple users. Another possibility is to represent <strong>DDI</strong><strong>in</strong> <strong>relational</strong> <strong>databases</strong> (RDBs).It is obvious that <strong>DDI</strong> can only serve the scientific community if it is actively used by a sufficientnumber of stakeholders. In order to achieve this goal, the <strong>DDI</strong>-based documentation has to beeasy to underst<strong>and</strong> <strong>and</strong> easy to <strong>in</strong>tegrate <strong>in</strong>to the exist<strong>in</strong>g data structure of the data providers. Italso has to be compatible with future developments <strong>in</strong> the area of data storage. Relational<strong>databases</strong> are a widely used <strong>and</strong> flexible solution for data storage. Br<strong>in</strong>g<strong>in</strong>g <strong>DDI</strong> together withthe capability of <strong>relational</strong> database systems will promote both data storage for the purpose ofscientific research <strong>and</strong> the <strong>DDI</strong> st<strong>and</strong>ard itself.This paper outl<strong>in</strong>es the advantages <strong>and</strong> disadvantages of represent<strong>in</strong>g <strong>DDI</strong> <strong>in</strong> <strong>relational</strong><strong>databases</strong> as an alternative to an XML structure. In addition, it discusses the benefits <strong>and</strong>drawbacks of us<strong>in</strong>g <strong>relational</strong> <strong>databases</strong> for the <strong>DDI</strong> model, gives some h<strong>in</strong>ts about futuresolutions, provides a short <strong>in</strong>troduction on the topic of how to model <strong>DDI</strong>, discusses applicationcompatibility, <strong>and</strong> po<strong>in</strong>ts out some challenges <strong>in</strong> “advanced cases.”<strong>DDI</strong> <strong>in</strong> Relational Databases vs. XML: Pros <strong>and</strong> Cons,<strong>and</strong> Other ApproachesThe idea of stor<strong>in</strong>g <strong>DDI</strong> <strong>in</strong>stances <strong>in</strong> a <strong>relational</strong> database, as opposed to an XML database, isoften a hot topic among developers. From the perspective of <strong>DDI</strong> solely as a “storage” st<strong>and</strong>ard,an XML database has certa<strong>in</strong> advantages. But when th<strong>in</strong>k<strong>in</strong>g of <strong>DDI</strong> as a transport formatbetween applications, the actual storage format for each application should be the one that bestmeets that application’s needs. In many cases, a <strong>relational</strong> database is the better option. Thefollow<strong>in</strong>g section of the paper demonstrates the advantages of us<strong>in</strong>g a <strong>relational</strong> database.<strong>Represent<strong>in</strong>g</strong> the <strong>DDI</strong> model with<strong>in</strong> a <strong>relational</strong> databaseThe first reason to consider a <strong>relational</strong> database model for <strong>DDI</strong> arises from an organizationalpo<strong>in</strong>t of view. Many agencies have been stor<strong>in</strong>g primary data <strong>and</strong> associated metadata fortimespans measured <strong>in</strong> decades, <strong>and</strong> a very common storage method is the <strong>relational</strong>database, as its tabular structure is ideal for stor<strong>in</strong>g rectangular data result<strong>in</strong>g from datacollection activities. Therefore, those agencies have a high level of expertise <strong>in</strong> us<strong>in</strong>g the5 http://www.ddialliance.org/Specification/<strong>DDI</strong>-Codebook/6 http://www.ddialliance.org/Specification/<strong>DDI</strong>-Lifecycle/7 http://basex.org/8 http://exist.sourceforge.net/<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 3 of 17


elational database model. Chang<strong>in</strong>g their present table-based metadata st<strong>and</strong>ard (whateverthat may be) to a <strong>DDI</strong> representation which is also table-based should thus be <strong>in</strong>tuitive to them.Us<strong>in</strong>g XML for storage, on the other h<strong>and</strong>, might be problematic as these agencies do not havethe experience or resources to convert the metadata <strong>and</strong> change the surround<strong>in</strong>g tools to thenew structure. XML may be known to them, but mostly as an import or export format. Theymight therefore be reluctant to utilize <strong>DDI</strong> <strong>in</strong> XML format for reasons of transformation costs orleav<strong>in</strong>g their area of expertise.In addition to organizational considerations, there are also structural advantages to us<strong>in</strong>g a<strong>relational</strong> database. Therefore, agencies often represent their microdata <strong>in</strong>ternally <strong>in</strong> the form ofa <strong>relational</strong> database as a central stor<strong>in</strong>g mechanism because it is ideal for process<strong>in</strong>grectangular data (e.g., SPSS data files, ASCII data files) <strong>in</strong> tables <strong>and</strong> can manage the filestructures of multiple studies by <strong>in</strong>put <strong>and</strong> output processes. If the metadata are stored <strong>in</strong> thesame database as the microdata, the movement from metadata to data output worksseamlessly as native database methods such as connect<strong>in</strong>g tables by referential <strong>in</strong>tegrity canbe used. The metadata can be l<strong>in</strong>ked to the associated research data. A user can therefore firstsearch the metadata <strong>and</strong> then move easily to the connected data. This model can even beextended to create custom data extracts (like a variable shopp<strong>in</strong>g basket), where an extract ofthe dataset, <strong>in</strong>clud<strong>in</strong>g the related metadata as a k<strong>in</strong>d of codebook, can be selected <strong>and</strong>downloaded, e.g., via a Web <strong>in</strong>terface. In an XML-based <strong>DDI</strong> environment this can also bedone, but with much more effort, as two different structural models have to be merged. In aworst case scenario, an external service has to l<strong>in</strong>k between an XML metadata structure basedon <strong>DDI</strong> <strong>and</strong> an ASCII file conta<strong>in</strong><strong>in</strong>g the microdata.Relational <strong>databases</strong> have existed on the market for decades, <strong>and</strong> have led to the developmentof many tools for work<strong>in</strong>g with them. If one extends the idea of comb<strong>in</strong><strong>in</strong>g metadata <strong>and</strong>microdata <strong>in</strong>to a <strong>relational</strong> database model, then the next step can be chang<strong>in</strong>g the databasemodel <strong>in</strong>to an analytical one. Relational <strong>databases</strong> can be enhanced to become analytical ormultidimensional <strong>databases</strong> (e.g., onl<strong>in</strong>e analytical process<strong>in</strong>g [OLAP] cubes 9 ). With this model,enhanced analytical or statistical methods from the area of Bus<strong>in</strong>ess Intelligence (e.g., datam<strong>in</strong><strong>in</strong>g, process m<strong>in</strong><strong>in</strong>g) can be applied to the data. These methods might lead to completelynew research questions <strong>and</strong> new knowledge. This change of model would be difficult to realize<strong>in</strong> a complete XML-based environment.A less complex example is stor<strong>in</strong>g more than one survey <strong>in</strong> a structure. In a <strong>relational</strong> database,the tabular structure can be designed to support multiple surveys <strong>in</strong> one structure by add<strong>in</strong>gadditional adm<strong>in</strong>istrative tables. In a <strong>DDI</strong> structure based on XML files, this is difficult torepresent; <strong>and</strong> it is difficult <strong>in</strong> an XML database, as the structure is largely based on the orig<strong>in</strong>al<strong>DDI</strong> XML schema, which normally (as it is file-based) dem<strong>and</strong>s a separate XML file for eachsurvey. In an XML database structure each survey on its own has to be represented as aseparate XML database or at least as a separate <strong>in</strong>stance of an XML database (if the XML9 http://www.mendeley.com/research/provid<strong>in</strong>g-olap-onl<strong>in</strong>e-analytical-process<strong>in</strong>g-touseranalysts-an-it-m<strong>and</strong>ate/<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 4 of 17


database supports <strong>in</strong>stances). The problem can be solved by add<strong>in</strong>g additional programm<strong>in</strong>grout<strong>in</strong>es surround<strong>in</strong>g the XML structure to emulate referential <strong>in</strong>tegrity by XML database l<strong>in</strong>kage.Nevertheless, the <strong>relational</strong> database offers these possibilities natively or with much less effort.Performance is not addressed <strong>in</strong> this paper, as the authors currently cannot prove that work<strong>in</strong>gwith <strong>DDI</strong> <strong>in</strong> a <strong>relational</strong> database is always faster than <strong>in</strong> an XML database or XML file structure.The performance of <strong>DDI</strong> with<strong>in</strong> different systems depends heavily on the structure used, <strong>and</strong>therefore benchmark<strong>in</strong>g would not make much sense, as the results would not berepresentative of different <strong>in</strong>stances or different database products. The authors believe the<strong>relational</strong> database might have some advantages because of its long existence <strong>and</strong> heavyperformance optimizations (e.g., <strong>in</strong>dexes, stored procedures, user-def<strong>in</strong>ed functions, managedcode, file groups, raw device mapp<strong>in</strong>g), which for the most part do not exist <strong>in</strong> XML <strong>databases</strong>,but this impression cannot be verified <strong>and</strong> is therefore not further discussed here.A representation of metadata with<strong>in</strong> a <strong>relational</strong> database can also be <strong>in</strong>dependent of the <strong>DDI</strong>version or <strong>in</strong>stance. Some agencies use an <strong>in</strong>ternal structure for their metadata that is notbased on <strong>DDI</strong> but conta<strong>in</strong>s all the necessary <strong>in</strong>formation needed to exchange data with otheragencies. For them, <strong>DDI</strong> <strong>in</strong> its XML form can be used as an import <strong>and</strong> export format, where thenecessary files are read or created by extract, transform or load processes (so called ETLprocesses).For example, ICPSR offers an “Export Study-level metadata” (of <strong>DDI</strong> 2.1 or 3.1, asof Oct. 2011) function for studies <strong>in</strong> its data archive 10 <strong>in</strong> this manner. A possible advantage ofthis method would be that the surround<strong>in</strong>g processes can always be adapted to the desired orrequired <strong>DDI</strong> version(s), which is far less challeng<strong>in</strong>g than updat<strong>in</strong>g native <strong>DDI</strong> XML <strong>in</strong>stances tothe appropriate version. Nevertheless, a major drawback of <strong>relational</strong> <strong>databases</strong> import<strong>in</strong>g XMLfile structures is the possibility for <strong>in</strong>formation loss. If for some nodes with<strong>in</strong> the XML <strong>in</strong>stancethere is no representation with<strong>in</strong> the database structure, this content will simply be lost dur<strong>in</strong>gthe import process, or the import will not work at all if there is a structural check disallow<strong>in</strong>gthese k<strong>in</strong>ds of partial imports.In a <strong>DDI</strong>-RDB model all import <strong>and</strong> export processes have to be h<strong>and</strong>led by ETL (Extract-Transform-Load) processes. This means <strong>DDI</strong> XML structures have to be parsed <strong>and</strong>transformed. If an unknown element comes up there are essentially two strategies to h<strong>and</strong>le this-- discard or store. Discard means the loss of <strong>in</strong>formation which can be considered bad tool<strong>in</strong>g.Stor<strong>in</strong>g also causes a problem as this <strong>in</strong>volves a high degree of program logic. A strategy couldbe that the orig<strong>in</strong>al <strong>DDI</strong> XML structure is kept as backup <strong>and</strong> can be attached to a later export.However, here the danger of creat<strong>in</strong>g errors <strong>in</strong> the new <strong>DDI</strong>-XML structure is even higher asthere might be also problems with <strong>DDI</strong> version<strong>in</strong>g when re-assembl<strong>in</strong>g the metadata.<strong>Represent<strong>in</strong>g</strong> the <strong>DDI</strong> model <strong>in</strong> XML <strong>in</strong>stancesAlthough the <strong>relational</strong> database conta<strong>in</strong>s a lot of additional features, the “native” way torepresent the <strong>DDI</strong> content is to store <strong>DDI</strong> as an <strong>in</strong>stance specified by the XML schema. Thisleads to the logical advantage of a direct representation of the content <strong>in</strong> the correct schema. A10 http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/<strong>in</strong>dex.jsp<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 5 of 17


<strong>DDI</strong> <strong>in</strong>stance us<strong>in</strong>g the full set of <strong>DDI</strong> elements will be far superior to a construct with<strong>in</strong> a<strong>relational</strong> database, as not all functionalities of <strong>DDI</strong> can be represented easily <strong>in</strong> the latter.Problems arise with a <strong>relational</strong> database, as will be shown further below, <strong>in</strong> represent<strong>in</strong>gversion<strong>in</strong>g <strong>in</strong> <strong>DDI</strong> 11 , or po<strong>in</strong>t<strong>in</strong>g to another agency by us<strong>in</strong>g referential URNs. In native XML thesolution can be quite easily expressed, but <strong>in</strong> <strong>relational</strong> <strong>databases</strong> this is possible only withheavy additional programm<strong>in</strong>g (e.g., <strong>in</strong>crement<strong>in</strong>g version<strong>in</strong>g by surround<strong>in</strong>g Web services orus<strong>in</strong>g analytical <strong>databases</strong> with slowly chang<strong>in</strong>g dimensions to represent the time or version).However, most agencies do not use <strong>DDI</strong> <strong>in</strong> its full specification, but only a small subset ofelements; here, the advantages of the XML approach may not weigh heavily. Essentially, if anagency uses the full <strong>DDI</strong> specification, the XML implementation is superior as this is the bestpossibility to express <strong>DDI</strong> as designed by the <strong>DDI</strong> Alliance.Issues with <strong>DDI</strong> specification changes <strong>in</strong> <strong>relational</strong> <strong>databases</strong> <strong>and</strong> <strong>in</strong>XMLA problem all implementations of <strong>DDI</strong> share is h<strong>and</strong>l<strong>in</strong>g new versions of the specification (e.g.,<strong>DDI</strong> 3.1 to <strong>DDI</strong> 3.2). If a new version of <strong>DDI</strong> is extended with new structures, or there arechanges <strong>in</strong> the structure itself, this causes significant problems <strong>in</strong> implementation. In the case ofthe <strong>DDI</strong>-RDB, this means construct<strong>in</strong>g a new import <strong>and</strong> export mechanism for the new version.Furthermore it might lead to a change <strong>in</strong> the overall database model to support both versions. Ina worst case scenario, the structures are not compatible anymore, leav<strong>in</strong>g the organization withtwo different <strong>databases</strong> or at least database partitions for stor<strong>in</strong>g the <strong>in</strong>formation, which is aconsiderable problem <strong>in</strong> data management.But the <strong>DDI</strong>-XML method faces challenges with specification changes as well. Either the <strong>DDI</strong>-XML structure has to be transformed, or there have to be multiple versions of <strong>DDI</strong>-XML <strong>in</strong> theXML database. This leads also to changes <strong>in</strong> the application logic of the associated tool. If onechooses the simple solution from above <strong>and</strong> only changes the nodes which are known to theagency, aga<strong>in</strong> this leads to the <strong>in</strong>consistency problems mentioned before.A sizable advantage for the <strong>DDI</strong>-XML representation here is its hierarchical structure. <strong>DDI</strong>-XMLis capable of express<strong>in</strong>g complex structures <strong>in</strong> an organized manner <strong>and</strong> can use built-<strong>in</strong> XMLfeatures like <strong>in</strong>heritance or validation aga<strong>in</strong>st the schema. A <strong>DDI</strong>-RDB has to use additionalprogram logic to emulate this behavior. In some cases, <strong>in</strong>heritance can only be expressed byus<strong>in</strong>g complex jo<strong>in</strong> operations between tables or self-jo<strong>in</strong>s with<strong>in</strong> a table, lead<strong>in</strong>g to a decrease<strong>in</strong> speed while access<strong>in</strong>g the <strong>in</strong>formation. These performance issues can be decreased by us<strong>in</strong>gadvanced database optimization techniques like partitioned view, partitioned tables or managedcode, but <strong>in</strong> the end there is still a structural disadvantage.11 See also: <strong>DDI</strong> Best Practice on Version<strong>in</strong>g <strong>and</strong> Publication -http://dx.doi.org/10.3886/<strong>DDI</strong>BestPractices08<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 6 of 17


Consider<strong>in</strong>g a hybrid RDB-XML database approach for <strong>DDI</strong>As described before, the import process <strong>in</strong>to a <strong>relational</strong> database from an XML structure mightlead to problems, or fail altogether, if there is not sufficient mapp<strong>in</strong>g between certa<strong>in</strong> nodes <strong>and</strong>the tabular structure. The question is what will happen to the parts which cannot be imported?Do they get discarded or are they stored externally (e.g., <strong>in</strong> a table conta<strong>in</strong><strong>in</strong>g str<strong>in</strong>gs whichwere not imported or keep<strong>in</strong>g the orig<strong>in</strong>al file as backup)? The disadvantage of h<strong>and</strong>l<strong>in</strong>g this<strong>in</strong>formation externally would be very complex import <strong>and</strong> export h<strong>and</strong>l<strong>in</strong>g. Furthermore,search<strong>in</strong>g long str<strong>in</strong>gs with<strong>in</strong> one table cell leads to a major loss <strong>in</strong> performance as <strong>relational</strong>tables are optimized for short cell lengths.Another way to keep the imported XML structure <strong>in</strong>tact without los<strong>in</strong>g performance or logicallosses would be to use the XML features of commercial <strong>databases</strong>. Some database systems,such as Microsoft SQL Server 2008 R2 12 or Oracle 11g 13 , have added support for manag<strong>in</strong>gXML natively with<strong>in</strong> the cell structure of their tables. This <strong>in</strong>cludes advanced features like XML<strong>in</strong>dexes, XML data type (thus XML will not be h<strong>and</strong>led as str<strong>in</strong>g, but recognized as XML) <strong>and</strong>XPath search expressions with<strong>in</strong> table cells.Us<strong>in</strong>g the hybrid approach, the advantages of <strong>relational</strong> <strong>databases</strong> (e.g., multiple studies, highperformance) can be comb<strong>in</strong>ed with the flexibility of XML <strong>databases</strong> <strong>and</strong> enable easier h<strong>and</strong>l<strong>in</strong>gof <strong>DDI</strong> between different systems. A thorough evaluation of this approach, however, is outsideof the scope of this paper.Model<strong>in</strong>g <strong>DDI</strong> <strong>in</strong> Relational DatabasesThere are many strategies for represent<strong>in</strong>g <strong>DDI</strong>, based on its XML expression, <strong>in</strong> a <strong>relational</strong>database. This section provides some <strong>in</strong>formation about how to model various <strong>DDI</strong> elements<strong>and</strong> relationships. These ideas can be <strong>in</strong>corporated <strong>in</strong>to the strategy chosen for a particularapplication. However, there are many other factors that should also be considered. These<strong>in</strong>clude requirements for performance <strong>and</strong> scalability <strong>and</strong> which <strong>DDI</strong> versions to support, as wellas support with<strong>in</strong> the chosen database, programm<strong>in</strong>g language, <strong>and</strong> application framework. Allof these must be considered when design<strong>in</strong>g a database schema for use with<strong>in</strong> a particularapplication. The follow<strong>in</strong>g examples are based on <strong>DDI</strong> version 3 14 , but the techniques usedshould apply to future versions of <strong>DDI</strong>.<strong>DDI</strong> elementsMost <strong>DDI</strong> elements consist of a number of attributes <strong>and</strong> sub-elements. For <strong>in</strong>stance, theVariable element has several attributes of different types, <strong>in</strong>clud<strong>in</strong>g id (str<strong>in</strong>g), isGeographic(boolean), urn (URI), <strong>and</strong> more. Additionally, the Variable element conta<strong>in</strong>s many sub-elements.12 http://msdn.microsoft.com/en-us/library/ms189887.aspx13 http://www.oracle.com/technetwork/database/features/xmldb/<strong>in</strong>dex.html14 http://www.ddialliance.org/Specification/<strong>DDI</strong>-Lifecycle/<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 7 of 17


These <strong>in</strong>clude the VariableName, Label, Description, ConceptReference, Representation, <strong>and</strong>more.The natural representation for these types of elements <strong>in</strong> a <strong>relational</strong> database is a table. In aVariables table, each row would represent a s<strong>in</strong>gle <strong>DDI</strong> Variable. The columns <strong>in</strong> the Variablestable would be used to store the various attributes <strong>and</strong> sub-elements of the <strong>DDI</strong> Variableelement. For most simple types (such as str<strong>in</strong>gs, numbers, booleans, dates/times), this caneasily be represented via the correspond<strong>in</strong>g database field types (varchars/texts, <strong>in</strong>ts, booleans,datetime). This works well for fields that can only be used once (required or optional) <strong>in</strong> a table.For example, the id <strong>and</strong> isGeographic attributes are only used once <strong>in</strong> the Variable element. Forsub-elements such as Label, <strong>DDI</strong> allows 0 to unlimited Labels for each Variable. Model<strong>in</strong>g this<strong>in</strong> a <strong>relational</strong> database will require a more complex structure than a simple str<strong>in</strong>g field. Thefollow<strong>in</strong>g sections describe how to model these complex relationships.XML hierarchyThe most basic relationship between <strong>DDI</strong> elements comes from the XML structure. In thesimplest case, there are many Schemes <strong>in</strong> <strong>DDI</strong> (such as VariableScheme orControlConstructScheme) which are lists. Each VariableScheme can have any number ofVariables, but each Variable belongs to only one VariableScheme. To model these types ofrelationships, the best option is usually a one-to-many relationship, where the foreign keyrelates the two tables. For example, to model VariableScheme <strong>and</strong> Variable, the Variables tablewould have a variable_scheme_id field that references the appropriate record <strong>in</strong> theVariableSchemes table. This strategy works not only for <strong>DDI</strong> Schemes, but for most elements <strong>in</strong>the <strong>DDI</strong> hierarchy. For example, a StudyUnit has many DataCollections, <strong>and</strong> a DataCollectionhas many CollectionEvents. These can also be modeled with one-to-many relationships.ReferencesA major feature of <strong>DDI</strong> is the ability to reuse elements, usually via References. For example, aVariable can reference the QuestionItem (or QuestionItems) it is based upon. Additionally, aQuestionItem can be referenced by many Variables. A many-to-many relationship is required toproperly model <strong>DDI</strong> References. This is accomplished <strong>in</strong> an RDB us<strong>in</strong>g a jo<strong>in</strong> table. For eachpossible <strong>DDI</strong> reference, a separate jo<strong>in</strong> table is needed to store the reference. While a jo<strong>in</strong> tableworks f<strong>in</strong>e for specific references, it does not work well for “late-bound” references, which aresupported <strong>in</strong> <strong>DDI</strong>. These are discussed <strong>in</strong> the Advanced Cases section of this paper.Recursive structuresThere are some <strong>DDI</strong> elements which can have sub-elements of the same type. Some examplesof this are Groups <strong>and</strong> ControlConstructs. These types of elements have a one-to-manyrelationship with themselves. While this can be modeled like any other one-to-many element,there are several other possibilities that may improve performance. Some common strategiesfor implement<strong>in</strong>g trees <strong>in</strong> <strong>relational</strong> <strong>databases</strong> are Path Enumeration, Nested Sets, NestedIntervals, or solutions us<strong>in</strong>g Common Table Expressions. The best option depends on many<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 8 of 17


factors, <strong>in</strong>clud<strong>in</strong>g the nature of the application, programm<strong>in</strong>g language/library support, <strong>and</strong>database support.Substitution groupsThere are many <strong>DDI</strong> elements that serve as placeholders for which several options arepossible. One example is ResponseDoma<strong>in</strong>. A ResponseDoma<strong>in</strong> may not be directly used <strong>in</strong> a<strong>DDI</strong> Instance. Instead, it may be substituted with the appropriate element, such as aCategoryDoma<strong>in</strong>, CodeDoma<strong>in</strong>, DateTimeDoma<strong>in</strong>, GeographicDoma<strong>in</strong>, NumericDoma<strong>in</strong>, orTextDoma<strong>in</strong>. Other examples of substitution groups <strong>in</strong>clude ControlConstruct <strong>and</strong>ValueRepresentation.<strong>DDI</strong> substitution groups can be implemented <strong>in</strong> an RDB us<strong>in</strong>g <strong>in</strong>heritance. The placeholderelement (such as ResponseDoma<strong>in</strong>) becomes the parent class <strong>in</strong> the RDB, <strong>and</strong> the othersubstitution elements (such as CategoryDoma<strong>in</strong>, etc.) become the child classes. There areseveral options for implement<strong>in</strong>g <strong>in</strong>heritance, <strong>in</strong>volv<strong>in</strong>g either a s<strong>in</strong>gle table to hold all classes,or multiple tables.In the s<strong>in</strong>gle-table solution, one should create a s<strong>in</strong>gle table to hold the entire class hierarchy.There should be columns for all properties of all of the possible child classes. Many of thesecolumns will be NULL because they do not apply to a particular record. For example, a row thatis a CategoryDoma<strong>in</strong> will have NULL fields for all of the columns that apply to CodeDoma<strong>in</strong>s,TextDoma<strong>in</strong>s, etc. This type of solution is <strong>in</strong>efficient with respect to space, but it elim<strong>in</strong>ates jo<strong>in</strong>s<strong>and</strong> unions as all properties for a record are <strong>in</strong> a s<strong>in</strong>gle row <strong>in</strong> the table.In a multiple-table solution, a table can be created for each child class, with the appropriatecolumns to store the fields for that class. Additionally, each child table will have a foreign keypo<strong>in</strong>t<strong>in</strong>g to a record <strong>in</strong> a parent-class table. For example, the ResponseDoma<strong>in</strong> table will bepo<strong>in</strong>ted to by each of the child tables. This solution is more space efficient, but may requireseveral jo<strong>in</strong>s to retrieve records.The best solution depends on many factors, <strong>in</strong>clud<strong>in</strong>g the needs of the application, as well asthe programm<strong>in</strong>g language <strong>and</strong> database support.Controlled vocabulariesThere are several <strong>DDI</strong> fields whose values should come from a Controlled Vocabulary, which ismanaged by the <strong>DDI</strong> Alliance’s Controlled Vocabularies Work<strong>in</strong>g Group 15 . As the vocabulary isa list of items, each Controlled Vocabulary should be represented <strong>in</strong> its own table. The variouselements which need to refer to a row <strong>in</strong> this table will use a foreign key field, us<strong>in</strong>g a st<strong>and</strong>ardone-to-many relationship.15 http://www.ddialliance.org/alliance/work<strong>in</strong>g-groups#cvwg<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 9 of 17


Database IDsEach record <strong>in</strong> a table needs an ID. This can be an <strong>in</strong>ternal database ID (often created with auto<strong>in</strong>crement), or it can be a <strong>DDI</strong> ID. For performance reasons, it is often better to use an <strong>in</strong>ternaldatabase ID, us<strong>in</strong>g the fastest type for jo<strong>in</strong>s. This allows for auto <strong>in</strong>crement to create unique IDswhen creat<strong>in</strong>g new records, <strong>and</strong> leads to extremely fast jo<strong>in</strong>s when the ID columns are <strong>in</strong>dexed.The <strong>in</strong>ternal database ID can be an <strong>in</strong>teger (or unsigned <strong>in</strong>teger) 16 , though many <strong>databases</strong>support universally unique identifiers (UUIDs) 17 . Us<strong>in</strong>g UUIDs has certa<strong>in</strong> advantages, as thesecan be used to generate completely unique <strong>DDI</strong> IDs.The <strong>DDI</strong> ID can be stored <strong>in</strong> an additional column. As <strong>DDI</strong> IDs can conta<strong>in</strong> both letters <strong>and</strong>numbers, they would have to be stored as varchar or text. This makes them less fast for jo<strong>in</strong>s,but still useful for search<strong>in</strong>g for elements, based on <strong>DDI</strong> ID.Advanced CasesThe follow<strong>in</strong>g discussion provides more depth on some aspects of <strong>DDI</strong> management that wereaddressed earlier.Version<strong>in</strong>g (<strong>in</strong>clud<strong>in</strong>g late-bound references)When there are multiple versions of an element (such as a QuestionItem), another element(such as a Variable) can reference a specific version of the QuestionItem, or it can use a “latebound”reference. In <strong>DDI</strong>, this is done by us<strong>in</strong>g the letter “L” <strong>in</strong> the version (e.g., “1.L” or“3.0.0.L”), to reference the latest version of an element. <strong>DDI</strong> supports both version number<strong>in</strong>g,<strong>and</strong> also version timestamps. The timestamp aspect can be automatically managed by some<strong>databases</strong>, such as SQL Server 2008 R2 or Oracle 11i. Nevertheless there is currently nosimilar mechanism for represent<strong>in</strong>g the version number<strong>in</strong>g. Furthermore, chang<strong>in</strong>g the versionnumber of an element <strong>in</strong> <strong>DDI</strong> very often leads to a cascad<strong>in</strong>g effect where the version number ofother elements has to be <strong>in</strong>creased as well.The version numbers will therefore have to be h<strong>and</strong>led programmatically. Basically there arethree options to solve this:1. A mechanism native to <strong>databases</strong> would be to create an array of different triggers on thetables to <strong>in</strong>crement the version numbers of elements as well as ma<strong>in</strong>ta<strong>in</strong> copies of theold elements <strong>in</strong> history tables (as there might be references to older versions of theelements). This leads to a lot of performance issues <strong>in</strong> runn<strong>in</strong>g SQL INSERT orUPDATE statements. Furthermore, as triggers can be considered unmanaged code(very often they are only present <strong>in</strong> the database <strong>and</strong> not <strong>in</strong> the source code controlsystem of the outer programm<strong>in</strong>g framework), they are hard to document <strong>and</strong> regularlycause problems <strong>in</strong> larger programm<strong>in</strong>g teams.16 http://en.wikipedia.org/wiki/Integer_(computer_science)17 http://en.wikipedia.org/wiki/Universally_unique_identifier<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 10 of 17


2. A way to solve this issue would be the option of us<strong>in</strong>g managed code (e.g., an externalWeb service programmed <strong>in</strong> C# access<strong>in</strong>g a SQL Server via the .NET Framework orsimilar solutions <strong>in</strong> JAVA or PHP frameworks). This does not elim<strong>in</strong>ate the problem ofperformance issues, but at least makes the code more manageable <strong>and</strong> allows theusage of repositories.3. The last option would be us<strong>in</strong>g a feature of Data Warehous<strong>in</strong>g to represent the version<strong>and</strong> validity by slowly chang<strong>in</strong>g dimensions (see, e.g., Kimball 2002) 18 . This means thetabular structure gets additional fields which specify the start <strong>and</strong> end po<strong>in</strong>t of the validityof an object as well as the version number. Although this can also be done <strong>in</strong> <strong>relational</strong><strong>databases</strong>, this technique is orig<strong>in</strong>ally designed for the dimension tables with<strong>in</strong> analytical<strong>databases</strong> <strong>and</strong> therefore might cause huge SQL statements to be represented <strong>in</strong> a fully<strong>relational</strong> structure. A solution could be to use an analytical database for the history dataor ETL processes to <strong>DDI</strong> <strong>and</strong> a <strong>relational</strong> database for the current version.Another problem of version<strong>in</strong>g is that <strong>DDI</strong> allows for unlimited length version numbers (e.g.,a.b.c.d.e.f...), which would normally be implemented as a text field. Str<strong>in</strong>g process<strong>in</strong>g with<strong>in</strong>table cells (even us<strong>in</strong>g <strong>in</strong>dex<strong>in</strong>g) is generally slow with<strong>in</strong> <strong>databases</strong> <strong>and</strong> should therefore beavoided. A way to limit the search burden would be to limit the depth of the version numbers so<strong>in</strong>teger fields can be used. This will also allow late b<strong>in</strong>d<strong>in</strong>g to be done via SQL WHEREstatements, rather than via str<strong>in</strong>g process<strong>in</strong>g.As version<strong>in</strong>g is one of the key problems <strong>in</strong> us<strong>in</strong>g <strong>DDI</strong> <strong>in</strong> a <strong>relational</strong> database <strong>in</strong>frastructure,many of these problems have to be discussed among application developer, database designer<strong>and</strong> database adm<strong>in</strong>istrator to f<strong>in</strong>d the fitt<strong>in</strong>g solution between functionality <strong>and</strong> performance <strong>in</strong>the software environment <strong>in</strong> question. As different <strong>databases</strong> have different features to <strong>in</strong>creaseperformance (e.g., partitioned tables for history process<strong>in</strong>g <strong>in</strong> SQL Server), the choice of thebest of these three options has to be clarified.Model<strong>in</strong>g schemes which <strong>in</strong>clude other schemesThere are many schemes <strong>in</strong> <strong>DDI</strong> which can <strong>in</strong>clude other schemes by reference. This featurecan be used when creat<strong>in</strong>g a new version of a scheme, or even when <strong>in</strong>clud<strong>in</strong>g elements from acompletely unrelated scheme.There are two ma<strong>in</strong> methods for implement<strong>in</strong>g this <strong>in</strong> a <strong>relational</strong> database. The first is toimplement a structure very similar to the <strong>DDI</strong> XML structure. The second is to “resolve” all of the<strong>in</strong>cluded schemes, <strong>and</strong> just store the “complete” version. The descriptions below will useVariableSchemes as an example, but the methods can be extended to other schemes.Implement<strong>in</strong>g the first method <strong>in</strong>volves several tables. A VariableSchemeReferences table isused to store the references. Each VariableSchemeReference should <strong>in</strong>clude both the “target”VariableScheme that is be<strong>in</strong>g referenced, as well as the “source” VariableScheme that will18 Ralph Kimball, Mary Ross: The Data Warehouse Toolkit. The Complete Guide to DimensionalModel<strong>in</strong>g. 2nd Edition. John Wiley & Sons, New York : 2002.<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 11 of 17


<strong>in</strong>clude the target. This makes the VariableSchemeReferences table <strong>in</strong> effect a many-to-manyjo<strong>in</strong> table for the VariableSchemes table to itself.Each VariableSchemeReference can conta<strong>in</strong> several Exclude elements, which should be stored<strong>in</strong> a VariableExcludes table. Each VariableExclude belongs to a VariableSchemeReference,<strong>and</strong> has a many-to-many relationship with the Variables table, which should be modeled with ajo<strong>in</strong> table. The VariableSchemeReference should also <strong>in</strong>clude an ItemMap, which po<strong>in</strong>ts to thechanged elements. This is modeled by a VariableItemMaps table, where each row conta<strong>in</strong>s theID of a VariableSchemeReference, a source variable <strong>in</strong> the Variables table, a target variable <strong>in</strong>the Variables table, <strong>and</strong> a Correspondence. The correspondence can either be modeled withone or more text fields, or its own table.While the above method of model<strong>in</strong>g the Reference closely follows <strong>DDI</strong>, it comes at a cost.There are a number of tables to manage, which <strong>in</strong>creases application complexity. Additionally,try<strong>in</strong>g to figure out the elements <strong>in</strong> a VariableScheme becomes a very complex <strong>and</strong> resource<strong>in</strong>tensiveprocess, with many queries required to comb<strong>in</strong>e all of the elements to get one list ofvariables.An alternative method is to store only the “resolved” schemes <strong>and</strong> <strong>in</strong>stead of a one-to-manyrelationship between VariableSchemes <strong>and</strong> Variables, use a many-to-many relationship. In thismanner, each Variable will belong to all of the schemes that reference it. Us<strong>in</strong>g just the jo<strong>in</strong>table, it is very easy to list all of the Variables that belong to the VariableScheme. All importantdescriptive fields such as the ItemMap, Correspondence, etc., can be stored <strong>in</strong> the jo<strong>in</strong> table,document<strong>in</strong>g how a Variable is <strong>in</strong>cluded <strong>in</strong> the VariableScheme. While this second methoddeviates more from <strong>DDI</strong>, it has many advantages for application development <strong>and</strong> performance.Most read/write operations on Variables <strong>and</strong> VariableSchemes become much faster, <strong>and</strong> themodel is much simpler to underst<strong>and</strong> <strong>and</strong> ma<strong>in</strong>ta<strong>in</strong>.Multiple language supportMost text fields <strong>in</strong> <strong>DDI</strong> have support for multiple languages. This is usually implemented byrepeat<strong>in</strong>g the element, with a different locale for each element.Support for multiple languages is extremely important <strong>in</strong> applications, <strong>and</strong> therefore almostevery database <strong>and</strong> application framework has some support for hold<strong>in</strong>g localized version ofstr<strong>in</strong>gs. Because of the prevalence of support for this, the best option is usually to use themechanism supported by the application framework. These text str<strong>in</strong>gs can be processed by thetools offered with<strong>in</strong> the framework <strong>and</strong> stored <strong>in</strong> additional tables attached to the RDB structure.Many-to-many relationships can furthermore provide the means to store multiple translations <strong>in</strong>multiple languages for the <strong>in</strong>dividual item.In <strong>DDI</strong>, one should first identify the text str<strong>in</strong>gs that will be translatable <strong>in</strong> the application -- thiswill usually be Labels, Descriptions, QuestionText, <strong>and</strong> other fields – <strong>and</strong> then implement thesefields as the application framework recommends to enable multiple language support. Analternative method would be us<strong>in</strong>g an external st<strong>and</strong>ard for translation like the XML Localization<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 12 of 17


Interchange File Format (XLIFF). This XML schema is used by several translation softwaresuites (e.g., Trados or OLT - Open Language Translator). A process between the <strong>relational</strong>database <strong>and</strong> the XLIFF format could work <strong>in</strong> a manner similar to export<strong>in</strong>g a <strong>DDI</strong> structure outof the database. The export creates an XLIFF file with the str<strong>in</strong>gs to be translated, <strong>and</strong> anexternal tool like Trados can be used to translate the content. Afterward the str<strong>in</strong>gs can be reimportedto the appropriate places <strong>in</strong> the correspond<strong>in</strong>g translation tables.H<strong>and</strong>l<strong>in</strong>g unknown or external elements <strong>in</strong> <strong>DDI</strong>From the perspective of a <strong>relational</strong> database user, the biggest advantage of us<strong>in</strong>g the <strong>DDI</strong>-XMLmodel is easier exchange of metadata with another agency by import<strong>in</strong>g <strong>and</strong> export<strong>in</strong>g thatmetadata <strong>in</strong>to/from the database. Nevertheless, there are also some limitations of this process.As long as the XML schema of <strong>DDI</strong> is not violated, a <strong>DDI</strong>-XML based implementation can simplyimport all the code from other agencies, although the elements of the <strong>DDI</strong> schema might beunknown as the agency chose not to implement the full set of <strong>DDI</strong>. These unknown <strong>DDI</strong>elements <strong>in</strong> the XML file structure will not be processed by the tools of the import<strong>in</strong>g agency, butsimply ignored. However, they can still reside untouched <strong>in</strong> the XML file or XML database, readyto be added when an export takes place. When the metadata content is modified <strong>and</strong> laterexported, all the <strong>in</strong>formation from the orig<strong>in</strong>al import will still be available <strong>in</strong> the structure readyto be processed by the next agency which might support these elements. The XML code cantherefore more or less “pass through” agencies as long as it is not modified or referenced.Nevertheless, there is a danger of <strong>in</strong>consistency. The previous agency might have changedelements which could have <strong>in</strong> the full set of <strong>DDI</strong> an impact on other elements. As the tools of theimport<strong>in</strong>g agency are only aware of the elements they know <strong>and</strong> do not modify the unknownelements, but only export them as they were imported orig<strong>in</strong>ally, the result is an <strong>in</strong>ternal<strong>in</strong>consistency between those elements, which does not cause a problem for the export<strong>in</strong>gagency, but might have huge impacts on the import<strong>in</strong>g agency.In a <strong>relational</strong> database this k<strong>in</strong>d of behavior is much trickier to emulate. If the <strong>DDI</strong> XMLstructure from another agency is imported, the content has to be processed <strong>and</strong> divided as ithas to be converted from a tree structure to a tabular structure. For the conversion there cannotbe unknown elements as every s<strong>in</strong>gle one of them has to be parsed.Essentially there are three ways to h<strong>and</strong>le the process.1.) The <strong>relational</strong> database conta<strong>in</strong>s the full <strong>DDI</strong> set of a specific version (e.g., <strong>DDI</strong> Lifecycle 3.1)which means at least with<strong>in</strong> this sett<strong>in</strong>g there cannot be unknown elements.2.) The <strong>relational</strong> database discards all unknown elements <strong>in</strong> the import process; therefore theycannot be forwarded anymore, result<strong>in</strong>g <strong>in</strong> a loss of orig<strong>in</strong>al metadata.3.) The <strong>relational</strong> database stores the unknown elements as text str<strong>in</strong>gs or <strong>in</strong> the case ofenterprise <strong>databases</strong> as <strong>in</strong>ternal XML structures to provide them later for export<strong>in</strong>g.Nevertheless, the mechanism to perform this export is much more difficult to develop <strong>in</strong> an RDBenvironment <strong>and</strong> has the same risk of <strong>in</strong>consistency described <strong>in</strong> the XML world.<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 13 of 17


A special case of unknown elements is when external resources embedded <strong>in</strong>to the <strong>DDI</strong>structure have to be imported (e.g., multimedia resources used <strong>in</strong> educational sciences) as theimport process <strong>in</strong> both cases (<strong>relational</strong> or XML) cannot really h<strong>and</strong>le them because the storagestructure is normally not prepared for external elements. Here the import<strong>in</strong>g application has tobe adapted <strong>in</strong>dividually to fulfill those needs.Ensur<strong>in</strong>g Application Compatibility <strong>in</strong> Transferr<strong>in</strong>g<strong>DDI</strong> Between DatabasesUs<strong>in</strong>g a <strong>relational</strong> data model for storage (or a hybrid RDB-XML approach) has thedisadvantage of not be<strong>in</strong>g able to easily store the entire <strong>DDI</strong> schema, which often warrants thedynamic generation of <strong>DDI</strong> for transmission <strong>and</strong> consumption by applications. Thus, it isimportant to establish the set of elements be<strong>in</strong>g used <strong>in</strong> a given <strong>in</strong>stance. <strong>DDI</strong> applicationcompatibility can be def<strong>in</strong>ed as the ability for software applications driven by <strong>DDI</strong> XML topredictably underst<strong>and</strong> expected <strong>in</strong>puts <strong>and</strong> yield predictable outputs for potentially disparate<strong>in</strong>stances of <strong>DDI</strong>. Due to the flexible nature of the <strong>DDI</strong> schema, a mach<strong>in</strong>e-actionable means ofdef<strong>in</strong><strong>in</strong>g used <strong>and</strong> unused elements <strong>in</strong> a given <strong>in</strong>stance is necessary <strong>in</strong> order to validate whetherdifferent <strong>in</strong>stances of <strong>DDI</strong> are compatible. The <strong>DDI</strong> Profile module has been developed tofacilitate the automation of this requirement us<strong>in</strong>g XPath statements 19 . Use of this module is notm<strong>and</strong>ated by the <strong>DDI</strong> specification, but best practices have been established about the creationof <strong>DDI</strong> Profiles 20 . There is also a best practice document about high-level architecture for <strong>DDI</strong>application developers, which suggests the use of <strong>DDI</strong> Profiles <strong>in</strong> the context of application<strong>in</strong>teroperability 21 . This paper does not aim to reiterate best practice, but to further elucidate howone might use <strong>DDI</strong> Profiles <strong>in</strong> the context of build<strong>in</strong>g software applications capable ofcommunicat<strong>in</strong>g compatibility across organizations as well as with<strong>in</strong> organizations’ <strong>in</strong>stances of<strong>DDI</strong> metadata.The <strong>DDI</strong> Profile is def<strong>in</strong>ed <strong>in</strong> the <strong>DDI</strong> documentation as a simple collection of XPaths thatdescribe the objects with<strong>in</strong> <strong>DDI</strong> that are either used or not used for particular purposes. The <strong>DDI</strong>Profile facilitates shar<strong>in</strong>g by clearly stat<strong>in</strong>g what is expected <strong>in</strong> the <strong>DDI</strong> metadata received orsent by an organization <strong>and</strong> def<strong>in</strong>es which parts of <strong>DDI</strong> an organization or system can h<strong>and</strong>le 22 .The use of a Profile is not m<strong>and</strong>atory, but when one is be<strong>in</strong>g used, it should be referenced <strong>in</strong> allof the <strong>DDI</strong> <strong>in</strong>stances that conform to it. This is done us<strong>in</strong>g the URN of the profile <strong>in</strong> the<strong>DDI</strong>ProfileReference element declared at the end of the StudyUnit module as follows 23 :19 http://www.w3.org/TR/xpath/20 Best Practice on Creat<strong>in</strong>g a <strong>DDI</strong> Profile (2009-02-15):http://dx.doi.org/10.3886/<strong>DDI</strong>BestPractices0621 Best Practice on High-Level Architectural Model for <strong>DDI</strong> Applications (2009-02-22):http://dx.doi.org/10.3886/<strong>DDI</strong>BestPractices1222 <strong>DDI</strong> Profile module schema: http://www.ddialliance.org/Specification/<strong>DDI</strong>-Lifecycle/3.1/XMLSchema/ddiprofile.xsd23 Correspondences with S<strong>and</strong>a Ionescu, May 25, 2011 - Aug. 22, 2011<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 14 of 17


URN_of_profileAdd<strong>in</strong>g this reference is foundational to application compatibility as it provides access to theoutput logic conta<strong>in</strong>ed <strong>in</strong> the profile, <strong>in</strong>form<strong>in</strong>g developers <strong>in</strong>teract<strong>in</strong>g with the <strong>in</strong>stance of whichelements from the greater <strong>DDI</strong> Schema to expect as <strong>in</strong>put. (The proper use of identifiers <strong>and</strong>URNs is outside the scope of this discussion, but can be found described <strong>in</strong> the <strong>DDI</strong> Identifier 24<strong>and</strong> <strong>DDI</strong> URN Resolution 25 best practice documents.)Whether generat<strong>in</strong>g <strong>DDI</strong> dynamically from a <strong>relational</strong> data model, or transmitt<strong>in</strong>g it from anXML database, there are several types of validation required for <strong>in</strong>teroperability. Beyond XMLvalidation <strong>and</strong> <strong>DDI</strong> Schema validation, an application serv<strong>in</strong>g <strong>DDI</strong> should provide facilities bywhich other applications can validate compatibility between systems. For example, anapplication developer from organization ABC <strong>in</strong>teract<strong>in</strong>g with a <strong>DDI</strong> store from organization XYZvia a Web service to create a cross-organization metadata search would most likely bedelighted to f<strong>in</strong>d functionality that would analyze the profiles from their organization <strong>and</strong>communicate <strong>in</strong>compatibilities between the <strong>in</strong>stances that needed to be addressed beforeimplementation. This functionality would expedite the development process <strong>and</strong> could be builtrelatively easily by loop<strong>in</strong>g through the XPath statements conta<strong>in</strong>ed <strong>in</strong> ABC’s profile <strong>and</strong>return<strong>in</strong>g the results when performed on XYZ’s <strong>in</strong>stance of <strong>DDI</strong>.24 Management of <strong>DDI</strong> 3.0 Unique Identifiers (2009-02-15):http://dx.doi.org/10.3886/<strong>DDI</strong>BestPractices1025 <strong>DDI</strong> 3.0 URNs <strong>and</strong> Entity Resolution (2009-03-21):http://dx.doi.org/10.3886/<strong>DDI</strong>BestPractices11<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 15 of 17


Applications between organizations are not the only architecture where <strong>DDI</strong> profiles would bevery useful. Another case to consider is one <strong>in</strong> which a large organization is made up of manysmaller autonomous, <strong>DDI</strong>-generat<strong>in</strong>g units. This organization would like to f<strong>in</strong>d a way to mergedatasets <strong>in</strong>to a model derived from common fields from each unit. This task would be extremelyexpensive <strong>and</strong> tedious without a way to communicate used <strong>and</strong> unused fields <strong>in</strong> a st<strong>and</strong>ardizedway. <strong>DDI</strong> Profiles provide that method <strong>in</strong> the same way that was described <strong>in</strong> the previousexample. Even greater efficiency can be found when each unit establishes common profiles withthe other units before generat<strong>in</strong>g <strong>DDI</strong>, so that subsequent generation of metadata <strong>in</strong> one unitwill be <strong>in</strong>teroperable as long as it adheres to the agreed upon field set.Use of <strong>DDI</strong> Profiles to communicate elements used <strong>and</strong> not used <strong>in</strong> an organization has manybenefits besides <strong>in</strong>teroperability between autonomous units’ <strong>in</strong>stances of <strong>DDI</strong>. It would also bepivotal <strong>in</strong> facilitat<strong>in</strong>g metadata tool development throughout the <strong>DDI</strong> Lifecycle 26 . In the case ofmany survey agencies, the creation of metadata <strong>in</strong>volves a cha<strong>in</strong> of tools, the output of onefeed<strong>in</strong>g <strong>in</strong>to the next. For example, the output of a questionnaire designer would feed <strong>in</strong>to aquestionnaire eng<strong>in</strong>e, send<strong>in</strong>g its output through a data cleaner, the output of which is f<strong>in</strong>ally<strong>in</strong>gested by tools related to dissem<strong>in</strong>ation.A Look <strong>in</strong>to the Future: A More AbstractRepresentation of <strong>DDI</strong>While the previous sections discussed issues of h<strong>and</strong>l<strong>in</strong>g <strong>DDI</strong>-based metadata <strong>in</strong> a <strong>relational</strong>database, XML structure, or hybrid systems, the future may hold another possibility. <strong>DDI</strong> as ast<strong>and</strong>ard underwent significant changes <strong>in</strong> its history, e.g., <strong>DDI</strong> Codebook (prior to 2.5) isexpressed as a DTD while <strong>DDI</strong> Lifecycle (all versions of <strong>DDI</strong> 3.x) is represented as an XMLschema. Some agencies, as already mentioned, use the XML format only for import <strong>and</strong> exportpurposes, while they <strong>in</strong>ternally map the metadata to their own proprietary <strong>relational</strong> <strong>databases</strong>t<strong>and</strong>ard. The underly<strong>in</strong>g pr<strong>in</strong>ciple this reveals is that <strong>DDI</strong> does not need to rely upon aparticular technical representation, but is valuable as an abstract model, the manifestation ofwhich can be <strong>in</strong> different representations, such as UML, RDF, etc. This means an abstractionlayer could exist between the relations <strong>and</strong> nodes of a possible “<strong>DDI</strong> 4” <strong>and</strong> the technicalrepresentations used <strong>in</strong> a given system. The advantage is that a technical representation can begenerated out of the abstract model. At the time of the writ<strong>in</strong>g of this paper, discussions with<strong>in</strong>the <strong>DDI</strong> Alliance were occurr<strong>in</strong>g to take exactly this step, towards conceptualiz<strong>in</strong>g thespecification as an abstract model.ConclusionThis paper discussed various aspects, pros, <strong>and</strong> cons of manag<strong>in</strong>g <strong>DDI</strong> metadata <strong>in</strong> <strong>relational</strong><strong>databases</strong> vs. XML structures, <strong>and</strong> issues of application compatibility when transferr<strong>in</strong>g <strong>DDI</strong>metadata between different stores <strong>and</strong> agencies. The authors <strong>in</strong>vite discussion of this paper on26 See graphic depiction at: http://www.ddialliance.org/sites/default/files/what-is-ddi-diagram.jpg<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 16 of 17


the <strong>DDI</strong> users’ email discussion list 27 , <strong>and</strong> at future meet<strong>in</strong>gs of the <strong>DDI</strong> <strong>and</strong> broader socialscience data communities.AcknowledgmentsThe Leibniz Institute for Educational Research <strong>and</strong> Educational Information (DIPF) hosted aworkshop <strong>in</strong>clud<strong>in</strong>g the topic of <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong> <strong>in</strong> Frankfurt, Germany, on April 7-8,2011, dur<strong>in</strong>g which the development of this document was begun. The contributors would like tothank the follow<strong>in</strong>g colleagues for provid<strong>in</strong>g <strong>in</strong>put on a late-Oct. 2011 draft: S<strong>and</strong>a Ionescu,University of Michigan (USA); Jeremy Iverson, Algenta Technologies (USA); <strong>and</strong> JohannaVompras, University of Bielefeld (Germany); <strong>and</strong> Mary Vardigan, University of Michigan (USA),who edited the f<strong>in</strong>al version of the paper.27 http://www.ddialliance.org/community/listserv<strong>Represent<strong>in</strong>g</strong> <strong>and</strong> <strong>utiliz<strong>in</strong>g</strong> <strong>DDI</strong> <strong>in</strong> <strong>relational</strong> <strong>databases</strong>DOI: http://dx.doi.org/10.3886/<strong>DDI</strong>OtherTopics02 1 December 2011 Page 17 of 17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!