how data are spread among physical disk drives. The databasetheoretician would note that our data model is in Second NormalForm but not in Third Normal Form. In a table that is part of a ThirdNormal Form data model, all columns are directly dependent on thewhole key. The column current_version_p is not dependent on thetable key but rather on two other non-key columns(editorial_status and version_date). SQL programmersrefer to this kind of per<strong>for</strong>mance-enhancing storage of derivable dataas "denormalization".If you want to serve 10 million requests per day directly from anRDBMS running on a server of modest capacity, you may need tobreak some rules. However, the most maintainable production datamodels usually result from beginning with Third Normal Form andadding a handful of modest and judicious denormalizations that aredocumented and justified.Note that any data model in Third Normal Form is also in Second NormalForm. A data model in Second Normal Form is in First Normal Form.6.14 Version Control (<strong>for</strong> Computer Programs)Note that a solution to the version control problem <strong>for</strong> site content(stuff in the database) still leaves you, as an engineer, with theproblem of version control <strong>for</strong> the computer programs that implementthe site. These are most likely in the operating system file systemand edited by a handful of professional software developers. Duringthis class you may decide that it is not worth the ef<strong>for</strong>t to set up anduse version control, in which case your de facto version controlsystem becomes backup tapes so make sure that you've got dailybackups. However, in the long run you need to learn aboutapproaches to version control <strong>for</strong> <strong>Internet</strong> application development.Throughout this section, keep in mind that a project with a very clearpublishing objective, specs that never change, and one very smartdeveloper, does not need version control. A project with evolvingobjectives, changing specifications, and multiple contributors needsversion control.Classical Solution: one development area per developerClassically version control is used by C developers with each Cprogrammer working from his or her own directory. This makes sensebecause there is no persistence in the C world. Code is compiled. Abinary runs that builds data structures in RAM. When the programterminates it doesn't leave anything behind. The entire "tree" of116A pragmatic approach would seem to start by keeping all thedocuments in the RDBMS: articles, user comments, discussion <strong>for</strong>umpostings, etc. Either once per night or every time a new documentwas added, update a full-text search system's collection. Pages thatare part of the standard user experience and workflow operate fromthe RDBMS. The search box at the upper right corner of every page,however, queries against the full-text search system. Let's call this asplit-system design.**** insert figure *****Figure 1: A split-system approach to providing full-text search. Theapplication's content is stored in a relational database managementsystem. Scripts periodically maintain a second copy in a specializedtext database. The Web server program per<strong>for</strong>ms queries, inserts,and updates to the RDBMS. When a user requests a full-text search,however, the query is sent to the text database.One argument against the split-system approach is that two copies ofthe document collection are being kept. In an age of $200 disk drivesof absurdly high capacity, this isn't a powerful argument. It is nearlyimpossible to fill a modern disk drive with words typed by humans.One can fill up a disk drive with video or audio streams, but not text.And in any case some full-text search systems can build an index toa document collection without themselves keeping the originaldocument around, i.e., you would in fact have only one copy of thedocument in the RDBMS.A second argument against using RDBMS and full-text searchsystems simultaneously is that the collections will get out of sync. Ifthe Web server crashes in the middle of an RDBMS transaction, allwork is rolled back. If the Web server was simultaneously inserting adocument into a full-text search system, it is possible that the full-textdatabase will contain a document that is not in fact available on themain pages of site--the site being generated from the RDBMS.Alternatively the RDBMS insert might succeed while the full-textinsert fails, leading to a document that is available on the site but notsearchable. This argument, too, ultimately lacks power. It is true thatthe RDBMS is a convenient and nearly foolproof means of managingtransactions and concurrency. However, it is not the only way. If onewere to hire sufficiently careful programmers and sufficientlydedicated system and database administrators it would be possibleto keep two databases in sync.233
way 1 1/16One might argue that this sentence makes better literature as "Allhappy families resemble one another, but each unhappy family isunhappy in its own way," but the full-text search software finds itmore useful in this <strong>for</strong>m.After the crude histogram is made, it is typically adjusted <strong>for</strong> theprevalence of words in standard English. So, <strong>for</strong> example, theappearance of "resemble" is more interesting than "happy" because"resemble" occurs less frequently in standard English. Stopwordssuch as "is" are thrown away altogether. Stemming is another usefulrefinement. In the index and in queries we convert all words to theirstems. The stem word <strong>for</strong> "families", <strong>for</strong> example, is "family". Withstemming, a query <strong>for</strong> "families" would match a document containing"family" and vice versa.Given a body of histograms it is possible to answer queries such as"Show me documents that are similar to this one" or "Show medocuments whose histogram is closest to a user-entered string." Theinter-document similarity query can be handled by comparinghistograms already stored in the text database. The search string"platinum mines in New Zealand" might be processed first bythrowing away the stopwords "in" and "new". By using histogramcomparison the software would deliver articles that that have themost occurrences of "platinum", "mines", and "Zealand". Supposethat "Zealand" is a rarer word than "platinum". Then a document withone occurrence of "Zealand" is favored over one with one occurrenceof platinum. A document with one occurrence of each word ispreferred to an article where only one of those words shows up. Adocument that contains only the words "platinum mines Zealand" is abetter match than a document that contains 100,000 words, three ofwhich happen to match the query terms.The power of this kind of system is enticing and raises the question"Can we run our entire Web application from a specialized full-textsearch database system?" Indeed, why not chuck the RDBMSaltogether?We don't chuck the RDBMS because we put it in to handle theproblem of concurrency: two users trying to update the same itemsimultaneously. A better query tool is nice but we can't adopt it as ourprimary database management system unless it handles theconcurrency problem as well as the RDBMS.232software is checked out from a version control repository into the filesystem of the development computer. Changed files are checkedback into the repository when the programmer is satisfied.A shallow objection to this development method in the world ofdatabase-backed <strong>Internet</strong> applications is that it becomes very tediousto make a small change. The programmer checks out the tree onto adevelopment server. The programmer installs an RDBMS, creates anRDBMS user and a tablespace. The programmer exports theRDBMS from the production site into a dump file, transfers that dumpfile over the network to the development machine, and imports it intothe RDBMS installation on the development server. Keep in mind that<strong>for</strong> many <strong>Internet</strong> applications the database may approach 1Terabyte in size and there<strong>for</strong>e it could take hours or days to transferand import the dump file. Finally, the programmer finds a free IPaddress or port and sets up an HTTP server rooted at thedevelopment tree. Ready to code!A deeper objection to applying this development method to our worldis that it is an obstacle to collaboration. In the <strong>Internet</strong> applicationbusiness, developers always work with the publisher and users.Those collaborators need to know, at all times, where to find thelatest running version of the software so that they can offer criticismand advice. If there are 10 software developers on a service it is notreasonable to ask the publishers and users to check 10 separatedevelopment sites.A Solution <strong>for</strong> Our Times1. three HTTP servers (can be on one physical computer)2. two or three RDBMS users/tablespaces (can be in oneRDBMS instance)3. one version control repositoryLet's go through these item by item.Item 1: Three HTTP ServersSuppose that a publisher's overall objective is to serve an <strong>Internet</strong>application accessible at "foobar.com". This implies a productionserver, rooted in the file system at /web/foobar/ (Server 1). It is toorisky to have programmers making changes on the live productionsite. This implies a development server, rooted at /web/foobar-dev/(Server 2). Perhaps this is enough. When everyone is happy with theway that the dev server is functioning, declare a code freeze, test a117
- Page 1 and 2:
SoftwareEngineering forInternetAppl
- Page 3 and 4:
Signature: ________________________
- Page 5 and 6:
end-users. We use every opportunity
- Page 7 and 8:
• availability of magnet content
- Page 9 and 10:
• we want to see if a student is
- Page 11 and 12:
you supply English-language queries
- Page 13 and 14:
What to do during lecturesWe try to
- Page 15 and 16:
The one-term cram courseWhen teachi
- Page 17 and 18:
332• spend a term learning how to
- Page 19 and 20:
Once we've taught students how to b
- Page 21 and 22:
has permission to perform each task
- Page 23 and 24:
UDDIUnixcustomer's credit card. If
- Page 25 and 26:
thousands of concurrent users. This
- Page 27 and 28:
OraclePerlnamed XYZ" without the pr
- Page 29 and 30:
LDAPLinuxbits per color, a vastly s
- Page 31 and 32:
FilterFirewallFlat-fileGIF318functi
- Page 33 and 34:
when there is an educational dimens
- Page 35 and 36:
system. The authors of the core pro
- Page 37 and 38:
Sign-OffsTry to schedule comprehens
- Page 39 and 40:
scheduling goals that both you and
- Page 41 and 42:
Client Tenure In Job (new, mid-term
- Page 43 and 44:
ReferencesEngagement ManagementSQL*
- Page 45 and 46:
Decision-makers often bring senior
- Page 47 and 48:
presentation to a panel of outsider
- Page 49 and 50:
300always been written by programme
- Page 51 and 52:
17.3 Professionalism in the Softwar
- Page 53 and 54:
Try to make sure that your audience
- Page 55 and 56:
Chapter 17WriteupIf I am not for my
- Page 57 and 58:
Suppose that an RDBMS failure were
- Page 59 and 60:
analysis programs analyzing standar
- Page 61 and 62:
at 9 hours 11 minutes 59 seconds pa
- Page 63 and 64:
found" will result in an access log
- Page 65 and 66: 15.18 Time and MotionThe team shoul
- Page 67 and 68: select 227, 891, 'algorithm', curre
- Page 69 and 70: create table km_object_views (objec
- Page 71 and 72: • object-create• object-display
- Page 73 and 74: The trees chapter of SQL for Web Ne
- Page 75 and 76: );274-- ordering within a form, low
- Page 77 and 78: and start the high-level document f
- Page 79 and 80: Example Ontology 2: FlyingWe want a
- Page 81 and 82: systems. What would a knowledge man
- Page 83 and 84: spreadsheet". Other users can comme
- Page 85 and 86: Chapter 15Metadata (and Automatic C
- Page 87 and 88: {site url}{site description}en-usCo
- Page 89 and 90: drawing on the intermodule API that
- Page 91 and 92: At this point you have something of
- Page 93 and 94: • description• URL for a photo
- Page 95 and 96: Here's a raw SOAP request/response
- Page 97 and 98: Chapter 14Distributed Computing wit
- Page 99 and 100: conduct programmer job interviews h
- Page 101 and 102: Most admin pages can be excluded fr
- Page 103 and 104: content that should distinguish one
- Page 105 and 106: Chapter 13Planning ReduxA lot has c
- Page 107 and 108: the Internet-specific problem of no
- Page 109 and 110: wouldn't see these dirty tricks unl
- Page 111 and 112: 12.8 Exercise 4: Big BrotherGeneral
- Page 113 and 114: than one call to contains in the sa
- Page 115: A third argument against the split
- Page 119 and 120: absquatulate 612bedizen 36, 9211cry
- Page 121 and 122: What if the user typed multiple wor
- Page 123 and 124: Chapter 12S E A R C HRecall from th
- Page 125 and 126: long as it is much easier to remove
- Page 127 and 128: features that are helpful? What fea
- Page 129 and 130: made it in 1938)? Upon reflection,
- Page 131 and 132: environment, we identify users by t
- Page 133 and 134: those updates by no more than 1 min
- Page 135 and 136: Balancer and mod_backhand, a load b
- Page 137 and 138: translation had elapsed--the site w
- Page 139 and 140: It seems reasonable to expect that
- Page 141 and 142: 11.1.5 Transport-Layer EncryptionWh
- Page 143 and 144: such as ticket bookings would colla
- Page 145 and 146: give their site a unique look and f
- Page 147 and 148: It isn't challenging to throw hardw
- Page 149 and 150: Chapter 11Scaling GracefullyLet's l
- Page 151 and 152: 10.15 Beyond VoiceXML: Conversation
- Page 153 and 154: Consider that if you're authenticat
- Page 155 and 156: In this example, we:194• ask the
- Page 157 and 158: As in any XML document, every openi
- Page 159 and 160: (http://www.voicegenie.com). These
- Page 161 and 162: Chapter 10Voice (VoiceXML)questions
- Page 163 and 164: 9.15 MoreStandards information:•
- Page 165 and 166: 9.14 The FutureIn most countries th
- Page 167 and 168:
9.10 Exercise 7: Build a Pulse Page
- Page 169 and 170:
9.6 Keypad HyperlinksLet's look at
- Page 171 and 172:
text/xml,application/xml,applicatio
- Page 173 and 174:
Protocol (IP) routing, a standard H