13.07.2015 Views

Software Engineering for Internet Applications - Student Community

Software Engineering for Internet Applications - Student Community

Software Engineering for Internet Applications - Student Community

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

way 1 1/16One might argue that this sentence makes better literature as "Allhappy families resemble one another, but each unhappy family isunhappy in its own way," but the full-text search software finds itmore useful in this <strong>for</strong>m.After the crude histogram is made, it is typically adjusted <strong>for</strong> theprevalence of words in standard English. So, <strong>for</strong> example, theappearance of "resemble" is more interesting than "happy" because"resemble" occurs less frequently in standard English. Stopwordssuch as "is" are thrown away altogether. Stemming is another usefulrefinement. In the index and in queries we convert all words to theirstems. The stem word <strong>for</strong> "families", <strong>for</strong> example, is "family". Withstemming, a query <strong>for</strong> "families" would match a document containing"family" and vice versa.Given a body of histograms it is possible to answer queries such as"Show me documents that are similar to this one" or "Show medocuments whose histogram is closest to a user-entered string." Theinter-document similarity query can be handled by comparinghistograms already stored in the text database. The search string"platinum mines in New Zealand" might be processed first bythrowing away the stopwords "in" and "new". By using histogramcomparison the software would deliver articles that that have themost occurrences of "platinum", "mines", and "Zealand". Supposethat "Zealand" is a rarer word than "platinum". Then a document withone occurrence of "Zealand" is favored over one with one occurrenceof platinum. A document with one occurrence of each word ispreferred to an article where only one of those words shows up. Adocument that contains only the words "platinum mines Zealand" is abetter match than a document that contains 100,000 words, three ofwhich happen to match the query terms.The power of this kind of system is enticing and raises the question"Can we run our entire Web application from a specialized full-textsearch database system?" Indeed, why not chuck the RDBMSaltogether?We don't chuck the RDBMS because we put it in to handle theproblem of concurrency: two users trying to update the same itemsimultaneously. A better query tool is nice but we can't adopt it as ourprimary database management system unless it handles theconcurrency problem as well as the RDBMS.232software is checked out from a version control repository into the filesystem of the development computer. Changed files are checkedback into the repository when the programmer is satisfied.A shallow objection to this development method in the world ofdatabase-backed <strong>Internet</strong> applications is that it becomes very tediousto make a small change. The programmer checks out the tree onto adevelopment server. The programmer installs an RDBMS, creates anRDBMS user and a tablespace. The programmer exports theRDBMS from the production site into a dump file, transfers that dumpfile over the network to the development machine, and imports it intothe RDBMS installation on the development server. Keep in mind that<strong>for</strong> many <strong>Internet</strong> applications the database may approach 1Terabyte in size and there<strong>for</strong>e it could take hours or days to transferand import the dump file. Finally, the programmer finds a free IPaddress or port and sets up an HTTP server rooted at thedevelopment tree. Ready to code!A deeper objection to applying this development method to our worldis that it is an obstacle to collaboration. In the <strong>Internet</strong> applicationbusiness, developers always work with the publisher and users.Those collaborators need to know, at all times, where to find thelatest running version of the software so that they can offer criticismand advice. If there are 10 software developers on a service it is notreasonable to ask the publishers and users to check 10 separatedevelopment sites.A Solution <strong>for</strong> Our Times1. three HTTP servers (can be on one physical computer)2. two or three RDBMS users/tablespaces (can be in oneRDBMS instance)3. one version control repositoryLet's go through these item by item.Item 1: Three HTTP ServersSuppose that a publisher's overall objective is to serve an <strong>Internet</strong>application accessible at "foobar.com". This implies a productionserver, rooted in the file system at /web/foobar/ (Server 1). It is toorisky to have programmers making changes on the live productionsite. This implies a development server, rooted at /web/foobar-dev/(Server 2). Perhaps this is enough. When everyone is happy with theway that the dev server is functioning, declare a code freeze, test a117

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!