10.07.2015 Views

Download pdf - Free Books

Download pdf - Free Books

Download pdf - Free Books

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Hack 61 Adding a Little Google to Your WordUse Google with Microsoft Word for better spelling suggestions than the traditionaldictionary.Some of the hacks we cover in this book are very useful, some are weird, and some of them arenot exactly useful but have a definite cool factor. The first version of CapeSpeller(http://www.capescience.com/google/spell.shtml) fit into that last category. Send a word via emailand receive a spelling suggestion in return.While cool, there weren't many scenarios where you'd absolutely need to use it. But the newerversion of CapeSpeller is far more useful; it's now designed to integrate with Microsoft Word andprovide spelling suggestions powered by Google as an alternative to the standard Word/Officedictionary.Now, why in the world would you want another spellchecker in Word? Doesn't it already have arather good one? Indeed it does, but it employs a traditional dictionary, which falls over whenfaced with certain proper nouns, jargon, and acronyms. Google's dictionary [Hack #16] is chockfullof these sorts of up-to-the-minute, hip, and non-traditional suggestions.61.1 Using CapeSpellerThere are several steps to acquiring and installing CapeSpeller.1. First, you'll need to have the Microsoft SOAP Toolkit installed(http://msdn.microsoft.com/downloads/default.asp?URL=/code/sample.asp?url=/msdnfiles/027/001/580/msdncompositedoc.xml).It's a fairly small download but may take alittle wrangling to get squared away. You must be running Internet Explorer 5 or later.You may also have to update your Windows Installer depending on what version ofWindows you're using. The CapeSpeller site(http://www.capescience.com/google/spell.shtml) provides more details.2. Once you've got the SOAP toolkit squared away, you'll have to get two code items fromCapeScience. The first is a zipped executable that's available fromhttp://www.capescience.com/google/download/CapeSpeller.zip. <strong>Download</strong> that one,unzip it, and run the executable.3. After you've downloaded and installed the executable, download the source code. Thesource code contains a place for you to copy and paste your API. Unless you've got a legitdeveloper's key there, you won't be able to get spelling suggestions from Google.4. The final thing you'll need to do to get CapeSpeller to work with Word is to set up amacro. CapeScience offers instructions for setting up a spellcheck macro athttp://www.capescience.com/google/spelltoword.shtml.


Hack 62 Permuting a QueryRun all permutations of query keywords and phrases to squeeze the last drop of results fromthe Google index.Google, ah, Google. Search engine of over 3 billion pages and 3 zillion possibilities. One ofGoogle's charms, if you're a search engine geek like me, is trying various tweaks with yourGoogle search to see what exactly makes a difference to the results you get.It's amazing what makes a difference. For example, you wouldn't think that word order wouldmake much of an impact but it does. In fact, buried in Google's documentation is the admissionthat the word order of a query will impact search results.While that's an interesting thought, who has time to generate and run every possible iteration of amultiword query? The Google API to the rescue! This hack takes a query of up to four keywordsor "quoted phrases" (as well as supporting special syntaxes) and runs all possible permutations,showing result counts by permutation and the top results for each permutation.You'll need to have the Algorithm::Permute Perl module for this program towork correctly(http://search.cpan.org/search?query=algorithm%3A%3Apermute&mode=all).62.1 The Code#!/usr/local/bin/perl# order_matters.cgi# Queries Google for every possible permutation of up to 4query keywords,# returning result counts by permutation and top resultsacross permutations.# order_matters.cgi is called as a CGI with form input# Your Google API developer's keymy $google_key='insert key here';# Location of the GoogleSearch WSDL filemy $google_wdsl = "./GoogleSearch.wsdl";use strict;use SOAP::Lite;use CGI qw/:standard *table/;use Algorithm::Permute;printheader( ),start_html("Order Matters"),h1("Order Matters"),


start_form(-method=>'GET'),'Query: &nbsp; ', textfield(-name=>'query'),' &nbsp; ',submit(-name=>'submit', -value=>'Search'), br( ),'Enter up to 4 querykeywords or "quoted phrases"',end_form( ), p( );if (param('query')) {# Glean keywordsmy @keywords = grep !/^\s*$/, split /([+-]?".+?")|\s+/,param('query');scalar @keywords > 4 andprint('Only 4 query keywords or phrasesallowed.'), last;my $google_search = SOAP::Lite->service("file:$google_wdsl");printstart_table({-cellpadding=>'10', -border=>'1'}),Tr([th({-colspan=>'2'}, ['Result Counts byPermutation' ])]),Tr([th({-align=>'left'}, ['Query', 'Count'])]);my $results = {}; # keep track of what we've seen acrossqueries# Iterate over every possible permutationmy $p = new Algorithm::Permute( \@keywords );while (my $query = join(' ', $p->next)) {# Query Googlemy $r = $google_search ->doGoogleSearch($google_key,$query,0, 10, "false", "", "false", "", "latin1", "latin1");print Tr([td({-align=>'left'}, [$query, $r->{'estimatedTotalResultsCount'}] )]);@{$r->{'resultElements'}} or next;# Assign a rankmy $rank = 10;foreach (@{$r->{'resultElements'}}) {$results->{$_->{URL}} = {title => $_->{title},snippet => $_->{snippet},seen => ($results->{$_->{URL}}->{seen}) + $rank};


}$rank--;}printend_table( ), p( ),start_table({-cellpadding=>'10', -border=>'1'}),Tr([th({-colspan=>'2'}, ['Top Results acrossPermutations' ])]),Tr([th({-align=>'left'}, ['Score', 'Result'])]);foreach ( sort { $results->{$b}->{seen} $results->{$a}->{seen} } keys %$results ) {print Tr(td([$results->{$_}->{seen},b($results->{$_}->{title}||'no title') . br( ) .a({href=>$_}, $_) . br( ) .i($results->{$_}->{snippet}||'no snippet')]));}print end_table( ),}print end_html( );62.2 Running the HackThe hack runs via a web form that is integrated into the code. Call the CGI and enter the queryyou want to check (up to four words or phrases). The script will first search for every possiblecombination of the search words and phrases, as Figure 6-1 shows.Figure 6-1. List of permutations for applescript google api


The script then displays top 10 search results across all permutations of the query, as Figure 6-2shows.Figure 6-2. Top results for permutations of applescript google api


62.3 Using the HackAt first blush, this hack looks like a novelty with few practical applications. But if you're a regularresearcher or a web wrangler, you might find it of interest.If you're a regular researcher—that is, there are certain topics that you research on a regularbasis—you might want to spend some time with this hack and see if you can detect a pattern inhow your regular search terms are impacted by changing word order. You might need to reviseyour searching so that certain words always come first or last in your query.If you're a web wrangler, you need to know where your page appears in Google's search results. Ifyour page loses a lot of ranking ground because of a shift in a query arrangement, maybe you wantto add some more words to your text or shift your existing text.


Hack 63 Tracking Result Counts over TimeQuery Google for each day of a specified date range, counting the number of results at eachtime index.Sometimes the results of a search aren't of as much interest as knowing the number thereof. Howpopular a is a particular keyword? How many times is so-and-so mentioned? How do differingphrases or spellings stack up against each other?You may also wish to track the popularity of a term over time to watch its ups and downs, spottrends, and notice tipping points. Combining the Google API and daterange: [Hack #11] syntax isjust the ticket.This hack queries Google for each day over a specified date range, counting the number of resultsfor each day. This leads to a list of numbers that you could enter into Excel and chart, for example.There are a couple of caveats before diving right into the code. First, the average keyword willtend to show more results over time as Google ads more pages to its index. Second, Googledoesn't stand behind its date-range search; results shouldn't be taken as gospel.This hack requires the Time::JulianDay(http://search.cpan.org/search?query=Time%3A%3AJulianDay) Perlmodule.63.1 The Code#!/usr/local/bin/perl# goocount.pl# Runs the specified query for every day between thespecified# start and end dates, returning date and count as CSV.# usage: goocount.pl query="{query}" start={date}end={date}\n}# where dates are of the format: yyyy-mm-dd, e.g. 2002-12-31# Your Google API developer's keymy $google_key='insert key here';# Location of the GoogleSearch WSDL filemy $google_wdsl = "./GoogleSearch.wsdl";use SOAP::Lite;use Time::JulianDay;use CGI qw/:standard/;# For checking date validitymy $date_regex = '(\d{4})-(\d{1,2})-(\d{1,2})';# Make sure all arguments are passed correctly


( param('query') and param('start') =~ /^(?:$date_regex)?$/and param('end') =~ /^(?:$date_regex)?$/ ) ordie qq{usage: goocount.pl query="{query}" start={date}end={date}\n};# Julian date manipulationmy $query = param('query');my $yesterday_julian = int local_julian_day(time) - 1;my $start_julian = (param('start') =~ /$date_regex/)? julian_day($1,$2,$3) : $yesterday_julian;my $end_julian = (param('end') =~ /$date_regex/)? julian_day($1,$2,$3) : $yesterday_julian;# Create a new Google SOAP requestmy $google_search = SOAP::Lite->service("file:$google_wdsl");print qq{"date","count"\n};# Iterate over each of the Julian dates for your queryforeach my $julian ($start_julian..$end_julian) {$full_query = "$query daterange:$julian-$julian";# Query Googlemy $result = $google_search ->doGoogleSearch($google_key, $full_query, 0, 10, "false", "", "false","", "latin1", "latin1");}# Outputprint'"',sprintf("%04d-%02d-%02d", inverse_julian_day($julian)),qq{","$result->{estimatedTotalResultsCount}"\n};63.2 Running the HackRun the script from the command line, specifying a query, start, and end dates. Perhaps you'd liketo see track mentions of the latest Macintosh operating system (code name "Jaguar") leading up to,on, and after its launch (August 24, 2002). The following invocation sends its results to a commaseparated(CSV) file for easy import into Excel or a database:% perl goocount.pl query="OS X Jaguar" \start=2002-08-20 end=2002-08-28 > count.csvLeaving off the > and CSV filename sends the results to the screen for your perusal:% perl goocount.pl query="OS X Jaguar" \start=2002-08-20 end=2002-08-28If you want to track results over time, you could run the script every day (using cron under Unixor the scheduler under Windows), with no date specified, to get the information for that day's date.


Just use >> filename.csv to append to the filename instead of writing over it. Or youcould get the results emailed to you for your daily reading pleasure.63.3 The ResultsHere's that search for Jaguar, the new Macintosh operating system:% perl goocount.pl query="OS X Jaguar" \start=2002-08-20 end=2002-08-28"date","count""2002-08-20","18""2002-08-21","7""2002-08-22","21""2002-08-23","66""2002-08-24","145""2002-08-25","38""2002-08-26","94""2002-08-27","55""2002-08-28","102"Notice the expected spike in new finds on release day, August 24th.63.4 Working with These ResultsIf you have a fairly short list, it's easy to just look at the results and see if there are any spikes orparticular items of interest about the result counts. But if you have a long list or you want a visualoverview of the results, it's easy to use these numbers to create a graph in Excel or your favoritespreadsheet program.Simply save the results to a file, and then open the file in Excel and use the chart wizard to createa graph. You'll have to do some tweaking but just generating the chart generates an interestingoverview, as shown in Figure 6-3.Figure 6-3. Excel graph tracking mentions of OS X Jaguar63.5 Hacking the Hack


You can render the results as a web page by altering the code ever so slightly (changes are in bold)and directing the output to an HTML file (>> filename.html):...printheader( ),start_html("GooCount: $query"),start_table({-border=>undef}, caption("GooCount:$query")),Tr([ th(['Date', 'Count']) ]);foreach my $julian ($start_julian..$end_julian) {$full_query = "$query daterange:$julian-$julian";my $result = $google_search ->doGoogleSearch($google_key, $full_query, 0, 10, "false", "", "false","", "latin1", "latin1");}printTr([ td([sprintf("%04d-%02d-%02d", inverse_julian_day($julian)),$result->{estimatedTotalResultsCount}]) ]);printend_table( ),end_html;


If you're easily entertained like me, you might amuse yourself for a while just by clicking anddragging the nodes around. But there's more to do than that.64.2 Expanding Your ViewHold your mouse over one of items in the group of pages. You'll notice that a little box with an Hpops up. Click on that and you'll get a box of information about that particular node, as shown inFigure 6-5.Figure 6-5. Node information pop-up boxThe box of information contains title, snippet, and URL—pretty much everything you'd get from aregular search result. Click on the URL in the box to open that URL's web page itself in anotherbrowser window.Not interested in visiting web pages just yet? Want to do some more search visualization? Doubleclickon one of the nodes. TouchGraph uses the API to request from Google pages similar to theURL of the node you double-clicked. Keep double-clicking at will; when no more pages areavailable, a green C will appear when you put your mouse over the node (no more than 30 resultsare available for each node). If you do it often enough, you'll end up with a whole screen full ofnodes with lines denoting their relationship to one-another, as Figure 6-6 shows.Figure 6-6. Node mass expanded by double-clicking on nodes


64.3 Visualization OptionsOnce you've generated similarity page listings for a few different sites, you'll find yourself with apretty crowded page. TouchGraph has a few options to change the look of what you're viewing.For each node, you can show page title, page URL, or "point" (the first two letters of the title). Ifyou're just browsing page relationships, the title's probably best. However, if you've been workingwith the applet for a while and have mapped out a plethora of nodes, the "point" or URL optionscan save some space. The URL option removes the www and .com from the URL, leaving theother domain suffixes. For example, www.perl.com will show as perl, while www.perl.org showsas perl.org.Speaking of saving space, there's a zoom slider on the upper-right side of the applet window.When you've generated several distinct groups of nodes, zooming out allows you to see thedifferent groupings more clearly. However, it also becomes difficult to see relationships betweenthe nodes in the different groups.TouchGraph offers the option to view the "singles," the nodes in a group that have a relationshipwith only one other node. This option is off by default; check the Show Singles checkbox to turn iton. I find it's better to leave them out; they crowd the page and make it difficult to establish andexplore separate groups of nodes.The Radius setting specifies how many nodes will show around the node you've clicked on. Aradius of 1 will show all nodes directly linked to the node you've clicked, a radius of 2 will showall nodes directly linked to the node you've clicked as well as all nodes directly linked to thosenodes, and so on. The higher the radius, the more crowded things get. The groupings do, however,tend to settle themselves into nice little discernable clumps, as shown in Figure 6-7.Figure 6-7. Node mass with Radius set to 4


A drop-down menu beside the Radius setting specifies how many search results—how manyconnections—are shown. A setting of 10 is, in my experience, optimal.64.4 Making the Most of These VisualizationsYes, it's cool. Yes, it's unusual. And yes, it's fun dragging those little nodes around. But whatexactly is the TouchGraph good for?TouchGraph does two rather useful things. First, it allows you to see at a glance the similarityrelationship between large groups of URLs. You can't do this with several flat results to similarURL queries. Second, if you do some exploration you can sometimes get a list of companies in thesame industry or area. This comes in handy when you're researching a particular industry or topic.It'll take some exploration, though, so keep trying.TouchGraph Google Browser created by Alex Shapiro (http://www.touchgraph.com/).


Hack 65 Meandering Your Google NeighborhoodGoogle Neighborhood attempts to detangle the Web by building a "neighborhood" of sitesaround a URL.It's called the World Wide Web, not the World Wide Straight Line. Sites link to other sites,building a "web" of sites. And what a tangled web we weave.Google Neighborhood attempts to detangle some small portion of the Web by using the GoogleAPI to find sites related to a URL you provide, scraping the links on the sites returned, andbuilding a "neighborhood" of sites that link both the original URL and each other.If you'd like to give this hack a whirl without having to run it yourself, there's a live versionavailable athttp://diveintomark.org/archives/2002/06/04.html#who_are_the_people_in_your_neighborhood.The source code (included below) for Google Neighborhood is available for download fromhttp://diveintomark.org/projects/misc/neighbor.py.txt.65.1 The CodeGoogle Neighborhood is written in the Python (http://www.python.org) programming language.Your system will need to have Python installed for you to run this hack."""neighbor.cgiBlogroll finder and aggregator"""_ _author_ _ = "Mark Pilgrim (f8dy@diveintomark.org)"_ _copyright_ _ = "Copyright 2002, Mark Pilgrim"_ _license_ _ = "Python"try:import timeoutsocket # http://www.timotasi.org/python/timeoutsocket.pytimeoutsocket.setDefaultSocketTimeout(10)except:passimport urllib, urlparse, os, time, operator, sys, pickle, re,cgi, timefrom sgmllib import SGMLParserfrom threading import *BUFFERSIZE = 1024IGNOREEXTS = ('.xml', '.opml', '.rss', '.rdf', '.<strong>pdf</strong>','.doc')INCLUDEEXTS = ('', '.html', '.htm', '.shtml', '.php', '.asp','.jsp')IGNOREDOMAINS = ('cgi.alexa.com','adserver1.backbeatmedia.com','ask.slashdot.org', 'freshmeat.net', 'readroom.ipl.org','amazon.com','ringsurf.com')

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!