Spotlight on Spotlight - Carol Smith Home Page
Spotlight on Spotlight - Carol Smith Home Page
Spotlight on Spotlight - Carol Smith Home Page
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Smith</strong> 8<br />
1. The target html page was opened within a web browser.<br />
2. Because a straight copy/paste routine would have captured undesirable informati<strong>on</strong><br />
and hyperlinks related extraneous to the message, each page was then reloaded via<br />
the page’s ‘Printer-friendly’ hyperlink.<br />
3. The message was copied in its entirety using cmd-a/cmd-c keyboard shortcuts<br />
(Macintosh).<br />
4. Using the cmd-v keyboard shortcut (Macintosh), the message was then pasted into a<br />
new plain text document, using Apple’s TextEdit applicati<strong>on</strong>.<br />
5. Two correcti<strong>on</strong>s were made to each plain text document:<br />
a. The phrase “Return to Message” (a hyperlink in the original page) was<br />
deleted from the end of each document.<br />
b. In order to avoid web crawler agents, the original html pages provide e-mail<br />
addresses in .gif format. For this reas<strong>on</strong>, each author’s e-mail address<br />
informati<strong>on</strong> needed to be entered manually.<br />
6. Each plain text file was then saved to the hard drive.<br />
7. A small percentage of message board postings were accompanied by .jpg<br />
attachments, typically scanned documents relating to the message. Each of these<br />
attachments (22 in all) was saved as separate data set files. Each attachment had to<br />
first be loaded into a separate browser window, for some unknown reas<strong>on</strong>,<br />
attachments could <strong>on</strong>ly be saved as .gif images without this extra step, even though<br />
the extensi<strong>on</strong> of the attachment indicated it was a .jpg file.<br />
After some c<strong>on</strong>siderati<strong>on</strong>, it was decided to name each text file sequentially, beginning with<br />
001, 002, 003, etc. If an initial message board posting received replies, each posting of a<br />
single thread were given the same number, but distinguished with sequential letters; e.g.,<br />
001a, 001b, 001c, etc… Some thought was given as to whether file names should indicate<br />
the level of depth in a particular thread; that is, if a posting was the sec<strong>on</strong>d reply to a reply of<br />
an initial posting, label it 001aab. This level of complexity was deemed unnecessary,<br />
however, as any thread in questi<strong>on</strong> could be easily located in its original web locati<strong>on</strong>, should<br />
the sequence of postings become of interest.<br />
Data Set Issues<br />
As described in the Functi<strong>on</strong>al Analysis secti<strong>on</strong> below, two decisi<strong>on</strong>s made during the<br />
creati<strong>on</strong> of the initial data set proved problematic, and required further data set modificati<strong>on</strong>:<br />
1. Because Mac files do not require extensi<strong>on</strong>s (.txt, .doc, etc.), extensi<strong>on</strong>s were not<br />
initially entered during the file-naming step.<br />
2. Documents were initially saved to separate sub-folders for each of the Internet<br />
message boards (i.e., “Ancestry-Minnick”; “Ancestry-Minick”; “Ancestry-Minck”;<br />
“Ancestry-Minnich”; “Ancestry-Minich”; “Ancestry-Mink”). Attachments were<br />
further segregated into folders within these folders, labeled “Ancestry-Minnick-<br />
Images”, etc. Finally, all six subfolders were c<strong>on</strong>tained within a single top-level folder<br />
labeled “<str<strong>on</strong>g>Spotlight</str<strong>on</strong>g> Data Set.”<br />
It should also be noted that the fielded format of the documents in their original web format