[t]he majority of OCR software suppliers define accuracy in terms of a percentage figure basedon the number of correct characters per volume of characters converted. This is very likely to bea misleading figure, as it is normally based upon the OCR engine attempting to convert aperfect laser-printed text of the modernity and quality of, for instance, the printed version of thisdocument. In our experience, gaining character accuracies of greater than 1 in 5,000 characters(99.98%) with fully automated OCR is usually only possible with post-1950's printed text, whilstgaining accuracies of greater than 95% (5 in 100 characters wrong) is more usual for post-1900and pre-1950's text [...]. (2009, no pag.)As pedocs' staffing is not adequate for intellectual control of digitization and OCR results,some of the digitized material might not be an entirely accurate representation of theoriginal source, which might, for example, cause publications not to be findable in thedatabase due to spelling errors in the title. The approach taken to this problem by pedocsis to mention it in the policy document, which since its last actualization informs users andproducers that “[i]n specific cases, pedocs can use scanning and OCR software to digitizedocuments that are provided in print only. Please note that the results from theseprocedures may contain errors depending on the printed master document” (pedocsGuidelines).To further address this problem, pedocs might want to consider to add a note to therecords of digitized documents, pointing out to users that what they are viewing is theresult of a digitization process with OCR. This measure would help to ensure theauthenticity of the digital object in that “provenance” information added to a record leadsto greater transparency concerning the origin and creation of the digital object users areviewing (cf. Factor et al. 2009). 93 A further, if somewhat more taxing possibility is theinclusion of the pre-OCR version of the digitization in addition to the version whichunderwent OCR: Thus, the image <strong>file</strong> could be attached to the record for the digitizeddocument along with the OCR-version, and with a description explaining the relationshipbetween the two <strong>file</strong>s. This practice would would give users the opportunity to comparefor themselves how strongly the OCR-version deviates from the mere image. However, asthe number of <strong>file</strong>s to be archived would be doubled for digitized publications, this doesnot seem a feasible method.pedocs is still working on finding the best possible solution to the problem by testingdifferent OCR software. Part of the challenge is to also to find a solution which complieswith accessibility standards for webpages (“Barrierefreiheit”). According to Dr. JuliaKreusch of the pedocs team, in the normal use case, what user will see when accessing apublication is the image <strong>file</strong>, which will – normally – be an authentic reproduction of theoriginal print version of the publication. Hence, a “provenance note” as just suggestedmay not go to the heart of the problem.It was stated above that one way of protecting the integrity and authenticity of <strong>file</strong>sduring submission as well as the information entered into the submission form by authorsis the use of secure connections to prevent someone from interfering with the93 Since mid-October 2009, documents published in pedocs receive a cover sheet (see below), which canalso contain a note saying that the publication was digitized by pedocs (see, for example, http://nbnresolving.de/urn/resolver.pl?urn=urn:nbn:de:0111-opus-16178– 24.10.2009).34
transmission. However, pedocs does not secure connections (e.g. SSL or SSH protocols)in the upload process at the moment.Ingest: Generate AIPAfter the metadata have been completed and/or modified, and the uploaded <strong>file</strong> has –where necessary – been converted to <strong>PDF</strong>, the SIP is transformed into a pre-stage AIP,i.e. it is saved in the pedocs <strong>file</strong> system on the DIPF servers in Berlin.In accordance with the requirements of the criteria catalogs, pedocs records receivea Uniform Resource Name (URN) allowing them to be permanently addressed andidentified. As an institution not organized in a library association, DIPF was allocated asub-namespace and is hence able to assign URNs within this namespace. Such URNsare structured in the following manner: urn:nbn:de:[four digit number]-[unique productionnumber][check digit]. 94 While the actual “object” to which the URN is assigned is therecord for the published document, this document itself can then be addressed via theURL embedded in the record, which is noted in the metadata along with the URN. ThisURL has the structure http://www.pedocs.de/volltexte/[Year]/[ID]/ and contains anidentification number which is unique within pedocs. Both the URN and the URL are thenregistered with the DNB, which is also notified should changes to the URL occur. 95Although URNs were assigned by the pedocs software from the outset, these arebeing registered with the DNB only since October 2009. Currently, newly assigned URNsare registered immediately, whereas previously assigned URNs will be registeredretrospectively. It is for this reason that the URN field is not currently visible in all records,as the field was suppressed as long as the URN was not functional. In the future, allmetadata records and the cover sheets added to the documents will contain a referenceto the URN. As outlined above, the TRAC criteria also require that existing persistentidentifiers are included in the metadata. No such field exists in the pedocs metadataschema, nor is it planned to implement one or submit previously assigned identifiers to theDNB. 96The pedocs software allows the attachment of more than <strong>file</strong> to a record; however,structural metadata to indicate, for example, in which order these <strong>file</strong>s should be viewedis not added. Thus, in the previously cited record with the ID 807, it is not obvious to usersthat the <strong>file</strong> “Dokument 2.pdf” (logically) precedes “Dokument 1.pdf” and in fact containsexplanatory notes relevant to understanding the content of the main <strong>file</strong>. pedocs is awareof this problem (which also occurs in Qucosa, the other OPUS installation considered inthis work) and is working on a solution which will allow staff to determine the order inwhich <strong>file</strong>s are displayed in the record. However, the problem might also simply be solvedby adding short descriptions to <strong>file</strong>s where more than document is attached to a record.94 See www.persistent-identifier.de – 11.10.2009.95 See the DNB homepage (http://www.d-nb.de/netzpub/erschl_lza/np_urn.htm – 03.11.2009) for furtherinformation about URNs at the DNB and the DNB's associated strategy.96 According to Kreusch, an existing persistent identifier might be included in the field for general remarks.This field, however, will not be submitted to the DNB.35
- Page 6: AbstractTaking its cue from the inc
- Page 13: and benefit from the development an
- Page 18 and 19: German repositories have already be
- Page 20 and 21: [t]he Open Archival Information Sys
- Page 22 and 23: Thus it seems highly recommendable
- Page 24 and 25: actively pursuing the long-term pre
- Page 26 and 27: Like pedocs, the repository is not
- Page 28 and 29: application for a project grant was
- Page 30 and 31: generating an Archival Information
- Page 32 and 33: Integrity can be defined as “comp
- Page 34 and 35: It is with this step that the metad
- Page 36 and 37: documents submitted to the reposito
- Page 40 and 41: One of the shortcomings of the soft
- Page 42 and 43: set of shared metadata which is the
- Page 44 and 45: document, and that hence the docume
- Page 46: Structural metadata: In DSpace it i
- Page 49 and 50: dc.description.provenancedc.descrip
- Page 51 and 52: Für die Langzeitverfügbarkeit der
- Page 53 and 54: ecord for a title; although a workf
- Page 55 and 56: Source (where applicable)Publicatio
- Page 57 and 58: checksums. In particular, TRAC requ
- Page 59 and 60: checksums are currently not checked
- Page 61 and 62: 2.4.2 JUWEL Data ManagementThe stru
- Page 63 and 64: Before any SIPs are accepted, the r
- Page 65 and 66: guarantee that documents are “arc
- Page 67 and 68: pedocs is a scholarly open access d
- Page 69 and 70: formats. 151 Although the possible
- Page 71 and 72: While the preferred file format is
- Page 73 and 74: from nestor criterion 8, making it
- Page 75 and 76: preserved for the long term, will h
- Page 77 and 78: It seems that of all three reposito
- Page 79 and 80: associated with them, or has define
- Page 81 and 82: versioning functionality which allo
- Page 83 and 84: communication channels, responsibil
- Page 85 and 86: Works CitedAllinson, Julie (2006):
- Page 87 and 88: DSpace Homepage. http://www.dspace.
- Page 89 and 90:
Lynch, Clifford A. (2000): Authenti
- Page 91 and 92:
in the EU. Amsterdam: Amsterdam Uni
- Page 93 and 94:
Ingestnestor TRAC DINIReceive Submi
- Page 95 and 96:
Archival Storagenestor TRAC DINIRec
- Page 97 and 98:
Archival Information Update10.4 Das
- Page 99 and 100:
Preservation PlanningnestorMonitor