12.01.2015 Views

Download - Academy Publisher

Download - Academy Publisher

Download - Academy Publisher

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

PDF document structure is a tree structure, the tree's<br />

root node is the Catalog dictionary of the PDF files, and<br />

there are four subtrees below the root node: Pages Tree,<br />

Outline Tree, Article Threads, Named Destination. In the<br />

page tree, all the page object is this tree's leaf node, and<br />

they will inherit the various attribute value of their father<br />

node to take its corresponding attribute default value,<br />

which function is to let the other objects of the PDF<br />

document can also use the the character string name to<br />

express some page region.<br />

IV. RESEARCH THE CONVERSION FROM PDF<br />

DOCUMENTS TO PS FILE<br />

PDF is the document format, while PostScript is the<br />

programming language, so this principled difference<br />

extends the difference of the expression of the document<br />

content inevitably, which cause the difference in<br />

producing and using the PDF document, as well as refer<br />

how to putout the hard copy of the PDF file in the<br />

PostScript printer or the non-PostScript printer.<br />

A. The feasibility of the conversion from PDF documents<br />

to PS file<br />

PDF takes root in PostScript, thus there are many<br />

similarities, which decides to the feasibility of the<br />

document conversion. PDF file format uses the imaging<br />

model of PostScript language to express text, graphics<br />

and other objects, similar to PostScript language<br />

procedure, a PDF page description draws a page through<br />

placing “pigment” In the selected region. Its imaging<br />

characteristics can be summarized as follows:<br />

Colored page objects may be abstract to "map", which<br />

could be the character shape, or be expressed by the<br />

digital sampling of the photo, and the region (graphics)<br />

defined with curves and straight-line.<br />

Imaging operation is relatively flexible, not limited to<br />

the print record or film, which can also on paper and<br />

other record. It could use any color of the pigment when<br />

imaging.<br />

For the page is viewed as “map” when imaging, thus it<br />

could be cut into other shape, with the result that the part<br />

of “map” only in this sharp can be appeared on the page.<br />

Typical PostScript language procedure needs to take<br />

the tag instruction character definition of the PostScript<br />

language to define its set of instruction character, used to<br />

describe the pages and control output. While PDF defines<br />

its own instruction character sets, and the most<br />

instruction characters are very similar to the PostScript<br />

instruction characters.<br />

Although PDF document can not be explained by the<br />

PostScript translator device, but in the conversion<br />

function of the translator device, page description of the<br />

PDF document can transform to the PostScript language<br />

procedure. Basic method is as follows:<br />

The difference between PostScript and PDF mainly<br />

from page description instruction character , so the<br />

solution of this problem is to insert the PostScript<br />

language process definition sets Prosets of the instruction<br />

character s which can realize the PDF page description.<br />

Extract the content on each page of the PDF document,<br />

because it is not required to store by the logical order in<br />

the PDF document format, so it is necessary to describing<br />

each page. Because the description part of the traditional<br />

PostScript language procedure uses the appropriative<br />

process, such as represent “moveto” by “m”, represent<br />

“lineto” by “l” and so on.<br />

To decode the compressed text, image and graphics<br />

data is not a simple work for ordinary users. PostScript<br />

Level 2 printers do not need to decode the compressed<br />

data in the PDF documents, but without the data encoded<br />

by Flate compression algorithm. But now most of PDF<br />

documents use Flate compression algorithm, so it must<br />

first achieve the decompression of Flate compression<br />

algorithm, and then complete the conversion from PS to<br />

PDF.<br />

Insert the resource, such as the font, into PostScript<br />

files, replace the font definition and insert according to<br />

the necessary. The basic is the font specifications of the<br />

PDF files.<br />

Place the information by the correct order, the result<br />

should be, in a usually sense, a PostScript program file,<br />

containing all the visible part of the document, but<br />

hyperlinks, comments, bookmarks and other PDF units<br />

will no longer be included in the documents.<br />

This can obtain the PostScript language document, and<br />

then could printout this document after sending it to the<br />

printer. In view of this, the conversion from the PDF<br />

documents to PS document is entirely feasible, and its<br />

key lies in the realization of translator device.<br />

B. The conversion algorithm from PDF document to PS<br />

file<br />

a) The description of the algorithm<br />

To complete the conversion from PDF files to PS<br />

format, firstly need to extract page information of the<br />

PDF file, which includes resources, content, annotation<br />

and etc. Then express each page by PS language, first<br />

write into the resource referred in this page, and then<br />

write into the content of this page, so it is only necessary<br />

to express with the corresponding PS language in this<br />

part. This part is not so complex as the extraction of the<br />

PDF information above, for most instruction character of<br />

PDF is the same as that of PostScript, thus it is relatively<br />

simple as long as the corresponding of these two<br />

instruction character is achieved in advance. The below is<br />

the algorithm which can constructs the information<br />

extractor of the PDF file.<br />

Achieve the lexical analysis and the syntax analysis of<br />

the PDF document, to analyze the single lexical symbol<br />

and the grammar object separately. The object format has<br />

been introduced above, and we need to achieve its<br />

identification.<br />

Find the location of document rear according to the<br />

keywords, because document rear contains the address of<br />

Catalog and the address of the cross-reference table. then<br />

analyze the content of the cross-reference table, that is the<br />

offset of each indirect object address, and produce<br />

encoding.<br />

Analyze the content of that Catalog, which is the root<br />

object of the PDF files, and obtain the attributes of the<br />

Catalog, such as the Outline attribute, Pages attribute and<br />

so on, that is its indirect object address.<br />

33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!