Download - Academy Publisher
Download - Academy Publisher
Download - Academy Publisher
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
PDF document structure is a tree structure, the tree's<br />
root node is the Catalog dictionary of the PDF files, and<br />
there are four subtrees below the root node: Pages Tree,<br />
Outline Tree, Article Threads, Named Destination. In the<br />
page tree, all the page object is this tree's leaf node, and<br />
they will inherit the various attribute value of their father<br />
node to take its corresponding attribute default value,<br />
which function is to let the other objects of the PDF<br />
document can also use the the character string name to<br />
express some page region.<br />
IV. RESEARCH THE CONVERSION FROM PDF<br />
DOCUMENTS TO PS FILE<br />
PDF is the document format, while PostScript is the<br />
programming language, so this principled difference<br />
extends the difference of the expression of the document<br />
content inevitably, which cause the difference in<br />
producing and using the PDF document, as well as refer<br />
how to putout the hard copy of the PDF file in the<br />
PostScript printer or the non-PostScript printer.<br />
A. The feasibility of the conversion from PDF documents<br />
to PS file<br />
PDF takes root in PostScript, thus there are many<br />
similarities, which decides to the feasibility of the<br />
document conversion. PDF file format uses the imaging<br />
model of PostScript language to express text, graphics<br />
and other objects, similar to PostScript language<br />
procedure, a PDF page description draws a page through<br />
placing “pigment” In the selected region. Its imaging<br />
characteristics can be summarized as follows:<br />
Colored page objects may be abstract to "map", which<br />
could be the character shape, or be expressed by the<br />
digital sampling of the photo, and the region (graphics)<br />
defined with curves and straight-line.<br />
Imaging operation is relatively flexible, not limited to<br />
the print record or film, which can also on paper and<br />
other record. It could use any color of the pigment when<br />
imaging.<br />
For the page is viewed as “map” when imaging, thus it<br />
could be cut into other shape, with the result that the part<br />
of “map” only in this sharp can be appeared on the page.<br />
Typical PostScript language procedure needs to take<br />
the tag instruction character definition of the PostScript<br />
language to define its set of instruction character, used to<br />
describe the pages and control output. While PDF defines<br />
its own instruction character sets, and the most<br />
instruction characters are very similar to the PostScript<br />
instruction characters.<br />
Although PDF document can not be explained by the<br />
PostScript translator device, but in the conversion<br />
function of the translator device, page description of the<br />
PDF document can transform to the PostScript language<br />
procedure. Basic method is as follows:<br />
The difference between PostScript and PDF mainly<br />
from page description instruction character , so the<br />
solution of this problem is to insert the PostScript<br />
language process definition sets Prosets of the instruction<br />
character s which can realize the PDF page description.<br />
Extract the content on each page of the PDF document,<br />
because it is not required to store by the logical order in<br />
the PDF document format, so it is necessary to describing<br />
each page. Because the description part of the traditional<br />
PostScript language procedure uses the appropriative<br />
process, such as represent “moveto” by “m”, represent<br />
“lineto” by “l” and so on.<br />
To decode the compressed text, image and graphics<br />
data is not a simple work for ordinary users. PostScript<br />
Level 2 printers do not need to decode the compressed<br />
data in the PDF documents, but without the data encoded<br />
by Flate compression algorithm. But now most of PDF<br />
documents use Flate compression algorithm, so it must<br />
first achieve the decompression of Flate compression<br />
algorithm, and then complete the conversion from PS to<br />
PDF.<br />
Insert the resource, such as the font, into PostScript<br />
files, replace the font definition and insert according to<br />
the necessary. The basic is the font specifications of the<br />
PDF files.<br />
Place the information by the correct order, the result<br />
should be, in a usually sense, a PostScript program file,<br />
containing all the visible part of the document, but<br />
hyperlinks, comments, bookmarks and other PDF units<br />
will no longer be included in the documents.<br />
This can obtain the PostScript language document, and<br />
then could printout this document after sending it to the<br />
printer. In view of this, the conversion from the PDF<br />
documents to PS document is entirely feasible, and its<br />
key lies in the realization of translator device.<br />
B. The conversion algorithm from PDF document to PS<br />
file<br />
a) The description of the algorithm<br />
To complete the conversion from PDF files to PS<br />
format, firstly need to extract page information of the<br />
PDF file, which includes resources, content, annotation<br />
and etc. Then express each page by PS language, first<br />
write into the resource referred in this page, and then<br />
write into the content of this page, so it is only necessary<br />
to express with the corresponding PS language in this<br />
part. This part is not so complex as the extraction of the<br />
PDF information above, for most instruction character of<br />
PDF is the same as that of PostScript, thus it is relatively<br />
simple as long as the corresponding of these two<br />
instruction character is achieved in advance. The below is<br />
the algorithm which can constructs the information<br />
extractor of the PDF file.<br />
Achieve the lexical analysis and the syntax analysis of<br />
the PDF document, to analyze the single lexical symbol<br />
and the grammar object separately. The object format has<br />
been introduced above, and we need to achieve its<br />
identification.<br />
Find the location of document rear according to the<br />
keywords, because document rear contains the address of<br />
Catalog and the address of the cross-reference table. then<br />
analyze the content of the cross-reference table, that is the<br />
offset of each indirect object address, and produce<br />
encoding.<br />
Analyze the content of that Catalog, which is the root<br />
object of the PDF files, and obtain the attributes of the<br />
Catalog, such as the Outline attribute, Pages attribute and<br />
so on, that is its indirect object address.<br />
33