PDFlib Text Extraction Toolkit (TET) Manual

More documents

Recommendations

Info

utf8_to_utf16( ), or TET_utf32_to_utf16( ) or until an exception is thrown. Clients must copy the string if they need it longer. Bindings This function is not available in Unicode-capable language bindings. The memory used for the converted string will be managed by TET, and must not be freed by the client. C++ Perl PHP C string utf16_to_utf8(string utf16string) string TET_utf16_to_utf8(resource tet, string utf16string) const char *TET_utf16_to_utf8(TET *tet, const char *utf16string, int len, int *size) Convert a string from UTF-16 format to UTF-8. utf16string The string to be converted. A Byte Order Mark (BOM) in the string will be interpreted. If it is missing the platform’s native byte ordering is assumed. len (C language binding only) Length of utf16string in bytes. size (C language binding only) C-style pointer to a memory location where the length of the returned string (in bytes) will be stored. If the pointer is NULL it will be ignored. Returns Bindings The converted UTF-8 string. The generated UTF-8 string will start with the UTF-8 BOM (\xEF\xBB\xBF). On EBCDIC platforms the conversion result including the BOM will finally be converted to EBCDIC. The returned string is valid until the next call to any function other than TET_utf16_to_utf8( ), TET_utf8_to_utf16( ), or TET_utf32_to_utf16( ) or until an exception is thrown. Clients must copy the string if they need it longer. This function is not available in Unicode-capable language bindings. C++ Perl PHP C string utf32_to_utf16(string utf32string, string ordering) string TET_utf32_to_utf16(resource p, string utf32string, string ordering) const char *TET_utf32_to_utf16(TET *tet, const char *utf32string, int len, const char *ordering, int *size) Convert a string from UTF-32 format to UTF-16. utf32string The string to be converted, which must contain a valid UTF-32 sequence. If a Byte Order Mark (BOM) is present, it will be interpreted len (C language binding only) Length of utf32string in bytes. ordering Specifies the byte ordering of the result string: > utf16 or an empty string: the converted string will not have any BOM, and will be stored in the platform’s native byte order. > utf16le: the converted string will be formatted in little endian format, and will be prefixed with the little-endian BOM (\xFF\xFE). > utf16be: the converted string will be formatted in big endian format, and will be prefixed with the big-endian BOM (\xFE\xFF). size (C language binding only) C-style pointer to a memory location where the length of the returned string (in bytes) will be stored. 124 Chapter 10: TET Library API Reference
Returns Scope Bindings The converted UTF-16 string. The returned string is valid until the next call to any function other than TET_utf16_to_utf8( ), TET_utf8_to_utf16( ), or TET_utf32_to_utf16( ) or until an exception is thrown. Clients must copy the string if they need it longer. any This function is not available in Unicode-capable language bindings. C++ void create_pvf(string filename, const void *data, size_t size, string optlist) C# Java void create_pvf(String filename, byte[] data, String optlist) Perl PHP TET_create_pvf(resource tet, string filename, string data, string optlist) VB Sub create_pvf(filename As String, data, optlist As String) C void TET_create_pvf(TET *tet, const char *filename, int len, const void *data, size_t size, const char *optlist) Create a named virtual read-only file from data provided in memory. filename (Name string) The name of the virtual file. This is an arbitrary string which can later be used to refer to the virtual file in other TET calls. len (C language binding only) Length of filename (in bytes) for UTF-16 strings. If len=0 a null-terminated string must be provided. data A reference to the data for the virtual file. In COM this is a variant of byte containing the data comprising the virtual file. In C and C++ this is a pointer to a memory location. In Java this is a byte array. In Perl and PHP this is a string. size (C and C++ only) The length in bytes of the memory block containing the data. optlist An option list according to Table 10.2. The following option can be used: copy Details The virtual file name can be supplied to any API function which uses input files. Some of these functions may set a lock on the virtual file until the data is no longer needed. Virtual files will be kept in memory until they are deleted explicitly with TET_delete_ pvf( ), or automatically in TET_delete( ). Each TET object will maintain its own set of PVF files. Virtual files cannot be shared among different TET objects. Multiple threads working with separate TET objects do not need to synchronize PVF use. If filename refers to an existing virtual file an exception will be thrown. This function does not check whether filename is already in use for a regular disk file. Unless the copy option has been supplied, the caller must not modify or free (delete) the supplied data before a corresponding successful call to TET_delete_pvf( ). Not obeying to this rule will most likely result in a crash. Table 10.2 Options for TET_create_pvf( ) option copy description (Boolean) TET will immediately create an internal copy of the supplied data. In this case the caller may dispose of the supplied data immediately after this call. The copy option will automatically be set to true in the COM, .NET, and Java bindings (default for other bindings: false). In other language bindings the data will not be copied unless the copy option is supplied. 10.2 General Functions 125
Page 1 and 2:
PDFlib GmbH München, Germany www.p
Page 3 and 4:
Contents 0 First Steps with TET 7 0
Page 5:
10.9 Option Handling 151 10.10 pCOS
Page 8 and 9:
0.2 Applying the TET License Key Us
Page 11 and 12:
1 Introduction The PDFlib Text Extr
Page 13 and 14:
may extract some text which is not
Page 15 and 16:
2 TET Command-Line Tool 2.1 Command
Page 17 and 18:
Constructing TET command lines. The
Page 19 and 20:
Extract images and generate TETML i
Page 21 and 22:
3 TET Library Language Bindings Thi
Page 23 and 24:
TET_CATCH(tet) { printf("Error %d i
Page 25 and 26:
3.4 COM Binding Installing the TET
Page 27 and 28:
3.6 .NET Binding The .NET edition o
Page 29 and 30:
3.8 PHP Binding Installing the TET
Page 31 and 32:
3.9 Python Binding Installing the T
Page 33 and 34:
Exception Handling in RPG. TET clie
Page 35 and 36:
4 TET Connectors TET connectors pro
Page 37 and 38:
4.2 TET Connector for the Lucene Se
Page 39 and 40:
Indexing metadata fields. The TET c
Page 41 and 42:
4.4 TET Connector for Oracle The TE
Page 43 and 44:
these options on the command line.
Page 45 and 46:
TET PDF IFilter is freely available
Page 47 and 48:
tracts text and metadata from the P
Page 49 and 50:
5 Configuration 5.1 Indexing protec
Page 51 and 52:
5.2 Resource Configuration and File
Page 53 and 54:
tet/3.0/resource /tet/3.0/resource/
Page 55 and 56:
Geometry. The geometry features may
Page 57 and 58:
6 Text Extraction 6.1 Document Doma
Page 59 and 60:
How to search with Acrobat 8/9: not
Page 61 and 62:
6.2 Unicode Concepts Unicode encodi
Page 63 and 64:
artificial characters which will be
Page 65 and 66:
width (x, y) beta fontsize baseline
Page 67 and 68:
6.4 Support for Chinese, Japanese,
Page 69 and 70:
Options for text filtering. There a
Page 71 and 72:
6.6 Content Analysis PDF documents
Page 73 and 74: Dehyphenation. Hyphenated words at
Page 75 and 76: Table 6.3 Document styles docstyle=
Page 77 and 78: Fig. 6.5 Sample font reports create
Page 79 and 80: glyphmapping {{fontname=Warnock* to
Page 81 and 82: 7 Image Extraction 7.1 Image Extrac
Page 83 and 84: 7.2 Image Geometry Using TET_get_im
Page 85 and 86: 7.3 Image Analysis Image merging. S
Page 87 and 88: 7.4 Restrictions and Caveats Image
Page 89 and 90: 8 TET Markup Language (TETML) 8.1 C
Page 91 and 92: Depending on the sele
Page 93 and 94: You can specify the amount of text
Page 95 and 96: Table 8.3 TETML elements TETML elem
Page 97 and 98: 8.4 Transforming TETML with XSLT Ve
Page 99 and 100: Run the program as follows: nxslt3.
Page 101 and 102: 8.5 XSLT Samples The TET distributi
Page 103 and 104: Alphabetical list of words in the d
Page 105 and 106: 9 The pCOS Interface The pCOS (PDFl
Page 107 and 108: 9.2 Handling Basic PDF Data Types p
Page 109 and 110: 9.3 Composite Data Structures and I
Page 111 and 112: Path prefixes. Prefixes can be used
Page 113 and 114: Table 9.3 Universal pseudo objects
Page 115 and 116: Table 9.4 Pseudo objects for PDF ob
Page 117 and 118: Table 9.5 Pseudo objects for resour
Page 119 and 120: 9.6 Encrypted PDF Documents pCOS su
Page 121 and 122: 10 TET Library API Reference 10.1 O
Page 123: 10.2 General Functions Perl PHP res
Page 127 and 128: 10.3 Exception Handling C++ string
Page 129 and 130: 10.4 Document Functions C++ int ope
Page 131 and 132: Table 10.3 Document options for TET
Page 133 and 134: C++ C int open_document_callback(vo
Page 135 and 136: Table 10.5 Page options for TET_ope
Page 137 and 138: Table 10.6 Suboptions for the conte
Page 139 and 140: Table 10.7 Suboptions for the layou
Page 141 and 142: Table 10.9 Suboptions for the struc
Page 143 and 144: C++ const TET_char_info *get_char_i
Page 145 and 146: 10.7 Image Retrieval Functions C++
Page 147 and 148: C++ int write_image_file(int doc, i
Page 149 and 150: 10.8 TET Markup Language (TETML) Fu
Page 151 and 152: 10.9 Option Handling C++ void set_o
Page 153 and 154: 10.10 pCOS Functions The full pCOS
Page 155 and 156: If the object has type stream all f
Page 157 and 158: A TET Library Quick Reference The f
Page 159: B Revision History Revision history
Page 162 and 163: unsupported types 87 XMP metadata 8
show all

PDFlib Text Extraction Toolkit (TET) Manual

Create successful ePaper yourself

Delete template?

Save as template?