RTF files specify which encoding a font to be used is using and what properties it has. This is used by a rendering application to determine the best matching font on a platform where the exact specified font is not available. Additionally, the encoding a font is in is used by the rendering application to correctly interpret the characters found in the RTF file.
However, this mechanism does not support custom fonts with a special mapping of their constituing characters to a Unicode codepoint. This is what the stdfonts.config file is for. upCast comes with a default file embedded in the application JAR. You may extend and/or override it by providing a custom stdfonts.config file, to be located in the application's support directory in the Encodings directory. Here, you can specify standard font properties based on the font name, specifically any custom encoding resp. codepage this font uses..
stdfonts.config can be found at the following location in the package hierarchy in the upcast.jar JAR file:
/de/infinityloop/resources/config/stdfonts.config
You can modify this file to your liking and requirements, or supply an override file in the location mentioned above. Follows an informal description of the file format and the necessary properties, followed by the search algorithm used by upCast to find the properties for a given font in a CSS-like rule.
The following special properties are used in the stdfonts.config file:
Determines the general RTF font family a font belongs to based on its design. An RTF rendering application will use this information to find a font with similar appearance when an exact match cannot be found.
Supported values: roman, swiss, symbol, modern, script, decor, tech, bidi
This indicates the Windows codepage the font uses for its encoding.
Supported values: codepageAsInteger, -1, 10000, 32001, 32002, 32004
The special values have the following meaning:
Uses the default encoding of the platform. This is the best choice for normal fonts.
Identifies the Mac Roman encoding.
Identifies the standard encoding of the Symbol font.
Identifies the encoding of the Wingdings font.
Identifies the encoding of the Zapf Dingbats font.
The file structure is line based. Each line identifies a set of font names with a set of properties:
fontlist ::= propertyset fontlist ::= font ( ', ' font)* font ::= fontname | '"' fontname '"' propertyset ::= '\-ilx-rtf-font-family: ' ffval '; \-ilx-codepage: ' [0-9]+ ';' ffval ::= 'roman' | 'swiss' | 'symbol' | 'modern' | 'script' | 'decor' | 'tech' | 'bidi' fontname ::= name of font
Note that you must use CSS style escapes (or numerical character entities of the form &#...;) to generate Unicode characters for specifying font names using characters outside the ASCII range. Examples for this can be seen in the file packaged default stdfonts.config file.
All lines starting with // denote a comment line, as do empty lines.
To avoid having to explicitly define every font in the stdfonts.config file which might ever occur in a stylesheet, upCast employs a multi-stage search algorithm for a matching property definition entry as follows:
First, a potentially existing user supplied stdfonts.config is prepended to the default one supplied by upCast. Within this concatenated, big file, the following search algorithm is employed:
A search for the exact name (considering case) is performed. The first matching entry is is used if it exists.
A search for the exact name, but ignoring case, is performed. The first matching entry is is used if it exists.
A search for a font name is performed that matches the start of the actual name. So if the characteristics for "Univers Bold" are requested, and there is an entry "Univers" in stdfonts.config, then its properties are used. Case is ignored.
A search for a font name is performed that is contained in the actual name. So if the characteristics for "L Univers 44" are requested, and there is an entry "Univers" in stdfonts.config, then its properties are used because the string "Univers" is contained in the actual font name. Case is ignored.
upCast comes complete with virtually all default encodings you can use in RTF resp. Word, including many two-byte ones. This means that normally, you do not need to provide a custom encoding.
The default encodings are hard-coded with optimizations done for each specific encoding to provide efficient access, since the mapping functions are called for each character that passes through upCast. These default encodings are therefore not directly user-accessable. However, there are sometimes occasions where you'll need to use a custom encoding, especially when you are using custom fonts.
upCast provides a custom encoding loader and handler which lets you specify your own mappings from character codepoint in the font to Unicode by means of a simple text file. Both, one-byte and two-byte encodings can be specified in this way.
To create a custom encoding, you need to create an ASCII text file with the extension .encoding which specifies both the mapping of the individual codepoints to Unicode and also states which codepage it implements. You can also give it a name for easily spotting it in the UI portions of the application. upCast looks for custom encoding files in the <application-support>/Encodings/ directory at startup. All encodings it finds are added to the internal set of default encodings. By specifying a codepage in a custom encoding that has a default equivalent, you may override any of the factory-supplied encodings.
Since the mapping is built on the fly, specific optimizations cannot be performed and the use of custom encodings may slow-down processing slightly.
A custom encoding per se is not tied to anything but the codepage it implements. To tie a codepage to a specific font, you need to extend or override the stdfonts.config file. In this mapping file, you simply list the font's name and associate it with a codepage using the keywords as described.
It is recommended to use codepage values greater than 40000 for custom encodings, as upCast will not use these codepages internally. Which you use for custom encodings is up to you. upCast reserves the range from 32000 to 35000 for internal use, so you should not use these. Also note that when you override a default encoding, every font that is specified to use that encoding will use the custom one.
File names can be arbitrary, must however have an extension of .encoding. They should be placed into <application-support>/Encodings/ for downCast to scan and load them automatically at startup.
The file structure is simple: one mapping entry per line, and all lines starting with #, // or ; are treated as comments. To create a two-byte encoding, separate the two bytes by a comma.
A mapping entry has the form (notation similar to BNF):
mapping ::= <srcbyte> [',' <srcbyte>] '=' <unicodechar>
with:
srcbyte ::= hexNumber | decimalNumber unicodechar ::= hexNumber | decimalNumber hexNumber ::= ('0x' | '0X' | '$')[0-9A-Fa-f]+ decimalNumber ::= [0-9]+
Follows a rather silly example, which maps what in codepage 1252 fonts is a space to the at-sign:
@codepage 42001 @encodingname Silly Encoding $20=$40
Options. Two special options are supported:
This specifies the codepage this encoding represents.
You can specify either an existing encoding to override its definition, or create custom codepages for specific fonts, in which case you should choose codepage number higher than 40000.
This is a descriptive name for the encoding so you can easily spot it in downCast's UI.
It is recommended to use codepage values greater than 40000 for custom encodings, as upCast will not use these codepages internally. Which you use for custom encodings is up to you.
upCast reserves the range from 32000 to 35000 for internal use, so you should not use these. Also note that when you override a default encoding, every font that is specified to use that encoding will use the custom one.
upCast has a built-in mechanism for converting any Unicode character to any other Unicode character or even entity notation on export. This is done by means of a Unicode translation map, which is a plain ASCII text file. You can specify a Unicode translation map in various export filters as the final stage a character needs to pass before actually getting written to the output file or stream.
The file structure is simple: one conversion entry per line, and all lines starting with #, // or ; are treated as comments.
A conversion entry has the form (notation similar to BNF):
conversion ::= unicodeNumber '=' replacement
with:
unicodeNumber ::= hexNumber | decimalNumber replacement ::= string | hexNumber | decimalNumber hexNumber ::= ('0x' | '0X' | '$')[0-9A-Fa-f]+ decimalNumber ::= [0-9]+ string ::= '"' (asciiChar)* '"' asciiChar ::= a one-byte character in the range from 32 to 127, excluding '"'
Follows a rather silly example, with the effect added in comments:
// First, we simply convert all spaces to a dot: 32="." // Then, we convert all capital letter A's to a // full, empty tag: <letter_a /> 65="<letter_a />" // And then, we discard all small // letters 'u' completely: 0x75=""
Options. There is one special option to specify default behaviour:
This specifies how a certain range of codes should be preset. This saves you typing if you need some range of characters not be specified in UTF-8 encoding, but e.g. as character references.
You can specify this option anywhere in a Unicode translation map, it takes effect at that specific location. You may use this to initialize a certain coderange and then overwrite selected codepoints by specifying additional, normal translation rules as described above, which will then override the initialization performed by this option.
An integer value specifying the start codepoint of the code range.
An integer value specifying the end codepoint of the range.
A string constant identifying the algorithm to use for filling the specified code range.
The code range is filled with character entities in decimal notation, e.g. Ӓ .
The code range is filled with character entities in decimal notation, e.g. Ӓ .
@option-default-mapping 128 32767 decimalNumericalEntities
This line fills the Unicode translation map fro all codepoints from 128 to 32767 (incl.) with numerical character entities.
This table associates arbitrary CSS <length> properties with a pair of unit and precision information. This is useful when the created style information in either the CSS stylesheet or the style overrides in the XML output should be human readable, in which case you would provide a table with a unit of measurement that people are most familiar with (inches or centimeters, e.g.), and a reasonable precision like 2 decimal digits.
The default table uses cm as default unit, with a precision of 1 or 2 decimal digits, and pt for special properties like font-size.
The file structure is simple: one unit association entry per line, and all lines starting with // are treated as comments.
Options. There are two special options to specify default behaviour:
This specifies the default unit to use for all <length> units not specified explicitly in the unit table.
This specifies the default precision to be used for all <length> units not specified explicitly in the unit table.
These options must be specified before any unit association for a specific CSS property.
Here's an example of a CSS property unit table similar to the one used as the default table in upCast:
@option-default-length-unit:mm @option-default-length-precision:2 font-size:pt,1 border-top-width:pt,1 border-right-width:pt,1 border-bottom-width:pt,1 border-left-width:pt,1 -ilx-border-vertical-inside-width:pt,1 -ilx-border-horizontal-inside-width:pt,1 text-indent:mm,1 width:mm,1 height:mm,1 margin-left:mm,1 margin-right:mm,1 margin-top:mm,1 margin-bottom:mm,1 padding-left:mm,1 padding-right:mm,1 padding-top:mm,1 padding-bottom:mm,1 line-height:pt,1 border-spacing:pt,2 letter-spacing:pt,2 -ilx-list-marker-offset:tw,0 -ilx-header-offset:mm,1 -ilx-footer-offset:mm,1 size:mm,1
An association entry has the form (notation similar to BNF):
association ::= propertyName ':' unit ',' precision
with:
propertyName ::= CSS-property-name-identifier unit ::= 'm' | 'cm' | 'mm' | 'pt' | 'in' | 'pc' | 'px' | 'emu' | 'tw' | 'hp' precision ::= [0-9]+
tw is a twip and the basic length unit used in RTF; 1tw = 0.05pt
emu is a unit used in RTF shape objects; 1cm=360,000emu
hp is a half-point and the unit used in RTF for specifying font sizes; 1hp = 0.5pt
upCast supports the use of a Catalog file. A Catalog file is basically a mapping definition between PUBLIC DTD identifiers and the location of a physical copy of that specific DTD (or more general, Entity). The application supports the Catalog file format as defined in
http://www.oasis-open.org/specs/tr9401.html
Catalog files need to reside in the Application Support folder and have the name catalog for upCast to be able to find and use them.
upCast will ask you at first launch whether you want to install a default catalog file (if there isn't already one installed). It is highly recommended to have this default installed, as it places a copy of the upCast DTD locally on your machine and lets you validate the XML (upCast DTD) filter output without requiring an active connection to the internet.
During this initial procedure, you get also the choice to specify a different default XML Catalog to use by upCast. Choose the respective option in the presented dialog and pick the catalog using a standard file chooser.
You can also change the XML Catalog used by selecting
> any time during executing upCast.You may wish to add an entry to the catalog for the XHTML 1.0 strict DTD as well and place a copy on your local machine to be able to validate also XHTML files without requiring an internet connection. Such an entry might look like this:
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "file:///localpath/to/xhtml1-strict.dtd"