Chapter 13. RTF Conversion Technical Information

13.1. General
13.2. Bookmark Handling
13.3. Calculated fields
13.3.1. Simple fields (flat fields)
13.3.2. Nested fields (hierarchical fields, expressions)
13.4. The hidden element
13.5. Overview: Supported RTF Symbols

13.1. General

upCast Technology is used to convert Microsoft Word binary (*.doc) files[2] and RTF documents to XML. RTF is an acronym for Rich Text Format. It is a document exchange format set forth by Microsoft but is used as the exchange format by many conventional document processing products. However, RTF (as are all conventional document processing systems) is largely layout-driven - that's where upCast Technology comes into play.

upCast does not require time-consuming configuration tweaking to produce excellent conversion results. upCast's unique strength is to automatically introduce additional, hierarchical information into the resulting XML document from the rather "flat" RTF input document structure. As an example, consecutive list items are automatically bracketed as a list (also nested to any depth), and regions between headings of the same level get automatically bracketed as sections.

In doing this job upCast takes into account not only information about the logical structure information that is explicitly contained in documents (e.g. heading 1, heading 2, …). It also looks for certain layout characteristics to recreate the full logical structure of your documents that may not be explicitly expressed to its full extent when using layout-driven word processing applications.

upCast aims at reconstructing the logical structure of your documents as it is perceived easily by you as a human being. But contrary to the fact that this is quite an easy job for any human being, it is quite a hard job for a computer program to do so. It took several years of development by a team of computer specialists to develop the upCast to the level as it is today.

upCast does not try to preserve the visual appearance of the input documents as closely as possible. However, by using Cascading Style Sheets (CSS), it generally does a pretty decent job at preserving the layout without sacrificing the logical structure.

13.2. Bookmark Handling

Note

This is only for your information.

upCast incorporates a special handling scheme for bookmarks. This was necessary in order to overcome some oddities in the RTF required by e.g. MS Word to create multiple references to a single footnote. In RTF, the bookmark is part of the reference to the footnote, not part of the footnote element itself. This, however, is most probably not what you'd expect to see in your XML export. Therefore, upCast now delays bookmark generation in this case for it to show up as first element within a footnote, so that a reference to that bookmark point into the footnote.

This all happens without your having to set any parameters or specially prepare the input documents. We just mention it because this is different from how previous versions of upCast handled this situation.

13.3. Calculated fields

Important

Please note that the instruction portion of fields is not standardized in the RTF specification, but proprietary to the specific application. Therefore, upCast might have trouble with fields that interfere with the subset of MS Word generated fields that upCast handles in a special way for your convenience. If you happen to run in such a conflict, please do notify us!

13.3.1. Simple fields (flat fields)

upCast actively supports fields. Fields are tagged, and upCast even extracts the original instruction that was used to calculate the result of the field. This is useful in situations where you want to dynamically replace special fields (like DATE or TIME) at final time of publishing or need to handle the field instructions yourself at times where the field result generated automatically by Word does not meet your requirements.

Fields generally have the following form:

<gentext kind="FIELDTYPE"
                            data="instructions">the Word calculated field result</gentext>

Make sure you do not generally discard the contents of gentext elements.

The gentext tag around the numbering information of headings has a kind attribute value of headingnumberstring.

13.3.2. Nested fields (hierarchical fields, expressions)

Nested fields, that is fields which contain fields in their instruction part, are a little more complicated. Since it is only the instruction part that is nested (the result is flat again), there is no easy way to express that in XML.

In upCast, nested fields get flattened out. This means that even a nested field is exported as a simple <gentext kind="" data="">content</gentext> combination. The difference lies in the data attribute value, which adopts a syntax similar to the one you'll see in MS Word's Online Help system. This means that the instruction of a nested field is enclosed in curly braces { and }.

To calculate the page number within a specific section, you would use a bookmark at the start of that section and subtract that from the overall page number at a specific location. The data attribute therefore might look like something along the lines of data="= { PAGEREF } - { PAGE }". The result of such a construction, which is the contents of that gentext tag, would still be a single number as #PCDATA.

13.4. The hidden element

The hidden element indicates content that the user (or the Word processing application!) has marked to not be rendered visually. Whereas in earlier versions of upCast, hidden content was simply dropped from the processed document, this content is now tagged using <hidden></hidden>

upCast processes the hidden element as it sees it in the incoming RTF stream. This means that the resulting XML document is largely dependent on the usage and handling of the hidden attribute in the source Word processing application. Unfortunately, some RTF symbols indicating logical, structural information (like index entries/targets) are rendered visually into the document if they are not explicitly marked as hidden. The same is true for Table Of Contents (RTF symbol: \tc) entries. It is important to note that, though these symbols and groups are hidden, their logical and structuring meaning is still valid and applies.

A second point to note is that special structural elements and constructs (like footnote, annotation, … ) do not inherit any surrounding hidden state. This has been true for visual properties like fontsize or font in previous versions of upCast and now also extends to the hidden state.

Now, how are you supposed to handle the hidden element in your own XML application?

First, you should never discard the contents of hidden elements completely without parsing. Otherwise, you risk losing important structural and logical information like footnotes, index entries, annotations, or any other kind of "sub-documents".

Second, ensure that you do not inherit the hidden state into the following document subgroups/elements:

  • footnote

  • index

  • annotation

  • certain high level gentext elements (e.g. TOC)

13.5. Overview: Supported RTF Symbols

section stylesnot supported
paragraph styles

supported

most CSS1 properties supported

character styles

supported

most CSS1 properties supported

text

supported

standard western character sets supported; two-byte encodings supported via add-on module; Unicode input supported

output as Unicode in UTF-8; customizable Unicode output mapping table

special characters

partially supported

linebreaks/pagebreaks intelligently supported

Table 13.1. Paragraphs and Text

headings

supported

paragraph level attribute is used for automatic sectioning (also nested); numbering specially tagged (optional)

headers

footers

supported

differentiation between left/right/firstpage headers/footers

footnotes

supported

footnotes may contain other elements like lists and tables

table of contents

partially supported

TOC fields may be marked up in export filters supporting it

Table 13.2. Structuring elements

lists

supported

retrieves numbering type; nested lists supported; handles both "old-style" Word95 and new Word97/2000 lists (using listtable and listoverridetable RTF keywords)

Table 13.3. Lists

tables

supported

nested tables supported (Word 2000); nesting of lists and tables supported; table borders and cell background color supported; horizontally and vertically merged cells supported

following properties not supported: non-rectangular shaped tables

visually created "tables" (using tabulators)not supported

Table 13.4. Tables

bookmarks

supported

references

supported

dedicated support forREF, NOTEREF, PAGEREF field types

hyperlinks

supported

dedicated support for HYPERLINK field type

index entries

supported

also main and up to 8 levels of sub-entries (using \xe and : delimiter)

Table 13.5. References and Links

referenced images

supported

dedicated support for INCLUDEPICTURE fields

embedded images

mostly supported

embedded binary images are written as files to disk and are referenced appropriately from the XML/XHTML output; supported formats: PNG, BMP, JPEG, WMF, PICT; most WMF images can also be automatically converted to a bitmap format (optional)

image captions not supported (only as normal flowing text)

mathematical formulae

not supported

the formula editor writes a WMF alternate representation, which will be handled like embedded images

manual formula creation using fields (like EQ, ?) is not directly supported, but field instructions are accessible

forms

supported

drawing objects, shapesnot supported
OLE objects, other embedded objects

not supported

if these objects generate an image as an alternate representation, this will be handled like embedded images

textbox

partially supported

only as object, not as positioned paragraph

fields

mostly supported

nested fields are flattened, field instructions and type (EQ, DATE, =, …) can be retrieved; results are standard document text

generated content (e.g. index, list of figures etc.)not directly supported

Table 13.6. Embedded objects

standard properties

supported

user defined document properties

supported

Table 13.7. Document properties



[2] Requirements for converting Microsoft Word binary files: upCast installed using the provided upCast Installer for Windows and running on Microsoft Windows 95/98/2000/NT/XP with an installed version of Microsoft Word 97 or later available.