upCast Technology is used to convert Microsoft Word binary (*.doc) files[2] and RTF documents to XML. RTF is an acronym for Rich Text Format. It is a document exchange format set forth by Microsoft but is used as the exchange format by many conventional document processing products. However, RTF (as are all conventional document processing systems) is largely layout-driven - that's where upCast Technology comes into play.
upCast does not require time-consuming configuration tweaking to produce excellent conversion results. upCast's unique strength is to automatically introduce additional, hierarchical information into the resulting XML document from the rather "flat" RTF input document structure. As an example, consecutive list items are automatically bracketed as a list (also nested to any depth), and regions between headings of the same level get automatically bracketed as sections.
In doing this job upCast takes into account not only information about the logical structure information that is explicitly contained in documents (e.g. heading 1, heading 2, …). It also looks for certain layout characteristics to recreate the full logical structure of your documents that may not be explicitly expressed to its full extent when using layout-driven word processing applications.
upCast aims at reconstructing the logical structure of your documents as it is perceived easily by you as a human being. But contrary to the fact that this is quite an easy job for any human being, it is quite a hard job for a computer program to do so. It took several years of development by a team of computer specialists to develop the upCast to the level as it is today.
upCast does not try to preserve the visual appearance of the input documents as closely as possible. However, by using Cascading Style Sheets (CSS), it generally does a pretty decent job at preserving the layout without sacrificing the logical structure.
This is only for your information.
upCast incorporates a special handling scheme for bookmarks. This was necessary in order to overcome some oddities in the RTF required by e.g. MS Word to create multiple references to a single footnote. In RTF, the bookmark is part of the reference to the footnote, not part of the footnote element itself. This, however, is most probably not what you'd expect to see in your XML export. Therefore, upCast now delays bookmark generation in this case for it to show up as first element within a footnote, so that a reference to that bookmark point into the footnote.
This all happens without your having to set any parameters or specially prepare the input documents. We just mention it because this is different from how previous versions of upCast handled this situation.
Please note that the instruction portion of fields is not standardized in the RTF specification, but proprietary to the specific application. Therefore, upCast might have trouble with fields that interfere with the subset of MS Word generated fields that upCast handles in a special way for your convenience. If you happen to run in such a conflict, please do notify us!
upCast actively supports fields. Fields are tagged, and upCast even extracts the original instruction that was used to calculate the result of the field. This is useful in situations where you want to dynamically replace special fields (like DATE or TIME) at final time of publishing or need to handle the field instructions yourself at times where the field result generated automatically by Word does not meet your requirements.
Fields generally have the following form:
<gentext kind="FIELDTYPE" data="instructions">the Word calculated field result</gentext>
Make sure you do not generally discard the contents of gentext elements.
The gentext tag around the numbering information of headings has a kind attribute value of headingnumberstring.
Nested fields, that is fields which contain fields in their instruction part, are a little more complicated. Since it is only the instruction part that is nested (the result is flat again), there is no easy way to express that in XML.
In upCast, nested fields get flattened out. This means that even a nested field is exported as a simple <gentext kind="…" data="…">content</gentext> combination. The difference lies in the data attribute value, which adopts a syntax similar to the one you'll see in MS Word's Online Help system. This means that the instruction of a nested field is enclosed in curly braces { and }.
To calculate the page number within a specific section, you would use a bookmark at the start of that section and subtract that from the overall page number at a specific location. The data attribute therefore might look like something along the lines of data="= { PAGEREF … } - { PAGE }". The result of such a construction, which is the contents of that gentext tag, would still be a single number as #PCDATA.
The hidden element indicates content that the user (or the Word processing application!) has marked to not be rendered visually. Whereas in earlier versions of upCast, hidden content was simply dropped from the processed document, this content is now tagged using <hidden>…</hidden>
upCast processes the hidden element as it sees it in the incoming RTF stream. This means that the resulting XML document is largely dependent on the usage and handling of the hidden attribute in the source Word processing application. Unfortunately, some RTF symbols indicating logical, structural information (like index entries/targets) are rendered visually into the document if they are not explicitly marked as hidden. The same is true for Table Of Contents (RTF symbol: \tc) entries. It is important to note that, though these symbols and groups are hidden, their logical and structuring meaning is still valid and applies.
A second point to note is that special structural elements and constructs (like footnote, annotation, … ) do not inherit any surrounding hidden state. This has been true for visual properties like fontsize or font in previous versions of upCast and now also extends to the hidden state.
Now, how are you supposed to handle the hidden element in your own XML application?
First, you should never discard the contents of hidden elements completely without parsing. Otherwise, you risk losing important structural and logical information like footnotes, index entries, annotations, or any other kind of "sub-documents".
Second, ensure that you do not inherit the hidden state into the following document subgroups/elements:
footnote
index
annotation
certain high level gentext elements (e.g. TOC)
section styles | not supported |
paragraph styles | supported most CSS1 properties supported |
character styles | supported most CSS1 properties supported |
text | supported standard western character sets supported; two-byte encodings supported via add-on module; Unicode input supported output as Unicode in UTF-8; customizable Unicode output mapping table |
special characters | partially supported linebreaks/pagebreaks intelligently supported |
Table 13.1. Paragraphs and Text
headings | supported paragraph level attribute is used for automatic sectioning (also nested); numbering specially tagged (optional) |
headers footers | supported differentiation between left/right/firstpage headers/footers |
footnotes | supported footnotes may contain other elements like lists and tables |
table of contents | partially supported TOC fields may be marked up in export filters supporting it |
Table 13.2. Structuring elements
lists | supported retrieves numbering type; nested lists supported; handles both "old-style" Word95 and new Word97/2000 lists (using listtable and listoverridetable RTF keywords) |
Table 13.3. Lists
tables | supported nested tables supported (Word 2000); nesting of lists and tables supported; table borders and cell background color supported; horizontally and vertically merged cells supported following properties not supported: non-rectangular shaped tables |
visually created "tables" (using tabulators) | not supported |
Table 13.4. Tables
bookmarks | supported |
references | supported dedicated support forREF, NOTEREF, PAGEREF field types |
hyperlinks | supported dedicated support for HYPERLINK field type |
index entries | supported also main and up to 8 levels of sub-entries (using \xe and : delimiter) |
Table 13.5. References and Links
referenced images | supported dedicated support for INCLUDEPICTURE fields |
embedded images | mostly supported embedded binary images are written as files to disk and are referenced appropriately from the XML/XHTML output; supported formats: PNG, BMP, JPEG, WMF, PICT; most WMF images can also be automatically converted to a bitmap format (optional) image captions not supported (only as normal flowing text) |
mathematical formulae | not supported the formula editor writes a WMF alternate representation, which will be handled like embedded images manual formula creation using fields (like EQ, ?) is not directly supported, but field instructions are accessible |
forms | supported |
drawing objects, shapes | not supported |
OLE objects, other embedded objects | not supported if these objects generate an image as an alternate representation, this will be handled like embedded images |
textbox | partially supported only as object, not as positioned paragraph |
fields | mostly supported nested fields are flattened, field instructions and type (EQ, DATE, =, …) can be retrieved; results are standard document text |
generated content (e.g. index, list of figures etc.) | not directly supported |
Table 13.6. Embedded objects
[2] Requirements for converting Microsoft Word binary files: upCast installed using the provided upCast Installer for Windows and running on Microsoft Windows 95/98/2000/NT/XP with an installed version of Microsoft Word 97 or later available.