upCast first of all is a sophisticated RTF to XML converter. It relies on visual properties, explicit layout and structural information present in the RTF format, but also makes use of heuristic methods to produce a very clean and useable XML output. Additionally, it may serve as an XML Validator, XML workflow automization tool, CSS authoring tool (design-by-example), XHTML 1.0 (strict) export filter or an XSLT processor by way of the built-in post-processing modules.
Main features of upCast's RTF to XML conversion are:
converts, among others, most Word 95, Word 97, Word 2000, Word XP and Word 2003 documents into XML format
fully recreates the logical document structure, without any manual help
offers a powerful table translation including support for nesting tables as introduced in Word 2000
processes footnotes, hyperlinks, headers, footers and references
easily deals with any combination of nested lists, tables and any combination of layout elements that might occur in RTF documents
supports Unicode
includes style sheet support via Cascading Style Sheets (CSS)
offers numerous output options, including valid XHTML 1.0 (strict)
automatically converts most embedded WMF (vector-) images into bitmaps (JPEG, PNG or PICT) handle large files
has an intuitive, uncluttered, yet powerful graphical user interface
platform-independent due to Java technology
features built-in Java API hooks for customization on low and intermediate conversion levels
includes an RTF correction module to process most RTF documents as exported by non-Microsoft products or RTF, even when containing errors
upCast was designed as a highly modular application. Some of the modules are visible to the user by way of import filters and export filters, but others work silently under the hood.
In doing its conversion job, upCast relies on this modern and flexible modules based architecture using state-of-the-art software design patterns, including fine-tuned structure recognition heuristics based on layout analysis techniques. The architecture was developed by infinity-loop in close collaboration with the computer department of the Technische Universität München (Chair of Prof. J. Schlichter) and consists simplified of an import module, a kernel module (including an RTF-correction plugin and a heuristics module), a post processing module, and an output module.
This carefully chosen modular approach enables upCast to get the best possible XML from your source documents without any user interaction, enabling you to process and edit it with your XML tools of choice - without spending hours on editing and tweaking the generated XML individually for every single source document.
Both, import of documents and export of documents, is handled via so called Filters. An Import Filter reads a document and processes it so that it can be stored in a unified format within upCast. An Export Filter generates a specific output based on the internal unified document format. Figure Figure 2.2, “Filters in upCast” shows this architecture:
It is important to understand that several export filters can be active at the same time and that they are processed in a defined order, namely from top to bottom in the order listed in the export filter list. This allows for powerful complex processing actions to take place within a single document conversion job.
You might first convert a document from RTF to the upCast DTD, then post process it within upCast using XSLT to some proprietary DTD, and then call an external commandline tool like a database loader application to read and store the file in your document database. In this case, you'd use three filters in sequence: XML (upCast DTD), XSLT Processor and Commandline.