Chapter 3: The Specification and Validation Framework

3 The Specification and Validation Framework

In this section, we introduce the functions and components of the specification and validation framework and remark on its advantages over other technologies. We then provide a detailed description of the Structured Document Definition Language.

The framework we are proposing presents several advantages over current technologies for document specification and validation. First, other than the abstraction layer contained in the SDD-Type structure (see 3.1.1), the framework is generic. This allows documents to be validated in their native format and permits the extension of the framework to new classes of documents through the creation of new SDD-Types. Second, the SDDL allows for constraining the content of a document through datatypes or regular expressions (see 3.3.4). Third, SDDL namespaces (see 3.3.3) allow for the same tag to be treated as a completely different tag depending on its context. This feature allows users to achieve stricter control over a document's structure and content, particularly documents that have been encoded with a set of generic tags that do not fully meet the more focused needs of a particular project. From the user's point of view, the framework reduces the learning curve. All classes of documents can be specified using the same specification language and errors during validation will be presented using a unified set of error messages.

3.1 Main Components

The framework we are proposing implements three distinct functions: i) SDD-Type creation; ii) SDD creation; and iii) document validation. Figure 3.1 illustrates the detailed functions implemented by the proposed framework. Two phases can be outlined: creation, which covers functions i) and ii), and deployment, which covers function iii). The three functions are related sequentially, from one through three, with the output of a previous function informing subsequent functions. Each function requires a distinct set of skills from the user, however, a user is not required to complete each function every time document validation is performed. For example, one or more users in function three-document validation-will rely on an SDD created in function two to validate many different document instances. The SDD may have been created by a different user who in turn relies on an SDD-Type to inform the many SDDs that this user writes. In comparison to SDD creation and document validation, the creation of SDD-Types is a rare event.

AbiWord Image figure3.1.png

Figure 3.1 Specification and Validation Framework.

3.1.1 SDD-Type Creation

The output of this phase is an SDD-Type, a data structure that provides information on the distinct properties of a particular class of documents to be validated. For example, the SDD-Type would know if the tags for a document type are allowed to contain attributes or which characters constitute legal data. Examples of document classes would be Java property files, XML, and HTML. Thus, there is a one-to-one relationship between an SDD-Type and the document class that an SDD-Type defines. The SDD-Type data structure also provides access to a type-specific lexer that will be called during validation (part three) to break up a document into a generic set of tokens (see Appendix C). The SDD-Type thus acts as a layer of abstraction between a particular type of document and the other parts of the framework. To implement an SDD-Type requires programming skills, detailed knowledge of the document type being described, and an understanding of lexical analysis. Table 3.1 contains a list of properties that must be defined for an SDD-Type.

Property Description

attributes Boolean, format allows attributes

singletons Boolean, format allows singleton tags

nested tags Boolean, format allows nested tags

tag name chars Characters that can appear in a tag name

tag regular expression Regular expression for a tag name

attribute name chars Characters that can appear in an attribute name

attribute reg. expression Regular expression for a tag name

lexer Lexical analyzer for the document format

attribute lexer Lexical analyzer for attributes in the format

Table 3.1 SDD-Type properties.

3.1.2 SDD Creation

This step uses the SDDL (see 3.2) and an SDD-Type from part one to create an SDD. Regardless of whether the user creates the SDD with a text editor or the aid of a graphical tool (see 3.3), the result will be a specification written in SDDL for a single class of document defined by the chosen SDD-Type. Thus, there is a one-to-many relationship between an SDD-Type and the SDDs that use a particular SDD-Type, as opposed to the ono-to-one relationship between SDD-Types and their corresponding document classes (see Figure 3.2). The SDD encodes information on the SDD-Type to use and the structure and content constraints to apply during the validation phase. Depending on the SDD-Type in use, certain features of the SDDL may not be available. For example, the tags in a Java properties file do not have attributes nor can they contain other tags. Therefore, when compiling an SDD for a Java properties file, the user will receive compile errors if attributes are specified or tags are nested. Users implementing SDDs will require a working knowledge of the SDDL and the principles of structured documents as well as detailed information on the structure and syntax of the target documents for which the SDD will be used to validate.

AbiWord Image figure3.2.png

Figure 3.2 Relationship between SDD-Types and SDDs.

3.1.3 Document Validation

Document validation is the most frequently invoked phase of the framework. Users at this stage are concerned with determining if documents are valid in relation to a supplied SDD and using the error messages generated during validation to correct problems in the documents.

The Validator consists of the following components (see Figure 3.3): type-specific lexers, parsing component, data validator, attribute validator, and error handler. The Validator uses a Compiled-SDD-an SDD that has been converted into a hierarchy of objects that fully describe the characteristics of each element in an SDD--to guide its work. From the Compiled-SDD, the Validator determines the document class being validated and subsequently queries the appropriate SDD-Type to obtain the correct type-specific lexer. This lexer is used to tokenize incoming documents into a stream of tokens. Every class of documents will be decomposed into tokens drawn from the same set (see Table 3.2). Therefore, once a document has been tokenized, the Validator is no longer concerned with its class. If any errors are detected during validation, an error report will be generated. If there are no errors, the user will simply be notified that validation is complete. The Validator's components are described below:

i) Parser: An implementation of Tomita's generalized LR parser. The parser obtains the grammar and parsing tables from the Compiled-SDD.

ii) Attribute Validator: Uses information from the Compiled-SDD to determine if a tag's attributes are valid. The validator checks if the attributes are formatted correctly using a type-specific attribute lexer obtained from the appropriate SDD-Type. It then checks if the attributes are allowed and if their content is valid.

iii) Data Validator: Uses information from the Compiled-SDD to determine if the data content for a tag is valid. Validity is based on the types of characters allowed or regular expressions.

iv) Error Handler: Handles the creation and formatting of the error report.

AbiWord Image figure3.3.png

Figure 3.3 Input and output to the Validator and its main components.

The first step in validation requires the user to load the text-file version of a previously created SDD into memory by invoking the SDD Compiler. All SDDs are compiled using the same compiler. Once compiled, the Compiled-SDD supports read-only or read-write interfaces to its content depending on the tool to which it is passed. For example, if it were passed to the SDD Developer, the user, acting now as SDD Creator, would be able to modify the contents of the Compiled-SDD and create a new version of the original SDD. In the case of document validation, the Compiled-SDD will be passed to the Validator as a read-only structure to be consulted by the Validator's components during validation. In addition to providing information on structure and data constraints to the Validator, the Compiled-SDD also informs the Validator which type-specific lexer the Validator should invoke when processing document input. For example, if the Compiled-SDD is for an XML-type document instance, the SDD-Type for XML will be queried for an XML lexer. If the Compiled-SDD is for a Java properties-type document instance, the SDD-Type for Java properties will be queried.

Finally, the user must invoke the Validator by passing it a selection of one or more documents to validate. At this point the Validator invokes the appropriate lexer to decompose a document into a generic set of tokens (see Table 3.2). Each token is analyzed by the Validator as soon as it is recognized by the lexer. In some cases--TAG_OPEN, TAG_CLOSE, or TAG_SINGLETON, the token is passed to the Validator's parsing component. In other cases, such as DATA or CHAR_ENTITY, the token is passed to the Validator's data validation component which will query the Compiled-SDD regarding the validity of the data within the constraints of the current context i.e., the tag that is currently open. During validation, the Validator will log any errors to an error report which the user will evaluate. No effort is made to auto-correct errors. Validation will likely be an iterative process in which the user submits documents for validation, corrects the documents that have errors based on the error report, then resubmits the erroneous documents for validation. Users performing document validation will require knowledge of the structure and syntax of the documents they are validating and the ability to interpret the error messages which the validation tool provides.

Token Type Description Validator Component Handler

TAG_OPEN Denotes an open tag. Parser

TAG_CLOSE Denotes a close tag. Parser

TAG_SINGLETON Denotes an open tag that Parser

is also a singleton.

DATA Non-structural document Data Validator

content.

NO_OP Signals that the token --

should be ignored.

LEXICAL_ERROR Signals an error found by Error Handler

the lexer.

CHAR_ENTITY A special type of non- Data Validator

structural content.

ATTRIBUTE_STRING Unparsed tag attributes. Attribute Validator

ATTR_NAME Name of an attribute. Attribute Validator

ATTR_VALUE Value of an attribute. Attribute Validator

Table 3.2 Generic Token-Types returned by type-specific lexers.

3.2 The Structured Document Definition Language (SDDL)

The Structured Document Definition Language provides a syntax for describing the structure and constraining the content of structured documents. It serves as the language for communicating document structure and content requirements to the Validator through SDDs-specifications written in SDDL. In Figure 3.2, the SDDL is used to encode the SDDs for the various document instances. In Figure 3.3, the Compiled-SDD is created from an SDD specified using SDDL.

Depending on the nature of the document class being described, an SDD may be hierarchical or flat. (Type-specific information of this sort is obtained from the SDD-Type for the specified document class.) An SGML/XML document, for example, would typically have a hierarchical structure with tags being enclosed within other tags while a Java properties file would have a flat structure since a tag within a properties file cannot contain other tags. Recursion is allowed within hierarchical definitions. The elements--TAGs, BLOCKs, and ORs--constitute the language's primary building blocks. Several different types of operators can be applied to TAGs and BLOCKs to define the frequency with which a structure can occur in a given context. The language provides two methods for describing content--datatypes and regular expressions. Attributes can also be specified and their values can be constrained using datatypes and regular expressions. The subsections below describe the specification language in detail and illustrate the language's features through an increasingly complex example of an SDD for an XML document. The grammar for the specification language is given in Appendix A with explanatory notes in 3.3.7. Appendix B contains an example of an SDD for a Java properties file.

3.2.1 Elements

Elements are the building blocks of the Structured Document Definition Language. They provide the foundation and the scafolding for specifications, as all definitions for attributes and data are tightly bound to an element. An SDD comprises a list of elements, ordered hierarchically or sequentially, to define a document's structure. Recursive element definitions are allowed for document types that support hierarchical structures. Three types of elements are available: TAGs, BLOCKs, and ORS, as follows:

TAG elements define the names of tags, which are document-type specific, and may or may not have a close tag associated with them. TAGs that do not have an associated close tag are referred to as singletons. TAGs with an associated close tag may contain other elements of any type. Users may also describe the types of data that this element can contain and any attributes associated with it. Only TAG elements are allowed to contain data (see 3.3.4) and attribute (see 3.3.5) descriptions. TAG elements require a frequency specification (see 3.3.2) and may be associated with a namespace (see 3.3.3).

BLOCK elements wrap other elements and are usually used to define a repeating sequence of elements. Like TAGs, BLOCK elements require a frequency specification and may be associated with a namespace.

OR elements are used to define alternatives. Only one of the alternatives per element instance can be applied during parsing. An element whose content is defined using ORs must contain at least two OR elements.

Type Syntax Frequency Other Elements Data Attributes

TAG <name>...</name> yes yes yes yes

TAG(s) <name/> yes no no yes

BLOCK block...end_block yes yes no no

OR or...end_or no yes no no

Table 3.3 Element types and their characteristics. (TAG (s) denotes singleton TAGs.)

Figure 3.4 is a skeletal example of an SDD featuring each type of element. The "1"s which follow the TAG and BLOCK elements are frequency specifications (see 3.2.2). The angle brackets encasing the tag names are part of the SDD syntax and bear no connection to the syntax of the markup being processed.

<?sdd version="2" type="XML"?>

<document>1

<body>1

end_or

end_or

</body>

<footer>1

block1

end_block

</footer>

</document>

Figure 3.4 An XML-type SDD showing the use of TAGs, BLOCKs, and ORs.

3.2.2 Frequency

Frequency defines how many times a TAG or BLOCK may occur within a specific context. Users may specify that an element occurs an absolute number of times, zero or more, one or more, and zero or once. In addition, users can use the % operator to specify that an element can appear any number of times and that its position relative to other elements does not matter. An example of the % operator's use would be a paragraph in which tags denoting titles and cross-references are interspersed throughout the text in no particular order. Users may also specify a range. For example, the frequency 2+5 means that the element must appear at least twice but no more than five times. The % operator can also be specified as a range. For example, 2%5 means that the element must appear at least twice but no more than five times and that its position relative to other elements is irrelevant. Table 3.4 provides a summary of the frequency operators and their meanings.

Operator/Syntax Min Max Notes

X X X

+ 1 infinity

? 0 1

* 0 infinity

% 0 infinity relative order is irrelevant

X%Y X Y relative order is irrelevant

%Y 0 Y relative order is irrelevant

X% X infinity relative order is irrelevant

X+Y X Y

+Y 0 Y

X+ X infinity

Table 3.4 Frequency operators. (X and Y denote positive integers and X < Y when they appear together).

The SDD from Figure 3.4 has been extended in Figure 3.5 with additional tags and new frequency operators. Readers should note that some elements, such as para and italic appear multiple times within the specification. After such tags are first defined, subsequent references only specify the element's placement and frequency, omitting the remainder of its definition.

<?sdd version="2" type="XML"?>

<document>1

<header>1

<contact-info>* </contact-info>

</header>

<body>1

<abstract>1

<para>1+3

</para>

</abstract>

end_or

<essay>1

<para>+

</essay>

end_or

</body>

<footer>1

block+

<citation>1

<italic>%

</citation>

<annotation>1

<italic>%

</annotation>

end_block

</footer>

</document>

Figure 3.5 An XML-type SDD showing the use of frequency operators.

3.2.3 Namespaces

Namespaces are one of the most powerful features of the SDDL. A namespace is a label that is associated with an element, either a TAG or BLOCK. For BLOCKs, the association allows the SDD author to reuse a BLOCK element without having to redefine it each time. Figure 3.6 illustrates the use of namespaces with BLOCKs through an excerpt from an SDD before and after the application of namespaces.

block+

end_block

</sources>

<related_works>1

block+

<citation>1

<annotation>1

end_block

</related_works>

--------------------------------------------------------------

block+ namespace=cite_anno

end_block

</sources>

<related_works>1

block+ namespace=cite_anno

</related_works>

Figure 3.6 SDD snippets illustrating the use of namespaces with BLOCKs.

When applied to TAGs, namespaces provide a means to redefine a TAG's definition, including its internal structure, content constraints, and attributes. A specification, for example, might contain two different types of para elements. One type, to be used within the body of the document, might allow charts, images, and lists to be imbedded in the text. Another type would allow text only and be used in the document's abstract. This type of restrictive definition can be applied without changing the markup of the source document and facilitates customizing a generic document structure to the particular needs a specific project. Figure 3.7 illustrates the use of namespaces for both TAGs and BLOCKs. Comments (see 3.3.6), demarcated with <! ... !>, have also been inserted to clarify the use of namespaces in the example.

<?sdd version="2" type="XML"?>

<document>1

<header>1

<title>1 namespace=header <!this version of title does not allow italic tags!> </title>

<contact-info>* </contact-info>

</header>

<body>1

<abstract>1

<para>1+3 namespace=abstract.para <!this version of para does not allow title tags!>

</para>

</abstract>

end_or

<essay>1

<para>+ namespace=essay.para <!this version of para does allow title tags!>

<title>% namespace=essay.para <!this version of title allows italic tags!>

<italic>%

</title>

<italic>%

<underline>%

</para>

</essay>

end_or

</body>

<footer>1

<sources>1

block+ namespace=cite.anno

<citation>1

<italic>%

</citation>

<annotation>1

<italic>%

</annotation>

end_block

</sources>

<related_works>?

block+ namespace=cite.anno <!reuse of block that previously appeared within the sources tag!>

</related_works>

</footer>

</document>

Figure 3.7 An XML-type SDD showing the use of namespaces.

3.2.4 Data Specification

Data content may be constrained using datatypes or regular expressions. Datatypes allow the user to pick individual characters or classes of characters to which the data contained within a TAG element must conform. The frequency or sequence of the characters in the TAG's content does not matter. Datatype specifications are denoted between curly braces and consist of a datatype keyword or one or more characters. For example, the following datatype specification {LOWERCASE}{-_.} allows for lowercase letters along with hyphens, underscores, or periods. If the validator encountered any other type of character while processing this TAG's content, an error would be reported. To facilitate the shortest datatype specifications possible, users may use the "!" operator to signify that a datatype is not allowed. For example, the specification {PCDATA}!{DIGIT} allows for all characters except digits. The characters allowed in a datatype specification and the meaning of some of the datatype keywords are document-type specific. The characters "<" and ">", for instance, are not legal content within an SGML/XML document.

A TAG's content may also be constrained using regular expressions These are specified using the rgx keyword as follows: rgx="regular expression". A TAG may only have one regular expression and the entire content of the TAG must match this expression.

The three keywords optional, loose, and strict also affect the validation of a TAG's content. The optional keyword instructs the validator that the TAG's content is optional and that it should not issue an error if the TAG has no content. Users can specify at the application level whether the validator should use datatypes or regular expressions when validating content. If the SDD provides only one type of specification, either datatype or regular expression, then that type will be used when validating regardless of the application-level property. If the SDD provides both types, then the validator will use the type specified through the application-level property. The loose and strict keywords act as local overrides for this application-level property, instructing the validator to use the specified datatype, in the case of loose, or regular expression, in the case of strict, when validating a TAG's content. Table 3.5 provides a list of the keywords associated with restricting data content. The SDD example from Figure 3.7 is extended in Figure 3.8 with datatypes and regular expressions.

Keyword Syntax Meaning

PCDATA {PCDATA} Datatype meaning all legal characters.

LOWERCASE {LOWERCASE} Datatype denoting lowercase letters.

UPPERCASE {UPPERCASE} Datatype denoting uppercase letters.

ALLALPHA {ALLALPHA} Datatype denoting upper and lowercase letters.

DIGIT {DIGIT} Datatype denoting numeric characters 0-9.

ENTITIES {ENTITIES} Datatype for document-type specific character

entities.

optional optional Signals that the content for a TAG is

optional.

loose loose Signals that datatypes should be used to

validate a TAG's content.

strict strict Signals that a regular expression should be

used to validate a TAG's content.

Table 3.5 Keywords associated with data content specifications.

<?sdd version="2" type="XML"?>

<document>1

<document.id>1 rgx="ID-[A-Z]{3}[0-9]{3}[ae]" </document.id>

<header>1

<title>1 namespace=header{PCDATA} <!this version of title does not allow italic tags!> </title>

<subtitle>?{PCDATA} </subtitle>

<author>+{LOWERCASE}{UPPERCASE}{ENTITIES}{.-} </author>

<contact-info>*{LOWERCASE}{UPPERCASE}{DIGIT}{.@-} rgx="[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+" </contact-info>

</header>

<body>1

<abstract>1

<para>1+3 namespace=abstract.para{PCDATA} <!this version of para does not allow title tags!>

<italic>%{PCDATA} </italic>

<underline>%{PCDATA} </underline>

</para>

</abstract>

end_or

<essay>1

<para>+ namespace=essay.para{PCDATA} <!this version of para does allow title tags!>

<title>% namespace=essay.para{PCDATA} <!this version of title allows italic tags!>

<italic>%

</title>

<italic>%

<underline>%

</para>

</essay>

end_or

</body>

<footer>1

<sources>1

block+ namespace=cite.anno

<citation>1{PCDATA}

<italic>%

</citation>

<annotation>1{PCDATA}

<italic>%

</annotation>

end_block

</sources>

<related_works>?

block+ namespace=cite.anno <!reuse of block that previously appeared within the sources tag!>

</related_works>

</footer>

</document>

Figure 3.8 An XML-type SDD showing the use of data specifications.

3.2.5 Attribute Specifications

Some data formats, such as SGML and XML, allow markup to have attributes. Attributes can be required or optional. Their specifications consist of an attribute's name along with values--specified using datatypes and/or regular expressions. The language constructs and syntax used to constrain data associated with an element also apply to data associated with an attribute. Both TAG and singleton TAG elements may contain attributes. Each attribute specification is demarcated with sqaure brackets. The # synbol denotes that an attribute is required, meaning that the tag will not be considered valid if the attribute is missing. The loose or strict keywords follow but are optional. Next is the attributes name, followed by an equal sign which precedes a regular expression and/or list of datatypes. Table 3.6 contains sample attribute specifications along with explanations of the syntax. Figure 3.9 extends the SDD from Figure 3.8 with attribute specifications. Note that the <italic> and <underline> tags have been replaced with a single tag that uses attributes.

Attribute Specification Explanation

[# id="A[0-9]+"] The attribute "id" is required

and has values composed of

an A followed by one or

more digits.

[direction="north|south|east|west"] The attribute "direction" is

optional and must use the

values: north, south, east, or

west.

[# type={LOWERCASE}{-.}] The attribute "type" is

required and has values

composed of letters, dashes,

and periods.

[sig={DIGIT}{UPPERCASE}"[0-9]+[A-Z]?"] The attribute "sig" is optional

and may contain values

composed of digits and

uppercase letters or adhere

to the pattern one or digits

followed by an optional

letter.

Table 3.6 Sample attribute specifications.

<?sdd version="2" type="XML"?>

<document>1 [# category={LOWERCASE}{-}] [ reviewedby={LOWERCASE}{UPPERCASE}]

<document.id>1 rgx="ID-[A-Z]{3}[0-9]{3}[ae]" </document.id>

<header>1

<title>1 namespace=header{PCDATA} <!this version of title does not allow italic tags!> </title>

<subtitle>?{PCDATA} </subtitle>

<author>+{LOWERCASE}{UPPERCASE}{ENTITIES}{.-} [# id="[A-Z]{2}[0-9]+"] </author>

<contact-info>*{LOWERCASE}{UPPERCASE}{DIGIT}{.@-} rgx="[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+" </contact-info>

</header>

<body>1

<abstract>1

<para>1+3 namespace=abstract.para{PCDATA} <!this version of para does not allow title tags!>

<emphasis>%{PCDATA} [# type="italic|underline"] </emphasis>

</para>

</abstract>

end_or

<essay>1

<para>+ namespace=essay.para{PCDATA} <!this version of para does allow title tags!>

<title>% namespace=essay.para{PCDATA} <!this version of title allows italic tags!>

<emphasis>%

</title>

<emphasis>%

<classification/>% [# term={LOWERCASE}{-.}]

</para>

</essay>

end_or

</body>

<footer>1

<sources>1

block+ namespace=cite.anno

<citation>1{PCDATA}

<emphasis>% namespace=cite.anno{PCDATA} [# type="italic"] </emphasis>

</citation>

<annotation>1{PCDATA}

<emphasis>% namespace=cite.anno

</annotation>

end_block

</sources>

<related_works>?

block+ namespace=cite.anno <!reuse of block that previously appeared within the sources tag!>

</related_works>

</footer>

</document>

Figure 3.9 An XML-type SDD showing the use of attribute specifications. The attributes have been marked in bold for readability.

3.2.6 Comments

Comments are demarcated using "<!" for the start of a comment and "!>" for the end of a comment. Comments are associated with the most recent element and may appear anywhere within an element's definition. However, comments may not appear within a specification construct and comments may not be nested. Figure 3.10 provides examples of legal and illegal comment usage.

Legal:

<emphasis>% namespace=cite.anno <! Comment !> {PCDATA} [# type="italic"] </emphasis>

<emphasis>% namespace=cite.anno {PCDATA}<! Comment !> [# type="italic"] </emphasis>

<emphasis>% namespace=cite.anno {PCDATA} [# type="italic"] <! Comment !> [id="[abc]+[0-9]?"</emphasis>

Illegal:

<emphasis>% namespace <! Comment !>=cite.anno {PCDATA} [# type="italic"] </emphasis>

<emphasis>% namespace=cite.anno {PCDATA <! Comment !>} [# type="italic"] </emphasis>

<emphasis>% namespace=cite.anno {PCDATA} [# <! Comment !> type="italic"] [id="[abc]+[0-9]?"</emphasis>

Figure 3.10 Examples of legal and illegal comment usage.

3.2.7 SDDL Grammar

A grammar for the SDDL in BNF format is presented in Appendix A. Four of the non-terminals are not defined. The first, <regular_expr>, is defined by the Perl5 regular expression language. The other three-<datatype_chars>, <tag_chars>, and <attr_name_chars>-are type-specific. The definitions of these non-terminals will differ depending on the underlying type for which the SDD is intended. For example, some types may require that tag names be restricted to three upper case letters while others are less restrictive, allowing tag names of any non-zero length containing upper and lower case letters as well as digits. The information needed to define these type-specific non-terminals is provided at runtime by the SDD-Type data structure described in 3.1.1.

3.2.8 Illegal Structures

There are two types of structures that, while allowed by the SDDL grammar, are illegal. Both structures are impractical and pose problems for the parsing and conversion algorithms described in chapter 4. First, recursive definitions that include required self-references are not allowed. Such definitions state that the content of an element, at some level in its tree of required elements, contains a reference to itself and that this recursive element is required. In other words, its frequency specification states that it must occur at least once. See the sample SDDs in Figures 3.11 and 3.12 for examples of illegal SDDs and 3.13 for an example that appears to be illegal but is valid. From a user's perspective, this type of recursion is impractical because the document can never be completed; the recursion is infinite. When converting from an SDD to a grammar, this type of recursion will send the conversion algorithm into an infinite loop (see 4.1).

<?sdd version="2" type="XML"?>

<doc>1

<list>1

<item>+

<list>1

</item>

</list>

</doc>

Figure 3.11 SDD with an illegal recursive definition (list).

<?sdd version="2" type="XML"?>

<doc>1

<para>+

</para>

</doc>

Figure 3.12 SDD with an illegal recursive definition (para).

<?sdd version="2" type="XML"?>

<xxx>1

<zzz>*

<xxx>1

</zzz>

</xxx>

Figure 3.13 SDD with a legal recursive definition. This is legal because zzz is not required.

The second type of illegal structure concerns BLOCKs and ORs that do not contain required content. The reasons for this restriction are more subtle. This type of structure converts to a hidden-left-recursive grammar (see 4.2) that will cause the parsing algorithm described in chapter 4 to enter an infinite loop. From a user's perspective, such structures are of questionable value and likely indicate poor design. The SDD in Figure 3.14, for example, states that the defined BLOCK must occur one or more times. However, the content of the BLOCK may legally be empty as the TAG x may occur zero or more times. A BLOCK has no visible identifiers at the document's syntactic level other than the TAG elements that comprise its content. Therefore, it is impossible to tell that a BLOCK has occurred unless some part of its content is present. This situation presents a contradition. The BLOCK has occurred if one or more x's occur or if none are present.

<?sdd version="2" type="XML"?>

block+

end_block

Figure 3.14 SDD with an illegal BLOCK.

3.3 GUI Toolset

As described in section 1.1.1, there are three main steps involved in document validation : i) SDD-type creation; ii) SDD creation; and iii) document validation. These features are made accessible through the GUI toolset component of our framework, henceforth known as the Structured Document Validator (SDValidator). This section describes the main components of the SDValidator, focusing on dynamic SDD-Type recognition and loading, SDD creation and modification, and document validation The descriptions that follow present a high-level overview of the SDValidator's principle features. This is not a tutorial on how to use the application.

In the interest of platform independence, the SDValidator application has been implemented in Java using JDK 1.3 and has been tested on Linux, Windows NT, Windows98, and Windows2000 using Sun's 1.3 Java Runtime Environment. The application's muliple document interface [MDI] is realized through the use of tabs split into two tab panes. The upper pane holds tabs related to SDD creation and document editting; the lower pane houses tabs that display error messages. Figure 3.15 shows the application with open tabs in both panes. When deployed, the executable and application data will reside in three directories as described in Table 3.7. <application> denotes a directory path. <user.home> denotes the path to the user's home directory. The user-defined application properties noted in Table 3.7 include such information as location and size of the application's frame window at shutdown, preferred fonts and background colors, and "favorite" directories.

AbiWord Image figure3.15.png

Figure 3.15 Open tabs in application.

Path Contents

<application>SDValidator/bin Location for executable jar file.

<user.home>SDValidator User-defined application properties.

<user.home>SDValidator/typelib Contains SDD-Type jar files.

Table 3.7 SDValidator's directory structure.

3.3.1 SDD-Type Recognition and Loading

As explained in 3.1.1, SDD-Types are data structures that contain information specific to a document class. When deployed, the SDD-Types are encoded as a collection of Java class files that are then compressed into a jar file. Although the SDValidator application is not used to create SDD-Types, it must be able to interface with them. The interfacing occurs at runtime, allowing for new SDD-Types to be added on an as needed basis without modifications to the SDValidator application. SDD-Types are dynamically loaded during application startup through a survey of the SDD-Type jars stored in the typelib directory. This method allows users to add to or remove from the SDD-Type information available to the application by managing the contents of the typelib directory.

3.3.2 SDD Creation and Modification

The SDValidator application implements phase two functionality in two different ways. First, the user can manually enter a new SDD or modify an old one in a specialized text editting window and then compile the SDD as a second step. In addition to normal text-editting features such as search/replace and copy/cut/paste, a Compile SDD option is active when users work in an SDD editting window. Alternatively, the user can use the SDD Developer (Figure 3.16) to create a new SDD or modify one that has already been compiled. The Developer provides a visual interface that frees the user from the details of the SDDL syntax. Within the Developer, SDDs are presented as a tree with TAG, BLOCK, and OR elements as nodes. Users have the option to add elements to the tree as children, siblings, or parents of existing nodes. Elements may also be removed from the tree and discarded or pasted into a new location in the tree. After selecting a particular element in the tree, users have access to all information associated with the element. SDDs created or modified through the Developer are always in a compiled and up-to-date state so a separate compilation step is not necessary. The full range of SDDL language features are available through the Developer.

AbiWord Image figure3.16.png

Figure 3.16 SDD Developer.

3.3.3 Document Validation

Validation can be performed on a single file, a collection of files from a directory, or all the files in a directory. Regardless of the number of documents to process, validation is performed using the "current SDD." The SDValidator allows multiple compiled SDDs to be available in the application simutaneously. The current SDD selection is controlled via a drop-down selection box in the tool bar. Errors encountered during validation are written to an error report visible in the lower tab pane. The error report shows the file name for the validated document and a brief description, including line and column number information, for each error. It is assumed that in most cases validation will be an iterative process in which a user validates a collection of documents en masse then corrects errors and revalidates documents individually.

AbiWord Image figure3.17.png

Figure 3.17 Validated document and its corresponding error report.