SDValidator Help Pages -- Syntax Reference

Version 0.9

Syntax Reference for the SDD Language

Comments are used to add explanatory notes to the sdd. They may be associated with the sdd, a tag element, or a block element. They play no role during validation. The text of comments is enclosed in <! and !>.

Examples:
Code	Description
<?sdd version="2.1" type="XML"?> <! author: Joe Sddwriter last edit: 5/1/05 !>	Example of a top-level comment.
block <!comment!> <xx> <! some comments !> </xx> <eee/> <!another comment!> end_block	Examples of comments associated with a block element, a tag, and a singleton tag.

Bact to Index.

Tag Element

Tag elements define the names of tags, which are document-type specific, and may or may not have a close tag associated with them. Tags that do not have an associated close tag are referred to as singletons. Tags with an associated close tag may contain other elements (tags, blocks, and ors). Users may also describe the types of data that an element can contain and any attributes associated with it. Only tag elements are allowed to contain data and attribute descriptions. Tag elements require a frequency specification and may be associated with a namespace.

When a tag elements appear multiple times within an SDD, it is fully defined on its first appearance but subsequent references only specify the element's placement and frequency, omitting the remainder of its definition.

Examples:
Code	Description
<doc> </doc>	An open tag followed by it's associated close tag.
<classify/>	A singleton tag--one that does not have an associated close tag.
<doc> <classify/> </doc>	A tag that contains another element--in this case the singleton classify tag.

Bact to Index.

Block Element

Block elements wrap other elements (tags, blocks, and ors) and are usually used to define a repeating sequence of elements. Like tags, block elements require a frequency specification and may be associated with a namespace.

Restrictions:

Block elements should have more than one child element.
Blocks must have required content. At least one of a block's child elements must have a frequency with a minimum count of one.

Examples:
Code	Description
block <image/> <classify/> end_block	A block element containing two singleton tags.

Bact to Index.

Or Element

Or elements are used to define alternatives. Only one of the alternatives per element instance can be applied during parsing. An element whose content is defined using ors must contain at least two or elements.

Restrictions:

Ors must have required content. At least one of an or's child elements must have a frequency with a minimum count of one.
The first child element of an or element should not be optional.
The first child elements of a sequence of or elements should be unique.

Examples:
Code	Description
<doc> or <xx> </xx> end_or or <yy> </yy> <zz/> end_or </doc>	A sequence of two or elements defining the content for the doc tag.

Bact to Index.

Frequency

Frequency defines how many times a tag or block element may occur within a specific context. Users may specify that an element occurs an absolute number of times, zero or more, one or more, and zero or once. In addition, users can use the % operator to specify that an element can appear any number of times and that its position relative to other elements does not matter. An example of the % operator's use would be a paragraph in which tags denoting titles and cross-references are interspersed throughout the text in no particular order. Users may also specify a range. For example, the frequency 2+5 means that the element must appear at least twice but no more than five times. The % operator can also be specified as a range. For example, 2%5 means that the element must appear at least twice but no more than five times and that its position relative to other elements is irrelevant.

Frequency Operators
Operator	Min	Max	Example	Notes
X	X	X	<doc>2 </doc>	If a frequency operator is not specified, 1 is used as a default.
+	1	infinity	<doc>+ </doc>
?	0	1	<doc>? </doc>
*	0	infinity	<doc>* </doc>
X+Y	X	Y	<doc>2+5 </doc>	defines a floor and ceiling
+Y	0	Y	<doc>+5 </doc>	defines a ceiling
X+	X	infinity	<doc>2+ </doc>	defines a floor
%	0	infinity	<doc>% </doc>	relative order in relation to siblings is irrelevant
X%Y	X	Y	<doc>2%5 </doc>	relative order in relation to siblings is irrelevant; defines a floor and ceiling
%Y	0	Y	<doc>%5 </doc>	relative order in relation to siblings is irrelevant; defines a ceiling
X%	X	infinity	<doc>2% </doc>	relative order in relation to siblings is irrelevant; defines a floor
Note: X and Y denote positive integers and X < Y when they appear together.

Bact to Index.

Namespaces

A namespace is a label that is associated with an element, either a tag or block. For blocks, the association allows the SDD author to reuse a block element without having to redefine it each time.

When applied to tags, namespaces provide a means to redefine a tag's definition, including its internal structure, content constraints, and attributes. A specification, for example, might contain two different types of para elements. One type, to be used within the body of the document, might allow charts, images, and lists to be imbedded in the text. Another type would allow text only and be used in the document's abstract. This type of restrictive definition can be applied without changing the markup of the source document and facilitates customizing a generic document structure to the particular needs a specific project.

Bact to Index.

Defining Attributes

Some data formats, such as SGML and XML, allow markup to have attributes. Attributes can be required or optional. Their specifications consist of an attribute's name along with values--specified using datatypes and/or regular expressions. The language constructs and syntax used to constrain data associated with an element also apply to data associated with an attribute. Both tag and singleton tag elements may contain attributes. Each attribute specification is demarcated with sqaure brackets. The # synbol denotes that an attribute is required, meaning that the tag will not be considered valid if the attribute is missing. The loose or strict keywords follow but are optional. Next is the attributes name, followed by an equal sign which precedes a regular expression and/or list of datatypes.

Examples:
Code	Description
[# id="A[0-9]+"]	The attribute "id" is required and has values constrained by a regular expression that allows an A followed by one or more digits.
[direction="north\|south\|east\|west"]	The attribute "direction" is optional and has values constrained by a regular expression limited to: north, south, east, or west.
[# type={LOWERCASE}{-.}]	The attribute "type" is required and has values composed of letters, dashes, and periods.
[sig={DIGIT}{UPPERCASE}"[0-9]+[A-Z]?"]	The attribute "sig" is optional and may contain values composed of digits and uppercase letters or adhere to a regular expression that requires one or more digits followed by an optional uppercase letter.

Bact to Index.

Defining Data Constraints

Data content may be constrained using datatypes or regular expressions. Datatypes allow the user to pick individual characters or classes of characters to which the data contained within a tag element must conform. The frequency or sequence of the characters in the tag's content does not matter when using a datatype constraint. Datatype specifications are denoted between curly braces and consist of a datatype keyword or one or more characters. For example, the following datatype specification {LOWERCASE}{-_.} allows for lowercase letters along with hyphens, underscores, or periods. If the validator encountered any other type of character while processing this tag's content, an error would be reported. To facilitate the shortest datatype specifications possible, users may use the "!" operator to signify that a datatype is not allowed. For example, the specification {PCDATA}!{DIGIT} allows for all characters except digits. The characters allowed in a datatype specification and the meaning of some of the datatype keywords are document-type specific. For example, the characters "<" and ">" are not legal content within an SGML/XML document.

Keyword	Syntax	Description
PCDATA	{PCDATA}	Datatype meaning all legal characters.
LOWERCASE	{LOWERCASE}	Datatype denoting lowercase letters.
UPPERCASE	{UPPERCASE}	Datatype denoting uppercase letters.
ALLALPHA	{ALLALPHA}	Datatype denoting upper and lowercase letters.
DIGIT	{DIGIT}	Datatype denoting numeric characters 0-9.
ENTITIES	{ENTITIES}	Datatype for document-type specific character entities.

A tag's content may also be constrained using regular expressions These are specified using the rgx keyword as follows: rgx="regular expression". A tag may only have one regular expression and the entire content of the tag must match this expression. The table below provides some examples of regular expressions. To read more about regular expressions, see OROMatcher's User Guide or Perl Documentation on Regular Expressions.

Regular Expression	Meaning
word	Matches word and nothing else.
This phrase\.	Matches This phrase. and nothing else, including the different cases, whitespace, and punctuation. Note the backslash before the period. This is required because the period has special meaning to the regular expression engine. It signifies a match of anything except a newline.
[a-c0-2]+	Matches any string of at least one character in length composed of the characters and digits a, b, c, 0, 1, 2. So, a011bc2b would be a match but a011bc3b would not because of the 3.
[A-C8-9]{3}	Matches any string of length 3 characters from the set A, B, C, 8, or 9.
[A-C8-9]{3,5}	Matches any string of length 3, 4, or 5 characters from the set A, B, C, 8, or 9.
[A-C8-9]{3,}	Matches any string of length 3 or more characters from the set A, B, C, 8, or 9.
[0-9]{3}-[0-9]{2}-[0-9]{4}	Matches the Social Security Number format: 000-00-0000.
he[l]*p	Matches the word help with an unlimited number of l's, including none. For example, hep and helllllp would both match.
&(?![a-zA-Z]+;)	This example uses a negative lookahead (?! ... ) to find instances of an & that is not followed by a sequence of letters and a semicolon.
(?!Further Reading\|NAICS Code\(s\))[A-Za-z \(\)]+	This example uses a negative lookahead (?! ... ) to match any string containing upper and lower case letters, spaces, and parens except for the phrases "Further Reading" and "NAICS Code(s)".

The three keywords optional, loose, and strict also affect the validation of a tag's content. The optional keyword instructs the validator that the tag's content is optional and that it should not issue an error if the tag has no content. Loose means that data validation should be carried out using the least restrictive method available. Strict means that the most restrictive method should be used. The macro level--accessed through Validation | Configuration--controls how data validation will be handled for all the elements in an SDD, ie whether the validator should use datatypes (loose) or regular expressions (strict) when validating content. If the SDD provides only one type of specification, either datatype or regular expression, then that type will be used when validating regardless of the application-level property. If the SDD provides both types, then the validator will use the type specified through the application-level property. The loose and strict keywords act as local overrides for this application-level property, instructing the validator to use the specified datatype, in the case of loose, or regular expression, in the case of strict, when validating a tag's content. The table below summarizes the application's behavior--the order in which it will select the different types of validation--in various circumstances. The ultimate default if datatypes or an rgx have not been specified is an empty datatype.

		Element Level (keyword in SDD)
		Default	Loose	Strict
M a c r o	Loose	Use DataType Use Regular Expression	Use DataType Use Regular Expression	Use Regular Expression Use DataType
L e v e l	Strict	Use Regular Expression Use DataType	Use DataType Use Regular Expression	Use Regular Expression Use DataType

Bact to Index.

Character Entities

In progress.

Bact to Index.

Sample SDD

In progress.

Bact to Index.