[SCL] SCL syntax

Frank Farance frank at farance.com
Sun Feb 22 00:38:54 CST 2004


Pat-

I took a quick look at the syntax.  I've noticed a couple problems.  First is that you should mandate a particular character set encoding ... most programming languages don't.  For example:

        Any SCL Core, or core, expression is encoded as a sequence of Unicode characters ( UCS-2, from the Unicode 'base multilingual plane' according to ISO/IEC 10646-1:1999). The standard encoding is UTF-8 (ISO 10646-1:2000 Annex D ) . Only characters in the US-ASCII subset are reserved for special use in the core itself, so that the language can be encoded as an ASCII text string if required. UCS-2 characters outside that range are represented in ASCII text by a character coding sequence of the form \unnnn or \unnnnnn where n is a hexadecimal digit character. When transforming an ASCII text string to UTF-8, such a sequence should be replaced by the corresponding UTF-8 byte encoding. This document uses the ASCII encoding.

The compatibility of encodings is properly handled by C and C++ (but not Java or Javascript).  Furthermore, UCS-2 is obsolete and *should not* be used.  Regarding, ASCII, ISO/IEC 8859-1, UTF-16, UTF-32, or UTF-8, they are all fine -- just say you want the *repertiore* of 10646-1, not any specific encoding.

And, yes, character sequences like "\unnnn" or "\Unnnnnnnn" (note the captial 'U') should be supported (but not the sequence "\unnnnnn")

Getting back to the syntax problems: the modern way of specifying these kind of syntaxes is to use several phases of translation.  I believe I've mentioned this before on this list.  If you try to put it all in one translation phase, you'll run into several sticky specification problems, such as awkward syntaxes and ambiguous provisions (e.g., requirements).  Here's an example from the C programming language standard (ISO/IEC 9899), which is the most widely implemented programming language:

        5.1.1.2 Translation phases

        The precedence among the syntax rules of translation is specified by the following phases. [footnote]

                1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

                2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.

                3. The source file is decomposed into preprocessing tokens [footnote 2] and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.

                4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal character name is produced by token concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.

                5. Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character. [footnote 3]

                6. Adjacent string literal tokens are concatenated.

                7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

                8. All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

        [footnote 1] Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.

        [footnote 2] As described in 6.4, the process of dividing a source file’s characters into preprocessing tokens is context-dependent. For example, see the handling of < within a #include preprocessing directive.

        [footnote 3] An implementation need not convert all non-corresponding source characters to the same execution character.

Again, I suggest that you consider a multiple phase approach (but not as many phases a C).  Right now, it appears that you are attemping to put this all in one translation phase -- it just won't be doable in any straightfoward way.

I don't understand why you have the restriction:

      The double-quote symbol " is prohibited in SCL Core. This permits quoted SCL Core syntax to be included inside XHTML markup without being rendered by a Web browser. Otherwise, SCL Core syntax permits the use of the full Unicode character set. 

One should never have the expectation that text can be inserted literally into another context.  There are always "escaping" and "expansion" functions to insert text.  For example, an SCL statement that includes the character sequence:

        <!--

is likely to have problems.  I can cite other examples, too.  So the restriction on double quote just seems artificial and doesn't seem to solve a real problem (in my opinion).

Some of the terminology in the document is non-standard.  I'd prefer standard terminology to be used (ISO/IEC 2382-15 Information Technology Vocabulary -- Programming Languages).  Here's some of the terms and their definitions (yes, I know we can all nit-pick, but these have broad agreement and use):

        15.01.01
        lexical token, lexical element, lexical unit: A string of one or more characters of the alphabet of a programming language that, by convention, represents an
elemental unit of meaning.  Examples: A literal such as 2G5 or an identifier such as last_name in Pascal.

        15.01.02
        language construct: A syntactically allowable part of a program that may be formed from one or more lexical tokens in accordance with the rules of a programming language.

        15.01.03
        identifier (in programming languages): A lexical token that names a language construct.  Examples: The names of variables, arrays, records, labels,
procedures, etc.  NOTE – An identifier usually consists of a letter optionally followed
by letters , digits, or other characters.

        15.01.04
        predefined identifier: An identifier that is defined as part of a programming
language.  Example: A reserved word.  NOTE – If a predefined identifier is not reserved, then a declaration using that identifier redefines its meaning for the scope of the
declaration.

        15.01.05
        reserved word: A predefined identifier that cannot be redefined by a
programmer.  NOTE – Not all programming languages have reserved words.

        15.01.06
        delimiter (in programming languages), separator (deprecated in this sense): A lexical token that indicates the beginning or the end of another lexical token or of a character string considered as a syntactic unit.  NOTE 1: Special characters or reserved words may serve as delimiters.  NOTE 2: Contrast with separator.

        15.01.07
        separator: A delimiter that prevents adjacent lexical tokens or syntactic
units from being interpreted as a single item.  Examples: The space character or a format effector.  NOTE – Contrast with delimiter.

        15.01.11
        comment, remark: A language construct exclusively used to include text that has no intended effect on the execution of the program.  Examples: An explanation to a human reader; data for an automatic documentation system.

So "lexcial token" should be used instead of "lexical item" or "lexeme", and "separator" (or, possibly, "delimiter") should be used instead of "interlexical charater".

When you state:

        open and closing parenthesis (U+0028 and U+0029)

are you aware that 10646 says that the meaning of U+0028 and U+0029 may be reversed if the script in use is a right-to-left script rather than a left-to-right script?  Considering that you want to support all of 10646, what is your solution?

Regarding whitespace, you probably want to have a translation phase that understands some notion of "logical lines".  Then you can include all the format effectors (tab, form feed, line feed, carriage return, etc.) and some of the characters you've missed, such as characters from the C1 character set (e.g., U+0085 is Next Line).

The following syntax is ambiguous:

        The backslash \ (reverse solidus U+005C) character is reserved for special use. Followed by the lowercase letter u (U+0075) and a four- or six-digit hexadecimal code, it is used to transcribe non-ASCII UTF-8 characters in an ASCII character stream, as explained above. Any string of this form plays the same SCL syntactic role in an ASCII string rendering as a single ordinary character. The combination \' is used to encode a single quote inside an SCL quoted string, and the double-backslash \\ indicates the backslash character itself.

How do I write U+1234 followed by the characters '5' and '6':

        \u123456

I suggest that you use the convention already established in several programming language standards:

        \uhhhh  (for characters in the range U+0000 to U+FFFF)
        \Uhhhhhhhh (for all characters)

Regarding the EBNF metasyntax, you've forgotten the terminator symbol (;):

        OLD:
        special = "@" | "'" | "="

        NEW:
        special = "@" | "'" | "=" ;

In addition, since you are using 10646, you probably don't want to limit your identifiers to ASCII-only characters, you probably want to include other characters as identifiers.  This work has already been researched.  The current list includes extended letters, extended digits, and extended special characters:

Extended non-digit, non-special (e.g., letters):

 	Latin: 0041-005A, 0061-007A, 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
 	Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC
 	Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
 	Armenian: 0531-0556, 0561-0587
 	Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 05F0-05F2
 	Arabic: 0621-063A, 0640-0652, 0670-06B7, 06BA-06BE, 06C0-06CE, 06D0-06DC, 06E5-06E8, 06EA-06ED
 	Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963
 	Bengali: 0981-0983, 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2, 09B6-09B9, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD, 09DF-09E3, 09F0-09F1
 	Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36, 0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D, 0A59-0A5C, 0A5E, 0A74
 	Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD, 0AD0, 0AE0
 	Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D, 0B5F-0B61
 	Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
 	Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D, 0C60-0C61
 	Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1
 	Malayalam: 0D02-0D03, 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D, 0D60-0D61
 	Thai: 0E01-0E3A, 0E40-0E5B
 	Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0-0EB9, 0EBB-0EBD, 0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD
 	Tibetan: 0F00, 0F18-0F19, 0F35, 0F37, 0F39, 0F3E-0F47, 0F49-0F69, 0F71-0F84, 0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
 	Georgian: 10A0-10C5, 10D0-10F6
 	Hiragana: 3041-3093, 309B-309C
 	Katakana: 30A1-30F6, 30FB-30FC
 	Bopomofo: 3105-312C
 	CJK Unified Ideographs: 4E00-9FA5
 	Hangul: AC00-D7A3

The following are digit characters:
 	Digits: 0030-0039, 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F33

The following are special characters:
 	Special characters: 005F, 00B5, 00B7, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029

The "specialform" production is properly called a set of "predefined identifiers".

In your "simple logical expression written in introductory text-book form", I have no idea what that simple expression means because I don't know what the functons/relations/operators/expressions/whatever "Boy(x)", "Girl(x)", and "Kissed(x,y)" mean.  Rather than telling me the English equivalent to the logical statement (which I've always doubted John Sowa's explanation in his CG translations), simply tell me what "Boy(x)", "Girl(x)", and "Kissed(x,y)" represent.

Regarding the following provision:

        Boolean sentences require implication and iff to be binary, but allow and and or to have any number of arguments (including zero).

What are the results of:

        - "and" with zero arguments
        - "or" with zero arguments

In the production:

         char = special | hexa | ...

why is "hexa" included when it is already included in the alphabet?  Didn't you mean "digit" instead of "hexa"?

Regarding the use of backslash:

        Any occurrence of the backslash character \ not immediately followed by the character ' or u simply indicates the backslash character itself

don't you want to support line folding for long lines, e.g.:

'This is a very very \
long line that just \
keeps on going'


Enough for now ...

-FF

______________________________________________________________________
Frank Farance, Farance Inc.    T: +1 212 486 4700   F: +1 212 759 1605
mailto:frank at farance.com       http://farance.com
Standards/Products/Services for Information/Communication Technologies



More information about the SCL mailing list