[SCL] SCL syntax
Pat Hayes
phayes at ihmc.us
Mon Feb 23 11:53:16 CST 2004
>Pat-
>
>I took a quick look at the syntax. I've noticed a couple problems.
Thanks for your input, Ive made a number of detailed changes as a result.
>First is that you should mandate a particular
>character set encoding ... most programming
>languages don't.
Did you mean: should NOT mandate?
The text now reads
"Any SCL Core, or core, expression is encoded as
a sequence of Unicode characters as defined in
ISO/IEC 10646-1. Any character encoding which
supports the repertoire of ISO 10646-1 may be
used, but UTF-8 (ISO 10646-1:2000 Annex D ) is
preferred. "
> For example:
>
> Any SCL Core, or core, expression is
>encoded as a sequence of Unicode characters (
>UCS-2, from the Unicode 'base multilingual
>plane' according to ISO/IEC 10646-1:1999). The
>standard encoding is UTF-8 (ISO 10646-1:2000
>Annex D ) . Only characters in the US-ASCII
>subset are reserved for special use in the core
>itself, so that the language can be encoded as
>an ASCII text string if required. UCS-2
>characters outside that range are represented in
>ASCII text by a character coding sequence of the
>form \unnnn or \unnnnnn where n is a hexadecimal
>digit character. When transforming an ASCII text
>string to UTF-8, such a sequence should be
>replaced by the corresponding UTF-8 byte
>encoding. This document uses the ASCII encoding.
>
>The compatibility of encodings is properly
>handled by C and C++ (but not Java or
>Javascript). Furthermore, UCS-2 is obsolete and
>*should not* be used.
OK, I took UCS-2 from other standard
specifications but no doubt they were out of
date. The intention is to be as wide-ranging as
possible.
> Regarding, ASCII, ISO/IEC 8859-1, UTF-16,
>UTF-32, or UTF-8, they are all fine -- just say
>you want the *repertiore* of 10646-1, not any
>specific encoding.
Ah, I see how it goes. OK, I am happy with that.
"Repertoire" in this context is new to me, so if
the above wording is inappropriate please suggest
an improvement.
>
>And, yes, character sequences like "\unnnn" or
>"\Unnnnnnnn" (note the captial 'U') should be
>supported (but not the sequence "\unnnnnn")
OK, I have made that change.
>
>Getting back to the syntax problems: the modern
>way of specifying these kind of syntaxes is to
>use several phases of translation.
Well, we have done this, in effect, by splitting
the lexicalization from the logical syntax.
Although Im not sure what you mean by
'translation'. (?)
For SCL, it does not seem appropriate to specify
a detailed algorithmic process such as the
example you give below; particularly as the
processes used by applications might vary, and
SCL text might be embedded in other notational or
even lexical systems which mandate their own
local character- or string-handling processes.
> I believe I've mentioned this before on this
>list. If you try to put it all in one
>translation phase, you'll run into several
>sticky specification problems, such as awkward
>syntaxes and ambiguous provisions (e.g.,
>requirements). Here's an example from the C
>programming language standard (ISO/IEC 9899),
>which is the most widely implemented programming
>language:
>
> 5.1.1.2 Translation phases
>
> The precedence among the syntax rules of
>translation is specified by the following
>phases. [footnote]
>
> 1. Physical source file
>multibyte characters are mapped, in an
>implementation-defined manner, to the source
>character set (introducing new-line characters
>for end-of-line indicators) if necessary.
>Trigraph sequences are replaced by corresponding
>single-character internal representations.
>
> 2. Each instance of a backslash
>character (\) immediately followed by a new-line
>character is deleted, splicing physical source
>lines to form logical source lines. Only the
>last backslash on any physical source line shall
>be eligible for being part of such a splice. A
>source file that is not empty shall end in a
>new-line character, which shall not be
>immediately preceded by a backslash character
>before any such splicing takes place.
>
> 3. The source file is decomposed
>into preprocessing tokens [footnote 2] and
>sequences of white-space characters (including
>comments). A source file shall not end in a
>partial preprocessing token or in a partial
>comment. Each comment is replaced by one space
>character. New-line characters are retained.
>Whether each nonempty sequence of white-space
>characters other than new-line is retained or
>replaced by one space character is
>implementation-defined.
>
> 4. Preprocessing directives are
>executed, macro invocations are expanded, and
>_Pragma unary operator expressions are executed.
>If a character sequence that matches the syntax
>of a universal character name is produced by
>token concatenation (6.10.3.3), the behavior is
>undefined. A #include preprocessing directive
>causes the named header or source file to be
>processed from phase 1 through phase 4,
>recursively. All preprocessing directives are
>then deleted.
>
> 5. Each source character set
>member and escape sequence in character
>constants and string literals is converted to
>the corresponding member of the execution
>character set; if there is no corresponding
>member, it is converted to an
>implementation-defined member other than the
>null (wide) character. [footnote 3]
>
> 6. Adjacent string literal tokens are concatenated.
>
> 7. White-space characters
>separating tokens are no longer significant.
>Each preprocessing token is converted into a
>token. The resulting tokens are syntactically
>and semantically analyzed and translated as a
>translation unit.
>
> 8. All external object and
>function references are resolved. Library
>components are linked to satisfy external
>references to functions and objects not defined
>in the current translation. All such translator
>output is collected into a program image which
>contains information needed for execution in its
>execution environment.
>
> [footnote 1] Implementations shall
>behave as if these separate phases occur, even
>though many are typically folded together in
>practice.
>
> [footnote 2] As described in 6.4, the
>process of dividing a source files characters
>into preprocessing tokens is context-dependent.
>For example, see the handling of < within a
>#include preprocessing directive.
>
> [footnote 3] An implementation need not
>convert all non-corresponding source characters
>to the same execution character.
>
>Again, I suggest that you consider a multiple
>phase approach (but not as many phases a C).
>Right now, it appears that you are attemping to
>put this all in one translation phase -- it just
>won't be doable in any straightfoward way.
Actually into two. Lexicalization splits a
character stream into lexemes, discarding
whitespace, and the logical syntax parses a
lexeme sequence.
>
>I don't understand why you have the restriction:
Er...I don't. It should have been deleted, sorry.
that was in a left-over note from an earlier
draft.
>
> The double-quote symbol " is prohibited in
>SCL Core. This permits quoted SCL Core syntax to
>be included inside XHTML markup without being
>rendered by a Web browser. Otherwise, SCL Core
>syntax permits the use of the full Unicode
>character set.
>
>One should never have the expectation that text
>can be inserted literally into another context.
>There are always "escaping" and "expansion"
>functions to insert text. For example, an SCL
>statement that includes the character sequence:
>
> <!--
>
>is likely to have problems. I can cite other
>examples, too. So the restriction on double
>quote just seems artificial and doesn't seem to
>solve a real problem (in my opinion).
>
>Some of the terminology in the document is
>non-standard. I'd prefer standard terminology
>to be used (ISO/IEC 2382-15 Information
>Technology Vocabulary -- Programming Languages
Well, but SCL isn't a programming language, so does this apply?
>). Here's some of the terms and their
>definitions (yes, I know we can all nit-pick,
>but these have broad agreement and use):
>
> 15.01.01
> lexical token, lexical element, lexical
>unit: A string of one or more characters of the
>alphabet of a programming language that, by
>convention, represents an
>elemental unit of meaning. Examples: A literal
>such as 2G5 or an identifier such as last_name
>in Pascal.
This is what I call a lexeme. I guess 'lexical
token' would be fine. I'll make that change.
>
> 15.01.02
> language construct: A syntactically
>allowable part of a program that may be formed
>from one or more lexical tokens in accordance
>with the rules of a programming language.
>
> 15.01.03
> identifier (in programming languages): A
>lexical token that names a language construct.
>Examples: The names of variables, arrays,
>records, labels,
>procedures, etc.
Sorry, but even if this is an ISO standard I
would ignore it, since it is incoherent. A
language construct is a syntactically well-formed
part of a program, but examples include arrays,
records and procedures?? This is nonsense. Like
many things written by engineers, it confuses use
and mention. Arrays and procedures are not
syntactic parts of programs, they are *denoted*
by program text. They are part of a run-time
virtual machine state, not part of a program.
But in any case, logical languages like SCL do
not contain any identifiers, so the issue is moot.
>NOTE An identifier usually consists of a letter optionally followed
>by letters , digits, or other characters.
>
> 15.01.04
> predefined identifier: An identifier
>that is defined as part of a programming
>language. Example: A reserved word. NOTE If
>a predefined identifier is not reserved, then a
>declaration using that identifier redefines its
>meaning for the scope of the
>declaration.
>
> 15.01.05
> reserved word: A predefined identifier that cannot be redefined by a
>programmer. NOTE Not all programming languages have reserved words.
I used 'reserved element' rather than 'word'
because these things are not considered to be
logical words (= names).
>
> 15.01.06
> delimiter (in programming languages),
>separator (deprecated in this sense): A lexical
>token that indicates the beginning or the end of
>another lexical token or of a character string
>considered as a syntactic unit. NOTE 1: Special
>characters or reserved words may serve as
>delimiters. NOTE 2: Contrast with separator.
>
> 15.01.07
> separator: A delimiter that prevents
>adjacent lexical tokens or syntactic
>units from being interpreted as a single item.
>Examples: The space character or a format
>effector. NOTE Contrast with delimiter.
This is a useful distinction and I will
incorporate it, although again these definitions
are incoherent as written (if a space is a
delimiter, then it must by 15.01.06 be a lexical
token; but then what sense can be attributed to
the use of 'contrast with' in that definition?)
Parentheses and single-quotes are delimiters, whitespace is a separator.
>
> 15.01.11
> comment, remark: A language construct
>exclusively used to include text that has no
>intended effect on the execution of the program.
>Examples: An explanation to a human reader; data
>for an automatic documentation system.
>
>So "lexcial token" should be used instead of
>"lexical item" or "lexeme", and "separator" (or,
>possibly, "delimiter") should be used instead of
>"interlexical charater".
Right
>
>When you state:
>
> open and closing parenthesis (U+0028 and U+0029)
>
>are you aware that 10646 says that the meaning
>of U+0028 and U+0029 may be reversed if the
>script in use is a right-to-left script rather
>than a left-to-right script? Considering that
>you want to support all of 10646, what is your
>solution?
SCL core syntax is a left-to-right script. What is the problem?
>
>Regarding whitespace, you probably want to have
>a translation phase that understands some notion
>of "logical lines". Then you can include all
>the format effectors (tab, form feed, line feed,
>carriage return, etc.) and some of the
>characters you've missed, such as characters
>from the C1 character set (e.g., U+0085 is Next
>Line).
>
>The following syntax is ambiguous:
>
> The backslash \ (reverse solidus U+005C)
>character is reserved for special use. Followed
>by the lowercase letter u (U+0075) and a four-
>or six-digit hexadecimal code, it is used to
>transcribe non-ASCII UTF-8 characters in an
>ASCII character stream, as explained above. Any
>string of this form plays the same SCL syntactic
>role in an ASCII string rendering as a single
>ordinary character. The combination \' is used
>to encode a single quote inside an SCL quoted
>string, and the double-backslash \\ indicates
>the backslash character itself.
>
>How do I write U+1234 followed by the characters '5' and '6':
>
> \u123456
>
>I suggest that you use the convention already
>established in several programming language
>standards:
>
> \uhhhh (for characters in the range U+0000 to U+FFFF)
> \Uhhhhhhhh (for all characters)
Good point. Will do.
>
>Regarding the EBNF metasyntax, you've forgotten the terminator symbol (;):
>
> OLD:
> special = "@" | "'" | "="
>
> NEW:
> special = "@" | "'" | "=" ;
>
>In addition, since you are using 10646, you
>probably don't want to limit your identifiers to
>ASCII-only characters
I don't. The note by the nonascii production
indicates that any Unicode non-control character
can be counted as a character in a namestring. In
fact, I would be happy to allow pictures, movies
and sound files as logical names, if there were
some principled way to do so.
>, you probably want to include other characters
>as identifiers. This work has already been
>researched. The current list includes extended
>letters, extended digits, and extended special
>characters:
Why only these?
>
>Extended non-digit, non-special (e.g., letters):
>
> Latin: 0041-005A, 0061-007A, 00AA, 00BA,
>00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217,
>0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
> Greek: 0386, 0388-038A, 038C, 038E-03A1,
>03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0,
>03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45,
>1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D,
>1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4,
>1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC,
>1FF2-1FF4, 1FF6-1FFC
> Cyrillic: 0401-040C, 040E-044F,
>0451-045C, 045E-0481, 0490-04C4, 04C7-04C8,
>04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
> Armenian: 0531-0556, 0561-0587
> Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 05F0-05F2
> Arabic: 0621-063A, 0640-0652, 0670-06B7,
>06BA-06BE, 06C0-06CE, 06D0-06DC, 06E5-06E8,
>06EA-06ED
> Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963
> Bengali: 0981-0983, 0985-098C, 098F-0990,
>0993-09A8, 09AA-09B0, 09B2, 09B6-09B9,
>09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD,
>09DF-09E3, 09F0-09F1
> Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10,
>0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36,
>0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D,
>0A59-0A5C, 0A5E, 0A74
> Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D,
>0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3,
>0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD,
>0AD0, 0AE0
> Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10,
>0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39,
>0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D,
>0B5F-0B61
> Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90,
>0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F,
>0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9,
>0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
> Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10,
>0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C3E-0C44,
>0C46-0C48, 0C4A-0C4D, 0C60-0C61
> Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90,
>0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CBE-0CC4,
>0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1
> Malayalam: 0D02-0D03, 0D05-0D0C,
>0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D3E-0D43,
>0D46-0D48, 0D4A-0D4D, 0D60-0D61
> Thai: 0E01-0E3A, 0E40-0E5B
> Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A,
>0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5,
>0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0-0EB9,
>0EBB-0EBD, 0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD
> Tibetan: 0F00, 0F18-0F19, 0F35, 0F37,
>0F39, 0F3E-0F47, 0F49-0F69, 0F71-0F84,
>0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD,
>0FB1-0FB7, 0FB9
> Georgian: 10A0-10C5, 10D0-10F6
> Hiragana: 3041-3093, 309B-309C
> Katakana: 30A1-30F6, 30FB-30FC
> Bopomofo: 3105-312C
> CJK Unified Ideographs: 4E00-9FA5
> Hangul: AC00-D7A3
>
>The following are digit characters:
> Digits: 0030-0039, 0660-0669, 06F0-06F9,
>0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF,
>0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF,
>0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F33
OK, but if I allow these to be used as digits in
numeral strings then I need to specify their
interpretation. I thought of that, but decided
that the Arabic digits were almost universally
used in commercial and scientific applications,
so decided that it was best to require that
Arabic numbers are the only SCL-recognized forms
for digits.
>
>The following are special characters:
> Special characters: 005F, 00B5, 00B7,
>02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1,
>02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE,
>203F-2040, 2102, 2107, 210A-2113, 2115,
>2118-211D, 2124, 2126, 2128, 212A-2131,
>2133-2138, 2160-2182, 3005-3007, 3021-3029
>
>The "specialform" production is properly called
>a set of "predefined identifiers".
Well, not really. First, they aren't identifiers:
second, I don't want to use a word which is a
cognate of "define" here, since that often
communicates an erroneous impression. In a sense,
the only SCL meanings that are *defined* are
those of the logical syntax itself.
I will revert to the previous terminology of
'special name', which is blandly uninformative.
>
>In your "simple logical expression written in
>introductory text-book form", I have no idea
>what that simple expression means because I
>don't know what the
>functons/relations/operators/expressions/whatever
>"Boy(x)", "Girl(x)", and "Kissed(x,y)" mean.
Boy(y) is intended to mean 'y is a Boy' and so on.
>Rather than telling me the English equivalent to
>the logical statement (which I've always doubted
>John Sowa's explanation in his CG translations),
>simply tell me what "Boy(x)", "Girl(x)", and
>"Kissed(x,y)" represent.
There is nothing to tell. They represent whatever
you want to interpret them as meaning, in fact.
Logical names aren't PL identifiers: they don't
have computationally locatable entities attached
to them which determine their meaning.
>
>Regarding the following provision:
>
> Boolean sentences require implication
>and iff to be binary, but allow and and or to
>have any number of arguments (including zero).
>
>What are the results of:
>
> - "and" with zero arguments
> - "or" with zero arguments
I don't understand the question. What results?
The truth-conditions are that (and) is vacuously
true and (or) is vacuously false. I will add some
text to draw attention to this (not written yet).
>In the production:
>
> char = special | hexa | ...
>
>why is "hexa" included when it is already
>included in the alphabet? Didn't you mean
>"digit" instead of "hexa"?
Yes, thanks.
>
>Regarding the use of backslash:
>
> Any occurrence of the backslash
>character \ not immediately followed by the
>character ' or u simply indicates the backslash
>character itself
>
>don't you want to support line folding for long lines, e.g.:
>
>'This is a very very \
>long line that just \
>keeps on going'
Actually, no. I think this is an anachronism
(with its historical roots in the age of teletype
machines), given the ubiquity of line-wrapped
display modes for text. All my text-processors
treat paragraphs as extremely long lines and do a
graphical text-fit to the viewing window.
Above changes now made.
Pat
--
---------------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973 home
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32501 (850)291 0667 cell
phayes at ihmc.us http://www.ihmc.us/users/phayes
More information about the SCL
mailing list