Character Set and encoding

Ocean programs are allways written in UTF-8. This means that an ASCII only program is a valid ocean text.

The reason for this restriction is easy: it is not the job of the compiler to support x different language environments on y different platforms.However, almost all platforms have (free) programs which can translate to UTF-8, and UTF-8 allows the use of the whole UNICODE while being backwards compatible with ASCII and being totally endian insensitive.

The core of Ocean is written in ASCII for easy implementation. However many aliasses and dyadic operators use characters outside ASCII.

There is a known problem with the fact that UNICODE is a changing definition. The solution is in supporting only a limited set of UNICODE as valid language elements; all texts which have no direct meaning (like text literals, enumerated values, and some labels) can use the full range of UNICODE.

Special condition: first bytes

Please note that the parser should skip UTF-8 encoded BOMs (Byte Order marker: u0FEFF) as first bytes of a file/stream, but that such a character would be illegal when encountered anywhere else in the file.

Special condition: last byte

If the last byte of the file is a SUB (Substitute, ^Z: u0001A) it is ignored (thrown away).

Special condition: multiple encoding

It is not an error if the file uses surrogate pairs (qv UTF-16 encoding, for encoding u10000-u10FFFF) within the UTF-8 stream, but the compiler may warn against this custom. The end result is still a single unicode character.