Motivation
The following post is just an experimentation on using a lisp-like syntax, with hints of JSON, for storing structured data as text. The main motivation for this was a need to have a less verbose alternative to XML, but still be readable and editable (weak point of JSON). It should also support pattern matching and common primitive datatypes.
The current specification has errors and is incomplete. The programming elements are currently in the toy stage.
Specification
(@text (# Quick overview An experimental typed data-format using s-expression-like syntax for storing structured data. A tuple can be a named, unnamed, or typed parenthesis that enclose an ordered sequence of whitespace separated expressions. A typed tuple may be a @int, @float, @number, @bool, @bytes, @rule, @text, @note or @tuple. An named or unnamed tuple is an untyped @tuple tuple. A @note is used for documenting or ignoring tuples and may be skipped during processing. A pair may be a : that separates a named key and value expression == that separates two expressions in an assertion -> that separates two expressions in an substitution => that separates two expressions in an combination A named, unnamed and @tuple typed tuple may contain key-value pairs, untyped text, named, unnamed, @tuple, or @text tuples. A document must start with a typed tuple. If a document discards @note typed tuples and starts with one, an assertion failure must be raised. Syntax () - encloses a named, unnamed or typed tuple (# - begins a text block (nested text blocks are allowed but unbalanced text blocks must use double-quotation instead) #) - ends a text block "" - encloses a text string with escaped characters @text - gives one or more text blocks or strings : - a key-value pair => - rule that matches the left side and then combines it with the right side -> - rule that matches the left side and then substitute it with the right side == - rule that matches left side and then asserts that the right side is present '' - encloses a pattern matching variable that specify type or named reference @tuple - give a sequence of untyped text, @text, named, unnamed, @rule, or @note tuples ("unnamed") - equivalent to @tuple (Named) - equivalent to @tuple @note - equivalent to @tuple @rule - gives a sequence of rules @bytes - gives a sequence of base64 encoded text containing unsigned bytes @int - gives a sequence of signed integers in 32- or by default 64-bit @float - gives a sequence of floating point numbers in 32- or by default 64-bit @number - gives a sequence of unbounded numbers @bool - gives a sequence of true or false Pattern matching - Application of pattern matching is handled at the application level - Pattern matching variables can be used to (Data value: 'PI') - assign variable to the value of a pair (@rule 'PI' -> (@float 3.14)) - substitute a variable with an expression (@rule (Data) == (size: '@int')) - assert that a tuple has a certain structure Examples - Named tuples (Document title: "just another format" Document) (Dir name: "root" files: ( (File name: "readme" data: (@bytes (#MTIz#))) ) ) - Unnamed tuples (title: "just another format") (name: "root" files: ( (name: "readme" data: (@bytes (#MTIz#))) ) ) - Typed tuples in key-value pairs (Cell name: "a" value: (@int:32 42)) (Screen fullscreen: (@bool false)) - Untyped tuples in key-value pairs (Cell name: "b" value: 42) (Screen fullscreen: false) Parsing syntax start ::= tuple-typed | rule-typed | int-typed | note-typed | float-typed | bytes-typed | bool-typed tuple ::= tuple-unnamed | tuple-named | tuple-typed tuple-unnamed ::= unnamed<tuple-seq> tuple-named ::= named<name,tuple-seq> tuple-typed ::= typed<tuple-type,tuple-seq> tuple-type ::= '@tuple' tuple-seq ::= {rule | note | text | pair | tuple, pad} unnamed<body> ::= '(' body ')' named<tag,body> ::= '(' tag (pad body)? (pad tag)? ')' typed<tag,body> ::= '(' tag (pad body)? (pad tag)? ')' name ::= [a-zA-Z][-A-Za-z0-9]* pair ::= name ':' pad (primitive | tuple) primitive ::= bytes | int | float | bool | text | variable | number-untyped rule ::= rule-typed rule-typed ::= typed<rule-type,rule-options? rule-seq> rule-type ::= '@rule' rule-options ::= {pair, pad} rule-seq ::= {rule-combine | rule-substitute | rule-assert, pad} rule-combine ::= tuple pad '=>' pad tuple rule-substitute ::= (tuple pad '->' pad tuple) | variable pad '->' pad (tuple | primitive) rule-assert ::= tuple pad '==' pad tuple variable ::= ''' variable-type ''' variable-type ::= name | int-type | float-type | bool-type | bytes-type | text-type | tuple-type note-typed ::= typed<note-type,tuple-seq> note-type ::= '@note' bytes ::= bytes-typed bytes-typed ::= typed<bytes-type,bytes-untyped> bytes-type ::= '@bytes' bytes-untyped ::= '(#' << base64 alphabet with whitespace trimmed away >> '#)' number-untyped ::= int-untyped | float-untyped int ::= int-untyped | int-typed int-typed ::= typed<int-type,int-seq> int-untyped ::= << characters giving an integer of any size >> int-type ::= '@int' (':32' | ':64')? int-seq ::= {int-untyped, pad} float ::= float-untyped | float-typed float-typed ::= typed<float-type,float-seq> float-untyped ::= << characters giving a floating point of any size >> float-type ::= '@float' (':32' | ':64')? float-seq ::= {float-untyped, pad} bool ::= bool-untyped | bool-typed bool-typed ::= typed<bool-type,bool-seq> bool-untyped ::= 'true' | 'false' bool-type ::= '@bool' bool-seq ::= {bool-untyped, pad} text ::= text-untyped | text-typed text-typed ::= text-string | named<text-type,text-seq> text-untyped ::= text-string | text-block text-type ::= '@text' text-seq ::= {text-string | text-block, pad} text-string ::= '"' << any characters until unescaped double-quote >> '"' text-block ::= '(#' << any characters until text-block is balanced >> '#)' pad ::= [/s]+ How to handle text - Text-strings are escaped by replacing '\' with '\\' then '"' with '\"', and unescaped by replacing '\"' with '"' then '\\' with '\'. - Text-blocks are not escaped, but verified that they are balanced. If not they become text-strings. A text-block is balanced if all embedded '(#' is matched with a corresponding '#)'. A text-block cannot end with the '(' character. Examples of invalid expressions (Data name: "A" name: "B") - duplicate key-value pair Data - neither a primitive or named tuple name: "A" - pair not enclosed in parenthesis @tuple - not enclosed with parenthesis '@byte' - variable does not contain a valid type Examples of valid expressions (name: "Kyrre") - unnamed tuple "Some characters!" - implicit (@text "Some characters!") (state: true) - implicit (state: (@bool true)) (width: 1024) - implicit (width: (@number 1024)) (@int 4) - implicit (@int:64 4) (@float 3.14) - implicit (@float:64 3.14) #) @text)
Example
Data with redundant information stored in rules:
(@tuple (@note "Rules for naming all unnamed tuples and discarding notes") (@rule forall: true (name: '@text' type: '@text') => (Node) (from: '@tuple' to: '@tuple') => (Link) (inputs: ( (@rule (name: '@text' datatype: '@text') => (Socket)) ) ) => () (outputs: ( (@rule (name: '@text' datatype: '@text') => (Socket)) ) ) => () (name: "Out" datatype: '@text') => () '@note' -> () ) (@note "Rules for substituting in data") (@rule forall: true 'test-data' -> (@bytes (#QUI9PQ==#)) 'instructions' -> "some data in text format" ) (@note "The data to be transformed") (name: "Group1" type: "Group" nodes: ( (name: "Source" type: "Value" value: 'test-data' inputs: ((name: "In" datatype: "bytedata")) outputs: ((name: "Out" datatype: "bytedata")) ) (name: "Transform" type: "Process" data: 'instructions' inputs: ((name: "In" datatype: "bytedata")) outputs: ((name: "Out" datatype: "bytedata")) ) (name: "Target" type: "Value" value: "" inputs: ((name: "In" datatype: "bytedata")) outputs: ((name: "Out" datatype: "bytedata")) ) ) links: ( (from: (node: "Source" socket: "Out") to: (node: "Transform" socket: "In")) (from: (node: "Transform" socket: "Out") to: (node: "Target" socket: "In")) ) ) @tuple)
The resulting data after the rules of the tuple has been applied to itself:
(@tuple (Node name: "Group1" type: "Group" nodes: ( (Node name: "Source" type: "Value" value: (@bytes (#QUI9PQ==#)) inputs: ((Socket name: "In" datatype: "bytedata")) outputs: ((Socket name: "Out" datatype: "bytedata")) ) (Node name: "Transform" type: "Process" data: "some data in text format" inputs: ((Socket name: "In" datatype: "bytedata")) outputs: ((Socket name: "Out" datatype: "bytedata")) ) (Node name: "Target" type: "Value" value: "output" inputs: ((Socket name: "In" datatype: "bytedata")) outputs: ((Socket name: "Out" datatype: "bytedata")) ) ) links: ( (Link from: (node: "Source" socket: "Out") to: (node: "Transform" socket: "In") ) (Link from: (node: "Transform" socket: "Out") to: (node: "Target" socket: "In") ) ) ) @tuple)
Ideas for extensions
- Chaining together separate files
- Nesting separate files with pattern matching variables – like referencing large bytedata and instantiating tuples
- Use separate files for validation, update and typing of data
Have you seen YAML? It’s quite mature and is very similar to what you have just described:
http://www.yaml.org/
I was aware of the name YAML but I’ve never used it. It looks a little more serious than what I scribbled down and it’s easy to see I’ve replicated stuff like datatypes, casting, binary-data and referenced data. YAML also seems to have class-like instancing? That’s stuff I miss in XML where you instead see ugly solutions using namespaces or wrapped tags. One thing I do like about XML is that grouped stuff is named and you can collapse the structure in an editor. JSON, which apparently is also a subset of YAML, I don’t find so easy to understand or edit, but from what I gather it’s mostly a data-serialization format. The only stuff I think is interesting with regards to what I posted is the pattern matching stuff by matching against structure and type, but it got some flaws and limitations so I’m obviously not a competent language designer. It’s fun though to experiment with and searching for XML alternatives shows I’m not alone. :)