* | Getting started | | Key concepts |
CDuce is a strongly-typed functional programming language adapted
to the manipulation of XML documents. Its syntax is reminiscent
of the ML family, but CDuce has a completely different type system.
Let us introduce directly some key concepts:
- Values are the objects manipulated by
CDuce programs; we can distinguish several kind of values:
- Basic values: integers, characters.
- XML documents and fragments: elements, tag names, strings.
- Constructed values: pairs, records, sequences.
- Functional values.
- Types denote sets of values that share common
structural and/or behavioral properties. For instance,
Int denotes the sets of all integers,
and <a href=String>[] denotes XML elements
with tag a that have an attribute href
(whose content is a string), and with no sub-element.
- Expressions are fragments of CDuce programs
that produce values. For instance, the expression 1 + 3
evaluates to the value 4. Note that values can
be seen either as special cases of expressions, or as
the result of evaluating expressions.
- Patterns are ``types + capture variables''. They allow
to extract from an input value some sub-values, which can then be
used in the rest of the program. For instance, the pattern
<a href=x>[] extracts the value of the
href attribute and binds it to the value
identifier x.
A first example
let x = "Hello, " in
let y = "world!" in
x @ y
The expression binds two strings to value identifiers x
and y, and then concatenates them. The general form
of the local binding is:
where p is a pattern and e,
e' are expressions.
Note:
A small aside about the examples in this tutorial and their usage. The
first program that prints "Hello word" can be tried directly on the on-line
prototype: just select and copy it, click on the link to the on-line
interpreter in the side bar (we suggest you open it in a new window), paste it in the execution window and run it. The
second example instead cannot be run. This is visually signaled by the fact
that it contains text in italics. We use italics for meta notation, that is
e and e' stand for generic expressions, therefore it is useless to run
this code (you would just obtain an error signaling that e is
not bound or that the quote in e' is not closed). This is true also in general in what follows: code without
italicized text can be copied and pasted in the on-line prototype as they are
(of course you must first paste the declarations of the types they use);
this is not possible whenever the code contains italicized text.
Patterns are much more than simple variables. They can be used to decompose
values. For instance, if the words Hello and world are in the two elements of a pair, we can capture each of them and concatenate them as follows:
let (x,y) = ("Hello, " , "world!") in x @ y
Patterns can also check types. So for instance
let (x & String, y) = e in x
would return a (static) type error if the first projection of e has not the static type String.
The form let x&t = e in e' is used so often that we introduced a special syntax for it:
Note the blank spaces around the colons
[1].
This is because the XML recommendation allows colons to occur in identifiers: see the User's Manual section on namespaces. (the same holds true for the functional arrow symbol -> which must be surrounded by blanks and by colons in the formal parameters of a function: see this paragraph of the User's manual).
|
| XML documents |
CDuce uses its own notation to denote XML documents. In the next table we
present an XML document on the left and the same document in CDuce notation on
the right (in the rest of this tutorial we visually distinguish XML code from CDuce one by putting the former in light yellow boxes):
<?xml version="1.0"?>
<parentbook>
<person gender="F">
<name>Clara</name>
<children>
<person gender="M">
<name>Pål André</name>
<children/>
</person>
</children>
<email>clara@lri.fr</email>
<tel>314-1592654</tel>
</person>
<person gender="M">
<name> Bob </name>
<children>
<person gender="F">
<name>Alice</name>
<children/>
</person>
<person gender="M">
<name>Anne</name>
<children>
<person gender="M">
<name>Charlie</name>
<children/>
</person>
</children>
</person>
</children>
<tel kind="work">271828</tel>
<tel kind="home">66260</tel>
</person>
</parentbook>
|
let parents : ParentBook =
<parentbook>[
<person gender="F">[
<name>"Clara"
<children>[
<person gender="M">[
<name>['Pål ' 'André']
<children>[]
]
]
<email>['clara@lri.fr']
<tel>"314-1592654"
]
<person gender="M">[
<name>"Bob"
<children>[
<person gender="F">[
<name>"Alice"
<children>[]
]
<person gender="M">[
<name>"Anne"
<children>[
<person gender="M">[
<name>"Charlie"
<children>[]
]
]
]
]
<tel kind="work">"271828"
<tel kind="home">"66260"
]
]
|
Note the straightforward correspondence between the two notations:
instead of using an closing tag, we enclose the content of each
element in square brackets. In CDuce square brackets denote sequences,
that is, heterogeneous (ordered) lists of blank-separated elements. In
CDuce strings are not a primitive data-type but are sequences of
characters. To the purpose of the example we used different notations to
denote strings as in CDuce "xyz", ['xyz'],
['x' 'y' 'z'], [ 'xy' 'z' ], and [
'x' 'yz' ] define the same string literal. Note also that the
"Pål André" string is accepted as CDuce supports Unicode
characters. |
| Loading XML files | The program on the right hand-side in the previous section starts
by binding the variable parents to the XML document. It
also specifies that parents has the type ParentBook : this is optional but it
usually allows earlier detection of type errors.
If the file XML on the left hand-side is stored in a file, say,
parents.xml then it can be loaded from the file by load_xml
"parents.xml" as the builtin function load_xml converts and
XML document stored in a file into the CDuce expression representing it. However
load_xml has type String->Any, where
Any is the type of all values. Therefore if we try to reproduce the
same binding as the above by writing the following declaration
let parents : ParentBook = load_xml "parents.xml"
we would obtain a type error as we were trying to use an expression of type
Any where an expression of type ParentBook is expected.
The right way to reproduce the binding above is:
let parents : ParentBook =
match load_xml "parents.xml" with
x & ParentBook -> x
| _ -> raise "parents.xml is not a document of type ParentBook"
what this expression does is that before assigning the result of the load_xml expression to the
variable parents it matches it against the type
ParentBook. If it succeeds (i.e., if the XML file in the document has
type ParentBook) then it performs the assignment (the variable
x is bound to the result of the load_xml expression by the pattern
x&ParentBook) otherwise it raises an exception.
Of course an exception such as "parents.xml is not a document of type
ParentBook" it is not very informative about why the document failed the match
an where the error might be. In CDuce it is possible to ask the program to
perform this check and raise an informative exception (a string that describes
and localize the problem) by using the dynamic type check construction
(e:?t) which checks whether the expression
exp has type t and it either returns the
result of exp or raise an informative exception.
let parents = load_xml "parents.xml" :? ParentBook
which perform the same test as the previous program but in case of failure give
information to the programmer on the reasons why the type check failed.
The dynamic type check can be also used in a let construction as follows
let parents :? ParentBook = load_xml "parents.xml"
which is completely equivalent to the previous one.
The command load_xml "parents.xml" is just an abbreviated form for
load_xml "file://parents.xml". If CDuce is compiled with
netclient or curl support, then it is also possible to use other URI schemes such as
http:// or ftp://. A special scheme string: is always supported: the string
following the scheme is parsed as it is.
[2]
So, for instance, load_xml
"string:exp"
parses litteral XML code exp (it corresponds to XQuery's { exp }), while load_xml
("string:" @ x) parses the XML code associated to the string variable x. Thus the following definition of x
let x : Any = <person>[ <name>"Alice" <children>[] ]
is completely equivalent to this one
let x = load_xml "string:<person><name>Alice</name> <children/></person>"
|
| Type declarations |
First, we declare some types:
type ParentBook = <parentbook>[Person*]
type Person = FPerson | MPerson
type FPerson = <person gender="F">[ Name Children (Tel | Email)*]
type MPerson = <person gender="M">[ Name Children (Tel | Email)*]
type Name = <name>[ PCDATA ]
type Children = <children>[Person*]
type Tel = <tel kind=?"home"|"work">['0'--'9'+ '-'? '0'--'9'+]
type Echar = 'a'--'z' | 'A'--'Z' | '_' | '0'--'9'
type Email= <email>[ Echar+ ('.' Echar+)* '@' Echar+ ('.' Echar+)+ ]
The type ParentBook describes XML documents that store information
of persons. A tag <tag attr1=... attr2=... ...>
followed by a sequence type denotes an XML document type. Sequence
types classify ordered lists of heterogeneous elements and they are
denoted by square brackets that enclose regular expressions over types
(note that a regular expression over types is not a type, it
just describes the content of a sequence type, therefore if it is not
enclosed in square brackets it is meaningless). The definitions above
state that a ParentBook element is formed by a possibly empty sequence
of persons. A person is either of type FPerson or
MPerson according to the value of the gender attribute.
An equivalent definition for Person would thus be:
<person gender="F"|"M">[ Name Children (Tel | Email)*]
A person element is composed by a sequence formed of a name
element, a children element, and zero or more telephone and e-mail
elements, in this order. Name elements contain strings. These are encoded as sequences of
characters. The PCDATA keyword is equivalent to the
regexp Char*, then String,
[Char*], [PCDATA], [PCDATA*
PCDATA], ..., are all equivalent notations. Children are
composed of zero or more Person elements. Telephone elements have an
optional (as indicated by =?) string attribute whose
value is either ``home'' or ``work'' and they are formed by a single
string of two non-empty sequences of numeric characters separated by
an optional dash character. Had we wanted to state that a phone number
is an integer with at least, say, 5 digits (of course this is
meaningful only if no phone number starts by 0) we would have used an
interval type such as <tel kind=?"home"|"work">[10000--*],
where * here denotes plus infinity, while on the lefthand side of -- (as in *--100) it denotes minus infinity.
Echar is the type of characters in e-mails
addresses. It is used in the regular expression defining Email to
precisely constrain the form of the addresses. An XML document satisfying
these constraints is shown
|
| | [1]
Actually only the first blank is necessary. CDuce accepts let x :t = e in e',
as well
[2]
All these schemes are available for load_html and load_file as well.
|
| |
|
|