Edinburgh Speech Tools  2.1-release
 All Classes Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
doc/estwagon.md
1 Classification and Regression Trees {#estwagon}
2 ===================================
3 [TOC]
4 
5 # Overview {#cart-overview}
6 
7 As part of tools for statistical modelling, EST includes methods
8 for automatically building decision trees and decision lists from
9 features data to predict both fixed classed (classification) or
10 gaussians (regression). \ref Wagon is
11 the basic program that provide this facility.
12 
13 The construction of CARTs (classification and regression trees) is
14 best described in @cite breiman1984classification and has become a common basic
15 method for building statistical models from simple feature data.
16 CART is powerful because it can deal with incomplete data, multiple
17 types of features (floats, enumerated sets) both in input features and
18 predicted features, and the trees it produces often contain rules
19 which are humanly readable.
20 
21 Decision trees contain a binary question (yes/no answer) about
22 some feature at each node in the tree. The leaves of the tree
23 contain the best prediction based on the training data. Decision
24 lists are a reduced form of this where one answer to each question
25 leads directly to a leaf node. A tree's leaf node may be a single
26 member of some class, a probability density function (over some
27 discrete class), a predicted mean value for a continuous feature
28 or a gaussian (mean and standard deviation for a continuous value).
29 
30 Theorectically the predicted value may be anything for which a function
31 can defined that can give a measure of *impurity*
32 for a set of samples, and a distance measure between impurities.
33 
34 
35 The basic algorithm is given a set of samples (a feature vector) find
36 the question about some feature which splits the data minimising the
37 mean "impurity" of the two partitions. Recursively apply this
38 splitting on each partition until some stop criteria is reached
39 (e.g. a minimum number of samples in the partition.
40 
41 
42 The basic CART building algorithm is a *greedy algorithm
43 in that it chooses the locally best
44 discriminatory feature at each stage in the process. This is
45 suboptimal but a full search for a fully optimized set of question
46 would be computationallly very expensive. Although there are
47 pathological cases in most data sets this greediness is not a problem.
48 
49 The basic building algorithm starts with a set of feature vectors
50 representing samples, at each stage all possible questions for all
51 possibles features are asked about the data finding out how the
52 question splits the data. A measurement of impurity of each
53 partitioning is made and the question that generates the least impure
54 partitions is selected. This process is applied recursively on each
55 sub-partions recursively until some stop criteria is met (e.g. a
56 minimal number of samples in a partition).
57 
58 ## Impurities {#estwagonimpurities}
59 
60 The *impurity* of a set of samples is designed
61 capture how similar the samples are to each other. The smaller
62 the number the less impure the sample set is.
63 
64 For sample sets with continuous predictees Wagon uses the variance
65 times number of sample points. The variance alone could
66 be used by this overly favour very small sample sets. As the
67 test thatuses the impurity is trying to minimise it over
68 a partitioning of the data, multiple each part with the number
69 of samples will encourage larger partitions, which we have found
70 lead to better decision trees in general.
71 
72 For sample sets with discrete predictees Wagon uses the entropy
73 times number of sample points. Again the number of sample
74 points is used in so that small sample set are not unfairly favoured.
75 The entropy for a sample set is calculated as:
76 
77 \f[ E = \sum_{x \in class} prob(x) * \log(prob(x)) \f]
78 
79 Other impurity measure could be used if required. For example an
80 experimental cluster technique used for unit selection actually used
81 impurity calculated as the mean euclidean distance between all vectors
82 of parameters in the sample set. However the above two are more
83 standard measures.
84 
85 ## Question forming {#estwagonoverviewquestion}
86 
87 Wagon has to automatically form questions about each feature in the
88 data set.
89 
90 For discrete features questions are build for each member of the set,
91 e.g. if feature n has value x. Our implementation does not currently
92 support more complex questions which could achieve better results
93 (though at the expense of training time). Questions about features
94 being some subset of the class members may give smaller trees. If the
95 data requires distinction of values a, b and c, from d e and f, our
96 method would require three separate questions, while if subset
97 questions could be formed this could be done in one step which would
98 not only give a smaller tree but also not unecessarily split the
99 samples for a, b and c. In general subset forming is exponential on
100 the number items in the class though there are techniques that can
101 reduce this with heuristics. However these are currently not supported.
102 Note however the the tree formalism produced but Wagon does support
103 such questions (with the operator "in") but Wagon will never produce
104 these question, though other tree building techniques (e.g. by hand)
105 may use this form of question.
106 
107 For continuous features Wagon tries to find a partition of
108 the range of the values that best optimizes the average
109 impurity of the partitions. This is currently done by linearly
110 splitting the range into a predefined subparts (10 by default)
111 and testing each split. This again isn't optimal but does
112 offer reasonably accuracy without require vast amounts of
113 computation.
114 
115 
116 ## Tree Building criteria {#estwagonoverviewtreebuild}
117 
118 There are many ways to constrain the tree building algorithm to help
119 build the "best" tree. Wagon supports many of theses (though there
120 are almost certainly others that is does not.
121 
122 In the most basic forms of the tree building algorithm a fully
123 exhaustive classifcation of all samples would be achieved. This, of
124 course is unlikely to be good when given samples that are not
125 contained within the training data. Thus the object is to build a
126 classification/regression tree that will be most suitable for new
127 unseen samples. The most basic method to achieve this is not to build
128 a full tree but require that there are at least n samples in a
129 partition before a question split is considered. We refer to that as
130 the *stop* value. A number like 50 as a stop value
131 will often be good, but depending of the amount of data you have, the
132 distribution of it, etc various stop value may produce more general
133 trees.
134 
135 A second method for building "good" trees is to *hold out* some of
136 the training data and build a (probably over-trained) tree with a small
137 stop value. Then prune the tree back to where it best matches the
138 held out data. This can often produce better results than a fixed
139 stop value as this effectively allows the stop value to vary through
140 different parts of the tree depending on how general the prediction
141 is when compared against held out data.
142 
143 
144 It is often better to try to build more balanced trees. A small stop
145 value may cause the tree building algorithm to find small coherent
146 sets of samples with very specific questions. The result tree becomes
147 heavily lop-sided and (perhaps) not optimal. Rather than having the
148 same literal stop value more balanced trees can built if the stop
149 value is defined to be some percentage of the number of samples under
150 consideration. This percentage we call a *balance*
151 factor. Thus the stop value is then the largest of the
152 defined fixed stop value or the balance factor times the number of
153 samples.
154 
155 
156 To some extent the multiplication of the entropy (or variance) by
157 the number of samples in the impurity measure is also way to
158 combat imbalance in tree building.
159 
160 
161 A good technique we have found is to build trees in a *stepwise*
162 fashion. In this case instead of considering all
163 features in building the best tree. We increment build trees looking
164 for which individual feature best increases the accuracy of the build
165 tree on the provided test data. Unlike within the tree building
166 process where we are looking for the best question over all features
167 this technique limits which features are available for consideration.
168 It first builds a tree using each and only the features provided
169 looking for which individual feature provides the best tree. The
170 selecting that feature is builds n-1 trees with the best feature from
171 the first round with each of the remaining features. This process
172 continues until no more features add to the accuracy or some
173 stopping criteria (percentage improved) is not reached.
174 
175 
176 This technique is also a greedy technique but we've found that when
177 many features are presented, especially when some are highly
178 correlated with each other, stepwise building produces a significantly
179 more robust tree on external test data. It also typically builds
180 smaller trees. But of course there is a cost in computation time.
181 
182 While using the stepwise option each new feature added is printed out.
183 Care should be taking in interpreting what this means. It does not
184 necessarily give the order and relative importance of the features,
185 but may be useful if showing which features are particualrly important
186 to this build.
187 
188 Stepwise tests each success tree against the specified test set, (balance,
189 held out and stop options are respected for each build). As this
190 is using the test set which optimizing the tree, it is not valid
191 to view the specified test set as a genuine test set. Another
192 externally held test set should be used to test the accuracy of
193 generated tree.
194 
195 ## Data format {#cart-formats}
196 
197 The input data for wagon (and some other model building tools
198 in the Edinburgh Speech Tools library), should consist of
199 feature vectors, and a description of the fields in these vectors.
200 
201 ### Feature vectors {#estwagon-featvectors}
202 
203 A feature vector is a file with one sample per line, with feature value
204 as white space separated tokens. If your features values conatin
205 whitespace then you must quote them using double quotes.
206 
207 
208 The (Festival) program `dumpfeats` is specifically
209 designed to generate such files from databases of utterances but
210 these files may be generated from any data source.
211 
212 Each vector must have the same number of features (and in the
213 same order. Features may be specified as "ignored" in the
214 description (or in actual use) so it is common that data files
215 contain more features than are always used in model building. By
216 default the first feature in a data file is the predictee, though
217 at least in wagon) the predictee field can be named at tree building time
218 to be other than the first field.
219 
220 Features can be discrete of continuous but at present must be
221 single valued, "multi-valued" or "list-valued" features are not
222 currently supported. Note this means that a feature in different
223 samples may have different values but in a particular sample a particular
224 feature can only have one value.
225 
226 A type example is:
227 
228  0.399 pau sh 0 0 0 1 1 0 0 0 0 0 0
229  0.082 sh iy pau onset 0 1 0 0 1 1 0 0 1
230  0.074 iy hh sh coda 1 0 1 0 1 1 0 0 1
231  0.048 hh ae iy onset 0 1 0 1 1 1 0 1 1
232  0.062 ae d hh coda 1 0 0 1 1 1 0 1 1
233  0.020 d y ae coda 2 0 1 1 1 1 0 1 1
234  0.082 y ax d onset 0 1 0 1 1 1 1 1 1
235  0.082 ax r y coda 1 0 0 1 1 1 1 1 1
236  0.036 r d ax coda 2 0 1 1 1 1 1 1 1
237 
238 Note is it common to have thousands, even hundreds of thousands of
239 samples in a data file, and the number of features can often be in the
240 hundreds, though can also be less than ten depending on the what it
241 describes.
242 
243 ### Data descriptions {#estwagon-datadescr}
244 
245 A data file also requires a description file which names and classifies
246 the features in a datafiles. Features must haves names so they can
247 be refered to in the decision tree (or other model output) and also
248 be classified into their type. The basic types available for
249 features are:
250 
251  - `continuous` for features that range over reals (e.g. duration of phones)
252  - `categorial` for features with a pre-defined list of possible values (e.g. phone names)
253  - `string` for features with an open class of discrete values (e.g. words)
254  - `vectors` like floats but as vectors of floats, (e.g. MFCC data)
255 
256 The data description consists of a parenthesized list of feature
257 descriptions. Each feature description consists of the feature
258 name and its type (and/or possible values). Feature names, by
259 convention, should be features names in the sense for features (and
260 pathnames) used throughout the utterance structures in the
261 Edinburgh Speech Tools. The expected method to use models
262 generated from features sets in the Edinburgh Speech Tools is
263 to apply them to items. In that sense having a feature name be
264 a feature of an item (or relatve) pathname will avoid having
265 the extra step of extracting features into a separated table before
266 applying the model. However it should also be stated that to
267 wagon these names are arbitrary tokens and their semantic irrelevant
268 at training time.
269 
270 A typical description file would look like this, this is
271 one suitable for the data file given above
272 
273 
274 
275  ((segment_duration float)
276  ( name aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g
277  hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
278  ( n.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g
279  hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
280  ( p.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g
281  hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
282  (position_type 0 onset coda)
283  (pos_in_syl float)
284  (syl_initial 0 1)
285  (syl_final 0 1)
286  (R:Sylstructure.parent.R:Syllable.p.syl_break float)
287  (R:Sylstructure.parent.syl_break float)
288  (R:Sylstructure.parent.R:Syllable.n.syl_break float)
289  (R:Sylstructure.parent.R:Syllable.p.stress 0 1)
290  (R:Sylstructure.parent.stress 0 1)
291  (R:Sylstructure.parent.R:Syllable.n.stress 0 1)
292  )
293 
294 There are also a number of special symbols that may be used in a
295 description file. If the type (first toke after the name) is
296 `ignore` the feature will be ignored in the model
297 building process. You may also specified features to ignore at tree
298 building time but it is often convenient to explicitly ignore
299 feature(s) in the description file.
300 
301 For open categorial features the token `_other_`
302 should appear as the first in the list of possible values. This
303 actually allows features to have a partially closed set and and open set.
304 
305 A description file can't be generated automatically from a data set though
306 an approximation is possible. Particularly its is not possible
307 to automatically decied if a feature value is continous of that its
308 example values happen to look like numbers. The script
309 `make_wagon_desc` takes a datafile and file
310 containing only the names of the features, and the name of
311 the description file it will create. This is often a useful
312 first pass though it almost certainly must be hand editted afterwards.
313 
314 ### Tree format {#estwagon-treeformat}
315 
316 The generated tree files are written as Lisp s-expressions as this
317 is by far the easiest external method to represent trees. Even if
318 the trees are read by something other than Lisp it is easy to
319 write a reader for such a format. The syntax of a tree is
320 
321 
322  TREE ::= LEAF | QUESTION-NODE
323 
324  QUESTION-NODE ::= "(" QUESTION YES-NODE NO-NODE ")"
325 
326  YES-NODE ::= TREE
327 
328  NO-NODE ::= TREE
329 
330  QUESTION ::= "(" FEATURENAME "is" VALUE ")" |
331  "(" FEATURENAME "=" FLOAT ")" |
332  "(" FEATURENAME "<" FLOAT ")" |
333  "(" FEATURENAME ">" FLOAT ")" |
334  "(" FEATURENAME "matches" REGEX ")" |
335  "(" FEATURENAME "in" "(" VALUE0 VALUE1 ... ")" ")"
336 
337  LEAF ::= "(" STDDEV MEAN ")" |
338  "(" "(" VALUE0 PROB0 ")" "(" VALUE1 PROB1 ")" ... MOSTPROBVAL ")" |
339  any other lisp s-expression
340 
341 
342 Note that not all of the question types are generated by Wagon but
343 they are supported by the interpreters.
344 
345 The leaf nodes differ depending on the type of the predictee. For
346 continuous predictees (regression trees) the leaves consist of a pair
347 of floats, the stddev and mean. For discrete predictees
348 (classification trees) the leaves are a probability density function
349 for the members of the class. Also the last member of the list is the
350 most probable value. Note that in both case the last value of the
351 leaf list is the answer desired in many cases.
352 
353 
354 Here's a small example tree
355 
356  NOT WRITTEN
357 
358 
359 # Functions {#estwagon-functions}
360 
361 # Programs {#estwagon-programs}
362 
363 The following exectutable programs are provided for the
364 building and testing of decision trees
365 
366  - \ref wagon_manual: is the core building program
367  - \ref wagon_test_manual: applies a trained treee to data and tests its accuracy.
368 
369