1 The Tilt Intonation Model {#esttilt}
2 ===========================
4 *Tilt* is a phonetic model of intonation that
5 represents intonation as a sequence of continuously parameterised
8 The tilt library is a set of functions which analyses, synthesizes and
9 manipulates tilt representations.
11 # Theoretical Overview {#tilt-overview}
13 The basic unit in the tilt model is the *intonational
event*.
14 Events occur as instants with nothing between them,
15 as opposed to segmental based phenomena where units occur in a
16 contiguous sequence. The basic types of intonational
event are
17 *pitch accents* and (following the popular
18 terminology) *boundary tones*. Pitch accents
19 (denoted by the letter a) are F0 excursions associated with
20 syllables which are used by the speaker to give some degree of
21 emphasis to a particular word or syllable. In the tilt model, boundary
22 tones (b) are rising F0 excursions which occur at the edges of
23 intonational phrases and as well as giving the hearer a cue as to the
24 end of the phrase, can also signal effects such as continuation and
25 questioning. A combination event ab occurs when a pitch accent
26 and boundary tone occur so close to one another that only a single
27 pitch movement is observed. There are different kinds of pitch accents
28 and boundary tones: the choice of pitch accent and boundary tone
29 allows the speaker to produce different global intonational tunes
30 which can indicate questions, statements, moods etc to the hearer.
32 \anchor tilt-f0-representation
33 \image html tilt-f0-representation.svg "Schematic F0 representation"
34 \image latex tilt-f0-representation.eps "Schematic F0 representation" width=7cm
38 \ref tilt-f0-representation shows a Schematic representation of F0,
39 intonational event relation and segment relation in the Tilt
40 model. The linguistically relevant parts of the F0 contour, which
41 correspond to intonational events, are circled. The events, labelled a
42 for pitch accent and b for boundary are linked to the syllable nuclei
43 of the syllable relation. Note that every event is linked to a
44 syllable, but some syllables do not have events.
46 Unlike traditional intonational phonology schemes \cite{ph:thesis},
47 \cite{tobi} which impose a categorical classification on events, Tilt
48 uses a set of continuous parameters. These parameters, collectively
49 known as *tilt parameters*, are determined from
50 examination of the local shape of the
event's F0 contour.
52 The tilt model is built on a simpler model, the rise/fall/connection (RFC) model.
54 In the RFC model, each event is modelled by a rise part followed by a
55 fall part. Each part has an amplitude and duration, and two parameters
56 are used to give the time position of the event in the utterance and
57 the F0 height of the event. \ref figure-typical-pitch-accent shows a typical
58 pitch accent with these parameters marked.
60 \anchor figure-typical-pitch-accent
61 \image html typical-pitch-accent.svg "Typical pitch accent"
62 \image latex typical-pitch-accent.eps "Typical pitch accent" width=7cm
65 The RFC parameters for an utterance are therefore:
68 - rise duration (seconds)
70 - fall duration (seconds)
74 Sometimes events don't have rise or fall parts, and in these cases the
75 amplitude and duration of the missing part is set to 0. The position
76 parameter can be specified in two ways: either as the distance from
77 the start of the utterance, or the distance from the start of the
78 vowel of the associated syllable. The latter is more linguistically
79 meangingful, but as vowel boundaries are not always available, the
82 While the RFC model can accurately describe F0 contours, the mechanism
83 is not ideal in that the RFC parameters
for each contour are not as
84 easy to interpret and manipulate as one might like. For instance there
85 are two amplitude parameters
for each event, when it would make sense
88 The *Tilt* representation helps solve these
89 problems by transforming the four amplitude and duration RFC
90 parameters into three Tilt parameters:
92 - amplitude (Hz): the sum of the magnitudes of the rise and fall amplitudes.
93 - duration (seconds): the sum of the rise and fall durations.
94 - tilt: a dimensionless number which expresses the overall *shape*
95 of the event, independent of its amplitude or duration.
97 The position and F0 height parameters are the same as before.
99 The tilt representation is superior to the RFC representation in that
100 it has fewer parameters without significant loss of
101 accuracy. Importantly, it can be argued that the tilt parameters are
102 more linguistically meaningful.
104 In describing the tilt model, we use the term
105 *analysis* to describe the process of producing a
106 tilt representation from an F0 contour, and *synthesis
107 * to describe the process of prodcing a F0 contour from a
110 ## RFC Analysis {#esttilt-overview-rfcanalysis}
112 ### Locating Events in the F0 contour {#esttilt-overview-rfcanalysis-locating}
114 The first stage in analysis is to find the intonational events in an
115 F0 contour. EST does not directly provide a means
for doing this. In
116 practice
this is either done by hand by a human labeller, or
117 automatically by the
HMM auto event labeller. The current
HMM event
118 labeller is based on the HTK system and hence can
't be part of EST,
119 but an outline of the system follows:
121 The automatic event detector uses continuous density hidden Markov
122 models to perform a segmentation of the input utterance. A number of
123 units are defined and a HMM is trained on examples of that kind from a
124 pre-labelled training corpus using the Baum-Welch algorithm
125 \cite{baum:72}. Each utterance in the corpus is acoustically processed
126 so that it can be represented by sequence of evenly spaced
127 frames. Each frame is a multi-component vector representing the
128 acoustic information for the time interval centred around the frame.
130 Recognition is performed by forming a network comprising the HMMs for
131 each unit in conjunction with an n-gram language model which gives the
132 prior probability of a sequence of n units occurring. To perform
133 recognition on an utterance, the network is searched using the
134 standard Viterbi algorithm to find the most likely path through the
135 network given the input sequence of acoustic vectors.
137 It is our intention to put a complete event labeller in EST in the future.
139 ### Producing an RFC representation from an utterance's events and F0 contour {#ov-rfc-analysis}
141 An utterance
's events are represented in a relation. Initially, events
142 are stored as regions with start and stop times as this is the most
143 common output format of labellers (both human and automatic).
145 For example, for utterance kdt_016, a set of basic event labels is as
146 follows (in xlabel format):
157 Events are labelled "a", and silences "sil". The use of the "c" label
158 is to allow start times which differ from the end of the previous
159 event. Conceptually, this can alsow be represented as follows:
161 name:sil start:0.0 end:0.290
162 name:a start:0.290 end:0.620
163 name:a start:0.760 end:0.960
164 name:a start:1.480 end:1.680
165 name:sil start:1.790 end:1.790
167 The other component for analysis is the utterance's F0 contour, which
168 is stored in a track. The contour must be continuous (i.e. have no
169 breaks), and its frames must be specified at fixed intervals. For best
170 performance the contour should have been smoothed.
172 The RFC analysis component takes the approximate labels and the
173 smoothed F0 contour, fits rise and fall shapes, and hence determines
174 an optimal set of RFC parameters
for the utterance.
176 For each event, a peak picking algorithm decides
if the
event has a
177 rise part only, a fall part only or a rise part followed by a fall
180 \anchor tilt-search-region
181 \image html tilt-search-region.svg
"Tilt search region"
182 \image latex tilt-search-region.eps
"Tilt search region" width=7cm
185 For each part, a search region, shown in \ref tilt-search-region,
186 is defined around the approximate start and end boundaries as defined
187 in the input label file. The search region is controlled by a number
190 - start_limit: the distance in seconds before each input start
191 boundary that the start search region should begin.
192 - end_limit: the distance in seconds after each input end
193 boundary that the end search region should begin.
194 - range: the end and beginnings of the start and end regions
195 respectively, specified as a fraction of the overall label duration.
197 For example, a pitch accent starts at 1.45 seconds and ends at 1.75
198 seconds. If the start and end limit are both defined to be 0.1 seconds
199 and the range is 0.4 (40%), then the start region starts at 1.35
200 seconds and ends at 1.55, and the end region starts at 1.65 and ends
201 at 1.85. The matching algroithm will synthesize every possible shape
202 lying within
this region, measure the distance between each and the
203 actual contour, and pick the one with the lowest distance.
205 The
final results of the matching process is a relation of events,
206 each with the 6 RFC parameters are descibed above.
209 given a label file and F0 contour. The
function
211 set of options and returns the RFC parameters in the features of each
212 item in the relation.
215 ## RFC to Tilt Conversion {#rfc2tilt}
217 The rise and fall RFC parameters can be converted to Tilt parameters
218 using the following equations.
220 *Amplitude* is the sum of te magnitudes of the rise
223 \f[ tilt_{amp} = \frac{ \left | A_{rise} \right | -
224 \left | A_{fall}\right |}{
225 \left | A_{rise} \right | +
226 \left | A_{fall}\right |} \f]
228 *Duration* is the sum of the of the rise and fall durations:
230 \f[ tilt_{dur} = \frac{ D_{rise} - D_{fall}}{ D_{rise} + D_{fall}} \f]
232 *Tilt* can be measured with respect to amplitude:
234 \f[ tilt = \frac{ \left | A_{rise} \right | -
235 \left | A_{fall}\right |}{
236 2 \left (\left | A_{rise} \right | +
237 \left | A_{fall}\right | \right )} +
238 \frac{ D_{rise} - D_{fall}}{ 2 ( D_{rise} + D_{fall})}
243 \f[ A_{
event} = \left | A_{rise} \right | + \left | A_{fall} \right | \f]
245 The tilt model assumes that these are strongly correlated so that an
246 average of the two is representative of the shape of the
event:
248 \f[ D_{
event} = D_{rise} + D_{fall} \f]
251 The is no stand alone program to
do this conversion, but the
253 performing the RFC matching as described above.
257 containing RFC parameterised items and converts it to a relation
258 containing Tilt paramterised items.
261 Another
function, also called \ref
rfc_to_tilt takes a
262 Features
object containing the 4 rise fall parameters and writes the 3
263 tilt parameters into another features
object. This
function can be
264 used to
do rfc_to_tilt conversion
for a single
event.
266 ## Tilt to RFC Conversion {#tilt2rfc}
268 The Tilt parameters can be converted to RFC parameters
using the
271 \important Rise amplitude:
272 \anchor tilt-rise-amplitude
274 A_{rise} = \frac{A_{
event} (1 + tilt)}{2}
277 \important Fall amplitude:
278 \anchor tilt-fall-amplitude
280 A_{rise} = \frac{A_{
event} (1 - tilt)}{2}
284 \important Rise duration:
285 \anchor tilt-rise-duration
287 A_{rise} = \frac{D_{
event} (1 + tilt)}{2}
290 \important Fall duration:
291 \anchor tilt-fall-duration
293 A_{rise} = \frac{D_{
event} (1 - tilt)}{2}
301 There is no stand alone program to
do this conversion, but the
303 generating a F0 contour.
307 containing Tilt parameterised items and converts it to a relation
308 containing RFC paramterised items.
311 Another
function, also called \ref
tilt_to_rfc takes a
312 Features
object containing the 3 Tilt parameters and writes the 4 rise
313 fall RFC parameters into another features
object. This
function can be
314 used to
do tilt_to_rfc conversion
for a single
event.
316 ## RFC to F0 Synthesis {#ov-rfc-to-tilt}
318 An F0 contour can be generated from a set of RFC parameters
using the
322 Events are generated as piecewise combinations of quadratic functions:
325 f_0(t) = A_{abs} + A - 2 A \cdot (t/D)^2 & 0 < t < D/2 \\
326 f_0(t) = A_{abs} + 2 A \cdot (1-t/D)^2 & D/2 < t < D
329 Between events, straight lines are used:
332 f_0(t) = A_{abs} + A \cdot (t/D) ~~ 0 < t < D
335 The stand alone program
337 RFC label file as input and produces a F0 file. This program can also
338 generate a F0 file directly from a Tilt label file
341 containing RFC parameterised items and produces a F0 contour in a
344 The function \ref synthesize_rf_event takes a Features
345 object containing the 4 rise fall RFC parameters and generates the F0
346 contour
for a single
event.
348 # Executable Programs
350 - \ref tilt-analysis_manual: Produces a Tilt or RFC analysis of a
351 F0 contour, given a set label file containing a set of approximate
352 intonational
event boundaries.
353 - \ref tilt-synthesis_manual:
tilt_synthesis generates a F0 contour,
354 given a label file containing parameterised Tilt or RFC events.
355 - \ref pda_manual: Generates F0 contours
void rfc_synthesis(EST_Track &f0, EST_Relation &ev_list, float f_shift, int no_conn)
Generate an F0 contour given a list RFC events.
void rfc_analysis(EST_Track &fz, EST_Relation &event_list, EST_Features &op)
void tilt_to_rfc(EST_Features &tilt, EST_Features &rfc)
Convert a single set of local tilt parameters to local RFC parameters.
void tilt_synthesis(EST_Track &track, EST_Relation &ev_list, float f_shift, int no_conn)
Generate an F0 contour given a list Tilt events.
void tilt_analysis(EST_Track &fz, EST_Relation &event_list, EST_Features &op)
Fill op with sensible default parameters for RFC analysis.
void rfc_to_tilt(EST_Features &rfc, EST_Features &tilt)
Convert a single set of local RFC parameters to local tilt parameters. See RFC to F0 Synthesis for a ...