ZeroSpeech 2019: TTS without T

Task and intended goal

Young children learn to talk long before they learn to read and write. They can conduct a dialogue and produce novel sentences, without being trained on an annotated corpus of speech and text or aligned phonetic symbols. Presumably, they achieve this by recoding the input speech in their own internal phonetic representations (proto-phonemes or proto-text), which encode linguistic units in a speaker-invariant representation, and use this representation to generate output speech.

Duplicating the child’s ability would be useful for the thousands of so-called low-resource languages, which lack the textual resources and/or linguistic expertise required to build a traditional speech synthesis system. The ZeroSpeech 2019 challenge addresses this problem by proposing to build a speech synthesizer without any text or phonetic labels, hence TTS without T (text-to-speech without text). We provide raw audio for the target voice(s) in an unknown language (the Voice Dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery Dataset) and align them to the voice recording in a way that works best for the purpose of synthesizing novel utterances from novel speakers (see Figure 1).

The ZeroSpeech 2019 is a continuation and a logical extension of the sub-word unit discovery track of ZeroSpeech 2017 and ZeroSpeech 2015, as it demands of participants to discover such units, and then evaluate them by assessing their performance on a novel speech synthesis task. We provide a baseline system which performs the task using two off-the-shelf components: (1) a system which discovers discrete acoustic units automatically in the spirit of Track 1 of the Zero Resource Challenges 2015 [1] and 2017 [2], and (2) a standard TTS system.

A submission to the challenge will replace at least one of these systems. Participants can construct their own end-to-end system with the joint objective of discovering sub-word units and producing a waveform from them. Participants can, alternatively, make use of one of the two baseline systems, and improve the other. The challenge is therefore therefore open to ASR-only systems which make a contribution primarily to unit discovery, focusing on improving the embedding evaluation scores (see Evaluation metrics). Vice versa, the challenge is open to TTS-only systems which concentrate primarily on improving the quality of the synthesis on the baseline sub-word units. (All submissions must be complete, however, including both resynthesized audio and the embeddings used to generate them.)


Figure 1a. General outline of the ZeroSpeech 2019 Challenge. The aim is to construct a speech synthesizer which is conditioned, not on text, but on automatically discovered discrete subword units. View this as a voice conversion autoencoder with a discrete bottleneck (the input is speech from any speaker, the hidden representation is discrete, the output is speech in a target voice). We evaluate both the quality of the resynthesized wave file and the compacity of the intermediate discrete code.

This challenge is being prepared with the intention of running as a special session in Interspeech 2019, and operating on CodaLab. As with the other two challenges, it relies exclusively on open source software and datasets. It encourages participants to submit their code or systems. It will remain open after the workshop deadline, although human evaluation of the synthesis systems will only be rerun if there are enough new registrations.


We provide datasets for two languages, one for development, one for test.

The development language is English. The test language is a surprise Austronesian language for which much fewer resources are available. Only the development language is to be used for model development and hyperparameter tuning. The exact same model and hyperparameters must be used to train on the test language. In other words, the results on the surprise language must be the output of applying exactly the same training procedure as the one applied to the development corpus; all hyperparameter selection must be done beforehand, on the development corpus, or automated and integrated into training. The goal is to build a system that generalizes out of the box, as well as possible, to new languages. Participants will treat English as if it were a low-resource language, and refrain from using preexisting resources (ASR or TTS systems pretrained on other English datasets are banned). The aim is to only use the training datasets provided for this challenge. Exploration and novel algorithms are to be rewarded: no number hacking here!

Four datasets are be provided for each language:

  • the Voice Dataset contains one or two talkers, for around 2h of speech per talker. It is intended to build an acoustic model of the target voice for speech synthesis.

  • The Unit Discovery Dataset contains read text from 100 speakers, with around 10 minutes talk from each speaker. These are intended to allow for the construction of acoustic units.

  • The Optional Parallel Dataset contains around 10 minutes of parallel spoken utterances from the Target Voice and from other speakers. This dataset is optional. It is intended to fine tune the task of voice conversion. Systems that use this dataset will be evaluated in a separate ranking.

  • The Test Dataset contains new utterances by novel speakers. For each utterance, the proposed system has to generate a transcription using discovered subword units, and resynthesize the utterance into a waveform in the target voice(s), only using the computed transcription.

The Voice Dataset in English contains two voices (one male, one female). This is intended to cross-validate the TTS part of the system on these two voices (to avoid overfitting).

# speakers

# utterances

Duration (VAD)

Development Language (English)

Train Voice Dataset

1 Male



1 Female



Train Unit Dataset


5941 [16-80 by speaker]

15h40 [2-11min by speaker]

Train Parallel Dataset

10+ *

1 male



1 female



Test Dataset


455 [6-20 by speaker]

28min [1-4min by speaker]

Test Language (surprise Austronesian)

Train Voice Dataset

1 Female



Train Unit Dataset


15340 [81-164 by speaker]


Train Parallel Dataset

15 + 1 Female



Test Dataset


405 [10-30 by speaker]

29min [1-3min by speaker]


For the Train Parallel Dataset, there are 10 new speakers reading phrases from each speaker from the Train Voice Dataset.


Figure 1b. Summary of the datasets used in the ZeroSpeech2019 Challenge. The Unit Discovery dataset provides a variety of speakers and is used for unsupervised acoustic modeling (i.e., to discover speaker-invariant, language-specific subword discrete units–symbolic embeddings). The Voice dataset is used to train the synthesizer. The Parallel dataset is for fine tuning the two subsystems. These datasets consist in raw waveforms (no text, no label beyond speaker ID).

Evaluation metrics

Participants are requested to run each test sentence in their system, and extract two type of representations. At the very end of their pipeline, participants should collect a resynthesized wav file. In the middle of the pipeline, participants should extract a sequence of vectors corresponding to the system’s embedding of the test sentence at the entry point to the synthesis component. The embedding which is passed to the decoder must be submitted. Additionally, two additional representations may be submitted for evaluation, from earlier or later steps in the system’s pipeline: see Auxiliary embeddings for more details.

The most general form for the embedding is a real valued matrix, with the rows corresponding to the time axis and the columns to the dimensions of a real valued vector. The low bit-rate constraint (see below) will militate strongly in favour of a small, finite set of quantized units. One possibility, similar to the symbolic “text” in a classical TTS system, is a one-hot representation, with each “symbol” being coded as a 1 on its own dimension. Representations are not limited to one-hot representations, as many systems (for example, end-to-end DNN systems) may wish to take useful advantage of the internal structure of the numerical embedding space (but, to reduce the bitrate, it is nevertheless advisable to quantize to a small discrete subset, of numerical “symbols”). Conceptually, a d-dimensional vector submitted as an embedding may thus be of at least two types:

  • \(d\)-dimensional 1-hot vectors (zeroes everywhere except for a 1 in one dimension) representing the :math:-d discrete “symbols” discovered by the unit discovery system

  • \(d\)-dimensional continuous valued embedding vectors, representing some transformation of the acoustic input

The file format, however, does not distinguish between these cases. The embedding file will be provided in simple text format (ASCII), one line per row (separated by newlines), each column being a numerical value, columns separated by exactly one space (see Output Preparation). The number of rows is not fixed: while you may use a fixed frame rate, no notion of “frame” is presupposed by any of the evaluations, so the length of the sequence of vectors for a given file can be whatever the encoding component of the system deems it to be. The matrix format is unaffected by this choice of representation, and the evaluation software will be applied irrespective of this distinction.

For the bitrate computation, the vectors will be processed as a single character string: a dictionary of all of the possible values of the symbolic vectors will be constructed over the entire test set, and the index to the dictionary will be used in place of the original value. This will be used to calculate the number of bits transmitted over the duration of the test set, by calculating the entropy of the symbolic vector distribution (average number of bits transmitted per vector) and multiplying by the total number of vectors. Be careful! We use the term symbolic vectors to remind you that the vectors should ideally belong to a small dictionary of possible values. We will automatically construct this dictionary using the character strings as they appear in the file. This means that very close vectors ‘1.00000000 -1.0000000’ and ‘1.00000001 -1.0000001’ will be considered distinct. (For that matter, even ‘1 1’ and ‘1.0 1.0’ will be considered different, because the character strings are different). Therefore, make sure that the numeric values of the vectors are formatted consistently, otherwise you will be penalized with a higher bitrate than you intended.

The two sets of output files (embedding and waveform) will be evaluated separately. The symbolic output will be analyzed using two metrics:

Bit rate.

Here, we assume that the entire set of audio in the test set corresponds to a sequence of vectors \(U\) of length \(P\): \(U=[s_1, ..., s_P]\). The bit rate for \(U\) is then \(B(U)=\frac{P. \sum_{i=1}^{P} p(s_i)\,log_{2}\,p(s_i)}{D}\), where \(p(s_{i})\) is the probability of symbol \(s_i\). The numerator of \(B(U)\) is \(P\) times the entropy of the symbols, which gives the ideal number of bits needed to transmit the sequence of symbols \(s_{1:P}\). In order to obtain a bitrate, we divide by \(D\), the total duration of \(U\) in seconds. Note that a fixed frame rate transcription may have a higher bitrate than a ‘textual’ representation due to the repetition of symbols across frames. For instance, the bit rate of a 5 ms framewise gold phonetic transcription is around 450 bits/sec and that of a ‘textual’ transcription is 60 bits/sec. This indicates that there is a tradeoff between the bitrate and the detail of the information provided to the TTS system, both in number of symbols and in temporal information.

Unit quality.

Since it is unknown whether the discovered representations correspond to large or small linguistic units (phone states, phonemes, features, syllables, etc.), we evaluate this with a theory-neutral score, a machine ABX score, as in the previous Zero Resource challenges [23]. For a frame-wise representation, the machine-ABX discriminability between ‘beg’ and ‘bag’ is defined as the probability that A and X are closer than B and X, where A and X are tokens of ‘beg’, and B a token of ‘bag’ (or vice versa), and X is uttered by a different speaker than A and B. The choice of the appropriate distance measure is up to the researcher. In previous challenges, we used by default the average frame-wise cosine divergence of the representations of the tokens along a DTW-realigned path. The global ABX discriminability score aggregates over the entire set of minimal pairs like ‘beg’-‘bag’ to be found in the test set. We provide in the evaluation package the option of instead using a normalized Levenshtein edit distance. We give ABX scores as error rates. The ABX score of the gold phonetic transcriptions is 0 (perfect), since A, B and X are determined to be “the same” or “different” based on the gold transcription.


Because the symbolic embeddings submitted in this challenge typically do not have durations or temporal alignment, we cannot compute the ABX scores in the same way as in the previous two challenges (which were relying on alignements to extract the features of triphones like ‘beg’ or ‘bag’ from entire utterances). Here, we systematically extracted all the triphones as waveforms and provided them as small files (in addition to the original uncut sentence). This is why the test dataset contains many small files.

Synthesis accuracy and quality.

The synthesized wave files will be evaluated by humans (native speakers of the target language) using three judgment tasks.

  1. Judges will evaluate intelligibility by transcribing orthographically the synthesized sentence. Each transcription will then be compared with the gold transcription using the edit distance, yielding a character error rate (character-wise, to take into account possible orthographic variants or phone-level susbtitutions).

  2. Judges will evaluate the speaker similarity using a subjective 1 to 5 scale. They will be presented with a sentence from the original target voice, another sentence by the source voice, and then yet another resynthesized sentence by one of the system. A response of 1 means that the resynthesized sentence has a voice very similar to the target voice, 5 that it has a voice very similar to the source.

  3. Judges will evaluate the overall quality of the synthesis on a 1 to 5 scale, yielding a Mean Opinion Score (MOS).

For the development language, we will provide all of the evaluation software (bitrate and ABX), so that participants can test for the adequacy of their full pipeline. We will not provide human evaluations, but we will provide a mock experiment reproducing the conditions of the human judgemnts, that participants can run on themselves, so that they can get a sense of what the final evaluation on the test set will look like.

For the test language, only audio will be provided (plus the bitrate evaluation software). All human and ABX evaluations will be run automatically when the participants will have submitted their annotations and wav files through the submission portal. An automatic leaderboard will then appear on the challenge’s website.

Baseline, topline and leaderboard

As with the other ZR challenges, the primary aim is to explore interesting new ideas and algorithms, not to simply push numbers at all costs. Our leaderboard will therefore reflect this fact, as TTS without T is first and foremost a problem of tradeoff between the bit rate of the acoustic units and the quality of the synthesis. At one extreme, a voice conversion algorithm would obtain excellent synthesis, but at a cost of a very high bit rate. At the other extreme, a heavily quantized intermediate representation could provide a very low bit rate, at a cost of a poor signal reconstruction. Therefore the systems will be put on a 2D graphs, with bit rate on the x axis, and synthesis quality on the y axis (see Figure 2).


Figure 2. 2D leaderboard, showing the potential results of the topline, baseline and submitted systems in terms of synthesis quality against bitrate. The quantization examples illustrate the kinds of tradeoffs that can arise between quality of synthesis and bitrate of the embedding.

A baseline system will be provided, consisting of a pipeline with a simple acoustic unit discovery system based on DPGMM, and a speech synthesizer based on Merlin. A topline system will also be provided, which will consist in an ASR system (Kaldi) piped to a trained TTS system (Merlin), both trained using supervision (the gold labels). The baseline system will be provided in a virtual machine (the same one containing the evaluation software for the dev language), together with the datasets upon registration.

The Challenge is run on Codalab. Participants should register by creating a Codalab account and then downloading the datasets and virtual machine (see Getting started). Each team should submit the symbolic embeddings on which they condition the waveform synthesis part of their model, as well as the synthesized wavs for each test sentence on both dev and test languages. Regarding the embeddings, we will have to trust the participants that this is the ONLY INFORMATION that they use to generate the waveform. (Otherwise, this challenge becomes ordinary voice transfer, not TTS.)

For reproducibility purposes (and ability to check that the model truly runs on the provided discrete embeddings), we strongly encourage submission of the code or binary executable or virtual machine for the proposed system. The leaderboard will indicate whether the submitted paper has included their code or not with an “open source/open science” tag. Because we have to run the evaluations on human annotators, each team is only allowed two submissions maximum. The evaluation will be returned to each team by email, and the leaderboard published once the deadline of paper submission is over.


This challenge has been proposed as an Interspeech special session to be held during the conference (outcome known Jan 1). Attention: due to the time needed for us to run the human evaluations on the resynthesized waveforms, we will require that these waveforms be submitted to our website through Codalab 2 weeks before the Interspeech official abstract submission deadline.


Dec 15 2018

Release of the website (English datasets and evaluation code)

Feb 4 2019

Release of the Austronesian datasets (test language) and codalab competition

March 15 2019

Wavefile Submission Deadline! (codalab)

March 25 2019

Outcome of human evaluations+ Leaderboard

March 29 2019

Abstract submission deadline

April 5 2019

Final paper submissions deadline

June 17 2019

Paper acceptance/rejection notification

July 1, 2019

Camera-ready paper due

Sept 15-19 2019

Interspeech conference.

Scientific committee

  • Laurent Besacier

    • LIG, Univ. Grenoble Alpes, France

    • Automatic speech recognition, processing low-resourced languages, acoustic modeling, speech data collection, machine-assisted language documentation

    • email: laurent.besacier at,

  • Alan W. Black

  • Ewan Dunbar

  • Emmanuel Dupoux

    • Ecole des Hautes Etudes en Sciences Sociales, Paris

    • Computational modeling of language acquisition, psycholinguistics, unsupervised learning of linguistic units

    • email: emmanuel.dupoux at,

  • Lucas Ondel

    • University of Brno,

    • Speech technology, unsupervised learning of linguistic units

    • email: iondel at

  • Sakti Sakriani

    • Nara Institute of Science and Technonology (NAIST)

    • Speech technology, low resources languages, speech translation, spoken dialog systems

    • email:ssakti at,

Challenge Organizing Committee

  • Ewan Dunbar (Organizer)

    Researcher, EHESS,

  • Emmanuel Dupoux (Coordination)

    Researcher, EHESS & Facebook,

  • Sakti Sakriani (Datasets)

    Researcher, Nara Institute of Science and Technonology (NAIST) ssakti at

  • Xuan-Nga Cao (Datasets)

    Research Engineer, ENS, Paris, ngafrance at

  • Mathieu Bernard (Website & Submission)

    Engineer, INRIA, Paris, mathieu.a.bernard at

  • Julien Karadayi (Website & Submission)

    Engineer, ENS, Paris, julien.karadayi at

  • Juan Benjumea (Datasets & Baselines)

    Engineer, ENS, Paris, jubenjum at

  • Lucas Ondel (Baselines)

    Graduate Student, University of Brno & JHU,

  • Robin Algayres (Toplines)

    Engineer, ENS, Paris, rb.algayres at


The ZeroSpeech2019 challenge is funded through Facebook AI Research and a MSR Grant. It is endorsed by SIG-UL, a joint ELRA-ISCA Special Interest Group on Under-resourced Languages.



The ZeroSpeech2019 challenge is hosted on Codalab, an open-source web-based platform for machine learning competitions.



Previous challenges

  • ZeroSpeech 2015 [1]: Versteegh, M., Thiollière, R., Schatz, T., Cao, X.N., Anguera, X., Jansen, A. & Dupoux, E. (2015). The Zero Resource Speech Challenge 2015. In INTERSPEECH-2015, (pp 3169-3173). .. _zs15-paper:

  • Versteegh, M., Anguera, X., Jansen, A. & Dupoux, E. (2016). The Zero Resource Speech Challenge 2015: Proposed Approaches and Results. In SLTU-2016 Procedia Computer Science, 81, (pp 67-72).

  • ZeroSpeech 2017 [2]: Dunbar, E., Xuan-Nga, C., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X. & Dupoux, E. (2017). The Zero Resource Speech Challenge 2017. In ASRU-2017.(pp 323-330).

Preliminary work on TTS without T

  • [3] Muthukumar, P. K., & Black, A. W. (2014). Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. In Proceedings of ICASSP, (pp. 2594-2598).

  • [4] Scharenborg, O., Besacier, L., Black, A., Hasegawa-Johnson, M., Metze, F., Neubig, G., Stüker, S., Godard, P., Müller, M., Ondel, L., Palaskar, S., Arthur, P., Ciannella, F., Du, M., Larsen, E., Merkx, D., Riad, R., Wang, L. & Dupoux, E. (2018). Linguistic unit discovery from multimodal inputs in unwritten languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop. In ICASSP-2018.

Unit Discovery

  • [5] Lee, C., & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers Volume 1 (pp. 40-49).

  • [6] Ondel, L., Burget, L., & Černocký, J. (2016). Variational inference for acoustic unit discovery. Procedia Computer Science, 81, 80-86.

  • [7]: Badino, L., Canevari, C., Fadiga, L., & Metta, G. (2014). An auto-encoder based approach to unsupervised learning of subword units. in ICASSP.

  • [8] Myrman, A.F., & Salvi, G. (2017). Partitioning of Posteriorgrams Using Siamese Models for Unsupervised Acoustic Modelling. International Workshop on Grounding Language Understanding, ISCA.pp 27-31.

  • [9] Heck, M., Sakti, S., & Nakamura, S. (2016). Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario. Procedia Computer Science, 81, 73–79.

  • [10] Jansen, A., Thomas, S., & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training. In ICASSP (pp. 8091-8095).

  • Synnaeve, G., Schatz, T & Dupoux, E. (2014). [Phonetics embedding learning with side information][ref06]. in IEEE:SLT.

  • Renshaw, D., Kamper, H., Jansen, A., & Goldwater, S. (2015). A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge. In Sixteenth Annual Conference of the International Speech Communication Association.

  • Tsuchiya, T., Tawara, N., Ogawa, T., and Kobayashi, T. (2018), Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning, in ICASSP (pp.2381–2385).


  • Merlin [11]. Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings ISCA Speech Synthesis Workshop, 9, 202-207.

  • Wavenet [12]. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N. Senior, A.W., Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. In SSW, p. 125.

  • SampleRN [13]: Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. (2016). SampleRNN: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837.

  • Tacotron 2 [14]: Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions.” arXiv preprint arXiv:1712.05884.

  • DeepVoice 3 [15]: Ping, Wei, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller (2018). Deep voice 3: Scaling text-to-speech with convolutional sequence learning.

  • VoiceLoop [16]: Taigman, Y., Wolf, L., Polyak, A., & Nachmani, E. (2018). VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. ICLR2018.

  • Transformer TTS [17]: Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2018). Close to Human Quality TTS with Transformer. arXiv preprint arXiv:1809.08895.

Speech Chain and Cycle GANS

  • [18] Tjandra, Andros, Sakriani Sakti, and Satoshi Nakamura. “Listening while speaking: Speech chain by deep learning.” In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE, pp. 301-308. IEEE, 2017. Tjandra, Andros, Sakriani Sakti, and Satoshi Nakamura. “Machine Speech Chain with One-shot Speaker Adaptation.” arXiv preprint arXiv:1803.10525 (2018).

  • [19] Kaneko and Kameoka, H. (2017). Parallel-data-free voice conversion using cycle-consistent adversarial networks, arXiv preprint arXiv:1711.11293. Fang, F., Yamagishi, J., Echizen, I. and Lorenzo-Trueba, J., (2018). High-quality nonparallel voice conversion based on cycle-consistent adversarial network. arXiv preprint arXiv:1804.00425.

Voice Conversion

  • [20] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, (2016). Voice conversion from non-parallel corpora using variational auto-encoder,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6.

  • [21] Chou, Ju-chieh, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee. “Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations.” arXiv preprint arXiv:1804.02812 (2018).

  • [22] Y. Gao, R. Singh, and B. Raj, “Voice impersonation using generative adversarial networks”,[arXiv preprint arXiv:1802.06840][ref_gao], 2018.

Evaluation metrics

  • [23] Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H., & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline. In Interspeech (pp. 1781–1785). Schatz, Thomas. ABX-discriminability measures and applications.. PhD diss., Paris 6, 2016.