I definitely like taking my time, which is why it has taken me this long – almost a week actually, as I’ve just realized – to sum up my impressions from last Friday’s #AIIC talk on Automated Speech Translation, aptly title “2020: A Speech Odyssey”. But give me tea, time, and a good soundtrack, and I will eventually make myself sit down to write.
The first part was presented by Jan Niehues, and focused on the approaches to automated speech translation, and the basic technology and the main algorithms involved, before moving on to discussing the current challenges and a very practical demonstration, presented by the second guest speaker William Lewis.
The basic technology is very straightforward, in principle at least.
First, you have ASR, automatic speech recognition. Then comes machine translation (MT), and then the produced text is transformed to speech (TTS, Text-to-Speech). The technology is there, it is being constantly improved, and systems combining several of those components already exist. But it’s not all set yet.
The challenges are there, and they are many.
Semantics, punctuations, and, quite frankly, the very way we humans speak, were cited time and time again as the main challenges blocking ASR & MT progress and performance. We hesitate, we use all sorts of uhms and ers, we repeat ourselves, and then take our words back, we make endless corrections, and use phrases like “you know” making it nearly impossible for the machine to figure out what to do with them – and us.
We are, it turns out, not exactly machine friendly.
And then there is the whole matter of context, and the impossible conundrum: wait as long as possible to get more context, as context improves speech recognition and MT, or generate translation as soon as possible, as low latency is key for user experience.
In a way, I almost felt sorry for the poor little AI machines, out there on their own, facing problems very similar to the ones we – the “human interpreters” – have to face ourselves.
And yes, it gives me no pleasure whatsoever to have to insert the word “human” before “interpreters”.
“We have all had many occasions to laugh at MT. But let’s be honest. There have also been moments when we were impressed,” they said at the start of the talk. And yes, in that particular instance – when we were shown an automatic subtitles platform – I was very much impressed.
True, the subtitles were not always ideal – far from it for Russian. But I was nonetheless surprised, and, perhaps against my own will, impressed. I was impressed by the quality of ASR, the fact that the system didn’t get it altogether wrong all of the time, and by some of the examples of events where this technology had already been tested out. Wine classes were definitely high up on the list of those events that surprised me, as, having worked in the field, I would never have judged the vocab an easy one for any ASR or MT system to handle.
It was a revelation.
Such systems do work. Not always, but still, progress is evident.
True, there are many challenges, a significant portion of those linked, once again, to the fact that human speech is, well, human. To quote the speaker, “People are very disfluent. Amazingly so. Our minds are very good at filtering these disfluencies – but we wouldn’t know what to do with them if we saw them on transcript.” Which is why these automated subtitles are often so awkward to read. Nor do we use punctuation when we speak. And that does not help those poor machines either.
Another challenge arises from the very nature of normal human communication – and dialogue: “People want to be able to interrupt. To talk freely. And then machine needs to honour that.” But it rarely can. Which is why it still has a long road ahead of it. It needs to learn the pragmatics of real-life conversation, and what interrupting and taking turn means.
Other common issues include gender mismatches (for instance, going from English into Russian), register, including politeness (particularly important for languages like Japanese and Korean, but also for Russian and French in instances where you need to differentiate between “tu” and “vous”), and intonation. These things are hard to model, and these are just some of the issues still faced by ASR and subsequent MT.
But, once again, on some level these systems do work, and humans have proven to be “surprisingly adaptive” to bearing with those mistakes, mismatches, and mistranslations. Which essentially means that the bar for human interpretation needs to be raised even higher.
As the moderator and colleague Monika Kokoszycka summed it up, things are changing and evolving, whether we want it or not. And it probably makes more sense to at least follow these changes to be able to understand them and deal with them better.
On a personal note, I am not fully convinced that the technology is entirely there yet, nor do I think that time has come to talk about a “division of labour” between AI and “human interpreters”. But memory aids are already out there, as are those subs.
I am all for reading. And research. And being prepared and informed. Which is why I am hoping to attend the next event in the series, and why I am so grateful to our colleagues from AIIC UK & Ireland for organizing them in the first place.
And here, once again, is some Bowie to go with. There, at least, you can’t go wrong.
Featured image from this BBC article on the mystery of that legendary film.