Khalil Sima'an: a translation machine that really understands language


How do you translate the Dutch word gezellig into English? Cosy is probably the first thing that comes to mind but it doesn’t really convey the full meaning. And take the Italian word culaccino, for example, which refers to the mark that a cold glass leaves on a table. Or the Portuguese word saudade, which can best be described as a combination of melancholy and yearning.

These are words which only exist in a specific language and which, as a result, are extremely hard to translate. People who have these words in their mother tongue understand what they mean because of the experiences they have had in that language. If you are Dutch you will, for example, have come across countless situations that were described as gezellig. 

‘Understanding a language is a process that is embedded in a whole host of other experiences that we as humans experience,’ says Leader of Statistical Language Processing and Learning lab Khalil Sima'an.

Fascination for language

His fascination for language started right after he finished his studies in Computer Science at the University of Amsterdam back in the early nineties. ‘Take, ambiguity, for example. Relatively simple messages can be full of implicit meanings. "Enjoy that cookie", for example, may sound like a neutral message but you can also say it in a way that makes the other person feel guilty. This can result in miscommunication. 

We humans are generally pretty good at reading between the lines and understanding the implicit message of particular words.' But when you’re “translating” (compiling) computer programmes to machine code, these kinds of ambiguities are out of the question. 

Sima'an: ‘If it’s not completely clear what is meant by a particular programming construction then the program does not compile correctly. How is it then that, when it comes to language, we humans know which meaning is the most plausible?'

The answer lies in our experiences. Sima'an gives an example. ‘Take a bilingual four year old, for example. They have no problems translating for their grandma even though they haven’t specifically learnt to do so. This is because the child relates language to the experiences that they have had in their life and have learnt the words that relate to them in two languages'. And that is precisely where translation machines like Google Translate still fall short. 

Sima'an: ‘Google Translate is based on a formula that learns from examples. In other words, Google Translate can translate because it has seen millions of examples of translated texts before. But if the text departs from the examples to a substantial degree, Google Translate can get confused, which can sometimes produce some very odd translations.' 

Under the hashtag #badtranslation, people share examples of badly translated international menus or information boards. For example, a board which should have said something along the lines of ‘work in progress’, said ‘execution in progress’.

A translation machine that really understands language

The reason why Google Translate sometimes totally misses the mark is that the translation machine actually doesn’t really understand what it is translating. ‘Google Translate isn’t a translator but it’s pretty good at imitating one.' For example, at the moment, Google Translate doesn’t take into account the non-textual context where the sentence will be used when it’s translating. And that’s what Sima’an and his colleagues want to work on over the next few years. 

‘Language evolves, it’s certainly not clear cut. People don't stick to grammatical rules, they use language in their own way, depending on the culture and context environment in which they live or in which they have grown up. If a translation machine doesn’t really understand language, in terms of the situation or environment in which language is used, it will continue to make mistakes. And we humans will constantly have to feed translation machines with examples of translated texts.’

Sima'an and his colleagues want the translation machines of the future not only to imitate translators but also to have the translation capabilities of a bilingual child. ‘In other words, I’m working on the basic principles for a future machine which can translate without having seen millions of examples of translated texts.

But how do you ensure that a robot really understands what certain linguistic expressions mean? You have to expose it to as many human experiences as possible. 

‘Essentially, machines have to become a bit like a person and operate as much as possible among people. Because language is an expression of who we are as people.'

You can expose machines to one form of human experiences with the help of images and movies. Take ‘Romantic love, for example. You can let a robot experience this by showing it a film or play of Shakespeare’s Romeo and Juliet. That way the robot will learn that the concept of romantic love is often associated with things like blushing, certain glances or certain expressions.’

To enable further research in this field in the years ahead, Khalil has joined forces with his colleagues in Computer Vision. 

‘I’m busy setting up a major project whereby we show computers movies and related text on a daily basis. That way, the computer learns to relate language expressions to the entities, events and properties found in images in the movie.’

Future scenarios

According to Khalil, the fact that machines can’t yet relate language to experience is a major scientific limitation. This is something that he hopes we will be able to overcome over the coming decades. 

He gives an example: ‘Take the chatbots that you get on the line when you call a large company, for example. You talk to them but they either don’t understand everything you say or they remain uncertain about its meaning. It’s no wonder that many people get frustrated by this. But what if these machines are so good that you feel as though you have a person on the line who really understands you? That would make all the difference.’

Working in LAB42

What does Sima'an expect to gain from working in LAB42? ‘Soon there’ll be lots of scientists all working on different aspects of AI together in one building. I hope this will help us make faster progress and that, together, we can develop new and improved language models. 

It would also be great if the links we have with companies that are interested in language processing and language machines become closer as a result. 

As things stand, as scientists, we are approached by companies that want better chatbots and translation machines, for example, but wouldn’t it be even better if we just meet each other at the coffee machine and find that we could be of use to each other?’

About Khalil Sima'an

  • Leader of the Statistical Language processing and learning lab at the Institute for Logic, Language and Computation;
  • More information on Khalil’s personal page on the UvA site.