- Posted by Stephen Whiteley
- On 02/05/2011
- dotcom Willy Wonka, Google Translate, privacy
The inner workings of Google Translate are shrouded in a characteristic mix of Willy Wonka bonhomie and mildly Orwellian obfuscation. Understandably, Google are (is?) keen to keep the details of their revolutionary approach to machine translation under wraps, and share only enough information to give a vague idea of what they are up to.
This video is representative of their approach: fun, cute and colourful, it is only after it has finished that you realise you haven’t actually learned that much.
The key concept here is, of course, ‘statistical machine translation’.
The most popular machine translation tool is SYSTRAN, which was developed during the cold war, and is still used by Yahoo’s Babel Fish, amongst others. SYSTRAN follows a rule-based logic, which is theoretically similar to the way that a human learns a language: the program is filled with basic linguistic rules and vocabularies, which it then uses to translate texts.
When they moved from SYSTRAN in 2007, Google research scientist Franz Ochs explained:
“Most state-of-the-art commercial machine translation systems in use today have been developed using a rules-based approach and require a lot of work by linguists to define vocabularies and grammars. Several research systems, including ours, take a different approach: we feed the computer with billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model.”
Or, as the video puts it: “Once the computer finds a pattern, it can use this pattern to translate similar texts in the future. When you repeat this process billions of times you end up with billions of patterns and one very smart computer program.”
As with N-Grams, it is the staggering volume of data at Google’s disposal which grabs the attention. A significant step forward for Google Translate was the integration of reams of UN documentation, a ready-made six language corpus, but, unlike with N-Grams, the source of the rest of the data it uses remains worryingly unclear. Here are two extracts from Google’s Terms of Service:
“ …By submitting, posting or displaying the content you give Google a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive license to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute any Content which you submit, post or display on or through, the Services.”
“… You agree that this license includes a right for Google to make such Content available to other companies, organizations or individuals with whom Google has relationships for the provision of syndicated services, and to use such Content in connection with the provision of those services.”
It is hard to establish what the implications of all this are, not least because Google won’t divulge what use it makes of “a perpetual, irrevocable” right to do whatever it wants with anything you do online. What is clear, however, is that Google is much more than a sort of dotcom Willy Wonka.