Educational CyberPlayGround ®

Language Data, Computer & Keyboard Language

First of all we have Steve Jobs to thanks for the fonts and look of type on the personal computer which supports the look of language.


Since its inception, the Internet's domain-name system has made a point to accommodate only English-language characters. That provision has helped streamline the engineering of the Web, but according to delegates at a recent United Nations summit, it has left speakers of Russian, Arabic, Lao and the like out in the cold. At the meeting, held in Athens, speakers argued that the Web's apparent love of English has marginalized many surfers in developing nations. "I think the digital divide is not as important as the linguistic divide," said Adama Samasskou, president of the African Academy of Languages. Help for non-English speakers may be on the way: Although it took years of work, Web browsers including Mozilla's Firefox and Microsoft's Internet Explorer now support characters from other languages.

World Creoles

Expert Jeff Allen - Haitian Creole Language Technologies - Language Data Distribution

dotSUB - Any Video Any Language

Multilingual Translation System Receives Over 2 Million Euro in EU Funding[]
All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The MOLTO project, coordinated by University of Gothenburg, Sweden, receives more than 2 million euro in project support from the EU to create a reliable translation tool that covers a majority of the EU languages.
'It has so far been impossible to produce a translation tool that covers entire languages,' says Aarne Ranta, professor at the Department of Computer Science and Engineering at the University of Gothenburg, Sweden.
Google Translator is a widely spread translation programme that gradually improves the quality of translations through machine learning - the system learns from its own mistakes via system feedback, but tries to do without explicit grammatical rules.
In contrast, MOLTO is being developed in the opposite direction, meaning it begins with precision and grammar, while wide coverage comes later. We wanted to work with a translation technique that is so accurate that people who produce texts can use our translations directly. We have now started to move from precision to increased coverage, meaning that we have started to add more languages to the tool and database.
Professor Ranta is the coordinator of the MOLTO (Multilingual On-Line Translation) project, which includes three universities and two companies. The project is to receive 25 million SEK (2.375 euro) in EU funding over three years. The grant falls in the Machine Translation category, and one requirement has been that the system be developed to include a majority of EU's official languages.
The technique used in MOLTO is based on type theory, just like the technique used by Professor Thierry Coquand when introducing mathematical formulas into computer software. In Coquand's project, type theory serves as a bridge between programming language and mathematics, while in MOLTO it is used to bridge natural languages. The advantage of type theory is that each 'type' expresses content in a language-independent manner. This feature is used in speech technology to transfer meaning from one human language to another.
It is time-consuming to implement the system. First, all words needed for the field of application must be inserted in the language database. Each word is then provided with a type that indicates all possible meanings of the word. Finally, the grammar needs to be defined. At this point, the system needs to be told all the possible combinations of different types, which alternative expressions there are, in which forms the words can occur and how they should be ordered.
The database containing the grammar is called 'resource grammar', and the idea is to make it very easy for a user to extend the grammatical content and add new words. One of the main ideas of the project is that it is open source, meaning that the software shall be accessible to all.
'The purpose of the EU grant is to enable us to use the MOLTO technology to create a system that can be used for translation on the Internet', says Ranta. 'The plan is that producers of web pages should be able to freely download the tool and translate texts into several languages simultaneously. Although the technology does exist already, it is quite cumbersome to use unless you are a computer scientist. In a nutshell, the EU gives us money to modify the tool and make it user friendly for a large number of users.
The project aims at developing the system to suit different areas of applications. One area is translation of patent descriptions. Ultimately, people around the world should be able to take advantage of new technology immediately without having to master the language in which the patent description is written. A large number of translators have long had to be engaged in connection with new patents. Another sub-project aims at meeting the needs of mathematicians for a precise terminology for translation of mathematical teaching material, and then there is one sub-project that concerns descriptions of cultural heritage and museum objects, with a goal that anybody should be able to access these descriptions regardless of native tongue.

Bridging the Web's 'Linguistic Divide' From Igloo to the Internet
First Nations gain entree to electronic age
by David Akin - The Hamilton Spectator

[ ... English is the lingua franca of the world's software developers and hardware manufacturers. The core code that runs most of the world's computing devices was written in English, then translated into the ones and zeroes that machines can understand.
Which means wherever you want to go today using your computer, you will likely need to be able to speak and understand English. In Canada, of course, no manufacturer would be so brazen as to make something that could operate in only one of our official languages. Yet, just a decade ago, a French-speaking Quebecois living in Chicoutimi had to use the English accentless alphabet when sending e-mail to another French speaker in Trois Rivieres because the only e-mail programs in existence were written by English-speaking -- usually American -- developers who never thought about incorporating communication capabilities for those who use other alphabets.


Today, though, most popular software can represent French characters. But translating a software product from English to French is not as simple as running sub-titles through a movie or re-publishing a book. That's because the basic input device for a computer -- the keyboard -- has been designed and built for people who use the English alphabet. The French alphabet, of course, includes more possibilities than the English. There is c and then there is , for instance. Or e and and even .
Still, French characters, based as they are on the Latin alphabet, were close enough to the basic English alphabet that inclusion in new international standards was easy and quick.
But those who use an alphabet that doesn't rely on Latin letters -- Arabs, Greeks, Russians, and Chinese, to name a few -- can still come across Internet documents and software programs that require not only knowledge of a language they don't know but also an alphabet they've never used.
When Western Internet enthusiasts rave about the ability of telecommunications to unite the world in one global village, people of many non-Western cultures fail to see why they should rejoice in a communications system that marginalizes their language by forcing them into a homogenous English-only global village.
As a result, the rather narrow, technical issue of incorporating new computer characters into the machine language computers can understand has become a highly politicized issue in Canada and around the world.


Now, the push is on to bring the world's and Canada's aboriginal cultures into the electronic age, taking what are, in many cases, societies that were marginalized by an aggressive, dominant white culture during pre-industrial and industrial times, and giving them a prominent, participatory role in the new post-industrial digital age.
"It's a form of democratization. It allows smaller groups a voice at a lot of different levels," said educational consultant Dirk Vermeulen.
Vermeulen, who lives in Beamsville and works out of an office in the back of a native art gallery in Jordan, has developed curriculum and curriculum materials for Arctic boards of education since the 1970s.
And, just as southern Canadian boards of education are trying put more computers in the classroom, so too, are Arctic boards. Most computers, though, cannot support the phonetic syllabic characters used to represent Inuktitut in written language.
"We said, well, hold on, if you're going to allow computers into these schools, we have to make sure they'll work not just in English but also in Inuktitut and in French, so we went to work at that point to try and establish the ability of computers to be able to handle those various scripts.


"We quickly found that a lot of other native groups across Canada that were using syllabics were doing the same thing, but that none of the data was interchangeable. Everybody had their own method and their own solution to the problem," Vermeulen said.
In 1992, Industry Canada, with the urging of Canadian aboriginal groups, called on Vermeulen and others to form the Canadian Aboriginal Syllabics Encoding Committee, to come up with a proposed standard for including Canadian aboriginal syllabics into computer character sets that could be adopted by the International Organization for Standardization or ISO.
"The native cultures, at this point, are very ready to take control as to where their languages or culture is going," said Vermeulen in a recent telephone interview.
Through the Canadian Standards Association, Vermeulen's committee submitted that standard June 10 to the ISO. The ISO's global membership has voted in favour of the new standard three times since then. The fourth and final vote on the standard is expected some time in the spring.
If the ISO agrees to include Canadian aboriginal syllabics in the standard, computer manufacturers from California to Singapore will begin making computers that support that language.
"It doesn't mean they have to make fonts for it, but what it does mean is that if you buy a font, any computer that you have you will be able to process syllabics without any problem," said Michael Everson said.
Everson, born in Arizona but now living in Ireland, is one of Vermeulen's colleague's on CASEC.
The language standard used by computers is known as the Universal Multiple-Octet Coded Character Set.
This set contains 64,000 characters that a computer can be made to understand. So far, though, just 29,000 characters have been assigned a spot in that set.
Those characters include, for instance, the English alphabet -- in both capital and small letters -- as well as special characters such as tildes ( ~ ) or curly brackets { }.
The characters that have already been incorporated in the approved set also include many characters from Japanese, Chinese, Korean, Arabic, Hebrew and East Indian alphabets.
The ISO may also soon consider proposals to include important historical alphabets such as ancient Egyptian hieroglyphics as part of the approved coded character set.
The computer character set is crucial if people who use writing systems different from the English alphabet are to communicate in their own language using modern telecommunications technologies.
"It equalizes a lot of situations," Vermeulen said. "I think that's very useful and very good. I really stand behind that. What's interesting in many ways is that the native cultures are at this point very ready to take control as to where their cultures are going and where their languages are going."
Setting a standard for which languages computer products will support is not, just to be clear on the matter, a matter of translation.
A computer that supports different character sets cannot translate between languages.
In other words, if an English-speaker types in the word 'Igloo', it does not show up on the computer screen of an Inuktitut-speaker in the Canadian Syllabic characters for igloo.
What does happen is that when an English-speaker types i-g-l-o-o, the computer is programmed to understand that English word in its hexadecimal numeric language as 0069 0067 006C 006F 006F and act upon that word.
The proposed new standard would see computer manufacturers assign the hexadecimal string 1403 14A1 14D7 to the Inuktitut syllabic symbols for igloo or house.


The proposed new standard would be an enabling tool, allowing people to use their own writing systems in digital communications.
"We've been trying to allow the language room to be used in a variety of situations, including offices and governmental situations and whatever else, in order to broaden that base of the use of the language," said Vermeulen.
"I think Nunavut is a big deal," said Everson in a telephone interview from his office in Dublin. Dublin is the home base for Everson Gunn Teoranta, his firm that 'localizes' or re-writes computer software in minority languages such as Gaelic.
"Nunavut is really remarkable and amazing and it's going to change things. These people are getting their own state," Everson said.
"The fact that they're getting their own state is giving them the impetus to make some amazing technological jumps."
New communications technologies also give the newly empowered state of Nunavut to better control and direct the education of its young people, Vermeulen said.
"While there are a lot of pressures on the language from the English and the French media in Canada, the larger (aboriginal) groups are able to actually take advantage of the various media and promote their language.
"We hope that by including the writing systems into the modern technologies and into the modern standard it will do two things. One is that it'll allow people to use these technologies to promote their own language in whatever way they feel fit.
"The second thing is that it provides international recognition for those writing systems. In doing so, nobody can deny them the right to exist. That's a very important issue politically," Vermeulen said.
Nunavut comes into being April 1, 1999, when the Northwest Territories is divided, roughly along the tree line, into Nunavut and a western territory.
Communication technologies could play an important role in Nunavut's development if only because it, like Canada, must meet the challenges of serving a tiny population spread over a wide area.
Nunavut encompasses an area more than five times the size of Germany, yet it has just 20 kilometres of roads. Its 26 settlements are spread across three time zones. CASEC expects the ISO will formally adopt Canadian Syllabics into the standard some time in the spring.
The hard work, though, has just begun as Inuktitut speakers take English versions of popular software packages and re-write them using the complex and different Inuktitut grammar, syntax, and alphabet.
The Baffin Divisional Board of Education is already localizing Macintosh operating system 7.5 to be able to use Canadian syllabic characters.
"I've looked at the grammar of this and it is a language from hell," said Everson, sizing up the job of turning Apple's elegant English computer code into the phonetic symbols of written Inuktitut.
"I don't know how this poor woman is doing the translations of this technical vocabulary into this amazing language. It's a wonderful, wonderful language, but it is not like English, I'll tell you that."
CASEC estimates that there are about 200,000 people in Canada's north who use the syllabics system to express themselves in written form. Most of those people are Cree and some Dene people who live in Canada's eastern Arctic.
Ironically, the language of those Arctic dwellers had no written form until Methodist missionaries visited them in the 1830s. Now, just 160 years after the language first found its way onto parchment, it is being digitized.
The Methodist missionaries took the oral culture of the Cree and Dene and imposed a written vocabulary using French shorthand symbols. Since those first early efforts, syllabic character shapes have been added to the 'alphabet' while existing ones have been modified. The approved computer standard set already incorporates the syllabic characters for several Algonkian and Athapaskan languages.
"We find a lot of these languages actually strengthening," Vermeulen.
"The existence of the phonetic syllabic characters is credited with helping to sustain and strengthen native culture, by making it easy for users to read, write and publish in their own language."


The Canadian Standards Association submission to the International Organization for Standardization is titled Proposed pDAM for Unified Canadian Aboriginal Syllabics. You can find it at:

Universal Declaration of Linguistic Rights -- a statement that argues for the protection and encouragement of minority languages is at