Language Data, Computer & Keyboard Language
First of all we have Steve Jobs to thanks for the fonts and look of type on the personal computer which supports the look of language.
21ST CENTURY LINGUISTIC RIGHTS
Since its inception, the Internet's domain-name system has made a point to accommodate only English-language characters. That provision has helped streamline the engineering of the Web, but according to delegates at a recent United Nations summit, it has left speakers of Russian, Arabic, Lao and the like out in the cold. At the meeting, held in Athens, speakers argued that the Web's apparent love of English has marginalized many surfers in developing nations. "I think the digital divide is not as important as the linguistic divide," said Adama Samasskou, president of the African Academy of Languages. Help for non-English speakers may be on the way: Although it took years of work, Web browsers including Mozilla's Firefox and Microsoft's Internet Explorer now support characters from other languages.
Expert Jeff Allen - Haitian Creole Language Technologies - Language Data Distribution
dotSUB - Any Video Any Language
Multilingual Translation System
Receives Over 2 Million Euro in EU Funding
All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The
MOLTO
project, coordinated by University of Gothenburg, Sweden, receives more than 2 million euro in project
support
from the EU to create a reliable translation tool that covers a majority of the EU languages.
'It has so far been impossible to produce a translation tool that covers entire languages,' says
Aarne
Ranta, professor at the Department of Computer Science and Engineering at the University of Gothenburg,
Sweden.
Google Translator is a widely spread translation programme that gradually improves the quality of
translations
through machine learning - the system learns from its own mistakes via system feedback, but tries to do
without explicit grammatical rules.
In contrast, MOLTO is being developed in the opposite direction, meaning it begins with precision and
grammar,
while wide coverage comes later. We wanted to work with a translation technique that is so accurate that
people who produce texts can use our translations directly. We have now started to move from precision to
increased coverage, meaning that we have started to add more languages to the tool and database.
Professor Ranta is the coordinator of the MOLTO (Multilingual On-Line Translation) project, which includes
three universities and two companies. The project is to receive 25 million SEK (2.375 euro) in EU funding
over
three years. The grant falls in the Machine Translation category, and one requirement has been that the
system
be developed to include a majority of EU's official languages.
The technique used in MOLTO is based on type theory, just like the technique used by Professor Thierry
Coquand
when introducing mathematical formulas into computer software. In Coquand's project, type theory serves
as
a bridge between programming language and mathematics, while in MOLTO it is used to bridge natural
languages.
The advantage of type theory is that each 'type' expresses content in a language-independent manner.
This feature is used in speech technology to transfer meaning from one human language to another.
It is time-consuming to implement the system. First, all words needed for the field of application must be
inserted in the language database. Each word is then provided with a type that indicates all possible
meanings
of the word. Finally, the grammar needs to be defined. At this point, the system needs to be told all the
possible combinations of different types, which alternative expressions there are, in which forms the words
can occur and how they should be ordered.
The database containing the grammar is called 'resource grammar', and the idea is to make it very
easy
for a user to extend the grammatical content and add new words. One of the main ideas of the project is that
it is open source, meaning that the software shall be accessible to all.
'The purpose of the EU grant is to enable us to use the MOLTO technology to create a system that can be
used for translation on the Internet', says Ranta. 'The plan is that producers of web pages should
be
able to freely download the tool and translate texts into several languages simultaneously. Although the
technology does exist already, it is quite cumbersome to use unless you are a computer scientist. In a
nutshell, the EU gives us money to modify the tool and make it user friendly for a large number of
users.
The project aims at developing the system to suit different areas of applications. One area is translation
of
patent descriptions. Ultimately, people around the world should be able to take advantage of new technology
immediately without having to master the language in which the patent description is written. A large number
of translators have long had to be engaged in connection with new patents. Another sub-project aims at
meeting
the needs of mathematicians for a precise terminology for translation of mathematical teaching material, and
then there is one sub-project that concerns descriptions of cultural heritage and museum objects, with a
goal
that anybody should be able to access these descriptions regardless of native tongue.
Bridging the Web's 'Linguistic Divide' From Igloo to the Internet
First Nations gain entree to electronic age
by David Akin - The Hamilton Spectator
http://www.southam.com/calgaryherald/cgi/newsnow.pl?nkey=ch&file=/business/Technology/970922/t0922mt10.html
[ ... English is the lingua franca of the world's software developers and hardware manufacturers. The
core code that runs most of the world's computing devices was written in English, then translated into
the
ones and zeroes that machines can understand.
Which means wherever you want to go today using your computer, you will likely need to be able to speak and
understand English. In Canada, of course, no manufacturer would be so brazen as to make something that could
operate in only one of our official languages. Yet, just a decade ago, a French-speaking Quebecois living in
Chicoutimi had to use the English accentless alphabet when sending e-mail to another French speaker in Trois
Rivieres because the only e-mail programs in existence were written by English-speaking -- usually American
--
developers who never thought about incorporating communication capabilities for those who use other
alphabets.
FRENCH REPRESENTATION
Today, though, most popular software can represent French characters. But translating a software product
from
English to French is not as simple as running sub-titles through a movie or re-publishing a book. That's
because the basic input device for a computer -- the keyboard -- has been designed and built for people who
use the English alphabet. The French alphabet, of course, includes more possibilities than the English.
There
is c and then there is , for instance. Or e and and even .
Still, French characters, based as they are on the Latin alphabet, were close enough to the basic English
alphabet that inclusion in new international standards was easy and quick.
But those who use an alphabet that doesn't rely on Latin letters -- Arabs, Greeks, Russians, and
Chinese,
to name a few -- can still come across Internet documents and software programs that require not only
knowledge of a language they don't know but also an alphabet they've never used.
When Western Internet enthusiasts rave about the ability of telecommunications to unite the world in one
global village, people of many non-Western cultures fail to see why they should rejoice in a communications
system that marginalizes their language by forcing them into a homogenous English-only global village.
As a result, the rather narrow, technical issue of incorporating new computer characters into the machine
language computers can understand has become a highly politicized issue in Canada and around the world.
PUSH IS ON
Now, the push is on to bring the world's and Canada's aboriginal cultures into the
electronic
age, taking what are, in many cases, societies that were marginalized by an aggressive, dominant
white culture during pre-industrial and industrial times, and giving them a prominent, participatory role in
the new post-industrial digital age.
"It's a form of democratization. It allows smaller groups a voice at a lot of different
levels," said educational consultant Dirk Vermeulen.
Vermeulen, who lives in Beamsville and works out of an office in the back of a native art gallery in Jordan,
has developed curriculum and curriculum materials for Arctic boards of education since the 1970s.
And, just as southern Canadian boards of education are trying put more computers in the classroom, so too,
are
Arctic boards. Most computers, though, cannot support the phonetic syllabic characters used to represent
Inuktitut in written language.
"We said, well, hold on, if you're going to allow computers into these schools, we have to make
sure
they'll work not just in English but also in Inuktitut and in French, so we went to work at that point
to
try and establish the ability of computers to be able to handle those various scripts.
NOTHING INTERCHANGEABLE
"We quickly found that a lot of other native groups across Canada that were using syllabics were doing
the same thing, but that none of the data was interchangeable. Everybody had their own method and their own
solution to the problem," Vermeulen said.
In 1992, Industry Canada, with the urging of Canadian aboriginal groups, called on Vermeulen and others to
form the Canadian Aboriginal Syllabics Encoding Committee, to come up with a proposed standard for including
Canadian aboriginal syllabics into computer character sets that could be adopted by the International
Organization for Standardization or ISO.
"The native cultures, at this point, are
very ready to take control as to where their languages or culture is going," said
Vermeulen in a recent telephone interview.
Through the Canadian Standards Association, Vermeulen's committee submitted that standard June 10 to the
ISO. The ISO's global membership has voted in favour of the new standard three times since then. The
fourth and final vote on the standard is expected some time in the spring.
If the ISO agrees to include Canadian aboriginal syllabics in the standard, computer
manufacturers from California to Singapore will begin making computers that support that language.
"It doesn't mean they have to make fonts for it, but what it does mean is that if you buy a font,
any
computer that you have you will be able to process syllabics without any problem," said Michael Everson
said.
Everson, born in Arizona but now living in Ireland, is one of Vermeulen's colleague's on
CASEC.
The language standard used by computers is known as the Universal Multiple-Octet
Coded Character Set.
This set contains 64,000 characters that a computer can be made to
understand. So far, though, just 29,000 characters have been assigned a spot in that set.
Those characters include, for instance, the English alphabet -- in both capital and small letters --
as well as special characters such as tildes ( ~ ) or curly brackets { }.
The characters
that
have already been incorporated in the approved set also include many characters from Japanese, Chinese,
Korean, Arabic, Hebrew and East Indian alphabets.
The ISO may also soon consider proposals to include important historical alphabets such as ancient
Egyptian hieroglyphics as part of the approved coded character set.
The computer character set is crucial if people who use writing systems different from the English alphabet
are to communicate in their own language using modern telecommunications technologies.
"It equalizes a lot of situations," Vermeulen said. "I think that's very useful and very
good. I really stand behind that. What's interesting in many ways is that the native cultures are at
this
point very ready to take control as to where their cultures are going and where their languages are
going."
Setting a standard for which languages computer products will support is not, just to be clear on the
matter,
a matter of translation.
A computer that supports different character sets cannot translate between languages.
In other words, if an English-speaker types in the word 'Igloo', it does not show up on the computer
screen of an Inuktitut-speaker in the Canadian Syllabic characters for igloo.
What does happen is that when an English-speaker types i-g-l-o-o, the computer is programmed to
understand that English word in its hexadecimal numeric language as 0069 0067 006C 006F 006F and act upon
that word.
The proposed new standard would see computer manufacturers assign the hexadecimal string 1403 14A1 14D7 to
the Inuktitut syllabic symbols for igloo or house.
ENABLING TOOL
The proposed new standard would be an enabling tool, allowing people to use their own writing systems in
digital communications.
"We've been trying to allow the language room to be used in a variety of situations, including
offices and governmental situations and whatever else, in order to broaden that base of the use of the
language," said Vermeulen.
"I think Nunavut is a big deal," said Everson in a telephone interview from his
office in Dublin. Dublin is the home base for Everson Gunn Teoranta, his firm that 'localizes' or
re-writes computer software in minority languages such as Gaelic.
"Nunavut is really remarkable and amazing and it's going to change things. These
people are getting their own state," Everson said.
"The fact that they're getting their own state is giving them the impetus to make some amazing
technological jumps."
New communications technologies also give the newly empowered state of Nunavut to better control and direct
the education of its young people, Vermeulen said.
"While there are a lot of pressures on the language from the English and the French media in Canada,
the
larger (aboriginal) groups are able to actually take advantage of the various media and
promote their language.
"We hope that by including the writing systems into the modern technologies and into the modern
standard
it will do two things. One is that it'll allow people to use these technologies to promote their own
language in whatever way they feel fit.
"The second thing is that it provides international recognition for those writing
systems. In doing so, nobody can deny them the right to exist. That's a very important issue
politically," Vermeulen said.
Nunavut comes into being April 1, 1999, when the Northwest Territories is divided, roughly along the tree
line, into Nunavut and a western territory.
Communication technologies could play an important role in Nunavut's development if only because it,
like
Canada, must meet the challenges of serving a tiny population spread over a wide area.
Nunavut encompasses an area more than five times the size of Germany, yet it has just 20 kilometres
of
roads. Its 26 settlements are spread across three time zones. CASEC expects the ISO will formally adopt
Canadian Syllabics into the standard some time in the spring.
The hard work, though, has
just
begun as Inuktitut speakers take English versions of popular software packages and re-write them using the
complex and different Inuktitut grammar, syntax, and alphabet.
The Baffin Divisional Board of Education is already localizing Macintosh operating system 7.5 to be
able to use Canadian syllabic characters.
"I've looked at the grammar of this and
it
is a language from hell," said Everson, sizing up the job of turning Apple's elegant
English
computer code into the phonetic symbols of written Inuktitut.
"I don't know how
this
poor woman is doing the translations of this technical vocabulary into this amazing language. It's a
wonderful, wonderful language, but it is not like English, I'll tell you that."
CASEC estimates that there are about 200,000 people in Canada's north who use the syllabics
system
to express themselves in written form. Most of those people are Cree and some Dene people who live in
Canada's eastern Arctic.
Ironically, the language of those Arctic dwellers had no
written
form until Methodist missionaries visited them in the 1830s. Now, just 160 years after the language first
found its way onto parchment, it is being digitized.
The Methodist missionaries took the oral culture of the Cree and Dene and imposed a written vocabulary using
French shorthand symbols. Since those first early efforts, syllabic character shapes have been added to the
'alphabet' while existing ones have been modified. The approved computer standard set already
incorporates the syllabic characters for several Algonkian and Athapaskan languages.
"We find a lot of these languages actually strengthening," Vermeulen.
"The existence of the phonetic syllabic characters is credited with helping to sustain and strengthen
native culture, by making it easy for users to read, write and publish in their own language."
RELATED WEB SITES
The Canadian Standards Association submission to the International Organization for Standardization is
titled
Proposed pDAM for Unified Canadian Aboriginal Syllabics. You can find it at:
http://www.evertype.com/standards/sl/n1441-en.html
Universal Declaration of Linguistic Rights -- a statement that argues for the protection and encouragement of minority languages is at www.indigo.ie/egt/udhr/udlr-en.html