[ Index ] |
PHP Cross Reference of Unnamed Project |
[Summary view] [Print] [Text view]
1 # This document contains text in Perl "POD" format. 2 # Use a POD viewer like perldoc or perlman to render it. 3 4 =head1 NAME 5 6 Locale::Maketext::TPJ13 -- article about software localization 7 8 =head1 SYNOPSIS 9 10 # This an article, not a module. 11 12 =head1 DESCRIPTION 13 14 The following article by Sean M. Burke and Jordan Lachler 15 first appeared in I<The Perl Journal> #13 16 and is copyright 1999 The Perl Journal. It appears 17 courtesy of Jon Orwant and The Perl Journal. This document may be 18 distributed under the same terms as Perl itself. 19 20 =head1 Localization and Perl: gettext breaks, Maketext fixes 21 22 by Sean M. Burke and Jordan Lachler 23 24 This article points out cases where gettext (a common system for 25 localizing software interfaces -- i.e., making them work in the user's 26 language of choice) fails because of basic differences between human 27 languages. This article then describes Maketext, a new system capable 28 of correctly treating these differences. 29 30 =head2 A Localization Horror Story: It Could Happen To You 31 32 =over 33 34 "There are a number of languages spoken by human beings in this 35 world." 36 37 -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the 38 Identification of Languages" 39 40 =back 41 42 Imagine that your task for the day is to localize a piece of software 43 -- and luckily for you, the only output the program emits is two 44 messages, like this: 45 46 I scanned 12 directories. 47 48 Your query matched 10 files in 4 directories. 49 50 So how hard could that be? You look at the code that 51 produces the first item, and it reads: 52 53 printf("I scanned %g directories.", 54 $directory_count); 55 56 You think about that, and realize that it doesn't even work right for 57 English, as it can produce this output: 58 59 I scanned 1 directories. 60 61 So you rewrite it to read: 62 63 printf("I scanned %g %s.", 64 $directory_count, 65 $directory_count == 1 ? 66 "directory" : "directories", 67 ); 68 69 ...which does the Right Thing. (In case you don't recall, "%g" is for 70 locale-specific number interpolation, and "%s" is for string 71 interpolation.) 72 73 But you still have to localize it for all the languages you're 74 producing this software for, so you pull Locale::gettext off of CPAN 75 so you can access the C<gettext> C functions you've heard are standard 76 for localization tasks. 77 78 And you write: 79 80 printf(gettext("I scanned %g %s."), 81 $dir_scan_count, 82 $dir_scan_count == 1 ? 83 gettext("directory") : gettext("directories"), 84 ); 85 86 But you then read in the gettext manual (Drepper, Miller, and Pinard 1995) 87 that this is not a good idea, since how a single word like "directory" 88 or "directories" is translated may depend on context -- and this is 89 true, since in a case language like German or Russian, you'd may need 90 these words with a different case ending in the first instance (where the 91 word is the object of a verb) than in the second instance, which you haven't even 92 gotten to yet (where the word is the object of a preposition, "in %g 93 directories") -- assuming these keep the same syntax when translated 94 into those languages. 95 96 So, on the advice of the gettext manual, you rewrite: 97 98 printf( $dir_scan_count == 1 ? 99 gettext("I scanned %g directory.") : 100 gettext("I scanned %g directories."), 101 $dir_scan_count ); 102 103 So, you email your various translators (the boss decides that the 104 languages du jour are Chinese, Arabic, Russian, and Italian, so you 105 have one translator for each), asking for translations for "I scanned 106 %g directory." and "I scanned %g directories.". When they reply, 107 you'll put that in the lexicons for gettext to use when it localizes 108 your software, so that when the user is running under the "zh" 109 (Chinese) locale, gettext("I scanned %g directory.") will return the 110 appropriate Chinese text, with a "%g" in there where printf can then 111 interpolate $dir_scan. 112 113 Your Chinese translator emails right back -- he says both of these 114 phrases translate to the same thing in Chinese, because, in linguistic 115 jargon, Chinese "doesn't have number as a grammatical category" -- 116 whereas English does. That is, English has grammatical rules that 117 refer to "number", i.e., whether something is grammatically singular 118 or plural; and one of these rules is the one that forces nouns to take 119 a plural suffix (generally "s") when in a plural context, as they are when 120 they follow a number other than "one" (including, oddly enough, "zero"). 121 Chinese has no such rules, and so has just the one phrase where English 122 has two. But, no problem, you can have this one Chinese phrase appear 123 as the translation for the two English phrases in the "zh" gettext 124 lexicon for your program. 125 126 Emboldened by this, you dive into the second phrase that your software 127 needs to output: "Your query matched 10 files in 4 directories.". You notice 128 that if you want to treat phrases as indivisible, as the gettext 129 manual wisely advises, you need four cases now, instead of two, to 130 cover the permutations of singular and plural on the two items, 131 $dir_count and $file_count. So you try this: 132 133 printf( $file_count == 1 ? 134 ( $directory_count == 1 ? 135 gettext("Your query matched %g file in %g directory.") : 136 gettext("Your query matched %g file in %g directories.") ) : 137 ( $directory_count == 1 ? 138 gettext("Your query matched %g files in %g directory.") : 139 gettext("Your query matched %g files in %g directories.") ), 140 $file_count, $directory_count, 141 ); 142 143 (The case of "1 file in 2 [or more] directories" could, I suppose, 144 occur in the case of symlinking or something of the sort.) 145 146 It occurs to you that this is not the prettiest code you've ever 147 written, but this seems the way to go. You mail off to the 148 translators asking for translations for these four cases. The 149 Chinese guy replies with the one phrase that these all translate to in 150 Chinese, and that phrase has two "%g"s in it, as it should -- but 151 there's a problem. He translates it word-for-word back: "In %g 152 directories contains %g files match your query." The %g 153 slots are in an order reverse to what they are in English. You wonder 154 how you'll get gettext to handle that. 155 156 But you put it aside for the moment, and optimistically hope that the 157 other translators won't have this problem, and that their languages 158 will be better behaved -- i.e., that they will be just like English. 159 160 But the Arabic translator is the next to write back. First off, your 161 code for "I scanned %g directory." or "I scanned %g directories." 162 assumes there's only singular or plural. But, to use linguistic 163 jargon again, Arabic has grammatical number, like English (but unlike 164 Chinese), but it's a three-term category: singular, dual, and plural. 165 In other words, the way you say "directory" depends on whether there's 166 one directory, or I<two> of them, or I<more than two> of them. Your 167 test of C<($directory == 1)> no longer does the job. And it means 168 that where English's grammatical category of number necessitates 169 only the two permutations of the first sentence based on "directory 170 [singular]" and "directories [plural]", Arabic has three -- and, 171 worse, in the second sentence ("Your query matched %g file in %g 172 directory."), where English has four, Arabic has nine. You sense 173 an unwelcome, exponential trend taking shape. 174 175 Your Italian translator emails you back and says that "I searched 0 176 directories" (a possible English output of your program) is stilted, 177 and if you think that's fine English, that's your problem, but that 178 I<just will not do> in the language of Dante. He insists that where 179 $directory_count is 0, your program should produce the Italian text 180 for "I I<didn't> scan I<any> directories.". And ditto for "I didn't 181 match any files in any directories", although he says the last part 182 about "in any directories" should probably just be left off. 183 184 You wonder how you'll get gettext to handle this; to accomodate the 185 ways Arabic, Chinese, and Italian deal with numbers in just these few 186 very simple phrases, you need to write code that will ask gettext for 187 different queries depending on whether the numerical values in 188 question are 1, 2, more than 2, or in some cases 0, and you still haven't 189 figured out the problem with the different word order in Chinese. 190 191 Then your Russian translator calls on the phone, to I<personally> tell 192 you the bad news about how really unpleasant your life is about to 193 become: 194 195 Russian, like German or Latin, is an inflectional language; that is, nouns 196 and adjectives have to take endings that depend on their case 197 (i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of 198 what role they have in syntax of the sentence -- 199 as well as on the grammatical gender (i.e., masculine, feminine, neuter) 200 and number (i.e., singular or plural) of the noun, as well as on the 201 declension class of the noun. But unlike with most other inflected languages, 202 putting a number-phrase (like "ten" or "forty-three", or their Arabic 203 numeral equivalents) in front of noun in Russian can change the case and 204 number that noun is, and therefore the endings you have to put on it. 205 206 He elaborates: In "I scanned %g directories", you'd I<expect> 207 "directories" to be in the accusative case (since it is the direct 208 object in the sentence) and the plural number, 209 except where $directory_count is 1, then you'd expect the singular, of 210 course. Just like Latin or German. I<But!> Where $directory_count % 211 10 is 1 ("%" for modulo, remember), assuming $directory count is an 212 integer, and except where $directory_count % 100 is 11, "directories" 213 is forced to become grammatically singular, which means it gets the 214 ending for the accusative singular... You begin to visualize the code 215 it'd take to test for the problem so far, I<and still work for Chinese 216 and Arabic and Italian>, and how many gettext items that'd take, but 217 he keeps going... But where $directory_count % 10 is 2, 3, or 4 218 (except where $directory_count % 100 is 12, 13, or 14), the word for 219 "directories" is forced to be genitive singular -- which means another 220 ending... The room begins to spin around you, slowly at first... But 221 with I<all other> integer values, since "directory" is an inanimate 222 noun, when preceded by a number and in the nominative or accusative 223 cases (as it is here, just your luck!), it does stay plural, but it is 224 forced into the genitive case -- yet another ending... And 225 you never hear him get to the part about how you're going to run into 226 similar (but maybe subtly different) problems with other Slavic 227 languages like Polish, because the floor comes up to meet you, and you 228 fade into unconsciousness. 229 230 231 The above cautionary tale relates how an attempt at localization can 232 lead from programmer consternation, to program obfuscation, to a need 233 for sedation. But careful evaluation shows that your choice of tools 234 merely needed further consideration. 235 236 =head2 The Linguistic View 237 238 =over 239 240 "It is more complicated than you think." 241 242 -- The Eighth Networking Truth, from RFC 1925 243 244 =back 245 246 The field of Linguistics has expended a great deal of effort over the 247 past century trying to find grammatical patterns which hold across 248 languages; it's been a constant process 249 of people making generalizations that should apply to all languages, 250 only to find out that, all too often, these generalizations fail -- 251 sometimes failing for just a few languages, sometimes whole classes of 252 languages, and sometimes nearly every language in the world except 253 English. Broad statistical trends are evident in what the "average 254 language" is like as far as what its rules can look like, must look 255 like, and cannot look like. But the "average language" is just as 256 unreal a concept as the "average person" -- it runs up against the 257 fact no language (or person) is, in fact, average. The wisdom of past 258 experience leads us to believe that any given language can do whatever 259 it wants, in any order, with appeal to any kind of grammatical 260 categories wants -- case, number, tense, real or metaphoric 261 characteristics of the things that words refer to, arbitrary or 262 predictable classifications of words based on what endings or prefixes 263 they can take, degree or means of certainty about the truth of 264 statements expressed, and so on, ad infinitum. 265 266 Mercifully, most localization tasks are a matter of finding ways to 267 translate whole phrases, generally sentences, where the context is 268 relatively set, and where the only variation in content is I<usually> 269 in a number being expressed -- as in the example sentences above. 270 Translating specific, fully-formed sentences is, in practice, fairly 271 foolproof -- which is good, because that's what's in the phrasebooks 272 that so many tourists rely on. Now, a given phrase (whether in a 273 phrasebook or in a gettext lexicon) in one language I<might> have a 274 greater or lesser applicability than that phrase's translation into 275 another language -- for example, strictly speaking, in Arabic, the 276 "your" in "Your query matched..." would take a different form 277 depending on whether the user is male or female; so the Arabic 278 translation "your[feminine] query" is applicable in fewer cases than 279 the corresponding English phrase, which doesn't distinguish the user's 280 gender. (In practice, it's not feasable to have a program know the 281 user's gender, so the masculine "you" in Arabic is usually used, by 282 default.) 283 284 But in general, such surprises are rare when entire sentences are 285 being translated, especially when the functional context is restricted 286 to that of a computer interacting with a user either to convey a fact 287 or to prompt for a piece of information. So, for purposes of 288 localization, translation by phrase (generally by sentence) is both the 289 simplest and the least problematic. 290 291 =head2 Breaking gettext 292 293 =over 294 295 "It Has To Work." 296 297 -- First Networking Truth, RFC 1925 298 299 =back 300 301 Consider that sentences in a tourist phrasebook are of two types: ones 302 like "How do I get to the marketplace?" that don't have any blanks to 303 fill in, and ones like "How much do these ___ cost?", where there's 304 one or more blanks to fill in (and these are usually linked to a 305 list of words that you can put in that blank: "fish", "potatoes", 306 "tomatoes", etc.) The ones with no blanks are no problem, but the 307 fill-in-the-blank ones may not be really straightforward. If it's a 308 Swahili phrasebook, for example, the authors probably didn't bother to 309 tell you the complicated ways that the verb "cost" changes its 310 inflectional prefix depending on the noun you're putting in the blank. 311 The trader in the marketplace will still understand what you're saying if 312 you say "how much do these potatoes cost?" with the wrong 313 inflectional prefix on "cost". After all, I<you> can't speak proper Swahili, 314 I<you're> just a tourist. But while tourists can be stupid, computers 315 are supposed to be smart; the computer should be able to fill in the 316 blank, and still have the results be grammatical. 317 318 In other words, a phrasebook entry takes some values as parameters 319 (the things that you fill in the blank or blanks), and provides a value 320 based on these parameters, where the way you get that final value from 321 the given values can, properly speaking, involve an arbitrarily 322 complex series of operations. (In the case of Chinese, it'd be not at 323 all complex, at least in cases like the examples at the beginning of 324 this article; whereas in the case of Russian it'd be a rather complex 325 series of operations. And in some languages, the 326 complexity could be spread around differently: while the act of 327 putting a number-expression in front of a noun phrase might not be 328 complex by itself, it may change how you have to, for example, inflect 329 a verb elsewhere in the sentence. This is what in syntax is called 330 "long-distance dependencies".) 331 332 This talk of parameters and arbitrary complexity is just another way 333 to say that an entry in a phrasebook is what in a programming language 334 would be called a "function". Just so you don't miss it, this is the 335 crux of this article: I<A phrase is a function; a phrasebook is a 336 bunch of functions.> 337 338 The reason that using gettext runs into walls (as in the above 339 second-person horror story) is that you're trying to use a string (or 340 worse, a choice among a bunch of strings) to do what you really need a 341 function for -- which is futile. Preforming (s)printf interpolation 342 on the strings which you get back from gettext does allow you to do I<some> 343 common things passably well... sometimes... sort of; but, to paraphrase 344 what some people say about C<csh> script programming, "it fools you 345 into thinking you can use it for real things, but you can't, and you 346 don't discover this until you've already spent too much time trying, 347 and by then it's too late." 348 349 =head2 Replacing gettext 350 351 So, what needs to replace gettext is a system that supports lexicons 352 of functions instead of lexicons of strings. An entry in a lexicon 353 from such a system should I<not> look like this: 354 355 "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires" 356 357 [\xE9 is e-acute in Latin-1. Some pod renderers would 358 scream if I used the actual character here. -- SB] 359 360 but instead like this, bearing in mind that this is just a first stab: 361 362 sub I_found_X1_files_in_X2_directories { 363 my( $files, $dirs ) = @_[0,1]; 364 $files = sprintf("%g %s", $files, 365 $files == 1 ? 'fichier' : 'fichiers'); 366 $dirs = sprintf("%g %s", $dirs, 367 $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); 368 return "J'ai trouv\xE9 $files dans $dirs."; 369 } 370 371 Now, there's no particularly obvious way to store anything but strings 372 in a gettext lexicon; so it looks like we just have to start over and 373 make something better, from scratch. I call my shot at a 374 gettext-replacement system "Maketext", or, in CPAN terms, 375 Locale::Maketext. 376 377 When designing Maketext, I chose to plan its main features in terms of 378 "buzzword compliance". And here are the buzzwords: 379 380 =head2 Buzzwords: Abstraction and Encapsulation 381 382 The complexity of the language you're trying to output a phrase in is 383 entirely abstracted inside (and encapsulated within) the Maketext module 384 for that interface. When you call: 385 386 print $lang->maketext("You have [quant,_1,piece] of new mail.", 387 scalar(@messages)); 388 389 you don't know (and in fact can't easily find out) whether this will 390 involve lots of figuring, as in Russian (if $lang is a handle to the 391 Russian module), or relatively little, as in Chinese. That kind of 392 abstraction and encapsulation may encourage other pleasant buzzwords 393 like modularization and stratification, depending on what design 394 decisions you make. 395 396 =head2 Buzzword: Isomorphism 397 398 "Isomorphism" means "having the same structure or form"; in discussions 399 of program design, the word takes on the special, specific meaning that 400 your implementation of a solution to a problem I<has the same 401 structure> as, say, an informal verbal description of the solution, or 402 maybe of the problem itself. Isomorphism is, all things considered, 403 a good thing -- it's what problem-solving (and solution-implementing) 404 should look like. 405 406 What's wrong the with gettext-using code like this... 407 408 printf( $file_count == 1 ? 409 ( $directory_count == 1 ? 410 "Your query matched %g file in %g directory." : 411 "Your query matched %g file in %g directories." ) : 412 ( $directory_count == 1 ? 413 "Your query matched %g files in %g directory." : 414 "Your query matched %g files in %g directories." ), 415 $file_count, $directory_count, 416 ); 417 418 is first off that it's not well abstracted -- these ways of testing 419 for grammatical number (as in the expressions like C<foo == 1 ? 420 singular_form : plural_form>) should be abstracted to each language 421 module, since how you get grammatical number is language-specific. 422 423 But second off, it's not isomorphic -- the "solution" (i.e., the 424 phrasebook entries) for Chinese maps from these four English phrases to 425 the one Chinese phrase that fits for all of them. In other words, the 426 informal solution would be "The way to say what you want in Chinese is 427 with the one phrase 'For your question, in Y directories you would 428 find X files'" -- and so the implemented solution should be, 429 isomorphically, just a straightforward way to spit out that one 430 phrase, with numerals properly interpolated. It shouldn't have to map 431 from the complexity of other languages to the simplicity of this one. 432 433 =head2 Buzzword: Inheritance 434 435 There's a great deal of reuse possible for sharing of phrases between 436 modules for related dialects, or for sharing of auxiliary functions 437 between related languages. (By "auxiliary functions", I mean 438 functions that don't produce phrase-text, but which, say, return an 439 answer to "does this number require a plural noun after it?". Such 440 auxiliary functions would be used in the internal logic of functions 441 that actually do produce phrase-text.) 442 443 In the case of sharing phrases, consider that you have an interface 444 already localized for American English (probably by having been 445 written with that as the native locale, but that's incidental). 446 Localizing it for UK English should, in practical terms, be just a 447 matter of running it past a British person with the instructions to 448 indicate what few phrases would benefit from a change in spelling or 449 possibly minor rewording. In that case, you should be able to put in 450 the UK English localization module I<only> those phrases that are 451 UK-specific, and for all the rest, I<inherit> from the American 452 English module. (And I expect this same situation would apply with 453 Brazilian and Continental Portugese, possbily with some I<very> 454 closely related languages like Czech and Slovak, and possibly with the 455 slightly different "versions" of written Mandarin Chinese, as I hear exist in 456 Taiwan and mainland China.) 457 458 As to sharing of auxiliary functions, consider the problem of Russian 459 numbers from the beginning of this article; obviously, you'd want to 460 write only once the hairy code that, given a numeric value, would 461 return some specification of which case and number a given quanitified 462 noun should use. But suppose that you discover, while localizing an 463 interface for, say, Ukranian (a Slavic language related to Russian, 464 spoken by several million people, many of whom would be relieved to 465 find that your Web site's or software's interface is available in 466 their language), that the rules in Ukranian are the same as in Russian 467 for quantification, and probably for many other grammatical functions. 468 While there may well be no phrases in common between Russian and 469 Ukranian, you could still choose to have the Ukranian module inherit 470 from the Russian module, just for the sake of inheriting all the 471 various grammatical methods. Or, probably better organizationally, 472 you could move those functions to a module called C<_E_Slavic> or 473 something, which Russian and Ukranian could inherit useful functions 474 from, but which would (presumably) provide no lexicon. 475 476 =head2 Buzzword: Concision 477 478 Okay, concision isn't a buzzword. But it should be, so I decree that 479 as a new buzzword, "concision" means that simple common things should 480 be expressible in very few lines (or maybe even just a few characters) 481 of code -- call it a special case of "making simple things easy and 482 hard things possible", and see also the role it played in the 483 MIDI::Simple language, discussed elsewhere in this issue [TPJ#13]. 484 485 Consider our first stab at an entry in our "phrasebook of functions": 486 487 sub I_found_X1_files_in_X2_directories { 488 my( $files, $dirs ) = @_[0,1]; 489 $files = sprintf("%g %s", $files, 490 $files == 1 ? 'fichier' : 'fichiers'); 491 $dirs = sprintf("%g %s", $dirs, 492 $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); 493 return "J'ai trouv\xE9 $files dans $dirs."; 494 } 495 496 You may sense that a lexicon (to use a non-committal catch-all term for a 497 collection of things you know how to say, regardless of whether they're 498 phrases or words) consisting of functions I<expressed> as above would 499 make for rather long-winded and repetitive code -- even if you wisely 500 rewrote this to have quantification (as we call adding a number 501 expression to a noun phrase) be a function called like: 502 503 sub I_found_X1_files_in_X2_directories { 504 my( $files, $dirs ) = @_[0,1]; 505 $files = quant($files, "fichier"); 506 $dirs = quant($dirs, "r\xE9pertoire"); 507 return "J'ai trouv\xE9 $files dans $dirs."; 508 } 509 510 And you may also sense that you do not want to bother your translators 511 with having to write Perl code -- you'd much rather that they spend 512 their I<very costly time> on just translation. And this is to say 513 nothing of the near impossibility of finding a commercial translator 514 who would know even simple Perl. 515 516 In a first-hack implementation of Maketext, each language-module's 517 lexicon looked like this: 518 519 %Lexicon = ( 520 "I found %g files in %g directories" 521 => sub { 522 my( $files, $dirs ) = @_[0,1]; 523 $files = quant($files, "fichier"); 524 $dirs = quant($dirs, "r\xE9pertoire"); 525 return "J'ai trouv\xE9 $files dans $dirs."; 526 }, 527 ... and so on with other phrase => sub mappings ... 528 ); 529 530 but I immediately went looking for some more concise way to basically 531 denote the same phrase-function -- a way that would also serve to 532 concisely denote I<most> phrase-functions in the lexicon for I<most> 533 languages. After much time and even some actual thought, I decided on 534 this system: 535 536 * Where a value in a %Lexicon hash is a contentful string instead of 537 an anonymous sub (or, conceivably, a coderef), it would be interpreted 538 as a sort of shorthand expression of what the sub does. When accessed 539 for the first time in a session, it is parsed, turned into Perl code, 540 and then eval'd into an anonymous sub; then that sub replaces the 541 original string in that lexicon. (That way, the work of parsing and 542 evaling the shorthand form for a given phrase is done no more than 543 once per session.) 544 545 * Calls to C<maketext> (as Maketext's main function is called) happen 546 thru a "language session handle", notionally very much like an IO 547 handle, in that you open one at the start of the session, and use it 548 for "sending signals" to an object in order to have it return the text 549 you want. 550 551 So, this: 552 553 $lang->maketext("You have [quant,_1,piece] of new mail.", 554 scalar(@messages)); 555 556 basically means this: look in the lexicon for $lang (which may inherit 557 from any number of other lexicons), and find the function that we 558 happen to associate with the string "You have [quant,_1,piece] of new 559 mail" (which is, and should be, a functioning "shorthand" for this 560 function in the native locale -- English in this case). If you find 561 such a function, call it with $lang as its first parameter (as if it 562 were a method), and then a copy of scalar(@messages) as its second, 563 and then return that value. If that function was found, but was in 564 string shorthand instead of being a fully specified function, parse it 565 and make it into a function before calling it the first time. 566 567 * The shorthand uses code in brackets to indicate method calls that 568 should be performed. A full explanation is not in order here, but a 569 few examples will suffice: 570 571 "You have [quant,_1,piece] of new mail." 572 573 The above code is shorthand for, and will be interpreted as, 574 this: 575 576 sub { 577 my $handle = $_[0]; 578 my(@params) = @_; 579 return join '', 580 "You have ", 581 $handle->quant($params[1], 'piece'), 582 "of new mail."; 583 } 584 585 where "quant" is the name of a method you're using to quantify the 586 noun "piece" with the number $params[0]. 587 588 A string with no brackety calls, like this: 589 590 "Your search expression was malformed." 591 592 is somewhat of a degerate case, and just gets turned into: 593 594 sub { return "Your search expression was malformed." } 595 596 However, not everything you can write in Perl code can be written in 597 the above shorthand system -- not by a long shot. For example, consider 598 the Italian translator from the beginning of this article, who wanted 599 the Italian for "I didn't find any files" as a special case, instead 600 of "I found 0 files". That couldn't be specified (at least not easily 601 or simply) in our shorthand system, and it would have to be written 602 out in full, like this: 603 604 sub { # pretend the English strings are in Italian 605 my($handle, $files, $dirs) = @_[0,1,2]; 606 return "I didn't find any files" unless $files; 607 return join '', 608 "I found ", 609 $handle->quant($files, 'file'), 610 " in ", 611 $handle->quant($dirs, 'directory'), 612 "."; 613 } 614 615 Next to a lexicon full of shorthand code, that sort of sticks out like a 616 sore thumb -- but this I<is> a special case, after all; and at least 617 it's possible, if not as concise as usual. 618 619 As to how you'd implement the Russian example from the beginning of 620 the article, well, There's More Than One Way To Do It, but it could be 621 something like this (using English words for Russian, just so you know 622 what's going on): 623 624 "I [quant,_1,directory,accusative] scanned." 625 626 This shifts the burden of complexity off to the quant method. That 627 method's parameters are: the numeric value it's going to use to 628 quantify something; the Russian word it's going to quantify; and the 629 parameter "accusative", which you're using to mean that this 630 sentence's syntax wants a noun in the accusative case there, although 631 that quantification method may have to overrule, for grammatical 632 reasons you may recall from the beginning of this article. 633 634 Now, the Russian quant method here is responsible not only for 635 implementing the strange logic necessary for figuring out how Russian 636 number-phrases impose case and number on their noun-phrases, but also 637 for inflecting the Russian word for "directory". How that inflection 638 is to be carried out is no small issue, and among the solutions I've 639 seen, some (like variations on a simple lookup in a hash where all 640 possible forms are provided for all necessary words) are 641 straightforward but I<can> become cumbersome when you need to inflect 642 more than a few dozen words; and other solutions (like using 643 algorithms to model the inflections, storing only root forms and 644 irregularities) I<can> involve more overhead than is justifiable for 645 all but the largest lexicons. 646 647 Mercifully, this design decision becomes crucial only in the hairiest 648 of inflected languages, of which Russian is by no means the I<worst> case 649 scenario, but is worse than most. Most languages have simpler 650 inflection systems; for example, in English or Swahili, there are 651 generally no more than two possible inflected forms for a given noun 652 ("error/errors"; "kosa/makosa"), and the 653 rules for producing these forms are fairly simple -- or at least, 654 simple rules can be formulated that work for most words, and you can 655 then treat the exceptions as just "irregular", at least relative to 656 your ad hoc rules. A simpler inflection system (simpler rules, fewer 657 forms) means that design decisions are less crucial to maintaining 658 sanity, whereas the same decisions could incur 659 overhead-versus-scalability problems in languages like Russian. It 660 may I<also> be likely that code (possibly in Perl, as with 661 Lingua::EN::Inflect, for English nouns) has already 662 been written for the language in question, whether simple or complex. 663 664 Moreover, a third possibility may even be simpler than anything 665 discussed above: "Just require that all possible (or at least 666 applicable) forms be provided in the call to the given language's quant 667 method, as in:" 668 669 "I found [quant,_1,file,files]." 670 671 That way, quant just has to chose which form it needs, without having 672 to look up or generate anything. While possibly not optimal for 673 Russian, this should work well for most other languages, where 674 quantification is not as complicated an operation. 675 676 =head2 The Devil in the Details 677 678 There's plenty more to Maketext than described above -- for example, 679 there's the details of how language tags ("en-US", "i-pwn", "fi", 680 etc.) or locale IDs ("en_US") interact with actual module naming 681 ("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the 682 details of how to record (and possibly negotiate) what character 683 encoding Maketext will return text in (UTF8? Latin-1? KOI8?). There's 684 the interesting fact that Maketext is for localization, but nowhere 685 actually has a "C<use locale;>" anywhere in it. For the curious, 686 there's the somewhat frightening details of how I actually 687 implement something like data inheritance so that searches across 688 modules' %Lexicon hashes can parallel how Perl implements method 689 inheritance. 690 691 And, most importantly, there's all the practical details of how to 692 actually go about deriving from Maketext so you can use it for your 693 interfaces, and the various tools and conventions for starting out and 694 maintaining individual language modules. 695 696 That is all covered in the documentation for Locale::Maketext and the 697 modules that come with it, available in CPAN. After having read this 698 article, which covers the why's of Maketext, the documentation, 699 which covers the how's of it, should be quite straightfoward. 700 701 =head2 The Proof in the Pudding: Localizing Web Sites 702 703 Maketext and gettext have a notable difference: gettext is in C, 704 accessible thru C library calls, whereas Maketext is in Perl, and 705 really can't work without a Perl interpreter (although I suppose 706 something like it could be written for C). Accidents of history (and 707 not necessarily lucky ones) have made C++ the most common language for 708 the implementation of applications like word processors, Web browsers, 709 and even many in-house applications like custom query systems. Current 710 conditions make it somewhat unlikely that the next one of any of these 711 kinds of applications will be written in Perl, albeit clearly more for 712 reasons of custom and inertia than out of consideration of what is the 713 right tool for the job. 714 715 However, other accidents of history have made Perl a well-accepted 716 language for design of server-side programs (generally in CGI form) 717 for Web site interfaces. Localization of static pages in Web sites is 718 trivial, feasable either with simple language-negotiation features in 719 servers like Apache, or with some kind of server-side inclusions of 720 language-appropriate text into layout templates. However, I think 721 that the localization of Perl-based search systems (or other kinds of 722 dynamic content) in Web sites, be they public or access-restricted, 723 is where Maketext will see the greatest use. 724 725 I presume that it would be only the exceptional Web site that gets 726 localized for English I<and> Chinese I<and> Italian I<and> Arabic 727 I<and> Russian, to recall the languages from the beginning of this 728 article -- to say nothing of German, Spanish, French, Japanese, 729 Finnish, and Hindi, to name a few languages that benefit from large 730 numbers of programmers or Web viewers or both. 731 732 However, the ever-increasing internationalization of the Web (whether 733 measured in terms of amount of content, of numbers of content writers 734 or programmers, or of size of content audiences) makes it increasingly 735 likely that the interface to the average Web-based dynamic content 736 service will be localized for two or maybe three languages. It is my 737 hope that Maketext will make that task as simple as possible, and will 738 remove previous barriers to localization for languages dissimilar to 739 English. 740 741 __END__ 742 743 Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics 744 from Northwestern University; he specializes in language technology. 745 Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of 746 Linguistics at the University of New Mexico; he specializes in 747 morphology and pedagogy of North American native languages. 748 749 =head2 References 750 751 Alvestrand, Harald Tveit. 1995. I<RFC 1766: Tags for the 752 Identification of Languages.> 753 C<ftp://ftp.isi.edu/in-notes/rfc1766.txt> 754 [Now see RFC 3066.] 755 756 Callon, Ross, editor. 1996. I<RFC 1925: The Twelve 757 Networking Truths.> 758 C<ftp://ftp.isi.edu/in-notes/rfc1925.txt> 759 760 Drepper, Ulrich, Peter Miller, 761 and FranE<ccedil>ois Pinard. 1995-2001. GNU 762 C<gettext>. Available in C<ftp://prep.ai.mit.edu/pub/gnu/>, with 763 extensive docs in the distribution tarball. [Since 764 I wrote this article in 1998, I now see that the 765 gettext docs are now trying more to come to terms with 766 plurality. Whether useful conclusions have come from it 767 is another question altogether. -- SMB, May 2001] 768 769 Forbes, Nevill. 1964. I<Russian Grammar.> Third Edition, revised 770 by J. C. Dumbreck. Oxford University Press. 771 772 =cut 773 774 #End 775
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated: Tue Mar 17 22:47:18 2015 | Cross-referenced by PHPXref 0.7.1 |