Letter to Commissioner, Kannada and Culture, GoKI on Unicode Kannada fonts by beluru sudarshana

EAzÀ ¨ÉÃ¼ÀÆgÀÄ ¸ÀÄzÀ±Àð£À ¥ÀvÀæPÀvÀð £ÀA. 917J (1044) PÀÈµÀÚ, ªÀÄºÀr 1£ÉÃ J¥sï ªÀÄÄRå gÀ¸ÉÛ 2£ÉÃ ºÀAvÀ, Vj£ÀUÀgÀ ¨ÉAUÀ¼ÀÆgÀÄ 560085 zÀÆgÀªÁtÂ: 9741976789 F ªÉÄÊ¯ï: beluru@gmail.com

CwÃ dgÀÆgÀÄ ¥ÀvÀæ 4 ¥sÉ§ÄæªÀj 2013

EªÀjUÉ DAiÀÄÄPÀÛgÀÄ PÀ£ÀßqÀ ªÀÄvÀÄÛ ¸ÀA¸ÀÌöÈw E¯ÁSÉ PÀ£ÀßqÀ ¨sÀªÀ£À dAiÀÄZÁªÀÄgÁeÉÃAzÀæ gÀ¸ÉÛ (eÉ ¹ gÀ¸ÉÛ) ¨ÉAUÀ¼ÀÆgÀÄ 560001

«µÀAiÀÄ: ªÀiÁgÀÄw ¸Á¥sïÖªÉÃgï ¸ÉÆ®Ä±À£ïì ¸ÀA¸ÉÜAiÀÄªÀgÀÄ gÀÆ¦¹gÀÄªÀ AiÀÄÄ¤PÉÆÃqï ¥sÁAmïUÀ¼À£ÀÄß ©qÀÄUÀqÉ ªÀiÁqÀ¢gÀÄªÀAvÉ ªÀÄvÀÄÛ ¸ÀzÀj ¸ÀA¸ÉÜUÉ PÉÆlÖ mÉAqÀgÀ£ÀÄß PÀÆqÀ¯ÉÃ ¸ÀÛA¨sÀ£ÀUÉÆ½¸ÀÄªÀ §UÉÎ «£ÀAw ªÀiÁ£ÀågÉÃ, vÁªÀÅ «±Áé¸À«lÄÖ £À£Àß£ÀÄß PÀ£ÀßqÀ vÀAvÁæA±À C©üªÀÈ¢Þ ¸À«ÄwAiÀÄ ¢£ÁAPÀ 17.1.2013gÀ ¸À¨sÉUÉ «±ÉÃµÀ DºÁé¤vÀ£ÁV DºÁé¤¹gÀÄwÛÃj. F ¸À¨sÉAiÀÄ°è £Á£ÀÄ ¸ÀQæAiÀÄªÁV ¥Á¯ÉÆÎArzÉÝÃ£É. PÀ£ÀßqÀzÀ »jAiÀÄ ¸Á»wUÀ¼ÀÄ ªÀÄvÀÄÛ PÀ£ÀßqÀ vÀAvÁæA±À vÀdÕgÀÄ EgÀÄªÀAxÀ F WÀ£À ¸À«ÄwAiÀÄ ¸À¨sÉAiÀÄ°è ¥Á¯ÉÆÎ¼Àî®Ä £À£ÀUÉ CªÀPÁ±À ¤ÃrzÀÝPÁÌV vÀªÀÄUÉ ºÁUÀÆ ¸À«ÄwAiÀÄ ¸ÀzÀ¸ÀåjUÉ £À£Àß ªÀAzÀ£ÉUÀ¼À£ÀÄß ¸À°è¸ÀÄwÛzÉÝÃ£É.

DzÁUÀÆå, F ¸À¨sÉAiÀÄ°è ªÀiÁgÀÄw ¸Á¥sïÖªÉÃgï ¸ÉÆ®Ä±À£ïì ¸ÀA¸ÉÜAiÀÄªÀgÀÄ ¥ÀæzÀ²ð¹zÀ PÀ£ÀßqÀ AiÀÄÄ¤PÉÆÃqï ¥sÁAmïUÀ¼À£ÀÄß £ÉÆÃr £À£ÀUÉ DWÁvÀªÀÇ, «µÁzÀªÀÇ GAmÁVzÉ JAzÀÄ «£ÀªÀÄæªÁV w½¸À§AiÀÄ¸ÀÄvÉÛÃ£É. £À£ÀUÉ DWÁvÀ vÀAzÀ ¸ÀAUÀwUÀ¼À£ÀÄß F PÉ¼ÀV£ÀAvÉ £Á£ÀÄ ¸À¨sÉAiÀÄ UÀªÀÄ£ÀPÉÌ vÀA¢gÀÄvÉÛÃ£É: 1) ªÀiÁgÀÄw ¸Á¥sïÖªÉÃgï ¸ÉÆ®Ä±À£ïì£ÀªÀgÀÄ gÀÆ¦¹zÀ ºÀ®ªÀÅ AiÀÄÄ¤PÉÆÃqï ¥sÁAmïUÀ¼À°è AiÀiÁªÀÅzÉÆAzÀÆ PÀ£ÀßqÀ mÉÊ¥ï¸ÉnAUï£À eÁå«ÄwAiÀÄ ¸ÀÆvÀæUÀ½UÉ C£ÀÄUÀÄtªÁV®è; ¥ÀæªÀiÁt§zÀÞªÁVAiÀÄÆ E®è JA§ÄzÀÄ £À£Àß RavÀ C©ü¥ÁæAiÀÄªÁVzÉ. C®èzÉ F ¥sÁAmïUÀ¼ÀÄ CPÀëgÀ±Á¸ÀÛçzÀ (mÉÊ¥ÉÇÃUÀæ¦ü) ¤AiÀÄªÀÄUÀ¼À£ÀÄß ¥Á°¹®è JA§ÄzÀÄ £À£ÀUÉ ªÉÄÃ®Ä£ÉÆÃlPÉÌ C¤ß¹zÉ. CAzÀÄ ¸À¨sÉAiÀÄ°èzÀÝ qÁ|| ZÀAzÀæ±ÉÃRgÀ PÀA¨ÁgÀgÀÄ, PÀ¯Á«zÀ gÁ. ¸ÀÆjAiÀÄªÀgÀÄ F §UÉÎ ºÀ®ªÀÅ DPÉëÃ¥ÀUÀ¼À£ÀÄß ªÀåPÀÛ¥Àr¹zÀÝ£ÀÆß UÀªÀÄ¤¸À§ºÀÄzÀÄ. 2) F AiÀÄÄ¤PÉÆÃqï ¥sÁAmïUÀ¼À£ÀÄß F »AzÉAiÉÄÃ qÁ|| C£ÀAvÀ PÉÆ¥Ààgï ªÀÄvÀÄÛ qÁ|| AiÀÄÄ © ¥ÀªÀ£ÀdgÀªÀgÀÄ ¥ÀgÁªÀÄ²ð¹ C£ÀÄªÉÆÃ¢¹zÁÝgÉ JA§ÄzÀÄ vÀªÀÄä PÀZÉÃjAiÀÄ PÀqÀvÀUÀ¼À£ÀÄß £ÉÆÃrzÁUÀ ªÀÄvÀÄÛ ¸À¨sÉAiÀÄ £ÀqÁªÀ½UÀ¼À£ÀÄß UÀªÀÄ¤¹zÁUÀ UÉÆvÁÛVzÉ. DzÀgÉ F ¥sÁAmï C£ÀÄªÉÆÃzÀ£ÉAiÀÄ ¥ÀæQæAiÉÄAiÀÄÄ EA¢£À PÁ®ªÀiÁ£ÀPÉÌ vÀPÀÄÌzÁzÀ vÁAwæPÀ ªÀiÁ£ÀzÀAqÀUÀ¼ C£ÀÄ¸ÁgÀªÁV DV®è JA§ÄzÀÄ £À£Àß ªÉÆzÀ®£ÉÆÃlzÀ C©ü¥ÁæAiÀÄªÁVzÉ. 1

3) PÀ£ÀßqÀ CPÀëgÀUÀ¼À£ÀÄß «£Áå¸ÀUÉÆ½¸ÀÄªÀÅzÀÄ PÉÃªÀ® vÀAvÀæeÁÕ£ÀPÉÌ ¸ÀA§A¢ü¹zÀ «ZÁgÀªÀ®è. ªÀÄÄA¢£À zÀ±ÀPÀUÀ¼À°è PÀ£ÀßrUÀgÀÄ ªÁå¥ÀPÀªÁV §¼À¸ÀÄªÀ ªÀÄºÀvÀézÀ ¥sÁAmïUÀ¼À£ÀÄß PÀ£ÀßqÀ ¨sÁµÉAiÀÄ ¸ÁA¸ÀÌöÈwPÀ ZÀºÀgÉUÀ¼À C©üªÀåQÛAiÀiÁV gÀÆ¦¸À¨ÉÃQzÉ. DzÀÝjAzÀ E°è vÀAvÀæeÁÕ£ÀzÀµÉÖÃ ªÀÄÄRåªÁV F ¥sÁAmïUÀ¼ÀÄ PÀ£ÀßqÀzÀ PÀA¥À£ÀÄß ©ÃgÀÄvÀÛªÉAiÉÄ? PÀ£ÀßqÀzÀ mÉÊ¥ï¸ÉnAUï CPÀëgÀ±Á¸ÀÛçzÀ EwºÁ¸À ªÀÄvÀÄÛ ¥ÀgÀA¥ÀgÉAiÀÄ£ÀÄß ºÉÆuÉUÁjPÉ¬ÄAzÀ ªÀÄvÀÄÛ ºÉZÀÄÑUÁjPÉ¬ÄAzÀ ªÀÄÄAzÀÄªÀj¸ÀÄvÀÛzÉAiÉÄ? - F ¥Àæ±ÉßUÀ½UÀÆ £ÁªÀÅ GvÀÛgÀ PÀAqÀÄPÉÆ¼Àî¨ÉÃQzÉ. DzÀgÉ £À£Àß C®à w½ªÀ½PÉAiÀÄ ¥ÀæPÁgÀ F ¥sÁAmïUÀ¼À£ÀÄß CPÀëgÀ ¸ÀA¸ÀÌöÈw ªÀÄvÀÄÛ ¥ÀgÀA¥ÀgÉAiÀÄ »£Éß¯ÉAiÀÄ°è ¥ÀgÁªÀÄ±ÉðUÉ M¼À¥Àr¹®è. F CA±ÀUÀ¼À£ÀÄß £Á£ÀÄ ¸À¨sÉAiÀÄ UÀªÀÄ£ÀPÉÌ vÀA¢zÀÝ®èzÉ F ªÀÄÆ®PÀ ªÀÄvÉÆÛªÉÄä vÀªÀÄä CªÀUÁºÀ£ÉUÉ vÀgÀÄvÀÛ F PÀÄjvÀÄ E£ÀßµÀÄÖ CA±ÀUÀ¼À£ÀÄß ºÀAaPÉÆ¼Àî§AiÀÄ¸ÀÄvÉÛÃ£É: 1) ªÀiÁgÀÄw ¸Á¥sïÖªÉÃgï ¸ÉÆ®Ä±À£ïìgÀªÀgÀÄ ¥sÁAmïUÀ¼À£ÀÄß gÀÆ¦¸ÀÄªÁUÀ¯ÉÃ ¥sÁAmï ªÁå°qÉÃ±À£ï ¥ÀæQæAiÉÄ (¥sÁAmï£ÀÄß C£ÀÄªÉÆÃ¢¸ÀÄªÀ ¥ÀæQæAiÉÄ)AiÀÄ ¸ÀÆvÀæUÀ¼À£ÀÄß gÀÆ¦¹gÀ¨ÉÃPÁVvÀÄÛ. EzÀÄ F ¥ÀæQæAiÉÄAiÀÄ°è £À£ÀUÉ PÀAqÀÄ§AzÀ ªÉÆzÀ® ¥ÀæªÀÄÄR ¯ÉÆÃ¥ÀªÁVzÉ. DzÀgÉ F vÁAwæPÀ ¥ÀæQæAiÉÄAiÀÄ£ÀÄß qÁ|| C£ÀAvÀ PÉÆ¥Ààgï ªÀÄvÀÄÛ qÁ|| AiÀÄÄ © ¥ÀªÀ£ÀdgÀªÀgÀÄ MzÀV¸À§ºÀÄ¢vÀÄÛ. DzÀgÉ «µÁzÀªÀ±Ávï »ÃUÁV®è. D£ï¯ÉÊ£ï §¼ÀPÉUÉ ªÀÄvÀÄÛ ªÀÄÄzÀæt gÀAUÀzÀ°è ªÁå¥ÀPÀªÁV §¼À¸À¨ÃÉ PÀÄ JA§ »£Éß¯ÉAiÀÄ°è 56.00 ®PÀë gÀÆ.UÀ¼À ªÉZÀÑzÀ AiÉÆÃd£ÉAiÀÄ ¨sÁUÀªÁV gÀÆ¥ÀÅUÉÆAqÀ F ¥sÁAmïUÀ¼À£ÀÄß AiÀiÁªÀÅzÉÃ ²¸ÀÄÛ§zÀÞ ªÀÄvÀÄÛ CAvÁgÁ¶ÖçÃAiÀÄªÁV ªÀiÁ£ÀåªÁzÀ ªÀiÁ£ÀzÀAqÀUÀ¼À DzsÁgÀzÀ°è ¥Àj²Ã°¸ÀzÉÃ C£ÀÄªÉÆÃ¢¹gÀÄªÀÅzÀÄ CvÀåAvÀ ¥ÀæªÀÄÄR ¯ÉÆÃ¥ÀªÁVzÉ. DzÀÝjAzÀ FUÀ ¸À°èPÉAiÀiÁzÀ ¥sÁAmï C£ÀÄªÉÆÃzÀ£Á zÁR¯ÉAiÀÄÄ ¸ÀA¥ÀÇtðªÁV CvÁAwæPÀªÁVzÉ ªÀÄvÀÄÛ eÁUÀwPÀ ¤AiÀÄªÀÄUÀ¼À£ÀÄß ¥Á°¸ÀzÀ zÁR¯ÉAiÀiÁVgÀÄvÀÛzÉ. ¥sÁAmï vÀAiÀiÁjPÁ ¸ÀA¸ÉÜAiÀiÁV ªÀiÁgÀÄw ¸ÀA¸ÉÜAiÀÄÄ F J®è ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß (E¯ÁSÉAiÀÄÄ ªÁZÀåªÁV, °TvÀªÁV w½¸À¢zÀÝgÀÆ) C£ÀÄ¸Àj¸À¯ÉÃ¨ÉÃPÁVgÀÄªÀÅzÀÄ ¥sÁAmï gÀAUÀzÀ ¥ÁæxÀ«ÄPÀ CUÀvÀåªÁVgÀÄvÀÛzÉ; ¸ÁªÀÄÄzÁ¬ÄPÀ ºÉÆuÉUÁjPÉAiÀÄÆ DVgÀÄvÀÛzÉ. 2) F §UÉÎ £Á£ÀÄ qÁ|| C£ÀAvÀ PÉÆ¥Ààgï ªÀÄvÀÄÛ qÁ|| ¥ÀªÀ£ÀdjUÉ PÉÆÃjPÉ ¸À°è¹zÁUÀ, CªÀgÀÄ ¤ÃrgÀÄªÀ GvÀÛgÀªÀ£ÀÄß F ¥ÀvÀæzÉÆA¢UÉ ®UÀwÛ¹gÀÄvÉÛÃ£É (C£ÀÄ§AzsÀ 14). F ¥ÀvÀæzÀ°è w½¹zÀAvÉ £Á£ÀÆ ¸ÀºÀ 29.1.2013gÀAzÀÄ vÀªÀÄä PÀZÉÃjAiÀÄ°è £ÀqÉzÀ vÁAwæPÀ ¥Àj²Ã®£Á ¸À¨sÉAiÀÄ°è ¨sÁUÀªÀ»¹zÉÝ. ¥sÁAmï ªÁå°qÉÃ±À£ï PÀÄjvÀAvÉ AiÀiÁªÀÅzÉÃ ªÉÊeÁÕ¤PÀ PÀæªÀÄUÀ¼À£ÀÄß, ¸ÀªÀÄÄzÁAiÀÄ DzsÁjvÀ ¥ÀjÃPÉëUÀ¼À£ÀÄß £ÀqÉ¹gÀÄªÀÅzÀÄ PÀAqÀÄ§A¢®è. DzÀÝjAzÀ £À£Àß ªÉÄÃ°£À C©ü¥ÁæAiÀÄzÀ°è AiÀiÁªÀÅzÉÃ §zÀ¯ÁªÀuÉ EgÀÄªÀÅ¢®è. qÁ|| ¥ÀªÀ£ÀdgÀÄ vÀªÀÄä ¥ÀvÀæzÀ¯ÉèÃ ¥sÁAmï£ÀÄß vÁªÀÅ ¥ÀjÃQë¹®è JAzÀÄ w½¹zÁÝgÉ. C®èzÉ CPÀëgÀ ¸ËzÀAiÀÄðzÀ zÀÈ¶Ö¬ÄAzÀ PÉÃªÀ® qÁ|| ZÀAzÀæ±ÉÃRgÀ PÀA¨ÁgÀgÀ ¸À®ºÉUÀ¼À£ÀÄß ¥Á°¹gÀÄªÀÅzÁV w½¹zÁÝgÉ. £À£Àß C®à w½ªÀ½PÉAiÀÄAvÉ qÁ|| ZÀAzÀæ±ÉÃRgÀ PÀA¨ÁgÀgÀÄ £ÀªÀÄä £ÀqÀÄ«£À »jAiÀÄ ¸Á»w ªÀÄvÀÄÛ CPÀëgÀzÀ ¸ËAzÀAiÀÄðzÀ §UÉÎ D¼À gÀ¸ÀUÀæºÀtzÀ UÀÄt EgÀÄªÀªÀgÀÄ. DzÀgÉ CPÀëgÀUÀ¼À ¸ËAzÀAiÀÄðªÀÅ ªÉÄÃ®Ä£ÉÆÃlzÀ, PÀtÂÚUÉ PÁtÄªÀ gÀ¸ÀUÀæºÀtPÀÌµÉÖÃ ¹Ã«ÄvÀªÁV®è; ¥ÀæwAiÉÆAzÀÆ CPÀëgÀzÀ NgÉPÉÆÃgÉUÀ¼À §UÉÎ PÀÆ®APÀµÀªÁV ¥Àj²Ã®£É £ÀqÉ¹ PÀ¯ÁvÀäPÀvÉAiÀÄ ªÉÊeÁÕ¤PÀ ¸ÀÆvÀæUÀ¼À£ÀÄß C£Àé¬Ä¹ £ÉÆÃqÀÄªÀ CUÀvÀå«zÉ. DzÀÝjAzÀ¯ÉÃ E°è ¥sÁAmï / mÉÊ¥ÉÇÃUÀæ¦ü vÀdÕ PÀ¯Á«zÀgÀ ªÀÄvÀÄÛ EvÀgÉ vÀdÕgÀ CªÀ±ÀåPÀvÉ¬ÄzÉ. 3) £Á£ÀÄ PÉÃªÀ® ¥ÀvÀæPÀvÀð£ÁVzÀÝgÀÆ, £À£Àß ¸ÀéAvÀ ±ÀæªÀÄ¢AzÀ ¥sÁAmï ¸ÀA¸ÀÌöÈw, mÉÊ¥ÉÇÃUÀæ¦ü, AiÀÄÄ¤PÉÆÃqï ¥sÁAmï, Vè¥sï, CPÀëgÀ«£Áå¸À, CPÀëgÀ±Á¸ÀÛçzÀ°è eÁå«ÄwAiÀÄ C¼ÀªÀrPÉ - EªÉÃ ªÀÄÄAvÁzÀ ¸ÀAUÀwUÀ¼À£ÀÄß N¢ w½zÀÄPÉÆArgÀÄvÉÛÃ£É. £Á£ÉÃ ¸ÀévÀB ¥sÁAmïUÀ¼À£ÀÄß ªÀÄÆ®zÀ°è MqÉzÀÄ £ÉÆÃr CzsÀåAiÀÄ£À £ÀqÉ¹gÀÄvÉÛÃ£É. `PÀtd' CAvÀgÀeÁ® PÀ£ÀßqÀ eÁÕ£ÀPÉÆÃ±À AiÉÆÃd£ÉAiÀÄ°è `PÀtd' AiÀÄÄ¤PÉÆÃqï ¥sÁAmï ªÀiÁqÀÄªÀ ¸À®ÄªÁV ºÀ®ªÀÅ ¥sÁAmï vÀAiÀiÁgÀPÀ vÀdÕgÀ eÉÆvÉ ZÀZÉð £ÀqÉ¹gÀÄvÉÛÃ£É. C®èzÉ qÁ|| AiÀÄÄ © ¥ÀªÀ£Àd ªÀÄvÀÄÛ ºÀ®ªÀgÀ £ÉgÀ«¤AzÀ £Á£ÉÃ ¸ÀévÀB AiÀÄÄ¤PÉÆÃqï PÀ£ÀßqÀ ¥sÁAmï vÀAiÀiÁjPÉAiÀÄ ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß (SÁ¸ÀV ¥sÁAmï vÀAiÀiÁgÀPÀgÀÄ `PÀtd' ¥sÁAmï vÀAiÀiÁj¸À®Ä C£ÀÄPÀÆ®ªÁUÀ¯ÉAzÀÄ) ¸ÀAPÀ°¹ ¸ÀÄªÀiÁgÀÄ MA§vÀÄÛ SÁ¸ÀV ¥sÁAmï vÀAiÀiÁgÀPÀjUÉ PÀ½¹PÉÆnÖgÀÄvÉÛÃ£É. 4) qÁ|| C£ÀAvÀ PÉÆ¥ÀàgïgÀªÀgÁUÀ°Ã, qÁ|| AiÀÄÄ © ¥ÀªÀ£ÀdgÁUÀ°Ã, PÀ£ÀßqÀ mÉÊ¥ÉÇÃUÀæ¦üAiÀÄ §UÉÎ CjªÀÅ ºÉÆA¢zÀªÀgÁVzÀÝgÀÆ, CPÀëgÀ±Á¸ÀÛçzÀ C£ÀÄ¨sÀªÀzÀ (£Á¤°è CPÀqÉ«ÄPï CºÉÛUÀ¼À£ÀÄß G¯ÉèÃT¸ÀÄwÛ®è) »£Éß¯ÉAiÀÄÄ¼ÀîªÀgÀ®è JA§ÄzÀÄ £À£Àß £ÀªÀÄæ C©üªÀÄvÀªÁVzÉ. E°è PÀ£ÀßqÀ vÀAvÁæA±À C©üªÀÈ¢Þ ¸À«ÄwAiÀÄ ¸ÀzÀ¸ÀåjUÀÆ ºÉÆgÀvÁzÀ vÀdÕgÀ £ÉgÀªÀÅ ¨ÉÃPÀÄ JA§ÄzÀÄ £À£Àß RavÀ C©üªÀÄvÀ. GzÁºÀgÀuÉUÉ PÀ£ÀßqÀ PÀA¥ÀÇånAUï ¦vÁªÀÄºÀ ªÀÄvÀÄÛ PÀ£ÀßqÀ UÀtPÀ QÃ°ªÀÄuÉAiÀÄ£ÀÄß ªÉÆlÖªÉÆzÀ®Ä gÀÆ¦¹ J®è ¨sÁgÀwÃAiÀÄ ¨sÁµÉUÀ½UÀÆ ªÀiÁzÀjAiÀÄ£ÀÄß ºÁQPÉÆlÖ ²æÃ PÉ ¦ gÁªïgÀAxÀ ªÀÄºÀ¤ÃAiÀÄgÀ vÀdÕvÉAiÀÄ£ÀÄß PÀ£ÁðlPÀ ¸ÀPÁðgÀªÀÅ §¼À¹PÉÆ¼Àî¨ÉÃPÀÄ. EzÀ®èzÉ, PÀ£ÀßqÀzÀ mÉÊ¥ï¸ÉnAUï DgÀA¨sÀªÁzÀ E¥ÀàvÀÛ£ÉAiÀÄ ±ÀvÀªÀiÁ£ÀzÀ DgÀA¨sÀzÀ PÀÈwUÀ¼À£ÀÄß D¸ÀQÛ¬ÄAzÀ UÀªÀÄ¤¸ÀÄvÀÛ §A¢gÀÄªÀ ªÀÄvÀÄÛ CAxÀ ºÀ¼ÉAiÀÄ PÀÈwUÀ¼À£ÀÄß rfmÉÊ¸ï ªÀiÁr AiÀÄÄ¤PÉÆÃqï ¥ÀzÀ ºÀÄqÀÄPÁlzÀ ¸Ë®¨sÀåªÀ£ÀÄß gÀÆ¦¹zÀ, ¸Á¥sïÖ N¹Dgï (D¦ÖPÀ¯ï PÁågÀPÀÖgï gÉPÀVß±À£ï) 2

vÀAvÁæA±ÀªÀ£ÀÄß gÀÆ¦¹zÀ qÁ|| ¹ J¸ï AiÉÆÃUÁ£ÀAzÀgÀAxÀªÀgÀ£ÀÆß F PÉ®¸ÀPÁAiÀÄðUÀ½UÉ §¼À¹PÉÆ¼Àî§ºÀÄzÁVzÉ. ªÀiÁgÀÄw ¸Á¥sïÖªÉÃgï ¸ÉÆ®Ä±À£ïìgÀªÀjUÉÃ E£ÀÆß ªÀÄÆgÀÄ vÀAvÁæA±ÀzÀ PÉ®¸ÀUÀ¼À£ÀÄß vÀªÀÄä E¯ÁSÉ¬ÄAzÀ ªÀ»¹gÀÄwÛÃj. AiÀiÁªÀÅzÉÃ vÁAwæPÀ ªÀÄvÀÄÛ ªÁå°qÉÃ±À£ï ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß °TvÀªÁV gÀÆ¦¹ C£ÀÄªÉÆÃ¢¸ÀzÉAiÉÄÃ EªÀPÉÌ®è MlÄÖ 56.00 ®PÀë gÀÆ.UÀ¼À£ÀÄß ªÉZÀÑ ªÀiÁqÀÄªÀÅzÀÄ ¸ÁªÀðd¤PÀ ºÀtzÀ zÀÄgÀÄ¥ÀAiÉÆÃUÀ JAzÀÄ £À£ÀUÉ C¤ß¹zÉ. GzÁºÀgÀuÉUÉ ¨ÉæöÊ¯ï vÀAvÁæA±À gÀÆ¦¸À®Ä 18/19 ®PÀë gÀÆ.UÀ¼À£ÀÄß mÉAqÀj£À°è £ÀªÀÄÆ¢¹gÀÄwÛÃj. DzÀgÉ F §UÉÎ ¸ÀzÀj ¸ÀA¸ÉÜUÉ AiÀiÁªÀÅzÉÃ C£ÀÄ¨sÀªÀªÀÇ E®è; CAzsÀ vÀdÕgÀ£ÀÆß F ¥ÀæQæAiÉÄAiÀÄ°è PÀAr®è. ¸ÉÆgÀ§zÀ C§¹ JA§ UÁæªÀÄzÀ°èzÀÄÝPÉÆAqÉÃ PÀ£ÀßqÀ AiÀÄÄ¤PÉÆÃqï CPÀëgÀUÀ¼À£ÀÄß ¹AxÉ¸ÉÊ¸ï ªÀiÁr NzÀÄªÀ F-¹àÃPï vÀAvÁæA±ÀzÀ PÀ£ÀßqÀ C©üªÀÈ¢ÞAiÀÄ£ÀÄß AiÀÄ±À¹éAiÀiÁV ªÀiÁrzÀ CAzsÀ AiÀÄÄªÀPÀ ²æÃ n J¸ï ²æÃzsÀgÀgÀAxÀ, (C£ÀÄ§AzsÀ 7) ¸ÀªÀÄxÀð£ÀA ¸ÀA¸ÉÜAiÀÄ »jAiÀÄ ¸ÀzÀ¸Àå ²æÃ d¹Ö£ï£ÀAxÀ C¥Ààl PÀ£ÀßrUÀ AiÀÄÄªÀ ªÀÄ£À¸ÀÄìUÀ¼À£ÀÄß EAxÀ «±ÉÃµÀ PÁAiÀÄðzÀ°è vÉÆqÀV¹PÉÆ¼ÀîzÉ, £Áå±À£À¯ï D¸ÉÆÃ¹AiÉÄÃ±À£ï ¥sÁgï ¨ÉèöÊAqï D¥sï EArAiÀiÁ (J£ïJ©L)zÀAxÀ ¸ÀAWÀl£ÉUÀ¼À ¸ÁA¹ÜPÀ C£ÀÄ¨sÀªÀªÀ£ÀÆß M¼ÀUÉÆ¼ÀîzÉ EAxÀ vÀAvÁæA±ÀUÀ¼ÀÄ ©qÀÄUÀqÉAiÀiÁzÀgÀÆ «¥sÀ®ªÁUÀÄvÀÛªÉ JA§ÄzÀgÀ°è C£ÀÄªÀiÁ£À«®è. £Á£ÀÄ AiÀiÁªÀÅzÉÃ ªÀåQÛAiÀÄ£ÀÄß ²¥sÁgÀ¸ÀÄ ªÀiÁqÀ®Ä F ºÉ¸ÀgÀÄUÀ¼À£ÀÄß §gÉ¢®è. §zÀ°UÉ PÉÃªÀ® GzÁºÀgÀuÁxÀðªÁV ºÉ¸Àj¹gÀÄvÉÛÃ£É. ¸ÀPÁðgÀªÀÅ EµÀÄÖ ¨sÁjÃ ¥ÀæªÀiÁtzÀ ºÀt ¤Ãr gÀÆ¦¸ÀÄªÀ ¥sÁAmïUÀ¼ÀÄ ªÀÄvÀÄÛ EvÀgÉ vÀAvÁæA±ÀUÀ¼ÀÄ ªÀÄÄPÀÛ vÀAvÁæA±ÀªÁV (N¥À£ï¸ÉÆÃ¸ïð ¸Á¥sïÖªÉÃgï) ©qÀÄUÀqÉAiÀiÁUÀÄªÀÅzÀÄ PÁ£ÀÆ¤£À ªÀÄvÀÄÛ ¸ÁªÀiÁfPÀ §zÀÞvÉAiÀiÁVgÀÄvÀÛzÉ. F vÀAvÁæA±ÀUÀ¼À£ÀÄß »ÃUÉ d£ÀgÀ¯ï ¥À©èPï ¯ÉÊ¸É£ïì (f¦J¯ï) £ÀªÀÄÆ£É 3.0 CxÀªÁ N¥À£ï ¥sÁAmï ¯ÉÊ¸É£ïì (¸ÁªÀðd¤PÀ §¼ÀPÉUÉ, C©üªÀÈ¢ÞUÉ AiÀiÁªÀÅzÀÄ CvÀåAvÀ ¸ÀÆPÀÛªÉÇÃ CzÀÄ) zÁR¯ÉAiÀÄ£ÀéAiÀÄ ©qÀÄUÀqÉ ªÀiÁqÀÄªÀ §UÉÎ ¸ÀzÀj ¥sÁAmï vÀAiÀiÁjPÁ ¥ÀæQæAiÉÄAiÀÄ°è AiÀiÁªÀÅzÉÃ RavÀ ¤®ÄªÀÅ PÀAqÀÄ§A¢®è. (C£ÀÄ§AzsÀ 4) ¸ÀzÀj ¥sÁAmïUÀ¼À£ÀÄß PÀ£ÀßqÀ ªÀÄÄzÀæt ªÀiÁzsÀåªÀÄPÉÌ ¨ÉÃPÁzÀ J®è CUÀvÀåUÀ½UÉ ºÉÆAzÀÄvÀÛªÉAiÉÄÃ JAzÀÄ PÀÆ®APÀµÀªÁV ¥ÀgÁªÀÄ²ð¸À¨ÉÃQzÉ. C¢®èzÉ ºÉÆÃzÀgÉ, £ÀªÀÄä ¥ÀæPÁ±À£À gÀAUÀªÀÅ F ¥sÁAmïUÀ¼À£ÀÄß ¸ÁgÁ¸ÀUÀmÁV wgÀ¸ÀÌj¸ÀÄvÀÛzÉ. ¥ÀæPÁ±À£À gÀAUÀzÀ°è §¼ÀPÉAiÀÄ°ègÀÄªÀ C¦èPÉÃ±À£ï vÀAvÁæA±ÀUÀ¼ÁzÀ CqÉÆÃ¨ï E£ïr¸ÉÊ£ï, E®è¸ÉÖçÃlgï, ¥sÉÇÃmÉÆÃ±Á¥ï, PÉÆÃgÉ¯ï qÁæ, JA J¸ï D¦üÃ¸ï, ªÀÄÄAvÁzÀ ¥ÉÇæ¥ÉæöÊlj (SÁ¸ÀV) vÀAvÁæA±ÀUÀ¼À°è, ªÀÄvÀÄÛ EAPï¸ÉÌÃ¥ï, N¥À£ï D¦üÃ¸ï, VA¥ï, ¸ÉÌçöÊ§¸ï ªÀÄÄAvÁzÀ ªÀÄÄPÀÛ (N¥À£ï¸ÉÆÃ¸ïð) vÀAvÁæA±ÀUÀ¼À°è §¼À¸ÀÄªÀAvÉ F ¥sÁAmïUÀ¼À£ÀÄß vÁAwæPÀªÁV PÀgÁgÀÄªÁPÁÌV gÀÆ¦¸À¨ÉÃQzÉ. F §UÉÎ vÀdÕgÀÄ AiÀiÁªÀÅzÉÃ ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß G¯ÉèÃT¹zÀÄÝ PÀAqÀÄ§A¢®è. F vÀAvÁæA±ÀUÀ¼À°è AiÀÄÄ¤PÉÆÃqï gÉAqÀjAUï ¸ÀªÀÄ¸ÉåAiÀÄÄ EgÀÄªÀÅzÀÄ ¤dªÁzÀgÀÆ, PÀæªÉÄÃtªÁV CªÀ£Éß®è ¤ªÁj¸À¯ÁUÀÄwÛzÉ. d£ÀªÀj 17gÀ ¸À¨sÉAiÀÄ¯ÉèÃ qÁ|| AiÀÄÄ © ¥ÀªÀ£ÀdgÀÄ E£ïr¸ÉÊ£ï 6.0 vÀAvÁæA±ÀzÀ°è ªÉÄÊPÉÆæÃ¸Á¥sïÖ£À vÀÄAUÁ ¥sÁAmï ¸ÀªÀÄ¥ÀðPÀªÁV gÉAqÀgï DUÀÄwÛgÀÄªÀÅzÀ£ÀÆß, ªÀiÁgÀÄw ¸ÀA¸ÉÜAiÀÄÄ gÀÆ¦¹zÀ ¥sÁAmï gÉAqÀgï DUÀ¢gÀÄªÀÅzÀ£ÀÆß RÄzÀÄÝ £À£Àß UÀªÀÄ£ÀPÉÌ vÀA¢gÀÄvÁÛgÉ. ¥sÁAmïUÀ¼À£ÀÄß §¼À¸ÀzÉAiÉÄÃ C£ÀÄªÉÆÃzÀ£É ¤ÃqÀÄªÀÅzÀÄ ¤dPÀÆÌ CvÀåAvÀ PÁ£ÀÆ£ÀÄ¨Á»gÀ PÀæªÀÄªÁVzÉ. AiÀiÁªÀÅzÉÃ ªÀ¸ÀÄÛªÀ£ÀÄß ¸ÀPÁðgÀPÁÌV gÀÆ¦¹zÁUÀ, CzÀÄ ¸ÀjAiÀiÁV PÉ®¸À ªÀiÁqÀÄvÀÛzÉAiÉÄÃ E®èªÉÃ JAzÀÄ ¥Àj²Ã°¹AiÉÄÃ C£ÀÄªÉÆÃzÀ£É ¤ÃqÀ¨ÉÃQgÀÄvÀÛzÉ. DzÀgÉ E°è ºÁUÉ ¸ÀÆPÀÛ ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß C£ÀÄ¸Àj¹®è; C®èzÉ ¥sÁAmïUÀ¼À£ÀÄß AiÀiÁªÀÅzÉÃ vÀdÕgÀÆ §¼À¹ ¥Àj²Ã°¹®è. DzÀÝjAzÀ vÁAwæPÀ, CAvÁgÁ¶ÖçÃAiÀÄ ªÀiÁ£ÀzÀAqÀUÀ¼À°è ¥Àj²Ã®£ÉUÉ M¼À¥ÀqÀzÀ, PÀ£ÀßqÀzÀ CPÀëgÀ ¥ÀgÀA¥ÀgÉUÉ MA¢¤vÀÆ PÉÆqÀÄUÉAiÀÄ£ÀÄß PÉÆqÀzÀ ªÀÄvÀÄÛ ªÀÄÆ®vÀB §¼ÀPÉAiÀÄ ¥ÀjÃPÉëUÉÃ M¼À¥ÀqÀzÀ F ¥sÁAmïUÀ¼À£ÀÄß ¸ÁªÀðd¤PÀ §¼ÀPÉUÉ, CxÀªÁ ¥ÀæAiÉÆÃUÁxÀð §¼ÀPÉUÉ ©qÀÄUÀqÉ ªÀiÁqÀÄªÀÅzÀÄ ¸ÁªÀðd¤PÀ ²¹Û£À PÀæªÀÄªÁVgÀÄªÀÅ¢®è JAzÀÄ «£ÀªÀÄæªÁV w½¸À§AiÀÄ¸ÀÄvÉÛÃ£É. F »£Éß¯ÉAiÀÄ°è £Á£ÀÄ DPÀÈw PÀ£ÀßqÀ vÀAvÁæA±ÀªÀ£ÀÄß gÀÆ¦¹zÀ ¸ÉÊ§gï¸ÉÌÃ¥ï ¸ÀA¸ÉÜAiÀÄ ªÀiÁ°PÀgÁzÀ ²æÃ D£ÀAzÀgÀªÀgÀ£ÀÄß ¸ÀA¥ÀQð¹zÁUÀ CªÀgÀÄ F PÉ¼ÀPÀAqÀ ¸ÉÃªÉUÀ¼À£ÀÄß AiÀiÁªÀÅzÉÃ ±ÀÄ®Ì«®èzÉ ¤ÃqÀ®Ä ªÀÄÄAzÉ §A¢gÀÄvÁÛgÉ. •

PÀ£ÀßqÀ AiÀÄÄ¤PÉÆÃqï ¥sÁAmï ªÁå°qÉÃ±À£ïUÉ ¨ÉÃPÁzÀ J®è vÀdÕvÉAiÀÄ£ÀÆß MzÀV¸ÀÄªÀÅzÀÄ.

•

DPÀÈw vÀAvÁæA±ÀzÀ J®è AiÀÄÄ¤PÉÆÃqï ¥sÁAmïUÀ¼ÀÄ

•

DPÀÈw ¥sÁAmïUÀ¼À AiÀÄÄ¤PÉÆÃqï ¥ÀjªÀwðvÀ ¥sÁAmïUÀ¼ÀÄ 3

•

QÃ¨ÉÆÃqïð qÉæöÊªÀgïUÀ¼ÀÄ, PÀ£ÀélðgïUÀ¼ÀÄ ªÀÄvÀÄÛ ¸Éà°AUï ZÉPï ¥sÉæÃªÀiïªÀPïð - EªÀÅUÀ¼À£ÀÄß M¼ÀUÉÆAqÀ, «AqÉÆÃ¸ï PÁAiÀiÁðZÀgÀuÁ vÀAvÁæA±ÀzÀ ªÉÄÃ¯É PÉ®¸À ªÀiÁqÀÄªÀ `DPÀÈw «¸ÁÛgï' JA§ PÀ£ÀßqÀ vÀAvÁæA±À. 10) EzÀ®èzÉ `¥ÀzÀ' PÀ£ÀßqÀ AiÀÄÄ¤PÉÆÃqï vÀAvÁæA±ÀªÀ£ÀÄß FUÁUÀ¯ÉÃ GavÀªÁV ©qÀÄUÀqÉ ªÀiÁrgÀÄªÀ ²æÃ ¯ÉÆÃ»vïgÀªÀgÀÄ £À£Àß ªÀÄ£À«AiÀÄ ªÉÄÃgÉUÉ F vÀAvÁæA±ÀªÀ£ÀÄß (EzÀÄ §¼ÀPÉAiÀÄ°ègÀÄªÀ AiÀÄÄ¤PÉÆÃqï ¥sÁAmïUÀ¼À£ÉßÃ §¼À¸ÀÄvÀÛzÉ) ¸ÀPÁðgÀzÀ ªÀÄÆ®PÀ ¸ÁªÀðd¤PÀ §¼ÀPÉUÉ ªÀÄÄPÀÛªÁV ¤ÃqÀ®Ä (vÀªÀÄäzÉÃ PÉÆAZÀ §zÀ°¹zÀ ¯ÉÊ¸É¤ìAUï £ÉÆA¢UÉ) ªÀÄÄAzÉ §A¢gÀÄvÁÛgÉ (C£ÀÄ§AzsÀ 12). EzÀ£ÀÆß vÁªÀÅ UÀªÀÄ¤¸À¨ÉÃPÀÄ. EªÉ®èªÀ£ÀÆß NnJ¥sï ¸ÀégÀÆ¥ÀzÀ°è PÀ£ÀßrUÀjUÉ ¤ÃqÀ®Ä ²æÃ D£ÀAzï ªÀÄÄAzÉ §A¢zÀÄÝ CªÀgÀ ¥ÀvÀæªÀ£ÀÄß vÀªÀÄä CªÀUÁºÀ£ÉUÉ ®UÀwÛ¸À¯ÁVzÉ. (C£ÀÄ§AzsÀ 6). EAxÀ ¸ÀºÀÈzÀAiÀÄ ¥sÁAmï vÀAiÀiÁjPÁ vÀdÕgÀÄ EzÀÝgÀÆ CªÀgÀ£Éß®è ¥ÀjUÀt£ÉUÉÃ vÉUÉzÀÄPÉÆ¼ÀîzÉ PÀ£ÀßqÀzÀ PÉ®¸À ªÀiÁqÀÄªÀÅzÀÄ ¸ÀªÀÄAd¸ÀªÁUÀ¯ÁgÀzÀÄ. (F vÀAvÁæA±ÀªÀ£ÀÆß £Á£ÀÄ ¸ÀÆa¹zÀ ªÀiÁ£ÀzÀAqÀUÀ½UÉ C£ÀÄUÀÄtªÁVAiÉÄÃ C£ÀÄªÉÆÃ¢¸À¨ÉÃPÀÄ). 11) ¥sÁAmï ªÁå°qÉÃ±À£ï ªÀiÁ£ÀzÀAqÀUÀ½UÉ ¸ÀA§A¢ü¹zÀAvÉ £Á£ÀÄ `PÀtd' AiÉÆÃd£ÉAiÀÄ°èzÁÝUÀ gÀÆ¦¹zÀ ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß ªÀÄvÀÄÛ F ªÀiÁ£ÀzÀAqÀUÀ¼À°è Cw ¥ÀæªÀÄÄRªÁzÀ CA±ÀUÀ¼À ªÀÄÄ¢ævÀ ¥ÀÅ¸ÀÛPÀUÀ¼À£ÀÄß F ¥ÀvÀæzÉÆA¢UÉ ®UÀwÛ¹gÀÄvÉÛÃ£É. EªÉ®èªÀ£ÀÆß £Á£ÀÄ qÁ|| ¥ÀªÀ£Àd, qÁ|| AiÉÆÃUÁ£ÀAzÀ, ²æÃ D£ÀAzÀ, ¥ÀÅuÉAiÀÄ ²æÃ PÀÆ¥Àgï ªÀÄÄAvÁzÀ vÀdÕgÉÆA¢UÉ ¸ÀA¥ÀPÀðzÀ°èzÀÄÝPÉÆAqÉÃ gÀÆ¦¹gÀÄvÉÛÃ£É. AiÀÄÄ¤PÉÆÃqï£ÀÄß ªÀiÁ£ÀzÀAqÀªÁV PÉÃAzÀæ¸ÀPÁðgÀªÀÅ C¢üPÀÈvÀ C¢ü¸ÀÆZÀ£ÉAiÀÄ£ÀÄß ºÉÆgÀr¹gÀÄªÀÅzÀjAzÀ PÀ£ÁðlPÀ ¸ÀPÁðgÀªÀÅ EwÛÃZÉUÉ AiÀÄÄ¤PÉÆÃqï §¼ÀPÉ PÀÄjvÀÄ C¢ü¸ÀÆZÀ£É ºÉÆgÀr¹gÀÄªÀÅzÀÄ ¸ÀÆPÀÛªÁVzÉ. EzÀPÁÌV ¸ÀPÁðgÀªÀ£ÀÄß £Á£ÀÄ C©ü£ÀA¢¸ÀÄvÉÛÃ£É. F C¢ü¸ÀÆZÀ£ÉUÁV `PÀtd' ªÀw¬ÄAzÀ®Æ MvÁÛAiÀÄ ªÀiÁrzÀÝ£ÀÄß F ¸ÀAzÀ¨sÀðzÀ°è G¯ÉèÃT¸À§AiÀÄ¸ÀÄvÉÃÛ £É. »ÃVzÀÆÝ PÀ£ÀßqÀ AiÀÄÄ¤PÉÆÃqï ¥sÁAmï C£ÀÄªÉÆÃzÀ£ÉAiÀÄ ¸ÀAzÀ¨sÀðzÀ°è AiÀÄÄ¤PÉÆÃqï ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß C£ÀÄ¸Àj¸ÀzÉÃ ºÉÆÃVgÀÄªÀÅzÀÄ ¸ÀªÀÄAd¸À PÀæªÀÄªÀ®è. F ªÀiÁ£ÀzÀAqÀUÀ¼À ¥ÀnÖ »ÃVzÉ: PÀæ. «µÀAiÀÄ 1. N¥À£ïmÉÊ¥ï ¥sÁAmï ªÀiÁ£ÀzÀAqÀUÀ¼ÄÀ 2. AiÀÄÄ¤PÉÆÃqï ªÀiÁ£ÀzÀAqÀUÀ¼ÀÄ 3. ªÉÄÊPÉÆæÃ¸Á¥sïÖ ¸ÀA¸ÉÜAiÀÄÄ gÀÆ¦¹zÀ PÀ£ÀßqÀ PÀÄjvÀ ªÀiÁ£ÀzÀAqÀUÀ¼ÀÄ

C£ÀÄ§AzsÀzÀ ¸ÀASÉå C£ÀÄ§AzsÀ 1 C£ÀÄ§AzsÀ 2 C£ÀÄ§AzsÀ 3

F »£Éß¯ÉAiÀÄ°è £Á£ÀÄ F PÉ¼ÀV£ÀAvÉ vÀªÀÄä°è «£ÀAw¹PÉÆ¼ÀÄîwÛzÉÝÃ£É: 1) ªÀiÁgÀÄw ¸Á¥sïÖªÉÃgï ¸ÉÆ®Ä±À£ïì ¸ÀA¸ÉÜAiÀÄªÀgÀÄ gÀÆ¦¹zÀ PÀ£ÀßqÀ ¥sÁAmïUÀ¼À£ÀÄß ªÁå°qÉÃmï ªÀiÁqÀ®Ä / C£ÀÄªÉÆÃ¢¸À®Ä, PÀ£ÀßqÀ ¥sÁAmï gÀAUÀzÀ°è ¥ÀjtwAiÀÄÄ¼Àî MAzÀÄ vÀdÕgÀ ¸À«ÄwAiÀÄ£ÀÄß ªÀiÁqÀ¨ÉÃPÀÄ ªÀÄvÀÄÛ ¥sÁAmï C£ÀÄªÉÆÃzÀ£ÉUÉ ¸ÀÆPÀÛªÁzÀ ªÀiÁ£ÀzÀAqÀUÀ¼À£ÀÄß gÀÆ¦¸ÀÄªÀÅzÀ£ÀÆß F ¸À«ÄwUÉÃ ªÀ»¸À¨ÉÃPÀÄ. £Á£ÀÄ ªÉÊAiÀÄQÛPÀªÁV F PÉ¼ÀV£À ºÉ¸ÀgÀÄUÀ¼À£ÀÄß ¸ÀÆa¸ÀÄwÛzÉÝÃ£É: (F §UÉÎ E¯ÁSÉAiÀÄÄ 1) 2) 3) 4)

CAwªÀÄ ¤zsÁðgÀªÀ£ÀÄß PÉÊUÉÆ¼Àî§ºÀÄzÁVzÉ) ²æÃ PÉ ¦ gÁªï, ªÀÄtÂ¥Á® (CªÀgÀ ¸ÀÆÜ® ¥ÀjZÀAiÀÄªÀ£ÀÄß ®UÀwÛ¹zÉ. C£ÀÄ§AzsÀ 5) qÁ|| ¹ J¸ï AiÉÆÃUÁ£ÀAzÀ, ªÀÄÄRå¸ÀÜgÀÄ, UÀtÂvÀ «¨sÁUÀ, J¸ïeÉ¹E, ªÉÄÊ¸ÀÆgÀÄ qÁ|| ZÀAzÀæ£ÁxÀ DZÁAiÀÄð, »jAiÀÄ PÀ¯Á«zÀgÀÄ, ¨ÉAUÀ¼ÀÆgÀÄ ²æÃ «£ÀAiÀiï ¸ÁAiÀÄ£ÉÃPÀgï, mÉÊ¥ÉÇÃUÀæ¦ü vÀdÕgÀÄ, ªÀÄÄA§¬Ä.

2) ªÀiÁgÀÄw ¸ÀA¸ÉÜAiÀÄªÀgÀÄ gÀÆ¦¹gÀÄªÀ ¥sÁAmï ¥ÀjªÀvÀðPÀ vÀAvÁæA±ÀªÀ£ÀÆß EzÉÃ jÃwAiÀiÁV CvÀåAvÀ ²¹Û£À vÁAwæPÀ ªÀiÁ£ÀzÀAqÀUÀ½UÉ M¼À¥Àr¹AiÉÄÃ C£ÀÄªÉÆÃ¢¸À¨ÉÃPÀÄ (£Á£ÀÄ ¨sÁUÀªÀ»¹zÀ ¸À¨sÉAiÀÄ°è ¥ÀjªÀvÀðPÀªÀ£ÀÄß vÉÆÃj¸ÀÄªÁUÀ CªÀgÀÄ ¨ÉÃgÉ ¥sÁAmï vÀªÀÄä §½ E®è JAzÀÄ ¸ÀªÀÄeÁ¬Ä¶ ¤ÃrzÀgÀÄ. DzÀgÉ vÀAvÁæA±ÀzÀ ¸À¨sÁ ¤gÀÆ¥ÀuÉAiÀÄ°è EAxÀ AiÀiÁªÀÅzÉÃ ¸ÀªÀÄeÁ¬Ä¶UÉ CªÀPÁ±À EgÀÄªÀÅ¢®è. CªÀgÀÄ ¥sÁAmïUÀ¼À£ÀÄß C£ÀÄ¸ÁÜ¦¹, ¥ÀjÃPÁëxÀð vÉÆÃj¹AiÉÄÃ vÀªÀÄä vÀAvÁæA±ÀzÀ zÀPÀëvÉAiÀÄ£ÀÄß ©A©¸À¨ÉÃPÀÄ). FUÁUÀ¯ÉÃ ¸ÁªÀðd¤PÀªÁV ®¨sÀå«gÀÄªÀ GavÀ / ªÀÄÄPÀÛ ¥ÀjªÀvÀðPÀUÀ¼À°è E®èzÀ «±ÉÃµÀ CA±ÀUÀ¼À£ÀÄß ¥ÀnÖÃPÀj¹ CªÀÅUÀ¼À£ÀÄß ¥ÀgÁªÀÄ²ð¹, ¸ÀÆPÀÛ CUÀvÀå ©zÀÝgÉ ªÀiÁvÀæªÉÃ C£ÀÄªÉÆÃ¢¸À¨ÉÃPÀÄ. PÀÄªÉA¥ÀÅ vÀAvÁæA±À, 4

¥ÀzÀ vÀAvÁæA±À, §gÀºÀ vÀAvÁæA±À,À £ÀÄr vÀAvÁæA±ÀUÀ¼À°è FUÁUÀ¯ÉÃ ¥ÀjªÀvÀðPÀUÀ¼ÀÄ ®¨sÀå«ªÉ. EAxÀ vÀAvÁæA±ÀUÀ¼À£ÀÄß «ÄÃjzÀ ºÉZÀÄÑUÁjPÉ F vÀAvÁæA±ÀzÀ°è EzÉAiÉÄÃ JA§ PÀlÄÖ¤mÁÖzÀ ¥Àj²Ã®£É DUÀ¨ÃÉ QzÉ.

3) EzÀ®èzÉ, AiÀÄÄ¤PÉÆÃqï ¥sÁAmïUÀ¼ÀÄ, vÀAvÁæA±À ªÀÄvÀÄÛ ¥ÀjªÀvÀðPÀUÀ¼À£ÀÄß PÀ£ÀßqÀ «Q¦ÃrAiÀiÁªÀÇ M¼ÀUÉÆAqÀAvÉ PÀ£ÀßqÀPÁÌV, PÀ£ÀßqÀzÀ°è PÉ®¸À ªÀiÁqÀÄwÛgÀÄªÀ ªÀÄÄPÀÛ ªÀÄvÀÄÛ ¸ÀévÀAvÀæ vÀAvÁæA±À ¸ÀªÀÄÄzÁAiÀÄUÀ¼À°è ¥ÀjÃPÉëUÉ ©lÄÖ, C°è CªÀÅ J®è ªÀiÁ£ÀzÀAqÀUÀ¼À°è vÉÃUÀðqÉAiÀiÁzÀ £ÀAvÀgÀªÉÃ¸ÁªÀðd¤PÀ §¼ÀPÉUÉ ©qÀÄUÀqÉ ªÀiÁqÀ¨ÉÃPÀÄ. C°èAiÀÄªÀgÉUÉ vÀAvÁæA±À vÀAiÀiÁgÀPÀjUÉ AiÀiÁªÀÅzÉÃ ¥ÁªÀwAiÀÄ£ÀÆß ªÀiÁqÀPÀÆqÀzÀÄ. DzÀÝjAzÀ ¥sÁAmïUÀ¼À vÁAwæPÀ ¥ÀjÃPÉë, PÀ¯ÁvÀäPÀvÉ, ²¸ÀÄÛ-¸ËAzÀAiÀÄð, ¸ÁA¸ÀÌöÈwPÀ ¥ÀgÀA¥ÀgÉAiÀÄ ZÀºÀgÉ - F J®è CA±ÀUÀ¼À£ÀÆß UÀªÀÄ¤¹AiÉÄÃ C£ÀÄªÉÆÃ¢¸À¨ÉÃPÀÄ JA§ÄzÀÄ £À£Àß ªÉÄÃ°£À ªÀÄÆgÀÄ ¨ÉÃrPÉUÀ¼À ¸ÁgÁA±ÀªÁVzÉ. 4) C¸ÀªÀÄ¥ÀðPÀ / C¹ÛvÀézÀ¯ÉèÃ E®èzÀ ªÀiÁ£ÀzÀAqÀUÀ¼ÀÄ, ¥ÀjUÀt£ÉUÉ vÉUÉzÀÄPÉÆ¼ÀîzÀ ¸ÀªÀÄPÁ°Ã£À ¹ÜwUÀ¼ÀÄ ªÀÄvÀÄÛ C¹ÛvÀézÀ¯ÉèÃ E®èzÀ ªÁå°qÉÃ±À£ï ¸ÀÆvÀæUÀ¼ÀÄ - F PÁgÀtUÀ½AzÁV F mÉAqÀgï£ÀÄß vÀPÀëtªÉÃ CªÀiÁ£Àw£À°è EqÀ¨ÉÃPÀÄ / ¸ÀÛA¨sÀ£ÀUÉÆ½¸À¨ÉÃPÀÄ. C®èzÉ PÀÆqÀ¯ÉÃ F §UÉAiÀÄ vÀAvÁæA±ÀUÀ¼ÀÄ ªÀÄÄPÀÛªÁV / GavÀªÁV ¸ÀªÀiÁdzÀ°è ®¨sÀå«zÉAiÉÄÃ JA§ÄzÀ£ÀÆß ¥ÀjUÀtÂ¸À¨ÉÃPÀÄ. F vÀAvÁæA±ÀUÀ¼À£ÀÄß gÀÆ¦¸ÀÄªÀÅzÀPÉÌ EgÀ¨ÉÃPÁzÀ ªÀiÁ£ÀzÀAqÀUÀ¼ÉÃ£ÀÄ JA§ÄzÀ£ÀÄß RavÀªÁV zÁR°ÃPÀj¸À¨ÉÃPÀÄ. EªÉ®è ºÀAvÀUÀ¼À£ÀÄß C£ÀÄ¸Àj¸ÀzÉÃ ºÉÆÃzÀgÉ F vÀAvÁæA±À gÀÆ¦¸ÀÄªÀ mÉAqÀgïUÉ AiÀiÁªÀ ªÉÊeÁÕ¤PÀ DzsÁgÀªÀÇ EgÀÄªÀÅ¢®è. 5) PÀ£ÀßqÀ vÀAvÁæA±À C©üªÀÈ¢Þ ¸À«ÄwAiÀÄÄ ¤ÃrzÀ ²¥sÁgÀ¸ÀÄUÀ¼À MmÁÖgÉ ¹Üw - UÀwAiÀÄ §UÉÎ MAzÀÄ zÀÄAqÀÄªÉÄÃf£À ¸À¨sÉ PÀgÉAiÀÄ¨ÉÃPÀÄ. EzÀPÉÌ FUÁUÀ¯ÉÃ GavÀªÁV vÀAvÁæA±À ¤ÃqÀÄwÛgÀÄªÀ vÀAvÁæA±À vÀAiÀiÁgÀPÀgÀ£ÀÄß, SÁ¸ÀV vÀAiÀiÁgÀPÀgÀ£ÀÄß, ¥sÁAmï vÀdÕgÀ£ÀÄß, ¥ÀÅ¸ÀÛPÀ ¥ÀæPÁ±ÀPÀgÀ£ÀÄß, rn¦ C£ÀÄ¨sÀ«UÀ¼À£ÀÄß, ªÉÆ¨ÉÊ¯ï vÀdÕvÉ ºÉÆA¢zÀªÀgÀ£ÀÄß, ªÀÄÄPÀÛ D¥ÀgÉÃnAUï vÀAvÁæA±ÀªÁzÀ °£ÀPïì/UÀÄß vÀdÕgÀ£ÀÄß, F gÀAUÀzÀ°è ¸ÀQæAiÀÄªÁVgÀÄªÀ AiÀÄÄªÀ ¸Á¥sïÖªÉÃgï vÀdÕ ²æÃ NA ²ªÀ¥ÀæPÁ±ï - »ÃUÉ J®è vÀAvÀædÕgÀ£ÀÄß PÀgÉAiÀÄ¨ÉÃPÀÄ. F ¸À¨sÉAiÀÄ£ÀÄß vÀAvÀæeÁÕ£À ¸ÀA§AzsÀªÁV £ÀqÉ¸À¨ÉÃPÀÄ. EAxÀ ¸À¨sÉUÉÃ ¥sÁAmï ªÀÄvÀÄÛ ¸ÀA¸ÀÌöÈwAiÀÄ §UÉÎ w½ªÀ½PÉ ºÉÆA¢zÀ qÁ|| PÉ ¦ gÁªï, qÁ|| AiÉÆÃUÁ£ÀAzÀ, `§gÀºÀ'zÀ ²æÃ ±ÉÃµÁ¢æ ªÁ¸ÀÄ, `¥ÀzÀ' GavÀ vÀAvÁæA±ÀªÀ£ÀÄß gÀÆ¦¹zÀ AiÀÄÄªÀ GvÁì» ²æÃ ¯ÉÆÃ»vï, ²æÃ «£ÀAiÀiï ¸ÁAiÀÄ£ÉÃPÀgï ªÀÄÄAvÁzÀªÀgÀ£ÀÄß PÀgÉAiÀÄ¨ÉÃPÀÄ. F ¸À¨sÉUÉ ¸ÀPÁðgÀzÀ F-UÀªÀ£Éð£ïì E¯ÁSÉ, ¨sÁgÀwÃAiÀÄ «eÁÕ£À ¸ÀA¸ÉÜ, ¹-qÁåPï vÀdÕgÀÆ §gÀ§ºÀÄzÀÄ. C¦èPÉÃ±À£ï vÀAvÁæA±ÀUÀ¼À°è AiÀÄÄ¤PÉÆÃqï §¼ÀPÉAiÀÄÄ ¸ÁªÀðwæPÀªÁUÀ®Ä E£ÀÆß PÀ¤µÀ× MAzÀÄ ªÀµÀð ¸ÀªÀÄAiÀiÁªÀPÁ±À EzÉ. CzÀÄ £ÀªÀÄä «ÄwAiÀÄ°è EgÀÄªÀ PÉ®¸ÀªÀÇ C®è; DzÀgÉ ¸ÀPÁðgÀªÀÅ EAxÀ CAvÁgÁ¶ÖçÃAiÀÄ SÁ¸ÀV ¸ÀA¸ÉÜUÀ¼À eÉÆvÉUÉ ¸ÀªÀiÁ¯ÉÆÃZÀ£É £ÀqÉ¸À§ºÀÄzÀÄ; ªÀÄÄPÀÛ vÀAvÁæA±À gÀÆ¦¸ÀÄªÀªÀgÉÆA¢UÉ ¸ÀAªÀºÀ£ÀªÀ£ÀÄß ¸Á¢ü¸À§ºÀÄzÀÄ. EzÀÆ ¸À¨sÉUÉ C£ÀÄPÀÆ®ªÁUÀÄvÀÛzÉ. EAxÀ CAvÁgÁ¶ÖçÃAiÀÄ C¦èPÉÃ±À£ï gÀÆ¦¸ÀÄªÀ ¸ÀA¸ÉÜUÀ¼À ¥Àæw¤¢üUÀ¼À£ÀÆß ¸À¨sÉUÉ PÀgÉAiÀÄ¨ÉÃPÀÄ.

6) F ¸À¨sÉAiÀÄ£ÀÄß PÀgÉAiÀÄÄªÀ ªÀÄÄ£Àß PÀ£ÀßqÀ vÀAvÁæA±ÀzÀ FV£À ¹Üw UÀwAiÀÄ §UÉÎ vÀAvÁæA±À ¸À«ÄwAiÀÄ ¸ÀzÀ¸ÀågÀ£ÀÄß ºÉÆgÀvÀÄ¥Àr¹zÀ vÀdÕjAzÀ MAzÀÄ ±ÉéÃvÀ¥ÀvÀæªÀ£ÀÄß (¹ÜwªÀgÀ¢ JAzÀgÀÆ ¥ÀgÀªÁV®è) gÀÆ¦¹ ¸À¨sÁ¥ÀÇªÀð ZÀZÉðUÁV «vÀj¸À¨ÉÃPÀÄ. 7) PÀ£ÁðlPÀ ¸ÀPÁðgÀzÀ ¤¢ü¬ÄAzÀ¯ÉÃ ºÀA¦ PÀ£ÀßqÀ «±Àé«zÁå®AiÀÄ¢AzÀ PÀÄªÉA¥ÀÅ vÀAvÁæA±ÀªÀÇ (EzÉÃ ªÀiÁgÀÄw

¸Á¥sïÖªÉÃgï ¸ÉÆ®Ä±À£ïìgÀªÀgÉÃ EzÀ£ÀÄß gÀÆ¦¹zÀÄÝ JAzÀÄ w½¢zÉÝÃ£É) gÀÆ¥ÀÅUÉÆArzÉ. F »AzÉ £ÀÄr vÀAvÁæA±ÀPÀÆÌ ¸ÀPÁðgÀzÀ £ÉgÀªÀÅ zÉÆgÉwvÀÄÛ. (£ÀÄr 5.0 DªÀÈwÛAiÀÄÄ d£ÀªÀj 29gÀAzÀÄ ªÀiÁ£Àå ªÀÄÄRåªÀÄAwæAiÀÄªÀjAzÀ¯ÉÃ ©qÀÄUÀqÉAiÀiÁVzÀÄÝ PÀ£ÁðlPÀ ¸ÀPÁðgÀzÀ eÁ®vÁtzÀ°è GavÀªÁV ¹UÀ°zÉ JAzÀÄ ªÀgÀ¢AiÀiÁVzÉ. CAzÀgÉ £ÀÄr vÀAvÁæA±ÀzÀ §¼ÀPÉAiÀÄ£ÀÄß ¸ÀPÁðgÀªÉÃ C£ÀÄªÉÆÃ¢¹zÉ. ºÁUÁzÀgÉ PÀ£ÀßqÀ ªÀÄvÀÄÛ ¸ÀA¸ÀÌöÈw E¯ÁSÉAiÀÄÄ gÀÆ¦¹zÀ vÀAvÁæA±ÀªÀÇ §¼ÀPÉAiÀiÁUÀÄªÀÅzÉ?) FUÀ PÀ£ÀßqÀ ªÀÄvÀÄÛ ¸ÀA¸ÀÌöÈw E¯ÁSÉAiÀÄÆ 56 ®PÀë gÀÆ.UÀ¼À£ÀÄß PÉÆqÀªÀiÁrzÉ. »ÃUÉ ¸ÀPÁðgÀªÀÅ ««zsÀ ¸ÀA¸ÉÜUÀ¼À ªÀÄÆ®PÀ MAzÉÃ §UÉAiÀÄ PÉ®¸ÀPÉÌ ºÀtªÀ£ÀÄß ©qÀÄUÀqÉ ªÀiÁqÀÄªÀÅzÀÄ CvÀåAvÀ C¸ÀªÀÄAd¸À PÀæªÀÄªÁVzÉ. PÀtd AiÉÆÃd£ÉAiÀÄ®Æè (£Á£ÀÄ D AiÉÆÃd£ÉUÉ ¸ÉÃgÀÄªÀ ªÉÆzÀ¯ÉÃ) ¥sÁAmï vÀAiÀiÁjPÉ CA±ÀªÀ£ÀÄß PÀ£ÁðlPÀ eÁÕ£À DAiÉÆÃUÀªÀÅ (PÀ£ÀßqÀ vÀAvÁæA±À C©üªÀÈ¢Þ ¸À«ÄwAiÀÄ C¹ÛvÀéªÀ£ÉßÃ w½AiÀÄzÉ) ¸ÉÃj¹vÀÄÛ. DzÀÝjAzÀ¯ÉÃ £Á£ÀÄ C¤ªÁAiÀÄðªÁV ¥sÁAmï gÀÆ¦¸ÀÄªÀ ¥ÀæQæAiÉÄUÉ ZÁ®£É ¤ÃrzÉ£ÁzÀgÀÆ, ªÉÆzÀ® ªÀµÀðzÀ°è F ¥ÀæQæAiÉÄUÉ vÀqÉ MrØzÉÝ£ÀÄ. JgÀqÀ£ÉAiÀÄ «wÛÃAiÀÄ ªÀµÀðzÀ°è ¥sÁAmï 5

vÀAiÀiÁjPÉAiÀÄ ¥ÀæQæAiÉÄAiÀÄ£ÀÄß CxÀð ªÀiÁrPÉÆ¼ÀÄîªÀ ¥ÀæªÀÄÄR GzÉÝÃ±À¢AzÀ ¥ÀÇªÀð¨sÁ« ¸ÀªÀiÁ¯ÉÆÃZÀ£Á ¸À¨sÉUÀ¼À£ÀÄß ¸ÀAWÀn¹zÉ£ÀÄ. DzÀgÉ F jÃw ¸ÀPÁðgÀªÀÅ ««zsÀ E¯ÁSÉUÀ¼À ªÀÄÆ®PÀ MAzÉÃ PÉ®¸ÀªÀ£ÀÄß ªÀiÁr¸ÀÄªÀÅzÀÄ ¸ÀjAiÀÄ®è JAzÀÄ C°èAiÀÄÆ ºÀ®ªÀÅ ¸Áj ZÀað¹zÉÝÃ£É. GzÁºÀgÀuÉUÉ EzÉÃ ªÀiÁgÀÄw ¸ÀA¸ÉÜAiÀÄªÀgÀÄ `PÀtd' ¥sÁAmï vÀAiÀiÁjPÉUÉ 62,500 gÀÆ.UÀ¼À PÉÆmÉÃ±À£ï PÉÆnÖzÀÝgÀÄ. EzÉÃ ¸ÀA¸ÉÜAiÀÄªÀgÀÄ PÀ£ÀßqÀ ¸ÀA¸ÀÌöÈw E¯ÁSÉUÉ 10.75 ®PÀë gÀÆ.UÀ¼À PÉÆmÉÃ±À£ï£ÀÄß AiÀiÁªÀ ªÀiÁ£ÀzÀAqÀzÀ ªÉÄÃ¯É PÉÆlÖgÀÄ? DzÀÝjAzÀ ¸ÀPÁðgÀzÀ

ºÀt¢AzÀ¯ÉÃ ««zsÀ ¸ÀA¸ÉÜUÀ¼ÀÄ KPÀPÁ°PÀªÁV KPÀ¥ÀQëÃAiÀÄªÁV vÀAvÁæA±À vÀAiÀiÁj¸ÀÄªÀÅzÀPÉÌ, §¼À¸ÀÄªÀÅzÀPÉÌ vÀqÉ ºÁQ, PÉÃªÀ® PÀ£ÀßqÀ ªÀÄvÀÄÛ ¸ÀA¸ÀÌöÈw E¯ÁSÉ - E-UÀªÀ£Éð£ïì E¯ÁSÉAiÀÄ dAn ªÉÄÃ°éZÁgÀuÉAiÀÄ°è ªÀiÁvÀæªÉÃ KQÃPÀÈvÀ PÁAiÀÄð¸ÀÆaAiÀÄ£ÀÄß C£ÀÄ¸Àj¸À¨ÉÃPÀÄ. vÁªÀÅ zÀAiÀÄªÀiÁr £À£Àß F ¥ÀvÀæªÀ£ÀÄß UÀA©üÃgÀªÁV ¥ÀjUÀtÂ¹ £À£Àß ¨ÉÃrPÉUÀ½UÉ ¸ÀªÀÄäw¸À¨ÉÃPÉAzÀÄ F ªÀÄÆ®PÀ DUÀæºÀ¥ÀÇªÀðPÀ «£ÀAw¹PÉÆ¼ÀÄîwÛzÉÝÃ£É. CPÀëgÀ ¸ÀA¸ÀÌöÈw - ¥ÀgÀA¥ÀgÉAiÀÄ UÀnÖ ¨É£Éß®Ä©®èzÉ vÀAvÀæeÁÕ£ÀzÀ ªÀÄÄRªÁqÀPÉÌ K£ÀÆ ¨É¯É E®è. ¸ÁªÀðd¤PÀ »vÀPÁÌV PÉ®¸À ªÀiÁqÀÄªÀÅzÀPÉÌAzÉÃ ¸ÁÜ¥À£ÉAiÀiÁVgÀÄªÀ PÀ£ÀßqÀ ªÀÄvÀÄÛ ¸ÀA¸ÀÌöÈw E¯ÁSÉAiÀÄÄ PÀ£ÀßqÀzÀ ¸ÀA¸ÀÌöÈwAiÀÄ ªÀÄÆ® zsÉåÃAiÉÆÃzÉÝÃ±ÀªÀ£ÀÆß ¥sÁAmï vÀAiÀiÁjPÉAiÀÄ°è C¼ÀªÀr¸ÀzÉÃ EzÀÝgÉ CzÀÄ E¯ÁSÉAiÀÄ PÁAiÀÄð¸ÀÆaAiÀÄ£ÉßÃ ¤®ðQë¹zÀAvÁUÀÄvÀÛzÉ. PÀ£ÀßqÀ ¥sÁAmï vÀAiÀiÁjPÉAiÀÄÄ PÉÃªÀ® £ÁªÀÄPÁªÀ¸ÉÜ C£ÀÄªÉÆÃzÀ£ÉAiÀÄ ªÀÄÆ®PÀ £ÀqÉzÀÄ PÀ£ÀßrUÀjUÉ C£ÁåAiÀÄ DUÀ¢gÀ° JA§ ¸ÀzÀÄzÉÝÃ±À¢AzÀ F ¥ÀvÀæªÀ£ÀÄß §gÉ¢gÀÄvÉÛÃ£É. F ¥ÀæQæAiÉÄAiÀÄ PÉÆ£ÉAiÀÄ ºÀAvÀzÀ°è ¨sÁUÀªÀ»¹zÀÝjAzÀ vÀqÀªÁV £À£Àß C¤¹PÉUÀ¼À£ÀÄß zÁR°¹zÉÝÃ£É. F «µÀAiÀÄªÁV £Á£ÀÄ AiÀiÁªÀÅzÉÃ ¸ÀAzÀ¨sÀðzÀ®Æè vÀªÀÄä£ÀÄß AiÀiÁªÀÅzÉÃ vÀdÕgÉÆA¢UÉ ªÀiÁvÀÄPÀvÉ £ÀqÉ¸À®Ä £À£Éß®è £ÉgÀªÀ£ÀÄß PÉÆqÀÄvÉÛÃ£É. PÀ£ÀßqÀ vÀAvÁæA±À gÀÆ¦¸ÀÄªÀ°è D¸ÀQÛ ºÉÆA¢zÀ vÀdÕ PÀ£ÀßrUÀgÀ, AiÀÄÄªÀ ªÀÄ£À¸ÀÄìUÀ¼À, ¥ÀÇªÁðUÀæºÀ¦ÃrvÀgÁUÀzÀ ªÀåQÛvÀéUÀ¼À PÉÆgÀvÉ RArvÀ E®è. zÀ±ÀPÀUÀ¼À PÁ® ªÀÄÄzÀæt ªÀiÁzsÀåªÀÄzÀ ªÉÄÃ¯É £ÉÃgÀ ¥ÀjuÁªÀÄ ©ÃgÀÄªÀ EAxÀ PÉ®¸ÀPÁAiÀÄðUÀ¼À°è CªÀ¸ÀgÀªÀÇ ¨ÉÃqÀ. PÀ£ÀßqÀ vÀAvÁæA±À C©üªÀÈ¢Þ ¸À«ÄwAiÀÄ ²¥sÁgÀ¸ÀÄUÀ¼À eÁj «¼ÀA§ªÁVgÀ§ºÀÄzÀÄ; DzÀgÉ F «¼ÀA§ªÀÅ CªÀ¸ÀgÀzÀ vÀAvÁæA±À C£ÀÄªÉÆÃzÀ£ÉUÉ DzsÁgÀªÁUÀ¨ÉÃQ®è; PÀÆqÀzÀÄ. F ªÀÄzsÉå £Á£ÀÄ ªÀiÁgÀÄw ¸ÀA¸ÉÜ ªÀÄvÀÄÛ qÁ|| AiÀÄÄ © ¥ÀªÀ£ÀdgÀ£ÀÄß ¨sÉÃnAiÀiÁV F §UÉÎ ZÀað¹gÀÄvÉÛÃ£É. ¥sÁAmïUÀ¼À£ÀÄß ªÁå°qÉÃmï ªÀiÁqÀ¯ÁV®è JA§ÄzÀ£ÀÄß CªÀgÀÄ M¦àPÉÆArgÀÄvÁÛgÉ. £Á£ÀÄ JwÛzÀ ºÀ®ªÀÅ DPÉëÃ¥ÀUÀ½UÉ ªÀiÁgÀÄw ¸ÀA¸ÉÜAiÀÄÄ vÀ£Àß ¸ÀªÀÄäwAiÀÄ£ÀÄß ¸ÀÆa¸À°®è. DzÀgÉ ¥sÁAmï C©üªÀÈ¢ÞAiÀiÁUÀÄªÀ ««zsÀ ºÀAvÀUÀ¼À°è vÁ£ÀÄ ºÀ®ªÀÅ ¸À®ºÉUÀ¼À£ÀÄß PÉÆnÖzÀÄÝ, CªÀÅUÀ¼À£ÀÄß zÁR¯Áw ªÀiÁr®è JA§ CA±ÀªÀ£ÀÄß qÁ|| ¥ÀªÀ£Àd w½¹gÀÄvÁÛgÉ. F ¨sÉÃnVAvÀ ªÀÄÄ£Àß qÁ|| ¥ÀªÀ£ÀdgÀÄ PÉ®ªÀÅ ªÀiÁ£ÀzÀAqÀUÀ¼À ¥ÀnÖAiÀÄ£ÀÄß PÀ½¹PÉÆnÖgÀÄvÁÛgÉ. CzÀ£ÀÄß C£ÀÄ§AzsÀ 14gÀ°è ¤ÃqÀ¯ÁVzÉ. DzÀgÉ EªÁªÀÅªÀÇ £Á£ÀÄ JwÛgÀÄªÀ ¸ÁA¸ÀÌöÈwPÀ, zÉÃ¹ ªÀÄvÀÄÛ ¥ÀgÀA¥ÀgÉAiÀÄ CUÀvÀåUÀ¼À£ÀÄß M¼ÀUÉÆAr®è; vÁAwæPÀ ªÁå°qÉÃ±À£ï ¥ÀæQæAiÉÄAiÀÄ£ÀÆß M¼ÀUÉÆAr®è JA§ÄzÀÄ ªÁ¸ÀÛªÀ. vÀªÀÄä «±Áé¹

(¨ÉÃ¼ÀÆgÀÄ ¸ÀÄzÀ±Àð£À)

C£ÀÄ§AzsÀUÀ¼À£ÀÄß ®UÀwÛ¹zÉ PÀæ. 1. 2. 3. 4. 5. 6. 7. 8. 9.

C£ÀÄ§AzsÀzÀ ¸ÀASÉå C£ÀÄ§AzsÀ 1 C£ÀÄ§AzsÀ 2 C£ÀÄ§AzsÀ 3 C£ÀÄ§AzsÀ 4 C£ÀÄ§AzsÀ 5 C£ÀÄ§AzsÀ 6 C£ÀÄ§AzsÀ 7 C£ÀÄ§AzsÀ 8 C£ÀÄ§AzsÀ 9

«µÀAiÀÄ N¥À£ïmÉÊ¥ï ¥sÁAmï ªÀiÁ£ÀzÀAqÀUÀ¼ÀÄ AiÀÄÄ¤PÉÆÃqï ªÀiÁ£ÀzÀAqÀUÀ¼ÀÄ ªÉÄÊPÉÆæÃ¸Á¥sïÖ ¸ÀA¸ÉÜAiÀÄÄ gÀÆ¦¹zÀ PÀ£ÀßqÀ PÀÄjvÀ ªÀiÁ£ÀzÀAqÀUÀ¼ÀÄ f¦J¯ï ªÀÄvÀÄÛ N¦J¯ï ¯ÉÊ¸É£ïì ªÀiÁzÀjUÀ¼ÀÄ PÉ ¦ gÁªï ¥ÀjZÀAiÀÄ D£ÀAzï, ¸ÉÊ§gï¸ÉÌÃ¥ï ¥ÀvÀæ n J¸ï ²æÃzsÀgÀ ¥ÀjZÀAiÀÄ vÉ®ÄUÀÄ AiÀÄÄ¤PÉÆÃqï ¥sÁAmï ªÁå°qÉÃ±À£ï Qmï ªÀiÁzÀj AiÀÄÄ¤PÉÆÃqï ªÀiÁ£ÀzÀAqÀzÀ PÀÄjvÀÄ PÉÃAzÀæ ¸ÀPÁðgÀzÀ ¸ÀÄvÉÆÛÃ¯É 6

10. C£ÀÄ§AzsÀ 10 11.

C£ÀÄ§AzsÀ 11

12. C£ÀÄ§AzsÀ 12 13. C£ÀÄ§AzsÀ 13 14.

C£ÀÄ§AzsÀ 14

PÀtd ¥sÁAmï ªÀiÁqÀÄªÀ ªÀÄÄ£Àß MAzÉÃ PÉ®¸ÀªÀ£ÀÄß ««zsÀ ¸ÀA¸ÉÜUÀ¼ÀÄ ªÀiÁqÀÄªÀÅzÀÄ ¸ÀjAiÀÄ®è JAzÀÄ £Á£ÀÄ F »AzÉ PÀtd AiÉÆÃd£É eÁj ¸ÀA¸ÉÜUÉ §gÉzÀ FªÉÄÊ¯ï ¥Àæw ¸ÁªÀðd¤PÀ §¼ÀPÉUÉ NJ¥sïJ¯ï ¯ÉÊ¸É¤ìAUï ªÀÄÆ®PÀ GavÀªÁV ¤ÃqÀ®Ä ¸ÉÊ§gï¸ÉÌÃ¥ï ¸ÀA¸ÉÜAiÀÄÄ ªÀÄÄAzÉ§A¢zÀÄÝ CzÀÄ gÀÆ¦¹gÀÄªÀ DPÀÈw vÀAvÁæA±ÀzÀ°ègÀÄªÀ AiÀÄÄ¤PÉÆÃqï ªÀÄvÀÄÛ EvÀgÉ ¥sÁAmïUÀ¼À ªÀiÁzÀjUÀ¼ÀÄ `¥ÀzÀ'vÀAvÁæA±ÀªÀ£ÀÄß GavÀªÁV PÀ£ÁðlPÀ ¸ÀPÁðgÀzÀ ªÀÄÆ®PÀ ¸ÁªÀðd¤PÀ §¼ÀPÉUÉ ¤ÃqÀ®Ä ªÀÄÄAzÉ §A¢gÀÄªÀ ²æÃ ¯ÉÆÃ»vïgÀªÀgÀ ¥ÀvÀæ £ÀÄr 5.0 DªÀÈwÛAiÀÄ ©qÀÄUÀqÉ PÀÄjvÀ PÀ£ÀßqÀ C©üªÀÈ¢Þ ¥Áæ¢üPÁgÀzÀ CzsÀåPÀë ²æÃ ªÀÄÄRåªÀÄAwæ ZÀAzÀÄægÀªÀgÀ ºÉÃ½PÉAiÀÄ ªÀgÀ¢ qÁ|| ¥ÀªÀ£Àd ªÀÄvÀÄÛ ²æÃ C£ÀAvÀ PÉÆ¥Ààgï £ÀqÀÄªÉ £ÀqÉ¹zÀ ¥ÀvÀæ ¸ÀA¥ÀPÀðzÀ ¥ÀæwUÀ¼ÀÄ

F ¥ÀvÀæzÀ ¥ÀæwAiÀÄ£ÀÄß F PÉ¼ÀPÀAqÀªÀjUÉ, ªÀiÁ»wUÁV ªÀÄvÀÄÛ ¥ÀæwQæAiÉÄUÁV ¸À°è¸À¯ÁVzÉ: 1) qÁ|| azÁ£ÀAzÀUËqÀ, CzsÀåPÀëgÀÄ, PÀ£ÀßqÀ vÀAvÁæA±À C©üªÀÈ¢Þ ¸À«Äw (DAiÀÄÝ C£ÀÄ§AzsÀUÀ¼ÀÄ ªÀiÁvÀæ. G½zÀ C£ÀÄ§AzsÀUÀ¼ÀÄ D£ï¯ÉÊ£ï£À¯ÉèÃ ®¨sÀå«zÀÄÝ, CUÀvÀå ©zÀÝgÉ ªÀiÁvÀæ ªÀÄÄ¢æ¹ PÉÆqÀ¯ÁUÀÄªÀÅzÀÄ. PÁUÀzÀzÀ ªÀÄÄzÀætªÀ£ÀÄß PÀrªÉÄ ªÀiÁqÀÄªÀÅzÉÆAzÉÃ EªÀ£ÀÄß ªÀÄÄ¢æ¸À¢gÀ®Ä PÁgÀt) 2) qÁ|| ZÀAzÀæ±ÉÃRgÀ PÀA¨ÁgÀ, PÀ.vÀ.C.¸À«ÄwAiÀÄ WÀ£À ¸ÀzÀ¸ÀågÀÄ ªÀÄvÀÄÛ PÀ£ÀßqÀ vÀAvÁæA±ÀPÁÌV ºÉÆÃgÁqÀÄwÛgÀÄªÀ »jAiÀÄ ¸Á»w. (DAiÀÄÝ C£ÀÄ§AzsÀUÀ¼ÀÄ ªÀiÁvÀæ. G½zÀ C£ÀÄ§AzsÀUÀ¼ÀÄ D£ï¯ÉÊ£ï£À¯ÉèÃ ®¨sÀå«zÀÄÝ, CUÀvÀå ©zÀÝgÉ ªÀiÁvÀæ ªÀÄÄ¢æ¹ PÉÆqÀ¯ÁUÀÄªÀÅzÀÄ. PÁUÀzÀzÀ ªÀÄÄzÀætªÀ£ÀÄß PÀrªÉÄ ªÀiÁqÀÄªÀÅzÉÆAzÉÃ EªÀ£ÀÄß ªÀÄÄ¢æ¸À¢gÀ®Ä PÁgÀt) 3) ¥ÀæzsÁ£À PÁAiÀÄðzÀ²ðAiÀÄªÀgÀÄ, F-UÀªÀ£Éð£ïì E¯ÁSÉ, PÀ£ÁðlPÀ ¸ÀPÁðgÀ (AiÀÄxÁªÀvï ¸ÀA¥ÀÇtð ¥Àæw) 4) PÁAiÀÄðzÀ²ðAiÀÄªÀgÀÄ, PÀ£ÀßqÀ ªÀÄvÀÄÛ ¸ÀA¸ÀÌöÈw E¯ÁSÉ (AiÀÄxÁªÀvï ¥ÀÇtð ¥Àæw)

The Unicode Standard Version 6.2 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Copyright © 1991–2012 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium ; edited by Julie D. Allen ... [et al.]. — Version 6.2. Includes bibliographical references and index. ISBN 978-1-936213-07-8) (http://www.unicode.org/versions/Unicode6.2.0/) 1. Unicode (Computer character set) I. Allen, Julie D. II. Unicode Consortium. QA268.U545 2012 ISBN 978-1-936213-07-8 Published in Mountain View, CA September 2012

Chapter 3

Conformance This chapter defines conformance to the Unicode Standard in terms of the principles and encoding architecture it embodies. The first section defines the format for referencing the Unicode Standard and Unicode properties. The second section consists of the conformance clauses, followed by sections that define more precisely the technical terms used in those clauses. The remaining sections contain the formal algorithms that are part of conformance and referenced by the conformance clause. Additional definitions and algorithms that are part of this standard can be found in the Unicode Standard Annexes listed at the end of Section 3.2, Conformance Requirements. In this chapter, conformance clauses are identified with the letter C. Definitions are identified with the letter D. Bulleted items are explanatory comments regarding definitions or subclauses. A number of clauses and definitions have been updated from their wording in prior versions of the Unicode Standard. A detailed listing of these changes since Version 5.0, as well as a listing of any new definitions added, is is available in Section D.2, Clause and Definition Updates. For information on implementing best practices, see Chapter 5, Implementation Guidelines.

3.1 Versions of the Unicode Standard For most character encodings, the character repertoire is fixed (and often small). Once the repertoire is decided upon, it is never changed. Addition of a new abstract character to a given repertoire creates a new repertoire, which will be treated either as an update of the existing character encoding or as a completely new character encoding. For the Unicode Standard, by contrast, the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the character is currently known. Each new version of the Unicode Standard supersedes the previous one, but implementations—and, more significantly, data—are not updated instantly. In general, major and minor version changes include new characters, which do not create particular problems with old data. The Unicode Technical Committee will neither remove nor move characters. Characters may be deprecated, but this does not remove them from the standard or from existing data. The code point for a deprecated character will never be reassigned to a different character, but the use of a deprecated character is strongly discouraged. These rules make the encoded characters of a new version backward-compatible with previous versions. Implementations should be prepared to be forward-compatible with respect to Unicode versions. That is, they should accept text that may be expressed in future versions of this standard, recognizing that new characters may be assigned in those versions. Thus they should handle incoming unassigned code points as they do unsupported characters. (See Section 5.3, Unknown and Missing Characters.) The Unicode Standard, Version 6.2

Conformance

A version change may also involve changes to the properties of existing characters. When this situation occurs, modifications are made to the Unicode Character Database and a new update version is issued for the standard. Changes to the data files may alter program behavior that depends on them. However, such changes to properties and to data files are never made lightly. They are made only after careful deliberation by the Unicode Technical Committee has determined that there is an error, inconsistency, or other serious problem in the property assignments.

Stability Each version of the Unicode Standard, once published, is absolutely stable and will never change. Implementations or specifications that refer to a specific version of the Unicode Standard can rely upon this stability. When implementations or specifications are upgraded to a future version of the Unicode Standard, then changes to them may be necessary. Note that even errata and corrigenda do not formally change the text of a published version; see “Errata and Corrigenda” later in this section. Some features of the Unicode Standard are guaranteed to be stable across versions. These include the names and code positions of characters, their decompositions, and several other character properties for which stability is important to implementations. See also “Stability of Properties” in Section 3.5, Properties. The formal statement of such stability guarantees is contained in the policies on character encoding stability found on the Unicode Web site. See the subsection “Policies” in Section B.6, Other Unicode Online Resources. See the discussion of backward compatibility in section 2.5 of Unicode Standard Annex #31, “Unicode Identifier and Pattern Syntax,” and the subsection “Interacting with Downlevel Systems” in Section 5.3, Unknown and Missing Characters.

Version Numbering Version numbers for the Unicode Standard consist of three fields, denoting the major version, the minor version, and the update version, respectively. For example, “Unicode 5.2.0” indicates major version 5 of the Unicode Standard, minor version 2 of Unicode 5, and update version 0 of minor version Unicode 5.2. Additional information on the current and past versions of the Unicode Standard can be found on the Unicode Web site. See the subsection “Versions” in Section B.6, Other Unicode Online Resources. The online document contains the precise list of contributing files from the Unicode Character Database and the Unicode Standard Annexes, which are formally part of each version of the Unicode Standard. Major and Minor Versions. Major and minor versions have significant additions to the standard, including, but not limited to, additions to the repertoire of encoded characters. Both are published as an updated core specification, together with associated updates to Unicode Standard Annexes and the Unicode Character Database. Such versions consolidate all errata and corrigenda and supersede any prior documentation for major, minor, or update versions. A major version typically is of more importance to implementations; however, even update versions may be important to particular companies or other organizations. Major and minor versions are often synchronization points with related standards, such as with ISO/ IEC 10646. Prior to Version 5.2, minor versions of the standard were published as online amendments expressed as textual changes to the previous version, rather than as fully consolidated new editions of the core specification.

The Unicode Standard, Version 6.2

3.1 Versions of the Unicode Standard

Update Version. An update version represents relatively small changes to the standard, typically updates to the data files of the Unicode Character Database. An update version never involves any additions to the character repertoire. These versions are published as modifications to the data files, and, on occasion, include documentation of small updates for selected errata or corrigenda. Formally, each new version of the Unicode Standard supersedes all earlier versions. However, because of the differences in the way versions are documented, update versions generally do not obsolete the documentation of the immediately prior version of the standard.

Errata and Corrigenda From time to time it may be necessary to publish errata or corrigenda to the Unicode Standard. Such errata and corrigenda will be published on the Unicode Web site. See Section B.6, Other Unicode Online Resources, for information on how to report errors in the standard. Errata. Errata correct errors in the text or other informative material, such as the representative glyphs in the code charts. See the subsection “Updates and Errata” in Section B.6, Other Unicode Online Resources. Whenever a new major or minor version of the standard is published, all errata up to that point are incorporated into the core specification, code charts, or other components of the standard. Corrigenda. Occasionally errors may be important enough that a corrigendum is issued prior to the next version of the Unicode Standard. Such a corrigendum does not change the contents of the previous version. Instead, it provides a mechanism for an implementation, protocol, or other standard to cite the previous version of the Unicode Standard with the corrigendum applied. If a citation does not specifically mention the corrigendum, the corrigendum does not apply. For more information on citing corrigenda, see “Versions” in Section B.6, Other Unicode Online Resources.

References to the Unicode Standard The documents associated with the major, minor, and update versions are called the major reference, minor reference, and update reference, respectively. For example, consider Unicode Version 3.1.1. The major reference for that version is The Unicode Standard, Version 3.0 (ISBN 0-201-61633-5). The minor reference is Unicode Standard Annex #27, “The Unicode Standard, Version 3.1.” The update reference is Unicode Version 3.1.1. The exact list of contributory files, Unicode Standard Annexes, and Unicode Character Database files can be found at Enumerated Version 3.1.1. The reference for this version, Version 6.2.0, of the Unicode Standard, is The Unicode Consortium. The Unicode Standard, Version 6.2.0, defined by: The Unicode Standard, Version 6.2 (Mountain View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-07-8) References to an update (or minor version prior to Version 5.2.0) include a reference to both the major version and the documents modifying it. For the standard citation format for other versions of the Unicode Standard, see “Versions” in Section B.6, Other Unicode Online Resources.

Precision in Version Citation Because Unicode has an open repertoire with relatively frequent updates, it is important not to over-specify the version number. Wherever the precise behavior of all Unicode characters needs to be cited, the full three-field version number should be used, as in the first

The Unicode Standard, Version 6.2

Conformance

example below. However, trailing zeros are often omitted, as in the second example. In such a case, writing 3.1 is in all respects equivalent to writing 3.1.0. 1. The Unicode Standard, Version 3.1.1 2. The Unicode Standard, Version 3.1 3. The Unicode Standard, Version 3.0 or later 4. The Unicode Standard Where some basic level of content is all that is important, phrasing such as in the third example can be used. Where the important information is simply the overall architecture and semantics of the Unicode Standard, the version can be omitted entirely, as in example 4.

References to Unicode Character Properties Properties and property values have defined names and abbreviations, such as Property:

General_Category (gc)

Property Value: Uppercase_Letter (Lu) To reference a given property and property value, these aliases are used, as in this example: The property value Uppercase_Letter from the General_Category property, as specified in Version 6.2.0 of the Unicode Standard. Then cite that version of the standard, using the standard citation format that is provided for each version of the Unicode Standard. When referencing multi-word properties or property values, it is permissible to omit the underscores in these aliases or to replace them by spaces. When referencing a Unicode character property, it is customary to prepend the word “Unicode” to the name of the property, unless it is clear from context that the Unicode Standard is the source of the specification.

References to Unicode Algorithms A reference to a Unicode algorithm must specify the name of the algorithm or its abbreviation, followed by the version of the Unicode Standard, as in this example: The Unicode Bidirectional Algorithm, as specified in Version 6.2.0 of the Unicode Standard. See Unicode Standard Annex #9, “Unicode Bidirectional Algorithm,” (http://www.unicode.org/reports/tr9/tr9-25.html) Where algorithms allow tailoring, the reference must state whether any such tailorings were applied or are applicable. For algorithms contained in a Unicode Standard Annex, the document itself and its location on the Unicode Web site may be cited as the location of the specification. When referencing a Unicode algorithm it is customary to prepend the word “Unicode” to the name of the algorithm, unless it is clear from the context that the Unicode Standard is the source of the specification. Omitting a version number when referencing a Unicode algorithm may be appropriate when such a reference is meant as a generic reference to the overall algorithm. Such a generic reference may also be employed in the sense of latest available version of the algorithm. However, for specific and detailed conformance claims for Unicode algorithms,

The Unicode Standard, Version 6.2

3.2 Conformance Requirements

generic references are generally not sufficient, and a full version number must accompany the reference.

3.2 Conformance Requirements This section presents the clauses specifying the formal conformance requirements for processes implementing Version 6.2 of the Unicode Standard. In addition to this core specification, the Unicode Standard, Version 6.2.0, includes a number of Unicode Standard Annexes (UAXes) and the Unicode Character Database. At the end of this section there is a list of those annexes that are considered an integral part of the Unicode Standard, Version 6.2.0, and therefore covered by these conformance requirements. The Unicode Character Database contains an extensive specification of normative and informative character properties completing the formal definition of the Unicode Standard. See Chapter 4, Character Properties, for more information. Not all conformance requirements are relevant to all implementations at all times because implementations may not support the particular characters or operations for which a given conformance requirement may be relevant. See Section 2.14, Conforming to the Unicode Standard, for more information. In this section, conformance clauses are identified with the letter C.

Code Points Unassigned to Abstract Characters C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. • The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character. C2 A process shall not interpret a noncharacter code point as an abstract character. • The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly. C3 A process shall not interpret an unassigned code point as an abstract character. • This clause does not preclude the assignment of certain generic semantics to unassigned code points (for example, rendering with a glyph to indicate the position within a character block) that allow for graceful behavior in the presence of code points that are outside a supported subset. • Unassigned code points may have default property values. (See D26.) • Code points whose use has not yet been designated may be assigned to abstract characters in future versions of the standard. Because of this fact, due care in the handling of generic semantics for such code points is likely to provide better robustness for implementations that may encounter data based on future versions of the standard.

Interpretation Interpretation of characters is the key conformance requirement for the Unicode Standard, as it is for any coded character set standard. In legacy character set standards, the single

The Unicode Standard, Version 6.2

Conformance

conformance requirement is generally stated in terms of the interpretation of bit patterns used as characters. Conforming to a particular standard requires interpreting bit patterns used as characters according to the list of character names and the glyphs shown in the associated code table that form the bulk of that standard. Interpretation of characters is a more complex issue for the Unicode Standard. It includes the core issue of interpreting code points used as characters according to the names and representative glyphs shown in the code charts, of course. However, the Unicode Standard also specifies character properties, behavior, and interactions between characters. Such information about characters is considered an integral part of the “character semantics established by this standard.” Information about the properties, behavior, and interactions between Unicode characters is provided in the Unicode Character Database and in the Unicode Standard Annexes. Additional information can be found throughout the other chapters of this core specification for the Unicode Standard. However, because of the need to keep extended discussions of scripts, sets of symbols, and other characters readable, material in other chapters is not always labeled as to its normative or informative status. In general, supplementary semantic information about a character is considered normative when it contributes directly to the identification of the character or its behavior. Additional information provided about the history of scripts, the languages which use particular characters, and so forth, is merely informative. Thus, for example, the rules about Devanagari rendering specified in Section 9.1, Devanagari, or the rules about Arabic character shaping specified in Section 8.2, Arabic, are normative: they spell out important details about how those characters behave in conjunction with each other that is necessary for proper and complete interpretation of the respective Unicode characters covered in each section. C4 A process shall interpret a coded character sequence according to the character semantics established by this standard, if that process does interpret that coded character sequence. • This restriction does not preclude internal transformations that are never visible external to the process. C5 A process shall not assume that it is required to interpret any particular coded character sequence. • Processes that interpret only a subset of Unicode characters are allowed; there is no blanket requirement to interpret all Unicode characters. • Any means for specifying a subset of characters that a process can interpret is outside the scope of this standard. • The semantics of a private-use code point is outside the scope of this standard. • Although these clauses are not intended to preclude enumerations or specifications of the characters that a process or system is able to interpret, they do separate supported subset enumerations from the question of conformance. In actuality, any system may occasionally receive an unfamiliar character code that it is unable to interpret. C6 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct. • The implications of this conformance clause are twofold. First, a process is never required to give different interpretations to two different, but canonicalequivalent character sequences. Second, no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences.

The Unicode Standard, Version 6.2

3.2 Conformance Requirements

• Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. • Even processes that normally do not distinguish between canonical-equivalent character sequences can have reasonable exception behavior. Some examples of this behavior include graceful fallback processing by processes unable to support correct positioning of nonspacing marks; “Show Hidden Text” modes that reveal memory representation structure; and the choice of ignoring collating behavior of combining character sequences that are not part of the repertoire of a specified language (see Section 5.12, Strategies for Handling Nonspacing Marks).

Modification C7 When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences. • Replacement of a character sequence by a compatibility-equivalent sequence does modify the interpretation of the text. • Replacement or deletion of a character sequence that the process cannot or does not interpret does modify the interpretation of the text. • Changing the bit or byte ordering of a character sequence when transforming it between different machine architectures does not modify the interpretation of the text. • Changing a valid coded character sequence from one Unicode character encoding form to another does not modify the interpretation of the text. • Changing the byte serialization of a code unit sequence from one Unicode character encoding scheme to another does not modify the interpretation of the text. • If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character. If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text. In general, a noncharacter should be treated as an unassigned code point. For example, an API that returned a character property value for a noncharacter would return the same value as the default value for an unassigned code point. • Note that security problems can result if noncharacter code points are removed from text received from external sources. For more information, see Section 16.7, Noncharacters, and Unicode Technical Report #36, “Unicode Security Considerations.” • All processes and higher-level protocols are required to abide by conformance clause C7 at a minimum. However, higher-level protocols may define additional equivalences that do not constitute modifications under that protocol. For example, a higher-level protocol may allow a sequence of spaces to be replaced by a single space. • There are important security issues associated with the correct interpretation and display of text. For more information, see Unicode Technical Report #36, “Unicode Security Considerations.”

The Unicode Standard, Version 6.2

Conformance

Character Encoding Forms C8 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence. • The specification of the code unit sequences for UTF-8 is given in D92. • The specification of the code unit sequences for UTF-16 is given in D91. • The specification of the code unit sequences for UTF-32 is given in D90. C9 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences. • The definition of each Unicode character encoding form specifies the illformed code unit sequences in the character encoding form. For example, the definition of UTF-8 (D92) specifies that code unit sequences such as <C0 AF> are ill-formed. C10 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. • For example, in UTF-8 every code unit of the form 110xxxx2 must be followed by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx2 as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character. • Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89. • Utility programs are not prevented from operating on “mangled” text. For example, a UTF-8 file could have had CRLF sequences introduced at every 80 bytes by a bad mailer program. This could result in some UTF-8 byte sequences being interrupted by CRLFs, producing illegal byte sequences. This mangled text is no longer UTF-8. It is permissible for a conformant program to repair such text, recognizing that the mangled text was originally well-formed UTF-8 byte sequences. However, such repair of mangled data is a special case, and it must not be used in circumstances where it would cause security problems. There are important security issues associated with encoding conversion, especially with the conversion of malformed text. For more information, see Unicode Technical Report #36, “Unicode Security Considerations.”

Character Encoding Schemes C11 When a process interprets a byte sequence which purports to be in a Unicode character encoding scheme, it shall interpret that byte sequence according to the byte order and specifications for the use of the byte order mark established by this standard for that character encoding scheme.

The Unicode Standard, Version 6.2

3.2 Conformance Requirements

• Machine architectures differ in ordering in terms of whether the most significant byte or the least significant byte comes first. These sequences are known as “big-endian” and “little-endian” orders, respectively. • For example, when using UTF-16LE, pairs of bytes are interpreted as UTF-16 code units using the little-endian byte order convention, and any initial <FF FE> sequence is interpreted as U+FEFF zero width no-break space (part of the text), rather than as a byte order mark (not part of the text). (See D97.)

Bidirectional Text C12 A process that displays text containing supported right-to-left characters or embedding codes shall display all visible representations of characters (excluding format characters) in the same order as if the Bidirectional Algorithm had been applied to the text, unless tailored by a higher-level protocol as permitted by the specification. • The Bidirectional Algorithm is specified in Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”

Normalization Forms C13 A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in Section 3.11, Normalization Forms. C14 A process that tests Unicode text to determine whether it is in a Normalization Form shall do so in accordance with the specifications in Section 3.11, Normalization Forms. C15 A process that purports to transform text into a Normalization Form must be able to produce the results of the conformance test specified in Unicode Standard Annex #15, “Unicode Normalization Forms.” • This means that when a process uses the input specified in the conformance test, its output must match the expected output of the test.

Normative References C16 Normative references to the Unicode Standard itself, to property aliases, to property value aliases, or to Unicode algorithms shall follow the formats specified in Section 3.1, Versions of the Unicode Standard. C17 Higher-level protocols shall not make normative references to provisional properties. • Higher-level protocols may make normative references to informative properties.

Unicode Algorithms C18 If a process purports to implement a Unicode algorithm, it shall conform to the specification of that algorithm in the standard, including any tailoring by a higher-level protocol as permitted by the specification. • The term Unicode algorithm is defined at D17. • An implementation claiming conformance to a Unicode algorithm need only guarantee that it produces the same results as those specified in the logical description of the process; it is not required to follow the actual described procedure in detail. This allows room for alternative strategies and optimizations in implementation.

The Unicode Standard, Version 6.2

Conformance

C19 The specification of an algorithm may prohibit or limit tailoring by a higher-level protocol. If a process that purports to implement a Unicode algorithm applies a tailoring, that fact must be disclosed. • For example, the algorithms for normalization and canonical ordering are not tailorable. The Bidirectional Algorithm allows some tailoring by higher-level protocols. The Unicode Default Case algorithms may be tailored without limitation.

Default Casing Algorithms C20 An implementation that purports to support Default Case Conversion, Default Case Detection, or Default Caseless Matching shall do so in accordance with the definitions and specifications in Section 3.13, Default Case Algorithms. • A conformant implementation may perform casing operations that are different from the default algorithms, perhaps tailored to a particular orthography, so long as the fact that a tailoring is applied is disclosed.

Unicode Standard Annexes The following standard annexes are approved and considered part of Version 6.2 of the Unicode Standard. These annexes may contain either normative or informative material, or both. Any reference to Version 6.2 of the standard automatically includes these standard annexes. • UAX #9: Unicode Bidirectional Algorithm, Version 6.2.0 • UAX #11: East Asian Width, Version 6.2.0 • UAX #14: Unicode Line Breaking Algorithm, Version 6.2.0 • UAX #15: Unicode Normalization Forms, Version 6.2.0 • UAX #24: Unicode Script Property, Version 6.2.0 • UAX #29: Unicode Text Segmentation, Version 6.2.0 • UAX #31: Unicode Identifier and Pattern Syntax, Version 6.2.0 • UAX #34: Unicode Named Character Sequences, Version 6.2.0 • UAX #38: Unicode Han Database (Unihan), Version 6.2.0 • UAX #41: Common References for Unicode Standard Annexes, Version 6.2.0 • UAX #42: Unicode Character Database in XML, Version 6.2.0 • UAX #44: Unicode Character Database, Version 6.2.0 • UAX #45: U-Source Ideographs Conformance to the Unicode Standard requires conformance to the specifications contained in these annexes, as detailed in the conformance clauses listed earlier in this section.

The Unicode Standard, Version 6.2

3.3 Semantics

3.3 Semantics Definitions This and the following sections more precisely define the terms that are used in the conformance clauses. A small number of definitions have been updated from their wording in Version 5.0 of the Unicode Standard. A detailed listing of these changes, as well as a listing of any new definitions added since Version 5.0, is available in Section D.2, Clause and Definition Updates.

Character Identity and Semantics D1 Normative behavior: The normative behaviors of the Unicode Standard consist of the following list or any other behaviors specified in the conformance clauses: • Character combination • Canonical decomposition • Compatibility decomposition • Canonical ordering behavior • Bidirectional behavior, as specified in the Unicode Bidirectional Algorithm (see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”) • Conjoining jamo behavior, as specified in Section 3.12, Conjoining Jamo Behavior • Variation selection, as specified in Section 16.4, Variation Selectors • Normalization, as specified in Section 3.11, Normalization Forms • Default casing, as specified in Section 3.13, Default Case Algorithms D2 Character identity: The identity of a character is established by its character name and representative glyph in the code charts. • A character may have a broader range of use than the most literal interpretation of its name might indicate; the coded representation, name, and representative glyph need to be assessed in context when establishing the identity of a character. For example, U+002E full stop can represent a sentence period, an abbreviation period, a decimal number separator in English, a thousands number separator in German, and so on. The character name itself is unique, but may be misleading. See “Character Names” in Section 17.1, Character Names List. • Consistency with the representative glyph does not require that the images be identical or even graphically similar; rather, it means that both images are generally recognized to be representations of the same character. Representing the character U+0061 latin small letter a by the glyph “X” would violate its character identity. D3 Character semantics: The semantics of a character are determined by its identity, normative properties, and behavior. • Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics.

The Unicode Standard, Version 6.2

Conformance • The character combination properties and the canonical ordering behavior cannot be overridden by higher-level protocols. The purpose of this constraint is to guarantee that the order of combining marks in text and the results of normalization are predictable.

D4 Character name: A unique string used to identify each abstract character encoded in the standard. • The character names in the Unicode Standard match those of the English edition of ISO/IEC 10646. • Character names are immutable and cannot be overridden; they are stable identifiers. For more information, see Section 4.8, Name. • The name of a Unicode character is also formally a character property in the Unicode Character Database. Its long property alias is “Name” and its short property alias is “na”. Its value is the unique string label associated with the encoded character. • The detailed specification of the Unicode character names, including rules for derivation of some ranges of characters, is given in Section 4.8, Name. That section also describes the relationship between the normative value of the Name property and the contents of the corresponding data field in UnicodeData.txt in the Unicode Character Database. D5 Character name alias: An additional unique string identifier, other than the character name, associated with an encoded character in the standard. • Character name aliases are assigned when there is a serious clerical defect with a character name, such that the character name itself may be misleading regarding the identity of the character. A character name alias constitutes an alternate identifier for the character. • Character name aliases are also assigned to provide string identifiers for control codes and to recognize widely used alternative names and abbreviations for control codes, format characters and other special-use characters. • Character name aliases are unique within the common namespace shared by character names, character name aliases, and named character sequences. • More than one character name alias may be assigned to a given Unicode character. For example, the control code U+000D is given a character name alias for its ISO 6429 control function as carriage return, but is also given a character name alias for its widely used abbreviation “CR”. • Character name aliases are a formal, normative part of the standard and should be distinguished from the informative, editorial aliases provided in the code charts. See Section 17.1, Character Names List, for the notational conventions used to distinguish the two. D6 Namespace: A set of names together with name matching rules, so that all names are distinct under the matching rules. • Within a given namespace all names must be unique, although the same name may be used with a different meaning in a different namespace. • Character names, character name aliases, and named character sequences share a single namespace in the Unicode Standard.

The Unicode Standard, Version 6.2

3.4 Characters and Encoding

3.4 Characters and Encoding D7 Abstract character: A unit of information used for the organization, control, or representation of textual data. • When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, aural or visual). Examples of such symbolic data include letters, ideographs, digits, punctuation, technical symbols, and dingbats. • An abstract character has no concrete form and should not be confused with a glyph. • An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme. • The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters. • Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences. D8 Abstract character sequence: An ordered sequence of one or more abstract characters. D9 Unicode codespace: A range of integers from 0 to 10FFFF16. • This particular range is defined for the codespace in the Unicode Standard. Other character encoding standards may use other codespaces. D10 Code point: Any value in the Unicode codespace. • A code point is also known as a code position. • See D77 for the definition of code unit. D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved. • See Table 2-3 for a summary of the meaning and use of each class. • For Noncharacter, see also D14 Noncharacter. • For Reserved, see also D15 Reserved code point. • For Private-Use, see also D49 Private-use code point. • For Surrogate, see also D71 High-surrogate code point and D73 Low-surrogate code point. D11 Encoded character: An association (or mapping) between an abstract character and a code point. • An encoded character is also referred to as a coded character. • While an encoded character is formally defined in terms of the mapping between an abstract character and a code point, informally it can be thought of as an abstract character taken together with its assigned code point. • Occasionally, for compatibility with other standards, a single abstract character may correspond to more than one code point—for example, “Å” corresponds both to U+00C5 Å latin capital letter a with ring above and to U+212B Å angstrom sign.

The Unicode Standard, Version 6.2

Conformance • A single abstract character may also be represented by a sequence of code points—for example, latin capital letter g with acute may be represented by the sequence <U+0047 latin capital letter g, U+0301 combining acute accent>, rather than being mapped to a single code point.

D12 Coded character sequence: An ordered sequence of one or more code points. • A coded character sequence is also known as a coded character representation. • Normally a coded character sequence consists of a sequence of encoded characters, but it may also include noncharacters or reserved code points. • Internally, a process may choose to make use of noncharacter code points in its coded character sequences. However, such noncharacter code points may not be interpreted as abstract characters (see conformance clause C2). Their removal by a conformant process constitutes modification of interpretation of the coded character sequence (see conformance clause C7). • Reserved code points are included in coded character sequences, so that the conformance requirements regarding interpretation and modification are properly defined when a Unicode-conformant implementation encounters coded character sequences produced under a future version of the standard. Unless specified otherwise for clarity, in the text of the Unicode Standard the term character alone designates an encoded character. Similarly, the term character sequence alone designates a coded character sequence. D13 Deprecated character: A coded character whose use is strongly discouraged. • Deprecated characters are retained in the standard indefinitely, but should not be used. They are retained in the standard so that previously conforming data stay conformant in future versions of the standard. • Deprecated characters typically consist of characters with significant architectural problems, or ones which cause implementation problems. Some examples of characters deprecated on these grounds include tag characters (see Section 16.9, Deprecated Tag Characters) and the alternate format characters (see Section 16.3, Deprecated Format Characters). • Deprecated characters are explicitly indicated in the Unicode Code Charts. They are also given an explicit property value of Deprecated=True in the Unicode Character Database. • Deprecated characters should not be confused with obsolete characters, which are historical. Obsolete characters do not occur in modern text, but they are not deprecated; their use is not discouraged. D14 Noncharacter: A code point that is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF. • For more information, see Section 16.7, Noncharacters. • These code points are permanently reserved as noncharacters. D15

Reserved code point: Any code point of the Unicode Standard that is reserved for future assignment. Also known as an unassigned code point.

• Surrogate code points and noncharacters are considered assigned code points, but not assigned characters.

The Unicode Standard, Version 6.2

3.4 Characters and Encoding

• For a summary classification of reserved and other types of code points, see Table 2-3. In general, a conforming process may indicate the presence of a code point whose use has not been designated (for example, by showing a missing glyph in rendering or by signaling an appropriate error in a streaming protocol), even though it is forbidden by the standard from interpreting that code point as an abstract character. D16

Higher-level protocol: Any agreement on the interpretation of Unicode characters that extends beyond the scope of this standard.

• Such an agreement need not be formally announced in data; it may be implicit in the context. • The specification of some Unicode algorithms may limit the scope of what a conformant higher-level protocol may do. D17

Unicode algorithm: The logical description of a process used to achieve a specified result involving Unicode characters.

• This definition, as used in the Unicode Standard and other publications of the Unicode Consortium, is intentionally broad so as to allow precise logical description of required results, without constraining implementations to follow the precise steps of that logical description. D18

Named Unicode algorithm: A Unicode algorithm that is specified in the Unicode Standard or in other standards published by the Unicode Consortium and that is given an explicit name for ease of reference.

• Named Unicode algorithms are cited in titlecase in the Unicode Standard. Table 3-1 lists the named Unicode algorithms and indicates the locations of their specifications. Details regarding conformance to these algorithms and any restrictions they place on the scope of allowable tailoring by higher-level protocols can be found in the specifications. In some cases, a named Unicode algorithm is provided for information only. When externally referenced, a named Unicode algorithm may be prefixed with the qualifier “Unicode” to make the connection of the algorithm to the Unicode Standard and other Unicode specifications clear. Thus, for example, the Bidirectional Algorithm is generally referred to by its full name, “Unicode Bidirectional Algorithm.” As much as is practical, the titles of Unicode Standard Annexes which define Unicode algorithms consist of the name of the Unicode algorithm they specify. In a few cases, named Unicode algorithms are also widely known by their acronyms, and those acronyms are also listed in Table 3-1.

Table 3-1. Named Unicode Algorithms Name

Description

Canonical Ordering Canonical Composition Normalization Hangul Syllable Composition Hangul Syllable Decomposition Hangul Syllable Name Generation Default Case Conversion Default Case Detection Default Caseless Matching Bidirectional Algorithm (UBA) Line Breaking Algorithm Character Segmentation

Section 3.11 Section 3.11 Section 3.11 Section 3.12 Section 3.12 Section 3.12 Section 3.13 Section 3.13 Section 3.13 UAX #9 UAX #14 UAX #29

The Unicode Standard, Version 6.2

Conformance

Table 3-1. Named Unicode Algorithms (Continued) Name

Description

Word Segmentation Sentence Segmentation Hangul Syllable Boundary Determination Default Identifier Determination Alternative Identifier Determination Pattern Syntax Determination Identifier Normalization Identifier Case Folding Standard Compression Scheme for Unicode (SCSU) Unicode Collation Algorithm (UCA)

UAX #29 UAX #29 UAX #29 UAX #31 UAX #31 UAX #31 UAX #31 UAX #31 UTS #6 UTS #10

3.5 Properties The Unicode Standard specifies many different types of character properties. This section provides the basic definitions related to character properties. The actual values of Unicode character properties are specified in the Unicode Character Database. See Section 4.1, Unicode Character Database, for an overview of those data files. Chapter 4, Character Properties, contains more detailed descriptions of some particular, important character properties. Additional properties that are specific to particular characters (such as the definition and use of the right-to-left override character or zero width space) are discussed in the relevant sections of this standard. The interpretation of some properties (such as the case of a character) is independent of context, whereas the interpretation of other properties (such as directionality) is applicable to a character sequence as a whole, rather than to the individual characters that compose the sequence.

Types of Properties D19 Property: A named attribute of an entity in the Unicode Standard, associated with a defined set of values. • The lists of code point and encoded character properties for the Unicode Standard are documented in Unicode Standard Annex #44, “Unicode Character Database,” and in Unicode Standard Annex #38, “Unicode Han Database (Unihan).” • The file PropertyAliases.txt in the Unicode Character Database provides a machine-readable list of the non-Unihan properties and their names. D20 Code point property: A property of code points. • Code point properties refer to attributes of code points per se, based on architectural considerations of this standard, irrespective of any particular encoded character. • Thus the Surrogate property and the Noncharacter property are code point properties. D21 Abstract character property: A property of abstract characters.

The Unicode Standard, Version 6.2

3.5 Properties

• Abstract character properties refer to attributes of abstract characters per se, based on their independent existence as elements of writing systems or other notational systems, irrespective of their encoding in the Unicode Standard. • Thus the Alphabetic property, the Punctuation property, the Hex_Digit property, the Numeric_Value property, and so on are properties of abstract characters and are associated with those characters whether encoded in the Unicode Standard or in any other character encoding—or even prior to their being encoded in any character encoding standard. D22 Encoded character property: A property of encoded characters in the Unicode Standard. • For each encoded character property there is a mapping from every code point to some value in the set of values associated with that property. Encoded character properties are defined this way to facilitate the implementation of character property APIs based on the Unicode Character Database. Typically, an API will take a property and a code point as input, and will return a value for that property as output, interpreting it as the “character property” for the “character” encoded at that code point. However, to be useful, such APIs must return meaningful values for unassigned code points, as well as for encoded characters. In some instances an encoded character property in the Unicode Standard is exactly equivalent to a code point property. For example, the Pattern_Syntax property simply defines a range of code points that are reserved for pattern syntax. (See Unicode Standard Annex #31, “Unicode Identifier and Pattern Syntax.”) In other instances, an encoded character property directly reflects an abstract character property, but extends the domain of the property to include all code points, including unassigned code points. For Boolean properties, such as the Hex_Digit property, typically an encoded character property will be true for the encoded characters with that abstract character property and will be false for all other code points, including unassigned code points, noncharacters, private-use characters, and encoded characters for which the abstract character property is inapplicable or irrelevant. However, in many instances, an encoded character property is semantically complex and may telescope together values associated with a number of abstract character properties and/or code point properties. The General_Category property is an example—it contains values associated with several abstract character properties (such as Letter, Punctuation, and Symbol) as well as code point properties (such as \p{gc=Cs} for the Surrogate code point property). In the text of this standard the terms “Unicode character property,” “character property,” and “property” without qualifier generally refer to an encoded character property, unless otherwise indicated. A list of the encoded character properties formally considered to be a part of the Unicode Standard can be found in PropertyAliases.txt in the Unicode Character Database. See also “Property Aliases” later in this section.

Property Values D23 Property value: One of the set of values associated with an encoded character property. • For example, the East_Asian_Width [EAW] property has the possible values “Narrow”, “Neutral”, “Wide”, “Ambiguous”, and “Unassigned”.

The Unicode Standard, Version 6.2

Conformance

A list of the values associated with encoded character properties in the Unicode Standard can be found in PropertyValueAliases.txt in the Unicode Character Database. See also “Property Aliases” later in this section. D24 Explicit property value: A value for an encoded character property that is explicitly associated with a code point in one of the data files of the Unicode Character Database. D25 Implicit property value: A value for an encoded character property that is given by a generic rule or by an “otherwise” clause in one of the data files of the Unicode Character Database. • Implicit property values are used to avoid having to explicitly list values for more than 1 million code points (most of them unassigned) for every property.

Default Property Values To work properly in implementations, unassigned code points must be given default property values as if they were characters, because various algorithms require property values to be assigned to every code point before they can function at all. Default property values are not uniform across all unassigned code points, because certain ranges of code points need different values for particular properties to maximize compatibility with expected future assignments. This means that some encoded character properties have multiple default values. For example, the Bidi_Class property defines a range of unassigned code points as having the “R” value, another range of unassigned code points as having the “AL” value, and the otherwise case as having the “L” value. For information on the default values for each encoded character property, see its description in the Unicode Character Database. Default property values for unassigned code points are normative. They should not be changed by implementations to other values. Default property values are also provided for private-use characters. Because the interpretation of private-use characters is subject to private agreement between the parties which exchange them, most default property values for those characters are overridable by higher-level protocols, to match the agreed-upon semantics for the characters. There are important exceptions for a few properties and Unicode algorithms. See Section 16.5, Private-Use Characters. D26 Default property value: The value (or in some cases small set of values) of a property associated with unassigned code points or with encoded characters for which the property is irrelevant. • For example, for most Boolean properties, “false” is the default property value. In such cases, the default property value used for unassigned code points may be the same value that is used for many assigned characters as well. • Some properties, particularly enumerated properties, specify a particular, unique value as their default value. For example, “XX” is the default property value for the Line_Break property. • A default property value is typically defined implicitly, to avoid having to repeat long lists of unassigned code points. • In the case of some properties with arbitrary string values, the default property value is an implied null value. For example, the fact that there is no Unicode character name for unassigned code points is equivalent to saying that the

The Unicode Standard, Version 6.2

3.5 Properties

default property value for the Name property for an unassigned code point is a null string.

Classification of Properties by Their Values D27 Enumerated property: A property with a small set of named values. • As characters are added to the Unicode Standard, the set of values may need to be extended in the future, but enumerated properties have a relatively fixed set of possible values. D28

Closed enumeration: An enumerated property for which the set of values is closed and will not be extended for future versions of the Unicode Standard.

• The General_Category and Bidi_Class properties are the only closed enumerations, except for the Boolean properties. D29 Boolean property: A closed enumerated property whose set of values is limited to “true” and “false”. • The presence or absence of the property is the essential information. D30

Numeric property: A numeric property is a property whose value is a number that can take on any integer or real value.

• An example is the Numeric_Value property. There is no implied limit to the number of possible distinct values for the property, except the limitations on representing integers or real numbers in computers. D31 String-valued property: A property whose value is a string. • The Canonical_Decomposition property is a string-valued property. D32 Catalog property: A property that is an enumerated property, typically unrelated to an algorithm, that may be extended in each successive version of the Unicode Standard. • Examples are the Age, Block, and Script properties. Additional new values for the set of enumerated values for these properties may be added each time the standard is revised. A new value for Age is added for each new Unicode version, a new value for Block is added for each new block added to the standard, and a new value for Script is added for each new script added to the standard. Most properties have a single value associated with each code point. However, some properties may instead associate a set of multiple different values with each code point. See Section 5.7.6, Properties Whose Values Are Sets of Values, in Unicode Standard Annex #44, “Unicode Character Database.”

Property Status Each Unicode character property has one of several different statuses: normative, informative, contributory, or provisional. Each of these statuses is formally defined below, with some explanation and examples. In addition, normative properties can be subclassified, based on whether or not they can be overridden by conformant higher-level protocols. The full list of currently defined Unicode character properties is provided in Unicode Standard Annex #44, “Unicode Character Database” and in Unicode Standard Annex #38, “Unicode Han Database (Unihan).” The tables of properties in those documents specify the status of each property explicitly. The data file PropertyAliases.txt provides a machinereadable listing of the character properties, except for those associated with the Unicode Han Database. The long alias for each property in PropertyAliases.txt also serves as the forThe Unicode Standard, Version 6.2

Conformance

mal name of that property. In case of any discrepancy between the listing in PropertyAliases.txt and the listing in Unicode Standard Annex #44 or any other text of the Unicode Standard, the listing in PropertyAliases.txt should be taken as definitive. The tag for each Unihan-related character property documented in Unicode Standard Annex #38 serves as the formal name of that property. D33 Normative property: A Unicode character property used in the specification of the standard. Specification that a character property is normative means that implementations which claim conformance to a particular version of the Unicode Standard and which make use of that particular property must follow the specifications of the standard for that property for the implementation to be conformant. For example, the Bidi_Class property is required for conformance whenever rendering text that requires bidirectional layout, such as Arabic or Hebrew. Whenever a normative process depends on a property in a specified way, that property is designated as normative. The fact that a given Unicode character property is normative does not mean that the values of the property will never change for particular characters. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes. See also “Stability of Properties” later in this section. Some of the normative Unicode algorithms depend critically on particular property values for their behavior. Normalization, for example, defines an aspect of textual interoperability that many applications rely on to be absolutely stable. As a result, some of the normative properties disallow any kind of overriding by higher-level protocols. Thus the decomposition of Unicode characters is both normative and not overridable; no higher-level protocol may override these values, because to do so would result in non-interoperable results for the normalization of Unicode text. Other normative properties, such as case mapping, are overridable by higher-level protocols, because their intent is to provide a common basis for behavior. Nevertheless, they may require tailoring for particular local cultural conventions or particular implementations. D34 Overridable property: A normative property whose values may be overridden by conformant higher-level protocols. • For example, the Canonical_Decomposition property is not overridable. The Uppercase property can be overridden. Some important normative character properties of the Unicode Standard are listed in Table 3-2, with an indication of which sections in the standard provide a general description of the properties and their use. Other normative properties are documented in the Unicode Character Database. In all cases, the Unicode Character Database provides the definitive list of character properties and the exact list of property value assignments for each version of the standard.

Table 3-2. Normative Character Properties Property

Description

Bidi_Class (directionality) Bidi_Mirrored Block Canonical_Combining_Class Case-related properties Composition_Exclusion

UAX #9 and Section 4.4 UAX #9 and Section 4.7 Section 17.1 Section 3.11 and Section 4.3 Section 3.13, Section 4.2, and UAX #44 Section 3.11

The Unicode Standard, Version 6.2

3.5 Properties

Table 3-2. Normative Character Properties (Continued) Property

Description

Decomposition_Mapping Default_Ignorable_Code_Point Deprecated General_Category Hangul_Syllable_Type Joining_Type and Joining_Group Name Noncharacter_Code_Point Numeric_Value White_Space

Section 3.7 and Section 3.11 Section 5.21 Section 3.1 Section 4.5 Section 3.12 and UAX #29 Section 8.2 Section 4.8 Section 16.7 Section 4.6 UAX #44

D35 Informative property: A Unicode character property whose values are provided for information only. A conformant implementation of the Unicode Standard is free to use or change informative property values as it may require, while remaining conformant to the standard. An implementer always has the option of establishing a protocol to convey the fact that informative properties are being used in distinct ways. Informative properties capture expert implementation experience. When an informative property is explicitly specified in the Unicode Character Database, its use is strongly recommended for implementations to encourage comparable behavior between implementations. Note that it is possible for an informative property in one version of the Unicode Standard to become a normative property in a subsequent version of the standard if its use starts to acquire conformance implications in some part of the standard. Table 3-3 provides a partial list of the more important informative character properties. For a complete listing, see the Unicode Character Database.

Table 3-3. Informative Character Properties Property

Description

Dash East_Asian_Width Letter-related properties Line_Break Mathematical Script Space Unicode_1_Name

Section 6.2 and Table 6-3 Section 12.4 and UAX #11 Section 4.10 Section 16.1, Section 16.2, and UAX #14 Section 15.5 UAX #24 Section 6.2 and Table 6-2 Section 4.9

D35a Contributory property: A simple property defined merely to make the statement of a rule defining a derived property more compact or general. Contributory properties typically consist of short lists of exceptional characters which are used as part of the definition of a more generic normative or informative property. In most cases, such properties are given names starting with “Other”, as Other_Alphabetic or Other_Default_Ignorable_Code_Point. Contributory properties are not themselves subject to stability guarantees, but they are sometimes specified in order to make it easier to state the definition of a derived property which itself is subject to a stability guarantee, such as the derived, normative identifier-

The Unicode Standard, Version 6.2

Conformance

related properties, XID_Start and XID_Continue. The complete list of contributory properties is documented in Unicode Standard Annex #44, “Unicode Character Database.” D36

Provisional property: A Unicode character property whose values are unapproved and tentative, and which may be incomplete or otherwise not in a usable state.

• Provisional properties may be removed from future versions of the standard, without prior notice. Some of the information provided about characters in the Unicode Character Database constitutes provisional data. This data may capture partial or preliminary information. It may contain errors or omissions, or otherwise not be ready for systematic use; however, it is included in the data files for distribution partly to encourage review and improvement of the information. For example, a number of the tags in the Unihan database file (Unihan.zip) provide provisional property values of various sorts about Han characters. The data files of the Unicode Character Database may also contain various annotations and comments about characters, and those annotations and comments should be considered provisional. Implementations should not attempt to parse annotations and comments out of the data files and treat them as informative character properties per se. Section 4.12, Characters with Unusual Properties, provides additional lists of Unicode characters with unusual behavior, including many format controls discussed in detail elsewhere in the standard. Although in many instances those characters and their behavior have normative implications, the particular subclassification provided in Table 4-13 does not directly correspond to any formal definition of Unicode character properties. Therefore that subclassification itself should also be considered provisional and potentially subject to change.

Context Dependence D37 Context-dependent property: A property that applies to a code point in the context of a longer code point sequence. • For example, the lowercase mapping of a Greek sigma depends on the context of the surrounding characters. D38 Context-independent property: A property that is not context dependent; it applies to a code point in isolation.

Stability of Properties D39 Stable transformation: A transformation T on a property P is stable with respect to an algorithm A if the result of the algorithm on the transformed property A(T(P)) is the same as the original result A(P) for all code points. D40 Stable property: A property is stable with respect to a particular algorithm or process as long as possible changes in the assignment of property values are restricted in such a manner that the result of the algorithm on the property continues to be the same as the original result for all previously assigned code points. • As new characters are assigned to previously unassigned code points, the replacement of any default values for these code points with actual property values must maintain stability. D41 Fixed property: A property whose values (other than a default value), once associated with a specific code point, are fixed and will not be changed, except to correct obvious or clerical errors.

The Unicode Standard, Version 6.2

3.5 Properties

• For a fixed property, any default values can be replaced without restriction by actual property values as new characters are assigned to previously unassigned code points. Examples of fixed properties include Age and Hangul_Syllable_Type. • Designating a property as fixed does not imply stability or immutability (see “Stability” in Section 3.1, Versions of the Unicode Standard). While the age of a character, for example, is established by the version of the Unicode Standard to which it was added, errors in the published listing of the property value could be corrected. For some other properties, even the correction of such errors is prohibited by explicit guarantees of property stability. D42 Immutable property: A fixed property that is also subject to a stability guarantee preventing any change in the published listing of property values other than assignment of new values to formerly unassigned code points. • An immutable property is trivially stable with respect to all algorithms. • An example of an immutable property is the Unicode character name itself. Because character names are values of an immutable property, misspellings and incorrect names will never be corrected clerically. Any errata will be noted in a comment in the character names list and, where needed, an informative character name alias will be provided. • When an encoded character property representing a code point property is immutable, none of its values can ever change. This follows from the fact that the code points themselves do not change, and the status of the property is unaffected by whether a particular abstract character is encoded at a code point later. An example of such a property is the Pattern_Syntax property; all values of that property are unchangeable for all code points, forever. • In the more typical case of an immutable property, the values for existing encoded characters cannot change, but when a new character is encoded, the formerly unassigned code point changes from having a default value for the property to having one of its nondefault values. Once that nondefault value is published, it can no longer be changed. D43 Stabilized property: A property that is neither extended to new characters nor maintained in any other manner, but that is retained in the Unicode Character Database. • A stabilized property is also a fixed property. D44 Deprecated property: A property whose use by implementations is discouraged. • One of the reasons a property may be deprecated is because a different combination of properties better expresses the intended semantics. • Where sufficiently widespread legacy support exists for the deprecated property, not all implementations may be able to discontinue the use of the deprecated property. In such a case, a deprecated property may be extended to new characters so as to maintain it in a usable and consistent state. Informative or normative properties in the standard will not be removed even when they are supplanted by other properties or are no longer useful. However, they may be stabilized and/or deprecated. The complete list of stability policies which affect character properties, their values, and their aliases, is available online. See the subsection “Policies” in Section B.6, Other Unicode Online Resources.

The Unicode Standard, Version 6.2

Conformance

Simple and Derived Properties D45 Simple property: A Unicode character property whose values are specified directly in the Unicode Character Database (or elsewhere in the standard) and whose values cannot be derived from other simple properties. D46

Derived property: A Unicode character property whose values are algorithmically derived from some combination of simple properties.

The Unicode Character Database lists a number of derived properties explicitly. Even though these values can be derived, they are provided as lists because the derivation may not be trivial and because explicit lists are easier to understand, reference, and implement. Good examples of derived properties include the ID_Start and ID_Continue properties, which can be used to specify a formal identifier syntax for Unicode characters. The details of how derived properties are computed can be found in the documentation for the Unicode Character Database.

Property Aliases To enable normative references to Unicode character properties, formal aliases for properties and for property values are defined as part of the Unicode Character Database. D47 Property alias: A unique identifier for a particular Unicode character property. • The identifiers used for property aliases contain only ASCII alphanumeric characters or the underscore character. • Short and long forms for each property alias are defined. The short forms are typically just two or three characters long to facilitate their use as attributes for tags in markup languages. For example, “General_Category” is the long form and “gc” is the short form of the property alias for the General Category property. The long form serves as the formal name for the character property. • Property aliases are defined in the file PropertyAliases.txt lists all of the nonUnihan properties that are part of each version of the standard. The Unihan properties are listed in Unicode Standard Annex #38, “Unicode Han Database (Unihan).” • Property aliases of normative properties are themselves normative. D48 Property value alias: A unique identifier for a particular enumerated value for a particular Unicode character property. • The identifiers used for property value aliases contain only ASCII alphanumeric characters or the underscore character, or have the special value “n/a”. • Short and long forms for property value aliases are defined. For example, “Currency_Symbol” is the long form and “Sc” is the short form of the property value alias for the currency symbol value of the General Category property. • Property value aliases are defined in the file PropertyValueAliases.txt in the Unicode Character Database. • Property value aliases are unique identifiers only in the context of the particular property with which they are associated. The same identifier string might be associated with an entirely different value for a different property. The combination of a property alias and a property value alias is, however, guaranteed to be unique. • Property value aliases referring to values of normative properties are themselves normative.

The Unicode Standard, Version 6.2

3.6 Combination

The property aliases and property value aliases can be used, for example, in XML formats of property data, for regular-expression property tests, and in other programmatic textual descriptions of Unicode property data. Thus “gc=Lu” is a formal way of specifying that the General Category of a character (using the property alias “gc”) has the value of being an uppercase letter (using the property value alias “Lu”).

Private Use D49 Private-use code point: Code points in the ranges U+E000..U+F8FF, U+F0000.. U+FFFFD, and U+100000..U+10FFFD. • Private-use code points are considered to be assigned characters, but the abstract characters associated with them have no interpretation specified by this standard. They can be given any interpretation by conformant processes. • Private-use code points are be given default property values, but these default values are overridable by higher-level protocols that give those private-use code points a specific interpretation. See Section 16.5, Private-Use Characters.

3.6 Combination Combining Character Sequences D50 Graphic character: A character with the General Category of Letter (L), Combining Mark (M), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs). • Graphic characters specifically exclude the line and paragraph separators (Zl, Zp), as well as the characters with the General Category of Other (Cn, Cs, Cc, Cf ). • The interpretation of private-use characters (Co) as graphic characters or not is determined by the implementation. • For more information, see Chapter 2, General Structure, especially Section 2.4, Code Points and Characters, and Table 2-3. D51 Base character: Any graphic character except for those with the General Category of Combining Mark (M). • Most Unicode characters are base characters. In terms of General Category values, a base character is any code point that has one of the following categories: Letter (L), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs). • Base characters do not include control characters or format controls. • Base characters are independent graphic characters, but this does not preclude the presentation of base characters from adopting different contextual forms or participating in ligatures. • The interpretation of private-use characters (Co) as base characters or not is determined by the implementation. However, the default interpretation of private-use characters should be as base characters, in the absence of other information. D51a Extended base: Any base character, or any standard Korean syllable block.

The Unicode Standard, Version 6.2

Conformance • This term is defined to take into account the fact that sequences of Korean conjoining jamo characters behave as if they were a single Hangul syllable character, so that the entire sequence of jamos constitutes a base. • For the definition of standard Korean syllable block, see D134 in Section 3.12, Conjoining Jamo Behavior.

D52 Combining character: A character with the General Category of Combining Mark (M). • Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me). • All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class. • The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation. • These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. • The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width nonjoiner. The combining character is said to apply to that base character. • There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters. • With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character. • The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle. D53 Nonspacing mark: A combining character with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me). • The position of a nonspacing mark in presentation depends on its base character. It generally does not consume space along the visual baseline in and of itself. • Such characters may be large enough to affect the placement of their base character relative to preceding and succeeding base characters. For example, a circumflex applied to an “i” may affect spacing (“î”), as might the character U+20DD combining enclosing circle. D54 Enclosing mark: A nonspacing mark with the General Category of Enclosing Mark (Me). • Enclosing marks are a subclass of nonspacing marks that surround a base character, rather than merely being placed over, under, or through it.

The Unicode Standard, Version 6.2

3.6 Combination

D55 Spacing mark: A combining character that is not a nonspacing mark. • Examples include U+093F devanagari vowel sign i. In general, the behavior of spacing marks does not differ greatly from that of base characters. • Spacing marks such as U+0BCA tamil vowel sign o may be rendered on both sides of a base character, but are not enclosing marks. D56 Combining character sequence: A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character, zero width joiner, or zero width non-joiner; or a sequence of one or more characters where each is a combining character, zero width joiner, or zero width non-joiner. • When identifying a combining character sequence in Unicode text, the definition of the combining character sequence is applied maximally. For example, in the sequence <c, dot-below, caron, acute, a>, the entire sequence <c, dotbelow, caron, acute> is identified as the combining character sequence, rather than the alternative of identifying <c, dot-below> as a combining character sequence followed by a separate (defective) combining character sequence <caron, acute>. D56a Extended combining character sequence: A maximal character sequence consisting of either an extended base followed by a sequence of one or more characters where each is a combining character, zero width joiner, or zero width non-joiner ; or a sequence of one or more characters where each is a combining character, zero width joiner, or zero width non-joiner. • Combining character sequence is commonly abbreviated as CCS, and extended combining character sequence is commonly abbreviated as ECCS. D57 Defective combining character sequence: A combining character sequence that does not start with a base character. • Defective combining character sequences occur when a sequence of combining characters appears at the start of a string or follows a control or format character. Such sequences are defective from the point of view of handling of combining marks, but are not ill-formed. (See D84.)

Grapheme Clusters D58 Grapheme base: A character with the property Grapheme_Base, or any standard Korean syllable block. • Characters with the property Grapheme_Base include all base characters (with the exception of U+FF9E..U+FF9F) plus most spacing marks. • The concept of a grapheme base is introduced to simplify discussion of the graphical application of nonspacing marks to other elements of text. A grapheme base may consist of a spacing (combining) mark, which distinguishes it from a base character per se. A grapheme base may also itself consist of a sequence of characters, in the case of the standard Korean syllable block. • For the definition of standard Korean syllable block, see D134 in Section 3.12, Conjoining Jamo Behavior. D59 Grapheme extender: A character with the property Grapheme_Extend.

The Unicode Standard, Version 6.2

Conformance • Grapheme extender characters consist of all nonspacing marks, zero width joiner, zero width non-joiner, U+FF9E, U+FF9F, and a small number of spacing marks. • A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character. • zero width joiner and zero width non-joiner are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders. • The small number of spacing marks that have the property Grapheme_Extend are all the second parts of a two-part combining mark. • The set of characters with the Grapheme_Extend property and the set of characters with the Grapheme_Base property are disjoint, by definition.

D60 Grapheme cluster: The text between grapheme cluster boundaries as specified by Unicode Standard Annex #29, “Unicode Text Segmentation.” • This definition of “grapheme cluster” is generic. The specification of grapheme cluster boundary segmentation in UAX #29 includes two alternatives, for “extended grapheme clusters” and for “legacy grapheme clusters.” Furthermore, the segmentation algorithm in UAX #29 is tailorable. • The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it. • A grapheme cluster is similar, but not identical to a combining character sequence. A combining character sequence starts with a base character and extends across any subsequent sequence of combining marks, nonspacing or spacing. A combining character sequence is most directly relevant to processing issues related to normalization, comparison, and searching. • A grapheme cluster typically starts with a grapheme base and then extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison and searching. • For many processes, a grapheme cluster behaves as if it were a single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties. For example, <x, macron> behaves in line breaking or bidirectional layout as if it were the character x. D61 Extended grapheme cluster: The text between extended grapheme cluster boundaries as specified by Unicode Standard Annex #29, “Unicode Text Segmentation.” • Extended grapheme clusters are defined in a parallel manner to legacy grapheme clusters, but also include sequences of spacing marks. • Grapheme clusters and extended grapheme clusters may not have any particular linguistic significance, but are used to break up a string of text into units for processing. • Grapheme clusters and extended grapheme clusters may be adjusted for particular processing requirements, by tailoring the rules for grapheme cluster segmentation specified in Unicode Standard Annex #29, “Unicode Text Segmentation.”

The Unicode Standard, Version 6.2

3.6 Combination

Application of Combining Marks A number of principles in the Unicode Standard relate to the application of combining marks. These principles are listed in this section, with an indication of which are considered to be normative and which are considered to be guidelines. In particular, guidelines for rendering of combining marks in conjunction with other characters should be considered as appropriate for defining default rendering behavior, in the absence of more specific information about rendering. It is often the case that combining marks in complex scripts or even particular, general-use nonspacing marks will have rendering requirements that depart significantly from the general guidelines. Rendering processes should, as appropriate, make use of available information about specific typographic practices and conventions so as to produce best rendering of text. To help in the clarification of the principles regarding the application of combining marks, a distinction is made between dependence and graphical application. D61a Dependence: A combining mark is said to depend on its associated base character. • The associated base character is the base character in the combining character sequence that a combining mark is part of. • A combining mark in a defective combining character sequence has no associated base character and thus cannot be said to depend on any particular base character. This is one of the reasons why fallback processing is required for defective combining character sequences. • Dependence concerns all combining marks, including spacing marks and combining marks that have no visible display. D61b Graphical application: A nonspacing mark is said to apply to its associated grapheme base. • The associated grapheme base is the grapheme base in the grapheme cluster that a nonspacing mark is part of. • A nonspacing mark in a defective combining character sequence is not part of a grapheme cluster and is subject to the same kinds of fallback processing as for any defective combining character sequence. • Graphic application concerns visual rendering issues and thus is an issue for nonspacing marks that have visible glyphs. Those glyphs interact, in rendering, with their grapheme base. Throughout the text of the standard, whenever the situation is clear, discussion of combining marks often simply talks about combining marks “applying” to their base. In the prototypical case of a nonspacing accent mark applying to a single base character letter, this simplification is not problematical, because the nonspacing mark both depends (notionally) on its base character and simultaneously applies (graphically) to its grapheme base, affecting its display. The finer distinctions are needed when dealing with the edge cases, such as combining marks that have no display glyph, graphical application of nonspacing marks to Korean syllables, and the behavior of spacing combining marks. The distinction made here between notional dependence and graphical application does not preclude spacing marks or even sequences of base characters from having effects on neighboring characters in rendering. Thus spacing forms of dependent vowels (matras) in Indic scripts may trigger particular kinds of conjunct formation or may be repositioned in ways that influence the rendering of other characters. (See Chapter 9, South Asian Scripts-I, for many examples.) Similarly, sequences of base characters may form ligatures in rendering. (See “Cursive Connection and Ligatures” in Section 16.2, Layout Controls.)

The Unicode Standard, Version 6.2

Conformance

The following listing specifies the principles regarding application of combining marks. Many of these principles are illustrated in Section 2.11, Combining Characters, and Section 7.9, Combining Marks. P1

[Normative] Combining character order: Combining characters follow the base character on which they depend.

• This principle follows from the definition of a combining character sequence. • Thus the character sequence <U+0061 “a” latin small letter a, U+0308 “!” combining diaeresis, U+0075 “u” latin small letter u> is unambiguously interpreted (and displayed) as “äu”, not “aü”. See Figure 2-18. P2

[Guideline] Inside-out application. Nonspacing marks with the same combining class are generally positioned graphically outward from the grapheme base to which they apply.

• The most numerous and important instances of this principle involve nonspacing marks applied either directly above or below a grapheme base. See Figure 2-21. • In a sequence of two nonspacing marks above a grapheme base, the first nonspacing mark is placed directly above the grapheme base, and the second is then placed above the first nonspacing mark. • In a sequence of two nonspacing marks below a grapheme base, the first nonspacing mark is placed directly below the grapheme base, and the second is then placed below the first nonspacing mark. • This rendering behavior for nonspacing marks can be generalized to sequences of any length, although practical considerations usually limit such sequences to no more than two or three marks above and/or below a grapheme base. • The principle of inside-out application is also referred to as default stacking behavior for nonspacing marks. P3

[Guideline] Side-by-side application. Notwithstanding the principle of inside-out application, some specific nonspacing marks may override the default stacking behavior and are positioned side-by-side over (or under) a grapheme base, rather than stacking vertically.

• Such side-by-side positioning may reflect language-specific orthographic rules, such as for Vietnamese diacritics and tone marks or for polytonic Greek breathing and accent marks. See Table 2-6. • When positioned side-by-side, the visual rendering order of a sequence of nonspacing marks reflects the dominant order of the script with which they are used. Thus, in Greek, the first nonspacing mark in such a sequence will be positioned to the left side above a grapheme base, and the second to the right side above the grapheme base. In Hebrew, the opposite positioning is used for sideby-side placement. P4

[Guideline] Traditional typographical behavior will sometimes override the default placement or rendering of nonspacing marks.

• Because of typographical conflict with the descender of a base character, a combining comma below placed on a lowercase “g” is traditionally rendered as if it were an inverted comma above. See Figure 7-1.

The Unicode Standard, Version 6.2

3.6 Combination

• Because of typographical conflict with the ascender of a base chracter, a combining há`ek (caron) is traditionally rendered as an apostrophe when placed, for example, on a lowercase “d”. See Figure 7-1. • The relative placement of vowel marks in Arabic cannot be predicted by default stacking behavior alone, but depends on traditional rules of Arabic typography. See Figure 8-5. P5

[Normative] Nondistinct order. Nonspacing marks with different, non-zero combining classes may occur in different orders without affecting either the visual display of a combining character sequence or the interpretation of that sequence.

• For example, if one nonspacing mark occurs above a grapheme base and another nonspacing mark occurs below it, they will have distinct combining classes. The order in which they occur in the combining character sequence does not matter for the display or interpretation of the resulting grapheme cluster. • The introduction of the combining class for characters and its use in canonical ordering in the standard is to precisely define canonical equivalence and thereby clarify exactly which such alternate sequences must be considered as identical for display and interpretation. See Figure 2-24. • In cases of nondistinct order, the order of combining marks has no linguistic significance. The order does not reflect how “closely bound” they are to the base. After canonical reordering, the order may no longer reflect the typed-in sequence. Rendering systems should be prepared to deal with common typed-in sequences and with canonically reordered sequences. See Table 5-3. • Inserting a combining grapheme joiner between two combining marks with nondistinct order prevents their canonical reordering. For more information, see “Combining Grapheme Joiner” in Section 16.2, Layout Controls. P6

[Guideline] Enclosing marks surround their grapheme base and any intervening nonspacing marks.

• This implies that enclosing marks successively surround previous enclosing marks. See Figure 3-1.

Figure 3-1. Enclosing Marks

a 09A4

$ 20DE

$¨ 0308

→

a¨

20DD

• Dynamic application of enclosing marks—particularly sequences of enclosing marks—is beyond the capability of most fonts and simple rendering processes. It is not unexpected to find fallback rendering in cases such as that illustrated in Figure 3-1. P7

[Guideline] Double diacritic nonspacing marks, such as U+0360 combining double tilde, apply to their grapheme base, but are intended to be rendered with glyphs that encompass a following grapheme base as well.

• Because such double diacritic display spans combinations of elements that would otherwise be considered grapheme clusters, the support of double diacritics in rendering may involve special handling for cursor placement and text selection. See Figure 7-8 for an example.

The Unicode Standard, Version 6.2

Conformance

[Guideline] When double diacritic nonspacing marks interact with normal nonspacing marks in a grapheme cluster, they “float” to the outermost layer of the stack of rendered marks (either above or below).

• This behavior can be conceived of as a kind of looser binding of such double diacritics to their bases. In effect, all other nonspacing marks are applied first, and then the double diacritic will span the resulting stacks. See Figure 7-9 for an example. • Double diacritic nonspacing marks are also given a very high combining class, so that in canonical order they appear at or near the end of any combining character sequence. Figure 7-10 shows an example of the use of CGJ to block this reordering. • The interaction of enclosing marks and double diacritics is not well defined graphically. Many fonts and rendering processes may not be able to handle combinations of these marks. It is not recommended to use combinations of these together in the same grapheme cluster. P9 [Guideline] When a nonspacing mark is applied to the letters i and j or any other character with the Soft_Dotted property, the inherent dot on the base character is suppressed in display. • See Figure 7-2 for an example. • For languages such as Lithuanian, in which both a dot and an accent must be displayed, use U+0307 combining dot above. For guidelines in handling this situation in case mapping, see Section 5.18, Case Mappings. Combining Marks and Korean Syllables. When a grapheme cluster comprises a Korean syllable, a combining mark applies to that entire syllable. For example, in the following sequence the grave is applied to the entire Korean syllable, not just to the last jamo: U+1100 ! choseong kiyeok + U+1161 " jungseong a + U+0300 & grave → ( If the combining mark in question is an enclosing combining mark, then it would enclose the entire Korean syllable, rather than the last jamo in it: U+1100 ! choseong kiyeok + U+1161 " jungseong a + U+20DD % enclosing circle → ) This treatment of the application of combining marks with respect to Korean syllables follows from the implications of canonical equivalence. It should be noted, however, that older implementations may have supported the application of an enclosing combining mark to an entire Indic consonant conjunct or to a sequence of grapheme clusters linked together by combining grapheme joiners. Such an approach has a number of technical problems and leads to interoperability defects, so it is strongly recommended that implementations do not follow it. For more information on the recommended use of the combining grapheme joiner, see the subsection “Combining Grapheme Joiner” in Section 16.2, Layout Controls. For more discussion regarding the application of combining marks in general, see Section 7.9, Combining Marks.

The Unicode Standard, Version 6.2

3.7 Decomposition

3.7 Decomposition D62 Decomposition mapping: A mapping from a character to a sequence of one or more characters that is a canonical or compatibility equivalent, and that is listed in the character names list or described in Section 3.12, Conjoining Jamo Behavior. • Each character has at most one decomposition mapping. The mappings in Section 3.12, Conjoining Jamo Behavior, are canonical mappings. The mappings in the character names list are identified as either canonical or compatibility mappings (see Section 17.1, Character Names List). D63 Decomposable character: A character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the Unicode Character Database, and those described in Section 3.12, Conjoining Jamo Behavior. • A decomposable character is also referred to as a precomposed character or composite character. • The decomposition mappings from the Unicode Character Database are also given in Section 17.1, Character Names List. D64 Decomposition: A sequence of one or more characters that is equivalent to a decomposable character. A full decomposition of a character sequence results from decomposing each of the characters in the sequence until no characters can be further decomposed.

Compatibility Decomposition D65

Compatibility decomposition: The decomposition of a character or character sequence that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Section 3.12, Conjoining Jamo Behavior, until no characters can be further decomposed, and then reordering nonspacing marks according to Section 3.11, Normalization Forms.

• The decomposition mappings from the Unicode Character Database are also given in Section 17.1, Character Names List. • Some compatibility decompositions remove formatting information. D66

Compatibility decomposable character: A character whose compatibility decomposition is not identical to its canonical decomposition. It may also be known as a compatibility precomposed character or a compatibility composite character.

• For example, U+00B5 micro sign has no canonical decomposition mapping, so its canonical decomposition is the same as the character itself. It has a compatibility decomposition to U+03BC greek small letter mu. Because micro sign has a compatibility decomposition that is not equal to its canonical decomposition, it is a compatibility decomposable character. • For example, U+03D3 greek upsilon with acute and hook symbol canonically decomposes to the sequence <U+03D2 greek upsilon with hook symbol, U+0301 combining acute accent>. That sequence has a compatibility decomposition of <U+03A5 greek capital letter upsilon, U+0301 combining acute accent>. Because greek upsilon with acute and hook symbol has a compatibility decomposition that is not equal to its canonical decomposition, it is a compatibility decomposable character.

The Unicode Standard, Version 6.2

Conformance • This term should not be confused with the term “compatibility character,” which is discussed in Section 2.3, Compatibility Characters. • Many compatibility decomposable characters are included in the Unicode Standard solely to represent distinctions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data or other special circumstances. • Some widely used and indispensable characters, such as NBSP, are compatibility decomposable characters for historical reasons. Their use is not discouraged. • A large number of compatibility decomposable characters are used in phonetic and mathematical notation, where their use is not discouraged. • For historical reasons, some characters that might have been given a compatibility decomposition were not, in fact, decomposed. The Normalization Stability Policy prohibits adding decompositions for such cases in the future, so that normalization forms will stay stable. See the subsection “Policies” in Section B.6, Other Unicode Online Resources. • Replacing a compatibility decomposable character by its compatibility decomposition may lose round-trip convertibility with a base standard.

D67

Compatibility equivalent: Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

Canonical Decomposition D68

Canonical decomposition: The decomposition of a character or character sequence that results from recursively applying the canonical mappings found in the Unicode Character Database and those described in Section 3.12, Conjoining Jamo Behavior, until no characters can be further decomposed, and then reordering nonspacing marks according to Section 3.11, Normalization Forms.

• The decomposition mappings from the Unicode Character Database are also printed in Section 17.1, Character Names List. • A canonical decomposition does not remove formatting information. D69 Canonical decomposable character: A character that is not identical to its canonical decomposition. It may also be known as a canonical precomposed character or a canonical composite character. • For example, U+00E0 latin small letter a with grave is a canonical decomposable character because its canonical decomposition is to the sequence <U+0061 latin small letter a, U+0300 combining grave accent>. U+212A kelvin sign is a canonical decomposable character because its canonical decomposition is to U+004B latin capital letter k. D70 Canonical equivalent: Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical. • For example, the sequences <o, combining-diaeresis> and <ö> are canonical equivalents. Canonical equivalence is a Unicode property. It should not be confused with language-specific collation or matching, which may add other equivalencies. For example, in Swedish, ö is treated as a completely different letter from o and is collated after z. In German, ö is weakly equivalent to oe and is collated with oe. In English, ö is just an o with a diacritic that indicates that it is

The Unicode Standard, Version 6.2

3.8 Surrogates

pronounced separately from the previous letter (as in coöperate) and is collated with o. • By definition, all canonical-equivalent sequences are also compatibility-equivalent sequences. For information on the use of decomposition in normalization, see Section 3.11, Normalization Forms.

3.8 Surrogates D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. D72 High-surrogate code unit: A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF. D74 Low-surrogate code unit: A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. • High-surrogate and low-surrogate code points are designated only for that use. • High-surrogate and low-surrogate code units are used only in the context of the UTF-16 character encoding form. D75

Surrogate pair: A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit. • Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode Encoding Forms.)

• Isolated surrogate code units have no interpretation on their own. Certain other isolated code units in other encoding forms also have no interpretation on their own. For example, the isolated byte 8016 has no interpretation in UTF8; it can be used only as part of a multibyte sequence. (See Table 3-7.) • Sometimes high-surrogate code units are referred to as leading surrogates. Lowsurrogate code units are then referred to as trailing surrogates. This is analogous to usage in UTF-8, which has leading bytes and trailing bytes. • For more information, see Section 16.6, Surrogates Area, and Section 5.4, Handling Surrogate Pairs in UTF-16.

3.9 Unicode Encoding Forms The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. The size of the code unit is specified for each encoding form. This section presents the formal definition of each of these encoding forms. D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF16 and E00016 to 10FFFF16, inclusive.

The Unicode Standard, Version 6.2

Conformance

D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. • Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. • A code unit is also referred to as a code value in the information industry. • In the Unicode Standard, specific values of some code units cannot be used to represent an encoded character in isolation. This restriction applies to isolated surrogate code units in UTF-16 and to the bytes 80–FF in UTF-8. Similar restrictions apply for the implementations of other character encoding standards; for example, the bytes 81–9F, E0–FC in SJIS (Shift-JIS) cannot represent an encoded character by themselves. • For information on use of wchar_t or other programming language types to represent Unicode code units, see “ANSI/ISO C wchar_t” in Section 5.2, Programming Languages and Data Types. D78 Code unit sequence: An ordered sequence of one or more code units. • When the code unit is an 8-bit unit, a code unit sequence may also be referred to as a byte sequence. • A code unit sequence may consist of a single code unit. • In the context of programming languages, the value of a string data type basically consists of a code unit sequence. Informally, a code unit sequence is itself just referred to as a string, and a byte sequence is referred to as a byte string. Care must be taken in making this terminological equivalence, however, because the formally defined concept of a string may have additional requirements or complications in programming languages. For example, a string is defined as a pointer to char in the C language and is conventionally terminated with a NULL character. In object-oriented languages, a string is a complex object, with associated methods, and its value may or may not consist of merely a code unit sequence. • Depending on the structure of a character encoding standard, it may be necessary to use a code unit sequence (of more than one unit) to represent a single encoded character. For example, the code unit in SJIS is a byte: encoded characters such as “a” can be represented with a single byte in SJIS, whereas ideographs require a sequence of two code units. The Unicode Standard also makes use of code unit sequences whose length is greater than one code unit. D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. • For historical reasons, the Unicode encoding forms are also referred to as Unicode (or UCS) transformation formats (UTF). That term is actually ambiguous between its usage for encoding forms and encoding schemes. • The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one. This property guarantees that a reverse mapping can always be derived. Given the mapping of any Unicode scalar value to a particular code unit sequence for a given encoding form, one can derive the original Unicode scalar value unambiguously from that code unit sequence.

The Unicode Standard, Version 6.2

3.9 Unicode Encoding Forms

• The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is not onto. In other words, for any given encoding form, there exist code unit sequences that have no associated Unicode scalar value. • To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values. D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form. • In the rawest form, Unicode strings may be implemented simply as arrays of the appropriate integral data type, consisting of a sequence of code units lined up one immediately after the other. • A single Unicode string must contain only code units from a single Unicode encoding form. It is not permissible to mix forms within a string. D81 Unicode 8-bit string: A Unicode string containing only UTF-8 code units. D82 Unicode 16-bit string: A Unicode string containing only UTF-16 code units. D83 Unicode 32-bit string: A Unicode string containing only UTF-32 code units. D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called ill-formed if and only if it does not follow the specification of that Unicode encoding form. • Any code unit sequence that would correspond to a code point outside the defined range of Unicode scalar values would, for example, be ill-formed. • UTF-8 has some strong constraints on the possible byte ranges for leading and trailing bytes. A violation of those constraints would produce a code unit sequence that could not be mapped to a Unicode scalar value, resulting in an ill-formed code unit sequence. D84a Ill-formed code unit subsequence: A non-empty subsequence of a Unicode code unit sequence X which does not contain any code units which also belong to any minimal well-formed subsequence of X. • In other words, an ill-formed code unit subsequence cannot overlap with a minimal well-formed subsequence. D85 Well-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called well-formed if and only if it does follow the specification of that Unicode encoding form. D85a Minimal well-formed code unit subsequence: A well-formed Unicode code unit sequence that maps to a single Unicode scalar value. • For UTF-8, see the specification in D92 and Table 3-7. • For UTF-16, see the specification in D91. • For UTF-32, see the specification in D90. A well-formed Unicode code unit sequence can be partitioned into one or more minimal well-formed code unit sequences for the given Unicode encoding form. Any Unicode code unit sequence can be partitioned into subsequences that are either well-formed or ill-

The Unicode Standard, Version 6.2

Conformance

formed. The sequence as a whole is well-formed if and only if it contains no ill-formed subsequence. The sequence as a whole is ill-formed if and only if it contains at least one illformed subsequence. D86 Well-formed UTF-8 code unit sequence: A well-formed Unicode code unit sequence of UTF-8 code units. • The UTF-8 code unit sequence <41 C3 B1 42> is well-formed, because it can be partitioned into subsequences, all of which match the specification for UTF-8 in Table 3-7. It consists of the following minimal well-formed code unit subsequences: <41>, <C3 B1>, and <42>. • The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed, because it contains one ill-formed subsequence. There is no subsequence for the C2 byte which matches the specification for UTF-8 in Table 3-7. The code unit sequence is partitioned into one minimal well-formed code unit subsequence, <41>, followed by one ill-formed code unit subsequence, <C2>, followed by two minimal well-formed code unit subsequences, <C3 B1> and <42>. • In isolation, the UTF-8 code unit sequence <C2 C3> would be ill-formed, but in the context of the UTF-8 code unit sequence <41 C2 C3 B1 42>, <C2 C3> does not constitute an ill-formed code unit subsequence, because the C3 byte is actually the first byte of the minimal well-formed UTF-8 code unit subsequence <C3 B1>. Ill-formed code unit subsequences do not overlap with minimal well-formed code unit subsequences. D87 Well-formed UTF-16 code unit sequence: A well-formed Unicode code unit sequence of UTF-16 code units. D88 Well-formed UTF-32 code unit sequence: A well-formed Unicode code unit sequence of UTF-32 code units. D89 In a Unicode encoding form: A Unicode string is said to be in a particular Unicode encoding form if and only if it consists of a well-formed Unicode code unit sequence of that Unicode encoding form. • A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8 string for short. • A Unicode string consisting of a well-formed UTF-16 code unit sequence is said to be in UTF-16. Such a Unicode string is referred to as a valid UTF-16 string, or a UTF-16 string for short. • A Unicode string consisting of a well-formed UTF-32 code unit sequence is said to be in UTF-32. Such a Unicode string is referred to as a valid UTF-32 string, or a UTF-32 string for short. Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. • For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a wellformed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is.

The Unicode Standard, Version 6.2

3.9 Unicode Encoding Forms

• As another example, the code unit sequence <C0 80 61 F3> is a Unicode 8-bit string, but does not consist of a well-formed UTF-8 code unit sequence. That code unit sequence could not result from the specification of the UTF-8 encoding form and is thus ill-formed. (The same code unit sequence could, of course, be well-formed in the context of some other character encoding standard using 8-bit code units, such as ISO/IEC 8859-1, or vendor code pages.) If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence. If a process which verifies that a Unicode string is in a Unicode encoding form encounters an ill-formed code unit subsequence in that string, then it must not identify that string as being in that Unicode encoding form. A process which interprets a Unicode string must not interpret any ill-formed code unit subsequences in the string as characters. (See conformance clause C10.) Furthermore, such a process must not treat any adjacent well-formed code unit sequences as being part of those ill-formed code unit sequences. Table 3-4 gives examples that summarize the three Unicode encoding forms.

Table 3-4. Examples of Unicode Encoding Forms Code Point

Encoding Form

Code Unit Sequence

U+004D

UTF-32 UTF-16 UTF-8 UTF-32 UTF-16 UTF-8 UTF-32 UTF-16 UTF-8 UTF-32 UTF-16 UTF-8

0000004D 004D 4D 00000430 0430 D0 B0 00004E8C 4E8C E4 BA 8C 00010302 D800 DF02 F0 90 8C 82

U+0430

U+4E8C

U+10302

UTF-32 D90 UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value. • In UTF-32, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <0000004D 00000430 00004E8C 00010302>. • Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0000D80016..0000DFFF16 are ill-formed. • Any UTF-32 code unit greater than 0010FFFF16 is ill-formed. For a discussion of the relationship between UTF-32 and UCS-4 encoding form defined in ISO/IEC 10646, see Section C.2, Encoding Forms in ISO/IEC 10646.

UTF-16 D91 UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned

The Unicode Standard, Version 6.2

Conformance 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5. • In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302. • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range D80016..DFFF16 are ill-formed.

Table 3-5 specifies the bit distribution for the UTF-16 encoding form. Note that for Unicode scalar values equal to or greater than U+10000, UTF-16 uses surrogate pairs. Calculation of the surrogate pair values involves subtraction of 1000016, to account for the starting offset to the scalar value. ISO/IEC 10646 specifies an equivalent UTF-16 encoding form. For details, see Section C.3, UTF-8 and UTF-16.

Table 3-5. UTF-16 Bit Distribution Scalar Value

UTF-16

xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx 000uuuuuxxxxxxxxxxxxxxxx 110110wwwwxxxxxx 110111xxxxxxxxxx Note: wwww = uuuuu - 1

UTF-8 D92 UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-6 and Table 3-7. • In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where <4D> corresponds to U+004D, <D0 B0> corresponds to U+0430, <E4 BA 8C> corresponds to U+4E8C, and <F0 90 8C 82> corresponds to U+10302. • Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed. • Before the Unicode Standard, Version 3.1, the problematic “non-shortest form” byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are ill-formed, because they are not allowed by Table 3-7. • Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed. Table 3-6 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of Unicode scalar values corresponding to one-, two-, three-, and four-byte sequences. For a discussion of the difference in the formulation of UTF-8 in ISO/IEC 10646, see Section C.3, UTF-8 and UTF-16. Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is well-formed in that position. Any byte value outside of the ranges listed is ill-formed. For example: • The byte sequence <C0 AF> is ill-formed, because C0 is not well-formed in the “First Byte” column.

The Unicode Standard, Version 6.2

3.9 Unicode Encoding Forms

Table 3-6. UTF-8 Bit Distribution Scalar Value

First Byte

Second Byte

Third Byte

Fourth Byte

00000000 0xxxxxxx 00000yyy yyxxxxxx zzzzyyyy yyxxxxxx 000uuuuu zzzzyyyy yyxxxxxx

0xxxxxxx 110yyyyy 1110zzzz 11110uuu

10xxxxxx 10yyyyyy 10uuzzzz

10xxxxxx 10yyyyyy

10xxxxxx

• The byte sequence <E0 9F 80> is ill-formed, because in the row where E0 is well-formed as a first byte, 9F is not well-formed as a second byte. • The byte sequence <F4 80 83 92> is well-formed, because every byte in that sequence matches a byte range in a row of the table (the last row).

Table 3-7. Well-Formed UTF-8 Byte Sequences Code Points

First Byte

Second Byte

Third Byte

Fourth Byte

U+0000..U+007F U+0080..U+07FF U+0800..U+0FFF U+1000..U+CFFF U+D000..U+D7FF U+E000..U+FFFF U+10000..U+3FFFF U+40000..U+FFFFF U+100000..U+10FFFF

00..7F C2..DF E0 E1..EC ED EE..EF F0 F1..F3 F4

80..BF A0..BF 80..BF 80..9F 80..BF 90..BF 80..BF 80..8F

80..BF 80..BF 80..BF 80..BF 80..BF 80..BF 80..BF

80..BF 80..BF 80..BF

In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence. As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF.

Encoding Form Conversion D93

Encoding form conversion: A conversion defined directly between the code unit sequences of one Unicode encoding form and the code unit sequences of another Unicode encoding form.

• In implementations of the Unicode Standard, a typical API will logically convert the input code unit sequence into Unicode scalar values (code points) and then convert those Unicode scalar values into the output code unit sequence. Proper analysis of the encoding forms makes it possible to convert the code units directly, thereby obtaining the same results but with a more efficient process. • A conformant encoding form conversion will treat any ill-formed code unit sequence as an error condition. (See conformance clause C10.) This guarantees that it will neither interpret nor emit an ill-formed code unit sequence. Any implementation of encoding form conversion must take this requirement into account, because an encoding form conversion implicitly involves a verification that the Unicode strings being converted do, in fact, contain well-formed code unit sequences.

The Unicode Standard, Version 6.2

Conformance

Constraints on Conversion Processes The requirement not to interpret any ill-formed code unit subsequences in a string as characters (see conformance clause C10) has important consequences for conversion processes. Such processes may, for example, interpret UTF-8 code unit sequences as Unicode character sequences. If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence. If an implementation of a UTF-8 conversion process stops at the first error encountered, without reporting the end of any ill-formed UTF-8 code unit subsequence, then the requirement makes little practical difference. However, the requirement does introduce a significant constraint if the UTF-8 converter continues past the point of a detected error, perhaps by substituting one or more U+FFFD replacement characters for the uninterpretable, ill-formed UTF-8 code unit subsequence. For example, with the input UTF-8 code unit sequence <C2 41 42>, such a UTF-8 conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because either of those outputs would be the result of misinterpreting a well-formed subsequence as being part of the ill-formed subsequence. The expected return value for such a process would instead be <U+FFFD, U+0041, U+0042>. For a UTF-8 conversion process to consume valid successor bytes is not only non-conformant, but also leaves the converter open to security exploits. See Unicode Technical Report #36, “Unicode Security Considerations.” Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only formal requirement mandated by Unicode conformance for a converter is that the <41> be processed and correctly interpreted as <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other approaches to signalling <F0 80 80> as an ill-formed code unit subsequence. Best Practices for Using U+FFFD. When using U+FFFD to replace ill-formed subsequences encountered during conversion, there are various logically possible approaches to associate U+FFFD with all or part of an ill-formed subsequence. To promote interoperability in the implementation of conversion processes, the Unicode Standard recommends a particular best practice. The following definitions simplify the discussion of this best practice: D93a Unconvertible offset: An offset in a code unit sequence for which no code unit subsequence starting at that offset is well-formed. D93b Maximal subpart of an ill-formed subsequence: The longest code unit subsequence starting at an unconvertible offset that is either: a. the initial subsequence of a well-formed code unit sequence, or b. a subsequence of length one. • The term maximal subpart of an ill-formed subsequence can be abbreviated to maximal subpart when it is clear in context that the subsequence in question is ill-formed.

The Unicode Standard, Version 6.2

3.9 Unicode Encoding Forms

• This definition can be trivially applied to the UTF-32 or UTF-16 encoding forms, but is primarily of interest when converting UTF-8 strings. • For example, in the ill-formed UTF-8 sequence <41 C0 AF 41 F4 80 80 41>, there are two ill-formed subsequences: <C0 AF> and <F4 80 80>, each separated by <41>, which is well-formed. Applying the definition of maximal subparts for these ill-formed subsequences, in the first case <C0> is a maximal subpart, because that byte value can never be the first byte of a well-formed UTF-8 sequence. In the second subsequence, <F4 80 80> is a maximal subpart, because up to that point all three bytes match the specification for UTF-8. It is only when followed by <41> that the sequence of <F4 80 80> can be determined to be ill-formed, because the specification requires a following byte in the range 80..BF, instead. • Another example illustrates the application of the concept of maximal subpart for UTF-8 continuation bytes outside the allowable ranges defined in Table 3-7. The UTF-8 sequence <41 E0 9F 80 41> is ill-formed, because <9F> is not an allowed second byte of a UTF-8 sequence commencing with <E0>. In this case, there is an unconvertible offset at <E0> and the maximal subpart at that offset is also <E0>. The subsequence <E0 9F> cannot be a maximal subpart, because it is not an initial subsequence of any well-formed UTF-8 code unit sequence. Using the definition for maximal subpart, the best practice can be stated simply as: Whenever an unconvertible offset is reached during conversion of a code unit sequence: 1. The maximal subpart at that offset should be replaced by a single U+FFFD. 2. The conversion should proceed at the offset immediately after the maximal subpart. This sounds complicated, but it reflects the way optimized conversion processes are typically constructed, particularly for UTF-8. A sequence of code units will be processed up to the point where the sequence either can be unambiguously interpreted as a particular Unicode code point or where the converter recognizes that the code units collected so far constitute an ill-formed subsequence. At that point, the converter can emit a single U+FFFD for the collected (but ill-formed) code unit(s) and move on, without having to further accumulate state. The maximal subpart could be the start of a well-formed sequence, except that the sequence lacks the proper continuation. Alternatively, the converter may have found a continuation code unit or some other code unit which cannot be the start of a well-formed sequence. To illustrate this policy, consider the ill-formed UTF-8 sequence <61 F1 80 80 E1 80 C2 62 80 63 80 BF 64>. Possible alternative approaches for a UTF-8 converter using U+FFFD are illustrated in Table 3-8.

Table 3-8. Use of U+FFFD in UTF-8 Conversion 1 2 3

61 0061 0061 0061

F1 80 80 E1 80 C2 FFFD FFFD FFFD FFFD FFFD FFFD FFFD FFFD FFFD FFFD

62 0062 0062 0062

80 FFFD FFFD FFFD

63 0063 0063 0063

80 BF FFFD FFFD FFFD FFFD FFFD

64 0064 0064 0064

The recommended conversion policy would have the outcome shown in Row 2 of Table 3-8, rather than Row 1 or Row 3. For example, a UTF-8 converter would detect that <F1 80 80> constituted a maximal subpart of the ill-formed subsequence as soon as it

The Unicode Standard, Version 6.2

Conformance

encountered the subsequent code unit <E1>, so at that point, it would emit a single U+FFFD and then continue attempting to convert from the <E1> code unit—and so forth to the end of the code unit sequence to convert. The UTF-8 converter would detect that the code unit <80> in the sequence <62 80 63> is not well-formed, and would replace it by U+FFFD. Neither of the code units <80> or <BF> in the sequence <63 80 BF 64> is the start of a potentially well-formed sequence; therefore each of them is separately replaced by U+FFFD. For a discussion of the generalization of this approach for conversion of other character sets to Unicode, see Section 5.22, Best Practice for U+FFFD Substitution.

3.10 Unicode Encoding Schemes D94

Unicode encoding scheme: A specified byte serialization for a Unicode encoding form, including the specification of the handling of a byte order mark (BOM), if allowed.

• For historical reasons, the Unicode encoding schemes are also referred to as Unicode (or UCS) transformation formats (UTF). That term is, however, ambiguous between its usage for encoding forms and encoding schemes. The Unicode Standard supports seven encoding schemes. This section presents the formal definition of each of these encoding schemes. D95 UTF-8 encoding scheme: The Unicode encoding scheme that serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself. • In the UTF-8 encoding scheme, the UTF-8 code unit sequence <4D D0 B0 E4 BA 8C F0 90 8C 82> is serialized as <4D D0 B0 E4 BA 8C F0 90 8C 82>. • Because the UTF-8 encoding form already deals in ordered byte sequences, the UTF-8 encoding scheme is trivial. The byte ordering is already obvious and completely defined by the UTF-8 code unit sequence itself. The UTF-8 encoding scheme is defined merely for completeness of the Unicode character encoding model. • While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the <EF BB BF> byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme. D96 UTF-16BE encoding scheme: The Unicode encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in big-endian format. • In UTF-16BE, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is serialized as <00 4D 04 30 4E 8C D8 00 DF 02>. • In UTF-16BE, an initial byte sequence <FE FF> is interpreted as U+FEFF zero width no-break space. D97 UTF-16LE encoding scheme: The Unicode encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in little-endian format. • In UTF-16LE, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is serialized as <4D 00 30 04 8C 4E 00 D8 02 DF>.

The Unicode Standard, Version 6.2

3.10 Unicode Encoding Schemes

• In UTF-16LE, an initial byte sequence <FF FE> is interpreted as U+FEFF zero width no-break space. D98 UTF-16 encoding scheme: The Unicode encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format. • In the UTF-16 encoding scheme, the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is serialized as <FE FF 00 4D 04 30 4E 8C D8 00 DF 02> or <FF FE 4D 00 30 04 8C 4E 00 D8 02 DF> or <00 4D 04 30 4E 8C D8 00 DF 02>. • In the UTF-16 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark; it is used to distinguish between the two byte orders. An initial byte sequence <FE FF> indicates big-endian order, and an initial byte sequence <FF FE> indicates little-endian order. The BOM is not considered part of the content of the text. • The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. Table 3-9 gives examples that summarize the three Unicode encoding schemes for the UTF16 encoding form.

Table 3-9. Summary of UTF-16BE, UTF-16LE, and UTF-16 Code Unit Sequence

Encoding Scheme

Byte Sequence(s)

004D

UTF-16BE UTF-16LE UTF-16

0430

UTF-16BE UTF-16LE UTF-16

4E8C

UTF-16BE UTF-16LE UTF-16

D800 DF02

UTF-16BE UTF-16LE UTF-16

00 4D 4D 00 FE FF 00 4D FF FE 4D 00 00 4D 04 30 30 04 FE FF 04 30 FF FE 30 04 04 30 4E 8C 8C 4E FE FF 4E 8C FF FE 8C 4E 4E 8C D8 00 DF 02 00 D8 02 DF FE FF D8 00 DF 02 FF FE 00 D8 02 DF D8 00 DF 02

D99 UTF-32BE encoding scheme: The Unicode encoding scheme that serializes a UTF-32 code unit sequence as a byte sequence in big-endian format. • In UTF-32BE, the UTF-32 code unit sequence <0000004D 00000430 00004E8C 00010302> is serialized as <00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02>. • In UTF-32BE, an initial byte sequence <00 00 FE FF> is interpreted as U+FEFF zero width no-break space. D100 UTF-32LE encoding scheme: The Unicode encoding scheme that serializes a UTF-32 code unit sequence as a byte sequence in little-endian format.

The Unicode Standard, Version 6.2

100

Conformance • In UTF-32LE, the UTF-32 code unit sequence <0000004D 00000430 00004E8C 00010302> is serialized as <4D 00 00 00 30 04 00 00 8C 4E 00 00 02 03 01 00>. • In UTF-32LE, an initial byte sequence <FF FE 00 00> is interpreted as U+FEFF zero width no-break space.

D101 UTF-32 encoding scheme: The Unicode encoding scheme that serializes a UTF-32 code unit sequence as a byte sequence in either big-endian or little-endian format. • In the UTF-32 encoding scheme, the UTF-32 code unit sequence <0000004D 00000430 00004E8C 00010302> is serialized as <00 00 FE FF 00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02> or <FF FE 00 00 4D 00 00 00 30 04 00 00 8C 4E 00 00 02 03 01 00> or <00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02>. • In the UTF-32 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark; it is used to distinguish between the two byte orders. An initial byte sequence <00 00 FE FF> indicates bigendian order, and an initial byte sequence <FF FE 00 00> indicates little-endian order. The BOM is not considered part of the content of the text. • The UTF-32 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-32 encoding scheme is big-endian. Table 3-10 gives examples that summarize the three Unicode encoding schemes for the UTF-32 encoding form.

Table 3-10. Summary of UTF-32BE, UTF-32LE, and UTF-32 Code Unit Sequence

Encoding Scheme

Byte Sequence(s)

0000004D

UTF-32BE UTF-32LE UTF-32

00000430

UTF-32BE UTF-32LE UTF-32

00004E8C

UTF-32BE UTF-32LE UTF-32

00010302

UTF-32BE UTF-32LE UTF-32

00 00 00 4D 4D 00 00 00 00 00 FE FF 00 00 00 4D FF FE 00 00 4D 00 00 00 00 00 00 4D 00 00 04 30 30 04 00 00 00 00 FE FF 00 00 04 30 FF FE 00 00 30 04 00 00 00 00 04 30 00 00 4E 8C 8C 4E 00 00 00 00 FE FF 00 00 4E 8C FF FE 00 00 8C 4E 00 00 00 00 4E 8C 00 01 03 02 02 03 01 00 00 00 FE FF 00 01 03 02 FF FE 00 00 02 03 01 00 00 01 03 02

The terms UTF-8, UTF-16, and UTF-32, when used unqualified, are ambiguous between their sense as Unicode encoding forms or Unicode encoding schemes. For UTF-8, this ambiguity is usually innocuous, because the UTF-8 encoding scheme is trivially derived from the byte sequences defined for the UTF-8 encoding form. However, for UTF-16 and UTF-32, the ambiguity is more problematical. As encoding forms, UTF-16 and UTF-32 refer to code units in memory; there is no associated byte orientation, and a BOM is never

The Unicode Standard, Version 6.2

3.11 Normalization Forms

101

used. As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, as for streaming data or in files; they may have either byte orientation, and a BOM may be present. When the usage of the short terms “UTF-16” or “UTF-32” might be misinterpreted, and where a distinction between their use as referring to Unicode encoding forms or to Unicode encoding schemes is important, the full terms, as defined in this chapter of the Unicode Standard, should be used. For example, use UTF-16 encoding form or UTF-16 encoding scheme. These terms may also be abbreviated to UTF-16 CEF or UTF-16 CES, respectively. When converting between different encoding schemes, extreme care must be taken in handling any initial byte order marks. For example, if one converted a UTF-16 byte serialization with an initial byte order mark to a UTF-8 byte serialization, thereby converting the byte order mark to <EF BB BF> in the UTF-8 form, the <EF BB BF> would now be ambiguous as to its status as a byte order mark (from its source) or as an initial zero width nobreak space. If the UTF-8 byte serialization were then converted to UTF-16BE and the initial <EF BB BF> were converted to <FE FF>, the interpretation of the U+FEFF character would have been modified by the conversion. This would be nonconformant behavior according to conformance clause C7, because the change between byte serializations would have resulted in modification of the interpretation of the text. This is one reason why the use of the initial byte sequence <EF BB BF> as a signature on UTF-8 byte sequences is not recommended by the Unicode Standard.

3.11 Normalization Forms The concepts of canonical equivalent (D70) or compatibility equivalent (D67) characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This section provides the formal definitions of the four Unicode Normalization Forms. It defines the Canonical Ordering Algorithm and the Canonical Composition Algorithm which are used to convert Unicode strings to one of the Unicode Normalization Forms for comparison. It also formally defines Unicode Combining Classes—values assigned to all Unicode characters and used by the Canonical Ordering Algorithm. Note: In versions of the Unicode Standard up to Version 5.1.0, the Unicode Normalization Forms and the Canonical Composition Algorithm were defined in Unicode Standard Annex #15, “Unicode Normalization Forms.” Those definitions have now been consolidated in this chapter, for clarity of exposition of the normative definitions and algorithms involved in Unicode normalization. However, because implementation of Unicode normalization is quite complex, implementers are still advised to fully consult Unicode Standard Annex #15, “Unicode Normalization Forms,” which contains more detailed explanations, examples, and implementation strategies. Unicode normalization should be carefully distinguished from Unicode collation. Both processes involve comparison of Unicode strings. However, the point of Unicode normalization is to make a determination of canonical (or compatibility) equivalence or nonequivalence of strings—it does not provide any rank-ordering information about those strings. Unicode collation, on the other hand, is designed to provide orderable weights or “keys” for strings; those keys can then be used to sort strings into ordered lists. Unicode normalization is not tailorable; normalization equivalence relationships between strings are exact and unchangeable. Unicode collation, on the other hand, is designed to be tailorable to allow many kinds of localized and other specialized orderings of strings. For more information, see Unicode Technical Standard #10, “Unicode Collation Algorithm.”

The Unicode Standard, Version 6.2

102

Conformance

D102 [Moved to Section 3.6, Combination and renumbered as D61a.] D103 [Moved to Section 3.6, Combination and renumbered as D61b.]

Normalization Stability A very important attribute of the Unicode Normalization Forms is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard. In order to ensure this stability, there are strong constraints on changes of any character properties that are involved in the specification of normalization—in particular, the combining class and the decomposition of characters. The details of those constraints are spelled out in the Normalization Stability Policy. See the subsection “Policies” in Section B.6, Other Unicode Online Resources. The requirement for stability of normalization also constrains what kinds of characters can be encoded in future versions of the standard. For an extended discussion of this topic, see Section 3, Versioning and Stability, in Unicode Standard Annex #15, “Unicode Normalization Forms.”

Combining Classes Each character in the Unicode Standard has a combining class associated with it. The combining class is a numerical value used by the Canonical Ordering Algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not. Canonical equivalence is the criterion used to determine whether two character sequences are considered identical for interpretation. D104 Combining class: A numeric value in the range 0..254 given to each Unicode code point, formally defined as the property Canonical_Combining_Class. • The combining class for each encoded character in the standard is specified in the file UnicodeData.txt in the Unicode Character Database. Any code point not listed in that data file defaults to \p{Canonical_Combining_Class = 0} (or \p{ccc = 0} for short). • An extracted listing of combining classes, sorted by numeric value, is provided in the file DerivedCombiningClass.txt in the Unicode Character Database. • Only combining marks have a combining class other than zero. Almost all combining marks with a class other than zero are also nonspacing marks, with a few exceptions. Also, not all nonspacing marks have a non-zero combining class. Thus, while the correlation between ^\p{ccc=0] and \p{gc=Mn} is close, it is not exact, and implementations should not depend on the two concepts being identical. D105 Fixed position class: A subset of the range of numeric values for combining classes— specifically, any value in the range 10..199. • Fixed position classes are assigned to a small number of Hebrew, Arabic, Syriac, Telugu, Thai, Lao, and Tibetan combining marks whose positions were conceived of as occurring in a fixed position with respect to their grapheme base, regardless of any other combining mark that might also apply to the grapheme base. • Not all Arabic vowel points or Indic matras are given fixed position classes. The existence of fixed position classes in the standard is an historical artifact of an earlier stage in its development, prior to the formal standardization of the Unicode Normalization Forms.

The Unicode Standard, Version 6.2

3.11 Normalization Forms

103

D106 Typographic interaction: Graphical application of one nonspacing mark in a position relative to a grapheme base that is already occupied by another nonspacing mark, so that some rendering adjustment must be done (such as default stacking or side-byside placement) to avoid illegible overprinting or crashing of glyphs. The assignment of combining class values for Unicode characters was originally done with the goal in mind of defining distinct numeric values for each group of nonspacing marks that would typographically interact. Thus all generic nonspacing marks placed above the base character are given the same value, \p{ccc=230}, while all generic nonspacing marks placed below are given the value \p{ccc=220}. Nonspacing marks that tend to sit on one “shoulder” or another of a grapheme base, or that may actually be attached to the grapheme base itself when applied, have their own combining classes. The design of canonical ordering generally assures that: • When two combining characters C1 and C2 do typographically interact, the sequence C1+ C2 is not canonically equivalent to C2+ C1. • When two combining characters C1 and C2 do not typographically interact, the sequence C1+ C2 is canonically equivalent to C2+ C1. This is roughly correct for the normal cases of detached, generic nonspacing marks placed above and below base letters. However, the ramifications of complex rendering for many scripts ensure that there are always some edge cases involving typographic interaction between combining marks of distinct combining classes. This has turned out to be particularly true for some of the fixed position classes for Hebrew and Arabic, for which a distinct combining class is no guarantee that there will be no typographic interaction for rendering. Because of these considerations, particular combining class values should be taken only as a guideline regarding issues of typographic interaction of combining marks. The only normative use of combining class values is as input to the Canonical Ordering Algorithm, where they are used to normatively distinguish between sequences of combining marks that are canonically equivalent and those that are not.

Specification of Unicode Normalization Forms The specification of Unicode Normalization Forms applies to all Unicode coded character sequences (D12). For clarity of exposition in the definitions and rules specified here, the terms “character” and “character sequence” are used, but coded character sequences refer also to sequences containing noncharacters or reserved code points. Unicode Normalization Forms are specified for all Unicode code points, and not just for ordinary, assigned graphic characters.

Starters D107 Starter: Any code point (assigned or not) with combining class of zero (ccc=0). • Note that ccc=0 is the default value for the Canonical_Combining_Class property, so that all reserved code points are Starters by definition. Noncharacters are also Starters by definition. All control characters, format characters, and private-use characters are also Starters. • Private agreements cannot override the value of the Canonical_Combining_Class property for private-use characters. Among the graphic characters, all those with General_Category values other than gc=M are Starters. Some combining marks have ccc=0 and thus are also Starters. Combining

The Unicode Standard, Version 6.2

104

Conformance

marks with ccc other than 0 are not Starters. Table 3-11 summarizes the relationship between types of combining marks and their status as Starters.

Table 3-11. Combining Marks and Starter Status Description gc

Nonspacing

Spacing

Enclosing

ccc

0 >0 0 >0 0

Starter

Yes No Yes No Yes

The term Starter refers, in concept, to the starting character of a combining character sequence (D56), because all combining character sequences except defective combining character sequences (D57) commence with a ccc=0 character—in other words, they start with a Starter. However, because the specification of Unicode Normalization Forms must apply to all possible coded character sequences, and not just to typical combining character sequences, the behavior of a code point for Unicode Normalization Forms is specified entirely in terms of its status as a Starter or a non-starter, together with its Decomposition_Mapping value.

Canonical Ordering Algorithm D108 Reorderable pair: Two adjacent characters A and B in a coded character sequence <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. D109 Canonical Ordering Algorithm: In a decomposed character sequence D, exchange the positions of the characters in each Reorderable Pair until the sequence contains no more Reorderable Pairs. • In effect, the Canonical Ordering Algorithm is a local bubble sort that guarantees that a Canonical Decomposition or a Compatibility Decomposition will contain no subsequences in which a combining mark is followed directly by another combining mark that has a lower, non-zero combining class. • Canonical ordering is defined in terms of application of the Canonical Ordering Algorithm to an entire decomposed sequence. For example, canonical decomposition of the sequence <U+1E0B latin small letter d with dot above, U+0323 combining dot below> would result in the sequence <U+0064 latin small letter d, U+0307 combining dot above, U+0323 combining dot below>, a sequence which is not yet in canonical order. Most decompositions for Unicode strings are already in canonical order. Table 3-12 gives some examples of sequences of characters, showing which of them constitute a Reorderable Pair and the reasons for that determination. Except for the base character “a”, the other characters in the example table are combining marks; character names are abbreviated in the Sequence column to make the examples clearer.

Table 3-12. Reorderable Pairs Sequence <a, acute> <acute, a> <diaeresis, acute> <cedilla, acute> <acute, cedilla>

Combining Classes 0, 230 230, 0 230, 230 202, 230 230, 202

Reorderable? No No No No Yes

Reason ccc(A)=0 ccc(B)=0 ccc(A)=ccc(B) ccc(A)<ccc(B) ccc(A)>ccc(B)

The Unicode Standard, Version 6.2

3.11 Normalization Forms

105

Canonical Composition Algorithm D110 Singleton decomposition: A canonical decomposition mapping from a character to a different single character. • The default value for the Decomposition_Mapping property for a code point (including any private-use character, any noncharacter, and any unassigned code point) is the code point itself. This default value does not count as a singleton decomposition, because it does not map a character to a different character. Private agreements cannot override the decomposition mapping for private-use characters • Example: U+2126 ohm sign has a singleton decomposition to U+03A9 greek capital letter omega. • A character with a singleton decomposition is often referred to simply as a singleton for short. D110a Expanding canonical decomposition: A canonical decomposition mapping from a character to a sequence of more than one character. D110b Starter decomposition: An expanding canonical decomposition for which both the character being mapped and the first character of the resulting sequence are Starters. • Definitions D110a and D110b are introduced to simplify the following definition of non-starter decomposition and make it more precise. D111 Non-starter decomposition: An expanding canonical decomposition which is not a starter decomposition. • Example: U+0344 combining greek dialytika tonos has an expanding canonical decomposition to the sequence <U+0308 combining diaeresis, U+0301 combining acute accent>. U+0344 is a non-starter, and the first character in its decomposition is a non-starter. Therefore, on two counts, U+0344 has a non-starter decomposition. • Example: U+0F73 tibetan vowel sign ii has an expanding canonical decomposition to the sequence <U+0F71 tibetan vowel sign aa, U+0F72 tibetan vowel sign i>. The first character in that sequence is a non-starter. Therefore U+0F73 has a non-starter decomposition, even though U+0F73 is a Starter. • As of the current version of the standard, there are no instances of the third possible situation: a non-starter character with an expanding canonical decomposition to a sequence whose first character is a Starter. D112 Composition exclusion: A Canonical Decomposable Character (D69) which has the property value Composition_Exclusion=True. • The list of Composition Exclusions is provided in CompositionExclusions.txt in the Unicode Character Database. D113 Full composition exclusion: A Canonical Decomposable Character which has the property value Full_Composition_Exclusion=True. • Full composition exclusions consist of the entire list of composition exclusions plus all characters with singleton decompositions or with non-starter decompositions. • For convenience in implementation of Unicode normalization, the derived property Full_Composition_Exclusion is computed, and all characters with the property value Full_Composition_Exclusion=True are listed in DerivedNormalizationProps.txt in the Unicode Character Database. The Unicode Standard, Version 6.2

106

Conformance

D114 Primary composite: A Canonical Decomposable Character (D69) which is not a Full Composition Exclusion. • For any given version of the Unicode Standard, the list of primary composites can be computed by extracting all canonical decomposable characters from UnicodeData.txt in the Unicode Character Database, adding the list of precomposed Hangul syllables (D132), and subtracting the list of Full Decomposition Exclusions. D115 Blocked: Let A and C be two characters in a coded character sequence <A, ... C>. C is blocked from A if and only if ccc(A)=0 and there exists some character B between A and C in the coded character sequence, i.e., <A, ... B, ... C>, and either ccc(B)=0 or ccc(B) >= ccc(C). • Because the Canonical Composition Algorithm operates on a string which is already in canonical order, testing whether a character is blocked requires looking only at the immediately preceding character in the string. D116 Non-blocked pair: A pair of characters <A, ... C> in a coded character sequence, in which C is not blocked from A. • It is important for proper implementation of the Canonical Composition Algorithm to be aware that a Non-blocked Pair need not be contiguous. D117 Canonical Composition Algorithm: Starting from the second character in the coded character sequence (of a Canonical Decomposition or Compatibility Decomposition) and proceeding sequentially to the final character, perform the following steps: R1

Seek back (left) in the coded character sequence from the character C to find the last Starter L preceding C in the character sequence.

R2 If there is such an L, and C is not blocked from L, and there exists a Primary Composite P which is canonically equivalent to the sequence <L, C>, then replace L by P in the sequence and delete C from the sequence. • When the algorithm completes, all Non-blocked Pairs canonically equivalent to a Primary Composite will have been systematically replaced by those Primary Composites. • The replacement of the Starter L in R2 requires continuing to check the succeeding characters until the character at that position is no longer part of any Non-blocked Pair that can be replaced by a Primary Composite. For example, consider the following hypothetical coded character sequence: <U+007A z, U+0335 short stroke overlay, U+0327 cedilla, U+0324 diaeresis below, U+0301 acute>. None of the first three combining marks forms a Primary Composite with the letter z. However, the fourth combining mark in the sequence, acute, does form a Primary Composite with z, and it is not Blocked from the z. Therefore, R2 mandates the replacement of the sequence <U+007A z, ... U+0301 acute> with <U+017A z-acute, ...>, even though there are three other combining marks intervening in the sequence. • The character C in R1 is not necessarily a non-starter. It is necessary to check all characters in the sequence, because there are sequences <L, C> where both L and C are Starters, yet there is a Primary Composite P which is canonically equivalent to that sequence. For example, Indic two-part vowels often have canonical decompositions into sequences of two spacing vowel signs, each of which has Canonical_Combining_Class=0 and which is thus a Starter by definition. Nevertheless, such a decomposed sequence has an equivalent Primary Composite.

The Unicode Standard, Version 6.2

3.12 Conjoining Jamo Behavior

107

Definition of Normalization Forms The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences. Each is then differentiated based on whether it employs a Canonical Decomposition or a Compatibility Decomposition. D118 Normalization Form D (NFD): The Canonical Decomposition of a coded character sequence. D119 Normalization Form KD (NFKD): The Compatibility Decomposition of a coded character sequence. D120 Normalization Form C (NFC): The Canonical Composition of the Canonical Decomposition of a coded character sequence. D121 Normalization Form KC (NFKC): The Canonical Composition of the Compatibility Decomposition of a coded character sequence. Logically, to get the NFD or NFKD (maximally decomposed) normalization form for a Unicode string, one first computes the full decomposition of that string and then applies the Canonical Ordering Algorithm to it. Logically, to get the NFC or NFKC (maximally composed) normalization form for a Unicode string, one first computes the NFD or NFKD normalization form for that string, and then applies the Canonical Composition Algorithm to it.

3.12 Conjoining Jamo Behavior The Unicode Standard contains both a large set of precomposed modern Hangul syllables and a set of conjoining Hangul jamo, which can be used to encode archaic Korean syllable blocks as well as modern Korean syllable blocks. This section describes how to • Determine the canonical decomposition of precomposed Hangul syllables. • Compose jamo characters into precomposed Hangul syllables. • Algorithmically determine the names of precomposed Hangul syllables. For more information, see the “Hangul Syllables” and “Hangul Jamo” subsections in Section 12.6, Hangul. Hangul syllables are a special case of grapheme clusters. For the algorithm to determine syllable boundaries in a sequence of conjoining jamo characters, see Section 8, “Hangul Syllable Boundary Determination” in Unicode Standard Annex #29, “Unicode Text Segmentation.”

Definitions The following definitions use the Hangul_Syllable_Type property, which is defined in the UCD file HangulSyllableType.txt. D122 Leading consonant: A character with the Hangul_Syllable_Type property value Leading_Jamo. Abbreviated as L. • When not occurring in clusters, the term leading consonant is equivalent to syllable-initial character. D123 Choseong: A sequence of one or more leading consonants. • In Modern Korean, a choseong consists of a single jamo. In Old Korean, a sequence of more than one leading consonant may occur.

The Unicode Standard, Version 6.2

108

Conformance • Equivalent to syllable-initial cluster.

D124 Choseong filler: U+115F hangul choseong filler. Abbreviated as Lf. • A choseong filler stands in for a missing choseong to make a well-formed Korean syllable. D125 Vowel: A character with the Hangul_Syllable_Type property value Vowel_Jamo. Abbreviated as V. • When not occurring in clusters, the term vowel is equivalent to syllable-peak character. D126 Jungseong: A sequence of one or more vowels. • In Modern Korean, a jungseong consists of a single jamo. In Old Korean, a sequence of more than one vowel may occur. • Equivalent to syllable-peak cluster. D127 Jungseong filler: U+1160 hangul jungseong filler. Abbreviated as Vf. • A jungseong filler stands in for a missing jungseong to make a well-formed Korean syllable. D128 Trailing consonant: A character with the Hangul_Syllable_Type property value Trailing_Jamo. Abbreviated as T. • When not occurring in clusters, the term trailing consonant is equivalent to syllable-final character. D129 Jongseong: A sequence of one or more trailing consonants. • In Modern Korean, a jongseong consists of a single jamo. In Old Korean, a sequence of more than one trailing consonant may occur. • Equivalent to syllable-final cluster. D130 LV_Syllable: A character with Hangul_Syllable_Type property value LV_Syllable. Abbreviated as LV. • An LV_Syllable has a canonical decomposition to a sequence of the form <L, V>. D131 LVT_Syllable: A character with Hangul_Syllable_Type property value LVT_Syllable. Abbreviated as LVT. • An LVT_Syllable has a canonical decomposition to a sequence of the form <LV, T>. D132 Precomposed Hangul syllable: A character that is either an LV_Syllable or an LVT_Syllable. D133 Syllable block: A sequence of Korean characters that should be grouped into a single square cell for display. • This is different from a precomposed Hangul syllable and is meant to include sequences needed for the representation of Old Korean syllables. • A syllable block may contain a precomposed Hangul syllable plus other characters. D134 Standard Korean syllable block: A sequence of one or more L followed by a sequence of one or more V and a sequence of zero or more T, or any other sequence that is canonically equivalent. • All precomposed Hangul syllables, which have the form LV or LVT, are standard Korean syllable blocks.

The Unicode Standard, Version 6.2

3.12 Conjoining Jamo Behavior

109

• Alternatively, a standard Korean syllable block may be expressed as a sequence of a choseong and a jungseong, optionally followed by a jongseong. • A choseong filler may substitute for a missing leading consonant, and a jungseong filler may substitute for a missing vowel. • This definition is used in Unicode Standard Annex #29, “Unicode Text Segmentation,” as part of the algorithm for determining syllable boundaries in a sequence of conjoining jamo characters.

Hangul Syllable Decomposition The following algorithm specifes how to take a precomposed Hangul syllable s and arithmetically derive its full canonical decomposition d. This normative mapping for precomposed Hangul syllables is referenced by D68, Canonical decomposition, in Section 3.7, Decomposition. This algorithm, as well as the other Hangul-related algorithms defined in the following text, is first specified in pseudo-code. Then each is exemplified, showing its application to a particular Hangul character or sequence. The Hangul characters used in those examples are shown in Table 3-13. Finally, each algorithm is then further exemplified with an implementation as a Java method at the end of this section.

Table 3-13. Hangul Characters Used in Examples Code Point

Glyph

Character Name

U+D4DB

L   

hangul syllable pwilh

U+1111 U+1171 U+11B6

Jamo Short Name

hangul choseong phieuph

hangul jungseong wi

hangul jongseong rieul-hieuh LH

Common Constants. Define the following consonants: SBase = AC0016 LBase = 110016 VBase = 116116 TBase = 11A716 LCount = 19 VCount = 21 TCount = 28 NCount = 588 (VCount * TCount) SCount = 11172 (LCount * NCount)

TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1. NCount is thus the number of precomposed Hangul syllables starting with the same leading consonant, counting both the LV_Syllables and the LVT_Syllables for each possible trailing consonant. SCount is the total number of precomposed Hangul syllables. Syllable Index. First compute the index of the precomposed Hangul syllable s: SIndex = s - SBase

The Unicode Standard, Version 6.2

110

Conformance

Arithmetic Decomposition Mapping. If the precomposed Hangul syllable s with the index SIndex (defined above) has the Hangul_Syllable_Type value LV, then it has a canonical decomposition mapping into a sequence of an L jamo and a V jamo, <LPart, VPart>: LIndex = SIndex div NCount VIndex = (SIndex mod NCount) div TCount LPart = LBase + LIndex VPart = VBase + VIndex

If the precomposed Hangul syllable s with the index SIndex (defined above) has the Hangul_Syllable_Type value LVT, then it has a canonical decomposition mapping into a sequence of an LV_Syllable and a T jamo, <LVPart, TPart>: LVIndex = (SIndex div TCount) * TCount TIndex = SIndex mod TCount LVPart = SBase + LVIndex TPart = TBase + TIndex

In this specification, the “div” operator refers to integer division (rounded down). The “mod” operator refers to the modulo operation, equivalent to the integer remainder for positive numbers. The canonical decomposition mappings calculated this way are equivalent to the values of the Unicode character property Decomposition_Mapping (dm), for each precomposed Hangul syllable. Full Canonical Decomposition. The full canonical decomposition for a Unicode character is defined as the recursive application of canonical decomposition mappings. The canonical decomposition mapping of an LVT_Syllable contains an LVPart which itself is a precomposed Hangul syllable and thus must be further decomposed. However, it is simplest to unwind the recursion and directly calculate the resulting <LPart, VPart, TPart> sequence instead. For full canonical decomposition of a precomposed Hangul syllable, compute the indices and components as follows: LIndex = SIndex div NCount VIndex = (SIndex mod NCount) div TCount TIndex = SIndex mod TCount LPart = LBase + LIndex VPart = VBase + VIndex TPart = TBase + TIndex

if TIndex > 0

If TIndex = 0, then there is no trailing consonant, so map the precomposed Hangul syllable s to its full decomposition d = <LPart, VPart>. Otherwise, there is a trailing consonant, so map s to its full decomposition d = <LPart, VPart, TPart>. Example. For the precomposed Hangul syllable U+D4DB, compute the indices and components: SIndex LIndex VIndex TIndex

= = = =

10459 17 16 15

LPart = LBase + 17 = 111116 VPart = VBase + 16 = 117116 TPart = TBase + 15 = 11B616

Then map the precomposed syllable to the calculated sequence of components, which constitute its full canonical decomposition: U+D4DB → <U+1111, U+1171, U+11B6>

The Unicode Standard, Version 6.2

3.12 Conjoining Jamo Behavior

111

Note that the canonical decomposition mapping for U+D4DB would be <U+D4CC, U+11B6>, but in computing the full canonical decomposition, that sequence would only be an intermediate step.

Hangul Syllable Composition The following algorithm specifes how to take a canonically decomposed sequence of Hangul jamo characters d and arithmetically derive its mapping to an equivalent precomposed Hangul syllable s. This normative mapping can be used to calculate the Primary Composite for a sequence of Hangul jamo characters, as specified in D117, Canonical Composition Algorithm, in Section 3.11, Normalization Forms. Strictly speaking, this algorithm is simply the inverse of the full canonical decomposition mappings specified by the Hangul Syllable Decomposition Algorithm. However, it is useful to have a summary specification of that inverse mapping as a separate algorithm, for convenience in implementation. Note that the presence of any non-jamo starter or any combining character between two of the jamos in the sequence d would constitute a blocking context, and would prevent canonical composition. See D115, Blocked, in Section 3.11, Normalization Forms. Arithmetic Primary Composite Mapping. Given a Hangul jamo sequence <LPart, VPart>, where the LPart is in the range U+1100..U+1112, and where the VPart is in the range U+1161..U+1175, compute the indices and syllable mapping: LIndex = LPart - LBase VIndex = VPart - VBase LVIndex = LIndex * NCount + VIndex * TCount s = SBase + LVIndex

Given a Hangul jamo sequence <LPart, VPart, TPart>, where the LPart is in the range U+1100..U+1112, where the VPart is in the range U+1161..U+1175, and where the TPart is in the range U+11A8..U+11C2, compute the indices and syllable mapping: LIndex = LPart VIndex = VPart TIndex = TPart LVIndex = LIndex

LBase VBase TBase * NCount + VIndex * TCount

s = SBase + LVIndex + TIndex

The mappings just specified deal with canonically decomposed sequences of Hangul jamo characters. However, for completeness, the following mapping is also defined to deal with cases in which Hangul data is not canonically decomposed. Given a sequence <LVPart, TPart>, where the LVPart is a precomposed Hangul syllable of Hangul_Syllable_Type LV, and where the TPart is in the range U+11A8..U+11C2, compute the index and syllable mapping: TIndex = TPart - TBase s = LVPart + TIndex

Example. For the canonically decomposed Hangul jamo sequence <U+1111, U+1171, U+11B6>, compute the indices and syllable mapping: LIndex = 17 VIndex = 16 TIndex = 15 LVIndex = 17 * 588 + 16 * 28 = 9996 + 448 = 10444 s = AC0016 + 10444 + 15 = D4DB16

Then map the Hangul jamo sequence to this precomposed Hangul syllable as its Primary Composite: <U+1111, U+1171, U+11B6> → U+D4DB

The Unicode Standard, Version 6.2

112

Conformance

Hangul Syllable Name Generation The Unicode character names for precomposed Hangul syllables are derived algorithmically from the Jamo_Short_Name property values for each of the Hangul jamo characters in the full canonical decomposition of that syllable. That derivation is specified here. Full Canonical Decomposition. First construct the full canonical decomposition d for the precomposed Hangul syllable s, as specified by the Hangul Syllable Decomposition Algorithm: s → d = <LPart, VPart, (TPart)>

Jamo Short Name Mapping. For each part of the full canonical decomposition d, look up the Jamo_Short_Name property value, as specified in Jamo.txt in the Unicode Character Database. If there is no TPart in the full canonical decomposition, then the third value is set to be a null string: JSNL = Jamo_Short_Name(LPart) JSNV = Jamo_Short_Name(VPart) JSNT = Jamo_Short_Name(TPart)

if TPart exists, else ""

Name Concatenation. The Unicode character name for s is then constructed by starting with the constant string “HANGUL SYLLABLE” and then concatenating each of the three Jamo short name values, in order: Name = "HANGUL SYLLABLE " + JSNL + JSNV + JSNT

Example. For the precomposed Hangul syllable U+D4DB, construct the full canonical decomposition: U+D4DB → <U+1111, U+1171, U+11B6>

Look up the Jamo_Short_Name values for each of the Hangul jamo in the canonical decomposition: JSNL = Jamo_Short_Name(U+1111) = "P" JSNV = Jamo_Short_Name(U+1171) = "WI" JSNT = Jamo_Short_Name(U+11B6) = "LH"

Concatenate the pieces: Name = "HANGUL SYLLABLE " + "P" + "WI" + "LH" = "HANGUL SYLLABLE PWILH"

Sample Code for Hangul Algorithms This section provides sample Java code illustrating the three Hangul-related algorithms. Common Constants. This code snippet defines the common constants used in the methods that follow. static final int SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7, LCount = 19, VCount = 21, TCount = 28, NCount = VCount * TCount, // 588 SCount = LCount * NCount; // 11172

Hangul Decomposition. The Hangul Decomposition Algorithm as specified above directly decomposes precomposed Hangul syllable characters into a sequence of either two or three Hangul jamo characters. The sample method here does precisely that: public static String decomposeHangul(char s) { int SIndex = s - SBase; if (SIndex < 0 || SIndex >= SCount) {

The Unicode Standard, Version 6.2

3.12 Conjoining Jamo Behavior

113

return String.valueOf(s); } StringBuffer result = new StringBuffer(); int L = LBase + SIndex / NCount; int V = VBase + (SIndex % NCount) / TCount; int T = TBase + SIndex % TCount; result.append((char)L); result.append((char)V); if (T != TBase) result.append((char)T); return result.toString(); }

The Hangul Decomposition Algorithm could also be expressed equivalently as a recursion of binary decompositions, as is the case for other non-Hangul characters. All LVT syllables would decompose into an LV syllable plus a T jamo. The LV syllables themselves would in turn decompose into an L jamo plus a V jamo. This approach can be used to produce somewhat more compact code than what is illustrated in this sample method. Hangul Composition. An important feature of Hangul composition is that whenever the source string is not in Normalization Form D or Normalization Form KD, one must not detect only character sequences of the form <L, V> and <L, V, T>. It is also necessary to catch the sequences of the form <LV, T>. To guarantee uniqueness, such sequences must also be composed. This extra processing is illustrated in step 2 of the sample method defined here. public static String composeHangul(String source) { int len = source.length(); if (len == 0) return ""; StringBuffer result = new StringBuffer(); char last = source.charAt(0); // copy first char result.append(last); for (int i = 1; i < len; ++i) { char ch = source.charAt(i); // 1. check to see if two current characters are L and V int LIndex = last - LBase; if (0 <= LIndex && LIndex < LCount) { int VIndex = ch - VBase; if (0 <= VIndex && VIndex < VCount) { // make syllable of form LV last = (char)(SBase + (LIndex * VCount + VIndex) * TCount); result.setCharAt(result.length()-1, last); // reset last continue; // discard ch } } // 2. check to see if two current characters are LV and T int SIndex = last - SBase; if (0 <= SIndex && SIndex < SCount && (SIndex % TCount) == 0) { int TIndex = ch - TBase; if (0 < TIndex && TIndex < TCount) { // make syllable of form LVT

The Unicode Standard, Version 6.2

114

Conformance

last += TIndex; result.setCharAt(result.length()-1, last); // reset last continue; // discard ch } } // if neither case was true, just add the character last = ch; result.append(ch); } return result.toString(); }

Hangul Character Name Generation. Hangul decomposition is also used when generating the names for precomposed Hangul syllables. This is apparent in the following sample method for constructing a Hangul syllable name. The content of the three tables used in this method can be derived from the data file Jamo.txt in the Unicode Character Database. public static String getHangulName(char s) { int SIndex = s - SBase; if (0 > SIndex || SIndex >= SCount) { throw new IllegalArgumentException("Not a Hangul Syllable: " + s); } StringBuffer result = new StringBuffer(); int LIndex = SIndex / NCount; int VIndex = (SIndex % NCount) / TCount; int TIndex = SIndex % TCount; return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex] + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex]; } static private String[] JAMO_L_TABLE = { "G", "GG", "N", "D", "DD", "R", "M", "B", "BB", "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H" }; static private String[] JAMO_V_TABLE = { "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O", "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI", "YU", "EU", "YI", "I" }; static private String[] JAMO_T_TABLE = { "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM", "LB", "LS", "LT", "LP", "LH", "M", "B", "BS", "S", "SS", "NG", "J", "C", "K", "T", "P", "H" };

Additional Transformations for Hangul Jamo. Additional transformations can be performed on sequences of Hangul jamo for various purposes. For example, to regularize sequences of Hangul jamo into standard Korean syllable blocks, the choseong or jungseong fillers can be inserted, as described in Unicode Standard Annex #29, “Unicode Text Segmentation.” For keyboard input, additional compositions may be performed. For example, a sequence of trailing consonants kf + sf may be combined into a single, complex jamo ksf. In addition, some Hangul input methods do not require a distinction on input between initial and final consonants, and may instead change between them on the basis of context. For example, in

The Unicode Standard, Version 6.2

3.13 Default Case Algorithms

115

the keyboard sequence mi + em + ni + si + am, the consonant ni would be reinterpreted as nf, because there is no possible syllable nsa. This results in the two syllables men and sa.

3.13 Default Case Algorithms This section specifies the default algorithms for case conversion, case detection, and caseless matching. For information about the data sources for case mapping, see Section 4.2, Case. For a general discussion of case mapping operations, see Section 5.18, Case Mappings. All of these specifications are logical specifications. Particular implementations can optimize the processes as long as they provide the same results. Tailoring. The default casing operations are intended for use in the absence of tailoring for particular languages and environments. Where a particular environment requires tailoring of casing operations to produce correct results, use of such tailoring does not violate conformance to the standard. Data that assist the implementation of certain tailorings are published in SpecialCasing.txt in the Unicode Character Database. Most notably, these include: • Casing rules for the Turkish dotted capital I and dotless small i. • Casing rules for the retention of dots over i for Lithuanian letters with additional accents. Examples of case tailorings which are not covered by data in SpecialCasing.txt include: • Titlecasing of IJ at the start of words in Dutch • Removal of accents when uppercasing letters in Greek • Titlecasing of second or subsequent letters in words in orthographies that include caseless letters such as apostrophes • Uppercasing of U+00DF “ß” latin small letter sharp s to U+1E9E latin capital letter sharp s The preferred mechanism for defining tailored casing operations is the Unicode Common Locale Data Repository (CLDR), where tailorings such as these can be specified on a perlanguage basis, as needed. Tailorings of case operations may or may not be desired, depending on the nature of the implementation in question. For more about complications in case mapping, see the discussion in Section 5.18, Case Mappings.

Definitions The full case mappings for Unicode characters are obtained by using the mappings from SpecialCasing.txt plus the mappings from UnicodeData.txt, excluding any of the latter mappings that would conflict. Any character that does not have a mapping in these files is considered to map to itself. The full case mappings of a character C are referred to as Lowercase_Mapping(C), Titlecase_Mapping(C), and Uppercase_Mapping(C). The full case folding of a character C is referred to as Case_Folding(C). Detection of case and case mapping requires more than just the General_Category values (Lu, Lt, Ll). The following definitions are used: D135 A character C is defined to be cased if and only if C has the Lowercase or Uppercase property or has a General_Category value of Titlecase_Letter.

The Unicode Standard, Version 6.2

116

Conformance • The Uppercase and Lowercase property values are specified in the data file DerivedCoreProperties.txt in the Unicode Character Database. The derived property Cased is also listed in DerivedCoreProperties.txt.

D136 A character C is defined to be case-ignorable if C has the value MidLetter or the value MidNumLet for the Word_Break property or its General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf ), Modifier_Letter (Lm), or Modifier_Symbol (Sk) • The Word_Break property is defined in the data file WordBreakProperty.txt in the Unicode Character Database. • The derived property Case_Ignorable is listed in the data file DerivedCoreProperties.txt in the Unicode Character Database. • The Case_Ignorable property is defined for use in the context specifications of Table 3-14. It is a narrow-use property, and is not intended for use in other contexts. The more broadly applicable string casing function, isCased(X), is defined in D143. D137 Case-ignorable sequence: A sequence of zero or more case-ignorable characters. D138 A character C is in a particular casing context for context-dependent matching if and only if it matches the corresponding specification in Table 3-14.

Table 3-14. Context Specification for Casing Context

Description

Final_Sigma C is preceded by a sequence consisting of a cased letter and then zero or more case-ignorable characters, and C is not followed by a sequence consisting of zero or more case-ignorable characters and then a cased letter. After_Soft_D There is a Soft_Dotted character before C, with no intervening character of otted combining class 0 or 230 (Above). More_Above C is followed by a character of combining class 230 (Above) with no intervening character of combining class 0 or 230 (Above). Before_Dot C is followed by combining dot above (U+0307). Any sequence of characters with a combining class that is neither 0 nor 230 may intervene between the current character and the combining dot above. There is an uppercase I before C, and After_I there is no intervening combining character class 230 (Above) or 0.

Regular Expressions Before C \p{cased} (\p{case-ignorable})* After C ! ( (\p{case-ignorable})* \p{cased} ) Before C [\p{Soft_Dotted}] ([^\p{ccc=230} \p{ccc=0}])* After C

[^\p{ccc=230}\p{ccc=0}]* [\p{ccc=230}]

After C

([^\p{ccc=230} \p{ccc=0}])* [\u0307]

Before C [I] ([^\p{ccc=230} \p{ccc=0}])*

In Table 3-14, a description of each context is followed by the equivalent regular expression(s) describing the context before C, the context after C, or both. The regular expressions use the syntax of Unicode Technical Standard #18, “Unicode Regular Expressions,” with one addition: “!” means that the expression does not match. All of the regular expressions are case-sensitive. The regular-expression operator * in Table 3-14 is “possessive,” consuming as many characters as possible, with no backup. This is significant in the case of Final_Sigma, because the sets of case-ignorable and cased characters are not disjoint: for example, they both contain U+0345 ypogegrammeni. Thus, the Before condition is not satisfied if C is preceded by Copyright © 1991–2012 Unicode, Inc.

The Unicode Standard, Version 6.2

3.13 Default Case Algorithms

117

only U+0345, but would be satisfied by the sequence <capital-alpha, ypogegrammeni>. Similarly, the After condition is satisfied if C is only followed by ypogegrammeni, but would not satisfied by the sequence <ypogegrammeni, capital-alpha>.

Default Case Conversion The following rules specify the default case conversion operations for Unicode strings. These rules use the full case conversion operations, Uppercase_Mapping(C), Lowercase_Mapping(C), and Titlecase_Mapping(C), as well as the context-dependent mappings based on the casing context, as specified in Table 3-14. For a string X: R1 toUppercase(X): Map each character C in X to Uppercase_Mapping(C). R2 toLowercase(X): Map each character C in X to Lowercase_Mapping(C). R3

toTitlecase(X): Find the word boundaries in X according to Unicode Standard Annex #29, “Unicode Text Segmentation.” For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C).

The default case conversion operations may be tailored for specific requirements. A common variant, for example, is to make use of simple case conversion, rather than full case conversion. Language- or locale-specific tailorings of these rules may also be used.

Default Case Folding Case folding is related to case conversion. However, the main purpose of case folding is to contribute to caseless matching of strings, whereas the main purpose of case conversion is to put strings into a particular cased form. Unicode Default Case Folding is built on the toLowercase(X) transform, with some adaptations specifically for caseless matching. Context-dependent mappings based on the casing context are not used. Default Case Folding does not preserve normalization forms. A string in a particular Unicode normalization form may not be in that normalization form after it has been casefolded. Default Case Folding is based on the full case conversion operation, Lowercase_Mapping, which includes conversions to lowercase forms that may change string length, but is adapted specifically for caseless matching. In particular, any two strings which are considered to be case variants of each other under any of the full case conversions, toUppercase(X), toLowercase(X), or toTitlecase(X) will fold to the same string by the toCasefold(X) operation: R4 toCasefold(X): Map each character C in X to Case_Folding(C). • Case_Folding(C) uses the mappings with the status field value “C” or “F” in the data file CaseFolding.txt in the Unicode Character Database. A modified form of Default Case Folding is designed for best behavior when doing caseless matching of strings interpreted as identifiers. This folding is based on Case_Folding(C), but also removes any characters which have the Unicode property value Default_Ignorable_Code_Point=True. It also maps characters to their NFKC equivalent sequences. Once the mapping for a string is complete, the resulting string is then normalized to NFC. That last normalization step simplifies the statement of the use of this folding for caseless matching. The Unicode Standard, Version 6.2

118

Conformance

R5 toNFKC_Casefold(X): Map each character C in X to NFKC_Casefold(C) and then normalize the resulting string to NFC. • The mapping NFKC_Casefold (short alias NFKC_CF) is specified in the data file DerivedNormalizationProps.txt in the Unicode Character Database. • The derived binary property Changes_When_NFKC_Casefolded is also listed in the data file DerivedNormalizationProps.txt in the Unicode Character Database. For more information on the use of NFKC_Casefold and caseless matching for identifiers, see Unicode Standard Annex #31, “Unicode Identifier and Pattern Syntax.”

Default Case Detection The casing status of a string can be determined by using the casing operations defined earlier. The following definitions provide a specification. They assume that X and Y are strings. In the following, functional names beginning with “is” are binary functions which take the string X and return true when the string as a whole matches the given casing status. For example, isLowerCase(X) would be true if the string X as a whole is lowercase. In contrast, the Unicode character properties such as Lowercase are properties of individual characters. For each definition, there is also a related Unicode character property which has a name beginning with “Changes_When_”. That property indicates whether each character is affected by a particular casing operation; it can be used to optimize implementations of Default Case Detection for strings. When case conversion is applied to a string that is decomposed (or more precisely, normalized to NFD), applying the case conversion character by character does not affect the normalization status of the string. Therefore, these definitions are specified in terms of Normalization Form NFD. To make the definitions easier to read, they adopt the convention that the string Y equals toNFD(X). D139 isLowercase(X): isLowercase(X) is true when toLowercase(Y) = Y. • For example, isLowercase(“combining mark”) is true, and isLowercase(“Combining mark”) is false. • The derived binary property Changes_When_Lowercased is listed in the data file DerivedCoreProperties.txt in the Unicode Character Database. D140 isUppercase(X): isUppercase(X) is true when toUppercase(Y) = Y. • For example, isUppercase(“COMBINING MARK”) is true, and isUppercase(“Combining mark”) is false. • The derived binary property Changes_When_Uppercased is listed in the data file DerivedCoreProperties.txt in the Unicode Character Database. D141 isTitlecase(X): isTitlecase(X) is true when toTitlecase(Y) = Y. • For example, isTitlecase(“Combining Mark”) is true, and isTitlecase(“Combining mark”) is false. • The derived binary property Changes_When_Titlecased is listed in the data file DerivedCoreProperties.txt in the Unicode Character Database. D142 isCasefolded(X): isCasefolded(X) is true when toCasefold(Y) = Y. • For example, isCasefolded(“heiss”) is true, and isCasefolded(“heiß”) is false.

The Unicode Standard, Version 6.2

3.13 Default Case Algorithms

119

• The derived binary property Changes_When_Casefolded is listed in the data file DerivedCoreProperties.txt in the Unicode Character Database. Uncased characters do not affect the results of casing detection operations such as the string function isLowercase(X). Thus a space or a number added to a string does not affect the results. The examples in Table 3-15 show that these conditions are not mutually exclusive. “A2” is both uppercase and titlecase; “3” is uncased, so it is simultaneously lowercase, uppercase, and titlecase.

Table 3-15. Case Detection Examples Case

Letter

Name

Alphanumeric Digit

Lowercase Uppercase Titlecase

a A A

john smith JOHN SMITH John Smith

a2 A2 A2

3 3 3

Only when a string, such as “123”, contains no cased letters will all three conditions,— isLowercase, isUppercase, and isTitlecase—evaluate as true. This combination of conditions can be used to check for the presence of cased letters, using the following definition: D143 isCased(X): isCased(X) is true when isLowercase(X) is false, or isUppercase(X) is false, or isTitlecase(X) is false. • Any string X for which isCased(X) is true contains at least one character that has a case mapping other than to itself. • For example, isCased(“123”) is false because all the characters in “123” have case mappings to themselves, while isCased(“abc”) and isCased(“A12”) are both true. • The derived binary property Changes_When_Casemapped is listed in the data file DerivedCoreProperties.txt in the Unicode Character Database. To find out whether a string contains only lowercase letters, implementations need to test for (isLowercase(X) and isCased(X)).

Default Caseless Matching Default caseless matching is the process of comparing two strings for case-insensitive equality. The definitions of Unicode Default Caseless Matching build on the definitions of Unicode Default Case Folding. Default Caseless Matching uses full case folding: D144 A string X is a caseless match for a string Y if and only if: toCasefold(X) = toCasefold(Y) When comparing strings for case-insensitive equality, the strings should also be normalized for most correct results. For example, the case folding of U+00C5 Å latin capital letter a with ring above is U+00E5 å latin small letter a with ring above, whereas the case folding of the sequence <U+0041 “A” latin capital letter a, U+030A  combining ring above> is the sequence <U+0061 “a” latin small letter a, U+030A combining ring above>. Simply doing a binary comparison of the results of case folding both strings will not catch the fact that the resulting case-folded strings are canonical- equivalent sequences. In principle, normalization needs to be done after case folding, because case folding does not preserve the normalized form of strings in all instances. This

The Unicode Standard, Version 6.2

120

Conformance

requirement for normalization is covered in the following definition for canonical caseless matching: D145 A string X is a canonical caseless match for a string Y if and only if: NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y))) The invocations of canonical decomposition (NFD normalization) before case folding in D145 are to catch very infrequent edge cases. Normalization is not required before case folding, except for the character U+0345 n combining greek ypogegrammeni and any characters that have it as part of their canonical decomposition, such as U+1FC3 o greek small letter eta with ypogegrammeni. In practice, optimized versions of canonical caseless matching can catch these special cases, thereby avoiding an extra normalization step for each comparison. In some instances, implementers may wish to ignore compatibility differences between characters when comparing strings for case-insensitive equality. The correct way to do this makes use of the following definition for compatibility caseless matching: D146 A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y))))) Compatibility caseless matching requires an extra cycle of case folding and normalization for each string compared, because the NFKD normalization of a compatibility character such as U+3392 square mhz may result in a sequence of alphabetic characters which must again be case folded (and normalized) to be compared correctly. Caseless matching for identifiers can be simplified and optimized by using the NFKC_Casefold mapping. That mapping incorporates internally the derived results of iterated case folding and NFKD normalization. It also maps away characters with the property value Default_Ignorable_Code_Point=True, which should not make a difference when comparing identifiers. The following defines identifier caseless matching: D147 A string X is an identifier caseless match for a string Y if and only if: toNFKC_Casefold(NFD(X)) = toNFKC_Casefold(NFD(Y))

The Unicode Standard, Version 6.2

Developing OpenType Fonts for Kannada Script

Introduction Microsoft Typography September 2008 Please note: This document reflects the changes made in 2005 for recommendations for Indic-script OpenType font and shaping-engine implementations. While Indic fonts made according to the earlier recommendations will still function properly in new versions of Uniscribe, font developers may wish to update their fonts, particularly if they wish to avoid certain limitations of the earlier implementation. This document presents information that will help font developers create or support OpenType fonts for Kannada script languages covered by the Unicode Standard. The Kannada script is used to write the Kannadu language and is closely related to the Telugu script.

Contents • • • •

Introduction Shaping Engine Features Appendices

Introduction This document targets developers implementing Indic shaping behavior compatible with Microsoft OpenType specification for Indic scripts. It contains information about terminology, font features and behavior of the Indic shaping engine in regards to the Kannada script. While it does not contain instructions for creating Kannada fonts, it will help font developers understand how the Indic shaping engine processes Indic text. In addition, registered features of the Kannada script are defined and illustrated with examples. The new Indic shaping engine allows for variations in typographic conventions, giving a font developer control over shaping by the choice of designation of glyphs to certain OpenType features. For example, the location where the reph and pre-pended matra are re-ordered within a syllable cluster is affected by the presence of a half form. See illustrations below.

Glossary The following terms are useful for understanding the layout features and script rules discussed in this document. Above-base form of consonants - A variant form of a consonant that appears above the base glyph. Akhand ligatures - Required consonant ligatures that may appear anywhere in the syllable, and may or may not involve the base glyph. Akhand ligatures have the highest priority and are formed first; some languages include them in their alphabets. Akhand ligatures may be displayed in either half- or full-form. Base glyph - The only consonant or consonant conjunct in the syllable that is written in its "full" (nominal) form. In Kannada, the last consonant of the syllable (except for syllables ending with letter "Ra") usually forms the base glyph. In "degenerate" syllables that have no vowel (last letter of a word), the last consonant in halant form serves as the base consonant and is mapped as the base glyph. Layout operations are defined in terms of a base glyph, not a base character, since the base can often be a ligature.

Below-base form of consonants - A variant form of a consonant that appears below the base glyph. In the glyph sequence, the below-base form comes after the consonant(s) that form the base glyph. Below-base forms are represented by a nonspacing mark glyph. Cluster – A group of characters that form an integral unit in Indic scripts, often times a syllable. Consonant - Each represents a single consonant sound. Consonants may exist in different contextual forms and have an inherent vowel (usually, the short vowel "a"). For example, "Ka" and "Ta", rather than just "K" or "T." Consonant conjuncts (aka “conjuncts”) - Ligatures of two or more consonants. Consonant conjuncts may have both full and half forms, or only full forms. Halant (Virama) - The character used after a consonant to "strip" it of it’s inherent vowel. A Halant follows all but the last consonant in every Kannada syllable. NOTE: A syllable containing halant characters may be shaped with no visible halant signs by using different consonant forms or conjuncts instead. Halant form of consonants - The form produced by adding the halant (virama) to the nominal shape. The Halant form is used in syllables that have no vowel or as the half form when no distinct shape for the half form exists. Half form of consonants (pre-base form) - A variant form of consonants which appear to the left of the base consonant, if they do not participate in a ligature. Consonants in their half form precede the ones forming the base glyph. Some Indic scripts, like Devanagari have distinctly shaped half forms for most of the consonants. If not distinct shape exists, the full form will display with an explicit Virama (same shape as the halant form). Kannada syllable - Effective orthographic "unit" of Kannada writing systems. Syllables are composed of consonant letters, independent vowels, and dependant vowels. In a text sequence, these characters are stored in phonetic order (although they may not be represented in phonetic order when displayed). Once a syllable is shaped, it is indivisible. The cursor cannot be positioned within the syllable. Transformations discussed in this document do not cross syllable boundaries. Matra (Dependent Vowel) - Used to represent a vowel sound that is not inherent to the consonant. Dependent vowels are referred to as "matras" in Sanskrit. They are always depicted in combination with a single consonant, or with a consonant cluster. The greatest variation among different Indian scripts is found in the rules for attaching dependent vowels to base characters. New shaping behavior - Shaping behavior defined in this version of the Indic OpenType Font Specification. Information in this document relates primarily to the new implementation model. Old behavior may be mentioned in comments about compatibility. Nukta - A combining character that alters the way a preceding consonant (or matra) is pronounced. Old shaping behavior - Shaping behavior defined in previous versions of the Indic OpenType Font Specification. OpenType layout engine – Library responsible for executing OpenType layout features in a font. In the Microsoft text formatting stack, it is named OTLS (OpenType layout services). OpenType tag – 4-byte identifier for script, language system or feature in the font. Post-base form of consonants – A variant form of a consonant that appears to the right of the base glyph. A consonant that takes a post-base form is preceded by the consonant(s) forming the base glyph plus a halant (virama). Post-base forms are usually spacing glyphs. Pre-base form of consonants - A variant form of a consonant that appears to the left of the base glyph. Note that most pre-base consonant forms are logically as well as visually before the base consonant. Half forms are examples of this kind of pre-base form. In some scripts, though, a pre-base Ra may logically follow the base consonant

(that is, it follows it phonetically and in the character sequence of the text), even though it is presented visually before the base. The shaping engine detects such cases dynamically using the <pref> feature and re-orders the pre-base-form glyph as needed. Reph – The above-base form of the letter "Ra", when"Ra" is the first consonant in the syllable and is not the base consonant. Shaping Engine –Code responsible for shaping input, classified to a particular script. Split Matra - A matra that is decomposed into pieces for rendering. Usually the different pieces appear in different positions relative to the base. For instance, part of the matra may be placed at the beginning of the cluster and another part at the end of the cluster. Syllable - A single unit of Indic text processing. Shaping of Indic text is performed independently for each syllable. Process of identifying boundaries of each syllable is described below. Vattu - A below-base form of a consonant.

Example in Devanagari script 1. Pre-base form 2. The base consonant 3. Above-base form (reph) 4. Post-base (matra) 5. Below-base form (vattu/rakaar)

Shaping Engine Analyze the text Reorder characters Shape glyph sequences (GSUB processing) Position glyphs sequences (GPOS processing) Base elements Invalid combining marks Use of ZWJ, ZWNJ and NBSP The Indic shaping engine processes Kannada text in stages. The stages are: 1. Analyze the text sequence; breaking it into syllable clusters 2. Reorder the characters as necessary 3. Apply OpenType GSUB font features to get the correct glyph shape 4. Apply OpenType GPOS features to position glyphs or marks The descriptions which follow will help font developers understand the rationale for the Kannada feature encoding model, and help application developers better understand how layout clients can divide responsibilities with operating system functions. • • • • • • •

Analyze the text Character properties

The shaping engine divides the text into syllable clusters and identifies character properties. Character properties are used in parsing syllables and identifying its parts, in

determining proper character or glyph reordering and in OpenType feature application. Properties for each character are divided into two types: static properties and dynamic properties. Static properties define basic characteristics that do not change from font to font: character type (consonant, matra, vedic sign, etc.) or type of matra reordering. They differ from script to script, but can't be controlled by font developer. Dynamic properties are font dependent and are retrieved by the shaping engine as the font is loaded. These properties affect shaping and reordering behavior. *Note: in old shaping-engine implementations, all consonant properties were static: consonants were assumed to have particular conjoining forms. In the new implementation model, consonant conjoining behavior is a dynamic property.

Retrieving dynamic character properties from Indic fonts

Fonts define dynamic properties for consonants through implementing standard features. Consonant types (and corresponding feature tags) that the shaping engine reads from the font are: • Reph <rphf> • Half forms <half> • Pre-base-reordering forms of Ra/Rra <pref> • Below-base forms <blwf> • Post-base forms <pstf> Each of the features above is applied together with <locl> feature to input sequences consisting of two characters: for <rphf> and <half>, features are applied to Consonant + Halant combinations; for <pref>, <blwf> and <pstf>, features are applied to Halant + Consonant combinations. This is done for each consonant. If these two glyphs form a ligature, with no additional glyphs in context, this means the consonant has the corresponding form. For instance, if a substitution occurs when the <half> and <locl> features are applied to a sequence Da + Halant, then Da is classified as having a half form. Note that a font may be implemented to re-order a Ra to pre-base position only in certain syllables and display it as a below-base or post-base form otherwise. This means that the Pre-base-form classification is not mutually exclusive with either Below-baseform or Post-base-form classifications. However, all classifications are determined as described above using context-free substitutions. Font-dependent character classification only defines consonant types. Reordering positions, however, are fixed for each character class. *Note: for fonts that support the old implementation, all features are applied to Consonant + Halant sequences.

Indic input processing The following steps should be repeated while there are characters left in the input sequence. All shaping operations are done on a syllable-by-syllable basis, independent from other characters.

Find next syllable in the input

Engine should find the character sequence matching one of the patterns below: Consonant syllable {C+[N]+<H+[<ZWNJ|ZWJ>]|<ZWNJ|ZWJ>+H>} + C+[N]+[A] + [< H+[<ZWNJ|ZWJ>] | {M}+[N]+[H]>]+[SM]+[(VD)] Vowel-based syllable: [Ra+H]+V+[N]+[<[<ZWJ|ZWNJ>]+H+C|ZWJ+C>]+[{M}+[N]+[H]]+[SM]+[(VD)] Stand Alone cluster (at the start of the word only): #[Ra+H]+NBSP+[N]+[<[<ZWJ|ZWNJ>]+H+C>]+[{M}+[N]+[H]]+[SM]+[(VD)] Where

{}

zero or more occurrences 13

[]

optional occurrence

<|>

'one of'

()

one or two occurrences

consonant

independent vowel

nukta

halant/virama

ZWNJ

zero width non-joiner

ZWJ

zero width joiner

matra (up to one of each type: pre-, above-, below- or post- base)

syllable modifier signs

vedic

anudatta (U+0952)

NBSP

NO-BREAK SPACE

Identify key positions inside syllable

Syllable structure consists of the following parts: Reph + HalfConsonant(s) + MainConsonant(s) + BelowBaseConsonant(s) + PostBaseConsonant(s) + PreBaseReorderingRa + MatrasAndSigns The consonant parts include all associated halants and nuktas. (For example, an instance of BelowBaseConsonant consists of a sequence of Halant + Below-base-forming Consonant.) All parts are optional, except the main consonant. All parts are shown in the order they would occur within a syllable, with one qualification: depending on a font implementation, PreBaseReorderingRa may occur before all BelowBaseConsonants, after BelowBaseConsonants and before PostBaseConsonants, or after PostBaseConsonants. Also, a font may be implemented to re-order a Ra to pre-base position only in certain syllables and display it as a below-base or post-base form otherwise. Thus, final determination of whether an occurrence of Ra in a specific syllable can be treated as a pre-base reordering Ra can be made only after the <pref> feature has been applied to that syllable. There could be several main consonants in the case where more than one consonant doesn't have a half-, below-base, post-base or pre-base form. In a case of a cluster where the first consonant does not have a half form, the shaping engine will recognize it as the 1st 'full form' and go on to identify the 2nd full form consonant, if there is one. This information will then be used to determine the reordering behavior of the reph or any matras, vowel modifiers or stress marks. All other elements are classified by their position relative to the base: pre-base (half forms and reordering pre-base Ra forms), below-base, above-base and post-base.

Indic clusters are subject to the following constraints: • • •

Only one reph is allowed per syllable. Only one pre-base reordering Ra is allowed per syllable. A nukta can be placed on a consonant, matra or independent vowel. It cannot be placed on a pre-composed nukta character.

•

• • •

One matra from each positioning class is permitted (exception in the Kannada script). A composite matra is treated as belonging to all the classes from which its components belong. One syllable modifier sign is allowed per cluster. Vedic signs are combining marks (used for Sanskrit) that should be included in all Indic scripts. Danda and Double Danda are punctuation marks that should be included in all Indic scripts.

Reorder characters Once the Indic shaping engine has analyzed the cluster as described above, it creates and manages a buffer of appropriately reordered elements (glyphs) representing the cluster, according to several rules (described below). The OpenType lookups in an Indic font must be written to match glyph sequences after re-ordering has occurred. OpenType fonts should not have substitutions that attempt to perform the re-ordering. If a font developer attempted to encode such reordering information in an OpenType font, they would need to add a huge number of many-tomany glyph mappings to cover the general algorithms that a shaping engine will use. 1. Find base consonant: The shaping engine finds the base consonant of the syllable, using the following algorithm: starting from the end of the syllable, move backwards until a consonant is found that does not have a below-base or postbase form (post-base forms have to follow below-base forms), or that is not a pre-base reordering Ra, or arrive at the first consonant. The consonant stopped at will be the base. o If the syllable starts with Ra + Halant (in a script that has Reph) and has more than one consonant, Ra is excluded from candidates for base consonants. 2. Decompose and reorder Matras: Each matra and any syllable modifier sign in the cluster are moved to the appropriate position relative to the consonant(s) in the cluster. The shaping engine decomposes two- or three-part matras into their constituent parts before any repositioning. Matra characters are classified by which consonant in a conjunct they have affinity for and are reordered to the following positions: o Before first half form in the syllable o After subjoined consonants o After post-form consonant o After main consonant (for above marks) 3. Reorder marks to canonical order: Adjacent nukta and halant or nukta and vedic sign are always repositioned if necessary, so that the nukta is first. 4. Final reordering: After the localized forms and basic shaping forms GSUB features have been applied (see below), the shaping engine performs some final glyph reordering before applying all the remaining font features to the entire cluster. o Reorder matras: If a pre-base matra character had been reordered before applying basic features, the glyph can be moved closer to the main consonant based on whether half-forms had been formed. Actual position for the matra is defined as 'after last standalone halant glyph, after initial matra position and before the main consonant' . If ZWJ or ZWNJ follow this halant, position is moved after it. o Reorder reph: Reph's original position is always at the beginning of the syllable, (i.e. it is not reordered at the character reordering stage). However, it will be reordered according to the basic-forms shaping results. Possible positions for reph, depending on the script, are; after main, before post-base consonant forms, and after post-base consonant forms.

1. If reph should be positioned after post-base consonant forms, proceed to step 5. 2. If the reph repositioning class is not after post-base: target position is after the first explicit halant glyph between the first post-reph consonant and last main consonant. If ZWJ or ZWNJ are following this halant, position is moved after it. If such position is found, this is the target position. Otherwise, proceed to the next step. Note: in old-implementation fonts, where classifications were fixed in shaping engine, there was no case where reph position will be found on this step. 3. If reph should be repositioned after the main consonant: from the first consonant not ligated with main, or find the first consonant that is not a potential pre-base reordering Ra. 4. If reph should be positioned before post-base consonant, find first post-base classified consonant not ligated with main. If no consonant is found, the target position should be before the first matra, syllable modifier sign or vedic sign. 5. If no consonant is found in steps 3 or 4, move reph to a position immediately before the first post-base matra, syllable modifier sign or vedic sign that has a reordering class after the intended reph position. For example, if the reordering position for reph is postmain, it will skip above-base matras that also have a post-main position. 6. Otherwise, reorder reph to the end of the syllable. Reorder pre-base reordering consonants: If a pre-base reordering consonant is found, reorder it according to the following rules: 1. Only reorder a glyph produced by substitution during application of the <pref> feature. (Note that a font may shape a Ra consonant with the <pref> feature generally but block it in certain contexts.) 2. Try to find a target position the same way as for pre-base matra. If it is found, reorder pre-base consonant glyph. 3. If position is not found, reorder immediately before main consonant.

Character reordering Classes for Kannada: Characters

Reorder Class

0CB0

After Postscript

0CBF, 0CC6, 0CCC

BeforeSubscript

0CBE

BeforeSubscript

0CE2, 0CE3

BeforeSubscript

0CC1, 0CC2

BeforeSubscript

0CC3, 0CC4, 0CD5, 0CD6

AfterSubscript

Shape glyph sequences (GSUB processing) All characters from a string are first mapped to their nominal glyphs using the cmap lookup. The shaping engine then proceeds to shape (substitute) the glyphs using GSUB lookups. The features for localized forms and basic shaping forms are applied one at a time to the cluster or a relevant portion of the cluster. The results after basic shaping forms features have been applied impact the final syllable analysis in terms of final designation of Ra as a pre-base reordering form and final reordering positions for reph and matras. Next, the features for presentation forms are

applied to the entire cluster simultaneously. Note: since the presentation form features are applied simultaneously over the entire cluster, several features are operationally equivalent to a single feature. Multiple features are provided as an aid for font developers to organize the lookups they implement. Note: final reordering occurs after features for basic shaping forms have been applied and before features forpresentation forms are applied. Font developers must consider the effects of initial reordering (before any features are applied) and final reordering (after basic shaping forms features have applied) when they create GSUB feature and lookup tables. These predefined features are described and illustrated in the Features section and are applied in the order below.

Shaping features:

Localized forms a. Apply feature 'locl' to select language-specific forms. Basic Shaping forms b. Apply feature 'nukt' to substitute nukta forms of consonants. c. Apply feature 'akhn' to substitute required akhand ligatures, or to substitute forms that take precedence over forms produced by features applied later. d. Apply feature 'rphf' to substitute reph glyph (above-base form of 'Ra'). e. Apply feature 'pref' to substitute pre-base forms. f. Apply feature 'blwf' to substitute below-base forms. g. Apply feature 'half' to substitute half forms of pre-base consonants. h. Apply feature 'pstf' to substitute post-base forms of consonants i. Apply feature 'cjct' to substitute conjunct forms. (This is needed particularly for ligature conjuct forms when the pre-base consonant does not have a half form). Presentation forms j. Apply feature 'pres' to substitute pre-base consonant conjuncts and pre-base matra conjuncts. (ie. consonant and matra conjuncts to the left of the base glyph). k. Apply feature 'abvs' to substitute above-base matra conjuncts, reph conjuncts, above-base vowel modifiers and above-base stress and tone marks. l. Apply feature 'blws' to substitute below-base consonant conjuncts, below-base matra conjuncts, below-base vowel modifier forms and below-base stress and tone mark forms. m. Apply feature 'psts' to substitute post-base consonant conjuncts, post-base matra conjuncts and post-base vowel modifiers. n. Apply feature 'haln' to substitute the halant form of base (or conjunct base) glyph in syllables ending with a halant. o. Apply feature 'calt' to substitute the contextual alternate of a consonant.

Position glyph sequences (GPOS processing) The shaping engine next processes the GPOS (glyph positioning) table, applying features concerned with positioning. All features are applied simultaneously to the entire cluster. The font developer must consider the effects of re-ordering when creating the GPOS feature and lookup tables (i.e., the glyphs will be in the order they were in after the GSUB presentation forms features were applied).

Positioning features:

Kerning a. Apply feature 'kern' to adjust distances (e.g., to provide kerning between post- or pre-base elements and the base glyph). b. Apply feature 'dist' to adjust distances. (NOTE - the feature 'dist' can be used in the same way as the 'kern' feature. The advantage of using the 'dist' feature is that it does not rely on the application to enable kerning. Therefore, if you want

to make sure certain spacing adjustments will always be displayed, you should use the 'dist' feature). Above-base marks c. Apply feature 'abvm' to position above-base forms, vowel modifiers and or stress/tone marks (on base glyph or post-base matra). Below-base marks d. Apply feature 'blwm' to position below-base forms, vowel modifiers and or stress/tone marks.

Base elements Commonly, a feature is required for dealing with the base glyph and one of the postbase, pre-base, above-base or below-base elements. Since it is not possible to reorder ALL of these elements next to the base glyph, we need to skip over the elements "in the middle" (reordering-wise). The solution is to assign different mark attachment classes to different elements of the syllable and positional forms, and in any given lookup work with one mark type only. For example, in above-base substitutions we need only consider above-base elements most of the time. Generally, it is good practice to label as "mark" any glyphs that are denoted as combining marks in the Unicode Standard as well as below-base/above-base forms of consonants. Then, different attachment classes should be assigned to different marks depending on their position with respect to the base. For example, after the shaping engine has re-ordered elements within the cluster, matras will always occur before syllable modifier sign such as the candrabindu. In an actual sequence, though, potentially some other mark glyph, such as nukta, may occur between the matra and the candrabindu. Thus, when processing the matra and candrabindu, you may need to allow for the possibility that some other mark glyph(s) may occur between them. Using lookup flags, you can specify that a lookup should process only a certain class of marks, such as 'above-base marks', and ignore all other marks. In that way, a match will occur whether or not a mark from another class is present. Otherwise, the lookup would fail to apply.

Using Microsoft VOLT, you can assign glyphs to attachment classes. In the example below this 'abvm' feature was set to process only TopMarks, therefore the presence of another mark class would be ignored. If Process ALL was used and another mark glyph followed the matra, this positioning lookup would fail to apply. This example comes from the Devanagari

font Mangal.

Invalid combining marks Combining marks and signs that do not occur in conjunction with a valid base are considered invalid. Shaping engine implementations may adopt different strategies for how invalid marks are handled. For example, a shaping engine implementation might treat an invalid mark as a separate cluster and display the stand-alone mark positioned on some default base glyph, such as a dotted circle. (See Fallback Rendering in section 5.13 of the Unicode Standard 4.0.) Shaping engine implementations may vary somewhat with regard to what sequences are or are not considered valid. For instance, some implementations may impose a limit of at most one above-base vowel mark while others may not. To allow for shaping engine implementations that expect to position an invalid mark on a dotted circle, it is recommended that a Kannada OT font contain a glyph for the dotted circle character, U+25CC. If this character is not supported in the font, such implementations will display invalid signs on the missing glyph shape (white box).

In addition to the 'dotted circle' other Unicode code points that are recommended for inclusion in any Kannada font are the ZWJ (zero width non-joiner; U+200C), the ZWNJ (zero width joiner; U+200D) and the ZWSP (zero width space; U+200B). For more information see the Suggested glyphs section of the OpenType Font Development document.

Effect of ZWJ, ZWNJ and NBSP on Consonant Shaping Unicode defines specific behaviors for ZWJ and ZWNJ in relation to Indic scripts. The Indic-specific behavior retains the general behavior that ZWJ requests connection between text elements while ZWNJ inhibits connection between text elements. 1. The main intent of using ZWJ in this context is to prevent a ligature-conjunct from forming (and in Devanagari or Gujuarati, to request a half form, belowbase form or post-base form instead). The Indic engine does not need to take any action to prevent ligature-conjuct formation: the presence of ZWJ will prevent GSUB substitution lookups from matching the input glyph sequence. If the first consonant does not have a half form, an overt-halant form should result, which would also happen with no particular action by the engine. 2. A secondary intent of using ZWJ in this context is to prevent the display of reph in the case that the first consonant is RA. If a cluster begins with RA H (halant) ZWJ, the engine must ensure that the 'rphf' feature is not applied, and that re-ordering for reph does not take place. Note that use of either joiner in this context should prevent formation and re-ordering of reph when RA is the first consonant. 3. The main intent of using ZWNJ is to prevent conjunct ligature or half forms from forming, and to display an explicit halant form instead. The shaping engine must take specific actions to prevent half forms for a sequence of Consonant + Halant + ZWNJ. The following example illustrates these behaviors:

Just as the ZWJ can be used to display a half form in isolation, it can also be used to display a mark, sub- or post-base form in isolation. Unlike the stand-alone half form, however, sequences to display them must begin with a no-break space (NBSP). This is because mark glyphs must combine with a base glyph: to appear in isolation, a NBSP must be provided as a base. In the illustration below the I-matra is displayed without the dotted circle by using the NBSP. The combination of NBSP and ZWJ is used to display the below-base form of Ra in isolation.

Features 20

The features listed below have been defined to create the basic forms for the languages that are supported on Kannada systems. Regardless of the model an application chooses for supporting layout of complex scripts, the shaping engine requires a fixed order for executing features within a run of text to consistently obtain the proper basic form. The features of the basic shaping forms are applied one at a time to the cluster or portion of the cluster. The result impacts the analysis in terms of the conjoining behavior and final reordering. The features of thepresentation forms are applied next, to the entire cluster simultaneously. Mandatory features must always be applied; the discretionary presentation-forms features listed should be applied by default, but can be suppressed by a client (normally at the discretion of the user). The order of the lookups within each feature is also very important. For more information on lookups and defining features in OpenType fonts, see the Encoding section of the OpenType Font Development document. OpenType features used for Kannada scripts, applied in the following order:

Feature

Feature function

Layout operation

Localized forms: locl

Localization form substitution

GSUB

Basic shaping forms: nukt

Nukta form substitution

GSUB

akhn

Akhand ligature substitution

GSUB

rphf

Reph form substitution

GSUB

pref

Pre-base form substitution

GSUB

blwf

Below-base form substitution

GSUB

half

Half-form substitution

GSUB

pstf

Post-base form substitution

GSUB

cjct

Conjunct form substitution

GSUB

Mandatory presentation forms: pres

Pre-base substitution

GSUB

abvs

Above-base substitution

GSUB

blws

Below-base substitution

GSUB

psts

Post-base substitution

GSUB

haln

Halant form substitution

GSUB

Discretionary presentation forms: calt

Contextual alternates

GSUB

Positioning features: kern

Kerning

GPOS

dist

Distances

GPOS 21

Feature

Feature function

Layout operation

abvm

Above-base mark positioning

GPOS

blwm

Below-base mark positioning

GPOS

[GSUB = glyph substitution, GPOS = glyph positioning]

Feature examples Many of the registered features described and illustrated in this document are based on the Microsoft OpenType font Tunga. 'Tunga' contains layout information and glyphs to support all of the required features for the Kannada script and language systems supported. The illustrations in the following examples show the result of that particular feature being applied. Features must be written to match glyph sequences after re-ordering has occurred. Note that the input context for a feature may be the result of a previous feature having already been applied.

Localized forms Feature Tag: "locl" This feature is used in association with OpenType language system tags to trigger lookups that will select alternate glyphs needed for language-specific typographic conventions. The 'locl' should not be used in association with the default language system, but only used with other language system tags. See the Appendix of this document for language system tags associated with the Kannada script.

Basic shaping forms Nukta

Feature Tag: "nukt" The nukta alters the way a preceding consonant or vowel is pronounced. The most common nukta forms have been defined as separate characters in Unicode with their own code points. All consonants, as well as akhandforms should have an associated nukta form. Note - Rather than using substitution, nukta forms can also be created by positioning the nukta as a below-base mark on the base glyph using the 'blwm' positioning feature The input context for the nukt feature always consists of the full form of the consonant. The half form of nukta consonants will be substituted using the half feature.

Akhand

Feature Tag: "akhn" An akhand is a required consonant ligatures that may appear anywhere in the syllable, and may or may not involve the base glyph. Akhand ligatures have the highest priority and are formed first; some languages include them in their alphabets. There are 2 Akhand ligatures in Kannada. The input context for the akhand feature always consists of the full form of the consonant. The half forms of Akhand ligatures will be called later in the half feature. Because the akhand feature is applied early in the sequence of features and is applied over the entire cluster, it can also be used to create certain forms that must take priority in particular contexts over forms that would be created during subsequent feature application.

Using the 'akhn' feature; Ka + halant + Ssa is substituted with the KaSsa ligature:

Ja + halant + Nya is substituted with the JaNya Ligature using 'akhn':

Reph

Feature Tag: "rphf" Applying this feature substitutes the Reph glyph. If the first consonant of the cluster consists of the full form of Ra + Halant, this feature substitutes the combining-mark form of Reph. In addition, the position of the Rephglyph is adjusted with the 'abvm' GPOS feature. The input context for the Reph feature always consists of the full form of Ra + Halant.

The 'rphf' feature substitutes the mark glyph form of Ra and is positioned after the final base or belowbase glyph:

Pre-base form of consonant

Feature Tag: "pref" This feature substitutes the pre-base forms of Consonants.

Below form of consonant

Feature Tag: "blwf" This feature substitutes the below-base forms of Consonants that follow the base consonant. All characters encoded in Unicode v3.0 for Kannada, have a below base form. If a ligature is required between the below-base glyph and the preceding consonant, it will be handled by the feature 'blwf' (below-base substitutions). Example 1 - Halant plus Ra (preceded by a base consonant) will be substituted by the below-base Ra:

Example 2- In 'Ba + halant + Ka', the below-base form of Ka will be substituted using the blwf feature:

Half form of consonant 23

Feature Tag: "half" Applying this feature substitutes half forms - forms of consonants used in the pre-base position. Consonants that have a half form should be listed in the 'half' feature. Some scripts, like Devanagari have distinctly shaped half forms for most of the consonants however, if a consonant does not have a distinct shape for the half form and does not form any ligature, it will be displayed with an explicit Virama (same shape as the halant form). Note - the result of listing a consonant in the half feature (whether it has a true half form or not) will affect the re-ordering (and positioning) of the reph and pre-pended matras. See illustration in the Introduction section of this document. This feature is applied to all consonants preceding the 'main' consonant. Note - While Kannada typically does not use half forms, this feature is made available for typographic preference.

Post-base form of consonant

Feature Tag: "pstf" The 'pstf' feature can be applied to substitute the post-base form of a consonant.

Conjunct forms

Feature Tag: "cjct" Apply feature 'cjct' to substitute conjunct forms where the first consonant in the consonant-cluster pair does not have a half form. This feature allows for control over reordering 'Ĺ˝of reph and pre-pended matras in case of consonants that do not take half forms yet do form 'Ĺ˝conjunct ligatures in combination with certain following consonants.

Presentation forms After the glyphs have been reordered, the presentation lookups are applied to provide the best typographic rendering of the text. The features of the presentation forms are applied to the entire cluster simultaneously, executing lookups within each feature in the order that they are specified in the font. The abvs, blws, psts and haln features are all mandatory for software implementations: they are required for correct script behaviour and none should ever be treated as discretionary. Because of this and because they are all applied simultaneously over entire clusters, they are not functionally different: a set of lookups could be divided between these features or grouped together under one of them with no difference in effect. These multiple features are provided, however, as an aid to the font developer for organizing lookups based on the combinations of glyphs they apply to. There are no specific requirements on how each should be used; the examples provided below illustrate typical usage, however.

Pre-base substitutions

Feature Tag: "pres" This feature is used to substitute pre-base consonant conjuncts made with half forms, the type most common in Devanagari. The resulting conjunct can be in full or half form. This feature can also be used to select variant forms of Matras, or pre-composed ligatures of Matras with certain bases.

Above-base substitutions

Feature Tag: "abvs" This feature is used for glyph substitutions involving above-base marks. Such substitutions might be used to select contextual forms of marks, to create mark-mark ligatures, or to create mark-base ligatures. Specific context-dependent forms or belowbase consonants are handled by this lookup as well.

Example 1 - 'abvs' ligature substitution; Ka + vowel I substituted with pre-composed ligature:

Example 2 - 'abvs' ligature substitution; Ta + vowel I substituted with pre-composed ligature:

Example 3 - 'abvs' contextual substitution; variation of vowel E is substituted when preceded by Nya:

Example 4 - 'abvs' contextual substitution; Using MS Volt, different consonant forms are substituted when followed by certain above base matras:

Below-base substitutions

Feature Tag: "blws" This feature is used for glyph substitutions involving below-base marks or consonants. Such substitutions can be used to create conjuncts of base glyphs with below-base consonants, below mark ligatures or below mark-base ligatures. Specific contextdependent forms are handled by this lookup as well.

Example 1- below-base conjunct substitution; KaSsa + below-base Nna:

Example 2- below-base consonant conjuncts; below-base Ta + below-base Ya form a conjunct:

Example 3- 'blws' contextual substitution; Using MS Volt, different below-base consonant forms are substituted when pre-ceded by certain below- base consonants:

Post-base substitutions

Feature Tag: "psts" This feature is used to substitute post-base consonants or matras. Such substitutions can be used to create conjuncts of base glyphs with post-base consonants or post-base matra ligatures. It can also be used to specify contextual alternates of post-base forms.

Example 1- contextual 'psts' substitution; used to select alternate form of vowel Uu, when preceded by Pa.

Example 2 - contextual 'psts' substitution; using MS Volt, variant shapes of matras are substituted based on the context:

Halant form of consonants

Feature Tag: "haln" This feature is used to substitute a pre-composed halant form of a base (or conjunct base) glyph in syllables ending with a halant. (Rather than using substitution, halant forms can also be created by positioning the halantas a below-base mark on the base glyph using the 'blwm' positioning feature.) This feature is applied only on the base glyph if the syllable ends with a halant, or in the case of non-final consonants that do not take a half form and do not form a conjunct ligature with the following consonant.

Example 1 - 'haln' feature used to substitute halant form of base glyph:

Example 2 - 'haln' feature used to substitute halant form of conjunct base glyph:

Contextual Alternates

Feature Tag: "calt" Unlike the previous presentation lookups, the 'calt' feature is optional and is used to substitute discretionary contextual alternates. It is important to note that an application may allow users to turn off this feature, therefore should not be used for any obligatory Kannada typography.

Positioning features Distances

Feature Tag: "dist" This feature covers positioning lookups that adjust distances between glyphs, such as kerning between pre- and post-base elements and the base glyph. Note; the feature 'dist' can be used in the same way as the 'kern' feature. The advantage of using the 'dist' feature is that it does not rely on the application to enable kerning. Example 1 - 'dist' feature created using MS Volt; pair adjustments made based on context:

Above-base marks 28

Feature Tag: "abvm" This feature positions all above-base marks on the base glyph or the post-base matra. The best method for encoding this feature in an OpenType font is to use a chaining context positioning lookup that triggers mark-to-base and mark-to-mark attachments for above-base marks. The 'abvm' feature using MS Volt 'Pair Adjustment' to move positions of above-marks over bases:

Below-base marks

Feature Tag: "blwm" This feature positions all below-base marks on the base glyph. The best method for encoding this feature in an OpenType font is to use a chaining context positioning lookup that triggers mark-to-base and mark-to-mark attachments for below-base marks.

The 'blwm' feature shown in MS Volt, using 'Anchor Attachment' to position below-base marks:

Examples of Kannada syllables Complex Kannada syllable formation is possible using the wide range of features available in OpenType. The following examples show how the shaping engine applies the OpenType features, one at a time to the input string. These combinations do not necessarily represent actual syllables or words, but are meant to illustrate the various OpenType features in a Kannada font.

Appendices Appendix A: Writing System Tags Features are encoded according to both a designated script and language system. Currently most shaping engine implementations only support the "default" language system for each script. However, font developers may want to build language specific features which are supported in other applications and will be supported in future Microsoft OpenType implementations. NOTE: It is strongly recommended to include the "dflt" language tag in all OpenType fonts because it defines the basic script handling for a font. The "dflt" language system is used as the default if no other language specific features are defined, or if the application does not support that particular language. If the "dflt" tag is not present for the script being used, the font may not work in some applications. The following table lists the registered tag names for script and language systems. Note for new Indic shaping implementation “knd2” is used (old-behavior implementations used “knda”).

Registered tags for the Kannada script

Registered tags for Kannada language systems 31

Script tag

Script

Language system tag

Language

"knd2"

Kannada

"dflt"

*default script handling

"KAN "

Kannada

Note: both the script and language tags are case sensitive (script tags should be lowercase, language tags are all caps) and must contain four characters (ie. you must add a space to the three character language tags). introduction | shaping engine | features | appendix

Source: http://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htm

GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright ÂŠ 2007 Free Software Foundation, Inc. <http://fsf.org/> Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow.

TERMS AND CONDITIONS 33

0. Definitions. “This License” refers to version 3 of the GNU General Public License. “Copyright” also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. “The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations. To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a “modified version” of the earlier work or a work “based on” the earlier work. A “covered work” means either the unmodified Program or a work based on the Program. To “propagate” a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To “convey” a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays “Appropriate Legal Notices” to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion.

1. Source Code. The “source code” for a work means the preferred form of the work for making modifications to it. “Object code” means any non-source form of a work. A “Standard Interface” means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The “System Libraries” of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A “Major Component”, in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The “Corresponding Source” for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as 34

by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work.

2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary.

3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures.

4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any nonpermissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.

5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: • •

a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”.

•

c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so.

A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an “aggregate” if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.

6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: •

•

a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d.

A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A “User Product” is either (1) a “consumer product”, which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For 36

a particular product received by a particular user, “normally used” refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. “Installation Information” for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying.

7. Additional Terms. “Additional permissions” are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: • •

a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or

•

• • •

c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors.

All other non-permissive additional terms are considered “further restrictions” within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way.

8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10.

9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so.

10. Automatic Licensing of Downstream Recipients. 38

Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An “entity transaction” is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it.

11. Patents. A “contributor” is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's “contributor version”. A contributor's “essential patent claims” are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, “control” includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a “patent license” is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To “grant” such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. “Knowingly relying” means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. 39

A patent license is “discriminatory” if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law.

12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program.

13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such.

14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License “or any later version” applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version.

15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM “AS 40

IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS

How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the “copyright” line and a pointer to where the full notice is found. <one line to give the program's name and a brief idea of what it does.> Copyright (C) <year> <name of author> This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: <program> Copyright (C) <year> <name of author> This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details.

The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an “about box”. You should also get your employer (if you work as a programmer) or school, if any, to sign a “copyright disclaimer” for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see <http://www.gnu.org/licenses/>. The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read <http://www.gnu.org/philosophy/why-not-lgpl.html>.

SIL OPEN FONT LICENSE Version 1.1 - 26 February 2007

PREAMBLE The goals of the Open Font License (OFL) are to stimulate worldwide development of collaborative font projects, to support the font creation efforts of academic and linguistic communities, and to provide a free and open framework in which fonts may be shared and improved in partnership with others. The OFL allows the licensed fonts to be used, studied, modified and redistributed freely as long as they are not sold by themselves. The fonts, including any derivative works, can be bundled, embedded, redistributed and/or sold with any software provided that any reserved names are not used by derivative works. The fonts and derivatives, however, cannot be released under any other type of license. The requirement for fonts to remain under this license does not apply to any document created using the fonts or their derivatives.

DEFINITIONS "Font Software" refers to the set of files released by the Copyright Holder(s) under this license and clearly marked as such. This may include source files, build scripts and documentation. "Reserved Font Name" refers to any names specified as such after the copyright statement(s). "Original Version" refers to the collection of Font Software components as distributed by the Copyright Holder(s). "Modified Version" refers to any derivative made by adding to, deleting, or substituting â&#x20AC;&#x201D; in part or in whole â&#x20AC;&#x201D; any of the components of the Original Version, by changing formats or by porting the Font Software to a new environment. "Author" refers to any designer, engineer, programmer, technical writer or other person who contributed to the Font Software.

PERMISSION & CONDITIONS Permission is hereby granted, free of charge, to any person obtaining a copy of the Font Software, to use, study, copy, merge, embed, modify, redistribute, and sell modified and unmodified copies of the Font Software, subject to the following conditions: 1) Neither the Font Software nor any of its individual components, in Original or Modified Versions, may be sold by itself. 2) Original or Modified Versions of the Font Software may be bundled, redistributed and/or sold with any software, provided that each copy contains the above copyright notice and this license. These can be included either as stand-alone text files, human-readable headers or in the appropriate machine-readable metadata fields within text or binary files as long as those fields can be easily viewed by the user. 3) No Modified Version of the Font Software may use the Reserved Font Name(s) unless explicit written permission is granted by the corresponding Copyright Holder. This restriction only applies to the primary font name as presented to the users. 4) The name(s) of the Copyright Holder(s) or the Author(s) of the Font Software shall not be used to promote, endorse or advertise any Modified Version, except to acknowledge the contribution(s) of the Copyright Holder(s) and the Author(s) or with their explicit written permission. 5) The Font Software, modified or unmodified, in part or in whole, must be distributed entirely under this license, and must not be distributed under any other license. The requirement for fonts to remain under this license does not apply to any document created using the Font Software.

TERMINATION

This license becomes null and void if any of the above conditions are not met.

DISCLAIMER THE FONT SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF COPYRIGHT, PATENT, TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, INCLUDING ANY GENERAL, SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF THE USE OR INABILITY TO USE THE FONT SOFTWARE OR FROM OTHER DEALINGS IN THE FONT SOFTWARE.

²æÃ PÉ ¦ gÁªï

QgÀÄ ¥ÀjZÀAiÀÄ

ªÀÈwÛ §zÀÄQ£À ¥ÀjZÀAiÀÄ PÉ.¦.gÁªïgÀªÀgÀÄ (PÀ¤ßPÀA§¼À ¥ÀzÀä£Á¨sÀ gÁªï) ªÀÄAUÀ¼ÀÆgÀÄ §½AiÀÄ PÀ¤ßPÀA§¼À JA§ aPÀÌºÀ½îAiÀÄ°è d¤¹zÀgÀÄ. ªÉÄÊ¸ÀÆgÀÄ «±Àé«zÁå®AiÀÄ¢AzÀ «eÁÕ£À ¥ÀzÀ« ¥ÀqÉzÀ CªÀgÀÄ MAzÀÄ ªÀÄÄzÀæuÁ®AiÀÄzÀ°è CPÀëgÀ eÉÆÃr¸ÀÄªÀ CgÉPÁ°PÀ PÉ®¸À¢AzÀ ªÀÈwÛfÃªÀ£À DgÀA©ü¹zÀgÀÄ. £ÀAvÀgÀ, ¥Àæw¶×vÀ mÁmÁ ¥Éæ¸ï ¸ÉÃjzÀ ²æÃ PÉ ¦ gÁªïgÀªÀgÀÄ PÀA¥À¤AiÀÄ°è ¥sÉÇÃmÉÆÃ PÀA¥ÉÇÃ¹AUï£ÀÄß ºÉÆ¸ÀzÁV DgÀA©ü¸À®Ä ±ÀæªÀÄªÀ»¹zÀgÀÄ. EAvÀºÀ ¸ÀvÀvÀ ¥Àj±ÀæªÀÄ ªÀÄvÀÄÛ C£ÉéÃµÀuÁ ªÀiÁUÀðUÀ¼À ªÀÄÆ®PÀ ªÉÆ£ÉÆÃmÉÊ¥ï PÀA¥À¤AiÀÄ ¤zÉðÃ±ÀPÀ ¸ÁÜ£ÀPÉÌ KjzÀªÀgÀÄ EªÀgÀÄ. DzsÀÄ¤PÀ ªÀÄÄzÀæt PÉëÃvÀæPÉÌ vÀªÀÄäzÉÃ jÃwAiÀÄ CªÀÄÆ®å PÉÆqÀÄUÉUÀ¼À£ÀÄß ¤ÃrzÁÝgÉ. CA¢£À PÁ®PÉÌ §ºÀ¼À ¸ÀAQÃtðªÁzÀ QÃ°ªÀÄuÉ ºÉÆA¢zÀ (¨ÉèöÊAqï QÃ¨ÉÆÃqïð) PÀ£ÀßqÀzÀ ¥sÉÇÃmÉÆÃmÉÊ¥À£ÀÄß EªÀgÀÄ ¹zÀÞ¥Àr¹zÁÝgÉ. DzsÀÄ¤PÀ ªÀÄÄzÀætzÀ°è §¼À¸À¯ÁUÀÄªÀ mÉÊ¥ÉÇÃ¸ÉnÖAUï GzÉÝÃ±ÀPÁÌV ¥ÉÇæÃUÁæ«ÄAUï ªÀÄÆ®PÀ ¯ÉÃ¸Àgï mÉÊ¥ï¸ÉlÖgï AiÀÄAvÀæUÀ½UÉ CPÀëgÀ ªÀÄÆr¸ÀÄªÀ ªÀåªÀ¸ÉÜAiÀÄ£ÀÄß C¼ÀªÀr¹zÁÝgÉ. PÀA¥ÀÇålgï£À qÁmï ªÀiÁånæPïì ¦æAl£Àð°è PÀ£ÀßqÀ °¦ ªÀÄÄzÀætPÁÌV ªÀåªÀ¸ÉÜAiÉÆAzÀ£ÀÄß C©üªÀÈ¢Þ¥Àr¹zÁÝgÉ. EªÀgÀÄ PÁ®PÁ®PÉÌ ®¨sÀå«zÀÝ DzsÀÄ¤PÀ vÀAvÀæeÁÕ£ÀUÀ¼À£ÀÄß §¼À¹ zÉÃªÀ£ÁUÀj, PÀ£ÀßqÀ, vÀÄ¼ÀÄ ªÀÄvÀÄÛ vÉ®ÄUÀÄ ¨sÁµÉUÀ¼À ¥sÁAmïUÀ¼À£ÀÄß ¸ÀÈf¹zÁÝgÉ; ªÀÄtÂ¥Á® EAf¤AiÀÄjAUï PÁ¯ÉÃf£À°è PÀA¥ÀÇålgïUÀ¼ÀÄ ªÀÄvÀÄÛ DzsÀÄ¤PÀ ªÀÄÄzÀæt ±Á¸ÀÛç PÀÄjvÀÄ PÉ®PÁ® ¨ÉÆÃ¢ü¹zÀ ²æÃAiÀÄÄvÀgÀÄ ZÀArÃUÀqsÀzÀ PÁéPïð JPïì¥Éæ¸ï ªÀÄvÀÄÛ ªÀÄvÀÄÛ CqÉÆÃ© ¹¸ÀÖA ¸ÀA¸ÉÜUÀ½UÉ ¸À®ºÉUÁgÀgÁVAiÀÄÆ ¸ÀºÀ PÁAiÀÄð¤ªÀð»¹zÁÝgÉ. DzsÀÄ¤PÀ ¨sÁgÀwÃAiÀÄ ¨sÁµÁ UÀtPÀzÀ°è §ºÀÄªÀÄÄTÃ ¥Àæ¥ÀæxÀªÀÄ ¸ÀA±ÉÆÃzsÀ£É, D«µÁÌgÀ £ÀªÀÄä zÉÃ±ÀzÀ°è DzsÀÄ¤PÀ ªÀÄÄzÀæt vÀAvÀæeÁÕ£ÀzÀ C£ÀÄµÁ×£À, ¨sÁgÀwÃAiÀÄ ¨sÁµÁ °¦UÀ¼À ¥sÉÇÃmÉÆÃPÀA¥ÉÇÃ¹AUï ªÀÄvÀÄÛ ¸ÀÄAzÀgÀ ªÀÄÄzÀætzÀ PÀÄjvÀÄ AiÀiÁgÀÆ ¸ÀºÀ ºÉZÀÄÑ D¯ÉÆÃZÀ£ÉAiÀÄ£ÉßÃ ªÀiÁrgÀzÀ CA¢£À ¸ÀAzÀ¨sÀðzÀ°è EªÉ®èªÀ£ÀÄß §ºÀ¼ÀµÀÄÖ zsÁå¤¹zÀ, «£ÀÆvÀ£À ªÀiÁUÀðUÀ¼À£ÀÄß C£ÉéÃ¶¹zÀ ªÀÄvÀÄÛ gÁ¶ÖçÃAiÀÄ ªÀÄlÖzÀ°è ºÀvÀÄÛ ºÀ®ªÀÅ ºÉÆ¸À ºÉÆ¸À ¥ÀæAiÉÆÃUÀUÀ¼À£ÀÄß PÉÊUÉÆAqÀÄ CzÀgÀ°è AiÀÄ±À¸Àì£ÀÄß PÀAqÀªÀgÀ°è £ÀªÀÄäªÀgÉÃ DzÀ ²æÃ PÉ.¦ gÁªï CªÀgÀÄ ªÉÆzÀ°UÀgÀÄ. ¨sÁgÀwÃAiÀÄ ¨sÁµÉUÀ¼À °¦AiÀÄ£ÀÄß PÀA¥ÀÇålgï §¼À¹ ¥ÀrªÀÄÆr¸ÀÄªÀ vÀAvÀæeÁÕ£ÀPÉÌ vÀªÀÄäzÉÃ jÃwAiÀÄ°è GvÀÛªÀÄ PÉÆqÀÄUÉAiÀÄ£ÀÄß EªÀgÀÄ ¤ÃrzÁÝgÉ. EzÀÄ PÀ£ÁðlPÀPÉÌ ªÀÄvÀÄÛ PÀ£ÀßrUÀjUÉ ºÉªÉÄäAiÀÄ «µÀAiÀÄªÁVzÉ. C®èzÉ, PÀ£ÀßqÀzÀ PÀA¥ÀÇålgï PÉëÃvÀæzÀ ºÀ®ªÀÅ ¥ÀæxÀªÀÄUÀ½UÉ EªÀgÀÄ PÁgÀtgÁVzÁÝgÉ. CA¢£À ¹Ã«ÄvÀ vÀAvÀæeÁÕ£ÀªÀ£ÀÄß §¼À¹ 1988gÀ°èAiÉÄÃ PÀ£ÀßqÀPÉÌ ¥Àæ¥ÀæxÀªÀÄªÁV ¥À¸Àð£À¯ï PÀA¥ÀÇålUÀð¼À°è §¼À¸À§ºÀÄzÁzÀ `¸ÉÃrAiÀiÁ¥ÀÅ’ JA§ ºÉ¸Àj£À vÀAvÁæA±ÀªÀ£ÀÄß (qÁ¸ï Jrlgï) ¹zÀÞ¥Àr¹ PÀ£ÀßrUÀgÉ®ègÀ §¼ÀPÉUÉ GavÀªÁV ¤ÃrzÀªÀgÀÄ ²æÃAiÀÄÄvÀ PÉ.¦ gÁªïgÀªÀgÀÄ. `¸ÉÃrAiÀiÁ¥ÀÅ’ vÀAvÁæA±ÀªÀÅ PÀ£ÀßqÀ ¨sÁµÉAiÀÄ°è ¥ÀvÀæUÀ¼ÀÄ ªÀÄvÀÄÛ ¯ÉÃR£ÀUÀ¼À£ÀÄß PÀA¥ÀÇålgï §¼À¹ ¹zÀÞ¥Àr¸À®Ä §¼ÀPÉUÉ §AzÀ ¥ÀæxÀªÀÄ qÁ¸ï DzsÁjvÀ ¥ÀzÀ¸ÀA¸ÁÌgÀPÀ (ªÀqïð ¥ÉÇæÃ¸É¸Àgï). EªÀgÀÄ PÁ®PÁ®PÉÌ ®¨sÀå«zÀÝ vÀAvÀæeÁÕ£ÀUÀ¼À£ÀÄß §¼À¹ ¸ÀÄAzÀgÀªÁzÀ PÀA¥ÀÇålgï ¥sÁAmïUÀ¼À£ÀÄß ¤«Äð¹zÀgÀÄ. CAvÀºÀ PÀ£ÀßqÀ °¦AiÀÄ ¥sÁAl£ÀÄß ¸ÀÄ®¨sÀªÁV PÀA¥ÀÇålgï£À°è ªÀÄÆr¸À®Ä ¸ÀgÀ¼À ºÁUÀÆ vÀPÀð§zÀÞªÁzÀ QÃ°ªÀÄuÉ «£Áå¸ÀªÀ£ÀÄß gÀa¹zÀ QÃwð ²æÃ PÉ.¦ gÁªïjUÉ ¸À®ÄèvÀÛzÉ. EzÉÃ QÃ°ªÀÄuÉ «£Áå¸À PÀ£ÀßqÀzÀ C¢üPÀÈvÀ «£Áå¸À JAzÀÄ EAzÀÄ PÀ£ÁðlPÀ ¸ÀPÁðgÀzÀ ªÀiÁ£ÀåvÉAiÀÄ£ÀÄß ¥ÀqÉ¢zÉ. PÀ£ÀßqÀ ªÀÄvÀÄÛ ªÀiÁ»w vÀAvÀæeÁÕ£ÀzÀ ²æÃ PÉ.¦ gÁªï ¥Àæ¸ÀÄÛvÀ, ªÀÄtÂ¥Á¯ï E¤ì÷ÖlÆåmï D¥sï PÀªÀÄÄå¤PÉÃµÀ£Àß°è UËgÀªÀ ¥ÁæzsÁå¥ÀPÀgÁV `¸ÉÊAn¦üPï PÀªÀÄÄå¤PÉÃµÀ£ï'£ÀÄß «zÁåyðUÀ½UÉ PÀ°¸ÀÄwÛzÁÝgÉ.

vÀPÀð§zÀÞªÁzÀ ¸ÀÄ®¨sÀ QÃ°ªÀÄuÉ «£Áå¸À PÀ£ÀßqÀ °¦ ªÀÄÆrPÉUÁV ««zsÀ jÃwAiÀÄ ««zsÀ «£Áå¸ÀzÀ ¨ÉgÀ¼ÀZÀÄÑ AiÀÄAvÀæzÀ ªÀiÁzÀjUÀ¼ÉÃ §¼ÀPÉAiÀÄ°èvÀÄÛ. EªÀÅUÀ¼À£ÀÄß §¼À¹ °¦ªÀÄÆrPÉ PÀæªÀÄUÀ¼À£ÀÄß PÀ°AiÀÄÄªÀÅzÀÄ PÀµÀÖPÀgÀªÁVvÀÄÛ. FVgÀÄªÀ EAVèµï QÃ°ªÀÄuÉAiÀÄ K C£ÀÄß MwÛzÀgÉ PÀ£ÀßqÀzÀ `PÀ’ ªÀÄÆqÀÄªÀAvÉ, K ªÀÄvÀÄÛ A C£ÀÄß PÀæªÀÄªÁV ¨ÉgÀ¼ÀaÑ¹zÀgÉ `PÁ’ ªÀÄÆqÀÄªÀAvÉ, K ªÀÄvÀÄÛ i MwÛzÀgÉ `Q’ ªÀÄÆqÀÄªÀAvÉ ºÉÆ¸À vÀPÀðªÀ£ÀÄß §¼À¹ PÀ£ÀßqÀPÉÌ ºÉÆ¸À QÃ°ªÀÄuÉ «£Áå¸ÀªÀ£ÉßÃ PÉ.¦ gÁªïgÀªÀgÀÄ gÀÆ¦¹zÀgÀÄ. FVgÀÄªÀ 26 EAVèµï QÃ°UÀ¼À£ÉßÃ §¼À¹ PÀ£ÀßqÀzÀ J¯Áè MvÀÛPÀëgÀUÀ¼À£ÀÄß ºÁUÀÆ UÀÄtÂvÁPÀëgÀUÀ¼À£ÀÄß ªÀÄÆr¸ÀÄªÀ ºÉÆ¸À PÀæªÀÄªÀ£ÀÄß C£ÉéÃ¶¹zÀgÀÄ. EAVèµï£À ¸ÀégÀUÀ¼À£ÀÄß ªÀÄvÀÄÛ ªÀåAd£ÀUÀ¼À£ÀÄß GZÁÑgÀuÁ zsÀé¤ DzsÀj¹ PÀæªÀÄªÁV PÀ£ÀßqÀzÀ ¸ÀégÀUÀ½UÉ ªÀÄvÀÄÛ ªÀåAd£ÀUÀ½UÉ ¸ÀAªÁ¢AiÀiÁV¸À¯ÁVzÉ. DzÀÄzÀjAzÀ¯ÉÃ F «£Áå¸ÀPÉÌ `zsÀé£ÁåvÀäPÀ’ (¥sÉÇÃ£ÉnPï) «£Áå¸À JA§ ºÉ¸ÀgÀÆ §AvÀÄ. zsÀé¤ DzsÁjvÀ, vÀPÀð§zÀÞ PÀæªÀÄ¢AzÁV PÀ£ÀßqÀ ¨sÁµÉAiÀÄ °¦ ªÀÄÆr¸À®Ä PÀA¥ÀÇålgï §¼ÀPÉzÁgÀjUÉ £É£À¦£À ±ÀQÛUÉ ºÉaÑ£À MvÀÛqÀ E®è. ªÀåAd£À ªÀÄvÀÄÛ ¸ÀégÀ ¸ÉÃj UÀÄtÂvÁPÀëgÀªÁUÀÄvÀÛzÉ ºÁUÀÆ °APï QÃ° §¼ÀPÉAiÉÆA¢UÉ MAzÀÄ ªÀåAd£À ªÀÄvÉÆÛAzÀÄ ªÀåAd£À ¸ÉÃj MvÀÛPÀëgÀªÁUÀÄvÀÛzÉ JA§ ¸ÀgÀ¼À vÀPÀðªÀ£ÀÄß §¼ÀPÉzÁgÀ £É£À¦lÄÖPÉÆAqÀgÉ ¸ÁPÀÄ. F «£Áå¸ÀªÀÅ zsÀé¤ DzsÁjvÀªÁVgÀÄªÀÅzÀjAzÀ EgÀÄªÀ EAVèµÀß QÃ°ªÀÄuÉAiÀÄ£ÉßÃ §¼À¹ PÀ£ÀßqÀzÀ ªÉÃUÀzÀ ¨ÉgÀ¼ÀZÀÄÑ PÀ°AiÀÄÄªÀÅzÀÄ ¸ÀÄ®¨sÀ. EzÀPÉÌ MAzÀÄ GzÁºÀgÀuÉAiÀÄ£ÀÄß £ÉÆÃqÉÆÃt. PÀ£ÀßqÀ ¨ÉgÀ¼ÀZÀÄÑ AiÀÄAvÀæzÀ°è ÀiÉÆÃ’ JA§ÄzÀ£ÀÄß ªÀÄÆr¸À®Ä DgÀÄ QÃ°UÀ¼À£ÀÄß MvÀÛ¨ÉÃPÀÄ!. ¨ÉgÀ¼ÀZÀÄÑ AiÀÄAvÀæzÀ «£Áå¸ÀªÉÃ EgÀÄªÀ PÀA¥ÀÇålgï vÀAvÁæA±ÀzÀ°è ÀiÉÆÃ’ ªÀÄÆr¸À®Ä PÀ¤µÀ× ªÀÄÆgÀÄ QÃ°UÀ¼À£ÁßzÀgÀÆ MvÀÛ¨ÉÃPÀÄ. DzÀgÉ, PÉ.¦.gÁªÀæªÀgÀ QÃ°ªÀÄuÉ «£Áå¸ÀzÀ°è PÉÃªÀ® AiÀÄ (y) ªÀÄvÀÄÛ N (O) JA§ JgÀqÀÄ QÃ°UÀ¼À£ÀÄß MwÛzÀgÉ ÀiÉÆÃ?’ªÀÄÆqÀÄvÀÛzÉ. MvÀÛ¯ÁzÀ JgÀqÀÄ QÃ°UÀ¼À£ÀÄß DzsÀj¹ UÀÄtÂvÁPÀëgÀUÀ¼ÀÄ ªÀÄvÀÄÛ MvÀÛPÀëgÀUÀ¼À£ÀÆß ªÀÄÆr¸ÀÄªÀ F vÀAvÀæUÁjPÉ ªÉÆzÀ°UÉ ºÉÆ¼ÉzÀzÀÄÝ ²æÃ PÉ.¦.gÁªïgÀªÀjUÉ. °¦ ªÀÄÆrPÉAiÀÄ vÀPÀðªÀÅ EvÀgÀ J¯Áè ¨sÁgÀwÃAiÀÄ ¨sÁµÉUÀ½UÀÆ ¸ÀªÀiÁ£ÀªÁVgÀÄªÀÅzÀjAzÀ ¨sÁµÁ ±Á¹ÛçÃAiÀÄªÁVAiÀÄÆ ¸ÀºÀ zsÀé£ÁåvÀäPÀvÉAiÀÄ£ÀÄß F «£Áå¸ÀªÀÅ ¥Àæw¥Á¢¸ÀÄvÀÛzÉ. EzÉÃ «±ÉÃµÀvÉ¬ÄAzÁV PÉ.¦ gÁªïgÀªÀgÀ F «£Áå¸ÀªÀÅ EvÀgÀ ¨sÁgÀwÃAiÀÄ ¨sÁµÉUÀ¼À PÀA¥ÀÇålgï vÀAvÁæA±ÀUÀ¼À°èAiÀÄÆ C¼ÀªÀrPÉAiÀiÁV d£À¦æAiÀÄªÁVzÉ. EzÉÃ PÀ£ÀßqÀzÀ, PÀ£ÁðlPÀ ¸ÀPÁðgÀzÀ C¢üPÀÈvÀ QÃ°ªÀÄuÉ «£Áå¸ÀªÀÇ DVzÉ. ²æÃ PÉ.¦ gÁªÀæªÀgÀ EzÉÃ «£Áå¸ÀªÀ£ÀÄß ¸ÀÄzsÁgÀuÉAiÀÄ ºÉ¸Àj£À°è C®à¸Àé®à §zÀ°¹zÀ `PÀ£ÀßqÀ UÀtPÀ ¥ÀjµÀvÀÄÛ' PÀ£ÁðlPÀ ¸ÀgÀPÁgÀªÀÅ EzÀ£ÀÄß `PÀ£ÀßqÀzÀ ²µÀÖ QÃ°ªÀÄuÉ «£Áå¸À' (¸ÁÖ÷åAqÀqïð QÃ¨ÉÆÃqïð ¯ÉÃOmï) JAzÀÄ CAVÃPÀj¸À®Ä PÁgÀtªÁVzÉ.

Shreedhar T S About eSpeak I came to know about eSpeak text to speech from a friend in Tamilnadu. I have seen eSpeak TTS supporting Hind language in screen readers called SAFA and NVDA. My friend told me about eSpeak Tamil, and suggested me to work for eSpeak Kannada. I contacted the eSpeak developer, Jonathan Duddington. He mapped all the Unicode letters and its pronunciation in eSpeak. I worked with him to improve the support Kannada language in eSpeak. Some changes I made with his guidance, and I gave feedback about some phoneme changes which affect the other languages too. After 2-3 months work, Kannada language is also included in the espeak-1.45.04 versionâ&#x20AC;&#x2122;s official release. Still I am giving some feedbacks, and making small changes. Some of my friends helped me in this.

Skills with computer and softwares I can use and customize the screen reading softwares and TTS as per the requirements of the users. I have a basic knowledge about the functions and fundamentals of two major and widely used screen readers called NVDA (Non visual desktop access) and JAWS (Job Access With Speech) The former is a free and open source screen reader, and the latter is a commercial one. I know to write the script for JAWS screen reader to enhance its ability of supporting some applications to which no screen readers support and also to some applications in which the screen reader users cannot access some functions. I know the usage of some screen readers for Symbian operating system based mobile phones and their customizations, and enhancing their accessibility in other applications too. Besides that I have the experience of testing some softwares for their accessibility. I used some audio games (the games which are based on audio feedbacks and designed for the visually impaired users) and tested them for their usability. Currently I am testing and giving feedbacks about a learning tool from an NGO. All these I do as a volunteer, keeping in view of the visually impaired community's development. I am learning C, C++, and HTML on my own by reading some tutorials, and with the help of some of my friends. I can write some basic programs in C and C++. interested in getting coaching in advanced concepts of programming languages, and some other programming languages like VB, JAVA, Python ETC; but my screen readers are not supporting for JAVA and Python compilers. Some more accessibility work needs to be done with the screen readers to support these compilers. Besides all these, I know the basic computer concepts of MS office.

E-Mail: tss.abs@gmail.com Mobile: 9980989171 Blog: shreeword.blogspot.com

Text-to-speech goes Kannada The software is developed by a visually challenged computer programmer who wanted to help others like him Niranjan.Kaggere @timesgroup.com For thousands of visually challenged people across the state, a 25-year-old visually challenged computer programmer, Sridhar, has come as a blessing in disguise. Not only has he set up an NGO to help other visually challenged people, but also developed a technology to take computers a step closer to the visually impaired. He has developed a Kannada version of the text-to-speech technology through which any visually challenged person can read, write and work on computers. The Knowledge Commission of the state government has hosted his software, named e-speak in Kannada, on its Kannada Wikipedia website to facilitate other visually challenged people to use it free of cost. The software will convert the Kannada script on the Internet and on computers into sound and read it out for users. “I have been working on the development of this software for the last 2-3 years. I think it will be a boon for all the visually challenged people,” Sridhar told Bangalore Mirror. Born in the remote hamlet of Abashi in Soraba taluk of Shimoga district, Sridhar obtained a diploma in computer applications for visually challenged from a special polytechnic in Mysore. “Then itself I had decided to make a career in software programming and learnt basic programming on my own. All through my computer education, I used English e-speak software which is useful only to those who know English. Those who do not know English were deprived of a great knowledge pool,” he said. Determined to help fellow visually challenged people, he approached Jonathan Duddington who first developed text-to-speech software in English. “I emailed him and explained my wish of developing the same software in Kannada. While he had included several other languages of the world, he found Kannada to be difficult. He sent me all the basics of the module. I helped him add the dictionary and pronunciation parts of the software, including the correct form of phonemes and other grammatical and linguistic aspects,” he said. Sridhar tested it with his friends. “It turned out to be a great success among the visually challenged as they could work on computers. Several of them started writing articles and blogs besides reading letters and e-papers online. Then I decided to make this software available to all visually challenged people of the state. The Knowledge Commission came to my rescue by hosting the software on its Kannada wikipedia website Kanaja,” he said. The software is compatible with both Windows and Linux operating systems and can be used with any other screen reading application. As the software has included speech synthesis markup language (SSML), the pronunciation can be corrected at any time. As the voice is still in robotic form, Sridhar is working to make it more natural and also try to include the application in mobile phone handsets as well. Source: Bangalore Mirror, dated 15th July 2011

�ೇಳ�ರು ಸುದಶ�ನ <beluru@gmail.com>

Font Development Meeting : My personal observations �ೇಳ�ರು ಸುದಶ�ನ <projectmanager@kanaja.in>

Fri, Nov 25, 2011 at 7:06 PM To: "Prof. S Rajagopalan" <raj@iiitb.ac.in>, bharathwaasi bharathwaasi <bharathwaasi@gmail.com>, Ramesh S <yesperla@gmail.com> Dear Prof. Rajagopalan Sir It was indeed a positive development t hat we could discuss the Kanaja Font Development issues with Dr. Chidananda Gowda, Chairperson of KSDC on 22.11.2011. The operative part of the meeting were as follows: 1) Kanaja project will carry on the Font Development process, irrespective of the other similar efforts by any other Government funded projects. 2) KKC will coordinate in organising a meeting of the stake holders in issuing GoK notification on Unicode Font Development.

These are quite a good decisions. Anyhow, I would like to reiterate my earlier view that since Unicode Font Development is a similar process to be undertaken by KSDC (through whichever Department is not our concern), the process should be similar and in tune with any other GoK process. Though we have funds, and an independent decision making body, I strongly feel that the process of this task should be clearly in tune with GoK norms. I am stressing this point, as the source of the funds, the task, and the service providers are one and the same. In this scenario, the process cannot be different from one another. As an active campaigner for developing Kannada in IT, I have no hidden agenda, or any ill motive to say this. I am only submitting my personal views. I had expressed this in the meeting, just before your arrival. Regards Beluru Sudarshana -�ೇಳ�ರು ಸುದಶ�ನ ಸಲ�ಾ ಸಮನ�ಯ�ಾರ, ಐ ಐ ಐ � - � 'ಕಣಜ’ ಅಂತರ�ಾಲ ಕನ�ಡ �ಾನ�ೋಶ www.kanaja.in (ಕ�ಾ�ಟಕ �ಾನ ಆ�ೕಗದ �ೕಜ�ೆ) ��ಾಸ: ಇಂಟರ್ �ಾ�ಶನಲ್ ಇನ್ ��ಟೂ�ಟ್ ಆಫ್ ಇ�ಾ��ೕ�ಶನ್ �ೆ�ಾ�ಲ� �ೆಂಗಳ�ರು ನಂ 26/�, ಎ�ೆ�ಾ��ಕ್� ��, �ೊಸೂರು ರ�ೆ� �ೆಂಗಳ�ರು - 560100 ದೂರ�ಾ�: ೯೭೪೧೯೭೬೭೮೯

�ೇಳ�ರು ಸುದಶ�ನ <belurusudarshana@gmail.com>

URGENT : Request from Beluru Sudarshana �ೇಳ�ರು ಸುದಶ�ನ <beluru@mitramaadhyama.co.in>

Fri, Jan 25, 2013 at 1:29 PM

To: anand@cyberscapeindia.com Dear Shri Anand CYBERSCAPE MULTIMEDIA LIMITED Bangalore Dear Shri Anand, Namaskara. • • •

• •

•

Please refer to our earlier communication w.r.t. Development of Unicode Kannada Fonts, when I was the Consultant Coordinator for Kanaja Kannada Jnanakosha Project (www.kanaja.in) I was invited as the Special invitee for the Kannada Software Development Committee Meeting which was held on to finally approve the fonts developed by Maruti Software Solutions, Hassan. I was shown the fonts on the screen and also on print. Going through the fonts, I strongly felt that the fotns are not developed taking the cultural heritage of Kannada fonts into consideration. Also, I could not find any technically sound font validation procedures that are applied to the presently developed fonts. Still, I am requesting Dr. U B Pavanaja and Dr. Ananth Koppar to provide more details on the procedures adopted for the validatio / approval. With these issues in background, I request you to kindly consider the following requests: Can you still offer the skins of all the fonts (I mean glyphs) developed by CYBERSCAPE to stitch them on to UNICODE, so that there would be ready fonts in UNICODE, for FREE? If so, the fonts will be mostly under GPL v3 version. (You can specify the exact nature of the open domain). I request you to kindly accept this request and share the glyphs. If you are accepting, I request you to kindly send me a soft copy of the catalogue of fonts. If you have developed any UNICODE KANNADA TOOL / SOFTWARE, will you be ready to share it with the people of Karnataka for free usage? (You can also specify the usage details)

I request you to kindly respond to my requests at the earliest so that I include your offers to Kannada people in my letter, which I plan to submit on 28.1.2013, Monday. Regards Beluru Sudarshana

-�ೇಳ�ರು ಸುದಶ�ನ www.mitramaadhyama.co.in ಕರ�ಾ�: ೯೭೪೧೯೭೬೭೮೯ ಅ�ಾಧ��ೆ: ಈ �ಂಚಂ�ೆಯ �ಷಯ ಮತು� ಲಗ��ದ ಕಡತಗಳ� �ೇವಲ ಈ �ಂಚಂ�ೆ ಬಳ�ೆ�ಾರ��ೆ �ೕ�ತ�ಾ�ರುತ��ೆ. ಈ ಪತ�ವನು� ಕ�ೇ� �ೆಲಸಗಳನು� ��ಾ�ಸಲು ಮರುರ�ಾ�ೆ �ಾಡಬಹು�ೇ �ನಃ �ೇ�ಾವ �ಾರಣಕೂ� ಮ�ೊ�ಬ��ೆ ಕ�ಸತಕು�ದಲ�. ಒಂದು �ೇ�ೆ ಈ ಪತ�ವ� �ಮ�ೆ ತ�ಾ�� ಬಂ�ದ��ೆ, ��. ಅ�ಾತುಯ��ೆ� ಮ��. DISCLAIMER: This email is intended only to those listed as recipients. This mail can be circulated only for the intended official use and shall not be sent to anybody for any other reasons. Please excuse me and inform the lapse, if I have wrongly sent the mail to you.

�ೇಳ�ರು ಸುದಶ�ನ <belurusudarshana@gmail.com>

URGENT : Request from Beluru Sudarshana Anand S K <anand@cyberscapeindia.com> To: �ೇಳ�ರು ಸುದಶ�ನ <beluru@mitramaadhyama.co.in>

Fri, Jan 25, 2013 at 3:40 PM

Dear Shri Sudarshana, It is nice to know you have been invited to approve the fonts developed for the Karnataka Government. For the Kanaja Project, we had offered our font expertise and list of the non-unicode and unicode Kannada fonts already developed by us. I am attaching the soft copies of our font catalog for your information. It is important to have a proper font validation and testing process to ensure the quality and aesthetics of the fonts since it has such an important bearing on the culture and heritage of our language. In order to uphold the cultural legacy and our glorious traditions, I would like to most humbly submit the following contributions from our company to the Open Source movement without any financial expectations: 1. 2. 3. 4.

Any expertise or help towards formulating a rigorous font testing and validation process. Our entire library of Unicode Kannada fonts (already available). Conversion of our non-unicode fonts to Unicode. Our Akruti Vistaar suite of software and tools on the Windows platform consisting of Keyboard Drivers, Conversion utilities and the spelling checker framework.

We can work out a right mechanism to release these substantial Intellectual Property developments which we have accumulated over decades of work in the area of Indian Language software in general and Kannada Software development in particular under an appropriate licensing system like the OFL (Open Font License) which is very similar to GPL but acknowledges the original contributors by giving due credits but at the same time allows for future contributors to adapt and modify to emerging platforms and scenarios. Please convey this to the Kannada Software Development Committee and ensure that all the Kannada language loving people benefit from our expertise and hard work in this area. Regards, Anand S.K. Managing Director Cyberscape Multimedia Ltd. Bangalore. Mob: 9341245270 [Quoted text hidden]

2 attachments Knd unicode font sample.pdf 277K Akruti Kannada.pdf 349K

�ೇಳ�ರು ಸುದಶ�ನ <belurusudarshana@gmail.com>

Requesting to share PADA freely with Kannada people through GoK Lohith DS <lohith.ds@gmail.com> To: �ೇಳ�ರು ಸುದಶ�ನ <beluru@mitramaadhyama.co.in>

Mon, Jan 28, 2013 at 11:30 AM

Dear �ೇಳ�ರು ಸುದಶ�ನ, Sir, I accept to distribute the Pada Software to all Kannada people worldwide, through GoK. The Software is already available from http://www.pada.pro for everyone.For the existing license please refer http://www.pada.pro/licence/ It is as follows: *********************************************************************************************************** Copyright: Copyright © 2012. Lohit D Shivamurthy, Pada Software. Pada Software may be freely used for personal purposes, Organizations running on Donations etc. Commercial users please Contact-Us. Pada Software may be copied and distributed to others, as long as no fee is charged for that purpose. The Software (or any of its tools) may not be resold, sub-licensed, rented, transferred or otherwise made available for use. The Software may not be offered for free downloading from websites other than Pada.pro . Disclaimer: Pada software is provided on an “AS IS” basis, without warranty of any kind. The entire risk as to the quality and performance of the Pada software is borne by you. Should the Pada software prove defective, you and not the owner assume the entire cost of any service and repair. *********************************************************************************************************** I am not completely aware of the terms and conditions of OFL license already, I can go through the it if required. Please feel free to ask any further clarification(s), if required. Thanks and regards Lohit [Quoted text hidden]

ಜ.29ರಂದು ನು� 5.0 �ೊಸ ತಂ�ಾ�ಂಶ �ಡುಗ�ೆ ¨ÉAUÀ¼ÀÆgÀÄ, d. 28 : ¸ÀÄ®¨sÀªÁV AiÀÄÆ¤PÉÆÃqïUÉ ªÀiÁ¥Àðr¸À§®èAxÀ, ªÉÆ¨ÉÊ°£À°è ¸ÀgÁUÀªÁV NzÀÄ§®èAxÀ 11 ºÉÆ¸À ¥sÁAmïUÀ½gÀÄªÀ £ÀÄr 5.0 ºÉÆ¸À ¸Á¥sïÖªÉÃgÀ£ÀÄß PÀ£ÀßqÀ C©üªÀÈ¢Þ ¥Áæ¢üPÁgÀ d.29gÀAzÀÄ ªÀÄAUÀ¼ÀªÁgÀ ¨ÉAUÀ¼ÀÆj£À°è ©qÀÄUÀqÉ ªÀiÁqÀÄwÛzÉ. PÉ® CAvÀeÁð® vÁtUÀ¼À°è ªÀÄvÀÄÛ ªÉÆ¨ÉÊ°£À°è PÀ£ÀßqÀ NzÀÄªÀÅzÀÄ E£ÀÆß C£ÉÃPÀjUÉ PÀµÀÖPÀgÀªÁVzÉ. PÉ® ªÉ¨ï ¸ÉÊlÄUÀ¼ÀÄ N©ÃgÁAiÀÄ£À PÁ®zÀ ¥sÁAmïUÀ¼À£ÉßÃ E£ÀÆß §¼À¸ÀÄwÛªÉ. ¥sÁAmïUÀ¼ÀÄ AiÀÄÆ¤PÉÆÃqï ¸ÁÖAqÀqïð ºÉÆA¢®è¢gÀÄªÀÅzÀjAzÀ J¯Áè ¥Áèmï¥sÁgÀA ªÀÄvÀÄÛ J¯Áè D¥ÀgÉÃnAUï ¹¸ÀÖA£À°è PÀ£ÀßqÀ NzÀÄªÀÅzÀÄ ¸ÁzsÀåªÁUÀÄwÛ®è. F PÉÆgÀvÉAiÀÄ£ÀÄß PÀ£ÀßqÀ UÀtPÀ ¥ÀjµÀvï C©üªÀÈ¢Þ¥Àr¹gÀÄªÀ £ÀÄr 5.0 vÀAvÁæA±À ¤ÃV¸À°zÉ J£ÀÄßvÁÛgÉ PÀ£ÀßqÀ C©üªÀÈ¢Þ ¥Áæ¢üPÁgÀzÀ CzsÀåPÀë ªÀÄÄRåªÀÄAwæ ZÀAzÀÄæ CªÀgÀÄ. ±À¤ªÁgÀ DAiÉÆÃf¸À¯ÁVzÀÝ ¥ÀwæPÁUÉÆÃ¶×AiÀÄ°è, «zsÁ£À¸ËzsÀzÀ°è ªÀÄAUÀ¼ÀªÁgÀ £ÀqÉAiÀÄÄªÀ PÁAiÀÄðPÀæªÀÄzÀ°è ºÉÆ¸À vÀAvÁæA±ÀªÀ£ÀÄß ªÀÄÄRåªÀÄAwæ dUÀ¢Ã±ï ±ÉlÖgï ©qÀÄUÀqÉ ªÀiÁqÀ°zÁÝgÉ JAzÀÄ «ªÀj¹zÀgÀÄ. F vÀAvÁæA±ÀªÀ£ÀÄß §¼À¹PÉÆAqÀÄ ºÀ¼É ¥sÁAmïUÀ¼À£ÀÄß AiÀÄÆ¤PÉÆÃqïUÉ ¥ÀjªÀwð¸À§ºÀÄzÀÄ. EzÀgÀ°è E£ÀÆß C£ÉÃPÀ ºÉÆ¸À ¦üÃZÀgïUÀ½gÀÄvÀÛªÉ. PÀ£ÀßqÀªÀ£ÀÄß ¸ÀgÀPÁj PÉ®¸ÀUÀ¼À°è C¼ÀªÀr¸ÀÄªÀ ªÀÄÄRå GzÉÝÃ±À¢AzÀ F vÀAvÁæA±ÀªÀ£ÀÄß C©üªÀÈ¢Þ¥Àr¸À¯ÁVzÉ JAzÀÄ ZÀAzÀÄæ w½¹zÀgÀÄ.

©qÀÄUÀqÉAiÀiÁzÀ £ÀAvÀgÀ F ºÉÆ¸À vÀAvÁæA±ÀªÀ£ÀÄß PÀ£ÁðlPÀ ¸ÀgÀPÁgÀzÀ C¢üPÀÈvÀ vÁt¢AzÀ GavÀªÁV qË£ï¯ÉÆÃqï ªÀiÁrPÉÆ¼Àî§ºÀÄzÀÄ. Read more at: http://kannada.oneindia.in/news/2013/01/28/districts-kda-releasenew-software-nudi-5-unicode-071141.html

�ೇಳ�ರು ಸುದಶ�ನ <belurusudarshana@gmail.com>

Validation of Kannada Unicode fonts developed by Maruti Software Solutions: Request for more information Pavanaja U B <pavanaja@gmail.com> To: �ೇಳ�ರು ಸುದಶ�ನ <beluru@mitramaadhyama.co.in>

Tue, Jan 29, 2013 at 9:51 AM

Cc: "Anant R. Koppar" <Anant.Koppar@ktwo.co.in>

Dear Suadarshan, Some of the major metrics considered• Glyphs o It should have all the Kannada characters as specified in Unicode o It should be able to display all possible combinations of consonants, vowels and vowel signs for Kannada o There should be proper spacings in all fronts o Should have aesthetic look and feel – all the shapes were finalised by Dr Chandrashekhara Kambara. As he is a Jnanapitha Awardee and an authority on Kannada language, I did not interfere in his decision o Combination glyphs for many special cases should be present o All curves should be properly closed. There should not be any open curves/lines –this can be checked using a font editing tool like FontLab • OTF features o All important OTF definitions needed for Indic must be present o OTF features like GSUB, GPOS, BASE, GDEF should be present and properly depict Kannada o Since majority of the users use Windows users, the font should have TTF outlines and then OTLS tables added to that o The font should work in all applications which support Indic Unicode OTF like Notepad, MS Office, Paint, browsers, DTP packages which support Indic, etc o Language should be specified as Kannada All the drawings are with Maruthi Thamthrasmha. I don’t have license to Fontlab. They developed using Fontlab. Hence all checking and demos can be done only in their presence. They are coming to Kannada Bhavana today afternoon. I request you to come there at around 3:30 so that you can through all drawings and Fontlab. Thanks and regards, Pavanaja Dr. Pavanaja U.B.

From: Anant R. Koppar [mailto:Anant.Koppar@ktwo.co.in] Sent: 28 January 2013 10:33 To: �ೇಳ�ರು ಸುದಶ�ನ; Anant R. Koppar; Pavanaja Bellippady

Subject: RE: Validation of Kannada Unicode fonts developed by Maruti Software Solutions: Request for more information Dear Sri. Sudarshan, Thanks for your email and feedback. I would request Dr. Pavanaja to discuss with you and provide you with some details. Warm Regards, Anant R Koppar, Ph D|Chairman & CEO|KTwo Technology Solutions|A Last Metre ConnectivityTM Solutions Company| K-Plex, #376, 1st Main, Sri Jagajyoti Layout, Nagadevanahalli, Jnanabharati Post, Bangalore - 560 056, India. From: belurusudarshana@gmail.com [mailto:belurusudarshana@gmail.com] Sent: Sunday, January 27, 2013 1:52 PM To: Anant R. Koppar; Pavanaja Bellippady Subject: Validation of Kannada Unicode fonts developed by Maruti Software Solutions: Request for more information

��ಯ �ಾ|| ಅನಂತ �ೊಪ�ರ್ ಮತು� �ಾ|| ಯು � ಪವನಜ��ೆ ನಮ�ಾ�ರಗಳ�. ಕನ�ಡ ತಂ�ಾ�ಂಶ ಅ�ವೃ�� ಸ��ಯ ಸ�ೆಯ�� (��ಾಂಕ ೧೭.೧.೨೦೧೩) ಇ�ಾ�ೆಯ ಆ�ಾ�ನದಂ�ೆ �ಾಗವ��ದ ನನ�ೆ ನನ� ಅ��ಾ�ಯಗಳನು� ವ�ಕ�ಪ�ಸಲು ಅವ�ಾಶ �ೕ�ದ��ಾ�� ತಮ�ೆ ವಂದ�ೆಗಳ�. ಸ�ೆಯ �ೊ�ೆಯ�� ಾವ� - �ಾ|| �ೊಪ�ರ್) ) - �ಾವ� ರೂ��ರುವ ಕಡತವನು� ಗಮ�ಸಲು ��. ಆದ�ೆ �ಾನು ನಂತರ ಕನ�ಡ ಮತು� ಸಂಸ�� ಕ�ೇ�ಯ�� ಾ��ಾಗ, �ಾ�ಬ�ರೂ ಸ� �ಾ�ರುವ �ಾಂಟ್ ಅನು�ೕದ�ಾ �ಾಗದಪತ�ವನು� �ೋ��ರು�ಾ��ೆ. ಈ ಅನು�ೕದ�ೆಯನು� �ಾವ� �ಾವ �ಾನದಂಡಗಳ ಅನುಗುಣ�ಾ� �ಾ��ೕ� ಎಂದು ��ದ�ೆ ತುಂ�ಾ ಉಪಯುಕ��ಾಗುತ��ೆ. ಏ�ೆಂದ�ೆ, �ಾನು ಇ�ೕ ಪ��ಯ �ೊ�ೆಯ ಹಂತದ�� ಾಗವ��ದ�ರೂ, ನನ�ೆ ಈ ಪ��ಯು ಕನ�ಡದ ಅಕಷ್ರಗಳ (ಮುಖ��ಾ� �ೈಪ್�ೆ�ಂಗ್) �ಾಂಸ��ಕ ಚಹ�ೆಗಳನು� ಗಮ��ಲ� �ಾಗೂ �ಾಂಟ್ಗಳನು� �ಾ�ಪಕ�ಾ� ಬಳ� ಪ�ೕ�� ೋ�ಲ� ಎಂಬ �ಾವ ಬಂ��ೆ. ಇದು ತಮ� �ೕ�ೆ �ಾವ��ೇ ಅನು�ಾನಗಳ�ಾ�ಗ�ೕ ವ�ಕ�ಪ�ಸಲು ಅಲ�; ಆದ�ೆ ಈ ಬ�ೆ� �ೆ��ನ �ಾ�� ಯುವ ಪ�ಯತ��ಾ��ೆ. ಆದ��ಂದ �ಾವ� ದಯ�ಾ� �ಾಂಟ್ ಅನು�ೕದ�ೆ ಪ��ಯನು� �ೇ�ೆ �ೈ�ೊಂ�� ಮತು� �ಾ�ಾ�ವ �ಾನದಂಡಗಳ ಆ�ಾರದ�� ಈ �ಾಂಟ್ಗಳನು� ಅನು�ೕ�� ಎಂದು ಸಂ�ಪ��ಾ� ಆದ�ೆ ಹಂತಹಂತದ ಕ�ಮಗಳನು� �ವ�� ಸ�ೇ�ಾ� �ನಂ��ೊಳ��ೆ�ೕ�ೆ. ��ಾ�ಸ�ಂದ �ೇಳ�ರು ಸುದಶ�ನ -�ೇಳ�ರು ಸುದಶ�ನ

www.mitramaadhyama.co.in ಕರ�ಾ�: ೯೭೪೧೯೭೬೭೮೯

అఅఅఅఅ అఅ ఆఇఈఉఊ ఋ

ఒఓఔ

KINIGE TEST KIT Test Telugu Unicode Font

ౠఎఏఐ

Version draft 0.5

అంఅఁ కః

Kinige Digital Technologies Pvt. Ltd (http://kinige.com )

కఖగఘఙచఛజఝఞట

ఠడఢణతథదధనపఫబ భమ య ర ల వ శ ష స హ ళ ఱకఖ

ఘ టస

స గమపద స ఎర ల

ెల

TABLE OF CONTENTS Goals and Non goals .......................................................................................................................... 2 installing the new font ....................................................................................................................... 3 Installing on windows - 1 ............................................................................................................... 3 Installing on windows - 2 ............................................................................................................... 3 installing font on windows – 3 ....................................................................................................... 4 Installing font on Linux .................................................................................................................. 5 Testing the New Font ........................................................................................................................ 6 Glyphs to be tested for a new Telugu font can be divided into following categories. ...................... 6 అ

ాల (axaralu)........................................................................................................................... 6

గ ణం ల (gunintalu ) ................................................................................................................... 6 త అ సంయ క అ సం ష అ

ాల (dvitva axaralu)....................................................................................................... 8 ాల (Samyukta axaralu)............................................................................................. 9 ాల (Samslesha axaralu ) ............................................................................................ 10

Klishta padalu ( ష ప ల ).......................................................................................................... 10 In combination of other scripts. ................................................................................................... 12 Punctution Marks and Special Characters .................................................................................... 12 Testing with different applications................................................................................................... 13 Checking with MS Word or Wordpad ........................................................................................... 13 Checking with Notepad................................................................................................................ 14 Testing with Internet Explorer ..................................................................................................... 16 Open given sample HTML file. ..................................................................................................... 16 Testing with Firefox ..................................................................................................................... 19 Testing on Chrome Browser......................................................................................................... 20 Testing on Linux Gedit ................................................................................................................. 22 Documenting font quality ................................................................................................................ 24

GOALS AND NON GOALS

Goals: Kinige Test Kinige (KTK) is aimed at helping new Telugu Unicode font developers to test their font and make sure things are in good shape. To provide sample files for notepad, word, browser and thus save time of font developer in testing their font. By providing enumeration minimum required Telugu letter possibilities, By providing list of can go wrong glyphs KTS is aiming to improve quality of fonts before their first public release.

Non goals: KTS is not aimed at teaching how to develop new Telugu Unicode Fonts.

Disclaimer: Kinige believes KTS is going to help font developers, but does not guarantee zero bugs after passing all the scenarios mentioned in this Kit. Community: http://groups.google.com/group/kinigetk Using above group one can participate in improving, discussing, reviewing this document. License:

Kinige Test Kit by Kinige Digital Technologies Pvt. Ltd. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at kinige.com.

Examples: Examples used in this document uses Vajram.ttf font. One can download the font from http://kinige.com/fonts/vajram/ Latest Version: Latest version of this document can be found at http://kinige.com/fonts/ktk

INSTALLING THE NEW FONT First things first. Let us see how to install new font on windows and Linux. Make sure you downloaded new font onto test machine.

INSTALLING ON WINDOWS - 1 Install new font (for example vajram.ttf)

INSTALLING ON WINDOWS - 2 Double Click on the font file (vajram.ttf). You can see Install Button on top left side. Click on it to install.

INSTALLING FONT ON WINDOWS â&#x20AC;&#x201C; 3 Visit the folder where this font is copied on your machine. Right click on it and see install option and click on it!

Thus, new font is installed in the machine.

Note: If the font already exists, it will ask you to replace the old one. Click yes to continue installation.

INSTALLING FONT ON LINUX

Simply double click the font file and click on Install Font button at bottomright corner, as shown in below figure.

TESTING THE NEW FONT Glyphs to be tested for a new Telugu font can be divided into following categories. 1. Telugu alphabets ( ెల గ అ ాల ) 2. Gunintalu (గ ణం ల ) 3. Dvitvaksharalu (

ాల )

4. Samyuktaksharalu (సంయ ా

ాల )

5. Samshleshaksharalu (సం

ాల )

ా

6. Klishta padaalu ( ష ప ల ) 7. Combine Telugu, English, Hindi, Bangla in different positions 8. Punctuation Marks and Special characters 9. Combining Telugu glyphs with RTL(Right To Left) glyphs. (eg.. Arabic, Urdu, Hebrew)

అ

ాల (AXARALU)

అ ఆ ఇ ఈ ఉ ఊ ఋ ౠ ఎ ఏ ఐ ఒ ఓ ఔ అం అః క ఖ గ ఘ ఙచ ఛ జ ఝ ఞట ఠ డ ఢ ణ త థ ద ధ న ప ఫ బ భ మయ ర ల వ శ ష స హ ళ గ ణం

క ా ఖఖ గ ా ఘఘ

ఱ

ల (GUNINTALU )

క క కృ కౄ

కం కః

ఖు ఖూ ఖృ ఖౄ ఖ ఖ ఖ ఖ ఖ ఖ ఖం ఖః గ గ గృ గౄ

గం గః

ఘ ఘ ఘృ ఘౄ

ఘ ఘం ఘః

చ

చు చూ చృ చౄ ె

ే

ై

ౌ చం చః

ఛ

ఛు ఛూ ఛృ ఛౄ ె

ే

ై

ౌ ఛం ఛః

జజ

జృ జౄ జ జ జౖ జ జ జ జం జః

ఝఝ

ఝ ఝ ఝృ ఝౄ

ఝ ఝం ఝః

ట ట ట ట ట ట టృ టౄ ట ట టౖ ట ట ట టం టః ఠ ా

ఠ ఠ ఠృ ఠౄ

ఠం ఠః

డ

డ డూ డృ డౄ ె

ఢ

ఢ ఢూ ఢృ ఢౄ ె

ే

ై

ే

ొ

ై

ో

ొ

ౌ డం డః

ో

ౌ ఢం ఢః

ణ ణ ణ ణ ణ ణ ణృ ణౄ ణ ణ ణౖ ణ ణ ణ ణం ణః త

త త

తృ తౄ ె

థ

థు థూ థృ థౄ

ె

ే

ై

ొ

ో

ౌ థం థః

ద

దు దూ దృ దౄ

ె

ే

ై

ొ

ో

ౌ దం దః

ధ

ధు ధూ ధృ ధౄ

న

ను నూ నృ నౄ

ె

ే

ై

ే

ై

ౌ తం తః

ొ ౖ

ో

ౌ ధం ధః నం నః

ప ా

ి

ీ ప ప పృ పౄ

ౖ

పం పః

ఫ ా

ి

ీ ఫ ఫ ఫృ ఫౄ

ౖ

ఫం ఫః

బబ

బ బ బృ బౄ బ బ బౖ బ బ బ బం బః

భ

భ భ భృ భౄ

మమ

ౖ

భం భః

మ మ మృ మౄ

యయ

మ మం మః

య య యృ యౄ

ర ా

ర ర రృ రౄ

య యం యః రం రః

లల

ల ల లృ లౄ ల ల లౖ ల ల ల లం లః

వ

వ వ వృ వౄ

ా

శ ా

హ

వం వః

శృ శౄ

ష ా స ా

ౖ

ి

ీష ష

షృ షౄ

శం శః

ీ సు సూ సృ సౄ హ హ

హృ హౄ

ృ ౄ

ౖ

షం షః సం సః హ హ హౌ హం హః ం ః

త అ

ఒక హల లక , , , ైతక అ ,

అ ే హల

ాల (DVITVA AXARALU)

ే ప ల సట,

ట, పట,

దు, మ ె ల,

డల ,

త నం,

లడ

ఇదర ,

గ ,

అన ,

మగ,

అడ ,

క

పచ ,

క ,

మను ,

న

గడమ ,

న ,

చు ,

అత,

ను ,

గజ ల ,

సు ,

ర,

బ

ఎత ,

ల,

గ,

బ జ, ిజ , గ

కట, బట,

క ,

కప ,

ౖ త , ద,

ప , ప ి ,

మ దు, ె,

అప డం, కప ల,

త

ట

రయ ,

రబ ర ,

పవ , అవ ,

సబ ,

మ ,

బసు ,

అమ ,

కర,

లస ,

దమ , నమ కం, అమ ల

ళల,

ర,

ళ ,

గల,

అయ ,

మ

ె ,

మళ ,

మల,

సంయ క అ

ఒక హల క

హల

ే అ

ప

మ

ాల (SAMYUKTA AXARALU)

ాల

తర మ

అ

ఆస

అర న

పద మ

అవస

కమం

అశ మ

అష

గరప

కట మ

ఆర ల

తమ

ా ర

ఇషమ

హ

ఈశ ర

ెల

భగవ

త

హరమ ం

సమ

ఓర కర

ార ం

ద ా రమ

అ

క

అదు తమ

కషమ

దురమ ద ధర మ

శ లమ

భ

తమ

మం ల

శబమ

ప త

మర టమ

సత

ార

ాజ మ

పష మ

ౖష

సదు ణమ

క

ట ాయ

ా

వరమ

ా నమ

పవచనం

దర

ాణం

ద

స ప మ

శ మ

హసమ

బత క

సం ష అ

ఒక హల క - ండ ఒత ల

హమ

ాల (SAMSLESHA AXARALU )

ే అ

ాల

ాషప

ఇ

క

ఈర

ౖ షమ

క

ామర మ దమ

ా తంత మ

ల

అర మ ఉ

య

కర

స

ాషమ

జ త

ఉ ే

వసమ

సంస ృ

ధృత ాష డ ీ

KLISHTA PADALU ( ష ప అంతఃప రం, ఉషః రణ ల , ాఙ యం,

యఙం, ాతః ాలం, జ నం,

ల ) ప జ నం, దుఃఖం, ఆజ,

అజ తం,

తపఃఫలం,

జ పకం

IN COMBINATION OF OTHER SCRIPTS.

“మ మ టల

క

ించల

బద ? सद य को ट पणी करने का अवसर

मले। All is well.‫ ﺳﺐ ﭨﮭﯿﮏ ﮨﮯ‬সব ভাল হয় ‫ ﻛﻞ ﺷﻲء ﺣﺴﻨﺎ‬ఇనుపచువ ను ం ం

అత

తల

ద ఆ సూ క రం ా అం

ఒక

ాల కనుల వగల .

PUNCTUTION MARKS AND SPECIAL CHARACTERS

! @ # $ % ^ & * () _ + = / \ ‘ “ ? “ ‘

TESTING WITH DIFFERENT APPLICATIONS CHECKING WITH MS WORD OR WORDPAD 1. open “sample.docx” 2. Select all the text in this page. (Ctrl A should do)

3. Now change the font to new font. (See the image below)

4. Note: If you don’t see your font that means it is not installed. Re-install it again.

5. Now verify that each and every glyph in this document is looking good.

6. Now verify that each and every glyph in this document is looking as expected, that means you are not seeing something from default font.

CHECKING WITH NOTEPAD Open “SampleTestPages.txt” in notepad. Changing the Default Font in Notepad (as shown in below image) First go to FormatFonts.

Select Fonts and you will get the following screen.

Select the new font from menu and click ok.

And the font is changed. Verify that all glyphs shown in sample file are correctly displayed. Verify that all glyphs shown in sample file are from new font, not from default fall back font.

TESTING WITH INTERNET EXPLORER Open given sample HTML file. Changing the font for Internet Explorer. Go to Tools Menu at top right corner of Internet Explorer.

Choose “Internet Options” button.

Select Fonts button.

Under the Fonts menu, Choose form Change Language Script (Latin Based, dropdown menu) from English to Telugu.

Then change the font from default font to New font . Then verify that all the glyphs given in sample file are correctly displayed.

TESTING WITH FIREFOX Open Sample html file in firefox and change font to new font by following instructions given below.

Now, go to Tools Menu, select Options icon, then choose Content icon. There you will find the font by default Gautami. Font size will be 16.

Change the font to New font. Then the screen will appear like this

TESTING ON CHROME BROWSER

Changing font on Chrome.

Now, go to Options, ->Under the hood, select “Customize font” icon.

There by default you will have Gautami font, with font size 16.

Change the font from Gautami to New font. Change encoding to UTF-8.

Now refresh the screen to view the Telugu alphabets in Chrome in New font font. Open sample.html file and verify all glyphs are correctly displayed.

TESTING ON LINUX GEDIT To change font on gedit, click on preferences:

Click on Fonts and Colors Tab.

Pick the editor font to new font:

Thatâ&#x20AC;&#x2122;s all! Now verify that sample.txt file is looking good in gedit.

DOCUMENTING FONT QUALITY

Use given MegaCheckList.docx (or pdf) file in KTK to document quality of your font.

KANNADA

AkrutiKndAnanthaNormal

AkrutiKndManiDemi

AkrutiKndRadheBold

AkrutiKndKapilaNormal

AkrutiKndPunarvasu

AkUKndCML41Normal

1234567890

AkUKndCML42Normal

1234567890

AkUKndCML43Normal

AkUKndCML44Normal

1234567890

AkUKndCML45Normal

1234567890

KANNADA â&#x20AC;&#x201C; UNICODE

AkUKndCML46Normal

1234567890

AkUKndCML47Normal

1234567890

AkUKndCML48Normal

1234567890

AkUKndCML49Normal

1234567890

AkUKndCML50Normal

108

103

1234567890

AkUKndCML01Normal

1234567890

AkUKndCML02Normal

1234567890

AkUKndCML03Normal

1234567890

AkUKndCML04Normal

AkUKndCML11Normal

1234567890

AkUKndCML12Normal

1234567890

AkUKndCML13Normal

1234567890

AkUKndCML14Normal

1234567890

AkUKndCML15Normal

AkUKndCML06Normal

1234567890

AkUKndCML16Normal

AkUKndCML07Normal

1234567890

AkUKndCML17Normal

1234567890

AkUKndCML18Normal

1234567890

AkUKndCML19Normal

1234567890

AkUKndCML20Normal

AkUKndCML05Normal

AkUKndCML08Normal

! " #

AkUKndCML09Normal

AkUKndCML10Normal

104

1234567890

105

1234567890

AkUKndCML31Normal

1234567890

AkUKndCML23Normal

1234567890

AkUKndCML21Normal

AkUKndCML22Normal

AkUKndCML24Normal

1234567890

AkUKndCML32Normal

AkUKndCML33Normal

AkUKndCML34Normal

1234567890

AkUKndCML25Normal

AkUKndCML35Normal

AkUKndCML26Normal

1234567890

AkUKndCML36Normal

AkUKndCML27Normal

1234567890

AkUKndCML28Normal

1234567890

AkUKndCML29Normal

1234567890

AkUKndCML30Normal

106

1234567890

AkUKndCML37Normal

AkUKndCML38Normal

1234567890

AkUKndCML39Normal

1234567890

AkUKndCML40Normal

107