Linux學習: Python的Unicode HOWTO

https://docs.python.org/3/howto/unicode.html
最早的ASCII碼(Code)，只有0~127(7bits)，無法包含更多特殊字。
1980年代的個人電腦幾乎都是8bit，所以可以處理ASCII且還多出128~255可以使用，但不同機器有不同的Code。
Unicode一開始用16bits來取代原來的8bits，有2^16 (65536)個值可使用，目標是可包含所有的語言，但其實還是不夠。後來擴展到 0 through 1,114,111 ( 0x10FFFF in base 16).
一個字元(Character)，Unicode定義了一個字元的Code point，就是這個字元的值。表示法如U+12CA就表示值是0x12ca的字元。
Encoding(編碼)是指從字串轉為位元組序列的規則，UTF-8就是一種最普遍的編碼方式，UTF (Unicode Transformation Format)，8是指8-bits的數字被使用來編碼。規則是這樣：

If the code point is < 128, it’s represented by the corresponding byte value.
If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Python的預設編碼方式是UTF-8，可用特殊的註解改變(第一或第二行)

# -*- coding:  -*-

Python支援以Unicode為名的變數
若想保持Python的原始碼是ASCII-only，可用Escape字符 /u或/U或/N字元名來編寫

>>> "\N{GREEK CAPITAL LETTER DELTA}"  # Using the character name
'\u0394'
>>> "\u0394"                          # Using a 16-bit hex value
'\u0394'
>>> "\U00000394"                      # Using a 32-bit hex value
'\u0394'

另外，可用bytes的method "decode()"來輸出Unicode，另外帶encoding參數和error處理方式(請看最上方連結)
Python 3.2有100種不同的encoding。
單個字元的轉換，可用chr(int)輸出Unicode，用ord(str)輸出值(code point)
字串與位元組串的操作是bytes.decode() 和 str.encode()

The most important tip is:

Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end.

Linux學習

2017年12月20日星期三

Python的Unicode HOWTO

沒有留言:

張貼留言

2017年12月20日 星期三

Python的Unicode HOWTO

沒有留言:

張貼留言

2017年12月20日星期三