2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								Unicode in Flask
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								================
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							
								
									
										
										
										
											2014-03-20 01:23:17 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								Flask, like Jinja2 and Werkzeug, is totally Unicode based when it comes to
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								text.  Not only these libraries, also the majority of web related Python
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								libraries that deal with text.  If you don't know Unicode so far, you
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								should probably read `The Absolute Minimum Every Software Developer
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								Absolutely, Positively Must Know About Unicode and Character Sets
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								<http://www.joelonsoftware.com/articles/Unicode.html>`_.  This part of the
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								documentation just tries to cover the very basics so that you have a
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								pleasant experience with Unicode related things.
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								Automatic Conversion
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								--------------------
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								Flask has a few assumptions about your application (which you can change
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								of course) that give you basic and painless Unicode support:
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-   the encoding for text on your website is UTF-8
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								-   internally you will always use Unicode exclusively for text except
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    for literal strings with only ASCII character points.
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-   encoding and decoding happens whenever you are talking over a protocol
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    that requires bytes to be transmitted.
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								So what does this mean to you?
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								HTTP is based on bytes.  Not only the protocol, also the system used to
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								address documents on servers (so called URIs or URLs).  However HTML which
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								is usually transmitted on top of HTTP supports a large variety of
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								character sets and which ones are used, are transmitted in an HTTP header.
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								To not make this too complex Flask just assumes that if you are sending
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								Unicode out you want it to be UTF-8 encoded.  Flask will do the encoding
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								and setting of the appropriate headers for you.
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								The same is true if you are talking to databases with the help of
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								SQLAlchemy or a similar ORM system.  Some databases have a protocol that
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								already transmits Unicode and if they do not, SQLAlchemy or your other ORM
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								should take care of that.
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								The Golden Rule
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								---------------
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								So the rule of thumb: if you are not dealing with binary data, work with
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								Unicode.  What does working with Unicode in Python 2.x mean?
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-   as long as you are using ASCII charpoints only (basically numbers,
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    some special characters of latin letters without umlauts or anything
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    fancy) you can use regular string literals (``'Hello World'``).
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-   if you need anything else than ASCII in a string you have to mark
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								    this string as Unicode string by prefixing it with a lowercase `u`.
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    (like ``u'Hänsel und Gretel'``)
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								-   if you are using non-Unicode characters in your Python files you have
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    to tell Python which encoding your file uses.  Again, I recommend
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 03:16:35 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								    UTF-8 for this purpose.  To tell the interpreter your encoding you can
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    put the ``# -*- coding: utf-8 -*-`` into the first or second line of
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    your Python source file.
							 | 
						
					
						
							
								
									
										
										
										
											2010-07-04 19:02:27 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								-   Jinja is configured to decode the template files from UTF-8.  So make
							 | 
						
					
						
							
								
									
										
										
										
											2010-07-01 07:45:39 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								    sure to tell your editor to save the file as UTF-8 there as well.
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								Encoding and Decoding Yourself
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								------------------------------
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								If you are talking with a filesystem or something that is not really based
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								on Unicode you will have to ensure that you decode properly when working
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								with Unicode interface.  So for example if you want to load a file on the
							 | 
						
					
						
							
								
									
										
										
										
											2010-07-04 19:02:27 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								filesystem and embed it into a Jinja2 template you will have to decode it
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								from the encoding of that file.  Here the old problem that text files do
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								not specify their encoding comes into play.  So do yourself a favour and
							 | 
						
					
						
							
								
									
										
										
										
											2010-07-04 19:02:27 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								limit yourself to UTF-8 for text files as well.
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								Anyways.  To load such a file with Unicode you can use the built-in
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								:meth:`str.decode` method::
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    def read_file(filename, charset='utf-8'):
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								        with open(filename, 'r') as f:
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								            return f.read().decode(charset)
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							
								
									
										
										
										
											2010-11-13 19:28:42 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								To go from Unicode into a specific charset such as UTF-8 you can use the
							 | 
						
					
						
							
								
									
										
										
										
											2010-06-18 01:01:32 +08:00
										 
									 
								 
							 | 
							
								
							 | 
							
								
							 | 
							
							
								:meth:`unicode.encode` method::
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    def write_file(filename, contents, charset='utf-8'):
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								        with open(filename, 'w') as f:
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								            f.write(contents.encode(charset))
							 | 
						
					
						
							
								
									
										
										
										
											2010-07-01 07:45:39 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								Configuring Editors
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-------------------
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								Most editors save as UTF-8 by default nowadays but in case your editor is
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								not configured to do this you have to change it.  Here some common ways to
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								set your editor to store as UTF-8:
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-   Vim: put ``set enc=utf-8`` to your ``.vimrc`` file.
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-   Emacs: either use an encoding cookie or put this into your ``.emacs``
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    file::
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								        (prefer-coding-system 'utf-8)
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								        (setq default-buffer-file-coding-system 'utf-8)
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								-   Notepad++:
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    1. Go to *Settings -> Preferences ...*
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    2. Select the "New Document/Default Directory" tab
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    3. Select "UTF-8 without BOM" as encoding
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								
							 | 
						
					
						
							| 
								
							 | 
							
								
							 | 
							
								
							 | 
							
							
								    It is also recommended to use the Unix newline format, you can select
							 | 
						
					
						
							
								
									
										
										
										
											2010-07-04 19:02:27 +08:00
										 
									 
								 
							 | 
							
								
									
										
									
								
							 | 
							
								
							 | 
							
							
								    it in the same panel but this is not a requirement.
							 |