Looking back into history the Internet has gone through many phases. Back in the 1990s there was a lot of mis-direction among the changing character sets. These basically outlined a library of symbols which would be translated into certain characters of languages all around the world. It is an important topic to understand how we have gotten into the trends of modern web design.
I want to go into a bit of depth on points of Unicode character encoding and HTML entities. I don’t mean for this to be a full master list. But to act as more of an introduction into a topic which many designers are not familiar with. Granted the UTF-8 charset has allowed for an almost unanimous configuration of every language on the face of the Earth.
Popular Character Sets
It is notable how the majority of websites these days use a form of Unicode. Many web servers are configured to use either ISO-8859-1 or UTF-8. But there are major differences between these two character encoding libraries and the amount of data held within each one. I’d recommend this character set list which glosses over some of the more archaic models from the past.
One of the biggest reasons we do not use these older sets is because of greater support from modern libraries. Languages such as Chinese, Thai, Japanese, and Korean may all be rolled into one package. This means you can mix various languages into the same document without any problem. This is why web documents have advanced to using a more uniform character set, rather than individual solutions based on native languages.
The Basics of Unicode
Typically UTF-8 is the character encoding for documents of all types. Plain text files will often save as ISO or ASCII encoding, but Unicode has offered something much more special. Beyond just the languages it also includes references for many punctuation characters, mathematical equations, and other symbols. There is almost no limitation when we are considering the massive collection of Unicode solutions.
But the idea of UTF-8 documents has grown substantially in recent years. Sometimes I will read about developers questions the differences between UTF-8 and UTF-16. It often changes the formatting of characters which can affect software development a lot heavier than websites. Building content online should be made as simple as possible.
The only major benefit to using UTF-16 would be generating smaller file sizes. But unless you are writing lengthy articles in various languages this should not be a problem. I am frequently visiting websites in other languages and the server load times are more than bearable. UTF-8 for Unicode web development is the easiest solution for getting some content working and running online properly.
HTML Entities
Entities are another important part of the character set encoding. Basically you can write codes into the HTML which display symbols or icons within the text. These are often called Numeric Character References(NCR) since each entity is given a specific ID number, along with a shorter alias.
As an example the copyright(©) symbol can be generated with the code © or the code ©. These both produce the same results in unicode webpages and the entities will display properly among all browsers. These present an important step forward and they dynamically changed the way developers code websites. Each of the numeric values is built on top of the character set, and the libraries have been used for many years.
Charset in CSS
One other typical anomaly which I run into is related to CSS character sets. It is very simple to update your own charset in the HTML, either as an attribute or using the tag. However in CSS it is often assumed based on the document type itself. However there is a small bit of code you may include at the beginning of the file to update the charset.
@charset "UTF-8";
This line of code should be easily recognized if you are familiar with CSS. But remember that it is not necessary unless you are using characters beyond the ASCII library. It is worth delving into this thread on Stack Overflow to learn a bit more about CSS character sets. But it can be important if you are including more dynamic UTF-8 Unicode characters directly into the document.
Tips for HTML5 Encoding
Going forward I would recommend all web developers should setup their main HTML pages with a default character encoding. It isn’t bad to leave this process up to the server, but it can be a lot better if you plan ahead. As I mentioned earlier there is a meta tag designed specifically for character encoding. Set this up in your document header, or in a heading theme template such as with WordPress to include it onto all pages.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
The other related method is to include a bit of code for http-equiv=”Content-Type”. You can read a bit more from this great article The Road to HTML5: Character Encoding. It’s a fairly lengthy article but the information is priceless. I have been so impressed with all of the standards coming out around HTML5. Character sets may initially appear to not be worth very much research. But understanding the basics can really help to wrap your mind around other more complicated topics which you may learn down the road.
Related Posts
Conclusion
I do hope this article may help some designers and web developers out there. Getting started building your own websites can be a very exciting challenge. But even after coding for years it is common to find topics which you are not familiar with. I am still constantly educating myself and researching on new ideas! It is a fun process and also very rewarding. But although character sets have been standardized and much easier to use, it is still a fun look back at the history of getting to where we are today. If you have similar ideas or questions feel free to share with us in the discussion area.
Read More at The Importance of Language and Character Set in Web Design