In this guide, you will learn how to load text data with different encodings using the programming software Python with a practical example to illustrate the process. You are provided with links to the example dataset, and you are encouraged to replicate this example. An additional practice example is suggested at the end of this guide. This example assumes that you have the data file stored in the working directory being used by Python.
Technically, everything in a computer is stored in the form of binary numbers or a sequence of 0s and 1s; in other words, computers do not store letters, numbers, or other characters directly. Hence, to handle text data in computers, we need a mapping between the characters used by humans and the binary numbers that computers can “understand.” Such mappings are called encodings. For example, the letter “A” is normally encoded as 65 in decimal or 01000001 in binary.
This example shows how to read a text file using different encodings, with data on 17,000 ISIS-related tweets from more than 100 Twitter users from all over the world. This dataset has been very helpful in developing effective counter-messaging measures against terrorism worldwide. A necessary precondition of using these data to do any useful analysis is to load its text content correctly.
This example uses a subset of data from the 2016 How ISIS Uses Twitter dataset (https://www.kaggle.com/fifthtribe/how-isis-uses-twitter/home). The data are collected by a digital agency, Fifth Tribe, and are released under the CC0: Public Domain license through the platform Kaggle. The variable we examine is
There are 17,410 tweets (rows) in the dataset, posted between January 6, 2015, and May 13, 2016, before and after the November 2015 Paris Attacks. At least two languages—English and Arabic—are used in the tweets, making these data appropriate for demonstrating encodings.
Python is a free, open-source programming language. Python does not operate with pull-down menus. Rather, you must submit lines of code that execute functions and operations built into Python. It is best to save your code in a simple text file that Python users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with Python, we suggest you start with the introduction manual located at https://wiki.python.org/moin/BeginnersGuide. While most computer systems come with a vanilla Python, we recommend installing the distribution made by Anaconda (https://www.anaconda.com/download/) as it contains many packages that are commonly used. This software guide uses this distribution and will prompt to install any package used here but not included in the Anaconda distribution.
For this example, we need the package “pandas” for loading data. If you are using the Anaconda distribution of Python, then the package should be installed already. With the package installed, we can load it as:
import pandas as pd
First, we will try to load the text data using the ASCII encoding (assuming the data file is already saved in your working directory):
dataset = pd.read_csv(“dataset-twitter-2016-subset1.csv”, encoding = ‘ASCII’)
The read_csv() function reads in the text file specified by its first input. The encoding of the file is specified by the “encoding” parameter which is set to “ASCII” in the code above. When the above code is executed, you will likely get an error, and we will discuss it in the next section.
Now we try to load the data using the UTF-8 encoding:
dataset = pd.read_csv(“dataset-twitter-2016-subset1.csv”, encoding = ‘UTF-8’)
Note how we specify the encoding “UTF-8” here using the parameter (encoding = ‘UTF-8’). Actually, UTF-8 is the default encoding in Python 3, so it is not necessary to specify this input if we want to use the UTF-8 encoding.
Check the data size:
And check a particular tweet:
Here, we just picked a random number 22 which corresponds to the 23rd tweet in the dataset (as in Python, the index of an array starts from 0).
For each command, Python will return its output immediately. When we read the data using the ASCII encoding, we will get an error message like “‘ascii’ codec can’t decode byte 0xd8 in position 66: ordinal not in range (128).” That is because Python sees a binary sequence and cannot decode it using ASCII. And that is to be expected since there are Arabic characters in the tweets, and ASCII does not cover Arabic characters. Hence, the loading process stops as soon as Python finds string that it cannot handle. The loading process fails, and we do not read in any data at all.
The file is read in successfully with UTF-8 since UTF-8 is the correct encoding. And the dataset loaded this way has 17,410 tweets, confirming that the whole file has been read in. Looking at the 23rd tweet in the dataset, there are indeed Arabic characters in it, and that is when Python stops working using ASCII.
You can download this sample dataset and see whether you can reproduce the results presented here. Then, try loading the data with any other encoding such as UTF-32.