Python:Data Analytics and Visualization
上QQ阅读APP看书,第一时间看更新

Interacting with data in text format

Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.

In this section we will start reading and writing data from and to text files.

Reading data from text format

Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.

Pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv() function. Let's start with a small example file:

$ cat example_data/ex_06-01.txt
Name,age,major_id,sex,hometown
Nam,7,1,male,hcm
Mai,11,1,female,hcm
Lan,25,3,female,hn
Hung,42,3,male,tn
Nghia,26,3,male,dn
Vinh,39,3,male,vl
Hong,28,4,female,dn

Tip

The cat is the Unix shell command that can be used to print the content of a file to the screen.

In the above example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:

>>> df_ex1 = pd.read_csv('example_data/ex_06-01.txt')
>>> df_ex1
 Name age major_id sex hometown
0 Nam 7 1 male hcm
1 Mai 11 1 female hcm
2 Lan 25 3 female hn
3 Hung 42 3 male tn
4 Nghia 26 3 male dn
5 Vinh 39 3 male vl
6 Hong 28 4 female dn

We see that the read_csv function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep parameter to change the separated symbol and set header=None in case the example file does not have a caption row.

See the below example:

$ cat example_data/ex_06-02.txt
Nam 7 1 male hcm
Mai 11 1 female hcm
Lan 25 3 female hn
Hung 42 3 male tn
Nghia 26 3 male dn
Vinh 39 3 male vl
Hong 28 4 female dn

>>> df_ex2 = pd.read_csv('example_data/ex_06-02.txt',
 sep = '\t', header=None)
>>> df_ex2
 0 1 2 3 4
0 Nam 7 1 male hcm
1 Mai 11 1 female hcm
2 Lan 25 3 female hn
3 Hung 42 3 male tn
4 Nghia 26 3 male dn
5 Vinh 39 3 male vl
6 Hong 28 4 female dn

We can also set a specific row as the caption row by using the header that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col to the name or index of the column. We again use the second data file example_data/ex_06-02.txt to illustrate this:

>>> df_ex3 = pd.read_csv('example_data/ex_06-02.txt',
 sep = '\t', header=None,
 index_col=0)
>>> df_ex3
 1 2 3 4
0
Nam 7 1 male hcm
Mai 11 1 female hcm
Lan 25 3 female hn
Hung 42 3 male tn
Nghia 26 3 male dn
Vinh 39 3 male vl
Hong 28 4 female dn

Apart from those parameters, we still have a lot of useful ones that can help us load data files into Pandas objects more effectively. The following table shows some common parameters:

Besides the read_csv() function, we also have some other parsing functions in Pandas:

In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:

$ cat example_data/ex_06-03.txt
Nam 7 1 male hcm
Mai 11 1 female hcm
Lan 25 3 female hn
Hung 42 3 male tn single
Nghia 26 3 male dn single
Vinh 39 3 male vl
Hong 28 4 female dn

>>> import csv
>>> f = open('data/ex_06-03.txt')
>>> r = csv.reader(f, delimiter='\t')
>>> for line in r:
>>> print(line)
['Nam', '7', '1', 'male', 'hcm']
['Mai', '11', '1', 'female', 'hcm']
['Lan', '25', '3', 'female', 'hn']
['Hung', '42', '3', 'male', 'tn', 'single']
['Nghia', '26', '3', 'male', 'dn', 'single']
['Vinh', '39', '3', 'male', 'vl']
['Hong', '28', '4', 'female', 'dn']

Writing data to text format

We saw how to load data from a text file into a Pandas data structure. Now, we will learn how to export data from the data object of a program to a text file. Corresponding to the read_csv() function, we also have the to_csv() function, supported by Pandas. Let's see an example below:

>>> df_ex3.to_csv('example_data/ex_06-02.out', sep = ';')
 

The result will look like this:

$ cat example_data/ex_06-02.out
0;1;2;3;4
Nam;7;1;male;hcm
Mai;11;1;female;hcm
Lan;25;3;female;hn
Hung;42;3;male;tn
Nghia;26;3;male;dn
Vinh;39;3;male;vl
Hong;28;4;female;dn
 

If we want to skip the header line or index column when writing out data into a disk file, we can set a False value to the header and index parameters:

>>> import sys
>>> df_ex3.to_csv(sys.stdout, sep='\t',
 header=False, index=False)
7 1 male hcm
11 1 female hcm
25 3 female hn
42 3 male tn
26 3 male dn
39 3 male vl
28 4 female dn

We can also write a subset of the columns of the DataFrame to the file by specifying them in the columns parameter:

>>> df_ex3.to_csv(sys.stdout, columns=[3,1,4],
 header=False, sep='\t')
Nam male 7 hcm
Mai female 11 hcm
Lan female 25 hn
Hung male 42 tn
Nghia male 26 dn
Vinh male 39 vl
Hong female 28 dn

With series objects, we can use the same function to write data into text files, with mostly the same parameters as above.