2.2. Data Preprocessing¶ Open the notebook in SageMaker Studio Lab
So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocess it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting. This section, while no substitute for a proper pandas tutorial, will give you a crash course on some of the most common routines.
2.2.1. Reading the Dataset¶
Comma-separated values (CSV) files are ubiquitous for storing tabular
(spreadsheet-like) data. Here, each line corresponds to one record and
consists of several (comma-separated) fields, e.g., “Albert
Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in
the field of gravitational physics”. To demonstrate how to load CSV
files with pandas
, we create a CSV file below
../data/house_tiny.csv
. This file represents a dataset of homes,
where each row corresponds to a distinct home and the columns correspond
to the number of rooms (NumRooms
), the roof type (RoofType
), and
the price (Price
).
import os
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')
Now let’s import pandas
and load the dataset with read_csv
.
import pandas as pd
data = pd.read_csv(data_file)
print(data)
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
import pandas as pd
data = pd.read_csv(data_file)
print(data)
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
import pandas as pd
data = pd.read_csv(data_file)
print(data)
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
import pandas as pd
data = pd.read_csv(data_file)
print(data)
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
2.2.2. Data Preparation¶
In supervised learning, we train models to predict a designated target
value, given some set of input values. Our first step in processing
the dataset is to separate out columns corresponding to input versus
target values. We can select columns either by name or via
integer-location based indexing (iloc
).
You might have noticed that pandas
replaced all CSV entries with
value NA
with a special NaN
(not a number) value. This can
also happen whenever an entry is empty, e.g., “3,,,270000”. These are
called missing values and they are the “bed bugs” of data science, a
persistent menace that you will confront throughout your career.
Depending upon the context, missing values might be handled either via
imputation or deletion. Imputation replaces missing values with
estimates of their values while deletion simply discards either those
rows or those columns that contain missing values.
Here are some common imputation heuristics. For categorical input
fields, we can treat NaN
as a category. Since the RoofType
column takes values Slate
and NaN
, pandas
can convert this
column into two columns RoofType_Slate
and RoofType_nan
. A row
whose roof type is Slate
will set values of RoofType_Slate
and
RoofType_nan
to 1 and 0, respectively. The converse holds for a row
with a missing RoofType
value.
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 NaN 0 1
1 2.0 0 1
2 4.0 1 0
3 NaN 0 1
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 NaN 0 1
1 2.0 0 1
2 4.0 1 0
3 NaN 0 1
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 NaN 0 1
1 2.0 0 1
2 4.0 1 0
3 NaN 0 1
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 NaN 0 1
1 2.0 0 1
2 4.0 1 0
3 NaN 0 1
For missing numerical values, one common heuristic is to replace the
NaN
entries with the mean value of the corresponding column.
inputs = inputs.fillna(inputs.mean())
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 3.0 0 1
1 2.0 0 1
2 4.0 1 0
3 3.0 0 1
inputs = inputs.fillna(inputs.mean())
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 3.0 0 1
1 2.0 0 1
2 4.0 1 0
3 3.0 0 1
inputs = inputs.fillna(inputs.mean())
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 3.0 0 1
1 2.0 0 1
2 4.0 1 0
3 3.0 0 1
inputs = inputs.fillna(inputs.mean())
print(inputs)
NumRooms RoofType_Slate RoofType_nan
0 3.0 0 1
1 2.0 0 1
2 4.0 1 0
3 3.0 0 1
2.2.3. Conversion to the Tensor Format¶
Now that all the entries in inputs
and targets
are numerical, we
can load them into a tensor (recall Section 2.1).
import torch
X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
X, y
(tensor([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=torch.float64),
tensor([127500, 106000, 178100, 140000]))
from mxnet import np
X, y = np.array(inputs.values), np.array(targets.values)
X, y
(array([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=float64),
array([127500, 106000, 178100, 140000], dtype=int64))
from jax import numpy as jnp
X, y = jnp.array(inputs.values), jnp.array(targets.values)
X, y
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(Array([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=float32),
Array([127500, 106000, 178100, 140000], dtype=int32))
import tensorflow as tf
X, y = tf.constant(inputs.values), tf.constant(targets.values)
X, y
(<tf.Tensor: shape=(4, 3), dtype=float64, numpy=
array([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]])>,
<tf.Tensor: shape=(4,), dtype=int64, numpy=array([127500, 106000, 178100, 140000])>)
2.2.4. Discussion¶
You now know how to partition data columns, impute missing variables,
and load pandas
data into tensors. In Section 5.7,
you will pick up some more data processing skills. While this crash
course kept things simple, data processing can get hairy. For example,
rather than arriving in a single CSV file, our dataset might be spread
across multiple files extracted from a relational database. For
instance, in an e-commerce application, customer addresses might live in
one table and purchase data in another. Moreover, practitioners face
myriad data types beyond categorical and numeric. Other data types
include text strings, images, audio data, and point clouds. Oftentimes,
advanced tools and efficient algorithms are required to prevent data
processing from becoming the biggest bottleneck in the machine learning
pipeline. These problems will arise when we get to computer vision and
natural language processing. Finally, we must pay attention to data
quality. Real-world datasets are often plagued by outliers, faulty
measurements from sensors, and recording errors, which must be addressed
before feeding the data into any model. Data visualization tools such as
seaborn,
Bokeh, or
matplotlib can help you to manually
inspect the data and develop intuitions about what problems you may need
to address.
2.2.5. Exercises¶
Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?
Try out indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.
How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What changes if you try it out on a server?
How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?
What alternatives to pandas can you think of? How about loading NumPy tensors from a file? Check out Pillow, the Python Imaging Library.