Chapter 1. Preparing the Data

In this chapter, we will cover the basic tasks of reading, storing, and cleaning data using Python and OpenRefine. You will learn the following recipes:

  • Reading and writing CSV/TSV files with Python
  • Reading and writing JSON files with Python
  • Reading and writing Excel files with Python
  • Reading and writing XML files with Python
  • Retrieving HTML pages with pandas
  • Storing and retrieving from a relational database
  • Storing and retrieving from MongoDB
  • Opening and transforming data with OpenRefine
  • Exploring the data with OpenRefine
  • Removing duplicates
  • Using regular expressions and GREL to clean up the data
  • Imputing missing observations
  • Normalizing and standardizing features
  • Binning the observations
  • Encoding categorical variables