PostgreSQL Data Cleaning vs. Python Data Cleaning
Introduction
During analysis, you rarely—if ever—get to go directly from evaluating data to transforming and analyzing it. Sometimes, to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning—and that’s a lot of cleaning! To accomplish all this work, you may use Excel, R, or Python, but are these the best tools for your PostgreSQL data cleaning tasks?
In this blog post, I explore some classic data cleaning scenarios and show how you can perform them directly within your database using TimescaleDB and PostgreSQL, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL can't replace these tools entirely. However, they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.
Cleaning is a critical part of the analysis process. In my experience, it can be the most grueling! By cleaning data directly within my database, I can perform many of my cleaning tasks one time rather than repetitively within a script. This saves me considerable time in the long run.
How PostgreSQL Data Cleaning Fits in the Data Analysis Process
I began this series of posts on data analysis by presenting the following summary of the analysis process:
The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis.
However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing data evaluation techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, data cleaning. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.
As I mentioned in my last post, my first job out of college was at an energy and sustainability solutions company that focused on monitoring utility usage—such as electricity, water, sewage, you name it—to determine how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.
Throughout this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database.
Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script.
This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project.
Instead, with PostgreSQL data cleaning, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make PostgreSQL data cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data.
To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our Slack channel, where you can ask the Timescale community specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!
Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for a free 30-day trial of Timescale (if you prefer self-hosting, you can always install and manage TimescaleDB on your own PostgreSQL instances). Learn more by following one of our many tutorials.
Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean,” I want to quickly cover the data set I'll be using. In addition, I want to note that all the code I show will assume you have some basic knowledge of SQL. If you aren't familiar with SQL, don’t worry! In my last post, I included a section on SQL basics.
About the Sample Dataset
In my experience within the data science realm, I have done most of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I'll use today, I'd likely do some initial cleaning before evaluation and then clean again after. Let me show you why.
I got the following IoT data set from Kaggle, where a very generous individual shared their energy consumption readings from their apartment in San Jose, CA, the data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:
and appears like this…
type | date | start_time | end_time | usage | units | cost | notes |
---|---|---|---|---|---|---|---|
Electric usage | 2016-10-22 | 00:00:00 | 00:14:00 | 0.01 | kWh | $0.00 | |
Electric usage | 2016-10-22 | 00:15:00 | 00:29:00 | 0.01 | kWh | $0.00 | |
Electric usage | 2016-10-22 | 00:30:00 | 00:44:00 | 0.01 | kWh | $0.00 | |
Electric usage | 2016-10-22 | 00:45:00 | 00:59:00 | 0.01 | kWh | $0.00 | |
Electric usage | 2016-10-22 | 01:00:00 | 01:14:00 | 0.01 | kWh | $0.00 | |
Electric usage | 2016-10-22 | 01:15:00 | 01:29:00 | 0.01 | kWh | $0.00 | |
Electric usage | 2016-10-22 | 01:30:00 | 01:44:00 | 0.01 | kWh | $0.00 | |
Electric usage | 2016-10-22 | 01:45:00 | 01:59:00 | 0.01 | kWh | $0.00 |
In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:
- The cost is seen as a text data type which will cause some issues.
- The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.
- I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household).
In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!
PostgreSQL Data Cleaning
Here's what I've done in the past while working in data science. While these examples aren't exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.
Please feel free to explore these various techniques and skip around if you need! There's a lot here, and I designed it to be a helpful glossary of tools that you could use as needed.
The techniques I will cover include:
- Correcting structural issues
- Creating or generating relevant data
- Adding data to a hypertable
- Renaming columns or tables
- Fill in missing values
Note on cleaning approach:
There are many ways that I could approach PostgreSQL data cleaning. I could create a table then ALTER
it as I clean, I could create multiple tables as I add or change data, or I could work with VIEW
s. Depending on the size of my data, any of these approaches could make sense. However, they will have different computational consequences.
You may have noticed above that my raw data table was called energy_usage_staging
. This is because I decided that given the state of my raw data, it's better to place the raw data in a staging table, clean it using VIEW
s, and then insert it into a more usable table as part of my cleaning process.
This move from raw table to the usable table could happen even before the evaluation step of analysis. Sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL VIEW
s, something I will talk about later.
Generally, if you're dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post, I'll show how to build up clean data using VIEW
s along with additional tables. This method of cleaning is more efficient. It also sets up our next blog post about data transformation, which includes the use of scripts in PostgreSQL.
Correcting structural issues
Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have date
and time columns separated and costs
is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the cost
column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.
TimescaleDB hypertables and why timestamp is important
For those of you not familiar with the structure of TimescaleDB hypertables, they're at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table.
The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some chunk based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.
Changing date-time structure
Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time-based queries, I want to restructure the energy_usage_staging
table's date
and time columns. I could use the date
column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space-efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!
Looking back at the table structure, I should be able to get a usable timestamp value from the date
and start_time
columns as the end_time
really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert: it's as simple as an algebraic statement. How cool is that?!
PostgreSQL code:
For my PostgreSQL data cleaning code, I can create the column without inserting it into the database. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet.
Let’s first compare the original columns with our newly generated columns. For this query, I simply add the two columns together. The AS
keyword allows me to rename the column to whatever I would like, in this case, time
.
--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time
FROM energy_usage_staging eus;
Results:
date | start_time | time |
---|---|---|
2016-10-22 | 00:00:00 | 2016-10-22 00:00:00.000 |
2016-10-22 | 00:15:00 | 2016-10-22 00:15:00.000 |
2016-10-22 | 00:30:00 | 2016-10-22 00:30:00.000 |
2016-10-22 | 00:45:00 | 2016-10-22 00:45:00.000 |
2016-10-22 | 01:00:00 | 2016-10-22 01:00:00.000 |
2016-10-22 | 01:15:00 | 2016-10-22 01:15:00.000 |
Python code:
In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.
energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])
Changing column data types
Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the TO_NUMBER()
function.
The format of the function is as follows: TO_NUMBER(‘text’, ‘format’)
. The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a $
symbol followed by a numeric set up 0.00
. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point.
I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’
PostgreSQL code:
--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus
ORDER BY cost_new DESC
Results:
cost | cost_new |
---|---|
$0.65 | 0.65 |
$0.65 | 0.65 |
$0.65 | 0.65 |
$0.57 | 0.57 |
$0.46 | 0.46 |
$0.46 | 0.46 |
$0.46 | 0.46 |
$0.46 | 0.46 |
Python code:
For Python, I used a lambda function that systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.
energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])
Creating a VIEW
Now that I know how to convert my columns, I can combine the two queries and create a VIEW
of my new restructured table. A VIEW
is a PostgreSQL object that allows you to define a query and call it by its VIEW
s name as if it were a table within your database. I can use the following query to generate the data I want and then create a VIEW
that I can query as if it were a table.
PostgreSQL code:
-- query the right data that I want
SELECT type,
(date + start_time) AS time,
"usage",
units,
TO_NUMBER("cost", 'L9G999D99') AS cost,
notes
FROM energy_usage_staging
Results:
type | time | usage | units | cost | notes |
---|---|---|---|---|---|
Electric usage | 2016-10-22 00:00:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 00:15:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 00:30:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 00:45:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 01:00:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 01:15:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 01:30:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 01:45:00.000 | 0.01 | kWh | 0.00 | |
Electric usage | 2016-10-22 02:00:00.000 | 0.02 | kWh | 0.00 | |
Electric usage | 2016-10-22 02:15:00.000 | 0.02 | kWh | 0.00 |
I decided to call my VIEW
energy_view
. Now, when I want to do further cleaning, I can just specify its name in the FROM
statement.
--create view from the query above
CREATE VIEW energy_view AS
SELECT type,
(date + start_time) AS time,
"usage",
units,
TO_NUMBER("cost", 'L9G999D99') AS cost,
notes
FROM energy_usage_staging
Python code:
energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))
It is important to note that with PostgreSQL VIEW
s, the data inside of them have to be recalculated every time you query it. This is why we want to insert our VIEW
data into a hypertable once we have the data set up just right. You can think of VIEW
s as a shorthand version of the CTEs WITH
AS
statement I discussed in my last post.
We are now one step closer to cleaner data!
Creating or generating relevant data
With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a WHERE
clause and specify where notes
are not equal to an empty string.
PostgreSQL data cleaning code—detecting blank notes:
SELECT *
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';
The results come out empty.
Python code:
print(energy_df[energy_df['notes'].notnull()])
Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modeling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the EXTRACT()
command. The EXTRACT()
command is a PostgreSQL date/time function that allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.
PostgreSQL code:
--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew
Results:
type | time | usage | units | cost | notes | day_of_week |
---|---|---|---|---|---|---|
Electric usage | 2016-10-22 00:00:00.000 | 0.01 | kWh | 0.00 | 6 | |
Electric usage | 2016-10-22 00:15:00.000 | 0.01 | kWh | 0.00 | 6 | |
Electric usage | 2016-10-22 00:30:00.000 | 0.01 | kWh | 0.00 | 6 | |
Electric usage | 2016-10-22 00:45:00.000 | 0.01 | kWh | 0.00 | 6 | |
Electric usage | 2016-10-22 01:00:00.000 | 0.01 | kWh | 0.00 | 6 | |
Electric usage | 2016-10-22 01:15:00.000 | 0.01 | kWh | 0.00 | 6 |
Python code:
energy_df['day_of_week'] = energy_df['time'].dt.dayofweek
Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. To do this, I create a boolean column, where true
represents a weekend, and false
represents a weekday. I then apply a CASE
statement. With this command, I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say WHEN
a day_of_week
value is IN
the set (0,6) THEN
the output should be true
, ELSE
the value should be false
.
PostgreSQL code:
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week,
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
ELSE false
END AS is_weekend
FROM energy_view ew
Results:
type | time | usage | units | cost | day_of_week | is_weekend |
---|---|---|---|---|---|---|
Electric usage | 2016-10-22 00:00:00.000 | 0.01 | kWh | 0.00 | 6 | true |
Electric usage | 2016-10-22 00:15:00.000 | 0.01 | kWh | 0.00 | 6 | true |
Electric usage | 2016-10-22 00:30:00.000 | 0.01 | kWh | 0.00 | 6 | true |
Electric usage | 2016-10-22 00:45:00.000 | 0.01 | kWh | 0.00 | 6 | true |
Electric usage | 2016-10-22 01:00:00.000 | 0.01 | kWh | 0.00 | 6 | true |
Fun fact: you can do the same query without a CASE
statement, however it only works for binary columns.
--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week,
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew
Python code:
Notice that in Python, the weekends are represented by numbers 5 and 6 vs. the PostgreSQL weekend values 0 and 6.
energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))
And maybe things then start getting really crazy. Maybe you want to add more parameters!
Let’s consider holidays. Now, you may be asking, “Why in the world would we do that?!” but often, people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least some of the holidays off. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I create another boolean column identifying when a federal holiday occurs.
I can accomplish this using TimescaleDB’s time_bucket()
function. The time_bucket()
function is one of the functions I discussed in detail in my previous post. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the time_bucket()
function, I would only see changes to the row associated with the 12 a.m. time period.
PostgreSQL code:
After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!
--create table for the holidays
CREATE TABLE holidays (
date date)
--insert the holidays into table
INSERT INTO holidays
VALUES ('2016-11-11'),
('2016-11-24'),
('2016-12-24'),
('2016-12-25'),
('2016-12-26'),
('2017-01-01'),
('2017-01-02'),
('2017-01-16'),
('2017-02-20'),
('2017-05-29'),
('2017-07-04'),
('2017-09-04'),
('2017-10-9'),
('2017-11-10'),
('2017-11-23'),
('2017-11-24'),
('2017-12-24'),
('2017-12-25'),
('2018-01-01'),
('2018-01-15'),
('2018-02-19'),
('2018-05-28'),
('2018-07-4'),
('2018-09-03'),
('2018-10-8')
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week,
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew
Results:
type | time | usage | units | cost | day_of_week | is_weekend | is_holiday |
---|---|---|---|---|---|---|---|
Electric usage | 2016-10-22 00:00:00.000 | 0.01 | kWh | 0.00 | 6 | true | false |
Electric usage | 2016-10-22 00:15:00.000 | 0.01 | kWh | 0.00 | 6 | true | false |
Electric usage | 2016-10-22 00:30:00.000 | 0.01 | kWh | 0.00 | 6 | true | false |
Electric usage | 2016-10-22 00:45:00.000 | 0.01 | kWh | 0.00 | 6 | true | false |
Electric usage | 2016-10-22 01:00:00.000 | 0.01 | kWh | 0.00 | 6 | true | false |
Electric usage | 2016-10-22 01:15:00.000 | 0.01 | kWh | 0.00 | 6 | true | false |
Python code:
holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01', '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))
At this point, I’m going to save this expanded table into another VIEW
so that I can call the data without writing out the query.
PostgreSQL code:
--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week,
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew
You may ask, “Why did you create these as boolean columns?!” A very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, you can use boolean columns to filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend AND
a holiday. I can do this simply by adding a WHERE
statement along with the specified columns.
PostgreSQL code:
--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday
Results:
type | time | usage | units | cost | day_of_week | is_weekend | is_holiday |
---|---|---|---|---|---|---|---|
Electric usage | 2016-12-24 00:00:00.000 | 0.34 | kWh | 0.06 | 6 | true | true |
Electric usage | 2016-12-24 00:15:00.000 | 0.34 | kWh | 0.06 | 6 | true | true |
Electric usage | 2016-12-24 00:30:00.000 | 0.34 | kWh | 0.06 | 6 | true | true |
Electric usage | 2016-12-24 00:45:00.000 | 0.34 | kWh | 0.06 | 6 | true | true |
Electric usage | 2016-12-24 01:00:00.000 | 0.34 | kWh | 0.06 | 6 | true | true |
Electric usage | 2016-12-24 01:15:00.000 | 0.34 | kWh | 0.06 | 6 | true | true |
Python code:
print(energy_df[(energy_df['is_weekend']==1) & (energy_df['is_holiday']==1)].head(10))
Adding data to a hypertable
Now I have new columns ready to go, and I know how I would like my table to be structured. I can now create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have run clean up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, either that is before or after evaluation.
PostgreSQL:
CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
)
--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')
INSERT INTO energy_usage
SELECT *
FROM energy_view_exp
Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis.
I'll discuss this in detail in my next post. Make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!
Renaming values
Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the popularity of this Python data analysis question on StackOverflow. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple using PostgreSQL data cleaning.
When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s CASE
statement within the SELECT
query. Let’s say I don’t like Sunday being represented by a 0 in the day_of_week
column, but would prefer it to be a 7. I can do this with the following query.
PostgreSQL code:
SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week
END
FROM energy_usage
Python code:
Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.
energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))
Now, let’s say that I wanted to use the names of the days of the week instead of showing numeric values. For this example, I want to ditch the CASE
statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join it to this table using the JOIN
command.
PostgreSQL:
--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)
--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')
--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week
Results:
type | time | usage | units | cost | day_of_week_name | weekend |
---|---|---|---|---|---|---|
Electric usage | 2018-07-22 00:45:00.000 | 0.1 | kWh | 0.03 | Sunday | true |
Electric usage | 2018-07-22 00:30:00.000 | 0.1 | kWh | 0.03 | Sunday | true |
Electric usage | 2018-07-22 00:15:00.000 | 0.1 | kWh | 0.03 | Sunday | true |
Electric usage | 2018-07-22 00:00:00.000 | 0.1 | kWh | 0.03 | Sunday | true |
Electric usage | 2018-02-11 23:00:00.000 | 0.04 | kWh | 0.01 | Sunday | true |
Python:
In this case, python has similar mapping functions.
energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))
Hopefully, one of these techniques will be useful for you as you approach data renaming!
Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as AS
(I couldn’t not use such a ridiculous statement 😂). When you use the SELECT
statement, you can rename your columns like so,
PostgreSQL code:
SELECT type AS usage_type,
time as time_stamp,
usage,
units,
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;
Results:
usage_type | time_stamp | usage | units | dollar_amount |
---|---|---|---|---|
Electric usage | 2016-10-22 00:00:00.000 | 0.01 | kWh | 0.00 |
Electric usage | 2016-10-22 00:15:00.000 | 0.01 | kWh | 0.00 |
Electric usage | 2016-10-22 00:30:00.000 | 0.01 | kWh | 0.00 |
Electric usage | 2016-10-22 00:45:00.000 | 0.01 | kWh | 0.00 |
Python code:
Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in its code.
energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))
Fill in missing data
Another common problem in the PostgreSQL data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points. However, it's very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon.
This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this.
For example, let’s say that you are modeling energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values.
As an example, I created some data. I called this table energy_data
and it is missing both time
and energy
readings for the timestamps between 7:45 a.m. and 11:30 a.m.
time | energy |
---|---|
2021-01-01 07:00:00.000 | 0 |
2021-01-01 07:15:00.000 | 0.1 |
2021-01-01 07:30:00.000 | 0.1 |
2021-01-01 07:45:00.000 | 0.2 |
2021-01-01 11:30:00.000 | 0.04 |
2021-01-01 11:45:00.000 | 0.04 |
2021-01-01 12:00:00.000 | 0.03 |
2021-01-01 12:15:00.000 | 0.02 |
2021-01-01 12:30:00.000 | 0.03 |
2021-01-01 12:45:00.000 | 0.02 |
2021-01-01 13:00:00.000 | 0.03 |
I can use TimescaleDB’s gapfilling hyperfunctions to fill in these missing values. The interpolate()
function is another one of TimescaleDB’s hyperfunctions. It creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the locf()
hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the time_bucket_gapfill()
function.
PostgreSQL code:
SELECT
--here I specified that the data should increment by 15 mins
time_bucket_gapfill('15 min', time) AS timestamp,
interpolate(avg(energy)),
locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time > '2021-01-01 07:00:00.000' AND time < '2021-01-01 13:00:00.000'
GROUP BY timestamp
ORDER BY timestamp;
Results:
timestamp | interpolate | locf |
---|---|---|
2021-01-01 07:00:00.000 | 0.1 | 0.10000000000000000000 |
2021-01-01 07:30:00.000 | 0.15 | 0.15000000000000000000 |
2021-01-01 08:00:00.000 | 0.13625 | 0.15000000000000000000 |
2021-01-01 08:30:00.000 | 0.1225 | 0.15000000000000000000 |
2021-01-01 09:00:00.000 | 0.10875 | 0.15000000000000000000 |
2021-01-01 09:30:00.000 | 0.095 | 0.15000000000000000000 |
2021-01-01 10:00:00.000 | 0.08125 | 0.15000000000000000000 |
2021-01-01 10:30:00.000 | 0.0675 | 0.15000000000000000000 |
2021-01-01 11:00:00.000 | 0.05375 | 0.15000000000000000000 |
2021-01-01 11:30:00.000 | 0.04 | 0.04000000000000000000 |
2021-01-01 12:00:00.000 | 0.025 | 0.02500000000000000000 |
2021-01-01 12:30:00.000 | 0.025 | 0.02500000000000000000 |
Python code:
energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)
Bonus:
The following query shows how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a WHERE
clause to specify the times I like to ignore (the second query).
SELECT *
FROM energy_data
WHERE energy IS NOT NULL
SELECT *
FROM energy_data
WHERE time <= '2021-01-01 07:45:00.000' OR time >= '2021-01-01 11:30:00.000'
PostgreSQL Data Cleaning Wrap-Up
After reading through these various techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL data cleaning and TimescaleDB data cleaning provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you're looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB.
In my next posts, I'll discuss techniques for transforming data using PostgreSQL and TimescaleDB. I'll then use everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).
If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members.
If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.
Until next time!
Functionality Glossary:
- Adding columns together
TO_NUMBER()
VIEW
WHERE
EXTRACT()
CASE
time_bucket()
JOIN
AS
CREATE TABLE
create_hypertable()
INSERT INTO
time_bucket_gapfill()