Nov 28, 2024
Posted by
Miranda Auhl
As I started writing this post, I realized that to properly show how to evaluate, clean, and transform data in the database (also known as data munging), I needed to focus on each step individually. This blog post will show you exactly how to use TimescaleDB and PostgreSQL to perform your data evaluation tasks that you may have previously done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/evaluation tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.
You may be asking yourself, “What exactly do you mean by Evaluating the data?”. When I talk about evaluating the data, I mean really understanding the data set you are working with.
If - in a theoretical world - I could grab a beer with my data set and talk to it about everything, that is what I would do during the evaluating step of my data analysis process. Before beginning analysis, I want to know every column, every general trend, every connection between tables, etc. To do this, I have to sit down and run query after query to get a solid picture of my data.
If you remember, in my last post, I summarized the analysis process as the “data analysis lifecycle” with the following steps: Evaluate, Clean, Transform, and Model.
As a data analyst, I found that all the tasks I performed could be grouped into these four categories, with evaluating the data as the first and I feel the most crucial step in the process.
My first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of usage - such as electricity, water, sewage, you name it - to figure out how buildings could be more efficient. They would place sensors on whatever medium you wanted to monitor to help you figure out what initiatives your group could take to be more sustainable and ultimately save costs. My role at this company was to perform data analysis and business intelligence tasks.
Throughout my time in this job, I got the chance to use many popular tools to evaluate my data, including Excel, R, Python, and heck, even Minitab. But once I tried using a database - and specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward evaluating work could be when done directly in a database. Lines of code that took me a while to hunt down online, trying to figure out how to accomplish with pandas, could be done intuitively through SQL. Plus, the database queries were just as fast, if not faster, than my other code most of the time.
Now, while I would love to show you a one-to-one comparison of my SQL code against each of these popular tools, that’s not practical. Besides, no one wants to read three examples of the same thing in a row! Thus, for comparison purposes in this blog post, I will directly show TimescaleDB and PostgreSQL functionality against Python code. Keep in mind that almost all code will likely be comparable to your Excel and R code. However, if you have any questions, feel free to hop on and join our Slack channel, where you can ask the Timescale community, or me, specifics on TimescaleDB or PostgreSQL functionality 😊. I’d love to hear from you!
Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! If so, you can sign up for a free 30-day trial or install and manage TimescaleDB on your own PostgreSQL instances. (You can also learn more by following one of our many tutorials.)
But enough of an intro, let’s get into the good stuff!
PostgreSQL is a database platform that uses SQL syntax to interact with the data inside it. TimescaleDB is an extension that is applied to a PostgreSQL database. To unlock the potential of PostgreSQL and TimescaleDB, you have to use SQL. So, before we jump into things, I wanted to give a basic SQL syntax refresher. If you are familiar with SQL, please feel free to skip this section!
For those of you who are newer to SQL (short for structured query language), it is the language many relational databases, including PostgreSQL, use to query data. Like pandas’ DataFrames or Excel’s spreadsheets, data queried with SQL is structured as a table with columns and rows.
The basics of a SQL SELECT
command can be broken down like this 👇
SELECT --columns, functions, aggregates, expressions that describe what you want to be shown in the results
FROM --if selecting data from a table in your DB, you must define the table name here
JOIN --join another table to the FROM statement table
ON --a column that each table shares values
WHERE --statement to filter results where a column or expression is equivalent to some statement
GROUP BY --if SELECT or WHERE statement contains an aggregate, or if you want to group values on a column/expression, must include columns here
HAVING --similar to WHERE, this keyword helps to filter results based upon columns or expressions specifically used with a GROUP BY query
ORDER BY --allows you to specify the order in which your data is displayed
LIMIT --lets you specify the number of rows you want displayed in the output
You can think of your queries as SELECTing data FROM your tables within your database. You can JOIN multiple tables together and specify WHERE your data needs to be filtered or what it should be GROUPed BY. Do you see what I did there 😋?
This is the beauty of SQL; these keyword's names were chosen to make your queries intuitive. Thankfully, most of PostgreSQL and SQL functionality follow this same easy-to-read pattern. I have had the opportunity to teach myself many programming languages throughout my career, and SQL is by far the easiest to read, write, and construct. This intuitive nature is another excellent reason why data munging in PostgreSQL and TimescaleDB can be so efficient when compared to other methods.
Note that this list of keywords includes most of the ones you will need to start selecting data with SQL; however, it is not exhaustive. You will not need to use all these phrases for every query but likely will need at least SELECT
and FROM
. The queries in this blog post will always include these two keywords.
Additionally, the order of these keywords is specific. When building your queries, you need to follow the order that I used above. For any additional PostgreSQL commands you wish to use, you will have to research where they fit in the order hierarchy and follow that accordingly.
Seeing a list of commands may be somewhat helpful but is likely not enough to solidify understanding if you are like me. So let’s look at some examples!
Let’s say that I have a table in my PostgreSQL database called energy_usage
. This table contains three columns: time
which contains timestamp values, energy
which contains numeric values, and notes
which contains string values. As you may be able to imagine, every row of data in my energy
table will contain,
time
: timestamp value saying when the reading was collectedenergy
: numeric value representing how much energy was used since the last readingnotes
: string value giving additional context to each reading. If I wanted to look at all the data within the table, I could use the following SQL query
SELECT time, energy, notes --I list my columns here
FROM energy_usage ;-- I list my table here and end query with semi-colon
Alternatively, SQL has a shorthand for ‘include all columns’, the operator *
. So I could select all the data using this query as well,
SELECT *
FROM energy_usage;
What if I want to select the data and order it by the time
column so that the earliest readings are first and the latest are last? All I need to do is include the ORDER BY
statement and then specify the time
column along with the specification ASC
to let the database know I want the data in ascending order.
SELECT *
FROM energy_usage
ORDER BY time ASC;-- first I list my time column then I specify either DESC or ASC
Hopefully, you can start to see the pattern and feel more comfortable with SQL syntax. I will be showing a lot more code snippets throughout the post, so hang tight if you still need more examples!
So now that we have a little refresher on SQL basics, let’s jump into how you can use this language along with TimescaleDB and PostgreSQL functionality to do your data evaluating tasks!
Earlier I talked about my first job as a data analyst for an IoT sustainability company. Because of this job, I tend to love IoT data sets and couldn’t pass up the chance to explore this IoT dataset from Kaggle to show how to perform data munging tasks in PostgreSQL and TimescaleDB.
The data set contains two tables, one specifying energy consumption for a single home in Houston, Texas (called power_usage
), and the other documenting weather conditions (called weather
). This data is actually the same data set that I used in my previous post, so bonus points if you caught that 😊!
This data was recorded from January 2016 to December 2020. While looking at this data set, and all time-series data sets, we must consider any outside influences that could affect the data. The most obvious factor that impacts the analysis of this dataset is the COVID-19 pandemic that took place from January 9th through to December 2020. Thankfully, we will see that the individual recording this data included some notes to help categorize days affected by the pandemic. As I go through this blog series, we will see patterns associated with the data collected during the COVID-19 pandemic, so definitely keep this fact in the back of your mind as we perform various data munging analysis steps!
Here is an image explaining the two tables, their column names in red and corresponding data types in blue.
As we work through this blog post, we will use the evaluating techniques available within PostgreSQL and TimescaleDB to understand these two tables inside and out.
As we discussed before, the first step in the data analysis lifecycle - and arguably the most critical step - is to evaluate the data. I will go through how I would approach evaluating this IoT energy data, showing most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, they will cover many of the evaluating steps you perform during your analysis, helping to make your evaluating tasks more efficient by using PostgreSQL and TimescaleDB.
The techniques that I will cover include:
Let’s start with the most simple evaluating task, looking at the raw data.
As we learned in the SQL refresher above, we can quickly pull all the data within a table by using the SELECT
statement with the *
operator. Since I have two tables within my database, I will query both table’s information by running a query for each.
PostgreSQL code:
-- select all the data from my power_usage table
SELECT *
FROM power_usage pu;
-- selects all the data from my weather table
SELECT *
FROM weather w;
But what if I don’t necessarily need to query all my data? Since all the data is housed in the database, if I want to get a feel for the data and the column values, I could just look at a snapshot of the raw data.
While conducting analysis in Python, I often would just print a handful of rows of data to get a feel for the values. We can do this in PostgreSQL by including the LIMIT
command within our query. To show the first 20 rows of data in my tables, I can do the following:
PostgreSQL code:
-- select all the data from my power_usage table
SELECT *
FROM power_usage pu
LIMIT 20; -- specify 20 because I only want to see 20 rows of data
-- selects all the data from my weather table
SELECT *
FROM weather w
LIMIT 20;
Results: Some of the rows for each table
startdate | value_kwh | day_of_week | notes |
---|---|---|---|
2016-01-06 01:00:00 | 1 | 2 | weekday |
2016-01-06 02:00:00 | 1 | 2 | weekday |
2016-01-06 03:00:00 | 1 | 2 | weekday |
2016-01-06 04:00:00 | 1 | 2 | weekday |
2016-01-06 05:00:00 | 0 | 2 | weekday |
2016-01-06 06:00:00 | 0 | 2 | weekday |
date | day | temp_max | temp_avg | temp_min | dew_max | dew_avg | dew_min | hum_max | hum_avg | hum_min | wind_max | wind_avg | wind_min | press_max | press_avg | press_min | precipit | day_of_week |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2016-01-06 | 1 | 85 | 75 | 68 | 74 | 71 | 66 | 100 | 89 | 65 | 21 | 10 | 0 | 30 | 30 | 30 | 0 | 2 |
2016-02-06 | 2 | 76 | 71 | 66 | 74 | 70 | 66 | 100 | 97 | 89 | 18 | 8 | 0 | 30 | 30 | 30 | 4 | 5 |
2016-02-07 | 2 | 95 | 86 | 76 | 76 | 73 | 69 | 94 | 67 | 43 | 12 | 6 | 0 | 30 | 30 | 30 | 0 | 6 |
2016-02-08 | 2 | 97 | 87 | 77 | 77 | 74 | 71 | 94 | 66 | 43 | 15 | 5 | 0 | 30 | 30 | 30 | 0 | 0 |
2016-02-09 | 2 | 95 | 85 | 77 | 75 | 74 | 70 | 90 | 70 | 51 | 16 | 7 | 0 | 30 | 30 | 30 | 0 | 1 |
2016-02-10 | 2 | 86 | 74 | 65 | 64 | 61 | 58 | 90 | 66 | 40 | 8 | 4 | 0 | 30 | 30 | 30 | 0 | 2 |
2016-03-06 | 3 | 79 | 72 | 68 | 72 | 70 | 68 | 100 | 94 | 72 | 18 | 5 | 0 | 30 | 30 | 30 | 3 | 6 |
Python code:
In this first Python code snippet, I show the modules I needed to import and the connection code that I would have to run to access the data from my database and import it into a pandas DataFrame.
One of the challenges I faced while data munging in Python was the need to run through the entire script again and again when evaluating, cleaning, and transforming the data. This initial data pulling process usually takes a good bit of time, so it was often frustrating to run through it repetitively. I also would have to run print anytime I wanted to quickly glance at an array, Dataframe, or element. These kinds of extra tasks in Python can be time-consuming, especially if you end up at the modeling stage of the analysis lifecycle with only a subset of the original data! All this to say, keep in mind that for the other code snippets within the blog, I will not include this as part of the code; however, it still impacts that code in the background.
Additionally, because I have my data housed in a TimescaleDB instance, I still need to use the SELECT
statement to query the data from the database and read it into Python. If you use a relational database - which I explained is very beneficial to analysis in my previous post - you will have to use some SQL.
import psycopg2
import pandas as pd
import configparser
import numpy as np
import tempfile
import matplotlib.pyplot as plt
## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')
## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'),
host=config.get('USERINFO', 'HOST'),
user=config.get('USERINFO', 'USER'),
password=config.get('USERINFO', 'PASS'),
port=config.get('USERINFO', 'PORT'))
## define the queries for copying data out of our database (using format to copy queries)
query_weather = "select * from weather"
query_power = "select * from power_usage"
## define function to copy the data to a csv
def copy_from_db(query, cur):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pd.read_csv(tmpfile)
return df
## create cursor to use in function above and place data into a file
cursor = conn.cursor()
weather_df = copy_from_db(query_weather, cursor)
power_df = copy_from_db(query_power, cursor)
cursor.close()
conn.close()
print(weather_df.head(20))
print(power_df.head(20))
Next, I think it is essential to understand any “categorical” columns - columns with a finite set of values - that I might have. This is useful in analysis because categorical data can give insight into natural groupings that often occur within a dataset. For example, I would assume that energy usage for many people is different on a weekday vs. a weekend. We can’t verify this without knowing the categorical possibilities and seeing how each could impact the data trend.
First, I want to look at my tables and the data types used for each column. Looking at the available columns in each table, I can make an educated guess that the day_of_week
, notes
, and day
columns will be categorical. Let’s find out if they indeed are and how many different values exist in each.
To find all the distinct values within a column (or between multiple columns), you can use the DISTINCT
keyword after SELECT
in your query statement. This can be useful for several data munging tasks, such as identifying categories - which I need to do - or finding unique sets of data.
Since I want to look at the unique values within each column individually, I will run a query for each separately. If I were to run a query like this 👇
SELECT DISTINCT day_of_week, notes
FROM power_usage pu;
I would get data like this
day_of_week | notes |
---|---|
3 | vacation |
3 | weekday |
1 | weekday |
1 | vacation |
2 | vacation |
4 | vacation |
The output data would show unique pairs of day_of_week
and notes
related values within the table. This is why I need to include a single column in each statement so that I only see that individual column’s unique values and not the unique sets of values.
For these queries, I am also going to include the ORDER BY
command to show the values of each column in ascending order.
PostgreSQL code:
-- selecting distinct values in the ‘day_of_week’ column within my power_usage table
SELECT DISTINCT day_of_week
FROM power_usage pu
ORDER BY day_of_week ASC;
-- selecting distinct values in the ‘notes’ column within my power_usage table
SELECT DISTINCT notes
FROM power_usage pu
ORDER BY notes ASC;
-- selecting distinct values in the ‘day’ column within my weather table
SELECT DISTINCT "day"
FROM weather w
ORDER BY "day" ASC;
-- selecting distinct values in the ‘day_of_week’ column within my weather table
SELECT DISTINCT day_of_week
FROM weather w
ORDER BY day_of_week ASC;
Results:
Notice that we see the recorder for this data included “COVID-19” as a category in their notes
column. As mentioned above, this note could be necessary to finding and understanding patterns in this family's energy usage.
day_of_week |
---|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
notes |
---|
COVID_lockdown |
vacation |
weekday |
weekend |
(Only some of the values shown for day)
day |
---|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Python code:
In my Python code, notice that I need to print anything that I want to quickly observe. I have found this to be the quickest solution, even when compared to using the Python console in debug mode.
p_day_of_the_week = power_df['day_of_week'].unique()
p_notes = power_df['notes'].unique()
w_day = weather_df['day'].unique()
w_day_of_the_week = power_df['day_of_week'].unique()
print(sorted(p_day_of_the_week), sorted(p_notes), sorted(w_day), sorted(w_day_of_the_week))
What if I want to evaluate my tables based on how specific columns were sorted? One of the top questions asked on StackOverflow for Python data analysis is "How to sort a dataframe in python pandas by two or more columns?". Once again, we can do this intuitively through SQL.
One of the things I'm interested in identifying is how bad weather impacts energy usage. To do this, I have to think about indicators that typically signal bad weather, which include high precipitation, high wind speed, and low pressure. To identify days with this pattern in my PostgreSQL weather
table, I need to use the ORDER BY
keyword, then call out each column in the order I want things sorted, specifying the DESC
and ASC
attributes as needed.
PostgreSQL code:
-- sort weather data by precipitation desc first, wind_avg desc second, and pressure asc third
SELECT "date", precipit, wind_avg, press_avg
FROM weather w
ORDER BY precipit DESC, wind_avg DESC, press_avg ASC;
Results:
date | precipit | wind_avg | press_avg |
---|---|---|---|
2017-08-27 | 13 | 15 | 30 |
2017-08-28 | 11 | 24 | 30 |
2019-09-20 | 9 | 9 | 30 |
2017-08-08 | 6 | 5 | 30 |
2017-08-29 | 5 | 22 | 30 |
2018-08-12 | 5 | 12 | 30 |
2016-02-06 | 4 | 8 | 30 |
2018-05-07 | 4 | 7 | 30 |
2019-10-05 | 3 | 9 | 30 |
2018-03-29 | 3 | 8 | 30 |
2016-03-06 | 3 | 5 | 30 |
2018-06-19 | 2 | 12 | 30 |
2019-08-05 | 2 | 11 | 30 |
2019-10-30 | 2 | 11 | 30 |
Python code:
I have often found the different pandas or Python functions to be harder to know off the top of my head. With how popular the StackOverflow question is, I can imagine that many of you also had to refer to Google for how to do this initially.
sorted_weather = weather_df[['date', 'precipit', 'wind_avg', 'press_avg']].sort_values(['precipit', 'wind_avg', 'press_avg'], ascending=[False, True, False])
print(sorted_weather)
Finding the sum of energy usage from data that records energy per hour can be instrumental in understanding data patterns. This concept boils down to performing a type of aggregation over a particular column. Between PostgreSQL and TimescaleDB, we have access to almost every type of aggregation function we could need. I will show some of these operators in this blog series, but I strongly encourage all of you to lookup more for your own use!
From the categorical section earlier, I mentioned that I suspect people could have different energy behavior patterns on weekdays vs. weekends, particularly in a single-family home in the US. Given my data set, I’m curious about this hypothesis and want to find the cumulative energy consumption across each day of the week.
To do so, I need to sum all the kWh data (value_kwh
) in the power table, then group this data by the day of the week (day_of_week
). In order to sum my data in PostgreSQL, I will use the SUM()
function. Because this is an aggregation function, I will have to include something that tells the database what to sum over. Since I want to know the sum of energy over each type of day, I can specify that the sum should be grouped by the day_of_week
column using the GROUP BY
keyword. I also added the ORDER BY
keyword so that we could look at the weekly summed usage in order of the day.
PostgreSQL code:
-- first I select the day_of_week col, then I define SUM(value_kwn) to get the sum of value_kwh col
SELECT day_of_week, SUM(value_kwh) --sum the value_sum column
FROM power_usage pu
GROUP BY day_of_week -- group by the day_of_week col
ORDER BY day_of_week ASC; -- decided to order data by the day_of_week asc
Results:
After some quick investigation, the value 0
in the day_of_week
column represents a Monday, thus my hypothesis may just be right.
day_of_week | sum |
---|---|
0 | 3849 |
1 | 3959 |
2 | 3947 |
3 | 4094 |
4 | 3987 |
5 | 4169 |
6 | 4311 |
Python code:
Something to note about the pandas groupby()
function is that the group by column in the DataFrame will become the index column in the resulting aggregated DataFrame. This can add some extra work later on.
day_agg_power = power_df.groupby('day_of_week').agg({'value_kwh' : 'sum'})
print(day_agg_power)
Clean data is fundamental in producing accurate analysis, and abnormalities/errors can be a huge roadblock to clean data. An essential part of evaluating data is finding abnormalities to determine if an error caused them. No data set is perfect, so it is vital to hunt down any possible errors in preparation for the cleaning stage of our analysis. Let's look at one example of how to uncover issues in a dataset using our example energy data.
After looking at the raw data in my power_usage
table, I found that the notes
and day_of_week
columns should be the same for each hour across a single day (there are 24 hourly readings each day, and each hour is supposed to have the same notes
value). In my experience with data analysis, I have found that notes which need to be recorded granularly often have mistakes within them. Because of this, I wanted to investigate whether or not this pattern was consistent across all of the data.
To check this hypothesis I can use the TimescaleDB time_bucket()
function, PostgreSQL’s GROUP BY
keyword, and CTEs (common table expressions). While the GROUP BY
keyword is likely familiar to you by now, CTEs and the time_bucket()
function are not. So, before I show the query, let’s dive into these two features.
Time bucket function
The time_bucket()
function allows you to take a timestamp column like startdate
in the power_usage
table, and “bucket” the time based on the interval of your choice. For example, startdate
is a timestamp column that shows values for each hour in a day. You could use the time_bucket()
function on this column to “bucket” the hourly data into daily data.
Here is an image that shows how rows of the startdate
column are bucketed into one aggregate row with time_bucket(‘1 day’, startdate)
.
After using the time_bucket()
function in my query, I will have one unique “date” value for any data recorded over a single day. Since notes
and day_of_week
should also be unique over each day, if I group by these columns, I should get a single set of (date, day_of_week, notes) values.
Notice that to use GROUP BY
in this scenario, I just list the columns I want to group on. Also, notice that I added AS
after my time_bucket()
function, this keyword allows you to "rename" columns. In the results, look for the day
column, as this comes directly from my rename.
PostgreSQL code:
-- select the date through time_bucket and get unique values for each
-- (date, day_of_week, notes) set
SELECT
time_bucket(interval '1 day', startdate ) AS day,
day_of_week,
notes
FROM power_usage pu
GROUP BY day, day_of_week, notes;
Results: Some of the rows
day | day_of_week | notes |
---|---|---|
2017-01-19 00:00:00 | 3 | weekday |
2016-10-06 00:00:00 | 3 | weekday |
2017-06-04 00:00:00 | 6 | weekend |
2019-01-03 00:00:00 | 3 | weekday |
2017-10-01 00:00:00 | 6 | weekend |
2019-11-27 00:00:00 | 2 | weekday |
2017-06-15 00:00:00 | 3 | weekday |
2016-11-16 00:00:00 | 2 | weekday |
2017-05-18 00:00:00 | 3 | weekday |
2018-07-17 00:00:00 | 1 | weekday |
2020-03-06 00:00:00 | 4 | weekday |
2018-10-14 00:00:00 | 6 | weekend |
Python code:
In my Python code, I cannot just manipulate the table to print results, I actually have to create another column in the DataFrame.
day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
print(power_unique)
Now that we understand the time_bucket()
function a little better, let's look at CTEs and how they help me use this bucketed data to find any errors within the notes
column.
CTEs or common table expressions
Getting unique sets of data only solves half of my problem. Now I want to verify if each day is truly mapped to a single day_of_week
and notes
pair. This is where CTE’s come in handy. With CTEs, you can build a query based on the results of others.
CTE’s use the following format 👇
WITH query_1 AS (
SELECT -- columns expressions
FROM table_name
)
SELECT --column expressions
FROM query_1;
WITH
and AS
allow you to define the first query, then in the second SELECT
statement, you can call the results from the first query as if it were another table in the database.
To check that each day was “mapped” to a single day_of_week
and notes
pair, I need to aggregate the queried time_bucket()
table above based upon the date column using another PostgreSQL aggregation function COUNT()
. I am doing this because each day should only count one unique day_of_week
and notes
pair. If the count results in two or more, this implies that one day contains multiple day_of_week
and notes
pairs and thus is showing abnormal data.
Additionally, I will add a HAVING
statement into my query so that the output only displays rows where the COUNT(day)
is greater than one. I will also throw in an ORDER BY
statement in case we have many different values greater than 1.
PostgreSQL code:
WITH power_unique AS (
-- query from above, get unique set of (date, day_of_week, notes)
SELECT
time_bucket(INTERVAL '1 day', startdate ) AS day,
day_of_week,
notes
FROM power_usage pu
GROUP BY day, day_of_week, notes
)
-- calls data from the query above, using the COUNT() agg function
SELECT day, COUNT(day)
FROM power_unique
GROUP BY day
HAVING COUNT(day) > 1
ORDER BY COUNT(day) DESC;
Results:
day | count |
---|---|
2017-12-27 00:00:00 | 2 |
2020-01-03 00:00:00 | 2 |
2018-06-02 00:00:00 | 2 |
2019-06-03 00:00:00 | 2 |
2020-07-01 00:00:00 | 2 |
2016-07-21 00:00:00 | 2 |
Python code:
Because of the count aggregation, I needed to rename the column in my agg_power_unique
DataFrame so that I could then sort the values.
day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
## If you ran the previous code snippet, this next line will error since you already ran it
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
agg_power_unique = power_unique.groupby('date_day').agg({'date_day' : 'count'})
agg_power_unique = agg_power_unique.rename(columns={'date_day': 'count'})
print(agg_power_unique.loc[agg_power_unique['count'] > 1].sort_values('count', ascending=False))
This query reveals that I indeed have a couple of data points that seem suspicious. Specifically, the dates [2017-12-27, 2020-01-03, 2018-06-02, 2019-06-03, 2020-07-01, 2016-07-21]. I will demonstrate how to fix these date issues in a later blog post about Cleaning techniques.
This example only shows one set of functions which helped me identify abnormal data through grouping and aggregation. You can use many other PostgreSQL and TimescaleDB functions to find other abnormalities in your data, like utilizing TimescaleDB’s approx_percentile()
function (introducing this next) to find outliers in numeric columns by playing around with interquartile range calculations.
Arguably, one of the more critical aspects of evaluating your data is understanding the general trends. To do this, you need to get basic statistics on your data using functions like mean, interquartile range, maximum values, and others. TimescaleDB has created many optimized hyperfunctions to perform these very tasks.
To calculate these values, I am going to introduce the following TimescaleDB functions: `approx_percentile`, `min_val`, `max_val`, `mean`, `num_vals`,`percentile_agg` (aggregate), and `tdigest` (aggregate)
These hyperfunctions fall under the TimescaleDB category of two-step aggregation. Timescale designed each function to either be an aggregate or accessor function (I noted which ones above were aggregate functions). In two-step aggregation, the more programmatically taxing aggregate function is calculated first, then the accessor function is applied to it after.
For specifics on how two-step aggregation works and why we use this convention, check out David Kohn’s blog series on our hyperfunctions and two-step aggregation.
I definitely want to understand the basic trends within the power_usage
table for my data set. If I plan to do any type of modeling to predict future usage trends, I need to know some basic information about what this home’s usage looks like daily.
To understand the daily power usage data distribution, I’ll need to aggregate the energy usage per day. To do this, I can use the time_bucket()
function I mentioned above, along with the SUM()
operator.
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT
time_bucket(INTERVAL '1 day', startdate ) AS day,
SUM(value_kwh)
FROM power_usage pu
GROUP BY day;
I then want to find the 1st, 10th, 25th, 75th, 90th, and 99th percentiles, the median or 50th percentile, mean, minimum value, maximum value, number of readings in the table, and interquartile range of this data. Creating the query with a CTE simplifies the process by only calculating the sum of data once and reusing the value multiple times.
PostgreSQL:
WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT
time_bucket(INTERVAL '1 day', startdate ) AS day,
SUM(value_kwh) AS sum_kwh
FROM power_usage pu
GROUP BY day
)
-- using two-step aggregation functions to find stats
SELECT approx_percentile(0.01,percentile_agg(sum_kwh)) AS "1p",
approx_percentile(0.10,percentile_agg(sum_kwh)) AS "10p",
approx_percentile(0.25,percentile_agg(sum_kwh)) AS "25p",
approx_percentile(0.5,percentile_agg(sum_kwh)) AS "50p",
approx_percentile(0.75,percentile_agg(sum_kwh)) AS "75p",
approx_percentile(0.90,percentile_agg(sum_kwh)) AS "90p",
approx_percentile(0.99,percentile_agg(sum_kwh)) AS "99p",
min_val(tdigest(100, sum_kwh)),
max_val(tdigest(100, sum_kwh)),
mean(percentile_agg(sum_kwh)),
num_vals(percentile_agg(sum_kwh)),
-- you can use subtraction to create an output for the IQR
approx_percentile(0.75,percentile_agg(sum_kwh)) - approx_percentile(0.25,percentile_agg(sum_kwh)) AS iqr
FROM power_usage_sum pus;
Results:
1p | 10p | 25p | 50p | 75p | 90p | 99p | min_val | max_val | mean | num_vals | iqr |
---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | 4.0028 | 6.9936 | 16.0066 | 28.9914 | 38.9781 | 56.9971 | 0.0 | 73.0 | 18.9025 | 1498.0 | 21.9978 |
Python:
Something that really stumped me when initially writing this code snippet was that I had to use astype(float)
on my value_kwh
column to use describe. I have probably spent the combined time of a day over my life trying to deal with value types being incompatible with certain functions. This is another reason why I enjoy data munging with the intuitive functionality of PostgreSQL and TimescaleDB; these types of problems just happen less often. And let me tell you, the faster and painless data munging is the happier I am!
agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum'})
# need to make the value_kwh column the right data type
agg_power.value_kwh = agg_power.value_kwh.astype(float)
describe = agg_power.value_kwh.describe()
percentiles = agg_power.value_kwh.quantile([.01, .1, .9, .99])
q75, q25 = np.percentile(agg_power['value_kwh'], [75 ,25])
iqr = q75 - q25
print(describe, percentiles, iqr)
Another technique you may want to use for accessing the distribution of data in a column is a histogram. Generally, creating an image is where Python and other tools shine. However, I often need to glance at a histogram to check for any blatant anomalies when evaluating data. While this one technique in TimescaleDB may not be as simple as the Python solution, I can still do this directly in my database, which can be convenient.
To create a histogram in the database, we will need to use the TimescaleDB histogram()
function, unnest()
, generate_series()
, repeat()
, and CTE’s.
The histogram()
function takes in the column you want to analyze and produces an array object which contains the frequency values across the number of buckets plus two (one additional bucket for values below the lowest bucket and above the highest bucket). You can then use PostgreSQL’s unnest()
function to break up the array into a single column with rows equal to two plus the number of specified buckets.
Once you have a column with bucket frequencies, you can then create a histogram “image” using the PostgreSQL repeat()
function. The first time I saw someone use the repeat()
function in this way was in Haki Benita’s blog post, which I recommend reading if you are interested in learning more PostgreSQL analytical techniques. The repeat()
function essentially creates a string that repeats chosen characters a specified number of times. To use the histogram frequency values, you just input the unnested histogram in for the repeating argument.
Additionally, I find it useful to know the approximate starting values for each bucket in the histogram. This gives me a better picture of what values are occurring when. To approximate the bin values, I use the PostgreSQL generate_series()
function along with some algebra,
(generate_series(-1, [number_of_buckets]) * [max_val - min_val]::float/[number_of_buckets]::float) + [min_val]
When I put all these techniques together, I am able to get a histogram with the following,
PostgreSQL:
WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT
time_bucket(INTERVAL '1 day', startdate ) AS day,
SUM(value_kwh) AS sum_kwh
FROM power_usage pu
GROUP BY day
),
histogram AS (
-- I input the column = sum_kwh, the min value = 0, max value = 73, and number of buckets = 30
SELECT histogram(sum_kwh, 0, 73, 30)
FROM power_usage_sum w
)
SELECT
-- I use unnest to create the first column
unnest(histogram) AS count,
-- I use my approximate bucket values function
(generate_series(-1, 30) * 73::float/30::float) + 0 AS approx_bucket_start_val,
-- I then use the repeat function to display the frequency
repeat('■', unnest(histogram)) AS frequency
FROM histogram;
Results:
count | approx_bucket_start_val | frequency |
---|---|---|
0 | -2.43 | |
83 | 0.0 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
104 | 2.43 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
207 | 4.87 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
105 | 7.3 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
150 | 9.73 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
76 | 12.17 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
105 | 14.6 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
62 | 17.03 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
48 | 19.47 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
77 | 21.9 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
35 | 24.33 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
83 | 26.77 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
42 | 29.2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
72 | 31.63 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
46 | 34.07 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
51 | 36.5 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
39 | 38.93 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
32 | 41.37 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ |
24 | 43.8 | ■■■■■■■■■■■■■■■■■■■■■■■■ |
16 | 46.23 | ■■■■■■■■■■■■■■■■ |
17 | 48.67 | ■■■■■■■■■■■■■■■■■ |
4 | 51.1 | ■■■■ |
3 | 53.53 | ■■■ |
5 | 55.97 | ■■■■■ |
4 | 58.4 | ■■■■ |
5 | 60.83 | ■■■■■ |
1 | 63.27 | ■ |
0 | 65.7 | |
1 | 68.13 | ■ |
0 | 70.57 | |
1 | 73.0 | ■ |
Python:
This Python code is definitively better. It’s simple and relatively painless. I wanted to show this comparison to provide an option for displaying a histogram directly in your database vs. having to pull the data into a pandas DataFrame then displaying it. Doing the histogram in the database just helps me to keep focus while evaluating the data.
plt.hist(agg_power.value_kwh, bins=30)
plt.show()
Hopefully, after reading through these various evaluating techniques, you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. Evaluating data directly in the database often saved me time without sacrificing any functionality. If you are looking to save time and effort while evaluating your data for analysis, definitely consider using PostgreSQL and TimescaleDB.
In my next posts, I will go over techniques to clean and transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting deep-dive data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).
If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).
If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.
Until next time!
Functionality Glossary
SELECT
FROM
ORDER BY
DESC
ASC
LIMIT
DISTINCT
GROUP BY
SUM()
time_bucket(<time_interval>, <time_col>)
WITH
AS
COUNT()
approx_percentile()
min_val()
max_val()
mean()
num_vals()
percentile_agg()
[aggregate]tdigest()
[aggregate]histogram()
unnest()
generate_series()
repeat()