Dec 04, 2024
Posted by
Prashant Sridharan
Public datasets can help us gain insight into our business and our world. Combining public datasets, either together or with our own data, often requires a series of steps to clean up (or “normalize”) the data. This blog post walks you through some data normalization techniques.
I enjoy using public datasets, combining them with metrics and business data, and seeing if there are any correlations. It’s kind of a personal hobby. I wrote a blog post about analyzing COVID-19 data along that vein.
My projects usually aim to combine data from one source with data from another source to learn something interesting. However, one problem you are likely to encounter is normalizing your data.
Data normalization is the process of organizing data to reduce redundancy and improve data integrity. It typically involves dividing a database into smaller, related tables and defining relationships between them, following rules known as normal forms to help ensure consistent and efficient data storage and retrieval.
Normalizing your data can be especially challenging in the realm of publicly available, open datasets. In my latest project, I wanted to analyze two public datasets: the San Francisco police incident database and the National Oceanic and Atmospheric Administration (NOAA) historic weather database (for the San Francisco area). There were two challenges:
incident_date
and weather_date
respectively. In the police database, dates are formatted as yyyy-mm-dd
, while in the weather database, it’s mm/dd/yyyy
.Let’s walk through how to solve both (all too common) issues.
The data normalization technique you use to clean data depends on how frequently you want to update your dataset and run your analysis.
If you’re looking to download your data once before importing and using it, then you can use anything from Excel (for smaller datasets) to Unix tools like awk
.
First, download your data in CSV format (step 0). In Excel, open your CSV file, highlight a column, and in the Ribbon (top-nav menu), select the ‘Number Formats’ drop-down (it usually starts by saying “General”), and choose the option. Select ‘Custom’ and in the ‘Type:’ field, enter your preferred date format.
In this case, we want to reformat the weather data to match the police dataset. Thus, we want to reformat the date field to yyyy-mm-dd
, which we can enter directly. Press ‘OK’ and your column is now formatted however you’d like.
While Excel works for moderately sized datasets, if you have a large dataset, datasets that you want to update frequently, or data that you retrieve programmatically at periodic intervals, you need something that you can run on the command line and/or automate.
For this, we’ll use awk
, a programming language designed for manipulating text-based data. awk
is ideal for formatting dates in CSV files. This is the awk
command for reformatting a date with mm/dd/yyyy
into yyyy-mm-dd
format:
awk -F, '{split($7,a,"/"); $7=a[3]"-"a[1]"-"a[2]}1' OFS=, Weather_SFO.csv
Let’s dissect this command:
-F
is the field separator. In this case, we have a comma-separated file, so the parameter for this argument is a comma (i.e., a comma separates each of our fields).split
: In our original comma-separated file, the 7th item in each row is the date we are trying to reformat. Therefore, we’ll split the 7th ($7
) field into strings, based on the forward slash character and store each part in an array called a
.a[3]
), first (a[1]
), and second (a[2]
) elements in the array, separated by dashes.OFS
), use commas to separate them.Weather_SFO.csv
.Once you run the awk
command, you’ll see the output in your Terminal, which you can then pipe into another file.
When we use public data, we often see “gaps” or times when the dataset doesn’t show any reported values for a given area, time, or other variable. This can be due to faulty sensors, systems maintenance or outage, human error in creating the tables, or a slew of other reasons.
For example, in the San Francisco weather data we’ve been using as an example in this blog post, we see a NULL in the second row (mid-temperature on January 6, 2018):
station | name | latitude | longitude | elevation | weather_date | precipitation | temp_max | temp_min | temp_mid
------------+--------------------------------+----------+-----------+-----------+--------------+---------------+----------+----------+----------
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-05 | 7.9 | 18.3 | 6.1 | 11.1
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-06 | 0 | 13.9 | 7.8 | [null]
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-07 | 0 | 11.7 | 8.3 | 10.6
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-08 | 93.5 | 11.1 | 8.9 | 11.1
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-09 | 9.9 | 11.7 | 10 | 10
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-10 | 0.3 | 10 | 8.3 | 10
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-11 | 0.3 | 13.3 | 8.3 | 11.7
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-12 | 0 | 16.7 | 3.9 | 8.9
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-13 | 0 | 17.2 | 4.4 | 8.3
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-14 | 0 | 18.3 | 4.4 | 12.8
(10 rows)
If you’re trying to visualize data, this presents a problem: we need values for each row and can’t have instances where there isn’t any data to visualize. We see this with monitoring metrics as well; to view systems performance, we need to plot values at each desired interval, or our trend lines and graphs look faulty.
Fortunately, this is an area where TimescaleDB excels (no pun intended), with capabilities like gap filling and time_bucket—a more powerful version of SQL date_trunc. By gap-filling the information in our weather dataset, we fill in every NULL value with the last (typically most recent) value for that sensor by date. We call this “last observation carried forward” or locf
.
To do so, we use the Timescale time_bucket_gapfill
function:
SELECT
time_bucket_gapfill('1 day', weather_date) AS date,
name, latitude, longitude,
locf(avg(temp_mid), treat_null_as_missing:=true)
FROM weather
WHERE weather_date >= '2018-01-05' AND
weather_date <= '2018-01-30' AND
name = 'SAN FRANCISCO OCEANSIDE, CA US'
GROUP BY date, name, latitude, longitude
ORDER BY date
LIMIT 10;
You can read more about time_bucket_gapfill
in our documentation and blog post.
Now, when we run the query, we can see that our NULL value in the second row is replaced with the temperature value from the previous day:
date | name | latitude | longitude | locf
-----------+--------------------------------+----------+-----------+---------------------
2018-01-05 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.1000000000000000
2018-01-06 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.1000000000000000
2018-01-07 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 10.6000000000000000
2018-01-08 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.1000000000000000
2018-01-09 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 10.0000000000000000
2018-01-10 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 10.0000000000000000
2018-01-11 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.7000000000000000
2018-01-12 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 8.9000000000000000
2018-01-13 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 8.3000000000000000
2018-01-14 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 12.8000000000000000
(10 rows)
You can also use averages or your own computation to perform your gap-filling operation. Consult the Timescale documentation for more ideas based on your use case.
Public datasets are amazing. Whether you combine them with other public datasets or with your own proprietary data, they give you a wealth of resources to query, analyze, and use to better understand your world (whatever that means to you).
Hopefully, this post has given you some ideas and ways to overcome a few data normalization obstacles, so you can power through your data preparation and get to the best part: analysis, visualization, and sharing your findings with others.
And if you need a high-performance database to power those, one of the benefits of using a PostgreSQL time-series database like TimescaleDB is the ability to combine analytics and metrics (this is inherently time-series data, as we care about the specific time we collected each value) with other relational data.
For example, you may want to combine historic weather data with business performance to identify correlations that could predict future business results. You’d need a high-performance time-series database that also supports the traditional relational data for your business (for example, products, customers, and orders). Voilà Timescale! Combining your external metadata with your time-series data is just a SQL statement. Want to try it for yourself? Create a free Timescale account today.