Tools for Working With Time-Series Analysis in Python
How do you monitor your distributed systems, cryptocurrency market, or energy consumption changes? You will look at the data and analyze it. To be more precise, you'll perform time-series analysis, a method that will allow you to understand better time-series data like the one in the above examples.
Time-series data tracks changes over time, whether at regular or irregular intervals, providing insight into how and why things change. To explore this data, you'll need appropriate tools and methodologies.
Fortunately, Python's broad library and tool ecosystem have contributed to its popularity as a programming language for analyzing this data. It provides a robust framework for businesses to analyze data and make data-driven decisions.
In this blog post, we will explore the top tools for analyzing time-series data in Python. As true data enthusiasts passionate about understanding data, we like to connect Timescale and Python using psycopg2, which allows you to execute raw SQL queries and prevents common attacks, such as SQL injection.
To learn more about how to connect Timescale and Python, check our Python quick start guide, where you will also learn how to create a Python virtual environment.
While we're aware that Timescale cannot replace all Python tools, our built in hyperfunctions can definitely simplify and boost the efficiency of your data analysis—sign up for a free 30-day Timescale trial if you want to unlock and try firsthand a next-level time-series analysis experience and don't forget to check out this post for a side-by-side comparison between Timescale and Python.
But for now, let's take a step back and examine why you should use Python for time-series data analysis.
Why You Might Use Python for Time-Series Data
Python is usually recognized as the preferred programming language for data analysis due to its versatility and wide reach. It is a simple language to understand and utilize, making it suitable for both novice and expert programmers.
Python's prominence in data analysis has resulted in the creation of many online tools and forums. These tools provide assistance and direction when dealing with various forms of data. This contains a variety of platforms, tutorials, and online courses to help users learn and be up-to-date with the latest methods.
Python is a cross-platform interpreted language, meaning you can run Python code on various operating systems, including Windows, macOS, and Linux. This adaptability is helpful while working in a variety of data analysis situations.
Python's open-source nature enables continual advancement. It ensures that Python stays current on the latest developments and breakthroughs in data analysis. Python offers many modules when it comes to working with time-series data.
With the help of these modules, you can perform various types of time-series analysis, such as curve fitting, forecasting, and classification.
Python's robust ecosystem of machine learning packages makes it a strong contender for time-series prediction and identifying anomalies. The scikit-learn library is known for its extensive collection of machine-learning algorithms.
It includes several ensemble approaches, regression models, and time series-specific algorithms. Such tools enable data scientists to create precise models for projection and detect inconsistencies in time-series data.
Python's extensive library ecosystem contains methods for extracting features from time-series data. These features record different data properties over time, such as the maximum, minimum, and median values. These characteristics can help find the patterns and trends in the time-series data.
It's a fascinating world that requires the correct tool to begin. But, with so many alternatives, how does one select the best tool for the job? Let's look at the challenges of choosing a time-series analysis tool in Python.
General Tools for Working With Data in Python
Python offers a robust ecosystem of analysis, manipulation, and data visualization tools. This makes it an excellent choice for data scientists and analysts. Among these tools, data professionals regularly utilize three fundamental numerical analysis and time-series programs: NumPy, pandas, and Matplotlib. These tools provide a solid foundation to execute various mathematical and statistical procedures on their data.
These tools allow users to handle arrays and matrices. You can also execute complicated mathematical operations like Fourier transformation, linear programming, and statistical evaluations.
These three packages also provide a foundation for more advanced tools and frameworks like scikit-learn. It can be used for more complex machine-learning applications. Let's find out the top tools for working with data in Python.
NumPy
NumPy is a popular Python package for numerical and scientific computations. It is the de facto paradigm in Python for numerical data. It is the foundation for several other numerical and time-series libraries, directly or indirectly, via pandas.
NumPy offers an advanced interface for rapid numerical processing. This is an indispensable tool for data scientists and analysts dealing with massive amounts of numerical information. It also includes several scientific and statistical functions, such as linear algebra procedures like linear regression.
The following code snippet in Python shows the relationship between time (months) and the number of passengers using the Air Passengers dataset.
import numpy as np
import pandas as pd
# Load the AirPassengers dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/air
data = pd.read_csv(url)
# Extract the time (months) and number of passengers
time = np.arange(len(data)) + 1
passengers = data['Passengers'].values
# Adding a column of ones for the intercept term
X = np.column_stack((np.ones_like(time), time))
# Calculating the coefficients using linear algebra
coefficients = np.linalg.inv(X.T @ X) @ X.T @ passengers
# Extracting the intercept and slope
intercept = coefficients[0]
slope = coefficients[1]
# Print the results
print("Intercept:", intercept)
print("Slope:", slope)
Intercept: 87.65277777777774
Slope: 2.6571839080459783
The results show a positive association between time and the number of passengers. The positive intercept value of roughly 87.65277777777774 indicates that there were approximately 88 passengers at the start of the time series (when time equals zero).
The positive slope value of about 2.6571839080459783 implies that the average number of passengers grows by approximately 2.66 per month. As an outcome of these findings, we may conclude that time and the number of passengers have a positive linear connection. The number of passengers tends to rise as time passes.
Furthermore, NumPy can also be used in predictive modeling and machine learning applications. Due to its quick calculations, NumPy has become a fundamental tool for many data experts, allowing them to handle and analyze complex numerical data efficiently.
Pandas
Pandas is a practical data analysis and manipulation toolkit. It includes fundamental data structures, most notably the Data Frame, for expressing and interacting with statistical and numerical information. Pandas handles various types of data, including time-series data.
Pandas' ability to adequately manage and alter structured data is one of their primary characteristics. The Data Frame, a two-dimensional table-like data structure, makes indexing, sorting, rearranging, and aggregating data simple. It allows you to obtain essential insights by providing versatile and straightforward methods to slice and split data.
Pandas acts as the foundation for several time series-specific modules that provide extra features. These libraries use Pandas' underlying data structures, such as the Data Frame, to analyze time series. It includes erratic time-series handling, time-based indexing, resampling, and rolling window calculations.
To show how the rolling window function works in Python pandas, we've generated a sample DataFrame with a time-series index showing everyday data for March 2023. The 'value' column contains integer numbers at random.
port pandas as pd
import numpy as np
# Create a sample DataFrame with a time series index
data = {
'date': pd.date_range(start='2022-03-01', end='2022-03-31'),
'value': np.random.randint(1, 100, size=31)
}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
# Calculate the rolling mean with a window size of 7 days
rolling_mean = df['value'].rolling(window=7).mean()
# Calculate the rolling sum with a window size of 14 days
rolling_sum = df['value'].rolling(window=14).sum()
# Calculate the rolling standard deviation with a window size of 30 day
rolling_std = df['value'].rolling(window=30).std()
# Print the original DataFrame and the rolling calculations
print("Original DataFrame:")
print(df)
print("\nRolling Mean (window size=7):")
print(rolling_mean)
print("\nRolling Sum (window size=14):")
print(rolling_sum)
print("\nRolling Standard Deviation (window size=30):")
print(rolling_std)
We then use the rolling() function
on the value
column to calculate rolling windows. Then we compute the rolling mean with a 7-day window, the rolling sum with a 14-day frame, and the rolling standard deviation with a 30-day window.
Rolling computations are saved in distinct Series objects (rolling_mean, rolling_sum, and rolling_std). It can then be utilized for additional analysis or visualization. Finally, we print the original data frame and the estimated rolling mean, total, and standard deviation.
Results:
The output displays the original data frame and the estimated rolling mean, sum, and standard deviation. Using the Rolling Mean (window size=7): It displays the estimated rolling mean values for each date over seven days. Because there aren't enough preceding numbers to calculate the mean, the first six values are NaN (Not a Number). From the seventh date, the rolling mean is derived by averaging the value
column for the previous seven days.
Rolling Sum (window size=14): Each date's calculated rolling sum values over 14 days. The first 13 values are NaN, as with the rolling mean, and the rolling sum is calculated from the 14th date onwards by adding the value
column for the previous 14 days.
Rolling Standard Deviation (window size=30): The estimated rolling standard deviation values for each date using a 30-day window size. Again, the first 29 values are NaN, and the rolling standard deviation is determined based on the value
column over the previous 30 days beginning on the 30th date.
Pandas also allows you to import and clean data. It includes a comprehensive range of tools and techniques for data processing, missing data management, combining datasets, and more. Pandas also works well alongside other data analysis and visualization tools. This makes it a vital tool in the data analysis environment.
Matplotlib
Matplotlib, a popular Python plotting library, offers numerous data visualization options. It enables users to build both basic exploration plots and complex scientific visualizations. With Matplotlib, you can demonstrate your imagination while effectively displaying your data.
The graphs include bagplots, line charts, histograms, and line graphs. It also excels at dealing with time-series data, allowing you to build visually appealing visualizations. Matplotlib recognizes the significance of visualizing time-dependent data, an essential part of time-series analysis.
What sets Matplotlib apart is its extensive customization options. You have full control over plot features, allowing you to fine-tune axes, titles, colors, fonts, and labels. This flexibility empowers you to craft visually stunning and intuitive plots that effectively convey your message.
Whether you need to perform fundamental exploratory analysis or create sophisticated visualizations for time-series purposes, Matplotlib is your go-to tool. It combines power, versatility, and aesthetics to elevate your data visualization game.
To show how you can work with Matplotlib to visualize time-series data and uncover valuable insights, I have created a Python code to visualize the number of Passengers with time. The following time-series graph shows the distribution of values.
This pattern indicates a positive trend in passenger traffic, expected as air travel becomes more accessible and popular. The plot starts with relatively few passengers in the early years and shows a consistent upward trend.
There are also seasonal fluctuations, with peaks occurring around the middle of each year. These peaks represent the high travel demand during vacation seasons or other factors influencing air travel.
Time-Series Libraries and Tools in Python
Tsfresh
Tsfresh is a Python time-series feature extraction package that automates the extraction of a significant amount of time-series features. It enables analysts and data scientists to swiftly and effectively extract meaningful insights from time-series data.
Tsfresh can calculate over 800 time-series features out of the box. It includes statistical attributes, spectral characteristics, and additional features that capture various time-series data elements. It offers several customizable settings allowing customers to tailor extracting features to their needs.
Tsfresh is intended to operate with multiple time-series datasets, including those with missing values, irregular sampling rates, and other frequent problems in time-series research. It also offers several integration options with major time-series libraries, such as pandas and scikit-learn.
We have generated the 100-point synthetic time-series dataset to know how it works in Python. The dataset contains a sine wave with noise added to simulate a real-world event. The generated data is then used to construct a data frame. The DataFrame has two columns: the datetime
index representing the time points and the time-series data.
In the preceding instance, we add the column to the DataFrame called id
and use the index values as identifiers. We acquire a data frame with dimensions [5 rows x 788 columns] after conducting feature extraction with Tsfresh. Tsfresh has calculated 788 features for the provided time-series dataset. Each row represents a different time series, and each column indicates a specific Tsfresh feature.
Sktime
Sktime is an open-source Python machine-learning package specializing in time-series data. It is meant to be interoperable with scikit-learn, which means that while dealing with time-series data, users may take advantage of the capabilities of scikit-learn's algorithms and evaluation metrics.
Sktime is a robust library that works in tandem with scikit-learn. Interacting with time-series data enables users to use the capabilities of scikit-learn's algorithms and assessment metrics. Sktime offers a wide range of features for time-series analysis operations. These operations include the following:
- Regression
- Grouping
- Forecasting
- Transformations
One of the key advantages of sktime is its flexibility to handle many forms of time-series data. It provides a variety of feature extraction methods to help you extract relevant insights from time-series data. These features include Fourier transforms and autocorrelation.
Sktime contains standard statistical methods and cutting-edge deep-learning models for time-series analysis. This allows users to apply advanced techniques and algorithms to gain insights and create precise forecasts from time-series data. It also features cross-validation approaches for accounting for temporal dependencies in time-series data and can be employed for estimating time-series model efficiency.
We have generated a synthetic dataset with the ARIMA model to show how it works in Python. The following plot shows the training, testing, and predicted values distribution over time.
In this case, since the synthetic data is randomly generated, there may not be any specific trends or seasonality present. The slight increase at the start and subsequent decrease are due to the data's randomness and the ARIMA model's modeling assumptions. The behavior can vary depending on the specific dataset and model parameters used.
AutoTS
AutoTS is a Python-based open-source automated time-series framework that aims to produce "high-accuracy forecasts at scale." It is built on the renowned scikit-learn package and offers a variety of automated machine learning (AutoML) capabilities. It can help users make accurate time-series predictions fast and easily.
To determine the optimal model for a specific time-series dataset, AutoTS can automatically develop and compare different time-series models. These models include ARIMA, exponential smoothing, and Facebook Prophet models. It can also increase accuracy by optimizing hyperparameters and selecting features for every model.
AutoTS's capacity to expand to big datasets is one of its primary strengths. It parallelizes model training and assessment using distributed computing frameworks such as Dask and Ray, allowing it to handle big datasets and sophisticated models.
All of AutoTS's features can also be integrated into an AutoML pipeline, which may automatically determine the model with the best fitting for a given time-series dataset. You can save significant energy and time by automating the entire time-series modeling process. For more information about AutoTS features, you can visit GitHub.
Prophet
Prophet is a forecasting library for time series built by Facebook's Core Data Science team. It is intended to be user-friendly and may be employed with R and Python. Prophet excels at analyzing time series with daily seasonality and vacation impacts.
Prophet uses a time-series decomposable model with three primary elements: pattern, seasonality, and holidays. The trend feature represents non-periodic time-series changes, whereas the seasonality part simulates periodic time-series changes. The holiday component shows how holidays and other special events affect the time series.
Prophet can recognize and handle missing data, anomalies, and shifts in trends automatically. It can also take several seasonality periods and contain extra regressors that may aid in time-series prediction.
Prophet has a simple interface for creating forecasts and can also provide visualizations of the forecasted data. It is well-documented and has an extensive and active user community.
Boost Your Time-Series Data Analysis With Timescale
As one of the most challenging problems in data science, your time-series data analysis in Python can benefit from Timescale's efficiency boost.
If you're just getting started and want to connect your Python code and your PostgreSQL (or Timescale) database, here's How to Use Psycopg2: The PostgreSQL Adapter for Python.
By connecting Python and Timescale, you'll get a flexible and configurable way to do complex analytics like machine learning, identifying anomalies, and projections—this is a great option for organizations that deploy advanced time-series analysis in their apps.
You might also want to have a look at the Timescale native hyperfunctions, which can handle some of the analysis you might normally use Python for while keeping your data in the database. Network round trips are expensive, so this is a very efficient way to work.
If you want to analyze your time-series data faster and better using Timescale and Python, sign up for a 30-day free trial of Timescale, where you will get all the PostgreSQL goodness raised to another level in a cloud environment. And remember, you can also run TimescaleDB locally for testing and development.