Python

Jun 20, 2024

Guide to Time-Series Analysis in Python

Guide to Time-Series Analysis in Python

Posted by

Anber Arif

One of the numerous ways software engineers add value to an org is by performing time-series analysis. This powerful technique allows us to extract valuable insights from temporal data and consists in analyzing and making predictions based on time-based patterns.

This blog post will delve into the world of time-series analysis using Python, often considered the go-to programming language for data analysis. Python offers a rich library and tools ecosystem, making it an ideal choice for working with time-series data.

However, using Python with a robust time-series database like Timescale can speed up and simplify your data analysis. See our Python quick start to leverage Timescale’s fast queries, performance, and features, or keep reading for more info and a step-by-step guide.

Now, back to Python.

An Example of Time-Series Analysis With Python

Python has quickly emerged as a preferred tool for data analysis due to its simplicity, versatility, and vast community support. With its intuitive syntax and extensive library ecosystem, this elegant programming language allows you to tackle complex problems efficiently.

Whether you are building a data-intensive application or working with an experienced data scientist, Python provides a robust platform for exploring, visualizing, and modeling time-dependent data.

Let's see how Python can empower your work with time-series data. Consider the following example code snippet that loads a time-series dataset using pandas and plots it using Matplotlib:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Generate random time-series data
np.random.seed(42)
dates = pd.date_range(start='2022-01-01', periods=100, freq='D')
values = np.random.randn(100).cumsum()

# Create a DataFrame from the generated data
data = pd.DataFrame({'date': dates, 'value': values})

# Set the 'date' column as the index
data.set_index('date', inplace=True)

# Plot the time-series data
plt.plot(data.index, data['value'])
plt.xlabel('Time')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()

This example consists of random data generated by NumPy’s random number generator. The dataset consists of 100 dates, starting from January 1, 2022, and corresponding random values. The data is converted into a Pandas DataFrame, and the 'date' column is set as the index. Finally, the time-series data is plotted using Matplotlib, displaying the variation of the 'value' over time.


Why Use Python for Time-Series Data Analysis?

Python brings a host of benefits to the table regarding time-series analysis:

  • It is a user-friendly language.
  • It is widely available in the open-source world.
  • It has extensive library support.
  • It can reuse existing code.

Let’s dig into these advantages.

Python is easy to use

Python is known for its simplicity and user-friendliness. Its intuitive syntax makes it easy to learn, even for beginners. The clean structure of Python code promotes efficient coding practices, allowing you to focus on analyzing time-series data rather than grappling with complex programming concepts.

Python is open source

One of Python's great advantages is that it's an open-source language. This means it is freely available to use and is continuously improved and supported by a vibrant community of developers. The open-source nature of Python enables data scientists to access a wealth of resources, tools, and libraries for analyzing time-series data without incurring additional costs.

Python offers extensive library support

Python offers extensive specialized libraries and tools specifically designed for time-series analysis. These libraries, such as pandas, NumPy, statsmodels, and scikit-learn, provide various functions and tools tailored to the unique challenges of working with time-dependent data. They simplify complex operations, allowing you to focus on extracting meaningful insights rather than reinventing the wheel.

Python facilitates code reusability

Thanks to its longevity and widespread adoption, Python has a vast codebase that data scientists and application developers can leverage for their time-series analysis needs.

The Python community has already implemented many common tasks, such as data loading, cleaning, transformation, and visualization. Building upon existing code and solutions can save time and effort and accelerate the analysis process.

Plotting Data Using Pyplot

Plotting time-series data is an essential step in visualizing patterns, trends, and anomalies. Python provides the Matplotlib library, which includes the Pyplot module for creating various types of plots, including line plots, scatter plots, and histograms.

To illustrate this, let's create a random dataset and plot it using Pyplot:

import numpy as np
import matplotlib.pyplot as plt

# Generate random time-series data
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', periods=100)
values = np.random.randn(100).cumsum()

# Plot the time-series data
plt.plot(dates, values)
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()

Time-Series Analysis Tasks in Python

Time-series analysis involves examining historical data to uncover patterns, trends, and other valuable insights. It is a crucial step in understanding the behavior of time-dependent data and making predictions for the future. Time-series analysis encompasses numerous techniques, such as trend analysis, seasonality detection, forecasting, and anomaly detection.

Editor’s Note: If you want to learn how to use Timescale to analyze historical data for speedy client-facing analytics dashboards, read Octave’s story (whose team built most of their backend software using Python).



In Python, various techniques are available to analyze data for trends and patterns. These techniques enable data scientists and developers to gain valuable insights into the underlying characteristics of their datasets. Understanding data trends and patterns is crucial for making informed decisions and predictions based on the available information.

Stationarity

Stationarity refers to a key concept in time-series analysis where the statistical properties of a dataset, such as mean and variance, remain constant over time. In Python, testing for stationarity involves methods like the Augmented Dickey-Fuller (ADF) test, Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, and visual inspection of time series plots. Here's an example of how to test for stationarity in Python using the ADF test:

from statsmodels.tsa.stattools import adfuller

# Assuming 'data' is the time series data
result = adfuller(data)
print('ADF Statistic:', result[0])
print('p-value:', result[1])

Seasonality

Seasonality pertains to recurring patterns or fluctuations in a time series that occur at regular intervals. It introduces predictable variations in the data over specific periods. Distinguishing between seasonality and stationarity is essential, as they represent different aspects of time series behavior.

Testing for seasonality in Python can be accomplished through decomposition analysis and autocorrelation function (ACF) plots. One example example of testing for seasonality involves decomposing the time series and analyzing the seasonal component visually.

Autocorrelation and partial autocorrelation

Autocorrelation measures the relationship between a variable's current and past values at different time lags. On the other hand, partial autocorrelation quantifies the direct relationship between a variable's current value and its past values, excluding the influence of intermediate-lagged variables.

In Python, testing for autocorrelation and partial autocorrelation often involves plotting the autocorrelation function (ACF) and partial autocorrelation function (PACF) and observing the patterns. Here's an example of how to visualize autocorrelation and partial autocorrelation in Python:

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Assuming 'data' is the time series data
plot_acf(data)
plot_pacf(data)
plt.show()

Predicting future values based on historical data

Python offers a variety of libraries and techniques for time-series forecasting, and one popular method is the autoregressive integrated moving average (ARIMA) model. ARIMA is a powerful and widely used approach that combines the three following components to capture the patterns and trends in time-series data:

1. Autoregression (AR)

2. Differencing (I)

3. Moving Average (MA)

You can utilize the statsmodels.tsa.arima.model to apply the ARIMA model in Python.ARIMA class. This class allows you to specify the order of the AR, I, and MA components and fit the model to your historical data. Once the model is fitted, you can use it to forecast future values by calling the predict method and specifying the start and end dates for the forecast period.

Types of forecasting models

In time-series analysis, various forecasting models are available to predict future values based on historical data. Each model has its own strengths, limitations, and suitability for different types of time-series data. Let's explore some common types of forecasting models:

Moving Average (MA)

The moving average model calculates the average of past observations to forecast future values. It helps eliminate short-term fluctuations and identify underlying trends in the data.

You can implement the moving average model using the rolling function in pandas, which calculates the mean over a specified window of past observations. Here's a simplified example:

import pandas as pd

# Assuming 'data' is the time series data
window_size = 3moving_avg = data.rolling(window=window_size).mean()

Autoregressive (AR)

The autoregressive model predicts future values using past observations and a linear regression equation. It assumes that the future values depend on the previous values with a lag.

You can implement the autoregressive model using the AR class from the statsmodels library, which enables the fitting of an autoregressive model to the time-series data. Here's an example:

from statsmodels.tsa.ar_model import AR

# Assuming 'data' is the time series data
model = AR(data)
ar_model = model.fit()
predictions = ar_model.predict(start=len(data), end=len(data)+n)  # Replace n with the number of future values to predict

Autoregressive Moving Average (ARMA)

The ARMA model combines the autoregressive and moving average models, making predictions based on past observations and the average of past errors.

You can implement the autoregressive moving average model using the ARMA class from the statsmodels library, which allows fitting an ARMA model to the time series data. Here's an example:

from statsmodels.tsa.arima.model import ARIMA

# Assuming 'data' is the time series data
model = ARIMA(data, order=(p, d, q))  # Replace p, d, and q with appropriate values
arma_model = model.fit()
predictions = arma_model.predict(start=len(data), end=len(data)+n, typ='levels')  # Replace n with the number of future values to predict

Autoregressive Integrated Moving Average (ARIMA)

The ARIMA model extends the ARMA model by incorporating differencing to make the time series stationary. It is suitable for non-stationary data with trends and seasonality.

The autoregressive integrated moving average model can also be implemented using the ARIMA class from the statsmodels library. Here's an example:

from statsmodels.tsa.arima.model import ARIMA

# Assuming 'data' is the time series data
model = ARIMA(data, order=(p, d, q))  # Replace p, d, and q with appropriate values
arima_model = model.fit()
predictions = arima_model.predict(start=len(data), end=len(data)+n, typ='levels')  # Replace n with the number of future values to predict

Exponential Smoothing

Exponential smoothing models apply weights to past observations, giving more importance to recent values. Different variations of exponential smoothing, such as Simple Exponential Smoothing (SES), Holt's Linear Exponential Smoothing, and Holt-Winters Exponential Smoothing, accommodate different patterns in the data.

You can implement exponential smoothing models using the SimpleExpSmoothing, ExponentialSmoothing, and Holt classes from the statsmodels library. Here's an example of Simple Exponential Smoothing (SES):


from statsmodels.tsa.holtwinters import SimpleExpSmoothing

# Assuming 'data' is the time series data
model = SimpleExpSmoothing(data)
ses_model = model.fit()
predictions = ses_model.forecast(steps=n)  # Replace n with the number of future values to forecast

Seasonal ARIMA (SARIMA)

SARIMA is an extension of the ARIMA model that accounts for seasonal patterns in the data. It includes additional terms to capture seasonality, making it suitable for time-series data with recurring patterns.

The seasonal ARIMA model can be implemented similarly to the ARIMA model but with additional seasonal parameters. Here's a simplified example:

from statsmodels.tsa.statespace.sarimax import SARIMAX

# Assuming 'data' is the time series data
model = SARIMAX(data, order=(p, d, q), seasonal_order=(P, D, Q, s))  # Replace p, d, q, P, D, Q, and s with appropriate values
sarima_model = model.fit()
predictions = sarima_model.forecast(steps=n)  # Replace n with the number of future values to forecast


These are just a few examples of forecasting models commonly used in time-series analysis. To explore more details about each model, including their mathematical formulations, strengths, limitations, and suitable use cases, read our blog post, What Is Time-Series Forecasting?

Extracting useful features for machine learning/deep learning algorithms

Extracting meaningful features from time-series data is also crucial for training machine learning or deep learning models. These features serve as inputs to the models and help capture relevant patterns and characteristics in the data. Python provides various techniques and libraries to extract useful features for time-series analysis.

One common approach is to apply feature engineering techniques to transform raw time-series data into informative features. Let's consider an example of training a machine learning model to detect potential risk for a heart attack based on heart rate data. Time-series analysis techniques can help us extract meaningful insights from the heart rate data that we can use as inputs to the model.

For instance, we can calculate statistical measures such as the mean and standard deviation of the heart rate. These measures provide information about the central tendency and variability of the heart rate data, respectively.

Additionally, we can compute other features like autocorrelation, which measures the correlation between the heart rate values at different time lags. By feeding these features into a machine learning model, we can make accurate predictions on the risk level for potential heart attacks.

Data cleaning

Data cleaning plays a crucial role in time-series analysis as it ensures the accuracy and reliability of the data used for further analysis and modeling. In Python, you can leverage various techniques and libraries to clean time-series data and handle common issues, such as missing values, outliers, or inconsistencies.

Data cleaning in time-series analysis typically involves the following steps:

  1. Handling missing values: Missing values can occur in time-series data for various reasons, such as sensor failures, data transmission issues, or human errors. Python provides libraries like pandas that offer methods to handle missing values, such as interpolation, forward filling, backward filling, or dropping rows with missing values.
  2. Outlier detection and treatment: Outliers are extreme values that deviate significantly from the normal patterns in the time series. Identifying and handling outliers is vital to avoid distortions in the analysis. Python libraries like pandas, NumPy, or scikit-learn provide statistical and machine-learning-based techniques for detecting and handling outliers.
  3. Dealing with inconsistent or incorrect data: Time-series data may sometimes contain inconsistent or erroneous values, such as inconsistent units, invalid data types, or data entry errors. Python offers functionalities to clean and correct such data inconsistencies, including data type conversion, data normalization, or applying business rules to identify and correct erroneous data.


By performing these data cleaning steps, you can ensure the quality and reliability of the time-series data for accurate and meaningful analysis. Another way to do this is by using Timescale. 😎 In one of our previous blog posts, we made a side-by-side comparison between Timescale (built on PostgreSQL) and Python for data cleaning.

Challenges in Working With Time-Series Data in Python

Still, working with time-series data in Python can pose some challenges, especially when dealing with large datasets. In this section, we will explore two common challenges and discuss strategies for overcoming them.

Loading data quickly and efficiently

Efficient loading of large time-series datasets is crucial for smooth data analysis. Python provides libraries like pandas and NumPy that offer efficient data structures and tools for handling time-series data.

Consider using pandas' read_csv function with optimized parameters to load data quickly. For example, specifying the appropriate data types for each column can significantly speed up the loading process. Compression techniques like gzip or parquet files can also reduce file size and improve loading performance.

Another approach to enhance data loading speed is to leverage parallel processing. Parallel processing utilizes multiprocessing techniques, where the data analysis is split across multiple machines, and the results are combined in the final step. Libraries like Dask and Apache Spark provide distributed computing capabilities, allowing you to distribute the load across multiple machines, thus accelerating the data loading and analysis process.

Handling large datasets

When dealing with gigabytes (or terabytes) of time-series data, parsing and processing all of it can become challenging. In such cases, a strategy to handle large datasets efficiently is essential.

Here, you can also use multiprocessing techniques. By splitting the analysis across multiple machines and combining the results afterward, you can distribute the workload and process the large datasets faster.

Another option is to leverage distributed computing frameworks like Apache Spark. Spark allows you to spread the data processing across a cluster of machines, enabling efficient handling of large-scale time-series datasets. Spark's parallel processing capabilities and built-in data processing functions make it a powerful tool for managing and analyzing big time-series data.

Working With Time Series in Python

Working with time-series data in Python involves several key steps, from choosing the right time-series library to loading and analyzing the data. Let’s explore the essential aspects of working with time series in Python, such as selecting a time-series library, utilizing the core library pandas for data loading, analysis, and visualization, and exploring some more specialized libraries for advanced time-series tasks.

Choosing a time-series library

Python provides various libraries tailored for time-series analysis. The core library for time-series analysis in Python is pandas. Pandas provides efficient data structures and functions to handle time series effectively. It allows you to load data from diverse sources, such as CSV files and databases like Timescale.

With pandas, you can perform basic analysis and visualization of time-series data. The central data structure in pandas is the DataFrame, which serves as the primary unit for representing time-series data.

Using pandas, you can load time-series data from various sources with ease. Functions like read_csv() and read_sql() enable you to load data into a DataFrame for further analysis. This flexibility allows you to work with data from different formats and platforms.

Pandas provides a rich set of functionalities for analyzing and visualizing time-series data. You can perform various operations, including data aggregation, filtering, and computing summary statistics. Additionally, pandas integrates well with visualization libraries like Matplotlib and Seaborn, allowing you to create insightful plots and charts to explore patterns and trends in the data.

Here's an example that demonstrates the steps of loading and working with time-series data using pandas in Python:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Load time-series Data
dates = pd.date_range(start='2023-01-01', periods=100)
values = np.sin(np.linspace(0, 2*np.pi, 100))
data = pd.DataFrame({'Date': dates, 'Value': values})

# Step 2: Perform Data Analysis
# Calculate summary statistics
summary_stats = data.describe()

# Filter data based on specific conditions
filtered_data = data[data['Value'] > 0]

# Resample data to a different frequency
resampled_data = data.resample('1W', on='Date').sum()

# Step 3: Visualize time-series Data
plt.plot(data['Date'], data['Value'])
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()

This code generates a time-series dataset with dates and sine wave values. It performs data analysis tasks such as calculating summary statistics, filtering data based on conditions, and resampling the data to a different frequency. Finally, it visualizes the time-series data by plotting the values against the dates.

You can refer to the official pandas documentation to delve deeper into pandas and its functionalities.

In addition to pandas, there are specialized libraries that can enhance your time-series analysis capabilities:

  1. sktime: sktime is a library that trains multiple time-series models and connects to related libraries, enabling advanced modeling and analysis of time-series data.
  2. pdmarima: pdmarima is a library used to calculate ARIMA (AutoRegressive Integrated Moving Average) models, a popular time-series forecasting and analysis technique.
  3. tsfresh: tsfresh is a library specifically designed for feature extraction from time-series data. It provides various algorithms and techniques for extracting meaningful features that can be used in machine learning and predictive modeling.

By leveraging these libraries, you can efficiently work with time-series data, perform advanced analysis, and extract valuable insights.

Obtain and store time-series data

Before diving into time-series analysis, defining the sources from which you'll gather your data is essential. Consider the following factors:

  • Determine the purpose of your analysis and identify the specific data requirements.
  • Explore available data sources such as public repositories, databases, APIs, or data collected from your applications.
  • Ensure data integrity and accuracy by choosing reliable and reputable sources.
  • Consider the frequency at which you'll update your data to ensure it remains up-to-date.

Depending on your specific requirements, there are several possibilities for obtaining and storing time-series data. Here are a few common approaches:

Loading existing datasets: If you have time-series data stored in CSV or other flat file formats, you can use libraries like pandas or Timescale to load and manipulate the data. These libraries provide flexible functions to read data from files and perform various operations on it. You can find more information about loading data into Timescale in the Timescale documentation.

Here's an example of how to load time-series data from a CSV file using pandas:

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('path/to/your/file.csv')

In the example above, you need to provide the file path of your CSV file as an argument to the read_csv() function. Pandas will automatically parse the file and create a DataFrame containing the time-series data.

Obtaining data from public or private APIs: Many organizations provide APIs that allow access to their time-series data. For example, weather data APIs offer historical and real-time weather information. You can use libraries like requests or specialized Python packages to interact with these APIs and retrieve the desired time-series data.

Writing dynamically from your own apps: If your own applications generate time-series data, you can write code to capture and store it dynamically. For example, you can track user login activity or user purchase activity on a website and store it in a database or file.

Load and analyze time-series data in Python

To load and analyze time-series data in Python, you can utilize various libraries and formats based on your specific requirements. One popular choice is using pandas, a powerful data manipulation library that provides a convenient way to load, transform, and analyze time-series data.

I’ve already created the table Weather in my database with two columns: date and temperature, as shown below:

You can follow the steps below to load Timescale data into pandas and perform time-series feature extraction using tsfresh.

  1. Install the required libraries, such as psycopg2, pandas, tsfresh.
  2. Import the necessary modules as shown below:

import psycopg2
from psycopg2 import sql
import pandas as pd
from tsfresh import extract_features

3. Establish a connection to your Timescale instance:

con = psycopg2.connect(
    host='your_host',
    port='your_port',
    database='your_database',
    user='your_username',
    password='your_password'
)

4. Create a cursor object associated with your established connection to Timescale. It allows you to interact with the database and execute SQL queries.

cursor = con.cursor()

5. Create an SQL query object using the sql.SQL class from the psycopg2 module. This construction allows you to safely construct SQL queries with placeholders for parameter values.

LIMIT = 10;
query = sql.SQL(f"SELECT * FROM Weather LIMIT {LIMIT}")
cursor.execute(query)

The SELECT * FROM Weather query is a simple SQL query that retrieves all rows and columns from the "Weather" table. However, instead of fetching all records from the table, the result set is limited to a maximum of 10 records using the LIMIT clause.

After creating the query object, the line cursor.execute(query) executes the SQL query using the object cursor that you previously created. It sends the query to the Timescale database for execution.

6. Fetch all the rows that are available from the query execution.

# Fetch the results
results = cursor.fetchall()

7. You can iterate over each row in the results list and print the row.

# Do something with the results
for row in results:
    print(row)

8. Extract data from the results list, which contains rows fetched from the database, and store them in separate lists, dates, and values.

dates = []
values = []

for date, value in results:
    dates.append(date)
    values.append(value)

The for loop iterates over each row in the results list, where each row is represented as a tuple containing two elements: date and value. By using the syntax for date, value in results, you can directly unpack the tuple into two separate variables: date and value.

Inside the loop, the date and value variables are appended to the respective lists' dates and values using the append() method. These lists allow you to store the values separately for further processing or analysis.

9. Create a pandas DataFrame from the extracted data and perform feature extraction using the extract_features function from the tsfresh library.


# Create a pandas DataFrame from the extracted data
data_df = pd.DataFrame({'date': dates, 'temperature': values})
data_df['id'] = ""
# Perform feature extraction
extracted_features = extract_features(data_df, column_id='id', column_sort='date')

10. Iterate over the columns of the extracted_features DataFrame and print information about each column and its corresponding data. For instance, you can examine the names and data of any of the five columns.

columns = extracted_features.columns

for column in columns[5:10]:
    print(f"Column: {column}")
    print(f"Data: {extracted_features[column][0]}")
    print()

This example demonstrates how to load time-series data from Timescale into pandas and extract features using the tsfresh library. Timescale provides efficient storage and retrieval of time-series data, optimized for time-based queries. It offers advantages such as:

  1. Hypertables: Timescale introduces the concept of hypertables, partitioned tables that automatically divide data into smaller chunks based on time intervals. This partitioning allows for parallelism and optimized query execution. In the above example, the "Weather" table is likely a hypertable, enabling faster loading and retrieval of time-series data.
  2. Time-Series Functions: Timescale provides a rich set of time-series functions and extensions for advanced analytics on time-based data. In the above code, the extract_features function from the tsfresh library leverages Timescale's capabilities to perform feature extraction on the time-series data.
  3. Scalability: We designed Timescale to scale effortlessly as the time-series data grows. It can handle massive amounts of data while maintaining high performance, ensuring that your time-series analysis remains efficient despite increasing data volumes.

Unleash the Power of Timescale for Time-Series Data

Timescale offers powerful time-series-specific SQL functions and extensions that enable efficient dataoperations, such as querying, filtering, and aggregating time-series data. Timescale is optimized for handling large-scale time-series data, providing efficient storage, compression, and indexing techniques.

It also offers built-in functionalities that can replace or complement Python libraries for data cleaning and preprocessing. So, are you ready to supercharge your time-series data management?

With Timescale, you can execute Python code directly in the database, leveraging popular data packages like tsfresh for feature extraction and analysis. Create a free Timescale account today.

Further reading

If you're using Python and PostgreSQL for time-series analysis, what's the best adapter? Psycopg2 or psycopg3? We tested both to assess their performance. Also, check out the best tools to work with time series and Python.

Originally posted

Mar 08, 2024

Last updated

Jun 20, 2024

Share

Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's Privacy Policy.