Building a Multivariate Time Series Model using IBM SPSS Modeler in IBM Cloud Pak for Data

Andrew Widjaja
9 min readAug 3, 2021

Trisentosa Wisesa, Data Engineer (Internship — University of Iowa), with Andrew Widjaja, Technical Leader IBM Software — Sinergi Wahana Gemilang

Building a multivariate time series can be challenging, even for a seasoned data analyst. In this article, we will discuss the process of creating, training, and deploying a multivariate time series model using IBM SPSS Modeler as part of IBM Cloud Pak for Data.

Dataset

We use a modified stock_exchange_customized.csv dataset, originally a Kaggle dataset created by Cody.

Illustration-1: stock_exchange_customized.csv visualized in Watson Studio

The dataset consisted of 14 fields. Field 1 is the date, and the other 13 fields are each company’s closed stock’s price in USD for that day. This project will try to predict the closed USD price for each company for an ‘x’s number of future days.

Steps

Project Setup

First, create a new project in Watson Studio, import the dataset into the asset, then create a new modeler flow instance. The following illustrates the preview of the flow we are going to create.

Illustration-2: A complete flow preview

Understanding the Dataset

We can visualize our dataset directly from the modeler flow. First, you need to take the data asset node from the import tab and assign that node’s dataset to the CSV file.

Illustration-3: Import data asset node configuration

Expand the data output tab, and take data audit and table nodes. Connect these nodes to our data asset node — Right-click on each node and press run. When finished, you’ll get notifications to preview the results.

Illustration-4: Data audit result, 1 of 2.
Illustration-5: Data audit result, 2 of 2.

With this node, we can see the trend and basic statistical value for each field. In the second part, we observe that there are no empty, blank, and null values for any field in this dataset. This will simplify the process of preparing data. If your dataset happens to have empty values, you can configure them later either with a filler node or directly from the time series node.

Preparing the Data

This step is optional. Select Filter node from field operations. For this demonstration, we filter out some fields to minimize CPU usage. Open the node and choose Add Columns.

Illustration-6: Filter node configuration

From field operations, take the Type node and connect that to the partition node. Open the node and choose Read Values. Next, change the role of each companies’ Role from input into the target, since their values are what we want to predict, and leave everything else the same.

Illustration-7: Type node configuration — Defining target and input

Select the Partition node from Field Operations to split the dataset into training and test data and connect it to the asset node. Open the node, and set the training and testing split. By default, it will split the training and test split into 50–50. In this example, we will try 80–20 training test splits. Save the configuration.

Illustration-8: Partition node configures train — test split

Training the Model

Select the Time Series node from the modeling tab and connect to the Filter Node to train a time series model. The first thing we have to do is to declare the interval between each data point. Open the Times series node in the Observations and Time interval section, change the field into Dates, and the interval to be Days.

Illustration-9: Time series node date and interval configuration

Next, we also want to specify how far into the future we want our model to predict. Expand the model option and tick the checkbox in the forecast. For this demonstration, we chose 7. We predict up to 7 days from the last observed data, considering the interval is days. Remember that more days require significantly more concurrency. Sometimes, a high number might cause an error in the deployment stage later.

Illustration-10: Time series node forecast

Save the configurations and run the node. After finishing, a new yellow node for the model should appear.

Optionally to increase performance and to not exceed the threshold of your machine learning instance, In the same model options tab, change the maximum number of models to be displayed in output to be 5. (By default, it is 10).

Illustration-11: Change the number of models to be displayed to increase performance

Optionally, in build options, you can also specify what method you want your model to be. ARIMA is preferable for a stationary model, meaning to have or follows a seasonal trend. At the same time, exponential smoothing is preferable for non-stationary data, meaning that it doesn’t follow a trend or is unpredictable. Since we are not sure whether the closed stock price is stationary, we can choose an expert modeler to choose the best fit method for our model automatically.

Illustration-12: Choose the build options that fit your dataset

Evaluating the Model

Right-click on the newly created node and choose preview to see which models were used for each targetable field, as well as the R squared, RMSE, MSE. You’ll also usually get a stationary R squared value with time series, ranging from negative infinity to 1.

Looking at the value of stationary R squared is preferable to R squared if a model is considered to have a trend or seasonal pattern. Negative values on R squared means that it performs worse than the baseline model, and positive values perform better than the baseline model. In our case, a trend in predicting the close stock prices exists but not by that much.

Finally, since each field is independent, the values of RMSE and MSE differ from one another. The values are pretty much deviating around 1–2% across the board from the actual values.

Illustration-13: Model summary preview

Select Table node, Analyze node, and Time plot node from the output tab into the new model node.

From the Table node, you’ll see that from each field, the tool creates three new fields with prefixes:

  1. $TS: generated model data
  2. $TSLCI: lower confidence interval of generated model data
  3. $TSUCI: upper confidence interval of generated model data
Illustration-14: Output table for model prediction result

For the time plot, specify the fields you want to compare the prediction and actual values with. For this example, we will compare actual HSI and our prediction of HSI model.

Illustration-15: Time plot for HSI configuration
Illustration-16: Time plot for HSI actual vs HSI prediction

Finally, run the analysis node. Running this node will compare and give a detailed summary of the results between your model prediction and the actual result from the test data. The analysis is only available to supervised models or models with targetable fields, from which the node will return an evaluation for each field.

Illustration-17: Analysis node output result

Test and Deploy the Model

Save the model by right-clicking on the output table and choose to save branch as a model.

Illustration-18: Saving the output table as a model

After saving, your project asset lists the model. Open the model and choose Promote to Deployment.

Illustration-19: Promote the model to space

Add the model to an existing space or create a new one by choosing New Space.

After being promoted, go to your Deployment Space. Deploy the model by hovering the rocket icon next to your promoted model.

Illustration-20: Choose the deploy option from your deployment space

Choose the online option for the deployment and name your deployment.

Illustration-21: Model deployment options

Please wait until the deployment has finished. Open it by going to the deployments tab. Go to the test tab to test our model prediction.

Now, fill the date in (YYYY-mm-dd) format, put 2021–05–28 in the date field as it is our last observed data (other valid date values will most likely pick up from the last observed data and output the same results). For the other fields, fill in the last observed value for each company. You can leave the companies you have filtered out to be any value.

Illustration-22: Deployed model — prediction test result

The result will return an array of predictions. The length of the array depends on how long the forecast you specified, in this case, is 7. To use the model in a notebook, use the endpoint provided in your deployment.

Illustration-23: The deployment API reference scoring endpoint

Create a new notebook for your project asset, and paste the scoring endpoint from your deployed model API reference. Before doing this, make sure you have your IBM Cloud API key ready.

Example of the payload scoring inputs :

{“input_data”: [{“fields”: [“Date”,”HSI”,”NYA”,”IXIC”,”000001.SS”,”N225",”N100",”399001.SZ”,”GSPTSE”,”NSEI”,”GDAXI”,”SSMI”,”TWII”,”J203.JO”], “values”: [[‘2021–05–28’,”3786.1733208", “16555.66016”, “13748.74023”,”0", “0”, “0”, “0”, “0”,”0",”0",”0",”0",”4728.8401566"]] }]}

Illustration-24: Testing your model in the notebook (running Python 3.7), 1 of 2.
Illustration-25: Testing your model in the notebook (running Python 3.7), 2 of 2.

To briefly summarize, we have created our multivariate time series model using SPSS Modeler. We have explored the general flow of the SPSS Modeler, training the model, deploying the model, and calling the model’s API via notebook.

In this project, we only explored the flow for a time series model. However, many different model patterns are still available. Examples include auto classifier, regression, random forest, and C&R node.

Why IBM Cloud Pak for Data?

IBM Cloud Pak for Data is a cloud-native Data and AI solution that enables you to work quickly and efficiently. Your data is useless if you can’t trust it or access it.

Cloud Pak for Data also enables all of your data users to collaborate from a single, unified interface that supports many services that are designed to work together.

Cloud Pak for Data is your go-to platform for data governance, analytics, and collaboration. The platform helps you spend less time looking for your data and more time working with it. With Cloud Pak for Data, we can do:

  1. Modernize data ecosystem
  2. Derive insight from the data
  3. Infuse the business with AI
  4. And so much more.

Cloud Pak for Data runs on top of Red Hat OpenShift, which means that you can run Cloud Pak for Data on an on-premises, private cloud cluster, or any public cloud infrastructure that supports Red Hat OpenShift. In a production-level cluster, there are three masters + infrastructure nodes and three or more worker nodes. Using dedicated worker nodes means that resources on those nodes are used only for application workloads, improving the cluster’s performance.

Illustration-26: The typical topology of a production-level cluster

Modular platform: The platform consists of a lightweight installation called the Cloud Pak for Data control plane. The control plane provides a command-line interface, an administration interface, a services catalog, and the central user experience.

Illustration-27: IBM Cloud Pak for Data control plane

Common core services: The common core services provide data source connections, deployment management, job management, notifications, projects, and search.

Illustration-28: IBM Cloud Pak for Data common core services

Integrated Data and AI services: The services catalog includes broad offerings from IBM and third-party vendors. The catalog contains AI, Analytics, Dashboards, Data governance, Data sources, Developer tools, Industry Solutions, and Storage.

Illustration-29: IBM Cloud Pak for Data integrated Data and AI services

Why IBM SPSS Modeler from Cloud Pak for Data?

SPSS Modeler is a tool to accelerate creating a predictive model, including doing analytical and statistical tasks. With SPSS Modeler, users can skip the process of learning frameworks programmatically while still maintaining most of the other experiences of learning data visualization and machine learning, such as visualizing the data, refining the data, training, and analyzing the model.

Are you interested in using IBM Cloud Pak for Data? You can start exploration from IBM Cloud Pak for Data, here.

--

--