Building a Multivariate Time Series Model using IBM SPSS Modeler in IBM Cloud Pak for Data
Trisentosa Wisesa, Data Engineer (Internship — University of Iowa), with Andrew Widjaja, Technical Leader IBM Software — Sinergi Wahana Gemilang
Building a multivariate time series can be challenging, even for a seasoned data analyst. In this article, we will discuss the process of creating, training, and deploying a multivariate time series model using IBM SPSS Modeler as part of IBM Cloud Pak for Data.
Dataset
We use a modified stock_exchange_customized.csv dataset, originally a Kaggle dataset created by Cody.
The dataset consisted of 14 fields. Field 1 is the date, and the other 13 fields are each company’s closed stock’s price in USD for that day. This project will try to predict the closed USD price for each company for an ‘x’s number of future days.
Steps
Project Setup
First, create a new project in Watson Studio, import the dataset into the asset, then create a new modeler flow instance. The following illustrates the preview of the flow we are going to create.
Understanding the Dataset
We can visualize our dataset directly from the modeler flow. First, you need to take the data asset node from the import tab and assign that node’s dataset to the CSV file.
Expand the data output tab, and take data audit and table nodes. Connect these nodes to our data asset node — Right-click on each node and press run. When finished, you’ll get notifications to preview the results.
With this node, we can see the trend and basic statistical value for each field. In the second part, we observe that there are no empty, blank, and null values for any field in this dataset. This will simplify the process of preparing data. If your dataset happens to have empty values, you can configure them later either with a filler node or directly from the time series node.
Preparing the Data
This step is optional. Select Filter node from field operations. For this demonstration, we filter out some fields to minimize CPU usage. Open the node and choose Add Columns.
From field operations, take the Type node and connect that to the partition node. Open the node and choose Read Values. Next, change the role of each companies’ Role from input into the target, since their values are what we want to predict, and leave everything else the same.
Select the Partition node from Field Operations to split the dataset into training and test data and connect it to the asset node. Open the node, and set the training and testing split. By default, it will split the training and test split into 50–50. In this example, we will try 80–20 training test splits. Save the configuration.
Training the Model
Select the Time Series node from the modeling tab and connect to the Filter Node to train a time series model. The first thing we have to do is to declare the interval between each data point. Open the Times series node in the Observations and Time interval section, change the field into Dates, and the interval to be Days.
Next, we also want to specify how far into the future we want our model to predict. Expand the model option and tick the checkbox in the forecast. For this demonstration, we chose 7. We predict up to 7 days from the last observed data, considering the interval is days. Remember that more days require significantly more concurrency. Sometimes, a high number might cause an error in the deployment stage later.
Save the configurations and run the node. After finishing, a new yellow node for the model should appear.
Optionally to increase performance and to not exceed the threshold of your machine learning instance, In the same model options tab, change the maximum number of models to be displayed in output to be 5. (By default, it is 10).
Optionally, in build options, you can also specify what method you want your model to be. ARIMA is preferable for a stationary model, meaning to have or follows a seasonal trend. At the same time, exponential smoothing is preferable for non-stationary data, meaning that it doesn’t follow a trend or is unpredictable. Since we are not sure whether the closed stock price is stationary, we can choose an expert modeler to choose the best fit method for our model automatically.
Evaluating the Model
Right-click on the newly created node and choose preview to see which models were used for each targetable field, as well as the R squared, RMSE, MSE. You’ll also usually get a stationary R squared value with time series, ranging from negative infinity to 1.
Looking at the value of stationary R squared is preferable to R squared if a model is considered to have a trend or seasonal pattern. Negative values on R squared means that it performs worse than the baseline model, and positive values perform better than the baseline model. In our case, a trend in predicting the close stock prices exists but not by that much.
Finally, since each field is independent, the values of RMSE and MSE differ from one another. The values are pretty much deviating around 1–2% across the board from the actual values.
Select Table node, Analyze node, and Time plot node from the output tab into the new model node.
From the Table node, you’ll see that from each field, the tool creates three new fields with prefixes:
- $TS: generated model data
- $TSLCI: lower confidence interval of generated model data
- $TSUCI: upper confidence interval of generated model data
For the time plot, specify the fields you want to compare the prediction and actual values with. For this example, we will compare actual HSI and our prediction of HSI model.
Finally, run the analysis node. Running this node will compare and give a detailed summary of the results between your model prediction and the actual result from the test data. The analysis is only available to supervised models or models with targetable fields, from which the node will return an evaluation for each field.
Test and Deploy the Model
Save the model by right-clicking on the output table and choose to save branch as a model.
After saving, your project asset lists the model. Open the model and choose Promote to Deployment.
Add the model to an existing space or create a new one by choosing New Space.
After being promoted, go to your Deployment Space. Deploy the model by hovering the rocket icon next to your promoted model.
Choose the online option for the deployment and name your deployment.
Please wait until the deployment has finished. Open it by going to the deployments tab. Go to the test tab to test our model prediction.
Now, fill the date in (YYYY-mm-dd) format, put 2021–05–28 in the date field as it is our last observed data (other valid date values will most likely pick up from the last observed data and output the same results). For the other fields, fill in the last observed value for each company. You can leave the companies you have filtered out to be any value.
The result will return an array of predictions. The length of the array depends on how long the forecast you specified, in this case, is 7. To use the model in a notebook, use the endpoint provided in your deployment.
Create a new notebook for your project asset, and paste the scoring endpoint from your deployed model API reference. Before doing this, make sure you have your IBM Cloud API key ready.
Example of the payload scoring inputs :
{“input_data”: [{“fields”: [“Date”,”HSI”,”NYA”,”IXIC”,”000001.SS”,”N225",”N100",”399001.SZ”,”GSPTSE”,”NSEI”,”GDAXI”,”SSMI”,”TWII”,”J203.JO”], “values”: [[‘2021–05–28’,”3786.1733208", “16555.66016”, “13748.74023”,”0", “0”, “0”, “0”, “0”,”0",”0",”0",”0",”4728.8401566"]] }]}
To briefly summarize, we have created our multivariate time series model using SPSS Modeler. We have explored the general flow of the SPSS Modeler, training the model, deploying the model, and calling the model’s API via notebook.
In this project, we only explored the flow for a time series model. However, many different model patterns are still available. Examples include auto classifier, regression, random forest, and C&R node.
Why IBM Cloud Pak for Data?
IBM Cloud Pak for Data is a cloud-native Data and AI solution that enables you to work quickly and efficiently. Your data is useless if you can’t trust it or access it.
Cloud Pak for Data also enables all of your data users to collaborate from a single, unified interface that supports many services that are designed to work together.
Cloud Pak for Data is your go-to platform for data governance, analytics, and collaboration. The platform helps you spend less time looking for your data and more time working with it. With Cloud Pak for Data, we can do:
- Modernize data ecosystem
- Derive insight from the data
- Infuse the business with AI
- And so much more.
Cloud Pak for Data runs on top of Red Hat OpenShift, which means that you can run Cloud Pak for Data on an on-premises, private cloud cluster, or any public cloud infrastructure that supports Red Hat OpenShift. In a production-level cluster, there are three masters + infrastructure nodes and three or more worker nodes. Using dedicated worker nodes means that resources on those nodes are used only for application workloads, improving the cluster’s performance.
Modular platform: The platform consists of a lightweight installation called the Cloud Pak for Data control plane. The control plane provides a command-line interface, an administration interface, a services catalog, and the central user experience.
Common core services: The common core services provide data source connections, deployment management, job management, notifications, projects, and search.
Integrated Data and AI services: The services catalog includes broad offerings from IBM and third-party vendors. The catalog contains AI, Analytics, Dashboards, Data governance, Data sources, Developer tools, Industry Solutions, and Storage.
Why IBM SPSS Modeler from Cloud Pak for Data?
SPSS Modeler is a tool to accelerate creating a predictive model, including doing analytical and statistical tasks. With SPSS Modeler, users can skip the process of learning frameworks programmatically while still maintaining most of the other experiences of learning data visualization and machine learning, such as visualizing the data, refining the data, training, and analyzing the model.
Are you interested in using IBM Cloud Pak for Data? You can start exploration from IBM Cloud Pak for Data, here.
References
IBM, 2021, “Nodes Palette (SPSS Modeler)”
IBM, 2021, “Forecasting with the Time Series Node”
IBM, 2021, “IBM Cloud Pak for Data Overview”
IBM, 2021, “IBM Cloud Pak for Data Getting Started”
IBM, 2021, “IBM Cloud Pak for Data Architecture”