Patrick Kelly, Appledore Research Group
Virtualization of the network brings with it monumental challenges necessary to manage cloud services. This challenge is embodied in the on-demand characteristics of cloud services and the properties of scaling, healing, and optimization of the underlying infrastructure. One of the areas moving to the forefront to help human operators manage this complex task is machine learning. Machine learning or artificial intelligence itself is not new. IBM Deep Blue defeated world chess champion Gary Kasparov in 1997. More recently in 2016, AlphaGo a program designed by Google DeepMind defeated Lee Sedol a world champion at the Chinese game of Go which has as many possible moves as atoms in the universe!
Machine learning algorithms use computational methods to “learn” information directly from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases.
In assuring services in the telecommunication network early techniques were derived to write rules or develop signature patterns to determine the root cause of a problem.
Virtualization makes the old root cause technology obsolete because resources and workloads move around dynamically – we no longer have fixed network and compute resources.
Machine learning combines big data and predictive modeling to understand the future state of the network to avoid service impacting events.
Supervised learning trains a model on known input and output data so that it can predict future outputs.
Unsupervised learning finds hidden patterns or intrinsic structures in input data.
IBM and Google have pioneered the field for commercial applications. Options also exist in the opensource community and niche suppliers are active in the field including companies like MathWorks. IBM is utilizing Watson in different industries to harness both structured and unstructured data. Watson allows the user to develop a model and the system then streams data to determine the best algorithm to yield the highest score and reliability. As more data is ingested, another algorithm may be utilized to improve the correlation.
Appledore Research Group is embarking on machine learning research and how it can be applied to the management of virtualization technologies such as NFV and SDN. If you have use cases that you would like to share and innovative solutions to help CSPs in this field feel free to contact me directly.
Use Case: Google Data Centre Reduces Cooling Cost by 40%
Large-scale commercial and industrial systems like data centres consume a lot of energy. Applying DeepMind’s machine learning to Google data centres, has reduced the amount of energy Google uses for cooling by up to 40 percent. Dynamic environments like data centres make it difficult to operate optimally for several reasons:
- The equipment, how we operate that equipment, and the environment interact with each other in complex, nonlinear ways. Traditional formula-based engineering and human intuition often do not capture these interactions.
- The system cannot adapt quickly to internal or external changes (like the weather). This is because it is difficult to come up with rules and heuristics for every operating scenario.
- Each data centre has a unique architecture and environment. A custom-tuned model for one system may not be applicable to another. Therefore, a general intelligence framework is needed to understand the data centre’s interactions.
Using a system of neural networks trained on different operating scenarios and parameters within Google data centres, it created a more efficient and adaptive framework to understand data centre dynamics and optimize efficiency.
Google accomplished this by taking the historical data that had already been collected by thousands of sensors within the data centre — data such as temperatures, power, pump speeds, setpoints, etc. — and using it to train an ensemble of deep neural networks. Since the objective was to improve data centre energy efficiency, Google trained the neural networks on the average future PUE (Power Usage Effectiveness), which is defined as the ratio of the total building energy usage to the IT energy usage. Google then trained two additional ensembles of deep neural networks to predict the future temperature and pressure of the data centre over the next hour. The purpose of these predictions is to simulate the recommended actions from the PUE model, to ensure that we do not go beyond any operating constraints.
Google tested the model by deploying on a live data centre. The graph below shows a typical day of testing, including when it turned the machine learning recommendations on, and when it turned them off.
The machine learning system could consistently achieve a 40 percent reduction in the amount of energy used for cooling, which equates to a 15 percent reduction in overall PUE overhead after accounting for electrical losses and other non-cooling inefficiencies. It also produced the lowest PUE the site had ever seen.