Telco Cloud Automation - The Need for Automation, Collaboration, and Scale

As the proverb has it: “if you want to travel fast, travel alone. If you want to travel far, take a friend”. The automation journey is too far, and too important, for CSPs to tackle alone, argues Grant Lenahan.

Automation is Complex, Layered, & Costly. It’s a Journey.

Automation in telecom holds wonderful promise, but it is also complex and costly. Many of tomorrow’s network tech and services depend on automation; not just the obvious datacenter workloads, but dynamic technologies like 5G & network slices. Finally, automation is essential for practical, agile innovation. In telecom the variety of things to be automated is vast: unlike traditional IT workloads in datacenters, telecom demands proximity[1], specialized infrastructure configurations, and complex configuration of 100s of 1000s or even millions of paying customers and customers services on that infrastructure.

Automation is therefore a journey. There are endless tasks to be automated in the evolving network and its many services, and each will benefit from near-endless improvement and refinement. In the end, few, if any, CSPs can single-handedly afford to single-handedly shoulder this investment. Moreover, if they could, why would they, when lower-cost, more complete alternatives are possible?

Telecom has been on an automation journey for over a decade, driven by SDN and NFV (let’s use the words broadly and conceptually) which automate many actions that previously demanded manual configuration. Yet automation goes well beyond the lifecycle of a datacenter workload, or automatically computing and configuring a data path through a L1 /L2/L3 network.

Telecom has consistently underestimated this Journey

The industry’s tumultuous path through NFV is illustrative. The NFV white paper, crafted and announced at the SDN world congress in 2012, was ahead of its time (for telecom anyway) and laid out many laudable principles of automation, flexibility, infrastructure uniformity etc. It anticipated automation, and it anticipated pools of capacity that could become spares or capacity expansion for ANY cloudified network function. But then reality hit, and initial implementations were largely “fat VMs” created from brittle images, with deterministic configuration and lifecycle management . . . to the extent that such LCM even existed. Without further belaboring the details, the NFV journey was a microcosm of the industry’s larger journey. In fact, since roughly 2016, Appledore Research has been noting the myriad places that automation is possible and desirable – and have consistently emphasized that for long term efficiency, maintainability, etc. — the “how?” — is often more important than the “what?”

Today, many CSPs are finally transitioning to truly automated, cloud-native management of workloads. They have only begun the transition to fully automated lifecycles for composite network functions made up of workloads, for network services, and for customer services (often “instances” or semi-custom composites of services). And the automation of datacenter infrastructure, what we call “DCI” in our forward-looking taxonomy, is weak. “Kubernetes” is useful, but not a sufficient answer. Google may have given away K8s, but it does not give away all the keys to its kingdom. Two simple examples illustrate the persistent gap from basic datacenter automation to what telecom demands. First, datacenter infrastructure must be application-aware, and second, thousands of tiny edge facilities demand complex proximity and placement algorithms. The DCI/infrastructure discussion is sufficiently complex that we point readers to our upcoming blog on precisely this topic.

We (telecom) are even earlier along the path toward utilizing advanced analytics – especially ML, and later AI — to discover problems, predict events, and generally improve the logic driving those lifecycle events. We have only begun to scratch the surface of what is possible – and since it is possible, competitively necessary. A key learning is that the platform is a tiny portion of the analytics challenge – building filters, training algorithms, learning data and causations are the true task – and one that is being solved, or many times not, hundreds of times across the globe, on hundreds of data sets, each much smaller than it might be if pooled.

This enormous undertaking must be shared. Scale wins.

“we must all hang together, or … we shall all hang separately” – Benjamin Franklin, July 2, 1776

Appledore Research has argued for years that collaboration – shared learning and investment — is necessary. We believe that collaboration and sharing ought to begin with the fundamental building blocks that are unlikely to differentiate any CSP, except possibly negatively, by omission. Examples are service and CNF models, ML methods, possibly anonymized data (“those with the biggest corpus of data to mine, win”) and other low-level capabilities. These are analogs of the libraries that widely form the basis of open source in the broader world of IT and technology. While many think they may be able to go it alone, Exhibit A is one of the best resourced CSPs in the world: AT&T. AT&T recently transferred its Network Cloud to Microsoft Azure – which aims to productize, invest and spread that investment for use by myriad telcos around the world. But we are getting ahead of ourselves. Suffice it to say this is the model used by all large industries, who rely on specialty suppliers to aggregate investment in component parts from tires for autos to productivity software used in offices.

Build it or buy it.

We may be moving toward such models for telecom operational automation. Without judging their vices nor virtues, I will point out that serious efforts are coming from three very different perspectives.

First, we have the mighty public cloud firms. Microsoft’s Azure for Operators is, from what we have seen, the poster child for building telecom-specific capabilities, both in terms of technical underpinnings, but largely in terms of automation across all layers from the lowest levels up to the actual network services (e.g.: EPC, 5G core, …). All the public cloud leaders have massive investments in cloud-native automation, and on top of this, Azure emphasizes the significant investments they have made in automating myriad lifecycle processes. These investments have been guided by the needs and desires of their own telecom acquisitions, in everything from automating application-aware infrastructure, to harnessing advanced on-platform analytics to inform intelligent automation, to service and customer provisioning automation at the NF/application layer. Massive investment and build out of Azure – the foundation of Azure for Operators. All of these could be replicated by a CSP – given enough time and money. But the reality is there is insufficient time or money – and even AT&T appears to have realized that.
Second, we have the David to the hyperscalers’ Goliath: start-up ISVs. One notable entrant is Robin.io, now owned by Rakuten. While they concentrate only on the lower layers (DCI, workload placement and optimization), Robin.io is investing to automate many tasks that are more complex than past efforts – including the application-aware configuration of infrastructure – an otherwise manual task and major performance drain. Another disrupter, Elisa’s FRINX engine, encourages the sharing of all user’s models and adapters in one common library that may be accessed by all.
Finally, we have the most traditional, establishment approach one can think of: the NEPs that often deliver extensive swaths of network from radio to core to management and BSS, and including Ericsson, Nokia, and Huawei plus large specialists. Their advantage; building all those specific equipment models, interfaces and related. Admittedly with a huge preference for their own kit. And yet, this is a very viable path, especially for smaller xSPs.

One size does not fit all

One approach is not better than the others. Rather, they lend themselves to various circumstances – NEPs have an advantage for those with very limited resources. Public cloud providers can deliver a full stack, at edges, in private edges, and sometimes, as in the case of Azure, right up to a particular suite of applications. Firms like Robin.io provide a more DIY path, but one with a subset of automation delivered – and maintained – for the common benefit.

None of these solve the end-to-end, top-to-bottom challenge; nor should they. But they represent a different approach – one that recognizes that SPs have common needs, and that solving those needs individually does not deliver differentiation – only pain and high costs.

So, let’s get back to where we began, with the idea of shared models, shared data corpuses, shared methods, shared analytics methods and shared ‘good ideas”. In the broader tech industry, thousands of basic libraries are built, maintained, improved and shared on GitHub and similar. By contrast, in the core of network operations, telecom has never been comfortable sharing what it feels are its own property. I’m thinking of equipment and service models that incorporate extensive operational triggers, algorithms, pattern detection methods, and many more. Maybe we need to recognize that SPs compete on things like customer experience just as auto OEMs compete with well-designed cars, not on the shared nuts, bolts and steel that cars are fabricated from.

We believe that high levels of automation are critical to quality, to agility, to innovation and to cost. We believe that more is likely better. And we believe that minimizing our investment while maximizing our returns is a good idea. Let’s find ways to collaborate on those nuts and bolts — and compete on the services.

[1] This means far more than “edges near the workload”. For example, a router out in suburban or rural access plant cannot be remoted to some datacenter – not even a shared remote edge.

Photo by Lorenzo Aiello on Unsplash