Lessons Learned – Managing your Critical IT Infrastructure during a Pandemic

Featured

Worldwide Craziness

The Novel Coronavirus has already devastated the global economy. Historically, most business continuity plans for data centers are based on local scenarios, where “acts of God” wreaked havoc in one place. Rarely had anyone considered that one place being all of Earth.

A Change in IT Mindset

It is not — at least not yet — the equivalent of a worldwide hurricane. Today, the world’s data centers are, for the most part, functional. Modern enterprise data centers have already been designed to operate with as few as three full-time staff members onsite.

You don’t have to look far to see how the global COVID-19 pandemic has fundamentally upended IT. As organizations in all sectors have rapidly emptied their offices and sent their employees home to comply with ever more expansive shelter-in-place and quarantine mandates, replicating the full breadth of services remotely has been IT’s biggest priority.

All of this is nothing short of a remote collaboration revolution. It is already rewriting how work gets done — and how technology gets supported — when direct access to traditional, physical infrastructure is no longer a given.

But this is merely one aspect of IT. As we begin to digest how these changes will shape technology best practices, both during the current crisis and well into the future, we can’t afford to ignore the often unseen underpinnings of IT infrastructure that don’t have the luxury of working remotely.

Not an Option

Put merely, mission-critical facilities like data centers can’t be relocated into employees’ home offices. While transferring end-user productivity out of a traditional office context is a fairly straightforward process. The same can’t be said for the highly specialized workloads that can only be managed within the framework of a data center. Beyond the uniquely visible and non-transferrable capabilities of the facilities themselves — grid access, raw compute power, failover, security, etc. — there is the genuine accountability associated with the sheer volume and type of workloads managed within them.

Regulatory constraints around how incalculably vital data must be managed and protected throughout all phases of its lifecycle add even more complexity to data center protocols during a pandemic.

So while you can’t simply abandon your data center in the same manner as your end users have cleared out their offices, you can — and must — understand how to rebalance your provision of data center services in light of how the pandemic continues to evolve. And it would be best if you did so while you continue to keep the lights on for stakeholders who need uninterrupted access to data center services now more than ever.

Against this backdrop, if you haven’t already examined your data center management strategy through a COVID-19 lens, now is the time to do so. As with anything related to the data center, however, this will be a complex, multifaceted process. It would be best if you positioned yourself to navigate it by looking at it through the following contexts.

  • Capacity Management

    The historically unpredictable global business environment is putting unprecedented pressure on capacity management, with businesses barely able to forecast demand — or, in many cases, keep up with it. Global internet traffic is trending upward, with several exchanges routinely reaching record throughput as entire economies and workforces adjust to the new lockdown paradigm. Some organizations facing spiking demand have no choice but to move services out of their own data centers and lean more heavily on vendors. This makes absolute sense in an unpredictable landscape where scale needs to be implemented without delay. Still, it doesn’t make everyday issues like bandwidth, power, CPU, memory, and disk space disappear. Instead, it shifts the burden onto these external providers and their specific infrastructure. IT leadership must adapt these partnerships to keep pace because, if vendors don’t stay ahead of the curve, IT may find itself unable to serve the business adequately.

  • Connectivity

    The old truth to avoid putting all your eggs in one basket has never been more valid than it is now. This issue relates directly to capacity management, and, as the crisis deepens, the strain on all aspects of infrastructure will only increase. Diversify your upstream providers as much as possible to mitigate the risks associated with any one of them being compromised by pandemic-related resourcing constraints. This minimizes the potential for back-end interruptions to reach your customers. Leverage third-party user reviews and analyst resources to better assess and compare vendors, match provider capabilities to fast-changing business needs, and position yourself to make best-of-breed decisions faster.

  • Disaster Recovery

    The uptick in adopting mission-critical services being deployed off-premises doesn’t only impact day-to-day service delivery and the service level agreements (SLAs) that set expectations and confirm accountabilities. It also has significant implications for disaster recovery (DR) planning and implementation. It shifts a fair degree of risk over to the third-party providers now responsible for delivering these services. DR plans must be updated to reflect this new world of vendor-distributed work, and vendors must be integral to this process to ensure they are in a position to fulfill all requirements.

  • Security

    Cybercriminals have never missed an opportunity to take advantage of periods of uncertainty to ply their evil trade, and the COVID-19 pandemic is no exception. As more organizations move their services to centralized locations, bad actors suddenly have significantly more — and better defined — higher-value targets. From a cybercriminal’s perspective, why attack one company and net only one victim when you can strike a mission-critical data center and compromise many victims? This sobering reality reinforces the need to nail down end-to-end security protocols with all vendors, including, but not limited to, encryption, authentication, and onsite access control. Reaffirming your cybersecurity skills inventory — and closing any gaps with targeted training — should also be prioritized.

  • Colocation

    If you are either using or responsible for colocated resources or infrastructure, you must take immediate steps to reduce physical risks at all levels, including:

    • Focus on disease control and disinfection throughout the facility.
    • Enforce monitoring — including temperature checks — at tightly controlled entries, and turn away anyone exhibiting symptoms to avoid compromising the facility itself.
    • Reduce the number of people onsite, especially unknowns and other individuals not considered essential to the business.
    • Consider extending shift lengths from eight to 12 hours and moving to a two-shift schedule, if local labor laws will accommodate.
    • Take individual steps to protect technical staff with skills required to maintain data center uptime, including sequestering them in a third, unscheduled shift, and holding them in reserve if primary staff exhibit symptoms.
    • Incorporate in-person monitoring of tasks during shift rotations to ensure continuity of operations. Implement contactless handovers to minimize transmission risk during these critical periods.
    • Assign activities and technical resources to single buildings and prevent them from moving to other buildings within a more massive campus.
    • Prioritize the implementation of “smart hands” services to ensure trained, known resources handle tasks requiring onsite engagement.
    • Leverage guidance from local and regional health authorities to ensure nothing is missed, including physical traffic control methods in shared areas to support social distancing.

Focus on the Opportunity

Not everything about the current pandemic should incite fear — all significant disruptions offer opportunities to rethink how data center operations are planned, managed, and evolved over time. The possibilities can be game-changing, but only if you take the time to get out of firefighting mode and zero in on what your strategy should look like once COVID-19 is firmly behind us.

For example, as more data physically moves offsite toward data centers, hardware GPUs can be leveraged for compute-intensive artificial intelligence, machine learning, and related data analysis applications. Recognize that data has gravity and tends to pull surrounding apps with it. Position yourself to sell compute capacity to meet these shifting demands.

Don’t Reinvent the Wheel

As the pandemic continues to play out, expect the value of traditional data center best practice to be reinforced. This isn’t so much a time to rip apart and rebuild as it is to validate what you’ve been doing all along and double down on it.

Start by ensuring your basics are sound and that your existing slate of products and services is reliable, secure, and well-communicated to your stakeholders. The sudden increase in demand for data center services and capacity may be unique in history, but stakeholders will depend on you having a firm foundation. By taking the time to reaffirm that this is indeed the case, you’re in a much better position to scale and meet this demand.

Learn from experience

As unique as this experience seems to us all, recognize that we’ve been through this before — including the SARS, H1N1, and Ebola outbreaks in 2003, 2009, and 2014, respectively. Refer back to any documentation you may have from those periods to inform your thinking and responses for the current pandemic, but bear in mind that the impact in those previous cases was significantly smaller, and we “returned to normal” much more quickly.

This time out, the impact is unprecedented, and the future timeline won’t be resolving itself anytime soon. Expect it to take far longer than initially expected to return to anything remotely approaching “normal,” and, even then, expect the very definition of the word to evolve.

Many economic, technological, and social changes will indeed be permanent, which means your go-forward strategy to manage data center resources should not be to overutilize what you’ve got and hope to ride out the storm. Instead, now is the time to scale your investments in critical infrastructure and prepare for a changing world after that. This strategy will maximize your business continuity and minimize the risks associated with navigating these strange times.

Until next time, Rob.

My thoughts on the Future of the Cloud

Many people in the IT consider containers, a technology used to isolate applications with their own environment, to be the future.

However, serverless geeks think that containers will gradually fade away. They will exist as a low-level implementation detail bubbling below the surface but most software developers will not have to deal with them directly. It may seem premature to declare victory for serverless just yet but there are enough positive signs already. Forward-thinking organizations like iRobot, Coca-Cola, Thomson Reuters, and Autodesk are experimenting and adopting serverless technologies. All major and minor Cloud providers — including the aforementioned ones as well as players like Azure, AWS, GCP, IBM, Oracle, and Pivotal are working on serverless offerings.  If you wan to learn more just take a quick look to this link, https://docs.microsoft.com/en-us/archive/blogs/wincat/validating-hybrid-cloud-scenarios-in-the-server-2012-technology-adoption-program-tap.

Together with the major players, a whole ecosystem of startups is emerging. These startups attempt to solve problems around deployment and observability, provide new security solutions, and help enterprises evolve their systems and architectures to take advantage of serverless. This isn’t, of course, to mention a vibrant community of enthusiasts who contribute to serverless open source projects, evangelize at conferences and online, and promote ideas within their organizations.

It would be great to close the book now and declare victory for the serverless camp, but the reality is different. There are challenges that the community and vendors are yet to solve. These challenges are cultural and technological; there’s tribal friction within the tech community; inertia to adoption within organizations, and issues around some of the technology itself. Also remember to make sure that you are properly certified if you are running cloud-based services, it’s the ISO 27017 certificate that you need for that.

Confusion and the Cloud

While adoption of serverless is growing, more work needs to be done by the serverless community to communicate what this technology is all about. The community needs to bring more people in and explain how serverless adds value. It’s inarguable that there are good questions from members of the tech community. These can range from trivial disagreements over “serverless” as a name, to more philosophical arguments about fit, use-case, and lock-in. This as a perfectly normal example of past successes (with other technologies) breeding inertia to change.

This isn’t to say that those who have objections are wrong. Serverless in its current incarnation isn’t suitable in all cases. There are limitations on how long functions can run, tooling is immature and monitoring distributed applications made up of a lot of functions and cloud services can be difficult (although some progress is being made to address this).

There’s also a need for a robust set of example patterns and architectures. After all, the best way to convince someone of the merit of technology is to build something with it and then show them how it was done.

Confusingly, there is a tendency by some vendors to label their offerings as serverless when they aren’t. This makes it look like they are jumping on the bandwagon rather than thoughtfully building services that adhere to serverless principles. Some of the bigger cloud vendors are guilty of this and unfortunately, this confuses people’s understanding of technology.

Go Big or Go Home

At the very large end of the scale, companies like Netflix and Uber are building their own internal serverless-like platforms. But unless you are the size of Netflix or Uber, building your own Function as a service (FaaS) platform from scratch is a terrible idea. Think of it this way like this, its like building a toaster yourself rather than buying a commoditized, off-the-shelf product. Interestingly, Google recently released a product called kNative. This product — based on the open source Kubernetes container orchestration software— is designed to help build, deploy and manage serverless workloads on your own servers.

For example, Google’s Bret McGowen, at Serverlessconf San Francisco ’18, gave of a real-life customer scenario out on an oil rig in the middle of an ocean with poor Internet connectivity. The customer needed to perform computation with terabytes of telemetry data but uploading it to a cloud platform over a connection equivalent to a 3G modem wasn’t feasible. “They cannot use cloud and it’s totally unfair to say — sorry buddy, hosted functions-as-a-service or bust — their developers deserve to have the same serverless experience as the rest of us” was Bret’s explanation why, in this case, running kNative locally on the oil rig made sense.

He is, of course, correct. Having a serverless system running in your own environment — when you cannot use a cloud platform — is better than nothing. However, for most of us, serverless solutions like Google Cloud Functions, Azure Functions, or AWS Lambda offer a far smaller barrier to entry and remove many administrative headaches. It’s fair to say that most companies should look at serverless solutions like Lambda first and if they don’t satisfy requirements look at other alternatives, like kNative and containers, second.

The Future…in my humble opinion

It’s likely that some of the major limitations with serverless functions are going to be solved in the coming years, if not months. Cloud vendors will allow functions to run for longer, support more languages, and allow deeper customizations. A lot of work is being done by cloud vendors to allow developers to bring their own containers to a hosted environment and then have those containers seamlessly managed by the platform alongside regular functions.

In the end, “do you have a choice?” “No, none, whatsoever” was Bret’s succinct, brutal answer at the conference. Existing limitations will be solved and serverless compute technologies will herald the rise of new, emerging architectural patterns and practices. We are yet to see what these are but, this is the future and it is unavoidable.

Cloud computing is where we are, and where the world is going for the next decade or two. After that, probably something new will come along.

But the reasons for going to cloud computing in general and the inevitable wind-down of on-premises to niche special functions are now pretty obvious.

  • Security – Big cloud operators have FAR more security people and capacity than even a big enterprise, and your own disgruntled employees don’t have the keys to the servers.
  • Cost-effectiveness – Economies of scale. The rule of big numbers.
  • Zero capital outlay – reduced costs.
  • For software developers, no more software piracy. That’s a big saving on the cost of developing software, especially for sales in certain countries.
  • Compliance – So much easier if your cloud vendor is fully certified, so you only have to worry about your part of the puzzle.
  • Energy efficiency – Big, well-designed datacentres use a LOT less global resources.

My next post in this series will be on “The Past and On-prem and the Cloud?

Until next time, Rob

Infrastructure: from your Enterprise to the Starship Enterprise: Building the right Playground – Part 1

Now, if you know me or every met me in any way, you know that I am a big Trekkie. The Star Trek series was very defining for my life in my ways. From the original series to Star Trek Next Generation to DS9, Enterprise, and off course Voyager. So recently, I decided to write this blog series in a context that many of us can understand over the coming weeks. I know that Star Trek is the love of many IT Pros.  And so we began this series on infrastructure.. Continue reading