At Your Service: Best Practices in Site Reliability Engineering (SRE)
Thanks to advances in cloud technology, it’s now possible for organizations to build almost any software solution themselves using microservices from the big public cloud providers, Amazon, Microsoft and Google. While building can be empowering, it may not always be the right decision. Just because you can, doesn’t mean you should. Often when companies evaluate build vs. buy decisions, they overlook some significant costs, specifically the costs of operating and supporting the services they build.
This article is part three of a new Signiant blog series discussing key service components that are core to cloud-native SaaS offerings — Customer Support, Customer Success, and Site Reliability Engineering.
Signiant is especially committed to a high services standard, delivered through the aforementioned three teams that work along-side Signiant product development and our customers.
Let’s explore best practices in Site Reliability Engineering (SRE).
SRE at Signiant
At the highest level, Site Reliability Engineering (SRE) brings together two crucial elements of software engineering: development and operations. As such, the SRE methodology combines the entire software development, deployment, and monitoring lifecycle and ensures 24×7 availability for customers.
Ideally, SRE is a cross-functional practice where developers are actively involved in releasing and monitoring the software they write. “It used to be that developers would write code; testers would test the code; and operations would deploy and monitor the code”, Signiant’s Director SRE Kevin Haggerty explaining why this new methodology matters, “Things have changed such that developers and testers are much more involved in every aspect of the process, including the deployment and monitoring of code they’ve written.”
So how does this affect the work SRE does? With the infrastructure and code tightly coupled, developers understand their service best since they designed and wrote it. So, it’s way more efficient to get them involved early in any problem solving when an issue is discovered.
Real-time Updates, Non-Stop Monitoring
The SRE benefits to customers can be simplified into two categories: product updates and monitoring. Signiant’s SaaS customers automatically receive product updates, with no downtime or maintenance windows needed. That customer benefit comes from a finely tuned system developed by our SRE team. At Signiant, this manifests itself in an elaborate system supporting developers’ productivity and in rapid release cycles which keeps all Signiant products in step with the demands of the industry and the individual customer. This also allows us to release new updates and roll them back quickly if there’s a problem.
SRE takes responsibility for everything properly working for customers at all times — which translates into monitoring a large number of systems and automated processes. Signiant’s SRE team built up operational excellence over several years as we transitioned to SaaS, and the finely-tuned processes by which they now work have helped the company handle the dynamism of the M&E industry.
“We have hundreds of advanced checks running 24×7 telling us all the different ways things could be inoperative,” says Kevin. “In addition, we leverage machine learning and anomaly detection to quickly highlight anything that looks out of the ordinary in our complex system. Whether it’s Amazon, Azure, Google Cloud, or any of our own stuff, if something goes wrong, we can usually fix it before it impacts customers, and a lot of it is automated.”
For instance, in November 2020, AWS suffered a service incident in their US East-1 region, which left a lot of Amazon cloud customers scrambling. Signiant — and, by extension, our customers — , was able to dodge the impact of this problem. How? Signiant can detect problems in a given region and automatically route traffic away from that region. For the parts where this failover doesn’t happen automatically, SRE has worked to make our monitoring capabilities even more robust, so that Signiant can rapidly detect a problem and react to it accordingly.
“If Amazon is having a problem in Virginia, for instance, our system will automatically failover to another region. We’ve invested heavily in a multi-region system, which is a lot of work but is incredibly important for our customers,” Kevin explains.
Because of the amount of high-value content moved by Signiant software, we take security protocols very seriously. We adhere to the MPAA best practices and guidelines for content security and are audited according to the DPP’s broadcast and production security standards. This rigor has led to Signiant being awarded the DPP’s “Committed to Security” certification — one of only 16 companies recognized in both categories.
These accolades and certifications would not be possible without the SRE team, which makes up a core aspect of Signiant’s defense-in-depth security practices, designing and providing software packages, audit tools, and scanning tools to ensure we are as secure as we possibly can be.
Signiant runs a commercial network security intrusion tool and does ongoing scans to ensure there aren’t misconfigurations or other known vulnerabilities we need to be aware of. We also monitor for extraneous error conditions on our customer portals. For example, in the past, on occasion, we have seen a large number of suspicious errors from a specific customer’s Media Shuttle portal. After investigating the incident, we found that it was the customer themselves who were doing a security scan, but we treat all suspicious activity the same just to make sure.
For SRE, the safety of your assets is the chief priority.
The Future of SRE
As our industry, our customers, and their challenges all evolve, SRE strategies must evolve with them, otherwise the consistency and support that they desire will quickly dissipate. Signiant has seen this firsthand since we introduced the first true cloud-native SaaS solution into the media supply chain in 2012 with the launch of Media Shuttle. Since then, we’ve continued to innovate, introducing two additional SaaS products, Flight and Jet, all built on the same SaaS platform and with the same cloud-native approach. With these new additions to the Signiant slate, SRE has continued to develop as well.
Just over the past year, in fact, our SRE team has undertaken an exciting and dynamic new initiative to continually improve services. In the past, Signiant’s control plane software ran with a primary region and a Disaster Recovery (DR) backup region on hot standby — so that it can be turned on should the primary region suffer an outage. In collaboration with our engineering development teams, Signiant now runs much of this software in multiple regions simultaneously, with handling traffic at the same time, in what we’ll call a live-live set-up.
It’s tweaks and expansions like this that make Signiant’s SaaS solutions so reliable and so trusted across the media industry. For SaaS products to serve the needs of their users, a vendor must be attentive and constantly willing to change and update their approach. You should be able to sleep soundly, knowing that Site Reliability Engineering teams are striving to ensure the fluidity, security, and strength of your business.
With Signiant, you can.