Anonymized Data: The Story About Everyone by No One

10 min read

Signiant’s Metadata Everywhere series focuses on how our systems interact with and utilize the metadata associated with media assets. In this piece, we will be focusing on the concept of Anonymized Data ⎯ what it is, how we collect, process, and utilize it, and at the end of the day, how it helps our customers.

The Signiant SaaS platform collects and processes customer data in two distinct but related ways. As the name implies, anonymized data goes through an automated conversion process whereby all confidential information is removed. Although it is extrapolated from customer-specific data, the value of anonymized data lies in the fact that individual data points are removed and replaced by a big-picture collective view.

Customer-specific data provided by Signiant software is used primarily for chain of custody, proving who did what and when to any given transferred file throughout its lifecycle. There’s a log of everybody who’s ever been granted access to that file, what changes were made, etc. This information is confidential to each customer, and Signiant takes privacy and security very seriously. At no point is any customer-specific information exposed beyond users explicitly authorized by the customer.

Anonymized data, on the other hand, is used to generate a broad-based analysis of what many customers are doing without exposing details relating to any individual customer. Through this process, we can identify patterns and trends regarding how our software is being used and under what conditions. This information helps us improve our service, prioritize new features, and understand media industry dynamics. Ultimately, anonymized data doesn’t provide a view of any specific customer but rather deep insight into our entire customer base. Our software collects and analyzes this data automatically and continuously, aided by machine learning.

The Multi-tenant SaaS Advantage

The benefits of anonymized data analysis can only be realized via a multi-tenant SaaS architecture like Signiant’s. Using our control plane that deals with metadata in a central cloud location, we can easily extract and process massive amounts of information. These economies of scale allow us to provide customer-specific data much more cost-effectively than would be possible with single-tenant architectures. Furthermore, the whole notion of anonymizing and mining data is only possible if information about thousands of customers is aggregated in a single place — which is exactly what a multi-tenant SaaS platform does.

Critical Mass

Critical mass is another key. The more data you have, the more actionable information you can extract from it. The broader the view of what people and systems are doing, the better we can tune our software to optimize various parameters. With only a few users on a multi-tenant SaaS system, you can’t extract much useful information. But when you have a million people from 50,000 companies using your system every day, new worlds open.

Media Industry Specific

What makes the anonymized data uniquely powerful for Signiant is that it is specific to media and entertainment companies. Our industry operates as a vast ecosystem of interconnected supply chains, so Signiant customers have shared interests and direct interactions with one another. Using AI and machine learning-based anonymized data techniques, our products tune themselves specifically for the customer community that we serve. The constant flow of new data that circulates within our system on top of the historical data we have already collected essentially automates the improvement of our service all the time. We’re confident no other company can claim the level of insight into the file transfer needs of media and entertainment companies like we can.

How It Works

The anonymous data life cycle consists of phases, ebbs, and flows. Data flows into the system, gets collected and analyzed, conclusions are extrapolated and saved, and then the anonymized data itself gets thrown out. As more data comes in, the knowledge pool gets bigger and more powerful.

System optimization based on anonymous data is analogous to air traffic control. It is designed to make better decisions on how we transfer files based on a “weather report” (i.e., current and forecasted network conditions) by incorporating learning from anonymized data. If you understand where the jet stream is, you can fly planes in it when they’re going west to east, and out of it when going east to west. We can forecast and detect changes in wind speed, direction, and other weather conditions to make mid-course corrections in real time. This real-time data is used along with anonymized data which is more historical and analogous to long term climate conditions.

Historical data also allows us to performance tune. It can tell us we need a new runway at a busy airport because the flight volume is increasing by a certain percentage for a given period. If we see a steady increase in passengers flying between two distant locations, we may need to fly bigger planes on that route. It may be more effective to send smaller planes more frequently depending on the distance and airport capacity at either end of the route. It could also tell us we need to decrease traffic on another route, or on Saturday there’s less traffic than on Monday. Our platform can look at the history of everything and extrapolate useful data that ultimately helps our customers perform faster, more efficient, more secure file transfers.

Examples of specific data-centric capabilities of the Signiant SaaS platform include:

File Transfer Metadata

As the name implies, file transfer metadata is information about a file or set of files transferred by a user or a system to another user/system. It includes various information about the computer that generated the file, the server it was uploaded to, and the network conditions under which it was transferred (bandwidth, delay, etc.). It also includes settings used to perform the transfer and the results achieved.

In the lens-to-screen ecosystem there are different files sizes and formats at every stage, all the way to the point that it is delivered to the consumer. At each stage, there are all sorts of ways in which people and systems interact with the files. The nature of the files changes as they move through the supply chain. A green screen camera capture for a major blockbuster, for example, moves along with thousands of other shots, processes, visual effects, etc. that are pieced into a final cut that gets delivered to different display screens. In between the upstream content creation down through the final content distribution phase, we touch all those files along the way and ask a multitude of different questions.

Understanding what has occurred is valuable. We don’t necessarily know all the forthcoming questions when collecting the data, but the more data we collect, the more questions we can answer. Using anonymized data techniques, we can look at the mass of productions and ask, what are the file types most often used? What do we need to optimize for? The more data you keep, the more you learn and the more you can do.

Intelligent Transport

Our newest products use anonymized data to optimize file transfer speed. The main network information we collect includes how long it takes for data to get from one end to the other, which is a good indicator of distance but may also be impacted by congestion, and available bandwidth between those locations. Think of Google Maps looking to find the fastest route to a destination but instead of a car with a single occupant, it is an airplane full of people and their luggage. We use anonymized data to process factors like getting from point A to point B — directions, speed limits, distances, current traffic conditions and ultimately the quickest way to arrive at a destination. Then we factor in air travel — how many passengers are aboard and how much does their luggage weigh to be transported across very long distances at very high speeds. We track speed and delivery and seek to mitigate or avoid network congestion. It will tell us things like it may be faster to put less people into more planes or to use a bigger plane or maybe we need to add a runway. Anonymous data allows us to be constantly optimizing system performance.

Storage Optimization

People use lots of different storage types with different performance characteristics. Storage is often a limiting factor — how fast a system can read and write for long periods of time is always a moving target. There’s no point in flying a thousand people to a terminal that can only handle 10 people per minute. You’re better off sending a hundred planes with 10 people once a minute. If we know storage parameters, we know what the limiting factor is in the pipeline, then we know how to interact with it. That’s a simplistic view of the frequently complex and constantly shifting storage equation.

Developing Other Features

Signiant-specific information such as portal types and feature usage plays into optimizing how we develop our service offerings going forward. Anonymized data tells us how many people are using different features so we can optimize features that most people use most of the time. For example, we have three types of portals: send, share, and submit. We know from anonymized data that the vast majority of our customers use share portals, so we continue to invest most heavily in optimizing the experience for that portal type.

Instead of writing algorithms to answer questions, we use artificial intelligence and machine learning that use the answers to questions to generate algorithms. It’s similar to the way that machine learning automated speech-to-text. Programmers used massive amounts of recorded speech segments and text that had been translated, and developed a way to convert recorded speech into text that hadn’t been seen before. We use anonymized data in a similar fashion to identify macro trends that help identify new ways to benefit our customers. It makes our software perform better in the real-time decisions it’s making and in how we enhance and develop our product. Customers don’t even necessarily realize that anonymized data was behind it. They just see things getting better.

With a multi-tenant SaaS platform, our developers can push new software every single day, multiple times. In addition to the software development process, our software is updating the decisions it makes on its own, through machine learning. The software looks at the data and improves itself without any external human overseeing it. It also allows humans to get answers to make decisions about how humans are going to enhance the software. The more raw material you have, the more opportunities you have to extract value out of it.

Serving Our Customers Better

In the end, anonymized data helps Signiant ultimately serve customers better by optimizing the software in real-time for better performance and better experiences, and through long term enhancements based on the available data. The rate at which things get better increases and compounds over time. More anonymized data enables smarter tools in a virtuous circle allowing continuous optimization. It just keeps going around and around. Anonymized information never leaves the company and isn’t shared directly with customers. But it does help us transfer your files faster, more efficiently, and safely and accelerates the pace of innovation in people’s ability to produce better content at a lower cost.

Anonymized Data: The Story About Everyone by No One

Contents

The Multi-tenant SaaS Advantage

Critical Mass

Media Industry Specific

How It Works

File Transfer Metadata

Intelligent Transport

Storage Optimization

Developing Other Features

Serving Our Customers Better

Suggested Content

Fast, Flexible File Transfer for Esports

Build or Buy? Why Full-Stack Tools Like Signiant Beat DIY Cloud Workflows

What is UDP? A Simple Guide for Networking