As with any company concerned with the security of its customers’ data, StreamSets ensures that data is reliably backed up and can be recovered in the event of some disaster.
Security is commonly described as a so-called “CIA” triad: confidentiality, integrity, and availability. Confidentiality ensures that it’s only seen by those parties who should and usually involves encryption.
Integrity means making sure that the data can’t be inappropriately altered, whether by a malicious actor or accidentally, and also involves the use of encryption, with digital signatures or with “hashes,” which are like fingerprints that show that the file you’re looking at now is the same one that existed previously.
Availability is where data backup and recovery are essential: your data should also be protected from loss and, in the event of a disruption, should either be quickly restored or, if necessary, available to you via some replacement for the failed system.
The StreamSets platform lives in the cloud and is backed up there as well. The StreamSets Control Hub service that a data engineering user logs into is hosted in a particular region on a cloud provider’s service (whether Google Cloud, AWS, or Azure), but all of the user’s transactions and associated data are continually replicated, across the provider, to a second backup site. That replication, or backup, means that in the event of the loss of that primary service (e.g., if AWS suffered the loss of the hosting data center due to an earthquake or fire), service can be restored from a secondary backup in a matter of minutes.
Recovery time objective (RTO) and recovery point objective (RPO) are targets, the time it will take to restore operations and how much of a gap in data might be seen because of the outage, respectively. If, in a hypothetical situation, a StreamSets user’s primary data center went down, their data would have been streaming over to the secondary site up until the outage and, as soon as the service could be restored on the secondary site, would be available. While StreamSets aims at RTOs and RPOs of no more than four hours, the fact that we replicate the data in the way we do ensures that the RPO will actually be far shorter than that.
As an analogy, imagine you’re a middle school math teacher grading exams. You have a full box of ungraded exams and an empty box to put the graded exams in. StreamSets approach to replication means that each time you complete an exam and set it in the “Done” box, a copy is sent to a second box in another location of your choice. If a fire breaks out and you’ve got to abandon your work, it will take a little while to set you up again at that other location (RTO), but very nearly all the work you’d previously done will be there waiting for you (RPO). Note: affected data would only include information describing your pipeline configurations since, with StreamSets’ architecture, all your pipeline data resides in your own cloud or on-premises environment.
StreamSets’ back-up strategy ensures that customer configurations are continually replicated and rapidly replaced if a data center fails. The StreamSets architecture – where customer pipeline data is never provided to StreamSets – means that the customer is free to make their own decisions on how to ensure their own data’s availability.