The Ideal Archive

The Ideal Digital Archival

This explores the ideal archive solution that balances redundancy, metadata, encryption, and cost. The goal is to find the ideal archiving solution that can cater to complex data requirements and large data storage volumes while still being cost-effective.



The archival system has a lot to do. The control center (the User Interface) simplifies all that data management, even when the data may be offline, or on slow connections. It provides different relocation methods to improve access, security and/or reliability.




The ideal solution should bring together the following:

  • Web-based UI to manage folder hierarchies and file attributes and search.
  • Spanning storage systems that combine online storage systems' speed, cost efficiency flexibility.
    • Free cloud storage systems such as google drive and Onedrive.
    • Lost cost storage systems such as Backblaze and Wasabi.
    • USB and other direct-attached storage systems.
    • Cloud storage systems such as S3, google cloud storage, azure storage and some backup systems
  • File encryption on the client rather than volume-based encryption. This introduces a range of key management challenges.

The following high-level features should cover everything needed from any archiving solution. More details are provided on other pages to describe their meaning and variations. Please comment on key requirements for your perfect archiving solution that aren't covered anywhere on this blog.

Redundancy

  • Control the number of copies
  • Control independence in redundant copies (physical location, vendor etc)
  • Price scales according to the total storage volume.
  • Supports local hard disks for speed and $/GB

Encryption

  • Encryption on the client (before upload)
  • Files are encrypted with separate keys.
  • Encryption can be controlled, data is protected for privacy by default.

Standards

  • Where possible, rely on common tools or libraries to avoid lock-in to a particular technology/vendor.
  • Otherwise, rely on clearly described formats and protocols, preferably commonly used open standards.
  • At a minimum, data from the resulting archive should be accessible with a simple text editor and basic scripting.

Metadata

  • File discovery and navigation using attributes and tags
  • Metadata can be accessed independently of data.
  • Metadata on encrypted files is encrypted with a different key.

Flexible Operations

  • Embeddable, so it can run without a dedicated server.
  • Asynchronous communication, so distributed systems can operate at different times.
  • Queue-based communication, so distributed systems can operate without direct connection.

Process/Workflow/Collaboration

  • Extensible rule system to recognize patterns in files and folders and apply attributes accordingly.
  • Data collection rules to ask users to choose attributes for file groups before importing the data
  • Approval-based workflows for folder groups, especially workflow to support the archival (deletion) process. This can be used to ensure sufficient metadata is collected for files before they are encrypted and locally deleted.
Other thoughts on the ideal backup strategies:
  • https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

Comments