Data Den is a service for preserving electronic data generated from research activities. It is a low-cost, highly durable storage system and is the largest storage system operated by ARC.
Data Den is a disk-caching, tape-backed archive optimized for data that is not regularly accessed for extended periods of time (weeks to years). Data Den is not meant to replace active storage services like Turbo and Locker. It is best used for data that is set aside and is not being used at all, but still needs to be stored. Data Den is best accessed through the Globus data management sharing system to move data into and out of the service. Data Den is only available in a replicated format.
Data Den can be part of a well-organized data management plan providing international data sharing, encryption, and data durability.
Data Den service rates are $39.96 per replicated terabyte (TB) per year ($3.33/TB/month).
Requesting or Modifying a Data Den Storage Volume
Visit the ITS Service Request System (SRS).
Please include the following:
- Amount of storage needed (in 1TB increments)
- MCommunity Group name (group members will receive service-related notification, and can request service changes)
- Numeric group ID of the group that will have access to files at the top level directory.
- Numeric user ID of person who will administer the top level directory and grant access to other users.
Globus Server Endpoint
Data Den supports the use of Globus servers to provide high performance transfers, data sharing and access to Data Den from off campus. To access Data Den via Globus, request your volume be added to Globus.
Because of the design of Data Den, often projects will need to be bundled to form larger single file archives. The most common tool do this with is
tar. Tar can optionally compress the data also but can take much longer.
The following command will bundle all file files in
directory, store it in the file
bundle.tar.bz2, and compress it with bzip2. It will also create a small text file
bundle.index.txt that can be stored with it to quickly reference what files are in the bundle.
tar -cvjf bundle.tar.bz2 driectory | tee bundle.index.txt
To extract the bundle:
tar -xvjf bundle.tar.bz2
-j to save time compressing, and omit
-v to not print the bundle progress as it runs.
Compressing an archive can be accelerated on multi-core systems using
lbzip2. The following will work on all ARC systems:
tar --use-compress-program=lbzip2 -cvf bundle.tar.bz2 brockp | tee bundle.index.txt
To extract the bundle:
tar --use-compress-program=lbzip2 -xvf bundle.tar.bz2
Storage Resource Software
If you are unsure which of our storage services should be used to host your data, we have written some software that you can download and execute to analyze your files to understand how much of your data is stored in large files, how much of your data has been accessed recently, and the distribution of file sizes and access times. Use the Storage Guidance Tool to find file statistics and get a storage recommendation.
This software doesn’t examine the contents of any data files, it merely scans file attributes, it also does not store any file names after searching through the filesystem.
If you have any questions on this software, please send us an email at email@example.com with your inquiry. If you are unsure about any of the recommendations the tool sends you, contact us at firstname.lastname@example.org with your inquiry.
Small File Limitation and Optimal Size
Data Den’s underlying technology does not work well with small files. Because of this design, each 1TB of Data Den capacity provides only 10,000 files, and only migrates files 100 MByte or larger. The optimal file size range from 10 – 200 GBytes.
Maximum File Size
The maximum file size is 8 TByte, but files should optimally not be larger than 1 TByte. Larger archives can be split before uploading to Data Den with the
split -b 200G filename command.
Sensitive Data — ePHI/HIPAA/ITAR/EAR/CUI
Data Den does not currently support PHI or other data types. It is scheduled to be reviewed for support at a later date.
Abuse, generally by excessive recalls of data better suited for active storage, of Data Den intentionally or not may result in performance or access being limited to preserve performance and access for other users. In the event this happens, staff will be in contact with users to engineer more appropriate solutions.
Frequently Asked Questions
Q: How do I check Data Den space and file usage?
A: Contact email@example.com
Q: How can I use Data Den in my data management plan?
A: Data Den can provide low cost storage and data sharing with Globus to allow access to select data on Data Den anywhere in the world on demand.
Q: Why can’t Data Den be accessible directly from hosts or my personal machine?
A: Data Den can appear as a local disk, but is not currently allowed as it presents many opportunities to disrupt use of the services by other users. This will be reevaluated in the future.
Q: Can Data Den encryption be used with my restricted data?
A: Always refer to the Sensitive Data Guide as the definitive source for allowed data types. Data Den is slated to support sensitive data of some types but is not currently approved. If your data type is not in the Guide, contact us to review.
Q: Why can’t we use Data Den as general purpose storage?
A: Data Den’s design is inherently linear and cannot respond to user requests quickly, the way general purpose storage does.
Q: I need to retrieve a lot (more than 50) of files from Data Den. Can you help me?
A: Yes, contact firstname.lastname@example.org with the list of files you want to recall, and we will optimize the recall from tape, speeding access.
Q: I use Locker heavily and want to automate moving data as it ages to my Data Den space. Can you help me?
A: Yes. Data Den and Locker are inherently connected. Any Locker volume can potentially have a custom data migration policy automating data flow to Data Den. Contact email@example.com to set up a custom migration policy.
Data Den consists of two IBM Spectrum Archive LTFSEE clusters. Each cluster is connected to a corresponding Locker cluster. The LTFSEE cluster consists of an IBM TS4500 tape library with 20PB of uncompressed capacity and 9 Linear Tape Open (LTO) 8 drives. Each drive provides optimally 360 MByte/s uncompressed performance. Each library can also add additional frames and drives for new tape technology, increasing the capacity and performance.
Each cluster has a copy of all data separated by miles protecting it against fire and other disasters. Each cluster has 2 data movers. The purpose of these data movers is to apply policies to specific exports in Locker. As data are written to these exports the data mover moves the largest and oldest files to the tape in the TS4500 library encrypting it in the process. The data mover leaves a 1 MByte stub preserving the file layouts and allowing the first 1 MByte of data to be read without involving the tape. When a file that has been migrated is read the data mover instructs the TS4500 to recall the data from the tape and place it back on the faster disk of Locker transparently to the user.
Thus Data Den is able to ingest new data quickly to disk, and over time send it to tape as it is ages and goes unused. If data is recalled it remains on disk until it again is migrated due to policy.