Data Den is a service for preserving electronic data generated from research activities. It is a low-cost, highly durable storage system and is the largest storage system operated by ARC.
Storing of sensitive data (including HIPAA and FERPA) is now supported (visit the Sensitive Data Guide for full details). This service is part of the U-M Research Computing Package (UMRCP) that provides storage allocations to researchers. Most researchers will not have to pay for Data Den.
Data Den is a disk-caching, tape-backed archive optimized for data that is not regularly accessed for extended periods of time (weeks to years). Data Den can hold onto research data for many years, thereby meeting the needs of the grant requirements specifically NIH and NSF.
Data Den is not meant to replace active storage services like Turbo and Locker. It is best used for data that is set aside and is not being used at all, but still needs to be stored. Data Den is best accessed through the Globus data management sharing system to move data into and out of the service. Data Den is only available in a replicated format.
Data Den can be part of a well-organized data management plan providing international data sharing, encryption, and data durability.
Data Den Rates
- 100TB of Data Den storage is available as part of the no-cost U-M Research Computing Package (UMRCP). Additional allocation is available for a fee as noted below.
- Data Den service rates are $20.04 per replicated terabyte (TB) per year ($1.67/TB/month).
- Starting July 1, 2023, the new LSA Research Storage Funding Model will provide unlimited Data Den Research Archive Storage. Please see the LSA Research Storage Portal to learn more.
- Not sure if this is the right storage service for you? Try the ITS Data Storage Finder to find storage solutions to meet your needs.
Getting Started – UPDATED 2023
- Project admins can use the Portal to update user access lists
- Use Globus.org collection UMich ARC Non-Sensitive Data Den Volume Collection your path will be in /<volume>
- Active Archive users that have archives mounted on ARC cluster login nodes path will be /nfs/dataden/<volume>
Requesting or Modifying a Data Den Storage Volume
Most should start with the U-M Research Computing Package rather than these instructions.
Visit the ITS Service Request System (SRS).
Please include the following:
- Amount of storage needed (in 1TB increments)
- MCommunity Group name (group members will receive service-related notification, and can request service changes)
- Numeric group ID of the group that will have access to files at the top level directory.
- Numeric user ID of person who will administer the top level directory and grant access to other users.
Globus Server Endpoint
Data Den supports the use of Globus servers to provide high performance transfers, data sharing and access to Data Den from off campus. To access Data Den via Globus, request your volume be added to Globus. The globus collection is UMich ARC Non-Sensitive Data Den Volume Collection for non-sensitive data.
Because of the design of Data Den, often projects will need to be bundled to form larger single file archives. The most common tool do this with is
tar. Tar can optionally compress the data also but can take much longer. We also recommend the use of
archivetar when using the ARC HPC clusters and is also available as a container for Linux systems.
On the ARC HPC systems run archivetar will do the needed tar bundling and upload to Data Den. The following command will sort data into tar files of 100GB each and upload to the selected folder on Data Den using Globus. You can also use tar, zip or other commands manually.
cd <folder to archive> module load archivetar archivetar --prefix my-archive --dryrun archivetar --prefix my-archive --destination-dir /<dataden-volume>/<folder>
The following tar command will bundle all file files in
directory, store it in the file
bundle.tar.bz2, and compress it with bzip2. It will also create a small text file
bundle.index.txt that can be stored with it to quickly reference what files are in the bundle.
tar -cvjf bundle.tar.bz2 driectory | tee bundle.index.txt
To extract the bundle:
tar -xvjf bundle.tar.bz2
-j to save time compressing, and omit
-v to not print the bundle progress as it runs.
Compressing an archive can be accelerated on multi-core systems using
lbzip2. The following will work on all ARC systems:
tar --use-compress-program=lbzip2 -cvf bundle.tar.bz2 brockp | tee bundle.index.txt
To extract the bundle:
tar --use-compress-program=lbzip2 -xvf bundle.tar.bz2
Storage Resource Software
If you are unsure which of our storage services should be used to host your data, we have written some software that you can download and execute to analyze your files to understand how much of your data is stored in large files, how much of your data has been accessed recently, and the distribution of file sizes and access times. Use the Storage Guidance Tool to find file statistics and get a storage recommendation.
This software doesn’t examine the contents of any data files, it merely scans file attributes, it also does not store any file names after searching through the filesystem.
If you have any questions on this software, send an email to email@example.com. If you are unsure about any of the recommendations the tool sends you, contact us at firstname.lastname@example.org.
Small File Limitation and Optimal Size
Data Den’s underlying technology does not work well with small files. Because of this design, each 1TB of Data Den capacity provides only 10,000 files, and only migrates files 100 MByte or larger. The optimal file size range from 10 – 200 GBytes.
Maximum File Size
The maximum file size is 8 TByte, but files should optimally not be larger than 1 TByte. Larger archives can be split before uploading to Data Den with the
split -b 200G filename command.
Sensitive Data — ePHI/HIPAA/ITAR/EAR/CUI
Data Den now supports sensitive data such as HIPAA, PHI, other data types. Refer to the Sensitive Data Guide for full details.
Abuse, generally by excessive recalls of data better suited for active storage, of Data Den intentionally or not may result in performance or access being limited to preserve performance and access for other users. In the event this happens, staff will be in contact with users to engineer more appropriate solutions.
Frequently Asked Questions
Q: How do I check Data Den space and file usage?
A: Contact email@example.com
Q: I’m getting Disk Quota Exceeded when uploading multiple TB of data but I’m well below my quota why?
A: This means you overran the allocated cache space for your volume. This space size is estimated for how much new data would be ingested in 48 hours. Leave the transfer running and contact firstname.lastname@example.org. We will increase the cache space for the ingest and Globus will automatically resume the transfer where it left off.
Q: How can I use Data Den in my data management plan?
A: Data Den can provide low cost storage and data sharing with Globus to allow access to select data on Data Den anywhere in the world on demand. Most can get Data Den for no cost through UMRCP.
Q: Why can’t Data Den be accessible directly from hosts or my personal machine?
A: Data Den can appear as a local disk, but is not currently allowed in most cases as it presents many opportunities to disrupt use of the services by other users. Anyone with more than 100TB of data and wishes to automate their data management may be interested in Data Den Active Archive and should contact email@example.com.
Q: Can Data Den encryption be used with my restricted data?
A: Always refer to the Sensitive Data Guide as the definitive source for allowed data types. Data Den now supports sensitive data.
Q: Why can’t we use Data Den as general purpose storage?
A: Data Den’s design is inherently linear and cannot respond to user requests quickly, the way general purpose storage does. Those with massive data volumes (100TB+) may find Active Archive a good solution to control storage costs and have a portion of data active and fast.
Q: I need to retrieve a lot (more than 50) of files from Data Den. Can you help me?
A: Yes, contact firstname.lastname@example.org with the list of files you want to recall, and we will optimize the recall from tape, speeding access.
Q: I use Locker heavily and want to automate moving data as it ages to my Data Den space. Can you help me?
A: Yes. Data Den and Locker are inherently connected. Any Locker volume can potentially have a custom data migration policy automating data flow to Data Den via Active Archive. Contact email@example.com to set up a custom migration policy.
Data Den consists of two IBM Spectrum Archive LTFSEE clusters. Each cluster is connected to a corresponding Locker cluster. The LTFSEE cluster consists of an IBM TS4500 tape library with 20PB of uncompressed capacity and 9 Linear Tape Open (LTO) 8 drives. Each drive provides optimally 360 MByte/s uncompressed performance. Each library can also add additional frames and drives for new tape technology, increasing the capacity and performance.
Each cluster has a copy of all data separated by miles protecting it against fire and other disasters. Each cluster has 2 data movers. The purpose of these data movers is to apply policies to specific exports in Locker. As data are written to these exports the data mover moves the largest and oldest files to the tape in the TS4500 library encrypting it in the process. The data mover leaves a 1 MByte stub preserving the file layouts and allowing the first 1 MByte of data to be read without involving the tape. When a file that has been migrated is read the data mover instructs the TS4500 to recall the data from the tape and place it back on the faster disk of Locker transparently to the user.
Thus Data Den is able to ingest new data quickly to disk, and over time send it to tape as it is ages and goes unused. If data is recalled it remains on disk until it again is migrated due to policy.