Broadly speaking, data management refers to the practices that researchers put in place to keep their data secure, organized, and well-documented. Although data management practices are most relevant during the data collection and analysis portions of the research lifecycle, it is important to think about your project's data management needs during the planning stages. Management practices are much easier for both individuals and teams to implement if they have been thought through and discussed beforehand.
The boxes below contain information and resources on some of the most important areas to consider for your own work.
Use the table below to compare data storage options available to UK researchers. You can download the table as a spreadsheet using the link below.
Storage | Sharepoint and OneDrive | Lab Archives | LCC/MCC Scratch Space Storage | Network Attached Storage (Gemini) paid Condo Storage | UKY Tape Archive | OURRstore Tape Archival | OSN (Open Storage Network) | Google Drive | Dropbox | Box | AWS Glacier |
Purpose | General document and data storage | Electronic Research Notebook for collaboration, notes and data management | Mounted to HPC cluster. Short-term scratch space for projects that do not need to maintain data long-term | Stand-alone on-site disk storage for researchers needing terabytes of space | On-site magnetic tape storage for long-term data preservation or creating a backup | Magnetic tape storage in two copies, with one copy at the University of Oklahoma and the other sent to the researcher | Stand-alone object store for lots of data | General document and data storage | General document and data storage | General document and data storage | Object storage for lots of data that does not need to be frequently accessed |
Can data be made accessible to non-UK researchers? | Yes | Yes | No | No | No | No | Yes | Yes | Yes | Yes | Yes |
Quota (individual) | 5TB to 25TB | Unlimited number of potential notebooks and attached files under 4GB in size | 25TB | Minimum purchase quota is 100TB | Based on quota purchased | Based on number of tapes purchased | 10TB | 15GB | 2GB (free version) | 10GB (free version) | Based on quota purchased |
Graphical user interface (GUI) for editing data | Yes | Yes | No | No | No | No | No | Yes | Yes | Yes | No |
Cost (TB) | Free for UK researchers | Free for UK Researchers | Free (see quotas) | Paid service (condo) | Paid service | $12/TB for 2 copies | Free | Free | Free | Free | Paid |
Good | Syncs easily with UK devices through OneDrive | Good for setting up workflows and protocols for research teams and organizing multiple team members' data | Convenient for HPC users | Highly cost-effective option for large amounts of data | On-site storage option for preserving data | Cheap option for backing up massive quantities of data | Easy to get access (with a short proposal); available to researchers at other institutions | Smooth intuitive web GUI; familiar to most researchers | Smooth intuitive web GUI; familiar to most researchers | Smooth intuitive web GUI; familiar to most researchers | Relatively inexpensive storage for large amount of data. |
Bad | Slow data transfer speeds | Intended for notebook attachments rather than a place to store data in general. Max File size for upload is 16GB | Short-term only | Difficult to learn for new users; lack of reliable backups of data | Limited support; not ideal for actively managed data | Initial set-up (tape purchase, tutorial session etc.) can be confusing. | Storage is available only until the active proposal duration and not long term | Low amount of free storage | Low amount of free storage | Low amount of free storage | Users need to purchase on their own; egress charges are high |
Where to Learn More | UK ITS | UK OVPR | UK CCS | UK CCS | UK ITS | OURRstore | NSF | UK ITS (set up your UK Google Workspace) | Dropbox | Box | AWS |
This table was adapted from one made by Mami Hayashida.
For document storage options at UK using Microsoft services, see this guide from UK ITS.
For more detailed information on these storage options, see this guide from the Center for Computational Science.
Accidents and unfortunate events happen, but they don't need to ruin your research. Taking proactive steps to keep your data safe are a crucial part of any project.
The 3-2-1 rule is a guideline for data security that recommends maintaining 3 copies of your data on 2 different forms of media, with 1 copy stored off-site. The scope of your work should determine the degree of intensity of your backup system. For a class project, storing a copy of your files in a cloud service such as OneDrive will likely be sufficient, but grant-funded research should adhere to the 3-2-1 rule.
When making a plan for backing up your data, take the following into account:
There are a variety of options for backing up your data depending on your needs:
These considerations are adapted from "Chapter 8: Storage and Backups" of Research Data Management for Researchers by Kristin Briney.
When developing data storage procedures, researchers collecting data from human subjects should consult the University of Kentucky Institutional Review Board's Confidentiality and Data Security Guidelines for Electronic Data to ensure they are in compliance. Other important UK and legal policy information regarding data security, privacy, and confidentiality can be found in the IRB Survival Handbook.
Depending on how much data you are generating or collecting, as a your research progresses, the combination of your data, code, and outside sources can quickly become unmanageable. A key file might get lost in a folder with hundreds of unrelated items, you might forget what the variable names you chose actually mean, or you might need to reproduce a particular series of steps for data analysis but aren't sure what you did the first time. Following best practices in organizing and naming your files and folders can make it easier for you and your collaborators to find the files you need quickly.
Consistent file naming conventions makes it easier to locate specific files quickly. The table below shows some best practices in file naming.
Practice | Reasoning | Example |
Start file names with the date written in YYYYMMDD format | Dates at the start of file names make them easy to sort chronologically. | Use 20230614 to represent June 14, 2023 |
For version numbers, include leading zeros | Without leading zeros, operating systems will not sort files numerically. |
Computers sort files with these names in the following order:
|
Separate terms using underscores or camel case rather than spaces or special characters | Spaces and special characters like / : \ ) ( # % * ? " | are not always read correctly by operating systems or programs. |
|
Keep file names descriptive but brief | Lengthy file names become difficult for humans to read and may result in file paths that are too long for operating systems to process. Using abbreviations can be helpful, but be sure to document their meaning. | Use 20230614_mwd_data_v01.txt rather than 20230614_MetropolitanWaterDistrict_data_v01.txt |
It is much easier to develop a file naming structure before data collection and analysis begins rather than attempting to rename your files in the middle of your work. Kristin Briney has created a File Naming Convention Worksheet that can help you consider your needs and develop an appropriate naming convention.
It's best practice to keep data files simple to make them easier to analyze, but that also means important information about the data may not be readily available. A data dictionary is a table or document for describing the variables used in a dataset. For each variable, a data dictionary might include:
Use the link below to download a worksheet (developed by Kristin Briney) for brainstorming a data dictionary for one of your datasets.
Sub-folders (or sub-directories) can keep you from confusing similarly named files by separating different phases of your work. Consider the different categories of the files that you are working with to determine how best to sort them into different folders. One common division is to maintain different folders for raw data, cleaned data, and data visualizations (such as charts and tables).
When creating sub-folders, be careful not to create too many divisions between your files. A project folder containing dozens of sub-folders that only contain two or three files each will not be very easy to navigate. Additionally, nesting too many folders inside one another can create lengthy file paths which may prevent you from opening your files.
A README file is a single data file that describes your data and its technical, interpretive, and analytical requirements. It's a form of documentation, usually a simple text file, that serves as a guide to your data. README files are considered a general best-practice measure for data management.
At a minimum, it is a good idea to create a README file for your project data as a whole, but if you are working with a large amount of data, consider using them in your sub-folders as well to describe portions of your work.
Cornell University's Research Data Services Group has created useful guidelines for writing README style metadata, which can be helpful in creating README files at various levels of granularity in your project folder.
Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage data. Metadata is key to ensuring that data will survive and continue to be accessible in the future. Typically, it is helpful to use a defined metadata standard to describe your research.
A few popular standards can be found on this page, but there may be others that are more specific to your research.
Some basic metadata elements include:
While you can always choose to create your own metadata based on what is most relevant to your project, it can also be helpful to use an existing metadata standard. Because standards are used by a very large community of practitioners, implementing them in your own work can help make your data more discoverable once you share it.
Below are some common metadata standards. If you are planning to deposit your final dataset in a repository, check to see if that repository prefers a particular standard.
The Darwin Core is body of standards intended to facilitate the sharing of information about biological diversity.
Data Documentation Initiative (DDI)
The Data Documentation Initiative (DDI) is an international standard for describing statistical and social science data.
The DC is a simple, easy to use vocabulary of fifteen properties for use in resource description.
Federal Geographic Data Committee (FGDC)
The FGDC is the content standard for digital geospatial metadata.
Integrated Taxonomic Information System (ITIS)
The ITIS is the authoritative taxonomic information on plants, animals, fungi, and microbes of North America and the world.
Text Encoding Initiative (TEI)
The TEI is a standard for the representation of texts in digital form.
Visual Resources Association Core (VRA)
The VRA Core is a standard for the description of works of visual culture and the images that document them.
As you transform your data to produce new insights, it is important to document your process so that other researchers (including yourself in the future) can understand the steps you took, evaluate them, and perhaps make changes to produce different results. This page offers guidance on incorporating documentation practices into your habitual processes.
Some tools are better equipped than others to make documenting analysis a natural part of your workflow. Writing scripts in a programming language such as R or Python to transform your data means that the code in your script naturally serves as a step-by-step explanation of everything that you did (and can be further improved with comments in your code). If you are analyzing data in a software like Excel, however, it can be more difficult to keep track of your transformation steps, especially for complicated processes.
Version control refers to the process of keeping track of what changes have been made to your files. It can be particularly important when working as part of a team.
At a minimum, save a copy of your original, unedited data. Perform your analysis on a separate copy of the file so that you can return to the original if need be. Consider saving copies of your transformed data at other important checkpoints, such as the completion of your cleaning process.
If you anticipate saving separate versions of your data at many different points, consider using a version control system such as Git. Instructions for installing Git and an explanation of a sample workflow can be found in this tutorial on GitHub for Version Control.
Even if you are not working in a discipline that uses research notebooks in a formalized manner, they can still provide a useful means of documenting elements of your research process that may be important in the future, including:
Your notebook can be as simple as a document where you regularly add date-stamped entries with any thoughts on your process. It can be useful later for recalling a specific choice that you haven't documented elsewhere, providing a narrative writeup of your process in an article, or reflecting on your research practices.