Archiving
Preparing for the future
Objectives
Learn how to backup data and code so others can understand and your future self can understand.
Documenting and backing up code
Documenting and backing up code and data are crucial practices in genomics, data science, and other fields that involve programming or handling large amounts of data. Here’s why:
Traceability and reproducibility: Clear documentation ensures that every piece of code, data, or computational process is traceable and reproducible. This is especially important when you want to replicate an analysis or experiment, debug an issue, or understand the progression and changes in the code or data over time.
Knowledge sharing and collaboration: Documentation allows others to understand your work, making it easier for collaboration. It also enables knowledge transfer when team members change over time. Code without documentation is often useless to others and sometimes even to the person who originally wrote it after some time has passed.
Professionalism and quality assurance: Good documentation is a sign of professionalism and maturity in software development. It shows that the code or data has been well thought out and tested, and it allows for quality assurance processes such as code reviews and audits.
Disaster recovery: Backups are essential for disaster recovery. If a server fails, or data gets corrupted, having a backup means you can restore the code or data and continue with minimal disruption. Backups also protect against accidental deletions and malicious actions.
Here’s how to document and backup effectively:
Comment your code: Write clear comments in your code explaining what specific sections or lines of code are doing. Wherever possible, use self-explanatory variable and function names.
Write documentation: Use readme files, wikis, or similar resources to provide high-level documentation. This can include how to use and install the software, the structure of the code or data, known issues, and more. Tools like Doxygen or Javadoc can be helpful.
Use version control systems: Tools like Git can track changes in code and help document the history and reasoning behind those changes. Git, in combination with platforms like GitHub, also allows you to back up your code online and collaborate with others.
Automate backups: Use automated tools to regularly back up your code and data. This could be a simple cron job that copies files to a different location, or more sophisticated solutions that can handle large datasets, such as AWS S3, Google Cloud Storage, or other data backup services.
Implement a testing framework: This will serve as an additional form of documentation, explaining how the code is supposed to behave, and it provides a way to verify that changes to the code have not broken anything.
Use data versioning tools: In case of data science projects, tools like DVC can be used to track changes in data and models over time, similar to how Git is used for code.
Documenting and backing up code and data are crucial practices in software development, data science, and other fields that involve programming or handling large amounts of data. Here’s why:
Remember, good documentation and regular backups are not afterthoughts or luxuries—they’re essential components of professional, reliable, and resilient projects.
Archiving a repo with Zenodo
Archiving your GitHub repository with Zenodo allows you to have a digital object identifier (DOI) for your repo, which is useful for citing your software in academic papers. Here’s a step-by-step guide:
Create a Zenodo account. You’ll need to visit Zenodo’s homepage (https://zenodo.org/) and sign up for an account if you don’t already have one.
Link GitHub with Zenodo. After you’ve set up your Zenodo account, you can navigate to the “GitHub” section in the dashboard and link your GitHub account to Zenodo.
Authorize Zenodo on GitHub. Zenodo will request access to your GitHub repositories. You can either allow Zenodo to access all your repositories or only select ones. After you’ve made your choice, click “Authorize Zenodo.”
Select the repository. Back in Zenodo, you should see a list of all your GitHub repositories. You can choose to archive any repository by toggling the switch to “on” next to the repository name. Zenodo will now create a “webhook” for this repository, which will notify Zenodo whenever there is a new release of the repository on GitHub.
Create a new release on GitHub. If you haven’t done so already, you should create a new release of your repository on GitHub. Go to your repository page, click “releases” then “create a new release”. Tag version with the version number (for example “v1.0”) and add some description about this release. Zenodo will be notified once the release is published.
Check Zenodo for the deposit. After creating a new release, go back to Zenodo. You should see the new release in the “Upload” section. Zenodo automatically fills in some of the metadata for you, such as the title and authors, but you can edit this if needed.
Publish the archive on Zenodo. Finally, you can publish the archive on Zenodo. Make sure all the information is correct, and then click the “Publish” button. Your repository is now archived, and you will have a DOI that you can use to cite your software.
Remember that Zenodo will archive a new version of your repository every time you create a new release on GitHub, so it’s important to create meaningful and well-documented releases. Zenodo also allows you to link different versions of your software together, so that it’s clear how the software has changed over time.