Validating data#

Getting your diagnostics data#

Go to your fork of your diagnostics repository, and follow the instructions in the README.md file Get the data section.

  • Your data directory should now a directory called group-0? where ? is a number from 0 through 2. This directory in turn contains 10 subdirectories of form sub-0?, where ? is a number between 1 and 10. Each of these directories contain a func directory, which each contain two .nii.gz files. These are the FMRI data files. There are matching .tsv files that contain the event onset data for the task during each scanning run. You will also see a file of form group-0?/hash_list.txt.

  • Now do git status. You will see that directory of the files you have just unpackaged show up in Git’s listing of untracked files.

  • Next put the data/group-0?/hash_list.txt file into Git version control, so you are keeping a record of what the data hashes ought to be. To do this, make a new branch, maybe called add-hashes, checkout that branch, and then run git add data/group-0?/hash_list.txt Run git status to check that you did add the file to the staging area. Commit your change, push up your branch and make a Pull Request (PR) to the main repo. Someone should merge this. As it is simple, that person could be you.

  • Now have a look at data/group-0?/hash_list.txt. For each of the files, hash_list.txt has a line with the SHA1 hash for that file, and the filename, separated by a space;

  • You want to be able to confirm that your data has not been overwritten or corrupted since you downloaded it. To do this, you need to calculate the current hash for each of the unpacked .nii.gz and .tsv files and compare it to the hash value in hash_list.txt;

  • Now run python3 scripts/validate_data.py. When you first run this file, it will fail;

  • In due course, you will edit scripts/validate_data.py in your text editor to fix. See below.

Some code you will need#

Reading bytes from a file#

Imagine we wanted to read in the byte-by-byte contents of a file.

We start with the Path object — see pathlib.

from pathlib import Path

# A picture (in fact, the logo for the textbook)
pth = Path('images') / 'reggie.png'
pth
PosixPath('images/reggie.png')

Here we read a sequence of bytes from the file, using the read_bytes method of the Path object. We could instead have used the read_text method to read text characters from the file. In that case Python interprets the file contents as text. See the pathlib page for detail.

reggie_bytes = pth.read_bytes()

# Show the first 10 bytes of the file.
reggie_bytes[:10]
b'\x89PNG\r\n\x1a\n\x00\x00'

Calculating a hash from the bytes#

You have seen hashes in Curious Git.

A hash is a signature for a file. Every unique sequence of bytes has a (near-as-dammit) unique hash signature.

We are going to use the SHA1 hash. Hash algorithms are in the hashlib standard Python module:

import hashlib

Here’s the SHA1 hash for the reggie.png file:

hashlib.sha1(reggie_bytes).hexdigest()
'c7b76e3629dd88ebd70aad86180a62648d6386be'

If you are on Mac, or have the command installed on Linux, you can see whether the command line version of this calculation agrees with your Python calculation:

%%bash
shasum images/reggie.png
c7b76e3629dd88ebd70aad86180a62648d6386be  images/reggie.png

Crashing out with an error#

Sometimes your code will discover unexpected and horrible things, and you will want to crash out of the code with an informative error. You can crash out with an error using raise, and some type of error, and a message. For example:

even_no = 3
if (even_no % 2) != 0:   # Oh no, it's not an even number
    raise ValueError(f'Oh no, {even_no} is not an even number')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 3
      1 even_no = 3
      2 if (even_no % 2) != 0:   # Oh no, it's not an even number
----> 3     raise ValueError(f'Oh no, {even_no} is not an even number')

ValueError: Oh no, 3 is not an even number

On to the validation#

Now run python3 scripts/validate_data.py. It will fail. Use the code suggestions above to edit the validate_data.py script and fix it, so it correctly checks all the hashes of the listed data files.