Validating data#
Getting your diagnostics data#
Go to your fork of your diagnostics repository, and follow the instructions in
the README.md
file Get the data section.
Your
data
directory should now a directory calledgroup-0?
where?
is a number from 0 through 2. This directory in turn contains 10 subdirectories of formsub-0?
, where?
is a number between 1 and 10. Each of these directories contain afunc
directory, which each contain two.nii.gz
files. These are the FMRI data files. There are matching.tsv
files that contain the event onset data for the task during each scanning run. You will also see a file of formgroup-0?/hash_list.txt
.Now do
git status
. You will see that directory of the files you have just unpackaged show up in Git’s listing of untracked files.Next put the
data/group-0?/hash_list.txt
file into Git version control, so you are keeping a record of what the data hashes ought to be. To do this, make a new branch, maybe calledadd-hashes
, checkout that branch, and then rungit add data/group-0?/hash_list.txt
Rungit status
to check that you did add the file to the staging area. Commit your change, push up your branch and make a Pull Request (PR) to the main repo. Someone should merge this. As it is simple, that person could be you.Now have a look at
data/group-0?/hash_list.txt
. For each of the files,hash_list.txt
has a line with the SHA1 hash for that file, and the filename, separated by a space;You want to be able to confirm that your data has not been overwritten or corrupted since you downloaded it. To do this, you need to calculate the current hash for each of the unpacked
.nii.gz
and.tsv
files and compare it to the hash value inhash_list.txt
;Now run
python3 scripts/validate_data.py
. When you first run this file, it will fail;In due course, you will edit
scripts/validate_data.py
in your text editor to fix. See below.
Some code you will need#
Reading bytes from a file#
Imagine we wanted to read in the byte-by-byte contents of a file.
We start with the Path object — see pathlib.
from pathlib import Path
# A picture (in fact, the logo for the textbook)
pth = Path('images') / 'reggie.png'
pth
PosixPath('images/reggie.png')
Here we read a sequence of bytes from the file, using the read_bytes
method
of the Path object. We could instead have used the read_text
method to read
text characters from the file. In that case Python interprets the file
contents as text. See the pathlib page for detail.
reggie_bytes = pth.read_bytes()
# Show the first 10 bytes of the file.
reggie_bytes[:10]
b'\x89PNG\r\n\x1a\n\x00\x00'
Calculating a hash from the bytes#
You have seen hashes in Curious Git.
A hash is a signature for a file. Every unique sequence of bytes has a (near-as-dammit) unique hash signature.
We are going to use the SHA1 hash. Hash algorithms are in the hashlib
standard Python module:
import hashlib
Here’s the SHA1 hash for the reggie.png
file:
hashlib.sha1(reggie_bytes).hexdigest()
'c7b76e3629dd88ebd70aad86180a62648d6386be'
If you are on Mac, or have the command installed on Linux, you can see whether the command line version of this calculation agrees with your Python calculation:
%%bash
shasum images/reggie.png
c7b76e3629dd88ebd70aad86180a62648d6386be images/reggie.png
Crashing out with an error#
Sometimes your code will discover unexpected and horrible things, and you will
want to crash out of the code with an informative error. You can crash out
with an error using raise
, and some type of error, and a message. For
example:
even_no = 3
if (even_no % 2) != 0: # Oh no, it's not an even number
raise ValueError(f'Oh no, {even_no} is not an even number')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 3
1 even_no = 3
2 if (even_no % 2) != 0: # Oh no, it's not an even number
----> 3 raise ValueError(f'Oh no, {even_no} is not an even number')
ValueError: Oh no, 3 is not an even number
On to the validation#
Now run python3 scripts/validate_data.py
. It will fail. Use the code
suggestions above to edit the validate_data.py
script and fix it, so it
correctly checks all the hashes of the listed data files.