Align file names to 96-well plate
This page illustrates how to create a table listing sample names with paths to their corresponding raw sequence data files for sequence data from the VBC NGS facility.
This is a difficult problem to solve reliably because the formats keep changing, so user discretion is essential.
Sequence file names
When we submit 96-well plates for sequencing to the VBC NGS facility, we typically submit something like an Excel sheet giving rows/columns of the plate and the biological sample that should be in each well. What we get back from the NGS facility is a mass of bam or fastq files that look something like this:
343561_TTATACGCGAGCCGTCTGTT_S22_R1_001.fastq.gz
Older samples might look something like this:
H3H7YDRXY_1#144456_ACTCGCTACGTCTAAT.bam
How is one meant to determine which sample in the original plate each file is meant to correspond to?
What do the file names mean?
Unfortunately, the naming system of files has changed through time, so they might not always look like the ones above. In this case at least you can see:
343561is the facility sample number. This usually indicates and ID for the plate. You can use this to track down the facility’s data on the sequencing run (in this case, for example: https://ngs.vbcf.ac.at/forskalle3/samples/343561). Confusingly, the facility also has a ‘request number’, which looks very similar.TTATACGCGAGCCGTCTGTTgives the adapter index sequence for this sample. This comprises two 8-or-more nucleotide sequences that together give a unique identifier for the row/column position in a 96-well plate. There may be multiple combinations for separate plates so that these can be run on a single flow cell.S22gives a number from 1 to 96 indicating the position in the well.R1indicates that these are paired reads, and that this is the file for mate-pair 1. Data for mate pair 2 would be found in a file with a similar name, but withR2in this position.001: What does this mean? Only the Gods know.
On the older file:
H3H7YDRXYis the flow cell on which sequences were run._1is some kind of subset of the data on that flow cell. For example, 1 and 2 here might indicate either end of paired-end data. Note that sometimes your data might be combined with someone else’s data, so you might have 3 and 4, or some other complicated combination of data. It’s best to ask Almudena or Viktoria if you aren’t sure.144456is the facility sample number.ACTCGCTACGTCTAATgives the adapter index sequence for this sample.
As you can see, only the information the two formats have in common are the sample ID (plate) and adapter sequences, so we will use this information to piece together which sample is which. Note that other formats may, or may yet exist!
Which adapters do I have?
For data created from 2023 or later you probably have the Nextera Dual XT adapters, which are four sets of 96 unique pairs of adapters. You can view them here.
Before 2023 there was little consistency, and they could be one of many adapter sets available. The full cornucopia of adapter sets available at the NGS facility is here, in a format that could politely be called “a data-science nightmare”. Likely candidates are one of the eight “Nordborg Nextera INDEX” sets. It may be best to ask within the lab in this case.
How can you determine which of the index sets was used? The first place to look is Forskalle. Search (the bar is in the top right) for the sample number of your plate, then look under “Adapter” under “Sample library options” in the upper left box, and the index set information should be displayed. For example, sample 144456 shows “Nordborg Nextera INDEX set 2…” in this box.
If in doubt, ask Viktoria to add the indices to the Nordborg-group NGS submission sheet on google drive.
Work out which file is which
From the previous section it is clear that if you know the adapter
sequence in the filename and which adapter set was used, you should be
able to work out which file corresponds to which plate position. Doing
this alignment is fairly tedious, and there is no point in people
replicating the task, so the function align_fastq_with_plate_positions
can done this automatically. This function requires lists of input fastq
files and a sample sheet giving the details of each sample, and returns
a copy of the sample sheet with the paths to corresponding fastq files
as additional columns
It is important to note that this is not a foolproof procedure, and you need to check the output.
List the fastq files
Using Python import the modules needed and make a list of filenames. In
this fictional example, imagine you have a folder of pairs of fastq
files corresponding to forward and reverse read pairs for each sample.
Files with mate pair 1 and 2 are labelled R1 and R2 respectively.
Globbing these files returns a python list of files names.
from glob import glob
import methlab as ml
print("Using methlab version " + ml.__version__)
# List of fastq files
mate1=glob("path/to/bam_files/*_R1_*.fastq.gz")
mate2=glob("path/to/bam_files/*_R2_*.fastq.gz")
Sequencing sample sheet
To line these filenames up with biological samples we need a sample sheet, here given as a CSV file and imported as a Pandas dataframe.
sequencing_sheet = pd.read_csv("tests/test_data/NGS_sample_sheet.csv")
Here is an example of how a sample sheet might look:
plate row col plantID index_set index1 index2
2021-007 A 1 H1.2xH2.1-1 4 TTCTCGTGCA ATACACAGAG
2021-007 B 1 H1.2xH2.1-2 4 GCCTAACGTG AGCTCTCAAG
2021-007 C 1 H1.2xH2.1-3 4 CATTCACGCT GTTGTACTCA
2021-007 D 1 H1.2xH2.1-4 4 GCCATATAAC ACACAATATC
2021-007 D 1 H1.2xH2.1-4 4 TCGTGCATTC GCGTTGGTAT
2021-007 D 1 H1.2xH2.1-4 4 AACCAGCCAC GCCACAGCAC
This file must include columns index1 and index2 that give the index
barcodes for each sample. See the previous section for where to find
these. The other columns give information about each sample. They can
include any information, as long as index1 and index2 are present.
There may not be duplicate rows of index1 and index2, which in
practical terms mean that you should look at one sequencing plate at a
time.
Be aware that sometimes index1 and index2 appear in reverse order in the file names!
Merge filenames and sample sheet
Pass the lists of file paths and the sample sheet to
align_fastq_with_plate_positions.
genotyping_sheet = ml.align_fastq_with_plate_positions(mate1, mate2, sequencing_sheet)
The function adds two new columns called fastq1 and fastq2 giving
the paths from mate1 and mate2 for each sample it found a match for.
This function uses regular expressions to look for a nucleotide sequence inside each filename and matches it to the indices in the sample sheet. This is a general, but not very robust solution. In particular, it will fail if there are multiple samples with the same adpater sequences, such as when the same index set was used for multiple plates. As such, it would be wise to only use this on one sequence plate at a time. It also wouldn’t hurt to look at the source code for this function and convince yourself of what it is doing.
Example
Sequencing plate 2024-011 was 96 samples of F8 lines from the
Intra-Swedish collection. You can find the sequencing sample sheet,
including plate positions, sample names, barcode and other information
here.
I saved this information as a CSV file on the cluster as
01_data/08_resequenced_F8s/sample_sheet.csv, and changed the “barcode”
columns to “index”.
The data were sequenced as NGS sample 343561. After unzipping the data to
/scratch-cbe/users/$(whoami)/04_resequencing/343561/, the first ten
files look like this:
343561_AACAAGTACAGGATATATCC_S87_I1_001.fastq.gz
343561_AACAAGTACAGGATATATCC_S87_I2_001.fastq.gz
343561_AACAAGTACAGGATATATCC_S87_R1_001.fastq.gz
343561_AACAAGTACAGGATATATCC_S87_R2_001.fastq.gz
343561_AACCGAGTTCGATCTCTGGA_S9_I1_001.fastq.gz
343561_AACCGAGTTCGATCTCTGGA_S9_I2_001.fastq.gz
343561_AACCGAGTTCGATCTCTGGA_S9_R1_001.fastq.gz
343561_AACCGAGTTCGATCTCTGGA_S9_R2_001.fastq.gz
343561_AACTTATCCTGTACTGGCGT_S78_I1_001.fastq.gz
343561_AACTTATCCTGTACTGGCGT_S78_I2_001.fastq.gz
Below is an example Python script to link the sample names up to the
file names. If you haven’t already, install the methlab and pandas
modules for Python using
pip install methlab pandas
Here is the Python code:
import pandas as pd
from glob import glob
import os
import methlab as ml
print("Using methlab version " + ml.__version__)
# Path to the directory containing raw fastq files
path_to_fastq = "scratchdir/04_resequencing/343561/"
if os.path.exists(path_to_fastq):
print(f"Using fastq files from {path_to_fastq}")
else:
print(f"Error: could not find the path containing fastq files at {path_to_fastq}")
print(f"Please check that you have unzipped the fastq files, and that you have the correct path.")
print(f"The path is hard coded to a directory on the CLIP cluster, so you may need to change it.")
print(f"Exiting...")
exit()
# Python lists of files for mate-pair 1 and mate-pair 2
mate1 = glob(path_to_fastq + "/*_R1_*.fastq.gz")
mate2 = glob(path_to_fastq + "/*_R2_*.fastq.gz")
# Sample sheet giving sample and index information
sequencing_sample_sheet = pd.read_csv(
"01_data/08_resequenced_F8s/sample_sheet.csv"
)
# Line up the sample names and filenames
genotyping_sample_sheet = ml.align_fastq_with_plate_positions(mate1, mate2, sequencing_sample_sheet)
genotyping_sample_sheet should now contain all the columns in the
original sequencing sample sheet, plus additional columns with paths to
the fastq files.