Exercise on one-liners

Parsing FASTA Files

Download a fasta-file with the known protein sequences of Aligator, by using the following commands on your command prompt:

wget ftp://ftp.ncbi.nih.gov/genomes/Alligator_mississippiensis/protein/protein.fa.gz
gunzip protein.fa.gz

Then try to solve the following tasks using single-line command structures.

  1. Count the number of protein sequences in the file, using a) grep or b) gawk
  2. Count the number of amino acids in the file using a) grep/wc and b) gawk
  3. Use gawk to create a file that gives all the amino acid sequence in one flow, without line-breaks, for each protein record. I.e. each record should consist of one row of a protein sequence name and one long row with the amino acid sequence.
  4. Count the number of tryptic peptides in the file, i.e. any amino acid sequence pattern ending with a K or R residue
  5. Count the number of tryptic peptides longer than 7 amino acids in the file, i.e. any amino acid sequence pattern ending with a K or R residue
  6. Count the number of unique tryptic peptides longer than 7 amino acids in the file, i.e. remove any duplicate peptides.

Parsing field data

Download a csv-file with the subcellular localization of proteins as reported by the HPA project, by using the following commands on your command prompt:

wget http://www.proteinatlas.org/download/subcellular_location.csv.zip
unzip subcellular_location.csv.zip

Then try to solve the following tasks using single-line command structures.

  1. Use sed to remove the double quotation marks (“) from the file
  2. Count the number of proteins reported to be present in both Cytosol and Mitochondria
  3. List the protein names of the gene names of the proteins present in both Cytosol and Mitochondria