Winter 2019: Topics in Bioinformatics

Schedule: Monday and Wednesday noon-13:20am

Location: TTI-C Room 530 on the 5^th floor, 6045 S Kenwood Ave, Chicago, IL 60637

Instructor: Jinbo Xu (jinboxu@gmail.com, office: TTI-C room 528)

Students can register this course through the University of Chicago.

With availability of a large-scale of genomic, expression and structural data, mathematics/statistics/computer science is being extensively used for the understanding of biological data at the molecular level. This course will focus on the application of machine learning and computer algorithms to the problems in the field of molecular biology. In particular, this course will cover a few topics such as sequence alignment and homology search, RNA/protein structure prediction and biological network analysis.

Students are highly encouraged to read the following materials before attending this class since they will not be covered in the class.

1. The Department of Energy's Primer on Molecular Genetics.

2. The Department of Energy's Overview of the Human Genome Project.

3. Hunter's molecular biology for computer scientists.

4. National New Biology Initiative: A New Biology for the 21st Century.

Syllabus

Here is a syllabus for this course. A temporary reading list is available at here.

Intended Audience

Graduate students or senior undergraduate students with some Math/CS/statistics/biology background. To be able to finish the assignments and the final research project, students shall do some programming using C++, Java, Matlab, Python, R or other scientific computing software.

Evaluation

There will be no examination for this course. The final grade consists of three components: homework, one final research project and attendance. For the homework assignments, you may re-implement a popular algorithm or conduct an experiment to compare several popular bioinformatics tools and summarize your work in a technical report (around 5 pages). For the final research project, you may develop some new algorithms for a specific bioinformatics problem or conduct a comprehensive review on a specific research topic such as deep learning for protein structure prediction. You are not required to come up with extremely innovative ideas, although it is highly encouraged. Incremental improvement over existing algorithms is acceptable for the final research project. Please hand in a report of the final research project. All the students are required to finish both homework and the final research project. However, undergraduate students will be marked more generously. Below is the grading scheme:

a) The 5 homework assignments account for 70% of the final grade, each for 14%.

b) The final project accounts for 20% of the final grade.

c) Class attendance accounts for 10% of the final grade.

d) If you fail to earn at least 60% of the full grade point in each of a), b) and c), you will also fail the whole course.

Homework Assignments

Homework 1 (due date Jan 21, 2019)

a. Implement the local and global alignment models for pairwise DNA sequence alignment using the programming language you are good at. DO NOT copy from open source codes.

b. Test your code using one of the four benchmarks (BaliBASE, OXBench, PREFAB or SMART) at http://dna.cs.byu.edu/mdsas/download.shtml, which has the ground truth alignments. Each file in the benchmark contains one multiple sequence alignment, from which you may extract pairwise alignments. The input sequences for your program shall not contain any gaps (i.e., “-” in the ground truth alignment). Calculate the alignment accuracy (i.e., precision and recall) and of your code by comparing its resultant alignments with the ground truth.

c. Compare the alignment accuracy of the local and global alignment models.

d. Study the impact of gap penalty. You may test two different gap extension penalty scores (0 and -1) and three different gap open penalty scores (0, -5, and -10).

Homework 2 (due date Jan 30, 2019)

Develop scoring matrices for amino acid similarity based upon the reference alignments in the SABmark1.65 datasets, using the method BLOSUM is derived.

a. The SABmark benchmark contains two datasets, one for superfamily level and the other for twilight zone. Please derive two different scoring matrices, one from each dataset.

b. Compare your two scoring matrices and explain their similarity and difference.

c. Compare your two scoring matrices with the five BLOSUM matrices listed at https://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/data/. Find out which BLOSUM matrices are the closest to your scoring matrices.

Homework 3 (due date Feb 10, 2019)

a. Please download the following protein sequences:

1) http://www.predictioncenter.org/casp12/target.cgi?target=T0859&view=sequence

2) http://www.predictioncenter.org/casp12/target.cgi?target=T0860&view=sequence

3) http://www.predictioncenter.org/casp12/target.cgi?target=T0861&view=sequence

4) http://www.predictioncenter.org/casp12/target.cgi?target=T0862&view=sequence

5) http://www.predictioncenter.org/casp12/target.cgi?target=T0863&view=sequence

b. Go to the NCBI BLAST web site and run Protein BLAST on each of the above sequences using different scoring matrices and word size

c. In your homework report, for each sequence please study the similarity and dissimilarity of the results returned by BLAST when the scoring matrix and word size are changed and summarize your findings. In particular, please report how the E-value distribution changes with respect to the scoring matrix and word size.

Homework 4 (due date Feb 20, 2019)

1) Please download the seed (multiple sequence) alignments of the top 20 Pfam families from the Pfam web site. The top 20 families are listed at http://pfam.xfam.org/family/browse?browse=top%20twenty

2) Implement the simple algorithm covered in the lecture to build a Profile HMM from a multiple sequence alignment. You may use whatever programming language you are familiar with.

3) Test your program on the multiple sequence alignments of the top 20 Pfam families and compare your results with what you can obtain by running HMMER3 (http://hmmer.org/). HMMER3 has a program hmmbuild that can build an HMM from a multiple sequence alignment. Please compare your HMM with the output of hmmbuild in terms of the number of match states, the state transition probability and the emission probability. Please try to explain the difference if there is any.

4) Please hand in your source code (with some comments so that I can understand) and your results.

Homework 5 (due date March 2, 2019)

1) In this homework, please learn to use two protein secondary structure prediction web servers: 1) SPIDER2: http://sparks-lab.org/server/SPIDER2/; and 2) RaptorX-Property: http://raptorx.uchicago.edu/StructurePropertyPred/predict/ .

2) To test them, please download the sequence and structure files (in FASTA and PDB format respectively) of the following proteins from the Protein Data Bank website: 4ympA, 5a7dB, 5a7dL, 5aotA, 5ereA, 5fjlA, 5j4aA, 5j4aB, 5j5vA, 5j5vB, 5j5vC, 5jmbA, 5jmuA, 5kkpA and 5ko9A. Meanwhile, the first 4 letters encode a PDB ID and the last letter is the protein chain name. For example, by searching 4ymp at http://www.rcsb.org/pdb/home/home.do, you will be directed to the web page of the protein 4ymp. Clicking on the “Download” button, you will be able to download the FASTA sequence and the PDB files. Note that you only need to download the sequence file for chain A.

3) Please submit your sequences to the web servers and then calculate the secondary structure prediction accuracy (i.e., Q3 accuracy) of the returned results. To obtain the ground truth, you may use the DSSP web server or program at http://www.cmbi.ru.nl/dssp.html. You will need to write a small program to parse the output of DSSP.

4) Compare your Q3 accuracy with that listed in Table 1 of the paper at https://academic.oup.com/bib/article/doi/10.1093/bib/bbw129/2769436/Sixty-five-years-of-the-long-march-in-protein to see how different your result is from this table. Only Q3 accuracy is needed.

5) Repeat step c) using the CASP13 targets at http://predictioncenter.org/casp13/targetlist.cgi. Most CASP13 targets still do not have publicly available native structures, so you just need to work on those targets with “PDB code” in the “description” column.

6) If you do not want to use web servers, you can of course install their standalone packages. But generally speaking, it needs much more effort to install a local copy of the packages.

You can use existing libraries or Matlab or Python to implement your algorithm. However, please clearly point out your contribution in your report. If you use other bioinformatics libraries, please pay more attention to result analysis.

Research Projects

Please choose one of the following topics. You are also encouraged to propose your own topics. However, you shall not work on the same topic for both your homework assignments and your final research project. For the algorithm development projects, you do not have to obtain the state-of-the-art performance, but your performance shall not be too much worse. For review projects, in addition to critically review existing work, please also discuss possible future research directions.

1. Review algorithms for protein secondary structure prediction

2. Review deep learning algorithms for protein local structure (secondary structure, solvent accessibility and torsion angles) prediction

3. Review algorithms for protein contact prediction

4. Review deep learning methods for drug discovery

5. Review algorithms for protein structure alignment

6. Review algorithms for phylogeny tree construction

7. Review algorithms for biological network alignment

8. Review algorithms for biological network construction

9. Develop algorithms for pairwise or multiple protein-protein interaction network alignment

10. Develop an efficient algorithm to cluster a very large sequence database

11. Develop deep learning algorithms for protein secondary structure, torsion angle prediction or contact prediction

12. Develop deep learning algorithms for protein solvent accessibility and/or contact number prediction

13. Develop deep learning algorithms to predict if a protein is DNA or RNA-binding protein

14. Develop deep learning algorithms to predict if two proteins interaction or not

The due date of the final project is early mid in March. Please send me a brief abstract (one paragraph) to tell me what want to work on before mid in Feb. If you need your final grade to graduate, please talk to me and hand in the final project earlier. If you need more time to complete the research project, please also talk to me.