- Implement a subroutine generating random data stemming either from the same or from different distributions, either by copying code from this snippet or by implementing similar functionality yourself. Check in your code into your github repository.
- Generate a testset of 10000 analytes under two different conditions measured in 3 replicates. As you will note the data generator will generate two data frames, one containing the simulated expression values for all samples-replicate tuples (e.g. s2_1) in columns, and all analytes in rows (e.g. a12). There is also a row for sample number in this data frame. In addition, there is a data frame containing which poulation each analyte and sample combination was stemming from. s1 is alway from population 0, while s2 is either from population 0 or 1.
- For each analyte, calculate a p-value, i.e. the probability that a difference between the two conditions was observed despite they stem from the same distribution, using a t-test the function
- Implement a python subroutine for estimating q-values from a set of p-values. Follow the steps of Remark B in Storey&Tibshirani. However instead of a spline estimation in step 2, just calculate the average of the pi_0 estimates with lambda >= 0.75.
- Plot the (simulated) number of differential expression genes as a function of q-value threshold.
- Plot the difference between q-values and the actual fraction of null statistics (using the labels from the generator).
- (*) Generate a test set using 3 different conditions under triplicates
- (*) Implement a 1-way ANOVA test to calculate p-values for no differences between the three conditions. Use statsmodels
- (*) Repeat step 5 and 6 for this more complex testset.
- Check in your code to your github repository.
I posted my version of parts of the exercise here.
Weightloss blood plasma set
- Download the expression data from the article “Proteomics reveals the effects of sustained weight loss on the human plasma proteome”
- Import the excel file into a pandas data frame using e.g. the command
import pandas as pd
df = pd.read_excel("MSB-12-901-s009.xlsx",skiprows=list(range(6))+list(range(15,18)),parse_cols=+list(range(3,318)),header=0,index_col=0)
- Test which analytes that are significantly different from Timepoint 1 to Timepoint 7 for all the patients which were present for the full experiment series. Try to reuse as much as possible from the code from your experiments with simulated data, point 4-6. Do an implementation using (a) a regular t-test e.g.
scipy.stats.ttest_ind(b) using a paired t-test using e.g.
scipy.stats.ttest_rel. Make sure you understand the difference between the two tests, and why they give different results.
- (*) Implement an ANOVA to test significant difference for any of the 7 points in time. In analogy with the paired t-test in the previous step, introduce a categorical blocking variable for the patients, i.e use an “Expression ~ C(Time) + C(Patient)” model. (Note that you will need to rename the tested gene product name to an easier parsable name at the time of testing, e.g. “Expression”)
- (**) Implement an ANOVA to test the influence on weight loss on plasma expression. That is “Expression ~ C(Patient) +dW”, where dW is calculated as the difference in weight of the patient since time 1, i.e. dW=0 at time 1.
- Check in your code to your github repository.
CHO-cell growth data
Here we have an yet unpublished dataset on protein concentrations in the supernatant of a CHO-cell bioreactor for protein concentrations. The concentrations have been determined by so-called iBAQ spectral counting.
- Download the data set.
- Read in the data to a pandas data frame. Tip: use a command like
df = pd.read_excel("CHOPER_IBAQ_sep.xlsx",header=0,index_col=0)
- remove lines with NaN or 0-values.
- Test which proteins that are differentially expressed between day 10and 30 at FDR 1% using a regular t-test e.g.
- (*) Test differential expression over time for each protein using a pandas model “Expression ~ C(Day)”.
- (**) Compare your results to a linear regression model, e.g. “Expression ~ Day”
(*) Pathology Atlas (TCGA) Data
- Download data on survival times for liver cancer patients as well as expression data from their sequenced tumors.
- For each transcript, create a survival analysis model, using the lifelines package. Here each transcript can be tested using, for instance, a Cox’s Proportional Hazard model. Test each transcript by itself as a potential prognostic marker for the survival of the patients.
- As before, apply multiple hypothesis corrections to your calculated p-values.
- Report the number of expression levels that you find prognostic for the survival of the patients in the cohort as a function of q-value.