Simulated data
 Implement a subroutine generating random data stemming either from the same or from different distributions, either by copying code from this snippet or by implementing similar functionality yourself. Check in your code into your github repository.
 Generate a testset of 10000 analytes under two different conditions measured in 3 replicates. As you will note the data generator will generate two data frames, one containing the simulated expression values for all samplesreplicate tuples (e.g. s2_1) in columns, and all analytes in rows (e.g. a12). There is also a row for sample number in this data frame. In addition, there is a data frame containing which poulation each analyte and sample combination was stemming from. s1 is alway from population 0, while s2 is either from population 0 or 1.
 For each analyte, calculate a pvalue, i.e. the probability that a difference between the two conditions was observed despite they stem from the same distribution, using a ttest the function
scipy.stats.ttest_ind
 Implement a python subroutine for estimating qvalues from a set of pvalues. Follow the steps of Remark B in Storey&Tibshirani. However instead of a spline estimation in step 2, just calculate the average of the pi_0 estimates with lambda >= 0.75.
 Plot the (simulated) number of differential expression genes as a function of qvalue threshold.
 Plot the difference between qvalues and the actual fraction of null statistics (using the labels from the generator).
 (*) Generate a test set using 3 different conditions under triplicates
 (*) Implement a 1way ANOVA test to calculate pvalues for no differences between the three conditions. Use statsmodels
 (*) Repeat step 5 and 6 for this more complex testset.
 Check in your code to your github repository.
I posted my version of parts of the exercise here.
Weightloss blood plasma set
 Download the expression data from the article “Proteomics reveals the effects of sustained weight loss on the human plasma proteome”
 Import the excel file into a pandas data frame using e.g. the command
import pandas as pd
df = pd.read_excel("MSB12901s009.xlsx",skiprows=list(range(6))+list(range(15,18)),parse_cols=[0]+list(range(3,318)),header=0,index_col=0)  Test which analytes that are significantly different from Timepoint 1 to Timepoint 7 for all the patients which were present for the full experiment series. Try to reuse as much as possible from the code from your experiments with simulated data, point 46. Do an implementation using (a) a regular ttest e.g.
scipy.stats.ttest_ind
(b) using a paired ttest using e.g.scipy.stats.ttest_rel
. Make sure you understand the difference between the two tests, and why they give different results.  (*) Implement an ANOVA to test significant difference for any of the 7 points in time. In analogy with the paired ttest in the previous step, introduce a categorical blocking variable for the patients, i.e use an “Expression ~ C(Time) + C(Patient)” model. (Note that you will need to rename the tested gene product name to an easier parsable name at the time of testing, e.g. “Expression”)
 (**) Implement an ANOVA to test the influence on weight loss on plasma expression. That is “Expression ~ C(Patient) +dW”, where dW is calculated as the difference in weight of the patient since time 1, i.e. dW=0 at time 1.
 Check in your code to your github repository.
CHOcell growth data
Here we have an yet unpublished dataset on protein concentrations in the supernatant of a CHOcell bioreactor for protein concentrations. The concentrations have been determined by socalled iBAQ spectral counting.

 Download the data set.
 Read in the data to a pandas data frame. Tip: use a command like
df = pd.read_excel("CHOPER_IBAQ_sep.xlsx",header=0,index_col=0)
 remove lines with NaN or 0values.
 Test which proteins that are differentially expressed between day 10and 30 at FDR 1% using a regular ttest e.g.
scipy.stats.ttest_ind
 (*) Test differential expression over time for each protein using a pandas model “Expression ~ C(Day)”.
 (**) Compare your results to a linear regression model, e.g. “Expression ~ Day”
(*) Pathology Atlas (TCGA) Data
 Download data on survival times for liver cancer patients as well as expression data from their sequenced tumors.
 For each transcript, create a survival analysis model, using the lifelines package. Here each transcript can be tested using, for instance, a Cox’s Proportional Hazard model. Test each transcript by itself as a potential prognostic marker for the survival of the patients.
 As before, apply multiple hypothesis corrections to your calculated pvalues.
 Report the number of expression levels that you find prognostic for the survival of the patients in the cohort as a function of qvalue.