Thursday 11 August 2016

Database headaches for genetics data

Turns out designing database for genetics data is quite hard because:

  1.  Genetic data is very large and both common and sparse. In a whole genome (4M variants) each individual harbours thousands of variants which are only see in that individual and close relatives (so very rare at population level).  Conversely many variants are very common  and this often because what we think to be the alternative allele should actually be the reference allele this could have been caused by population effect (we started sequencing europeans when we should have started with africans (thank you Craig Venter!)).
  2. Genetic data is messy. I mentioned the problem of flipped alleles, well there is worse. Some sites are multiallellic for single nucleotide variants and if one starts to considers indels of variable length then these sites are nearly always mulitallellic. Futrhermore, if you are in a repetitive region a variant can have more than one name (although right normalisation seems to be the rule but what happens in practice). Also sometimes at a multiallelic site, the reference allele may not even be seen!
  3. We need to query data column and row-wise.  Beside the basic filtering of variants, there are many biological question which require looking at large number of variants shared between individuals.
  4. Distributions of variants per gene is very skewed.  They are few genes eg TTN with a huge number of variants.
  5. In biology there are always exceptions to the rules or anything is possible: one variant belonging to many genes etc

Anyway all the above reason make db design and implementation difficult.

  1. Primary key design.  This is difficult when a variant has many synonyms,
  2.  SQL style database are not good at storing large many to many relationships. Our exome database with 1.3M variants and 5k individuals has 144M individual to variant relationship!
Solutions?

  1. Distributed databases which can be run on mutlitple nodes like Cassandra, SOLR, ElasticSearch, BigTable
  2. Specialised indexing of VCF with bgt or gqt.
The future:

We would like to be in a place where we do not need to worry about data formatting and the practicalities of indexing data.  Ideally, I want to give a VCF file a program and get a queriable database of variants out so I can focus on the analysis of my data rather than the formatting.

Statistical intuition

I'm often asked how to go about developing some statistical intuition.
Not being a proper statistician might make this hard to answer because it is not intuitive to me.

Some non-intuitive statistical principles that everyone should know:


  • Winner's curse or regression to the mean
  • Multiple testing



Saturday 31 January 2015

The coming of age of humanity

Here just a thought that's crossed my mind a few times and this might not be the best place to share it but I'd be interested to hear any views on this matter.


The collective being which is humanity has learned a lot since its birth.
A giant with a hazy memory of his early days and disturbing past still faces inner qualms.
It has grown to adulthood, its cells multiply and specialise, some faster than others, its still holds vestiges from its early childhood


Humanity as an entity, is a bit like a person going through different stages of life.  In it's infancy, humanity was uncoordinated knowledge was lost and many beliefs were held which were not founded on any sort of empirical evidence.
As humanity matured it went through a pubescent crisis

I believe (albeit naively) that humanity, the collective consciousness that has evolved through the transmission of knowledge across generations by the spoken and written word,  has come of age; that our tolerance, understanding, and scientific openness has reached a point where data can be put in the public space without fear of confidentiality, judgement or reprisal.
That we become as open about our genetics and medical problems than about our thoughts, religion, sexual orientation.  That these things don't become newsworthy anymore.  If anything genetics shows us that we are all exceptions, we all carry private mutations, minor alleles, strange intricacies in our DNA that distinguish us from everyone else.

What is interesting is not what makes us the same but what makes us different.

Group dynamics or group psychology

Democracy is founded on the "if everyone did this" thought excercise

Statistics and Geometry

Correlation is a cosine.

http://www.johndcook.com/blog/2010/06/17/covariance-and-law-of-cosines/

r(X,Y) = cov(X,Y) / sqrt(var(X)*var(Y))
<X,Y> = ||X|| ||Y|| cos(theta)

cos(theta) = <X,Y> / ||X|| ||Y|| = cov(X,Y) / sqrt(var(X)*var(Y)) = r(X,Y)

Sum of squares of X is X X'.

Matrix approach to regression.

X X' B = X Y


Avoiding repetition in science writing

Although english is possibly the richest language in terms of the number of words, it may not be the richest in terms of structure or conjunctive terms (terms used to string sentences together to build logical arguments).  These are especially useful in science when we build logical argument and marry many threads of seemingly contradicting evidence.

Words like "however" and "since", to introduce nuance or causation, get used a lot in english, and even more so in scientific writing.

To avoid repetition, I've regularly looked up synonyms for these words.
I've now done this enough times, that I thought I might blog about it.

Formal synonyms of suggest:

The hybrid method we advocate leverages the information available from
targeted qPCR assays ...

The hybrid method we present leverages the information available from
targeted qPCR assays ...

Formal synonyms of since:

The differential bias between cases and controls is very likely be the result of batch effect since the case and control DNA samples were prepared and processed in two different centers.

The differential bias between cases and controls is very likely be the result of batch effect as the case and control DNA samples were prepared and processed in two different centers.

The differential bias between cases and controls is very likely be the result of batch effect given that the case and control DNA samples were prepared and processed in two different centers.

The differential bias between cases and controls is very likely be the result of batch effect considering that the case and control DNA samples were prepared and processed in two different centers.

Formal synonyms of have implications:

bearing on


Formal synonyms of however:

We believe that those genes are unlikely to show association in the sample sizes currently available, in light of copy numbers greater than 2 being rare for all KIR genes.
However, given it is true that LD only accounts for presence/absence not for copy number (i.e. the LD pattern between 0 and 1 is the same as that between 0 and 2), the reviewer is right to remark that these genes cannot be definitively excluded based on the LD frequencies obtained from Allele Frequency Net database.

We believe that those genes are unlikely to show association in the sample sizes currently available, in light of copy numbers greater than 2 being rare for all KIR genes.
Nonetheless, given it is true that LD only accounts for presence/absence not for copy number (i.e. the LD pattern between 0 and 1 is the same as that between 0 and 2), the reviewer is right to remark that these genes cannot be definitively excluded based on the LD frequencies obtained from Allele Frequency Net database.

We believe that those genes are unlikely to show association in the sample sizes currently available, in light of copy numbers greater than 2 being rare for all KIR genes.
On the other hand, since it is true that LD only accounts for presence/absence not for copy number (i.e. the LD pattern between 0 and 1 is the same as that between 0 and 2), the reviewer is right to remark that these genes cannot be definitively excluded based on the LD frequencies obtained from Allele Frequency Net database.

Formal synonyms of due to:

This one is a bit tougher...

Thank you for noticing this omission which is due to a typographical mistake.

Thank you for noticing this omission which is because of a typographical mistake.

Thank you for noticing this omission which is  the result of a typographical mistake.



Population theory: within individual, between individual and between population variation

The underlying idea in clustering is to find a labelling of the data (clusters) which minimises the ratio of within to between cluster variation.


Some parallels may be drawn between analysing populations of cells within one individual and analysing populations of individuals.


Obviously they are clear differences:
In the case of cells the difference is an expression level (mRNA and protein) whereas the individual level the difference is at sequence (DNA level).
The time scale and dynamics are very different.  Gene expression variation is very flexible whereas the mechanism underlying DNA sequence variation are much better understood.


Some principles from analysing populations of cells are applicable to populations of individuals.  Although the data is longer instead of wider.


Both are snapshots of a dynamic system working on very different timescales.
In both scenarios we are trying to find latent classes/clusters which explain the variation
Clustering methods are particularly relevant.
While density based methods are more suited for data matrices of many rows, distance based methods maybe more suited when the number of columns is greater.
For example the idea of a cell lineage is comparable to the lineage.
Population bottlenecks are as applicable to genetic diversity than to cellular diversity.


Identification of latent factors or discovery of events which influence the population dynamics.


Within these datasets I have identified using various clustering approaches, population of cells and 
phenotypes on those which correlate with genetic variation.
I have developed efficient clustering methods using Bayesian mixture models to identify rare cluster and accounting for large variation between samples.
I have also extended this approach of mixture models to clustering genetic data and this has led to the largest association study of genes of the KIR region with type 1 diabetes.


Individuals like cells within an individual can be considered independent.
There is proximal structure in the genetic code (LD) which can exploited.


Statistical physics, Gaussian probability distribution, additivity

All of deterministic physics is emergent behaviour from quantum physics?
Law of large numbers?

The seemingly deterministic large-scale is an emerging behaviour of the probabilistic small-scale.

The trajectory of single particles is probabilistic but the trajectory of the larger object is deterministic.
In the same that in a large crowd of people the movement of a single individual is unpredictable while the movement of the crowd as an entity is much more predictable.

In weather prediction, we can predict large scale weather patterns say over a large area but we cannot predict with much certainty whether tomorrow it will rain in Cambridge.

When observing very small amounts of data, our measurements are far more uncertain.

On the large scale things tend to be normally distributed so the emerging behaviour of the system tends to be symmetric since most small scale forces cancel to give rise to large scale equilibrium.

Does non-linear imply non-additive?
Does non-linear imply interaction effects?  I.e non-marginal effects.
Marginal effects can detected by summing/integrating over all latent variables.