Mol. Evolution- Exam 2
What was the main impetus behind the development of Bayesian analysis?
- Very computationally intense, faster way to do models of evolution and use statistical approaches.
- Response to the increasing intractability of maximum likelihood analyses.
- Not an optimality method.
How is the statistical calculation for the tree score different from ML analyses? L= Prob(Hypothesis|Data)
Hypothesis is the tree itself.
Accuracy
True, the right answer
Precision
- How certain you are or how small is your level of uncertainty.
- Do people agree?
- Precision may be a little wide.
- How narrow is range of true/correct answer.
How do we know if our phylogenies are accurate?
- We never know 100%.
- We use simulated data sets that hopefully accurately represent genetic data.
If we can never know for sure, what two approaches are used to see if the methods we use are good at accurately determining historical phylogenetic relationships?
- Known phylogenies like bacterial sets
- Simulated data sets
- Congruence of methods
Prior probability
- Beginning guess, estimate the parameters of our model of evolution, starting point
- If I go into an analysis with some information, then I have more power to help inform my subsequent analysis
Posterior probability
- Information that comes from the prior probability
- Becomes the prior probability of the next tree
Burn-in period
- Plateau where no matter how many times we do it, we maxed out the scores of our trees, throw the bad trees away but save the good trees
- Use trees as stepping stones for better estimates
MCMC
- Way to rapidly estimate statistical scores of the trees
- Shortcut to speed up estimations we have to do when there's too much (complex)
- A quick way to stimulate a complex research space
What is the final outcome of a Bayesian analysis and how is this represented in a phylogeny?
- Hundreds of thousands of trees from the plateau phase and create one big tree out of those
- Does estimation faster than maximum likelihood and also does branching showing how often relationships were there
- A consensus tree created from a very large set of phylogenies saved from after the "burn-in" period with numbers above each node representing the posterior probabilities of each relationship
1. Hundreds of thousands of trees
2. Consensus tree for all trees
3. Support values
Bayesian analysis
Primary method of figuring out relationships.
Advantages:
- Faster than maximum likelihood
- Get result and estimate of process (probability support value)
- Bayesian analyses incorporate models of evolution in a statistical framework, but are more efficient than maximum likelihood methods
Posterior probability values
- How confident we are about the relationship in each node
- Measures precision
- If it's 100 = very high percentage they were related
Congruence
- Simply if two phylogenies match
- Precision across our results
What is the best method to find the phylogenetic relationships of a group and what does congruence have to do with this?
- Bayesian analysis
- Congruence of methods shows accurate representation/relationships of many organisms
Support measures
Estimate precision, how confident we are and how narrow our range is between the relationship
What are branch support measures testing?
Precision more than accuracy
Bremer support value
- Only for parsimony analysis
- Take every node on the tree and find very best tree
- Number represents difference between best tree and the tree that doesn't support the relationship
- Not confident in relationship if it's a small number
- Can't compare from one analysis to another
Jackknife support value
- For any analysis except for parsimony
- Take out one taxon and redo the analysis and see if everything looks the same
- If it all looks the same it means that the species didn't have an effect and very certain of relation
- Sensitivity to the taxa that are included
- Originally in parsimony, get analysis and get best tree and then redo and take out species
Taxon sampling, what is the effect of poor taxon sampling on a phylogeny?
- Process of selecting representative taxa for a phylogenetic analysis
- Lack of information, may not be accurate, may be missing so much information we didn't see the connections, want to include every species but we can't do that for diverse samples, it can mean we are missing some part of connection between species
Bootstrap support value
- Generally applicable
- Pseudo replica- replicate we constituted from original data set
- Recreate data set by sampling characters multiple times and some not at all
- Some samples are not going to hold the characters to give support for that relationship
- We get the best tree from thousands of trees by summarizing them
- Numbers between 50-100
- Can compare one analysis to another because it's done on 100% scale
- Widely applicable, used a lot
- Form pseudoreplication, data points can be sampled more than once, sample with replacement
- Tells how accurate data is across entire range
Bootstrap
1. Randomly resample characters with replacement to make a new data set the same size as the original (homology= data, point of evidence)
2. Find best topology (phylogeny) using new data set (pseudo replica)
3. Repeat (replicates), take all trees and make consensus tree
Posterior probability
Bayesian support measure from the last analysis
Why is the bootstrap support value most widely used?
- Widely applicable to all methodologies
- Relatively easy
- No new data
What are the bootstrap drawbacks?
1. Estimates precision, not accuracy
2. Tend to overestimate confidence
3. Assumes independence
4. Computationally inense
Supertrees
- Doesn't gather any new data
- Take analyses that have already been done and create a method to put them together
- A topology composed of different formal analyses with or without some sort of formal analysis
- Allows us to combine results from incompatible data sets
- Finds out areas of consensus, what if we don't agree how to represent all species
- Supertree methods ensure that we will find the true sets of relationships as long as the underlying assumptions are not violate (false)
Review what is meant by a consensus approach and a total evidence approach in making phylogenies and how this relates to supertree methods.
- Consenus approach- supertree is the agreement between total analysis
- Total evidence approach- take all the data and make a tree
Why did people start creating supertrees?
1. The unwieldiness of analysis, gets harder to work with bigger data sets
2. Like to summarize what has already been done, more formal way to summarize analyses
- They were originally created as a way to combine results from separate analyses where the underlying data was not congruent enough to assemble into a single data set
What is an informal supertree?
- No objective way to put them together/analysis that occurs
- Cut and paste, kind of know from other studies how some species are related, paste it with what is known with other species/groups
- Not fullproof, gives overall picture, doesn't have second analysis
Be able to outline the two processes used to make formal supertrees
- Agreement
- Optimization via matrix representation
(formal doesn't do well when there's conflict)
Agreement
- Making a consensus tree
- Here's phylogeny 1 and phylogeny 2, stick them together if they agree
Drawback: removed from analysis because incorrect data gets piled
Optimization via matrix representation
- Second round of analysis that goes on that synthesizes the original one, make a big matrix based on the trees
(more objective way to help decide when conflict)
What are the major criticisms of supertree methods?
Metanalysis, analysis of previous analyses without the original data, there is a lot of imprecision
1. No primary data
2. No "signal enhancement"
3. Novel clades not supported in source data (ended up with new relationships)
4. Inadvertent replication of source data (opposite of signal enhancement)
What is the more recent reason why people have proposed a supertree approach to building phylogenies?
There is so much data available and it can be impossible to do single phylogenies
Disk covering method
- Estimate relationships and then create supertree
- Need to have vague idea of relationships
- Gather new data and then use supertree methods to put them all together
- Areas with overlap, better fit and less sampling
Biclique method
- Find large data sets of what is currently available and put them together in sequence analyses that is based on the data
- Reanalyze data that is already available and put them together
- Put together matrix that represents different groups
- Helps identify good data sets
Reconstructing ancestral states
Look at the certain DNA for a certain sequence
Synapomorphy
- Single mutation that has been passed on to all the descendants
- Ex: feathers for birds
- Ex: making milk in mammals
Symplesiomorphy
- Important characteristic but is lost in their descendants
- Tetrapods where whales don't have legs
Convergence
- Similar characters derived independently
- Ex: bat, bird, and insect wings all derived independently but same function
Automorph
New characteristic but only in one species, not helpful
Which of the above character patterns provides direct evidence for classification of species into higher taxa?
Synapomorphies. All the other ones are noise/problems.
Fitch optimization
- Method to guarantee the most parsimonious mapping of complex characters
- Step by step mapping to tell us where mutations happened
Dollo parsimony
- Once a trait is lost it is not revolved
- Dollo Parsimony is best applied to the origin and evolution of complex features such a wings
- Ex: ancestors of stick insects are part of the winged insect group. The common ancestor of stick insects lost their wings but some current stick insects still have wings, they "revolved" them but basically they still carried the wing gene just turned it back on
- Ex: loss of teeth in vertebrates, teeth evolved only once at the origin of vertebrates and were then lost multiple times in turtles, birds, seahorses
Make sure you understand how mapping character traits onto a phylogeny allows us to reconstruct ancestral sites
Once we map a trait on a phylogeny it allows us to see variation
How is inference of ancestral states different under a maximum likelihood assumption?
Take into account branch lengths and models of evolution, what types of mutation are more likely
Convergence
- Bird, bat, and insect wings evolved independently but all used to fly
- Complex eyes of vertebrates, cephalopods, jellyfish, and arthropods evolved separately but are associated with vision
Be able to briefly outline Shimodaira-Hasegawa (SH) test
1. Uses bootstrap procedure
- Creates spread of possibilities
- Range of trees to accept or refuse
2. Tests wether an alternative hypothesis is significantly different than the best phylogeny
SH test drawbacks
- Has to create a distribution
- Anything that is a weakness of the bootstrap will be a weakness of SH test
- Tests to overestimate
- Not the best way to select which model of evolution to use for a genetic data set
Likelihood ratio test
- Is very flexible
- Uses a likelihood score
- Is widely applicable
What are the four primary uses of LRT?
1. Different phylogenies (is one phylogeny better than the other/significantly different?)
2. Molecular clock (can we use this data to estimate divergence time? only in limited situations, in data that's evolving naturally, can't predict in the short-term but can predict in the long-term)
3. Models of evolution (is this one better than this one?)
4. Looking for signs of natural selection in protein coding genes (is natural selection working on this gene or part of this gene?, purple= purifying selection, yellow= weak positive selection, red= strong positive selection)
How does one select a model of evolution?
- Do a stepwise comparison of all the different models, which is statistically different, and determine which one to choose
- Multiple tests to find best one
- The more complex the model of evolution, the less accurate
Orthology
- Two genes that can trace their common history back to a speciation event
- Two homologous genes, their divergence can be traced back to an ancient speciation event that split the most recent common ancestor of the two species with these genes into separate branches
Paralogy
- Two genes share common ancestor had gene duplication instead of speciation event
- Two homologous genes, their divergence can be trace back to a gene duplication event that predates the most recent common ancestor of the two species in which we find the genes
Xenology
- Horizontal gene transfer event can make a gene history not match the species history
- Huge problem for bacteria phylogenies
- Occurs but is very rare in eukaryotic genes
- Two homologous genes, one of them went through a horizontal gene transfer event and is now part of the genome of an organism very distantly related to the organism that has the other gene
Which of these subclasses are of use when trying to infer phylogenetic history?
Orthologous genes
Lineage sorting
- If speciation process is short and coalescence is fast, there there is no problem
- History of alleles doesn't trace the species history
- More than one allele in a population, one will be lost because of genetic drift
What are two processes that would make lineage sorting more likely?
Rapid speciation and long coalescence time
What are the three things that can cause a gene tree to conflict with a species tree (even when both trees are reconstructed accurately)?
- Gene duplication
- Horizontal gene transfer
- Coalescence (line sorting)
Gene duplication
Connected with paralogy
Horizontal gene transfer (plasmid/transformation, vectors with virus, and pilus)
Connected with xenology
Strict consensus tree
- Two, three, or more trees described together the trees agree with
- Very little resolution
Majority consensus tree
- How many trees show that relationship
- Better resolution
- D is more closely related to ABC than E
Pseudogene
Duplicated gene that no longer functions (still in the genome but is part of noncoding DNA)
Neofunctionalization
Duplicated gene now has a different function from what it did ancestrally (related to anagenesis)
Anagenesis
- Process of generating of new potential and diversity within a species over time
- Change in function over time but without any genes being created
Cladogenesis
- Process of speciation events where we generate new clades from a single ancestral population carrying characteristics into lineages and new functions
- Speciation process creating new clades or new groups
Is there a single species definition that can define all species?
No because speciation is a process, not an event and different species may have different processes that help establish the separation of a population into different species
Does this mean that the concept of a species is a human idea and not a biological reality? How can we reconcile this discrepancy?
Recognize it's a process and there are slight differences in some groups compared to others
Morphological species concept
- Most widely used
- Cats are different from dogs by looking at them
- Weakness where there is not enough morphological diversity to tell them apart
Strengths:
- Simple and easy
- Don't need special equipment, just need observation skills
Weaknesses:
- Need education on terms
- Need to be careful when there's a wide range of characteristics
Biological species concept
- Groups of actually or potentially interbreeding populations which are reproductively isolated from other such groups
- Used by defining rates of gene flow
- Can they exchange genetic material and at what level? Only relates to sexually reproducing species
Strengths:
- A little bit more scientific and objective
Weaknesses:
- Gene flow isn't 0 or 100, more in the middle
- Takes a lot of time and resources to get data set
- Can't use this for asexual species
Phylogenetic species concept
- Smallest monophyletic group distinguished by a shared derived character
- Only when other two don't work
Strengths:
- Very objective methodology
Weaknesses:
- Difficulties for asexual species
- Takes time and effort but it is only one thing, not multiple
Review the Wheeler paper and his arguments for the Phylogenetic Species Concept (PSC) being the single, unifying species concepts.
Even that has its own weaknesses because it’s a complex method when you can use simple morphological species concept that applies to species
What biological process does the PSC have a particular problem with?
Asexual reproducers ex: e. coli Hybrids and horizontal gene transfer, interbreeding would mess things up because trees would turn into networks