GenePlot




Click on a point for more information.

GenePlot Help

The basic procedure for using GenePlot on the Web is as follows:

  1. Upload a data file containing allele data for reference populations and any additional individuals you wish to assign.
  2. Choose two or more populations from the data to act as the reference populations
  3. [Optional] Choose additional groups of individuals to assign
  4. Run GenePlot
  5. View results (as a graph or a table)

You can repeat the analysis for different populations from the same file without uploading the file again. To do this, change your selection of reference populations and/or groups to be assigned, or change the method to use, and click "Run GenePlot" once you have finished your selections.

Uploading Files

The app can accept Genepop format files, or the file format defined on the "Example File" tab. Your data file can include data from many populations, and you can then analyse them in different combinations. For Genepop format files, the populations will be named "Pop1", "Pop2", etc. in the order that they appear in the file.

Reference Populations

After you have uploaded a file, GenePlot will detect the populations listed in the data. You can then choose any of those populations to be reference populations. You can use GenePlot to analyse the genetic structure of the reference populations, or to assign individuals by comparing them to the reference populations.

Choose reference populations by holding down the "Ctrl" key and then using the mouse to click on multiple populations. Scroll down the list to see more populations. If you want to deselect a population, press "Ctrl" again and then click on the population to deselect it; the other populations you have chosen will still be selected.

You must choose at least two reference populations. If you choose 2 reference populations, GenePlot will display a graph with the fit for one population on the x-axis and the fit for the other population on the y-axis. If you choose more than 2 reference populations, GenePlot will carry out PCA (principal component analysis) on the fit results with respect to all the populations, and display the first two principal components as the x- and y-axes of the graph. That is, the x-axis will show the linear combination of population fits that gives the best separation of individuals. If you have chosen 4 or more reference populations, then you can choose to show the 1st and 2nd principal components or the 3rd and 4th principal components.

Groups to be Assigned

After selecting reference populations, you can also select groups of individuals to be assigned. GenePlot will list the population labels found in the data file, so choose labels for your individuals to be assigned that will help you to identify them. If you need to be able to distinguish between specific individuals on the graph, list them in the data file with a different population label for each individual.

You can choose as many groups to be assigned as you like.

Method

You can run GenePlot using leave-one-out, if required. Leave-one-out is strongly recommended for data containing small population samples (i.e. fewer than about 30 samples).

The principle of leave-one-out is that when we calculate the fit for a reference individual with respect to their own reference population, we leave them out of the population. When we calculate the fit of other individuals, we include this one in its population again.

As an example, imagine that you have two reference populations, Pop A and Pop B, and individual A1 was sampled in Pop A. When you calculate the fit of individual A1 with respect to Pop A, you temporarily remove individual A1 from Pop A before estimating the allele frequencies for Pop A.

If we do not use leave-one-out, then when we calculate the fit of a reference sample, the fit will be positively biased because the individual's data is used both to characterise the population and to calculate the fit. If the reference population only has a few samples, every one of those samples can have a big effect on the estimated allele frequencies for the population. Overall, the level of separation between the populations will tend to be exaggerated, because each reference sample will have a better fit to their own population than they should.

Prior

If an allele was not found in the sampled data from one of the populations, that could be simply because it is rare in the population, rather than non-existent. In other words, the estimated allele frequencies for the populations are subject to sampling error.

The assignment algorithm uses a Bayesian method to infer the allele frequencies of each population, based on the sample data. It is necessary to set a prior on the allele frequencies at each locus. By setting non-zero prior frequenciess for all possible alleles at that locus (i.e. any alleles found within the chosen reference populations), all of these alleles will have a non-zero estimated posterior frequency, which allows for sampling error.

There are two standard priors used for assignment. One is defined in Rannala & Mountain (1997), and takes the value 1 for every allele at a locus. This is the default prior. The other is defined in Baudouin & Lebrun (2001), and takes the value 1/k for every allele at a locus, where k is the number of distinct alleles at that locus.

The prior defined by Baudouin & Lebrun may be more suitable for small samples, especially when the true population is thought to be much larger. Small samples from large populations are subject to much greater sampling error. The Baudouin & Lebrun prior penalizes rare alleles less than the Rannala & Mountain prior does, and thus compensates more for the sampling error.

Results

The results of the analysis are shown as a graph (on the Graph tab) and as a table (on the Results tab).

Individuals from different populations and groups are displayed on the graph with different colours.

Individuals who have missing loci are marked on the graph with an asterisk "*" inside their symbol, and are marked in the results with "impute" status. GenePlot also reports how many individuals were excluded from the analysis because they had data at too few loci. For individuals with missing data, the results show the "raw" fit for each population based on the loci that were present for the individual, and also the final fit for all the loci, obtained via the saddlepoint method.

If you have selected two reference populations, the x-axis will show the fit of each individual with respect to one of the populations, and the y-axis will show the fit of each individual with respect to the other population. The thick diagonal line is the line of equal fit in both populations. The thinner diagonal lines on either side are the lines at which the fit for one population is 9 times larger than the fit for the other population.

On a graph for two populations, the vertical lines show the 1% and 99% quantiles for the population on the x-axis, and the horizontal lines show the 1% and 99% quantiles for the population on the y-axis. This means that an individual on the 99% quantile for Pop A has a better fit for Pop A than 99% of all theoretical individuals that could possibly come from Pop A; in other words, the individual has a very good fit to Pop A. An individual on the 1% quantile for Pop A has a worse fit for Pop A than 99% of all theoretical individuals that could possibly come from Pop A; in other words, the individual has a very poor fit to Pop A. The method for calculating these quantiles is explained briefly on the Background tab.

If you have selected more than two reference populations, GenePlot performs PCA (principal component analysis) on the fit results for all the populations, and will then display the first two principal components as the x- and y-axes. That is, the x-axis will show the linear combination of fits with respect different populations that gives the best separation of fit results for the individuals in the data (including reference individuals and individuals to be assigned).

You can save a particular plot by right-clicking on the plot and selecting "Save Image". You will need to save the file as a .PNG file, e.g. "Results_plot.png".

Data Files

The GenePlot app can accept Genepop format files: http://genepop.curtin.edu.au/help_input.html. For GenePop format you need to specify whether you're using 2 or 3 digits per allele BEFORE choosing a file to upload. As an alternative, you can also use our GenePlot format directly.

An example file in GenePlot format can be found at https://www.stat.auckland.ac.nz/~fewster/GenePlot/

Data uploaded to GenePlot has to be in CSV (comma-separated) format.

The file must include a Data section and a Locnames section (which lists the names of the loci).

Every individual in the data file must have an ID and a population label. Individuals from the reference populations will have the name of the population they were sampled from, but additional individuals that you want to assign must also be given a population label. If you want to be able to distinguish specific individuals on the GenePlot graph, give each one a different population label. You can use any alphanumeric label for each population, group or individual.

Please do NOT repeat the keywords "DATA", "END", and "LOCNAMES".

Genetic Data

The main table of genetic data must be preceded by a single line containing the word "DATA", in capitals. The following lines should contain the data, using one row per individual. The line after the end of the data should contain the word "END" in capitals.

Do NOT include a header row in the data table. The line after the word "DATA" should contain the first entry.

The first two columns of the data table must give the IDs and populations of the individuals. The remaining columns should contain the allele names. IDs and population names can contain letters and/or numbers. There should be 2 allele columns for every locus name listed under LOCNAMES. Every row should end with a standard carriage return.

Use "0" for missing alleles.

Population Names

You can use any text you like for the population names, but do not use quotes within a population name. For example, it is fine to enter "PopA" in the population column, but not "Pop A 'Main'".

Locus Names

The locus names must be preceded by a single line containing the word "LOCNAMES", in capitals. The next line should list the locus names as a row of strings. Do NOT have spaces between the strings, only commas, as in the example data file

Example Data File

The following is an example data file. Note that the locus names are a single line, with no carriage return.

DATA
"Bi01","Mahu",96,126,280,280,236,250,165,165,232,246,231,231,185,187,89,89,170,176,154,164
"Bi02","Mahu",96,126,280,280,250,262,155,155,232,232,231,233,149,185,127,127,174,174,164,166
"Bi03","Mahu",96,126,280,280,258,262,165,165,232,232,231,231,185,187,89,127,174,174,164,164
"Bi04","Mahu",96,126,280,280,238,262,155,155,232,232,231,233,149,185,127,127,174,174,164,164
"Bi05","Mahu",96,122,280,280,250,258,0,0,226,244,231,231,187,187,107,127,174,176,164,164
"Bi06","Mahu",96,96,280,280,238,262,155,155,232,232,231,231,187,187,123,127,174,174,164,164
"Bi07","BM",122,126,276,280,236,238,155,165,242,242,219,225,161,187,123,127,176,176,164,182
"Bi08","Flat",126,126,280,280,238,262,155,165,226,232,231,231,187,187,89,89,174,174,164,164
"Bi10","Flat",96,96,280,280,250,250,165,165,226,246,219,231,0,0,107,127,174,174,154,166
"Bi11","Taik",96,96,278,280,234,250,165,165,226,240,231,231,149,187,89,99,170,170,154,164
"Bi12","Taik",96,96,276,280,234,250,165,165,240,240,231,231,187,187,89,99,170,174,154,164
"Bi13","Taik",96,126,276,276,246,250,165,165,226,244,231,231,149,187,99,99,174,174,164,164
"Bi14","Taik",96,126,276,276,262,262,155,165,226,244,231,231,149,187,89,107,170,174,154,164
"Bi15","Taik",96,96,276,280,234,262,155,165,232,240,231,231,149,187,99,99,170,174,154,164
END

LOCNAMES
"D10Rat20","D11Mgh5","D15Rat77","D16Rat81","D18Rat96","D19Mit2","D20Rat46","D2Rat234","D5Rat83","D7Rat13"

The data includes 4 populations (Mahu, BM, Flat and Taik), 10 loci (2 alleles per locus) and 15 individuals.

Authors

GenePlot was created by Rachel Fewster and Louise McMillan

Citations

Please cite as follows:

McMillan, L. and Fewster, R. "Visualizations for genetic assignment analyses using the saddlepoint approximation method" (2017) Biometrics.

The online publication is available here.

Background

Population genetics is the study of multiple populations, typically of a single species, and the level of migration and gene flow or separation between them. Genetic assignment is a process by which the genetic data of an individual is compared with genetic data of samples from two or more reference populations, to assess how likely it is that the individual might have come from any of those populations.

GenePlot performs assignment via the algorithm proposed by Rannala and Mountain (1997) but also improves upon that by adjusting the results for individuals with missing data so that they can be visualized on the same graph as results from individuals with complete data. This is achieved by characterizing the genetic distribution of each reference population using the saddlepoint algorithm (specifically the formula proposed by Lugannani and Rice, 1980). This characterization also enables us to calculate the quantiles of each population, to indicate the shape of the distribution and get a better feel for how well a particular individual fits within a population.

References

Lugannani, R., and Rice, S. (1980). Saddle point approximation for the distribution of the sum of independent random variables. Advances in Applied Probability 12, 475--490.

Rannala, B., and Mountain, J. L. (1997). Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences 94, 9197--9201.

Baudouin, L., and Lebrun, P. (2001). An operational Bayesian approach for the identification of sexually reproduced cross-fertilized populations using molecular markers. ISHS Acta Horticulturae 546, 81--93.

Contact Details

For more questions or additional help, please contact geneplotontheweb"at"gmail.com.