Introduction
For ages since the domestication of agriculture about 10,000 years ago plant breeding was regarded as an art rather than a science to manipulate the crop species for improving their characteristics to benefit production. Breeders were using methods to improve the economic traits like yield etc. by selection and hybridization to incorporate desirable traits from one variety to another. However their success was limited due to the fact that whatever variation they exploited was derived from wild relatives with scant natural variation. They practiced a form of mass selection in which plants of superior phenotype were selected and seeds of such plants were planted during the next season. Without knowing the genetic basis of such practices the approach was hit and trial in getting superior progeny generation. This led to varieties which are now termed as landraces – locally adopted lines in a particular region. Today plant breeding is a full-fledged science with the ability to create variation according to their needs as well as to expedite the breeding process considerably to evolve new varieties. Behind this progress lies the scientific and technological interventions like fundamental discoveries of Mendel’s gene in garden peas in 1900, phenomena of linkage and crossing over in 1910, production of mutation by X-rays in 1920, double-helix DNA in 1953, recombinant DNA technology, transgenesis and genetically modified organisms (GMOs) in 1970s including cloning technology and reverse genetics, epigenetic modifications due to DNA methylation and histone proteins, mapping of genes for quantitative traits with the help of markers and marker-assisted selection in 1980s, capability to sequence the whole genome in 2010s and lately CRISPR-based gene editing technology in 2012. In fact, transgenesis, QTL mapping, molecular marker-assisted breeding, gene sequencing etc. have introduced precision in the plant breeding process and have given rise to what is now termed as molecular breeding.
Mendelian Genetics Era
With the advent of Mendelian genetics, Danish botanist Wilhelm Johannsen, who coined the word gene, developed pure-line breeding theory to generate true-breeding (homozygous) lines through repeated self-pollination. He also stressed the role of environment in the inheritance of quantitative traits. Instead of bulking the seeds from different parent plants for mass selection and picking the best plants from the resultant crop, progeny row selection was adopted by sowing the seeds of single parent in separate rows and picking those parents whose progeny means were high. The progeny mean is subject to much smaller environmental variance and helps in picking the plants with the best genotypes. The identification of superior genotypes thus becomes more rigorous and fruitful in producing better progeny. One aspect worthy of being mentioned is the role of heritability – the fraction of total observed variability in an economic characteristic that is attributable to genetic causes – in taking decisions on selection procedures for genetic improvement. The heritability can be estimated from observed correlations between relatives 1. In plant genetics, however, the total observed variability is enhanced due to the existence of genotype x environment interactions giving lower heritability compared to one without such interactions. Stability parameters – common varietal effects across environments – and environment specific deviations (interactions) are then used to study such problems 23.
The discovery of linkage in 1910 by TH Morgan and of mutation by X-rays in 1920 by HJ Muller led plant breeders to increase diversity (variation) in their material as well as to expedite the breeding process. The former tool of linkage and crossing over became of fundamental importance in QTL mapping and marker assisted breeding as we will see in the sequel. The mutagenic effect of X-rays and ionizing radiation as well as other chemical agents opened avenues for mutation breeding as a tool for release of hundreds of improved cultivars. In particular, semi-dwarf stature of plants, preventing lodging of plants in the field, was developed by mutation breeding in several crop plants like barley, wheat, rice, and sunflower. In particular, Dr. Norman Borlaug did extensive experiments in Mexico crossing different strains of wheat to come up with crossing stubby-stalked dwarf wheat with high-yield varieties that resulted in extremely high yields provided sufficient dose of fertilizer was applied to enable the plant to hold up in the field under the weight of large clusters of grain. Agronomists used the same device to breed semi-dwarf rice plants. Such dwarf varieties of wheat and rice when planted in other countries made a tremendous difference in the crop productivity and led in due course to the phenomenon of green revolution. Dr. Borlaug was awarded Nobel Peace Prize for this success in 1970. A country like India plagued with deficit food production for ages moved, in 1960s, to a surplus state capable of exporting food due to the success of green revolution in that country. Of course it demanded the use of inorganic fertilizers, irrigation and pesticides. States of Punjab, Haryana and Western Uttar Pradesh endowed with irrigation capability contributed significantly to this revolution. However, with the excess use of high dose fertilizers and pesticides over time, the gains of green revolution were not sustainable, the strategy becoming environment non-friendly. A more holistic approach to transcend the green revolution with an ‘Evergreen Revolution’ by adopting a comprehensive farming systems approach that considers land, cultivar improvement, water, biodiversity, and integrated natural resource management was later advocated 4.
Subsequent development of recombinant DNA technology has shown that if the green revolution were to occur now this process would have been very quick. Dwarfing gene identified from any model crop can be injected in the plant cells of the desired crop to produce shorter plants that can carry greater amount of grain without any concomitant effect of lodging. This hastening of the genetic process is the hallmark of transgenic technology innovation achieved in 1970s.
DNA, Clones, and Reverse Genetics
Soon after the discovery of the double-helix DNA, detailed knowledge of genetic material started accumulating. Sensitive techniques of isolating and analyzing genetic material in the laboratory were developed around a crucial attribute of the material–the ability to replicate–as well as the universality of genetic code. The sequence of DNA (deoxyribonucleic acid) letters (4 types of nucleotides – the A, C, G, and T representing respectively four chemical units or bases: adenine, cytosine, guanine, and thymine, their pairing being A binding to T and C to G) in the nucleus of each cell of the organism constitutes the basic genetic entity. But it is not known which DNA letter affects which part of the body and in what way. However we do know how the 4 letter alphabet of the language of DNA is transformed into the 20 letter alphabet of the language of proteins. The genetic code consists of a system of successive triplets of nucleotides along the DNA, known as codons, which code for successive amino acids of a corresponding polypeptide chain of a protein or enzyme.
Cloning means copying a given gene, a segment of DNA, usually done by putting it in the laboratory version of an E. Coli microbe so that, as the bacteria multiply, so do the copies of the gene. When we mix viruses with bacteria that is grown on a petri dish of nutrient agar, the areas where viruses have killed the bacteria results in a clearing in the lawn of bacteria on the petri dish – a killing zone of the bacteria known as a plaque. The plaques contain millions of virus particles and therefore millions of copies of the original DNA fragment of the gene. The E. Coli bacteria resists the virus attack by enzymatically cutting up the DNA of the invading virus into small pieces. Such enzymes are called restriction enzymes, a basic tool of genetic analysis. This gives pieces of DNA ending with a single-stranded end protruding from the DNA duplex. These are complimentary and can be joined by another enzyme ligase using the property of pairing of the nucleotide bases. Such molecular tools are used in the laboratory to develop what is known as genomic DNA library. There is another type of DNA library known as complementary DNA library. In this case we use another type of genetic material found in the cells known as messenger RNA (mRNA). These are used to carry out the genome’s orders to make proteins. However mRNA is by its nature transitory and unstable. By using another enzyme called reverse transcriptase the RNA can be copied into a stable form of DNA known as complementary DNA (cDNA). The library is developed by isolating mRNAs at work in a tissue, converting them into cDNA fragments, and inserting these fragments into a plasmid, a small ring of DNA that carries the instructions for a bacterium. When the bacteria are infected with the plasmids, millions of copies of cDNA are produced. Because each bacterium contains a different segment of cDNA, when it replicates i.e. divides into daughter cells, both the mother and the daughter cells would contain the same DNA fragment. Of the two libraries, the cDNA library exploits the way that Nature’s copy editor turns the whole genetic code into a much smaller stretch of mRNAs that represents only the subset of genes-the coding genes-required for a specific cell or tissue type. This helps in gene hunting.
Techniques that are used to manipulate the genetic material in the test tube lead to another phenomenon known as reverse genetics. In classical genetics we observe the phenotype and infer the genotype on the basis of the results of mating two individuals differing in their genotypes. We conclude that the observed difference between them was due to pre-existing mutation as one can see in the results of Mendel’s experiments. In reverse genetics we take a fragment of DNA, the role of which in the life of the organism could be known or unknown, and mutate it in the test tube. After reintroducing it back into the cell where it gets integrated into the chromosomes, we can see the consequence, if any, on the phenotype of the organism. That is, we go from genotype to phenotype, a process reverse to the one used in classical genetics.
Transgenic Technology
It is a derivative of recombinant DNA technology that gave birth to plant genetic engineering involved in creating plants with desired characteristics by inserting useful genes from a wide range of living sources, not just from within the crop species or from closely related plants. It is a man-made technique but based on the principles followed in nature - the ingenious genetic engineering of soil bacterium agrobacterium tumefacieen of injecting its own DNA and integrating it with those of the plant with crown gall disease. This gave a clue to the researchers how to artificially insert into agrobacterium’s plasmid a desired gene for transferring it to the plant cell. When this genetically modified bacterium infected a host plant it would insert the chosen gene into the plant’s chromosome which would hereafter be called a genetically modified organism. This technique however got refined in 1980s by the invention of “gene gun” in which the desired gene is affixed to tiny gold or tungsten pellets and these are fired carefully like bullets into the cell. By 1990 the scientists succeeded in using the gun to shoot new genes into corn and genetically modified corn was born- the first GM crop.
This technology provides the means for identifying and isolating genes containing specific characteristics in one kind of organism and for moving copies of those genes into another quite different organism which will then also have these characteristics. It has enabled plant breeders to generate more useful and productive crop varieties and in a much shorter time than the cumbersome traditional cross-pollination and selection techniques. Genetically modified crops like corn, soybean, rapeseed oil, cotton, rice are now planted in about 170 million hectares globally.
In the case of Bt cotton, with the transfer of the cloned Bt (Bacillus thuringiensis) gene in the cotton plants by genetic engineering, the plants produce their own biocides and kill the caterpillars of the insects (lepidopteran) that cause damage to the crop (bollworm attack). It is a chemical protection of the crop, the plant cells being the delivery system. So while the quantity of the insecticides for spraying used in the traditional approach is considerably reduced leading to lower input costs to the farmer and protection of the ecosystem, the strategy might create problems in the internal machinery of the plant itself. But the experimental evidence is to the contrary.
A transgene is a segment of DNA containing a gene sequence that has been isolated from one organism and is introduced into a different organism. It is an assembly of three parts - a promotor, an exon, and a stop sequence. The promotor is a regulatory sequence that will determine where and when transgene is to be active. The exon is a protein coding sequence usually derived from the cDNA for the protein of interest (vide cDNA library discussed in previous section). All the three parts are typically combined in a bacterial plasmid with the coding sequence being chosen from transgenes with previously known functions.
The dichotomy of GM and non-GM crops seems to be superfluous. All improved varieties are genetically modified; only the methods of obtaining them could be different. As against such man-made genetic modifications in domesticated crops, there is also genetic modification in wild populations by natural selection following the principles put forth by Charles Darwin. This leads to evolution of varieties where the selection is of stabilizing type favoring phenotypes near the mean of the population. In the man-made case, the so called artificial selection, extreme phenotypes (increased yield) are favored that might involve some loss of fitness. Artificial selection practiced by breeders since the advent of agriculture about 10,000 years ago, when even genetic principles were not known, could produce modifications (genetic) of the desired type. It is said that Darwin got the clue to his theory of evolution by natural selection from the results of artificial selection practiced in domesticated species. It was later when Mendel gave the laws of heredity that natural selection got a genetic basis of operating on genetic variation created by mutation and recombination. Genetic modification is therefore at the root of all this process whether natural or man-made.
CRISPR-based Gene Editing Technology
CRISPR/Cas9 is a system consisting of a CRISPR (clustered regularly interspaced short palindromic repeats) molecule and an enzyme Cas9 of the cell. The former could be programmed to target a specific section of the DNA by loading it with its matching RNA sequence (guide RNAs i.e. sgRNA) and the latter could function as a powerful pair of molecular scissors to cut the matched section of the DNA. The repeat sequences of 29 nucleotides are separated by various 32 nucleotides spacer sequences. Soon after cleavage of the targeted sequence, the body can either repair itself on its own – non-homologous end joining (NHEJ) - or scientists can patch in a corrected sequence – homology-directed repair (HDR). If done in sex-cells, the changes will be passed on to future generations. It is a very recent biotechnological tool that is revolutionizing plant breeding practices by modifying targeted DNA sequences within plant genomes particularly in crops like rice and wheat. It is much like what we do in a word processor by ‘cut’ and ‘paste’ functions.
It is however significant to note that such a system has been derived from a naturally occurring defense mechanism first observed to take place in a cup of yoghurt when the bacterium streptococcus thermophilus used it to defend themselves from repeated viral infections by providing a type of acquired immunity for it. After viral invasion is repelled the bacterial DNA keeps a genetic record of the viruses infecting them as short repeated sections of the DNA along with short segments of spacer DNA in-between them as snippets of virus’ genes repelled so that when the same virus attempts to again infect the bacteria it would gravitate towards its matching section on the bacterial genome and bind to it. That summons the powerful enzyme cas9 of the cell to perform the task of snipping the virus out, leaving the bacteria free from infection. Researchers realized then that this trick of bacteria could be used to cut not only viral DNA but any DNA sequence in any organism at a specifically selected gene or genes by altering guide RNAs in combination with enzyme Cas9 to match a targeted gene or genes. The sgRNA is part of a longer RNA molecule that forms a riboprotein with the Cas9 enzyme machinery positioning the Cas9 enzyme to the correct position on the target DNA for cleavage.
In plant breeding this technology can enable the scientists to edit the genomes of superior varieties to produce new varieties in a single generation irrespective of the existing variability and without the need to select favorable combination of alleles. But such an approach requires knowledge of the nucleotide sequence and function of the targeted genome so as to be able to design the appropriate sgRNA and predict the editing outcome. However it has been applied in rice crops by generating mutations at the target sites at nearly 100 % efficiency. A CRISPR/Cas9 mutagensized rice line with enhanced blast resistance was recently released 5. In wheat however this technology has not been that successful. The first CRISPR/Cas9 mutagenized wheat plants developed had an efficiency of only 5 % 6. The capability of multiple targeting of sites of this technology can however be useful in wheat due its being a polyploid crop (having more than two sets of chromosomes). Scientists involved in climate change studies recently used genome editing to enhance drought tolerance in maize by editing a previously unidentified promoter to increase expression of the ARCOSS gene which down regulates the growth-inhibiting hormone ethylene, enhancing plant growth and yield under drought stress 7. In tomato flowering time can be manipulated by using CRISPR/Cas9 to generate early–yielding varieties by disrupting the flower-repressing gene SP5G8.
Like GMO the CRISPRized crops also face sociopolitical challenges such as government regulations, public acceptance and adoption by producers such as small farmers. However advantages of genome editing over conventional and earlier transgenic approaches being its low cost, ease of use, lack of transgenes permanently introduced into crop germplasm and high level of multiplexing (editing of multiple targets) can lead to its wide adoption in the near future for increased crop production.
Molecular Markers and Linkage Maps
Soon after the introduction of technology for genotyping molecular markers, the so called chip technology, the methods of plant breeding got a big impetus in increasing precision in the breeding process by incorporating the marker information in the existing approaches of selection and cross breeding. This involves three components (a) Molecular Markers and Linkage Maps, (b) Mapping of QTLs, and (c) Maker-assisted Plant Breeding discussed in this and following three sections.
The role of markers, however, was implicit in earlier studies on quantitative genetics way back to the work of K. Sax who investigated the existence of linkage between the polygenes of a quantitative trait like the weight of seeds and a Mendelian gene like the color of the seed 9. He crossed a strain of dwarf beans,Phasecolus vulgaris, having large colored seeds with another whose seeds were small and white. While seed size showed itself to be a continuously variable character, the pigmentation proved to be a single gene difference (P-p), the F2 giving a ratio of 3 colored to 1 white seeded plant. By means of F3 progeny, the colored F2 plants were further classified into homozygotes (PP) and heterozygotes (Pp). The average bean weights in the three classes of F2 plants were found to be PP (30.7), Pp (28.3) and pp (26.4). Their standard errors showed the difference in seed weight to be statistically significant. Clearly the average weight is associated with the number of P-alleles present viz. 2, 1, 0. The pigmentation is thus here synonymous with a marker that is associated with a quantitative trait. Such a marker can be followed through the generations and can serve as a tag for following the quantitative trait provided it is linked with it. This aspect has become of crucial importance to plant geneticists and plant breeders for improving economic traits.
Quantitative traits such as yield of plant, flowering time, pest resistance etc. are complex in nature being controlled by several genes and affected by environmental factors. Quantitative genetics in contrast to Mendelian genetics has developed around such traits with a heavy dose of statistical input 23. Quantitative Trait Loci (QTL) is a segment of DNA and its effects could be either small or large at least in comparison to the environmental modifications. As mentioned earlier the methodology of quantitative genetics has considerably got modified due to the introduction of marker information via chip technology. There are several ways of getting such information as for instance pigmentation in the study of K. Sax 9. Broadly there are three categories of markers viz. morphological like blood groups, biochemical like allozymes and molecular which are at the DNA level. The last one can be listed as:
Restriction fragment length polymorphism (RFLP)
Random amplified polymorphic DNA (RAPD)
Amplified fragment length polymorphism (AFLP)
Variable number of tandem repeats (VNTR) - that consist of microsatellites (short sequences) termed as short tandem repeats (STR) or simple sequence repeats (SSR) and mini-satellites (long sequences)
Single nucleotide polymorphism (SNP).
Of course the whole DNA sequence is itself an ultimate marker in the process of marker development. These all help in identifying the QTL by looking for association between the trait and the specific one or several markers 10111213. They are like sign posts or tags. For instance, suppose you go to a new city and are interested in locating the house of your friend whose address you don’t know but you do know that the house is in the vicinity of a petrol pump with a known address. Your ability to be successful in the search would depend on the closeness, including direction, of the petrol pump to the house. In the absence of such sign post information you would have a cumbersome task of knocking the doors of each house and enquiring whether your friend lives there.
In addition to the above type of markers we have also what are known as functional markers which are superior to above mentioned random DNA markers in that they are located within specific gene regions delimited by QTLs and are therefore completely linked with the QTL alleles. They are derived from functionally characterized sequence motifs affecting phenotypic variation.
The first problem in QTL mapping is to construct a linkage map that indicates the position and relative genetic distances between the chosen markers along each of the chromosomes. The map distance is based on the total number of crossovers between two markers whereas the physical distance between them is in terms the nucleotide base pairs (bp). A centi-Morgan (cM), corresponding to a cross-over of 1%, can be a span of 10 kbs to 1,000 kbs and can vary across species. Linkage maps for several crop species like rice, wheat, maize etc. have been constructed and are used for QTL mapping.
Since the marker genotypes can be followed in their inheritance through generations, they can, as stated above, serve as molecular tags for following the QTL provided they are linked with the QTL. This requires detecting the marker-QTL linkage and if established, estimating the QTL map position on the chromosome along with effect size of the QTLs. However, these problems depend on what sort of experimental populations we have in plant breeding investigations. In crops practicing self-fertilization, populations are derived from a cross between two pure breeding parents, homozygous at all the loci controlling variation in the trait. Such F1 hybrids are selfed to produce segregating F2 populations whereas backcross (BC) populations are derived by crossing the F1 hybrid to one of the parents, usually the recessive ones. Inbreeding from the individual F2 plants can lead to recombinant inbred (RI) lines which consist of a series of homozygous lines, each containing a unique combination of chromosomal segments from each of the two original parents. It takes around six to eight generations to achieve this type of populations. In species capable of tissue culture such as rice, barley and wheat, plants can be regenerated by inducing chromosome doubling from pollen grains. This leads to production of double haploid (DH) populations. Both RI and DH populations are true breeding lines that can be multiplied and reproduced without any segregation and therefore provide eternal resources for QTL mapping. In cross pollinating species, on the other hand, such simple designs are not possible due to lack of inbreeding. Mapping populations are usually derived from a cross between a heterozygous parent and a haploid or homozygous parent depending upon the plant breeding need.
Mapping of Quantitative Trait Loci (QTL)
The detection of marker-QTL linkage is based on a statistical test of a null hypothesis (H0) against an alternative hypothesis (H1). The null hypothesis postulates that there is no QTL in the vicinity of the chosen marker with a known location on a given chromosome and hence no linkage exists between them. This can happen in several ways. The QTL is not on the same chromosome as the marker or it is on the same chromosome but cross over with it at the meiosis occurs with probability ½. If we reject this hypothesis saying that we detect linkage when in fact no QTL is present we commit an error which is termed as false positive. On the other hand if we accept the null hypothesis meaning that there is no linkage when in fact a QTL is present we commit another error of missing the QTL which is termed as false negative. These errors are respectively known as Type I and Type II errors in the statistical literature pertaining to testing of hypotheses. Including the two other possibilities of true positive and true negative, the four possibilities are:
Reject H0 when H0 is true – false positive (type I error)
Accept H0 when H0 is true – true negative
Reject H0 when H0 is false – true positive
Accept H0 when H0 is false – false negative (type II error)
In statistical testing our strategy is to minimize the probability of committing the error of missing the QTL for a fixed low level of the probability of occurrence of the false positive, usually kept at 5 % level. When H0 is taken as false, the alternative hypothesis H1 is regarded as true implying that a QTL is present and the probability of such a contingency is maximized. This provides the power of the test and can be increased by increasing the sample size. It may be noted that the probability of the concerned events can only be determined on postulating the true hypothesis. In general the test statistic is derived by a likelihood ratio criterion. This statistic is, in genetic applications, termed as LOD score and is approximately related to a chi-squared distribution.
Broadly there are two approaches to QTL mapping known as (a) candidate gene mapping and (b) genome wide association study (GWAS). In the former a specific genomic region on a given chromosome is chosen to look for the QTL with the help of the markers known to be located in that region. Tests for the presence or absence of the QTL are conducted at several map positions in this region say every 1 cM with the help of the LOD scores. Map positions showing significant values of the LOD score are deemed to contain a QTL. Amongst these the one with the maximum LOD score is chosen to indicate the position of the QTL. However the distribution of the maximum LOD score is not just chi-square due to non-independence of the successive tests, particularly in a dense-marker linkage map. In GWAS, on the other hand, all maker positions on all the chromosomes are tested for the presence or absence of the QTL. This therefore requires a genome-wide threshold for judging the significance. With larger genomes more tests will be performed increasing the probability that a fixed LOD threshold will be exceeded. When we need an experiment-wise significance level of 5 % this means the probability of obtaining a LOD score above the threshold somewhere on the whole genome just by chance to be 5 %. The genome-wide threshold will thus depend on the number and length of the chromosomes as well as on the number of markers on the chromosomes. When few markers are tested per chromosome – the so called sparse map case – a lower threshold is needed at the same genome-wide significance level than when many markers are tested per chromosome – the so called dense map case. An exercise of determining LOD significance thresholds in experimental plant populations was attempted by using large scale simulations 14.
Taking into account the genetic basis there are three major methods of QTL mapping applicable to plant populations. These are (a) single marker analysis, (b) simple interval mapping (SIM), and composite interval mapping (CIM). We consider them below for a double back-cross population segregating for the quantitative trait under study as well as the chosen marker.
Single Marker Analysis
This is the simplest situation wherein for all the sampled plants from the population observations are recorded for the trait under the study and individual plant is genotyped for the marker. The data analysis can be performed either as a t-test, or as an analysis of variance (ANOVA) test. We can visualize that with one QTL locus and one marker locus there would be four marker-QTL genotypes whose frequencies would depend on the recombination probability between the two loci. Since the marginal frequency of the two possible marker genotypes is one-half each, the frequency of the QTL genotypes conditional on the marker genotype can be worked out. The expected value of the difference between the observed trait means of the backcross population in each of the two marker groups can be obtained in terms the recombination probability as well as the genetic effects for each of the QTL summed over all the QTLs. The two are confounded and so the null hypothesis being composite can mean either there is no linkage between QTL and marker loci or the QTL genetic effects are zero. This method is highly inefficient since we cannot determine whether a significant marker effect is due to one or multiple QTLs and whether the effect is due to far distantly linked QTLs with large effects or closely linked QTLs with small effects.
Simple Interval Mapping (SIM)
The most popular method is that of simple interval mapping (SIM) 15. It involves formation of intervals by pairing of adjacent markers and treating them as a single unit of analysis for detection and estimation purposes. It is based on the joint frequencies of a pair of adjacent markers and a putative QTL flanked by the two markers. Suppose markers A and B are linked with recombination fraction r and QTL Q is located between them with r1 recombination from A and r2 from B. Then r=r1+r2 -2r1r2 approximated as r1+r2, on the assumption of no interference and r so small that no double crossovers can be assumed. In the classical back cross with three loci each with two alleles, A-a, B-b, and Q-q, the expected frequencies for the eight marker-QTL genotypes can be used to obtain the conditional probabilities of the QTL genotypes given the marker genotypes. By setting up a linear regression model between the trait (Y) and the indicator variable (X) taking the value 1 if the QTL is QQ and –1 if it is Qq, one can estimate the regression coefficient that defines the allelic substitution effect of this QTL. In such a model, the QTL genotype for a given individual is unknown. Xis then a random indicator variable with conditional probabilities of obtaining QQ or Qq at the QTL. This means the observed value is modeled as a mixture distribution with mixture ratios as the conditional probabilities. We have, therefore, a situation often referred to as a linear regression with missing data. The problem of estimation then involves the use of EM algorithm. By assuming that the character is normally distributed within each of the eight marker-QTL classes with equal variance 2, one can set up a likelihood function in terms of unknown parameters, and develop a log likelihood ratio for testing the hypothesis that the QTL is not located in the interval where the log likelihoods are evaluated using the maximum likelihood estimates of the genotypic values for the two QTL genotypes, the variance 2and the recombination fraction r1 between marker A and the putative QTL using iterative procedures based on EM algorithm. This statistic is distributed as 2 with 1 d.f. The associated LOD score for the interval mapping is then (½)log10e)
This statistic is evaluated at regularly spaced points, say 1 or 2 cM distance, covering the interval as a function of the presumed QTL position. Repeating this procedure for each interval along the chromosome and plotting the LOD score curve against the interval gives a QTL likelihoodmap that presents the evidence for the QTL at any position in the genome. Presence of a putative QTL is assumed if LOD score exceeds a certain threshold T and the maximum of the LOD score function in the map gives an estimate of the QTL position and the gene effects. The mapping of QTL by interval method is widely used in practice. The analysis is done through the software MAPMAKER/QTL. The estimates of QTL effects and its location are asymptotically unbiased if there is only one QTL on a chromosome. But if there are two or more QTLs on the chromosome the test statistic for the effect and location will be affected by other QTLs linked to the QTL under test and therefore can result in biased estimates of effect and location. Also some regions not having any QTL can show a significant peak if there are several QTLs in the neighboring regions – a situation known as ghost gene phenomenon. This defect of SIM can be overcome by adopting composite interval mapping (CIM) discussed in the next section.
Composite Interval Mapping (CIM)
Although SIM is the method for QTL mapping most widely used with advantage in several practical situations, it ignores the fact that most quantitative traits are influenced by numerous QTLs. This is overcome either by adopting a model of Multiple QTL Mapping (MQM) or by combining SIM with the method of multiple linear regression, a procedure known as composite interval mapping (CIM) 16. Consider a segment of chromosome between markers i and (i+1) using a backcross progeny and set up the same type of linear model as in the section on SIM with X replaced by Xi and b replaced by bi and adding a sum over bkXk for the markers other than i-th marker with bk as the partial regression coefficient of the trait value on the marker k and Xk as a dummy variable for marker k taking value 1 if the marker has genotype AA and 0 for Aa. The maximum likelihood procedure is adopted to derive the formula for the relative position of the QTL as well as the likelihood ratio test statistic to obtain the LOD score for the hypothesis under test 17. Here the regression coefficient under test is a partial regression coefficient conditional on other partial regression coefficients in the model. The hypothesis under test is thus a composite and hence got the name composite interval mapping (CIM). It may be noted that the markers in the CIM model can control the residual genetic background only when they are linked to QTLs. In practice, CIM is implemented using iterative E-M algorithm. For each position of the QTL, the iteration starts with the
E-step : getting the probability of Xi = 1 for QTL being QQ and then performing the
M-step : estimating bi , B and σ2 for the next round of iteration where B is a vector of maximum likelihood estimates of the intercept and partial regression coefficients for all the markers except i and (i+1) and σ2 is the variance of the error term.
The advantage of CIM over SIM can be seen in the results of the analysis of mapping body weight QTLs on mouse chromosome X from a backcross population 18. The CIM analysis achieved a much better resolution than the SIM.