A single nucleotide polymorphism assay sheds light on the extent and distribution of genetic diversity, population structure and functional basis of key traits in cultivated north American cannabis

Background The taxonomic classification of Cannabis genus has been delineated through three main types: sativa (tall and less branched plant with long and narrow leaves), indica (short and highly branched plant with broader leaves) and ruderalis (heirloom type with short stature, less branching and small thick leaves). While still under discussion, particularly whether the genus is polytypic or monotypic, this broad classification reflects putative geographical origins of each group and putative chemotype and pharmacologic effect. Methods Here we describe a thorough investigation of cannabis accessions using a set of 23 highly informative and polymorphic SNP (Single Nucleotide Polymorphism) markers associated with important traits such as cannabinoid and terpenoid expression as well as fibre and resin production. The assay offers insight into cannabis population structure, phylogenetic relationship, population genetics and correlation to secondary metabolite concentrations. We demonstrate the utility of the assay for rapid, repeatable and cost-efficient genotyping of commercial and industrial cannabis accessions for use in product traceability, breeding programs, regulatory compliance and consumer education. Results We identified 5 clusters in the sample set, including industrial hemp (K5) and resin hemp, which likely underwent a bottleneck to stabilize cannabidiolic acid (CBDA) accumulation (K2, Type II & III). Tetrahydrocannabinolic acid (THCA) resin (Type I) makes up the other three clusters with terpinolene (K4 - colloquial “sativa” or “Narrow Leaflet Drug” (NLD), myrcene/pinene (K1) and myrcene/limonene/linalool (K3 - colloquial “indica”, “Broad Leaflet Drug” (BLD), which also putatively harbour an active version of the cannabichrometic acid Synthase gene (CBCAS). Conclusion The final chemical compositions of cannabis products have key traits related to their genetic identities. Our analyses in the context of the NCBI Cannabis sativa Annotation Release 100 allows for hypothesis testing with regards to secondary metabolite production. Genetic markers related to secondary metabolite production will be important in many sectors of the cannabis marketplace. For example, markers related to THC production will be important for adaptable and compliant large-scale seed production under the new US Domestic Hemp Production Program.


Background
Cannabis, an annual and dioecious member of the family Cannabaceae, is an economically important genus providing protein-and oil-rich seeds, long and short fibres for industrial applications (construction materials, textiles, paper, etc.), and a wide diversity of secondary metabolites found predominantly as terpenoids and cannabinoids (Lynch et al. 2016;McPartland 2018;Onofri and Mandolino 2017). In fact, the cannabis plant can produce over 150 unique terpenoids and roughly 100 unique cannabinoids, with a subset showing bona fide therapeutic utility (Hanuš et al. 2016;Booth and Bohlmann 2019). However, despite the large diversity in secondary metabolite profiles across thousands of cultivars, the stratification into drug-type cannabis or fibre-type cannabis hinges on the dry weight concentration of a single cannabinoid, Δ 9 -tetrahydrocannabinol (THC). This approach which prevails today in the USDA interim regulations, employs a THC concentration of 0.3% as the threshold separating hemp and drug-type cultivars, with concentrations below 0.3% defined as hemp (Dolgin 2019). Other jurisdictions outside North America have adopted higher thresholds, for example in Switzerland where concentration under 1% total THC is considered a compliant hemp crop. Sadly, despite human cultivation for over 6000 years in varying climates worldwide (Clarke and Merlin 2013), its evolution, taxonomic classification, and phylogenetic connections remain poorly understood. These deficiencies stem from limited research, irregular breeding efforts, unorganized selection, ex situ conservation, and government restrictions, which ultimately resulted in the high heterozygosity observed within cannabis genomes today (e.g. Rahn et al. 2016;McPartland 2018).
Although a subject of ongoing debate, taxonomic classification of the Cannabis genus has been delineated through three main types: 1) sativa (tall and less branched plant with long and narrow leaves), 2) indica (short and highly branched plant with broader leaves) and 3) ruderalis (heirloom with short stature, less branching and small thick leaves). While still under debate, particularly whether the genus is polytypic or monotypic, this broad classification reflects the putative geographical origins of each group (Clarke and Merlin 2013;Lynch et al. 2016, Schwabe andMcGlaughlin 2019). Consequently, there is currently no structured horticultural registration system available for cannabis cultivars (varieties), instead these are often awarded the epithet "strains", which are likely the outcome of extensive hybridization and subsequent rehybridization from their original botanical descriptors (Henry 2015).
The recent legalization of drug type cannabis for commercial production and recreational use in Canada, several US States, and other countries worldwide, as well a hemp under the US 2018 Farm Bill has created renewed scientific interest in developing a robust empirical classification system for cannabis. To that end, a particular focus has been placed on secondary metabolite expression with a clear separation based on CBD (cannabidiol): THC ratios. Differences in CBD:THC ratios delineate three class types: type-I (ratio < 0.5), type-II (ratio 0.5-3.0) and type-III (ratio > 3.0) (Elzinga et al. 2015), Interestingly, a genetic basis for these types can be determined by polymorphism at the CBDAS and THCAS genes on Chromosome 9 (Laverty et al. 2019). In addition, double recessives at this locus give rise to type-IV (cannabigerolic acid, CBGA accumulators; de Meijer and Hammond 2005) whereas type-V plants are free of cannabinoids which may have resulted from functionally ablative mutations founds within the upstream components of the cannabinoid synthase pathway (de Meijer et al. 2009). More recently, the addition of terpenoids as potential chemotaxonomic markers have emerged as a preferred model compared to cannabinoids alone (e.g. Lewis et al. 2018). Linking chemotype to genotype has also enabled deeper insight into a novel consumercentric classification based on genetic markers associated with chemical expression (e.g. Orser and Henry 2019). Recently, others have proposed targeted markers for the identification of fiber and resin cannabis (e.g. Cascini et al. 2019;Hilyard et al. 2019) as well as molecular sexing tools to differentiate feminized from regular seed stock (Toth et al. 2020).
In addition to paving the way for empirical taxonomic classification, genetic information can provide insight into the extent and distribution of genetic variability, population structure, phylogenetic relationships, as well as providing the essential tools required to perform marker assisted selection in order to improved homozygosity and trait stability. In addition, genetic information can faithfully identify matching multilocus genotypes across disparate accessions. This feature may be particularly useful in seed-to-sale tracking as it provides an irrefutable identity for each individual accession and paves the way for cannabis variety registration and protection.
Here we describe a thorough investigation of cannabis accessions using a set of 23 highly informative and polymorphic SNP markers associated with important traits such as cannabinoid and terpenoid expression (Henry 2017;Henry et al. 2018;Orser and Henry 2019). We extend the scope of sampling to 420 accessions from Licenced Cultivators in Saskatchewan, Manitoba, British Columbia, Canada as well as Nevada, USA. We validate the use of these 23 SNP markers to assess population structure, phylogenetic relationship, population genetics, and correlation to secondary metabolite concentrations, and demonstrate the utility of this assay for rapid, repeatable and cost efficient genotyping of commercial and industrial cannabis accessions for use in product traceability, breeding programs, and consumer education.

Sample collection
Samples were collected reflecting the diversity of cannabis germplasm available in North America, with samples from industrial hemp lines (type-III), resin hemp (type-II and type-III) and THC drug-type (type-I) cannabis. The sampling strategy did not follow any particular selection criteria as samples were randomly chosen from several licenced cultivators who graciously agreed to have their accessions analysed in this study (Supplementary Table  S3). Given the sensitivity of our genotyping approach, a small 2mm 2 segment of leaf tissue was collected at each facility and was sufficient to yield adequate DNA for downstream genotyping.

DNA isolation procedure
Prior to performing the DNA extraction protocol, and in order to obtain high molecular weight DNA, plant tissue samples were lyophilised by allowing to air dry for 24-48 h at room temperature and in the presence of silica desiccant. Lyophilised plant tissue was homogenised in a 1.5 ml microcentrifuge tube with a reusable pestle. Homogenised material was then treated following the Sbeadex® plant mini kit protocol (LGC Biosearch Technologies, Beverley, MA) following the manufacturer's instructions. After the addition of 90 μL Lysis buffer PN, samples were incubated at 65°C for > 10 min. The samples were then centrifuged at 2500 x g for 10 min to pellet the debris. 50 μL of the supernatant in this tube, referred to as the lysate, was then transferred to a clean 1.5 ml microcentrifuge tube with 120 μL Binding buffer PN and 10 μL Sbeadex® particle suspension and incubated at room temperature for 4 min. The tube was then brought into contact with a magnet for roughly 1 min until magnetic particles form a pellet. The supernatant was then discarded, and the pellet was then subjected to three consecutive wash steps. The washed beads were then eluted with 70 μL Elution buffer PN and incubated at 55°C for 3 min prior to bringing the tubes in contact with the magnet. 50 μL of the eluate was then transferred to a new tube which contain high purity plant DNA.

Endpoint PCR genotyping using custom KASP assays
Twenty-three optimized assay mixes, each specific to single nucleotide polymorphisms (SNP) previously identified as associated with phylogeny and chemotypic expression were screened in the sample set (Henry 2015(Henry , 2017Henry et al. 2018). These assays consist of two competitive, allele-specific forward primers and one common reverse primer (KASP; LGC Biosearch Technologies, Beverley, MA). Each forward primer incorporates an additional tail sequence that corresponds with one of two universal FRET (fluorescent resonance energy transfer) cassettes present in the KASP Master mix which contains the two FRET cassettes (FAM and HEX), ROX™ passive reference dye, Taq polymerase, free nucleotides and MgCl 2 in an optimised buffer solution.
The genotypes were generated using an Eco RT (Illumina, San Diego, CA), a CFX 96 (Biorad, Hercules, CA) and an Intelliqube array tape platform (LGC Biosearch Technologies, Beverley, MA) with multiple blind replicates across platforms to ensure cross system repeatability. Genotypes were called using the Kluster Caller software and manually verified using the SNPviewer software (LGC Biosearch Technologies, Beverley, MA).

Functional basis of 23 SNP
Given the recent release of the 10-chromosome map of the cannabis genome (Grassa et al. 2018;Laverty et al. 2019), metabolomic and proteomic insight (Jenkins and Orsburn 2019a, 2019b) as well as a fully annotated version of the cannabis genome resulting from the completion of the NCBI Cannabis sativa Annotation Release 100 (Jenkins and Orsburn 2019c), we set out to characterise the functional basis of the SNPs used in the study. The previously designed targets developed using Cansat 3 (van Bakel et al. 2011) were subjected to a BLASTn search (Altschul et al. 1990) constrained to the taxa cannabis using the NCBI online interface (https://blast.ncbi. nlm.nih.gov) accessed October 31, 2019. The location of the 10-chromosome map as well as the putative functional gene in which the 23 SNP are found were recorded.

Statistical analyses of genotypic data
Multilocus genotypes were formatted as a table (comma separated file) of genotypes with individuals as rows and markers as columns. As the total dataset of 681 plant DNA samples contained some missing data, we culled all missing data out and undertook the following analyses on 420 samples with complete genotype information across all markers. Metadata, including individual and population names, were separated from the genotype data and imported into the flexible statistical environment of R (R Core Team 2018) requiring the following packages, ape (Paradis and Schliep 2018), pegas (Paradis 2010), poppr (Kamvar et al. 2014), adegenet (Jombart 2008) and hierfstat (Goudet and Jombart 2015).
Briefly, the read.loci function was used to import the allelic data into the R environment as a data frame which was then converted to a genind object using the df2genind command. Individual and population (variety identity) were also incorporated into the genind object to allow for population level calculations to shed light on the stability of claimed variety names and to assess the level of genetic diversity within and between these hypothesized groups. Clonal lines were identified using mlg and mlg.id functions, which determines the number and identity of mutilocus genotypes. Basic population genetics metrics, particularly expected heterozygosity was calculated for each population and individual using the poppr function.
To shed light on the underlying relationships between our diverse sample set, a dissimilarity matrix or Hamming distance between multilocus genotypes was calculated using the bitwise.dist function and was visualized using a phylogenetic tree using the nj function. Principal component analyses (PCA) were undertaken to provide an independent line of evidence of the genetic affinities between accessions using the dudi.pca function. Broad signals of population genetic structure were investigated using discriminant analyses of principal components (DAPC; Jombart 2008). The optimal number of clusters was determined using the find.cluster function followed by the dapc function using said clusters as the most likely observed structure. The DAPC was visualized using the scatter function. A minimum spanning tree calculated from the squared distance between individual was plotted to shed light on the phylogenetic relationship of each inferred cluster. Lastly, the inferred clusters were applied as the population factor and the genetic differentiation between populations (variety names) as well as for the inferred clusters were calculated using the pairwise.fst function. Diversity indices for varieties representing putative seed lines for which at least three individuals were available in the dataset were also assessed using the locus_ table function where variety names were used as population indicator.

Statistical analyses of chemotypic data
A subset of 120 samples from Nevada were also analyzed by various LC / MS combination at 9 cannabinoid and 17 terpenes, following the methods described by Orser et al. (2018). Since the genetic panel was developed to find the most informative genetic markers associated with chemotypic expression, we grouped individuals according to the clusters from the DAPC and visualized the chemotype variation using side by side boxplots of the top cannabinoid and monoterpenes. Similarly, R was used to read the chemotypic data using the read.table function. The boxplot function was used to plot the top cannabinoid and terpenes expressed in each cluster.

Extent and distribution of genetic diversity and population structure in modern Cannabis
The 23 SNP panel used in this study was selected to represent a broad coverage of the cannabis genome and individual SNPs were found to be located on all cannabis linkage groups with the exception of chromosome 8 ( Table 1). As such, levels of polymorphism varied widely between SNPs, from fixed mitochondrial alleles that allow for the discrimination between fibre-type and resin-type cannabis (Figs. 1, 2 and 3), to highly variable nuclear markers. Of note, two resin-type landrace varieties from Kyrgyzstan and Egypt were the exception to the rule, both displaying the fibre-type mitochondrial haplotype while expressing THC as the main cannabinoid.
Heterozygosity at the nuclear markers ranged from 0.03 to 0.50 (Table 1, Supplementary Table S1). Three markers targeting the THCAS gene cluster offered strong discrimination of major cannabis groups, associated with the two major pentyl cannabinoids THC and CBD. In particular, the SW6 and VSSL_BtBD markers were fixed for one allele in all CBD expressing varieties (fibre and resin-types), while being fixed for other allele or heterozygote in all THC expressing varieties. In addition, the SVIP14 locus was also strongly associated with cannabinoid expression data ( Table 2).
The DAPC exercise clustered cannabis varieties into five groups (Figs. 1, 2 and 3), which was mostly congruent with the independent neighbor joining tree (Fig. 1). European Hemp (K5; 15 individuals, C. s. ruderalis, typically fibre or grain cultivars, often non-photoperiodic) was clearly distinct from all drug-type cannabis accessions, including high CBD resin expressing accessions. Interestingly resin (drug)-type cannabis consisted of four main genetic clusters, K1 and 3 (156, 118 individuals, myrcene/pinene, myrcene/limonene/linalool dominant respectively) which can be considered having a C. s. indica phenotype and perceived effect, while K4 (84 individuals, terpinolene) contain mainly accessions of  Showing the relative location of each individual sample in two dimensional space, overlaid by a minimum spanning tree calculated from the squared distance between individual to represent the phylogenetic relationship between inferred clusters. K5, hemp or "ruderalis" appears ancestral and the most differentiated group, followed by K4, terpinolene dominant resin accessions. The genetic distance between groups (Fst) is indicated on the respective branches of the minimum spanning tree  (Fig. 4). One known first generation hybrid ("S2") between an autoflowering male "Darryl" and a CBD resin-type named "Intergallactic Princess" (not sampled here) was found to be assigned to both K2 and K5 in a 40:60 proportion skewed towards the father's origin (Fig. 3). Other possible F1 hybrids were detected between K1 and K3 as well as possibly mis-assigned THC resin individuals into the K2 cluster (Figs. 1, 2, 3).

Diversity within seed lines and inferred clusters
Twenty of the 23 markers were found to deviate from Hardy-Weinberg equilibrium (HWE) in at least one of the 71 populations/seed lines (Supplementary Figure S1), which was not surprising in itself, given the domestication history and strong selective forces for chemical expression in modern North American commercial cannabis cultivars. Of interest when repeated in the larger clusters determined using DAPC, a total of four markers were found to not deviate from HWE (Supplementary Figure S2). The average heterozygosity within seed lines (putative populations) was 0.33, which was considered much higher than what was to be expected in any other major stable commercial crops. Interestingly, the most homozygous line, with heterozygosity of 0.09 was the Canadian fiber/grain cultivar "X59" (Supplementary Material  Table S1, Table S2). Several drug cultivars, including "Pink Kush", "Punch Breath", "Durga Matta II CBD", "Durga Matta", "Cotton Candy", "Chem4OG", "33rd Degree" and "ASD" all from known seed banks displayed relatively good stability with heterozygosities below 0.2. Another metric of interest is the index of association (Ia; Brown et al. 1980). This index brings an additional insight as a tool to quantify the reshuffling of alleles that occurs in sexually outcrossing species. A deviation from zero (typical of clonal population) indicates increased genetic distance between two individuals from the same seed line. Once again "X59" displayed the least distance between individuals indicating a possible strong selection for stable traits in the cannabinoid, fiber and grain expression pathways, and thus a good homogeneous production. For drug-type varieties, the three "Durga Matta II CBD" accessions, which were vegetative cuttings from the same mother plants were as expected confirmed to be identical clones.
On the other end of the spectrum, several drugtype cultivars had very large Ia, which may indicate mislabelling of individual plants or tremendous outcrossing, a syndrome of using F1 hybrids, which appears quite common in the industry to date.

Association between genetic clusters and chemotypic expression
Looking through a broader lens at the 5 clusters into which the 420 samples segregate one can clearly see a strong differentiation between fiber and resin-type cannabis (Figs. 1, 2, 3, Table 1). One can infer strong selective pressure against THCA expression in K2 (CBD resin type) and K5 (Industrial hemp). Individuals in these clusters, while expressing similar chemotypes, likely underwent a bottleneck for CBDA expression, while displaying large Ia values, likely indicative of the polyphyletic and broad origins of the samples at hand for both the resin and fiber-type cannabis. While no chemotypic data was available for the fiber-type cultivars from K5, a subsample of 118 resin-type cultivars with chemotypic data, particularly for major cannabinoid and terpenoid expression demonstrate that (K2 CBD resin type) also consistently expressed p-cymene more so than other resin-type accessions (Fig. 4, Table 1). Among the THC expressing resin-type cluster, K4, the terpinolene dominant group also appeared to accumulate more CBGA and less CBCA than K1-3 (Fig. 4, Table 1). The latter appear to be well warranted sub-clades given the higher level of diversity observed in the aggregate K1-3 compared to each cluster individually.

Discussion
The cannabis (2n = 2x = 20) draft genome has a haploid genomic sequence of over 876 Mb -1000 Mb (Laverty et al. 2019;McKernan et al. 2020) and transcriptome of at least 30,000 genes (van Bakel et al. 2011;Jenkins and Orsburn 2019a, 2019b, 2019c. The genome displays a large amount of polymorphism with a single nucleotide polymorphism (SNP) present in one-in-a-hundred to one-in-fifty base pairs (McKernan et al. 2020). The phylogenetic relationship and basis for the intra-genus classification have typically recognized a broad structure with divergence between fiber-type hemp and drug/resin types cannabis (Sawler et al. 2015;Dufresnes et al. 2017).
In the present study, we looked deeper into the extent and distribution of genetic diversity in modern commercial cannabis using a novel targeted genetic assay. While often debated in the literature and confused by lore, our data support a strong historical and genomewide division between fiber and resin-type cannabis. McPartland et al. (2018) suggested that hemp (C. s. ruderalis) is the ancestral group and originated in Europe about 19.7 M years ago. A combination of genetic drift and selection then likely contributed to the observed differentiation between fiber and resin cultivars . The introgression of an active CBDAS into resin-type cannabis likely occurred over the past decade since the advent of medical and recreational cannabis legislation in Europe and North America. Of interest high CBD and balanced (Type II) accessions were found to cluster into the three resin groups identified here, suggesting a polyphyletic origin of high CBD resin-type cannabis. It is assumed from mapping populations that the active form of CBDAS and THCAS are at different loci on Chromosome 9, 8 cM apart in a linked tandem repeat region nestled in a complex array of transposable elements (Weiblen et al., 2015), making the characterization of this region quite complex. Further whole genome sequencing data, particularly using long reads, has enabled deeper insight into the structure of the cannabinoid cassette and demonstrates that the inactive CBDAS gene is in close linkage to the active THCAS (McKernan et al. 2020).
In addition to cannabinoid expression, another marker linked to xylan polysaccharide metabolism (SVIP14; 1-4 Beta Xylanase) was found to contribute to the separation between resin and fiber types which may play a role in fiber quality, given its putative function of breaking down the major constituent of cell walls. Such a marker may provide a possible avenue for the development of multi-purpose resin/fiber cultivars.
Integrative analyses revealed a co-expression network of genes involved in the biosynthesis of both cannabinoids and terpenoids from common precursors (Zager et al. 2019). As such, we searched for signals underlying the resin-type cannabis clusters differentiated by the dominant terpenes expression, often under the control of two dozen terpene synthase genes (TPS; Allen et al. 2019). While we did not find specific TPS linked markers, we found that a number of SNPs falling in uncharacterized regions of the current C. sativa genome were associated with the differentiation between terpene groups in the resin accessions sampled here. Two markers in particular showed strong differentiation between terpinolene dominant ("sativa") and the other myrcene and limonene dominant accessions ("indica"), in particular VSSL_digi2, located in an Oglucosyltransferase Rumi analogue involved in ribosome biogenesis and SVIP16 a protein kinase possibly involved in developmental and defense-related processes.
Additionally, the chemical data available in the study supported the assertion by others (McKernan et al. 2020) that the presence/absence of a CBCAS gene in resin-type cannabis may be responsible for the "leaky" expression of THCA even in cultivars that do not contain an active copy of THCAS. As such, selection against the presence of the CBCAS may provide a possible avenue towards the development of high resin cultivars that are compliant with the current USDA / Health Canada domestic hemp production programs.

Conclusion
We present a targeted genetic assay and algorithms related to sub-genus classification in cannabis. We demonstrate the use and repeatability of the assay to tease fiber-from resin-type cannabis as well as derive possible chemotype classes within resin-type cannabis. We demonstrate some of the utility of the assay as it related to breeding compliant cannabis and in providing a rapid means to individually identify cannabis accessions and to derive an individual fingerprint that may be used in seed-to-sale tracking and traceability endeavours. The population level data demonstrates that most resin-type varieties exhibited high heterozygosity and as such should be considered unstable at this stage. The use of our array or similar technologies will help in reducing heterozygosity and improving on the stability of trait expression in a similar manner as has been achieved in a fiber-type cultivar sampled here, with low heterozygosity and stable trait expression in large seed batches.