Relationship among subjective responses, flavor, and chemical composition across more than 800 commercial cannabis varieties

Background Widespread commercialization of cannabis has led to the introduction of brand names based on users’ subjective experience of psychological effects and flavors, but this process has occurred in the absence of agreed standards. The objective of this work was to leverage information extracted from large databases to evaluate the consistency and validity of these subjective reports, and to determine their correlation with the reported cultivars and with estimates of their chemical composition (delta-9-THC, CBD, terpenes). Methods We analyzed a large publicly available dataset extracted from Leafly.com where users freely reported their experiences with cannabis cultivars, including different subjective effects and flavour associations. This analysis was complemented with information on the chemical composition of a subset of the cultivars extracted from Psilabs.org. The structure of this dataset was investigated using network analysis applied to the pairwise similarities between reported subjective effects and/or chemical compositions. Random forest classifiers were used to evaluate whether reports of flavours and subjective effects could identify the labelled species cultivar. We applied Natural Language Processing (NLP) tools to free narratives written by the users to validate the subjective effect and flavour tags. Finally, we explored the relationship between terpenoid content, cannabinoid composition and subjective reports in a subset of the cultivars. Results Machine learning classifiers distinguished between species tags given by “Cannabis sativa” and “Cannabis indica” based on the reported flavours:  = 0.828 ± 0.002 (p < 0.001); and effects:  = 0.9965 ± 0.0002 (p < 0.001). A significant relationship between terpene and cannabinoid content was suggested by positive correlations between subjective effect and flavour tags (p < 0.05, False-Discovery-rate (FDR)-corrected); these correlations clustered the reported effects into three groups that represented unpleasant, stimulant and soothing effects. The use of predefined tags was validated by applying latent semantic analysis tools to unstructured written reviews, also providing breed-specific topics consistent with their purported subjective effects. Terpene profiles matched the perceptual characterizations made by the users, particularly for the terpene-flavours graph (Q = 0.324). Conclusions Our work represents the first data-driven synthesis of self-reported and chemical information in a large number of cannabis cultivars. Since terpene content is robustly inherited and less influenced by environmental factors, flavour perception could represent a reliable marker to indirectly characterize the psychoactive effects of cannabis. Our novel methodology helps meet demands for reliable cultivar characterization in the context of an ever-growing market for medicinal and recreational cannabis.


Data sets
It is controversial to use the term "strain" to refer to commercially available cannabis varieties. Some authors prefer to use it (Gilbert and DiVerdi, 2018) and others explicitly discuss and explore the predictive value of strain names (Jikomes and Zoorob, 2018). Lewis and Russo explored the chemical coherence among individual samples of different varieties and defined the chemovar as the preferred nomenclature (Lewis et al., 2018). Given that the chemovar could not be applied to the Leafly data (since we did not have the THC:CBD content of individual samples), we substituted the term "strain" for "commercial varieties" or "cultivars", following (Lewis et al., 2018).
Concerning the species label, in spite of the recent proposal to eliminate this classification (Piomelli and Russo, 2016), historically, cannabis species have been reported to present different botanical characteristics, such as height, leaf width, etc (Bonini et al., 2018;LAMARCK, 1785), moreover, some contemporary authors have reported chemical characteristics specific to each particular species (Hillig, 2004;Hillig and Mahlberg, 2004). Although we do not state that this is the case for commercially available cultivars, we propose that some of the underlying differences between these categories could be conserved over time. Since we do not have any particular preference with respect to the category names, we adopted the "indica" and "sativa" labels because they were the ones provided by the Leafly website as well as by its users."

Leafly.com user's database by August 2018
The total number of reports by August 2018 was N reports = 100.901, with N = 983 corresponding to anonymous users. The remaining 99.918 reports corresponded to N users = 43.925, and 57.4% of these users completed only one report. Table 1 details the percentage of users with more than one, two and three reports. The last two columns show the mean proportion of the reports by cultivar by user, and the total reports by cultivar. It should be noted that for all partitions, individual users typically gave a 1% of the reports of any specific cultivar.

Supplementary Figure 1: Reports per subject per cultivar
Supplementary Figure 1 shows which fraction of the strains (N total = 887) presents a maximum fraction of reports given by only one subject. In the perfect case, we expect that all strains had only one report. In this case, the area under the line should be 0. In the worst case, all the reports by each strain were given by only one subject with an area of 1. The data from Leafly.com by August 2018 has an area under the curve of 0.05. As shown in the figure 1, only 10% of the studied cultivars presented more than 10% of their reports given by a single user, and this was reduced to 0.6% of varieties with more than 20% reports by a single user, which suggested that bias by single user reports could impact a reduced sample of the whole ensemble. available at the time the data was retrieved. .

Networks and modularity
A network is a representation of the interactions (edges) between certain objects (nodes). In this work, we represented the cultivars as networks (nodes as cultivars, e.g.: and ) linked with connections weighted by the value of the non-parametric Spearman rank correlation coefficient between the effect / flavour vectors of nodes and / and , respectively.
A common strategy to study the structure of networks is to analyze groups of nodes more densely connected between them than with other nodes in the network, commonly called the community or modular structure of the network. Given a network with i partitions, the modularity (Q) is defined as the fraction of the internal connections (e ii ) minus the expected value of the same quantity in a network with the same community division, but with random connections between nodes (Dorogovtsev and Mendes, 2004): Where represents edges between modules i and j (including i = j). The modularity for a given partition reaches values between 0 and 1. The final modularity Q of a network is computed using the partition that maximizes Eq. 1. As described by Clauset and Newman (Clauset et al., 2004), this can be expressed in terms of the adjacency matrix, A i,j as: Where A i,j represents the weight of the edges between nodes i and j, k i = ∑ , is the sum of the weights of the edges to the node i, and is the community in which the node i is included. The function is 1 when i = j and zero otherwise, and m = ∑ .
Computing the modularity for all possible partitions of the network is usually very costly. Because of this, various algorithms (also known as "heuristics") have been developed to estimate the best partition of a network. The most frequently followed strategies can be classified either as agglomerative or corrosive. Agglomerative algorithms take isolated nodes as starting points (each representing an independent community) and, following a procedure of step-by-step merging, calculate Q until optimization of the original value is achieved. Conversely, corrosive strategies start with the whole network identified as a single community and apply a process of step-by-step pruning to optimize Q.
In this work we used the Louvain agglomerative algorithm as implemented in Gephi (Blondel et al., 2008;Lambiotte et al., 2008). The algorithm proceeds in two iterative steps. First, the algorithm starts from all the isolated nodes considered each as a single community. Then, for each node it takes into consideration the nearest nodes and their strongest connections to the node. Next, it replaces each community by the merge of these nodes and calculates the increase in the value of Q. If no gain is possible the node remains in its original community. This step continues until no movements of the nodes between communities result in improvements to the modularity. The order of the nodes that are considered by the algorithm does not have an impact in the final communities that are discovered (Blondel et al., 2008), hence we started from a random selection. The second step consists in building a new network with the community structure determined from the previous step. For this purpose, the weights of the new merged nodes are given by the sum of the weights of the links between the new communities. Links inside the community are represented by self-links with weights equal to the sum of all the internal links (Supplementary Figure 3, A). The Gephi implementation of the algorithm is based on a generalization of this process ) that allows the inclusion of a resolution parameter (γ) to determine the hierarchical structure of the network. The algorithm was applied to maximize the above-mentioned Newman's modularity with a resolution parameter γ = 1.

Random Forest
In this work we used the random forest algorithm to classify species tags given the reported frequencies of effects and flavours. Here we give a conceptual explanation of this method -for a more extensive explanation, we recommend (Hastie, 2009;James et al., 2013).
The random forest algorithm is based on a simple and intuitive method (decision tree) combined with the addition of bootstrap aggregation of multiple models. Tree-based methods segment the prediction space based on consecutive decisions over many predictors or features. Supplementary Figure 3 ( panel B) shows an example where predictor variables x and y can be separated by different cuts ("splits") in the plane to predict the class of a new and unseen sample (e.g., the white dot in Supplementary Figure  3, panel B). This procedure can be represented graphically as a tree (Supplementary Figure 3, B, right). It should be noted that when a large number of features are taken into consideration, the order in which the variables are split may affect the result, since starting from a noisy variable will negatively impact on the quality of all ensuing splits. To overcome this limitation, the random forest algorithm runs several trees selecting a subset of random variables or features as input, and then aggregates the decision of all individual models. When a new sample is evaluated by the classifier, the class is determined by the vote of each individual tree.
In order to avoid overfitting, we divided our dataset into 5 equal parts and used 4 parts to train the model and the remaining part for testing (5-fold stratified cross-validation).

Latent Semantic Analysis
Latent Semantic Analysis is a natural language processing tool allowing to determine the similarity between documents from the co-occurrence patterns of the words that compose them (Landauer et al., 1998). The general idea underlying the LSA algorithm is to represent individual terms as vectors in a semantic space with reduced rank.
As described in the main manuscript text, the first step consists in the construction of an adequate terms(w)-documents(j) matrix ( . The second step is to decompose this matrix by means of the Singular Value Decomposition (Huang and Narendra, 2008) as , where U contains the matrix eigenvectors, S is a diagonal matrix containing the ordered eigenvalues of , and W contains the eigenvectors of (see Supplementary Figure 3, C). To reduce the impact of low and noninformative correlations in the data, we reconstructed the original matrix using the 50 higher values of S ( . Finally, to find the topics in the data we conducted a principal component analysis (PCA) of . PCA is based on Singular Value Decomposition and its objective is to reorganize the data identifying the orthogonal directions where its projection leads to the maximum variance (Bertoldo et al., 2004;De Lathauwer et al., 2000;Friston et al., 1996;Gao et al., 2009;Naganawa et al., 2005;Supekar et al., 2008;Tipping and Bishop, 1999). In our case, we retained the first five principal directions, leading to the identification of the first most relevant topics in the data.

Word2vec
To validate the LSA results we used a Word2Vec (Rong, 2014) model pre-trained with the Google News corpus (https://news.google.com/), comprising a vocabulary size of 3 6 unique words, and a total of 10 10 words for training using the skip-gram neural network architecture with dimension 300 (a common choice in related literature). After training, this model can represent any word in the corpus as a vector of 300 dimensions by assigning to each unique word in the vocabulary its corresponding vector in the embedding. Vectors corresponding to words that represent similar concepts (i.e. similar semantic content) in the corpus are located close to one another in the embedding, thus it is possible to determine the semantic distance between words by measuring the cosine of the angle between their respective vectors in the embedding (Supplementary Figure 3 Steps followed by the Louvain algorithm, adapted from (Blondel et al., 2008). B. Example of a partition of feature space by a decision tree. Left: splits in feature space for two features. Right: the decision tree associated with the splits presented in the left panel. C. Schematic representation of LSA and the factorization by Singular Value Decomposition. D. Three example words in a space defined by a word2vec model. The angles between the words are and . The semantic distance between words can be computed as the cosine distance between the associated vectors.