Following our first hypothesis, we applied supervised and unsupervised machine learning algorithms to the subjective effect tags and recovered cultivar clustering into similar breeds. Also, metrics of cultivar similarity based on self-reported effects allowed machine learning classification into the species tag as Cannabis “sativa” and Cannabis “indica”. This network is represented in Fig. 1a using the ForceAtlas 2 layout, which increases the proximity of nodes with strong connections. The Louvain algorithm produced a partition with modularity Q = 0.264 and a total of 18 modules, of which the largest five contained ≈ 98% of all cultivars, see Supplementary information 5. The network color-coded by species tags showed a clear separation of “indicas” and “sativas”, with cultivars labeled as “hybrids” located in between. Module 1 contained most of the “sativa” cultivars, while “indicas” and “hybrids” appeared distributed across the other modules.
Strains with names indicative of particular flavours clustered together in this network. Sub-panels I-VI (Fig. 1b) zoom into different regions of the network, showing that cultivars with similar naming conventions were strongly connected in the effect similarity graph. This was the case for lemons and diesels (I), skunks (II), grapes, cherries and berries (III), pineapples, oranges and strawberries (IV), fruits, cheeses and mangos (V), and blueberries (VI). We also described these groups by their general category, e.g. “lemons”, “grapefruits”, “strawberries” were labeled “fruits”. This grouping suggests the presence of correlations between effect and flavour tags, a possibility which is explored in the following sections.
Using the effect tag frequency vectors E(s) as features in a random forest classifier trained to distinguish “indicas” from “sativas” tags resulted in a highly accurate classification (Fig. 1c), with <AUC> = 0.9965 ± 0.0002 (mean ± standard deviation [STD], p < 0.001).
Flavour tags were also capable of characterizing commercial cultivars in terms of the given species tag. Figure 2 shows the network constructed using flavour similarity to weight the links between cultivars, e.g. the correlation between the F(s) vectors. The resulting network is shown in Fig. 2a Application of the Louvain algorithm yielded Q = 0.221 and a total of 19 modules, with the four largest containing ≈ 98% of all cultivars, see Supplementary information 5. In this case, modules composed predominantly of a single species tag were no longer clearly visible; however, a gradient of species tags (from “indicas” to “hybrids” to “sativas”) could be observed from top to bottom.
As we observed in the case of reported effects, flavours also showed that not only cultivars with similar naming conventions were grouped together, but also that their grouping was related to the flavours represented in their names (Fig. 2b). For instance, blueberries were grouped together and close to a cluster of grapes (I), fruit and cheese cultivars were in the same subpanel (II), fruit-related cultivars (pineapple, tangerine, citrus, orange) were grouped together (III), as well as skunks and diesels (IV), mangos and strawberries (V), with lemons appearing cohesively clustered together (VI). In this case, we must consider the possibility of bias due to the cultivar names in the reported flavour tags.
Interestingly, when using the flavour tag frequencies as features in a random forest classifier trained to distinguish “indicas” from “sativas”, we also obtained a highly accurate classification (Fig. 2c), with <AUC> = 0.828 ± 0.002 (mean ± STD, p < 0.001).
According to our next hypothesis, we evaluated the correlations between effect and flavour tags across cultivars, establishing a relationship between effects and flavour tags. The results are shown in Fig. 3. We found significant (p < 0.05, FDR-corrected) negative and positive effect-flavour correlations. Figure 3a shows negative correlations, i.e. inverse relationships between the frequency of the reported effect and flavour tags, while Fig. 3b illustrates positive correlations. The frequency of unpleasant subjective effects, such as “anxious”, “dizzy”, “headache” and “paranoid”, correlated negatively with the frequency of almost all flavour tags, meaning that users tended to avoid the use of flavour tags when describing unpleasant experiences. Complementary, we correlated cannabinoid content and reported effects for 183 “strains” that included cannabinoid content from PSI Labs and did not find an association between negative effects and THC content in this sample (see Supplementary Fig. 5). This could be explained by considering that negative subjective experiences may outweigh flavour or scent perception. This result also suggests that in these specific experiences the appreciation of aromatic and/or flavour variables is undermined by the overwhelming subjective effects. In these cases, flavors cannot explain unpleasant effects. Further inspection of Fig. 3a and b reveals that certain flavours, such as “berry”, “blueberry”, “earthy”, “pungent” and “woody”, were negatively correlated with subjective stimulant effects, such as “creative” and “energetic”, and at the same time presented positive correlations with soothing effects such as “relaxed” and “sleepy”. Other flavours, such as “citrus”, “lime”, “tar”, “nutty”, “pineapple” and “tropical” presented the opposite behaviour, i.e. they correlated negatively with soothing effects (“relaxed”, “sleepy”) and positively with stimulant effects (“creative”, “energetic”). The fact that the aforementioned flavours presented inverse correlation patterns with respect to opposite psychoactive effects adds support to the validity of this analysis.
Next, we performed a hierarchical clustering of the effects and flavours according to their correlations (Fig. 3c). The main cluster separated unwanted effects from the rest. The remaining clusters of subjective effects were divided into three categories: soothing (“relaxed”, “sleepy”), stimulant (“euphoric”, “creative”, “energetic”, “talkative”) and other miscellaneous effects commonly associated with cannabis use (“hungry”, “giggly”, “happy”, “dry mouth”, “dry eyes”, “tingly” and “aroused”). It is important to note that this hierarchy emerged from considering effect-flavour correlations only. Consistently, flavours were clustered according to their negative correlations (“pungent”, “earthy”, “woody”, “berry”, “blueberry”) and their positive correlations (“citrus”, “tropical”, “orange”, “pineapple”, “lemon”, “lime”).
Next, we tested our third hypothesis by objectively analysing the unstructured written reports with LSA and using this information to correlate cultivars and detect recurrent topics, which allowed us to relate the reports with the subjective effect tags. We found that the information contained in the self-reported tags was consistent with the free narratives provided by the users. Unstructured written reports can provide complementary information, since users are not limited by predefined sets of effect and flavour tags. We constructed a network in which nodes represented cultivars and links were weighted by their semantic similarity, measured by the correlation between the columns of the rank-reduced term-document matrix \( {\hat{A}}_{50} \) (see the “Natural language processing of written unstructured reports” section in the Methods). The resulting networks are shown in Fig. 4a. Applying the Louvain algorithm yielded Q = 0.058, with a total of 15 modules, the largest 4 containing ≈ 98% of all cultivars, see Supplementary information 5. Module distribution was bimodal, i.e. when compared in terms of unstructured written reports, most cultivars fell into one of two categories. When comparing the modular decomposition with the species tag distribution, we found a clear division in terms of “indicas” and “sativas”, with “hybrids” in between. This division paralleled the two main modules. Module 1 was conformed predominantly by “sativas” and “hybrids”, while module 2 was conformed by “indicas” and “hybrids”.
Next, we investigated the most frequently used terms in the reports of all the cultivars taken together, and of “indicas” and “sativas” considered separately. Figure 4b presents word cloud representations of the 40 most common terms for cultivars. The most common terms related to the subjective perceptual and bodily effects (terms like “amaze”, “strong”, “felt”, “favourite”, “body”), therapeutic effects and/or medical conditions (“pain”, “anxiety”, “relax”, “help”, “relief”, “focus”) and emotions (“euphoric”, “anxiety”, “happy”, “confusion”). It is important to note that, due to limitations in the amount of available data, this analysis used single term representations (1-g), therefore words used in positive or negative contexts could not be differentiated, e.g. the term “anxiety” could appear in “This helped calm my anxiety” or in “This caused me anxiety” without distinction. Half of the most representative words were common to both “indicas” and “sativas”, such as “anxiety,” “amaze”, “effect”. The main difference between species tags emerged after excluding terms common to both, resulting in words such as “focus”, “euphoric”, “energetic” for sativas, and “insomnia”, “enjoy”, “flavour” for “indicas”. A detailed analysis of the main 5 components by species can be found in Supplementary Information 4 (see Supplementary Fig. 7).
To relate the free narrative reports to the subjective effect tags, we investigated two cultivars with a large number of reports: Super Lemon Haze (“sativa”, N = 1.373, most frequently reported tags: “happy”, “energetic”, “uplifted”) and Blueberry (“indica”, N = 1456, most frequently reported tags: “relaxed”, “happy”, “euphoric”). We first applied PCA to the corresponding rank-reduced term-document frequency matrices to obtain the main topics for each “strain”. The word clouds with the highest-ranking terms for the first 5 principal components of each cultivar are presented in Fig. 5a. The variance explained by the first 5 components was 21% for Super Lemon Haze and also 21% for Blueberry. Next, we computed the semantic distance between the most frequent effect tags of each cultivar and the top 40 words in each of the principal components. The objective of this analysis was to evaluate whether the unstructured written reports reflected the choice of predefined tags made by the users. As shown in Fig. 5b, the most frequently reported effect tags for each cultivar showed a prominent projection into all the components, as compared to randomly chosen words. This suggests that users selected predefined tags consistently with the contents of their written reports.
Terpene and cannabinoid content
Finally, in order to test our last hypothesis, we investigated the relationship between the user reports and the molecular composition of the cultivars. For this purpose, we accessed publicly available data of cannabinoid content provided in the work of Jikomes and Zoorob (Jikomes and Zoorob 2018), as well as assays of cannabinoid and terpene content from the PSI Labs website.
The first dataset contains information on THC and CBD content for all 887 cultivars studied in this work. The relationship between the content of both active cannabinoids is plotted in Fig. 6a, left panel. As reported by Jikomes and Zoorob, the cultivars fell into three general chemotypes based on their THC:CBD ratios (Jikomes and Zoorob 2018), consistent with previous findings (Hazekamp et al. 2016; Hillig and Mahlberg 2004; Jikomes and Zoorob 2018). Most of the investigated cultivars fell into chemotype I (Chemotype I: 94.6%, Chemotype II: 4.8%, Chemotype III: 0.6%), indicating high THC vs. CBD ratios. This was replicated using the cannabinoid content data obtained from PSI Labs (N = 433 individual flower samples corresponding to 183 different cultivars), as shown in Fig. 6a, right panel. Again, the majority of the assays corresponded to chemotype I (Chemotype I: 90.3%, Chemotype II: 6%, Chemotype III: 3.7%).
Figure 6b shows the compiled data for 10 cannabinoids and 26 terpenes across multiple samples of a cultivar included in the PSI Labs dataset. While some terpenes appeared to be robustly detected in the “strain”, the relatively large spread indicated a considerable level of variability.
Next, we addressed in more detail the association between cannabinoid content, terpene content, flavours, effects, and cannabis species tag. For this purpose, each of the 183 cultivars in the PSI Labs dataset was described by a cannabinoid and terpene vector. We computed the Spearman correlation between these vectors to weight the links connecting the nodes that represented the cultivars. This resulted in cannabinoid and terpene similarity networks, which are shown in Fig. 7a and b, respectively. The network on the left panel of Fig. 7a is color-coded based on the application of the Louvain algorithm (Q = 0.041) to the cannabinoid similarity network, yielding a total of 8 modules, with the largest 3 represening ≈ 94% of the cultivars. This modular structure paralleled the classification into the three chemotypes.
The network on the right is color-coded based on cannabis species tag: the first and largest module contained cultivars belonging to all species tags (similar to chemotype I); another module, situated in the middle, presented a more balanced proportion of species tags, but also contained a smaller proportion of cultivars (similar to chemotype II), and the remaining module was composed mostly by “hybrids” (as in chemotype III). Since this classification used more information than the THC:CBD ratios, it corresponds to a multi-dimensional analogue of the standard chemotype characterization.
Figure 7b shows the network obtained by correlating cultivars by their terpene vectors. The network on the left is color-coded based on the results of the Louvain algorithm (Q = 0.245), yielding only two modules. The network on the right is color-coded based on cannabis species tags. Since there are multiple terpenes in cannabis, without equivalents of main active cannabinoids such as THC and CBD, the chemical description in terms of terpenes is necessarily multi-dimensional. As with the semantic analysis of written reports, the association of cultivars based on the terpene profiles was bimodal and without a clear differentiation in terms of species tags.
Finally, we explored how effects and flavours were related based on the terpene content of the cultivars (Fig. 7c). We generated a terpene vector associated with each effect and flavour tag by averaging the terpene content across all the cultivars for which that tag was reported. The left panel of Fig. 7c shows how flavour tags (nodes) relate in terms of the correlation of their associated terpene vectors (weighted links). Modularity analysis (Q = 0.324) yielded a module comprising intense and pungent flavours (“skunk”, “diesel”, “chemical”, “pungent”) combined with citric flavours (“lemon”, “orange”, “lime”, “citrus), a second module containing sweet and fruity flavours (“mango”, “strawberry”, “sweet”, “fruit”, “grape”), and a third module with a mixture of salty and sweet flavours (“cheese”, “butter”, “vanilla”, “pepper”), see Supplementary information 5. Modularity analysis (Q = 0.194) of the network of effect tags associated by terpene similarity (Fig. 7c, right panel) yielded three modules resembling the clustering of effects presented in Fig. 3c, where we found groups consisting of subjective unwanted effects, stimulant effects and soothing effects, with an intermediate group associated with miscellaneous effects of smoked cannabis. Module 1 contained mostly stimulant effects (“energetic”, “euphoric”, “creative”, “talkative”, among others), module 2 contained soothing effects (“sleepy”, “relaxed”), and module 3 contained unwanted effects such as “headache”, “dizzy”, “paranoid” (with the exception of “anxious”, which was included in module 2). The fact that the network of effects associated by terpene content similarity reflected the hierarchical clustering of effects obtained from flavour association (Fig. 3c) reinforces the link between flavours and the psychoactive effects of cannabis.