Training Tissue-Specific Gene Embeddings on GTEx Data

The gene embedding training code is available from https://github.com/nanxstats/exp2vec.

The Shiny app is at https://nanx.shinyapps.io/exp2vec/, with code available from https://github.com/nanxstats/exp2vec-shiny.

Photo by Amy Shamblen.

Background

We can often observe a mysterious connection between natural languages and the human genome when count data is derived from them. It is also straightforward to model them using the latest statistical or machine learning approaches developed for each field and borrow ideas from each other.

For example, suppose we treat each gene as a “term” and say each sample is a “document”. In that case, we can use word embedding methods in NLP like word2vec or GloVe to learn low-dimensional vector representations for genes, given the gene expression count data for a collection of samples.

Traditional methods might assume two genes are similar if they have highly correlated expression profiles or overlaps in labels. In contrast, we assume that two genes are only similar if their expression context is similar, i.e., the groups of genes that frequently express together with them are similar. This is not unlike the distributional hypothesis that words occurring in similar contexts tend to have similar meanings.

Our approach would only require the expression data without side information such as functional annotations. It could also help us avoid common challenges when measuring correlations between variables in a high-dimensional space. We can then use the learned representations to measure the similarities between genes, discover linear algebraic structure of genes, and augment the input for downstream tasks.

Preprocess GTEx data

The GTEx project uses RNA sequencing to characterize tissue-specific gene expression patterns. First, I downloaded the open access read count data from the GTEx Portal. The expression data is encoded in gct (tsv) format, with metadata such as the tissue type presented in a separate tsv file.

After a few basic data cleaning steps using my package grex, a count matrix is created focusing on the tissue pancreas. The data contains 328 samples and 25321 genes, with the expression count ranging from:

quantile(dtm_tissue)
#>    0%      25%      50%      75%     100%
#>     0        1      107      939 11344034

Convert DTM into TCM

The tricky part is getting a term-co-occurrence matrix (TCM) from the document-term matrix (DTM) above because we need a TCM as the input for gene embedding training.

We don’t have the type of raw sequence data like a text corpus to measure co-occurrence using sliding windows directly. Therefore, I took a quick and dirty approach: convert document-term matrix \(A\) to a term co-occurrence matrix by the inner product \(A^T A\) with corrections for the diagonal self-cooccurrence. This is inspired by quanteda:::fcm.dfm().

dtm2tcm <- function(x) {
  # Equivalent to t(x) %*% x
  y <- Matrix::crossprod(x)
  # Correct self-cooccurrence
  Matrix::diag(y) <- (Matrix::diag(y) - Matrix::colSums(x)) / 2L
  y
}

Training GloVe model

My favorite interpretation of word2vec or GloVe is that they are inexplicitly or explicitly factorizations of a word-context pointwise mutual information (PMI) matrix with a shift. In our example, if we set the embedding dimensionality to be 50, then it means factorizing the gene-context matrix \(M_{25321\times 25321}\) into a gene embedding matrix \(G_{25321 \times 50}\) and a context embedding matrix \(C_{50 \times 25321}\) such that \(M = G \cdot C\). We then use \(G\) to compute gene similarities and ruthlessly ignore \(C\). 🤔

In my experiment, I used text2vec to train a GloVe model, and set the embedding dimensionality as 100. The other parameters: the maximum number of co-occurrences used in the weighting function \(x_{\max}\) = 100; learning rate = 0.05; number of iterations = 50. The embedding dimensionality is the key parameter here. It can be tuned and often ranges from 50 to 300, increasing by 50.

On reproducibility, note that we can only get reproducible embeddings in text2vec by setting the random seed AND not using parallelization. For training speed, I chose to use parallelization in this example. Therefore, the embeddings we got is only reflecting one possibility. More generally, the embedding stability between runs is an intriguing open problem.

Low-rank projection

After getting the gene embedding, I projected it onto a 2D plane using t-SNE to see how well the embedding aligns with our perceived reality. I also used k-means on the embedding to get 15 clusters and colorize the projected points to make the patterns more distinguishable.

Projected gene embedding using t-SNE

The dark blue cluster at the top-left corner looks unusual. So let’s check what genes are in it:

Click here to expand the list of genes in the cluster
#>   [1] "AKAP17A"      "AMIGO3"       "ASMT"         "ASMTL"
#>   [5] "BPY2"         "BPY2B"        "CD99"         "CD99P1"
#>   [9] "CDY1"         "CDY1B"        "CDY2B"        "CRLF2"
#>  [13] "CSF2RA"       "CT45A2"       "CT45A6"       "CT45A7"
#>  [17] "CT45A8"       "CT45A9"       "CT47A1"       "CT47A10"
#>  [21] "CT47A11"      "CT47A12"      "CT47A2"       "CT47A3"
#>  [25] "CT47A4"       "CT47A5"       "CT47A6"       "CT47A7"
#>  [29] "CT47A8"       "CT47A9"       "DAZ3"         "DAZ4"
#>  [33] "DCANP1"       "DEFB104B"     "DEFB106A"     "DEFB106B"
#>  [37] "DEFB107A"     "DEFB113"      "DEFB130A"     "DEFB130B"
#>  [41] "DEFT1P2"      "DHRSX"        "FAM197Y2"     "FAM197Y3"
#>  [45] "FAM197Y4"     "FAM197Y6"     "FAM197Y7"     "FAM197Y8"
#>  [49] "GAGE12C"      "GAGE12D"      "GAGE12E"      "GAGE2C"
#>  [53] "GAGE4"        "GTPBP6"       "HIST1H2AM"    "IL3RA"
#>  [57] "IL9R"         "KRTAP21-3"    "KRTAP22-2"    "KRTAP25-1"
#>  [61] "LINC00102"    "LINC00106"    "LINC00280"    "LINC00317"
#>  [65] "LINC00395"    "LINC00459"    "LINC00583"    "LINC00587"
#>  [69] "LINC00972"    "LINC01074"    "LINC01162"    "LINC01216"
#>  [73] "LINC01256"    "LINC01392"    "LINC01661"    "LINC01680"
#>  [77] "LINC01683"    "LINC01737"    "LINC01745"    "LINC01809"
#>  [81] "LINC01921"    "LINC01957"    "LINC02025"    "LINC02098"
#>  [85] "LINC02162"    "LINC02167"    "LINC02439"    "LINC02472"
#>  [89] "LINC02545"    "LINC02676"    "LOC101927908" "LOC101928775"
#>  [93] "LOC101929148" "LOC102723655" "LOC102724452" "LOC102725532"
#>  [97] "LOC105370586" "LOC105373524" "LOC105378311" "MBD3L2"
#> [101] "MBD3L4"       "MIR101-1"     "MIR101-2"     "MIR103A1"
#> [105] "MIR103A2"     "MIR105-1"     "MIR105-2"     "MIR1185-1"
#> [109] "MIR1200"      "MIR1202"      "MIR1205"      "MIR1206"
#> [113] "MIR1207"      "MIR1208"      "MIR124-2"     "MIR1243"
#> [117] "MIR1246"      "MIR1248"      "MIR1251"      "MIR1252"
#> [121] "MIR1255B1"    "MIR1255B2"    "MIR1258"      "MIR125B2"
#> [125] "MIR1261"      "MIR1263"      "MIR1264"      "MIR1265"
#> [129] "MIR1267"      "MIR1268B"     "MIR1269A"     "MIR1277"
#> [133] "MIR1278"      "MIR128-2"     "MIR1287"      "MIR1289-2"
#> [137] "MIR1291"      "MIR1293"      "MIR1295A"     "MIR1297"
#> [141] "MIR1298"      "MIR1302-1"    "MIR1302-10"   "MIR1302-4"
#> [145] "MIR1302-6"    "MIR1302-7"    "MIR1305"      "MIR1307"
#> [149] "MIR1321"      "MIR1323"      "MIR133A2"     "MIR139"
#> [153] "MIR1468"      "MIR1471"      "MIR147A"      "MIR148A"
#> [157] "MIR151A"      "MIR1537"      "MIR15A"       "MIR16-1"
#> [161] "MIR16-2"      "MIR1827"      "MIR184"       "MIR187"
#> [165] "MIR188"       "MIR190A"      "MIR1911"      "MIR196A1"
#> [169] "MIR1972-2"    "MIR1976"      "MIR203A"      "MIR204"
#> [173] "MIR205"       "MIR2052"      "MIR2054"      "MIR206"
#> [177] "MIR2113"      "MIR2114"      "MIR2115"      "MIR222"
#> [181] "MIR2278"      "MIR26A1"      "MIR297"       "MIR302A"
#> [185] "MIR302B"      "MIR302E"      "MIR302F"      "MIR3065"
#> [189] "MIR30A"       "MIR30B"       "MIR30C1"      "MIR30D"
#> [193] "MIR31"        "MIR3116-1"    "MIR3118-1"    "MIR3118-2"
#> [197] "MIR3118-3"    "MIR3118-4"    "MIR3121"      "MIR3122"
#> [201] "MIR3123"      "MIR3126"      "MIR3128"      "MIR3129"
#> [205] "MIR3134"      "MIR3137"      "MIR3141"      "MIR3144"
#> [209] "MIR3147"      "MIR3152"      "MIR3154"      "MIR3155A"
#> [213] "MIR3156-1"    "MIR3156-2"    "MIR3156-3"    "MIR3160-1"
#> [217] "MIR3165"      "MIR3166"      "MIR3167"      "MIR3169"
#> [221] "MIR3170"      "MIR3171"      "MIR3173"      "MIR3175"
#> [225] "MIR3179-2"    "MIR3179-3"    "MIR3179-4"    "MIR3190"
#> [229] "MIR3193"      "MIR3194"      "MIR3196"      "MIR3197"
#> [233] "MIR3200"      "MIR3201"      "MIR320B1"     "MIR320C2"
#> [237] "MIR320D1"     "MIR320D2"     "MIR329-1"     "MIR329-2"
#> [241] "MIR33A"       "MIR34C"       "MIR3606"      "MIR361"
#> [245] "MIR362"       "MIR3651"      "MIR3655"      "MIR3660"
#> [249] "MIR3661"      "MIR3667"      "MIR3668"      "MIR3670-3"
#> [253] "MIR3670-4"    "MIR3672"      "MIR3675"      "MIR3679"
#> [257] "MIR3681"      "MIR3683"      "MIR3686"      "MIR3688-1"
#> [261] "MIR3689A"     "MIR3689C"     "MIR3689D1"    "MIR3689E"
#> [265] "MIR3690"      "MIR3692"      "MIR3713"      "MIR374A"
#> [269] "MIR374B"      "MIR376C"      "MIR378A"      "MIR378B"
#> [273] "MIR378C"      "MIR378D1"     "MIR378E"      "MIR378G"
#> [277] "MIR380"       "MIR381"       "MIR382"       "MIR383"
#> [281] "MIR384"       "MIR3910-1"    "MIR3914-1"    "MIR3915"
#> [285] "MIR3917"      "MIR3919"      "MIR3920"      "MIR3921"
#> [289] "MIR3922"      "MIR3923"      "MIR3924"      "MIR3926-1"
#> [293] "MIR3929"      "MIR3937"      "MIR3938"      "MIR3974"
#> [297] "MIR3975"      "MIR3976"      "MIR422A"      "MIR424"
#> [301] "MIR4251"      "MIR4252"      "MIR4255"      "MIR4260"
#> [305] "MIR4262"      "MIR4266"      "MIR4267"      "MIR4268"
#> [309] "MIR4270"      "MIR4272"      "MIR4274"      "MIR4275"
#> [313] "MIR4278"      "MIR4279"      "MIR4280"      "MIR4281"
#> [317] "MIR4282"      "MIR4283-1"    "MIR4283-2"    "MIR4287"
#> [321] "MIR4289"      "MIR4290"      "MIR4291"      "MIR4293"
#> [325] "MIR4294"      "MIR4295"      "MIR4298"      "MIR4299"
#> [329] "MIR4300"      "MIR4301"      "MIR4302"      "MIR4303"
#> [333] "MIR4309"      "MIR4310"      "MIR4316"      "MIR4317"
#> [337] "MIR4318"      "MIR4320"      "MIR4321"      "MIR4325"
#> [341] "MIR4327"      "MIR4328"      "MIR4330"      "MIR4421"
#> [345] "MIR4422"      "MIR4423"      "MIR4425"      "MIR4427"
#> [349] "MIR4429"      "MIR4430"      "MIR4433B"     "MIR4434"
#> [353] "MIR4435-2"    "MIR4436B1"    "MIR4436B2"    "MIR4437"
#> [357] "MIR4438"      "MIR4439"      "MIR4442"      "MIR4445"
#> [361] "MIR4446"      "MIR4447"      "MIR4451"      "MIR4452"
#> [365] "MIR4454"      "MIR4455"      "MIR4456"      "MIR4457"
#> [369] "MIR4462"      "MIR4465"      "MIR4466"      "MIR4469"
#> [373] "MIR4470"      "MIR4471"      "MIR4472-2"    "MIR4473"
#> [377] "MIR4474"      "MIR4475"      "MIR4476"      "MIR4477A"
#> [381] "MIR4477B"     "MIR4478"      "MIR448"       "MIR4481"
#> [385] "MIR4482"      "MIR4486"      "MIR4487"      "MIR4490"
#> [389] "MIR4491"      "MIR4493"      "MIR4494"      "MIR4495"
#> [393] "MIR4496"      "MIR4499"      "MIR449C"      "MIR4500"
#> [397] "MIR4501"      "MIR4503"      "MIR4507"      "MIR4509-1"
#> [401] "MIR4509-2"    "MIR4509-3"    "MIR450A1"     "MIR450A2"
#> [405] "MIR450B"      "MIR4510"      "MIR4514"      "MIR4520-1"
#> [409] "MIR4521"      "MIR4524B"     "MIR4527"      "MIR4528"
#> [413] "MIR4529"      "MIR4531"      "MIR4533"      "MIR4535"
#> [417] "MIR4539"      "MIR4540"      "MIR4643"      "MIR4650-1"
#> [421] "MIR4650-2"    "MIR4654"      "MIR4658"      "MIR4659A"
#> [425] "MIR466"       "MIR4660"      "MIR4661"      "MIR4662B"
#> [429] "MIR4663"      "MIR4665"      "MIR4666B"     "MIR4669"
#> [433] "MIR4675"      "MIR4681"      "MIR4686"      "MIR4694"
#> [437] "MIR4696"      "MIR4699"      "MIR4704"      "MIR4705"
#> [441] "MIR4710"      "MIR4711"      "MIR4713"      "MIR4715"
#> [445] "MIR4716"      "MIR4718"      "MIR4719"      "MIR4727"
#> [449] "MIR4729"      "MIR4731"      "MIR4733"      "MIR4735"
#> [453] "MIR4736"      "MIR4739"      "MIR4743"      "MIR4756"
#> [457] "MIR4759"      "MIR4760"      "MIR4762"      "MIR4765"
#> [461] "MIR4769"      "MIR4770"      "MIR4771-1"    "MIR4771-2"
#> [465] "MIR4774"      "MIR4780"      "MIR4781"      "MIR4789"
#> [469] "MIR4790"      "MIR4791"      "MIR4793"      "MIR4797"
#> [473] "MIR4798"      "MIR4799"      "MIR4801"      "MIR4803"
#> [477] "MIR4804"      "MIR486-1"     "MIR487A"      "MIR488"
#> [481] "MIR494"       "MIR495"       "MIR499A"      "MIR5007"
#> [485] "MIR500A"      "MIR501"       "MIR5011"      "MIR502"
#> [489] "MIR505"       "MIR507"       "MIR508"       "MIR5089"
#> [493] "MIR509-1"     "MIR509-2"     "MIR509-3"     "MIR5092"
#> [497] "MIR510"       "MIR5100"      "MIR511"       "MIR512-1"
#> [501] "MIR512-2"     "MIR513A1"     "MIR513A2"     "MIR513B"
#> [505] "MIR513C"      "MIR514A1"     "MIR514A2"     "MIR514A3"
#> [509] "MIR514B"      "MIR515-1"     "MIR515-2"     "MIR516A1"
#> [513] "MIR516A2"     "MIR516B1"     "MIR516B2"     "MIR517A"
#> [517] "MIR517B"      "MIR5186"      "MIR518A1"     "MIR518A2"
#> [521] "MIR518C"      "MIR518D"      "MIR5197"      "MIR519B"
#> [525] "MIR519C"      "MIR519D"      "MIR520B"      "MIR520C"
#> [529] "MIR520D"      "MIR520F"      "MIR520G"      "MIR520H"
#> [533] "MIR521-1"     "MIR525"       "MIR526A1"     "MIR526B"
#> [537] "MIR539"       "MIR548A1"     "MIR548A2"     "MIR548AB"
#> [541] "MIR548AD"     "MIR548AE1"    "MIR548AE2"    "MIR548AG1"
#> [545] "MIR548AI"     "MIR548AJ1"    "MIR548AJ2"    "MIR548AL"
#> [549] "MIR548AO"     "MIR548AP"     "MIR548AQ"     "MIR548AS"
#> [553] "MIR548AU"     "MIR548AV"     "MIR548AX"     "MIR548BA"
#> [557] "MIR548F1"     "MIR548F4"     "MIR548F5"     "MIR548G"
#> [561] "MIR548H2"     "MIR548H3"     "MIR548H4"     "MIR548H5"
#> [565] "MIR548I4"     "MIR548M"      "MIR548O2"     "MIR548P"
#> [569] "MIR548S"      "MIR548T"      "MIR548U"      "MIR548W"
#> [573] "MIR548X"      "MIR548X2"     "MIR548Y"      "MIR555"
#> [577] "MIR557"       "MIR5571"      "MIR5579"      "MIR5580"
#> [581] "MIR5582"      "MIR5583-1"    "MIR5583-2"    "MIR5584"
#> [585] "MIR5586"      "MIR5590"      "MIR5591"      "MIR5680"
#> [589] "MIR5681A"     "MIR5682"      "MIR5687"      "MIR5688"
#> [593] "MIR5689"      "MIR5691"      "MIR5692A1"    "MIR5692A2"
#> [597] "MIR5692C1"    "MIR5693"      "MIR5694"      "MIR5696"
#> [601] "MIR5697"      "MIR5700"      "MIR5701-1"    "MIR5701-2"
#> [605] "MIR5701-3"    "MIR5702"      "MIR5703"      "MIR5704"
#> [609] "MIR5705"      "MIR5708"      "MIR572"       "MIR582"
#> [613] "MIR583"       "MIR586"       "MIR588"       "MIR6068"
#> [617] "MIR6072"      "MIR6077"      "MIR6078"      "MIR6079"
#> [621] "MIR6082"      "MIR6083"      "MIR6086"      "MIR6089"
#> [625] "MIR609"       "MIR610"       "MIR6130"      "MIR6134"
#> [629] "MIR617"       "MIR620"       "MIR625"       "MIR629"
#> [633] "MIR632"       "MIR642A"      "MIR6499"      "MIR650"
#> [637] "MIR6504"      "MIR6508"      "MIR651"       "MIR6511A2"
#> [641] "MIR6511A3"    "MIR6511A4"    "MIR652"       "MIR663A"
#> [645] "MIR663B"      "MIR664A"      "MIR665"       "MIR668"
#> [649] "MIR6715A"     "MIR6718"      "MIR6724-1"    "MIR6724-3"
#> [653] "MIR6731"      "MIR6744"      "MIR6754"      "MIR676"
#> [657] "MIR6760"      "MIR6769B"     "MIR6770-1"    "MIR6771"
#> [661] "MIR6779"      "MIR6783"      "MIR6788"      "MIR6813"
#> [665] "MIR6815"      "MIR6827"      "MIR6828"      "MIR6836"
#> [669] "MIR6838"      "MIR6841"      "MIR6844"      "MIR6853"
#> [673] "MIR6854"      "MIR6856"      "MIR6859-1"    "MIR6859-3"
#> [677] "MIR6861"      "MIR6862-1"    "MIR6866"      "MIR6868"
#> [681] "MIR6874"      "MIR6876"      "MIR6881"      "MIR6882"
#> [685] "MIR7-2"       "MIR711"       "MIR7151"      "MIR7153"
#> [689] "MIR7154"      "MIR7156"      "MIR7157"      "MIR7158"
#> [693] "MIR7159"      "MIR7160"      "MIR7162"      "MIR718"
#> [697] "MIR759"       "MIR761"       "MIR764"       "MIR766"
#> [701] "MIR767"       "MIR7846"      "MIR7847"      "MIR7852"
#> [705] "MIR7853"      "MIR7854"      "MIR7856"      "MIR7973-1"
#> [709] "MIR7973-2"    "MIR7976"      "MIR7977"      "MIR7978"
#> [713] "MIR8052"      "MIR8053"      "MIR8056"      "MIR8058"
#> [717] "MIR8059"      "MIR8062"      "MIR8065"      "MIR8067"
#> [721] "MIR8068"      "MIR8069-2"    "MIR8071-1"    "MIR8071-2"
#> [725] "MIR8074"      "MIR8077"      "MIR8081"      "MIR8082"
#> [729] "MIR8084"      "MIR8087"      "MIR8088"      "MIR8485"
#> [733] "MIR873"       "MIR876"       "MIR885"       "MIR887"
#> [737] "MIR888"       "MIR890"       "MIR891A"      "MIR892A"
#> [741] "MIR892B"      "MIR892C"      "MIR9-1"       "MIR9-3"
#> [745] "MIR920"       "MIR921"       "MIR924"       "MIR936"
#> [749] "MIR941-2"     "MIR941-3"     "MIR941-4"     "MIR941-5"
#> [753] "MIR944"       "MIR99A"       "MIRLET7A2"    "MIRLET7F2"
#> [757] "OPN1MW2"      "OPN1MW3"      "OR10R2"       "OR13C3"
#> [761] "OR4C45"       "P2RY8"        "PLCXD1"       "PPP2R3B"
#> [765] "PRR20A"       "PRR20B"       "PRR20C"       "PRR20D"
#> [769] "PRR20E"       "PRY2"         "RBMY1A1"      "RBMY1B"
#> [773] "RBMY1D"       "RBMY1F"       "RMRP"         "RN7SK"
#> [777] "RNA5S1"       "RNA5S10"      "RNA5S11"      "RNA5S12"
#> [781] "RNA5S13"      "RNA5S14"      "RNA5S15"      "RNA5S16"
#> [785] "RNA5S17"      "RNA5S2"       "RNA5S3"       "RNA5S4"
#> [789] "RNA5S5"       "RNA5S6"       "RNA5S7"       "RNA5S8"
#> [793] "RNU1-3"       "RNU105C"      "RNU11"        "RNU2-1"
#> [797] "SCEL-AS1"     "SHOX"         "SLC25A6"      "SNORA11C"
#> [801] "SNORA30B"     "SNORA35"      "SNORA36A"     "SNORA50A"
#> [805] "SNORA51"      "SNORA58B"     "SNORA69"      "SNORA75B"
#> [809] "SNORA84"      "SNORD109B"    "SNORD112"     "SNORD113-1"
#> [813] "SNORD113-5"   "SNORD113-6"   "SNORD113-7"   "SNORD113-8"
#> [817] "SNORD114-11"  "SNORD114-15"  "SNORD114-16"  "SNORD114-17"
#> [821] "SNORD114-19"  "SNORD114-20"  "SNORD114-22"  "SNORD114-24"
#> [825] "SNORD114-28"  "SNORD114-29"  "SNORD114-3"   "SNORD114-30"
#> [829] "SNORD114-5"   "SNORD114-6"   "SNORD114-9"   "SNORD115-1"
#> [833] "SNORD115-10"  "SNORD115-14"  "SNORD115-16"  "SNORD115-17"
#> [837] "SNORD115-18"  "SNORD115-19"  "SNORD115-2"   "SNORD115-23"
#> [841] "SNORD115-24"  "SNORD115-25"  "SNORD115-28"  "SNORD115-29"
#> [845] "SNORD115-3"   "SNORD115-30"  "SNORD115-32"  "SNORD115-34"
#> [849] "SNORD115-36"  "SNORD115-37"  "SNORD115-38"  "SNORD115-39"
#> [853] "SNORD115-41"  "SNORD115-42"  "SNORD115-44"  "SNORD115-46"
#> [857] "SNORD115-48"  "SNORD115-5"   "SNORD115-9"   "SNORD116-11"
#> [861] "SNORD116-22"  "SNORD116-29"  "SNORD116-5"   "SNORD121B"
#> [865] "SNORD15A"     "SNORD18B"     "SNORD28B"     "SNORD31B"
#> [869] "SNORD32B"     "SNORD36C"     "SNORD38C"     "SNORD38D"
#> [873] "SNORD42B"     "SNORD43"      "SNORD56"      "SNORD56B"
#> [877] "SNORD60"      "SNORD61"      "SNORD65B"     "SNORD65C"
#> [881] "SNORD7"       "SNORD74B"     "SNORD77B"     "SNORD82"
#> [885] "SNORD98"      "SPATA31A5"    "SPDYE17"      "SPRY3"
#> [889] "TRIM49D1"     "TRIM49D2"     "TRPC7-AS2"    "TSPY3"
#> [893] "TSPY4"        "TTTY1"        "TTTY17A"      "TTTY17B"
#> [897] "TTTY17C"      "TTTY19"       "TTTY1B"       "TTTY20"
#> [901] "TTTY21"       "TTTY22"       "TTTY23"       "TTTY3"
#> [905] "TTTY3B"       "TTTY5"        "TTTY6B"       "TTTY7B"
#> [909] "UGT2A2"       "USP17L12"     "USP17L13"     "USP17L15"
#> [913] "USP17L18"     "USP17L19"     "USP17L20"     "USP17L21"
#> [917] "USP17L22"     "USP17L24"     "USP17L25"     "USP17L26"
#> [921] "USP17L27"     "USP17L28"     "USP17L29"     "USP17L30"
#> [925] "USP17L6P"     "USP17L9P"     "VAMP7"        "WASIR1"
#> [929] "XKRY"         "ZBED1"        "ZBTB9"

The cluster contains mostly RNA genes, prefixed with LINC, MIR, RNA, SNOR, and TTTY. To a degree, this makes sense to me as genes involved in similar functions could share similar expression contexts, thus considered similar.

Gene neighbors

As a direct application of the gene embedding, we can find the “most similar” genes for each gene in terms of expression context. For example, by querying four genes of interest (EGFR, TP53, PTEN, and KRAS), we get:

term n1 n2 n3 n4 n5
EGFR CRCP HERPUD2 CCT6A STK17A VKORC1L1
TP53 TXNDC17 VAMP2 TOP3A DVL2 TMEM102
PTEN CCSER2 ZNF33A BMPR1A CSTF2T EIF4EBP2
KRAS SINHCAF GOLT1B IPO8 ETNK1 FGFR1OP2

Gene analogies

Another application of gene embedding is exploring the linear algebraic structure of genes. For example, in word embeddings, we can ask:

Berlin - Germany = [ ? ] - France

where the word vectors would give us “Paris” as the answer. As a random example, here we try to find the following tissue-specific “gene analogies” with the gene embedding:

BRCA1 - BRCA2 = [ ? ] - TP53

The motivation behind this question: BRCA1 and BRCA2 genes play a role in cancer by working together in a common pathway of genome protection. However, the two corresponding proteins work at different stages in DNA damage response and DNA repair. Faulty BRCA genes are associated with an increased risk of developing breast, ovarian, and prostate cancer. On the other side of the equation, TP53 is another well-known tumor suppressor gene. By searching for a gene analogy to BRCA1 − BRCA2 for TP53, we might find key genes that work in conjunction with TP53.

gene_unknown <-
  word_vectors["BRCA1", , drop = FALSE] -
  word_vectors["BRCA2", , drop = FALSE] +
  word_vectors["TP53" , , drop = FALSE]
#>  UBE2O  LLGL2   SOX9   HID1   GGA3
#> 0.9427 0.9420 0.9412 0.9396 0.9390

Such capabilities could have potential applications in screening new leads for mechanisms like synthetic lethality.

Explore results interactively

Last but not least, I built a Shiny app: https://nanx.shinyapps.io/exp2vec/. You can use it to explore the gene neighbors and gene analogies.