(from @philippbayer)
The classic approach is to use something like BLAST to compare with known sequences, but this has many drawbacks. For starters, in plants the databases lean very heavily towards Arabidopsis thaliana, not more common plants such as maize or wheat.
People do get around this by looking for protein domains (Hidden Markov Models) but that doesn't go very far either, you have to describe domains first, and many are very generic. Can we classify protein/gene sequence using RNN/CNNs? Here's an example where someone tried
There are also graph-databases which link genes with similar genes, the literature, protein domains, (I'm a bit involved in KNetMiner), summarising that graph would also be useful.
2. Functional region prediction - given a genome assembly, can we find genes, can we find functional elements?
The majority of a plant genome assembly is retrotransposons/repeats (up to 80%), that's not useful for us, we want to know where genes are. Currently this is solved by training Hidden Markov Models on known genes and then comparing with alignments of expressed genes.
Problems are that you won't see rarely expressed genes, all of it takes a long time, it's easy to miss genes or to misassemble genes (split a longer gene into 'sub' genes etc.) I am not aware of any classifier that takes a genome assembly and finds genes, but there are some which find other smaller regulatory elements.
- DeepSea is a good example for finding regulatory regions.
- Played around a bit with dna2vec here
- IMHO classifying regions into genes, regulatory elements, pseudogenes all together would be amazing.
- you can classify genomic regions into whether they undergo selective sweeps and which class of sweep.
- Wonderful example using a CNN
It takes pictures of genomic read alignments with a reference and calls genomic variants from there, but imho it didn't really improve on the accuracy we had using regular text-based comparisons.
These are the areas I work with, there is so much more out there now!!