doi: 10.1371/journal.pcbi.1003063, Schriml, L. M., Arze, C., Nadendla, S., Chang, Y.-W. W., Mazaitis, M., Felix, V., et al. (2015). Computational Approaches for Protein Function Prediction: A Survey. These solutions demonstrate that compressing GO terms improves accuracy and may even boost efficiency (Wang et al., 2015; Yu et al., 2017e; Zhao et al., 2019a). Genome Res. This work was financially supported by Natural Science Foundation of China (61872300), Fundamental Research Funds for the Central Universities (XDJK2019B024 and XDJK2020B028), Natural Science Foundation of CQ CSTC (cstc2018-jcyjAX0228), and King Abdullah University of Science and Technology, under award number FCC/1/1976-19-01. The GO annotations are usually encoded by a gene-term association matrix (Yn m for n genes with respect to m GO terms). Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. Many models use the hierarchical inter-relations between GO terms and prove that the appropriate use of inter-relations can improve the gene function prediction (Tao et al., 2007; Done et al., 2010; Yu et al., 2015b). Genome Biol. Genomics 101, 368375. (2004). Protein function prediction based on zero-one matrix factorixation.

Early gene function prediction solutions simply utilized this annotation information (Schwikowski et al., 2000; Hvidsten et al., 2001; Raychaudhuri et al., 2002; Schug et al., 2002; Troyanskaya et al., 2003; Karaoz et al., 2004), and converted the problem into a plain binary (or multi-class) classification task (Hua and Sun, 2001; Lanckriet et al., 2003; Leslie et al., 2004). Our preliminary studies (Yu et al., 2017a, 2018b; Fu et al., 2018; Wang et al., 2019) show that using GO appropriately can boost the prediction of lncRNA-disease associations, and GO has some overlaps with Disease Ontology (Schriml et al., 2011), which also adopts a DAG to hierarchically organize disease terms. doi: 10.1145/2649387.2649442, Chikina, M. D., and Troyanskaya, O. G. (2011). A tutorial on multilabel learning. (2018a). AUC defines different thresholds to plot the receiver-operating characteristics curve of each GO term, and then calculates the average-area value of these terms. Bioinformatics 15, 13901402.

Others attempted to use the inter-relationships among GO terms, and introduced a variety of solutions based on multi-label learning.

Multiple evaluation metrics can be adopted to quantify the results of gene function prediction. (2016). Advances in bio-technology make it possible to perform high-throughput experiments, which yield diverse functional information about gene products, at decreasing costs. Comput. Nature 401, 788791. Measures of the similarity between genes can be extended from taxonomic similarity measures between GO terms. Nucleic Acids Res. (2015b) utilized the hierarchical and flat inter-relations among terms to predict additional annotations of partially annotated genes. (2012). Table 1. IEEE/ACM Trans. 9:S2. These GO terms are hierarchically connected with different types of directed edges. IEEE Trans. doi: 10.1007/978-3-030-18576-3_19, Xu, Y., Guo, M., Shi, W., Liu, X., and Wang, C. (2013). 12, 5668. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Mostafavi, S., and Morris, Q. This DAG can be generated from the ontology file with moderate scripts (i.e., Matlab, R, and Python). Semantic similarity in biomedical ontologies. Makrodimitris et al. doi: 10.1109/JPROC.2015.2487976. 3:540. doi: 10.1016/j.cels.2016.10.017, Clark, W. T., and Radivojac, P. (2011). Liu, X., Yu, G., Domeniconi, C., Wang, J., Ren, Y., and Guo, M. (2019). J. Comput. Kernel-based data fusion and its application to protein function prediction in yeast? in Pacific Symposium on Biocomputing (Hawaii: World Scientific), 300311. A. Protein function prediction by random walks on a hybrid graph. (2016) developed a web tool called InteGO2 to select the most appropriate measure from a set of measures using a voting method, or to integrate measures via a meta-heuristic search method. (2014) proposed two algorithms: selection of negatives through observed bias (SNOB) and negative examples from topic likelihood (NETL). (2002). For example, different species have different distributions of GO annotations; zebrafish is heavily studied in terms of developmental biology and embryogenesis, while rat is the standard model for toxicology (Dessimoz and kunca, 2017). Bioinformatics 16, 396406. Biol. A kernel method for multi-labelled classification? in Advances in Neural Information Processing Systems (Vancouver, BC), 681687. Obozinski et al. 32, 55395545. The True Path Rule is one of the most important rules in GO (Blake, 2013), and should be respected in gene function prediction. Sci. Transductive multi-label ensemble classification for protein function prediction? in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Beijing), 10771085. doi: 10.1109/TCBB.2018.2861379, Yu, G., Wang, K., Fu, G., Wang, J., and Zeng, A. Wang, K., Wang, J., Domeniconi, C., Zhang, X., and Yu, G. (2020). Despite much progress, the intrinsic complexity of GO-based gene function prediction, the evolution of GO and the importance of reliable GO annotations for various domains mean that there are still interesting and challenging research directions, which deserve further efforts. doi: 10.1016/j.compbiolchem.2016.09.005, Lu, M., Zhang, Q., Deng, M., Miao, J., Guo, Y., Gao, W., et al. Sin. IEEE Trans. Buza (2008) estimated the annotation quality with respect to terms in BPO via a rank of evidence codes. If Xn d stores the numeric features of these genes, then the function prediction task can be seen as a classification task that makes use of Y and input pattern X to train a model, which can predict the association probabilities between these (or new) genes and GO terms. of genes.

Yu et al. 3, 9931022. 36:e12. Categories of computational methods that combat one or two of these issues are on the right side of Figure 3. A large-scale evaluation of computational protein function prediction. Isoform function prediction based on bi-random walks on a heterogeneous network. doi: 10.1093/nar/gkn276, Zhou, N., Jiang, Y., Bergquist, T. R., Lee, A. J., Kacsoh, B. Given the incomplete functional knowledge of genes, we have to admit that existing gene function prediction solutions are still no substitute for wet-lab experiments. Northwestern Polytechnical University, China. doi: 10.1101/gr.440803, Kissa, M., Tsatsaronis, G., and Schroeder, M. (2015). Inform. (1998). Similarly, each negative annotation indicates the gene product does not perform the function described by this term. Nucleic Acids Res. 186, eds F. L. Bauer, A. S. Householder, F. W. J. Olver, H. Rutishauser, K. Samelson, and E. Stiefel (Berlin; Heidelberg: Springer), 134151. (2013). Natl. doi: 10.1109/TKDE.2013.39, Zhang, X. F., Dai, D. Q., and Li, X. X.

doi: 10.1093/bioinformatics/btg153, Lu, C., Chen, X., Wang, J., Yu, G., and Yu, Z. Youngs et al. Clark and Radivojac (2011) investigated the quality of NAS and IEA annotations, and found IEA annotations were much more reliable than NAS ones in MFO branch. Biol. 34(Suppl. Biol. Conversely, if this gene does not have the function described by t, then it should not be annotated with t's descendant terms other. Nucleic Acids Res. (2017d) proposed ProCMF to explore the latent relationships between genes and GO terms by matrix factorization. where m() is the number of genes, which have at least one predicted score . TPi counts the number of true positive predictions, FPi is the number of false positive predictions and FNi counts the number of false negative predictions for gene i. Smin utilizes information theoretic analogs based on the GO hierarchy to evaluate the minimum semantic distance between the predictions and ground-truths across all possible thresholds (Jiang et al., 2014). (2013). Nucleic Acids Res. Bioinformatics 36, 422429. (2010) introduced an R package called GOSemSim to efficiently compute the semantic similarity between individual GO terms, sets of GO terms, genes or gene clusters. Park et al. Wang et al. In addition, the area under the precision-recall curve (AUPRC) is also widely used as an evaluation metric. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. doi: 10.1093/nar/gkw1108, Thomas, P. D., Mi, H., and Lewis, S. (2007). *Correspondence: Maozu Guo,; Guoxian Yu,, Explainable Intelligent Processing of Biological Resources Integrating Data, Information, Knowledge, and Wisdom, View all Comput. Use and misuse of the gene ontology annotations. Biol. (2015d) introduced a downward Random Walks model (dRW), which performed random walks on the GO hierarchy while taking the terms annotated to a gene as the initial nodes. Evaluation metrics for multi-label learning are also used to quantify the performance of gene function prediction, such as MicroAvgF1, MacroAvgF1, RankingLoss, Coverage, and AvgPrecision. Ranking-based deep cross-modal hashing? in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. We can see that research interest in this topic is increasing. Methods 74, 7182. Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. doi: 10.1142/9789812704856_0029, Lee, D. D., and Seung, H. S. (1999). Sci. GOLabeler: Improving sequence-based large-scale protein function prediction by learning to rank.

(2017e). Nat. Methods 10, 221227. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. doi: 10.1109/TCBB.2019.2943342, Zhao, Y., Wang, J., Guo, M., Zhang, Z., and Yu, G. (2019c). IEEE/ACM Trans. Genet, 10:459. doi: 10.3389/fgene.2019.00459, You, R., Zhang, Z., Xiong, Y., Sun, F., Mamitsuka, H., and Zhu, S. (2018).

Brief. Z., Crocker, A. W., et al. (2001). Last but not least, beside proteins, other gene products like miRNAs and lncRNAs also play important roles in many life processes and have associations with different complex diseases (Lu et al., 2008; Chen et al., 2012; Deng et al., 2019; Zou et al., 2019). Bioinformatics 26, 976978. (2017e) adopted a hashing technique that preserved the graph structure from Liu et al. Two species with high homology have a large number of homologous genes, which should share similar (or even identical) GO annotations (Schnoes et al., 2013). Third, multi-omics data can reflect gene function from different aspects and they complement each other. Curr. clusDCA individually performed a random walk on the GO DAG and on the biological networks to capture information about the underlying structure, then obtained two updated adjacency matrices.

doi: 10.1093/bioinformatics/btv260, Wang, S., Qu, M., and Peng, J. Based on the adopted techniques, existing solutions can be divided into two types: (i) matrix factorization-based and (ii) hashing coding-based techniques. Cell Syst. (2016). ontology acyclic graphs minimal genomes eukaryotic consistent prediction applied nodes PLoS Comput. Among these paradigms, Gene Ontology (GO) (Ashburner et al., 2000) and MIPS Functional Catalog (FunCat) (Ruepp et al., 2004) are the most often used. Interspecies gene function prediction using semantic similarity. 11Estimating the quality of ontology-based annotations by considering evolutionary changes? in International Workshop on Data Integration in the Life Sciences (Manchester), 7187. doi: 10.1186/1752-0509-9-S1-S3, Yu, G., Zhu, H., Domeniconi, C., and Liu, J. Bioinformatics 2, 330338. B., Altman, R. B., and Botstein, D. (2003). Inform. Then, label propagation on the graph identifies the negative examples. Twin Cities: Department of Computer Science and Engineering; University of Minnesota. The main challenges of gene function prediction are: (i) GO annotations that are incomplete, sparse, shallow, and imbalanced within and between species; (ii) massive structurally organized GO terms; and (iii) increasing relevant and irrelevant multi-type biological data. For example, GO has been used to find functional similarities in genes that are overexpressed or underexpressed in diseases (Chen et al., 2013), and our empirical results showed that the exclusion of GO annotations of genes significantly compromised the precision of an lncRNA-disease association prediction (Yu et al., 2017a; Fu et al., 2018). 25, 2529. Biol. The inter-relations between GO terms can be measured from different viewpoints (Teng et al., 2013; Peng et al., 2018), and can be roughly grouped into two categories, flat and hierarchical. Protein function prediction using dependence maximization? in Joint European Conference on Machine Learning and Knowledge Discovery in Databases (Prague: Springer), 574589. A network of protein-protein interactions in yeast. Mazandu et al. Bioinformatics 23, i529?i538. Sci. Proc. 9:e1003343. Consistent probabilistic outputs for protein function prediction. Therefore, we give a comprehensive review of GO-based gene function prediction methods ( categorized in Figure 3).

Next, they measured the semantic similarity between genes by l1-norm regularized sparse representation on the weighted gene-term association matrix, and took advantage of annotations of semantic neighbors to identify noisy annotations of a gene. Categories of solutions that use different inter-relations between GO terms. Nat. (2012) presumed that negative examples of a target term came from the genes which were not annotated with sibling terms of that term. They are regularly updated and archived. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. Among them, BMA provides a good balance between the maximum and average measure, since the latter two measures are inherently influenced by the number of terms being combined (Pesquita et al., 2009). Bioinformatics 30, i609?i616. BMC Genomics 17:553. doi: 10.1186/s12864-016-2828-6, Peng, J., Zhang, X., Hui, W., Lu, J., Li, Q., Liu, S., et al. After that, SimNet applied these weights to fuse the networks into a composite network, and then performed random walks on the composite network to make a prediction. doi: 10.7544/issn1000-1239.2017.20170644, Yu, G., Wang, Y., Wang, J., Fu, G., Guo, M., and Domeniconi, C. (2018b). ProSNet: Integrating homology with molecular networks for protein function prediction? in Pacific Symposium on Biocomputing (Hawaii), 2738. (2007) quantified the semantic similarity between genes by combing the hierarchical relationships between terms and known GO annotations of genes, then using a k nearest neighbor (kNN) classifier with the semantic similarity to predict unknown annotations of genes. The results obtained in the history to recent evaluation are generally better than those obtained by the dataset partition evaluation. This DAG encodes domain knowledge of biology. Some efforts have been made to combine GO and heterogeneous proteomics/genomics data (Cho et al., 2016; Yu et al., 2016a, 2017d), but they often suffer from a large number of GO terms. Therefore, it is interesting to leverage the shared GO structure and complementary annotations of genes for cross-species gene function prediction. The flat inter-relations simply consider the occurrence of two GO terms annotated to the same genes, without explicitly using the hierarchical structure between the terms. Our survey reviews the literature of ongoing studies of gene function prediction using GO, with the aim of expediting research into reliable gene function prediction. 9:S3. Given the complexity of gene function prediction, these metrics aim to evaluate the performance from different aspects (Radivojac et al., 2013; Jiang et al., 2016). Bioinformatics 8, 775784. Comput. (2008). The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. To achieve low storage and fast retrieval, hashing has been widely used in big data applications (Wang et al., 2016; Liu et al., 2019). Learn. Although several semantic similarity-based solutions make specific use of the GO hierarchy, GO annotations (Tao et al., 2007; Done et al., 2010; Xu et al., 2013; Yu et al., 2015b,d) and additional data sources (Peng et al., 2018; Yu et al., 2020b) to obtain an improved performance, they are mostly based on the assumption of complete annotations. (2013). (2016) used hash tables to store essential information learned from GO DAG and to efficiently compute the semantic similarity of genes. GOEAST: a web-based software toolkit for gene ontology enrichment analysis. Int. doi: 10.1007/978-3-319-41279-5_7, Tao, Y., Li, J., Friedman, C., and Lussier, Y. HPHash improved the prediction accuracy, and can be used as a plugin to boost the BLAST-based gene function prediction (Zhang et al., 1997; You et al., 2018). Front. These methods are summarized in Table 2. Comput. To solve this problem, Yu et al. doi: 10.1609/aaai.v33i01.33014400, Lord, P. W., Stevens, R. D., Brass, A., and Goble, C. A. doi: 10.1109/TCBB.2010.38, Valentini, G. (2014). (2015). (2017c) introduced a more advanced and adaptive approach (NoGOA), which used evidence codes of annotations to deferentially weight annotations and sparse representation to quantify the similarity between genes to identify noisy annotations. Understanding how and why the gene ontology and its annotations evolve: the go within uniprot. BMC Bioinformatics 18:350. doi: 10.1186/s12859-017-1764-z, Yu, G., Luo, W., Fu, G., and Wang, J. The negative examples selected by ALBias can boost the performance of gene function predictions. doi: 10.2174/157016461302160514004307. Therefore, we first review the basic workflow of gene function prediction, introduce the True Path Rule, and evidence codes from GO, and then present the widely-used evaluation metrics for gene function prediction. To take advantage of information about features of genes and the available-but-scanty negative examples, Fu et al. where par(t) denotes the parent term of term t, gpar(t) is the grandparent term of t, and uncle(t) is the uncle (parent's sibling) term of t. p(t|par(t)) is the conditional probability that a gene is annotated with t given this gene is already annotated with par(t). doi: 10.1093/bioinformatics/btw366, Fu, G., Yu, G., Wang, J., and Maozu, G. (2016b). Large-scale gene function analysis with the panther classification system. Evidence suggests that using the inter-relations between GO terms can boost the performance of gene function prediction (Tao et al., 2007; Pandey et al., 2009; Done et al., 2010). Trends Genet. Next, it defined two smoothness terms on these two low-rank matrices with respect to the gene-gene interactions and the structural relationships between terms, thus guiding the optimization of low-rank matrices. doi: 10.1371/journal.pcbi.1001074, Cho, H., Berger, B., and Peng, J. A. Next, NMFGO used the low-rank matrices to explicitly calculate the semantic similarity between genes. To consider GO, Mitrofanova et al. doi: 10.1093/bioinformatics/bty130, Youngs, N., Penfold-Brown, D., Bonneau, R., and Shasha, D. (2014). Valentini (2011) and Cesa-Bianchi et al. Bioinformatics 32, 477479. Three issues in gene function prediction (left), and categorization of existing computational solutions based on GO (right). Curr. Finally, it reconstructed the association matrix using the optimized two low-rank matrices to predict gene functions. IEEE/ACM Trans. Predicting gene ontology function of human micrornas by integrating multiple networks. (2003). Graph-based measures organize terms annotated to a gene by a subgraph of DAG and then use graph comparing techniques to quantify the similarity between genes, i.e., simGIC (Pesquita et al., 2008) and SORA (Teng et al., 2013). Thomas et al. Bioinformatics 26, 17591765. To address that, Lu et al. Comput. 1), 115. In section 3, we categorize the existing GO-based gene function prediction methods. doi: 10.1093/bioinformatics/btt160, The Gene Ontology Consortium (2017). BRWLDA: bi-random walks for predicting lncRNA-disease associations. Predicting gene function in a hierarchical context with an ensemble of classifiers. (2007) proposed to apply evidence codes as indicator for the reliability of annotations, and found that the annotations achieved by experimental and author statement are more reliable than others. There are three main differences between the two ways. Irrespective of the target task, these solutions generally focus on using the co-occurrence of GO terms annotated to the same genes. (2010) introduced a method called NtN, which applies singular value decomposition (SVD) (Golub and Reinsch, 1971) on the gene-term association matrix, whose entries are weighted by the term frequency-inverse document frequency and GO hierarchy; thus, the semantic relationships between genes and between terms were explored and the missing associations between genes and terms were completed. 11:400. doi: 10.3389/fgene.2020.00400. HashGO: hashing gene ontology for protein function prediction. Gene function prediction methods mainly utilize the structure of GO and biological features (including nucleotide/amino acids sequences, gene expression, and interaction data, etc.) Each of these methods is detailed in the following sections. The evaluation protocol for gene function prediction is generally performed one of two ways. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. 8, 15511566. Improving protein function prediction using protein sequence and GO-term similarities. Proc. The collected GO annotations are still quite incomplete, imbalanced, and rather shallow (Rhee et al., 2008; Thomas et al., 2012; Dessimoz and kunca, 2017). InteGO2: a web tool for measuring and visualizing gene semantic similarities using gene ontology. The GO consortium (Ashburner et al., 2000) independently or collaboratively annotate genes with GO terms from model organisms (or species) of wide interest among biologists, such as Homo sapiens, Mus musculus, Arabidopsis thaliana, and so on. 9:S3. For example, Tian et al. doi: 10.1109/TCBB.2017.2715842, Yu, G., Fu, G., Wang, J., and Zhu, H. (2016a). Obviously, these solutions have some overlaps with the ones introduced in the previous subsections. Semantic similarity-based methods typically use the semantic similarity to select the neighborhood genes and predict the annotations of a gene based on annotations of those neighborhood genes. Second, from the prediction results, the history to recent way evaluates the fixed, recent annotations and, thus, it does not have a variance. IEEE/ACM Trans. Lu et al. 14, 119128. (2017) developed a deep learning-based method (DeepGO) to predict gene function from sequences. Biol. doi: 10.1109/BIBM.2018.8621081, Yu, G., Zhang, G., Rangwala, H., Domeniconi, C., and Yu, Z. This problem is also found in multi-label learning (Pillai et al., 2013). Predicting protein function via semantic integration of multiple networks. 10:938. doi: 10.3389/978-2-88963-214-5, Keywords: gene ontology, gene function prediction, functional genomics, directed acyclic graph, inter-relationships, semantic similarity, Citation: Zhao Y, Wang J, Chen J, Zhang X, Guo M and Yu G (2020) A Literature Review of Gene Function Prediction by Modeling Gene Ontology. (2019). The portion of negative annotations is much smaller than positive ones, because a negative result may be due to inadequate experimental conditions and is often deemed as less useful and publishable than a positive annotation. Comput. (2012) used the empirical co-occurrence of two GO terms annotated to the same genes to predict new annotations of genes, and Yu et al. Diffusion component analysis: unraveling functional topology in biological networks? in International Conference on Research in Computational Molecular Biology (Warsaw), 6264. Integrating multiple heterogeneous networks for novel lncRNA-disease association inference. Protein function prediction using positive and negative example. (2008). ZOMF did not need to threshold the reconstructed association probability matrix, and the compressed zero-one labels had a more intuitive explanation than compressed labels. As a result, they are generally less accurate than more advanced solutions (Tao et al., 2007; Pandey et al., 2009; Done et al., 2010; Liu et al., 2016), which take into account the various inter-relation among GO terms. Bioinformatics 29, 11901198. To replenish the missing annotations of partially annotated genes, Yu et al. Biol. doi: 10.1016/j.tig.2014.05.005, Li, X., Chen, H., Li, J., and Zhang, Z. Methods Mol. (2000). Exemplar solutions based on compressing GO terms. Res. Integrating multiple networks for protein function prediction. Weighted matrix factorization based data fusion for predicting lncRNA-disease associations? in IEEE International Conference on Bioinformatics and Biomedicine (Madrid), 572577. The dataset partition evaluation is influenced by the proportion of training and testing sets; a higher proportion of training sets generally gives better results. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Exploiting ontology graph for predicting sparsely annotated gene function. doi: 10.1007/978-3-319-16706-0_9, Cho, H., Berger, B., and Peng, J. Pandey et al. 71, 264273. After that, clusDCA optimized a relational matrix between low-dimensional matrices to explore the latent relations, and to predict the associations between genes and GO terms. Whole-genome annotation by using evidence integration in functional-linkage networks. Figure 1. As the need of human knowledge (i.e., GO and its annotations) for artificial intelligence in biology increases, we believe the study of GO for gene function prediction and for other biomedical data mining tasks will be fast growing. (2015) introduced a Python portable application called A-DaGO-Fun, which assembled diverse semantic measures and biological applications using these measures. bayes