南京林业大学学报(自然科学版) ›› 2019, Vol. 43 ›› Issue (02): 93-99.doi: 10.3969/j.issn.1000-2006.201806029

• 研究论文 • 上一篇    下一篇

一款基于转录组差异基因表达分析的软件包——findDEG

吴吉妍,姚 丹,吴海楠,童春发*   

  1. (南京林业大学林学院,江苏 南京 210037)
  • 出版日期:2019-03-30 发布日期:2019-03-30
  • 基金资助:
    收稿日期:2018-06-25 修回日期:2018-11-25
    基金项目:国家自然科学基金项目(31270706,31870654); 江苏高校优势学科建设工程资助项目(PAPD)。
    第一作者:吴吉妍(wjy7815@163.com)。
    *通信作者:童春发(tongchf@njfu.edu.cn),教授,ORCID(0000-0001-9795-211x)。

findDEG: an integrated software package for differential gene expression analysis with RNA sequencing data

WU Jiyan,YAO Dan,WU Hainan,TONG Chunfa*   

  1. (College of Forestry, Nanjing Forestry University, Nanjing 210037, China)
  • Online:2019-03-30 Published:2019-03-30

摘要: 【目的】随着二代测序技术的不断发展,转录组测序技术在许多物种里已被广泛地应用于基因差异表达分析和基因注释研究。现有的多种基因差异表达分析软件,分析步骤多而且复杂,不同分析方法其结果差别也较大,这给研究者分析实际数据带来了不少困难。为了简化基因差异表达分析的过程,利用现有的软件开发一个集成的软件包。【方法】针对Trinity、TopHat+Cufflinks和HISAT2+StringTie 3种比较成熟的基因差异表达分析流程,考虑研究对象有无参考基因组序列、样本数据是否有重复、单端还是双端测序、不同基因表达量的计算方法以及不同的基因差异表达显著性检验方法等因素,将多种转录组测序数据分析软件整合起来形成一个集成的软件包。【结果】 使用Perl语言开发了一个名为findDEG软件包用于转录组测序数据的基因差异表达分析。软件包共分为3个模块,即Trinity、TopHat+Cufflinks和HISAT2+StringTie模块。Trinity模块提供3种计算转录本表达量方法和4种差异表达基因显著性检验方法,TopHat+Cufflinks模块可供用户选择新版或旧版的Cufflinks分析方案,HISAT2+StringTie模块则只有一种分析方案。该软件包可以自由下载使用,其网址为http://www.bioseqdata.com/findDEG/findDEG.htm。采用新版和旧版的Cufflinks分析方案以及一种Trinity组合方法,分别对小叶杨在正常和干旱胁迫条件下的转录组数据进行了分析。结果两种Cufflinks方法分别识别出了53和33个差异表达基因,其中25个是相同的; Trinity方法识别了高达1 641个差异表达基因,其中与Cufflinks两种方法相同的分别有14和3个。【结论】 新开发的软件包findDEG有十多种基因差异表达分析方案可供选择,采用一键的方式进行数据计算分析,避免了中间环节参数输入和结果利用等操作步骤,使用方便。

Abstract: 【Objective】With the fast development of next-generation sequencing technology, transcriptome sequencing(or RNA-seq)is being widely used for differential gene expression analyses and gene annotations in many species. A variety of software packages for RNA-seq data analysis are available. However, the practical analysis involves several complicated steps and multiple parameters, making it difficult for most researchers to perform such an analysis accurately. 【Method】Based on the available software packages such as Trinity, TopHat+Cufflinks and HISAT2+StringTie, an integrated package was generated to analyze RNA-seq data by considering different methods for computing gene expression abundance and hypothesis testing of differential gene expression. Meanwhile, other issues were also considered, including whether a reference genome is available, if the sampling is repetitive or not, and whether the data is paired or single end. 【Result】An integrated software package called findDEG was developed with Perl language for differential gene expression analysis. The software consisted of three modules, i.e., Trinity, TopHat+Cufflinks, and HISAT2+StringTie. The Trinity module provides three methods for calculating transcript expression abundance and four methods for testing differentially expressed genes, while the module TopHat+Cufflinks allows users to choose either the new or old version of Cufflinks for performing differential gene expression analysis. However, the module HISAT2+StringTie has only one strategy for the analysis. The new software is freely available at the website http://www.bioseqdata.com/findDEG/findDEG.htm. By taking three analytical strategies, including the old and new versions of Cufflinks and the Trinity module, we analyzed the RNA-seq data from Populus simonii under normal and drought stress conditions. Consequently, the new and old versions of Cufflinks identified 53 and 33 differentially expressed genes, respectively, with 25 matching genes between them. Trinity detected up to 1 641 differentially expressed genes, of which 14 and 3 genes were the same as the results from the new and old versions of Cufflinks, respectively. 【Conclusion】The new developed software findDEG can conveniently provide more than a dozen strategies for differential gene expression analysis with RNA-seq data by using one piece of software to conduct the whole analysis, avoiding many intermediate parameters and results that would need to be manually processed.

中图分类号: