基因组注释（3）——ncRNA注释

非编码RNA(non-coding RNA，ncRNA)指不编码蛋白质的RNA，包括rRNA、tRNA、snRNA、snoRNA 和 microRNA 等多种已知功能的 RNA，和未知功能的RNA。tRNA预测可以使用经典的tRNAscan-SE，其他类型的RNA都可以用Infernal+Rfam数据库方式预测。

1. tRNAscan-SE

tRNAscan-SE的安装需要依赖Infernal软件，因此可以用conda直接安装tRNAscan-SE顺带解决依赖问题：

1	conda install -c bioconda trnascan-se

写一个脚本运行tRNAscan-SE：

#!/bin/bash
#SBATCH -n 50
#SBATCH -t 7200

tRNAscan-SE --thread 50 -E -I -m tRNA_luobuma.stats -o tRNA_luobuma.out -f tRNA_luobuma.ss /public/home/wlxie/NextPolish/luobuma_rundir/genome.nextpolish.fasta

参数解释：

-E 搜寻真核生物tRNA

-I 使用Infernal软件进行搜索

-m 保存结果统计文件

-o 输出tRNA预测结果

-f 保存tRNA的二级结构

也可以直接使用-j参数保存gff3格式，-b参数保存bed格式，详情可以见tRNAscan-SE -h

生成的二级结构结果文件如下：

str一行记录的二级结构信息，每个<>是互相配对的，代表在二级结构中这两个碱基连在一起。可以通过其他软件（如VARNA）绘制成图。

输出的out文件也推荐转成gff文件方便在基因组上可视化，因为我这里只是粗略统计一下tRNA数量，所以只看stat文件就可以了：

可以看到第一轮预测出526个tRNA，通过Infernal验证的有479个。

2. Infernal + Rfam

Infernal（INFERence of RNA ALignment）是Sanger实验室开发的ncRNA预测软件，他们建立了1600多个RNA家族，每个家族建立了一致性二级结构和协方差模型，也就是Rfam数据库。总体的注释思路是基因组与 Rfam数据库进行比对，Rfam是一个RNA分类数据库，其比对方法是调用软件Infernal中的程序cmscan，将提交的序列在Rfam.cm数据库中进行检索，从而得到其比对的结果。

cmscan（search sequence(s) against a covariance model database, 针对协方差模型数据库的序列搜索），主要参考官方手册Userguide.pdf (eddylab.org)中的Searching the Rfam CM database with a query sequence步骤。

前面已经安装了Infernal，这里需要再下载一个Rfam数据库。

# 下载Rfam数据库，注意两个文件版本必须一致
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin

# 解压建库
gunzip Rfam.cm.gz
cmpress Rfam.cm

官方手册这里使用默认参数运行cmscan，稍微注意下-Z，以下是官方对-Z参数的定义：

-Z Calculate E-values as if the search space size was megabases (Mb). Without the
use of this option, the search space size changes for each query sequence, it is
defined as the length of the current query sequence times 2 (because both strands
of the sequence will be searched) times the number of CMs in .

Z值代表搜索数据库的大小（database size），是和E-Values计算相关的，在默认情况下，每一个query sequence的-Z参数值是不同的，等于query sequence本身的碱基数*2*CM数据库中模型的数量，只有E-Values小于10的hits会被报道。

在作者的原文中，可以找到这么一句话：

To manually set the database size used in the E-value calculation to megabases when running cmsearch or cmscan on the command line, use the -Z option. It makes sense to do this if, for example, a large sequence file has been split up into many smaller files, and searches have been performed in parallel on a compute cluster, with the results combined. In that scenario, if is set as the total number of models used times the total number of nucleotides in all sequence files times two (for both strands), then the combined results should have appropriate E-values. That is, the expectation is that in the collection of all hits between all sequences and models there will be about 1 hit with an E-value of 1 or below by chance (not due to homology), about 10 with an E-value of 10 or below by chance, etc.

Non-coding RNA analysis using the Rfam database - PMC (nih.gov)

也就是说一个大的序列文件被分割成数个文件，并在一个集群上并行搜索，最终将结果文件整合的时候，设置Z值为使用的CM模型数量*所有序列文件的核苷酸总数*2，合并的结果会有一个适当的E-value值。

前面建库的时候可以看到CM数据库中模型的数量：

CM模型的数量不可以直接用cat Rfam.cm | grep "ACC" | wc -l这种方式查询，你会发现这种只搜索ACC或者NAME字段查找到的数量是实际数量的两倍（包括了HMM filter）。

Z值= 基因组核苷酸数*2*数据库中模型数量/1000000

1	esl-seqstat my-genome.fa # HMMER插件，统计基因组大小，计算Z值用

因此这里的Z值计算如下：
$$
Z = 22308888634108/1000000=1896982.90
$$
但是很神奇的是，在Rfam的帮助文档中给了一个对古菌Methanobrevibacter ruminantium注释ncRNA的例子，其中也用了Rfam数据库的，计算Z值时没有乘以CM模型的数量，我也有点疑惑，以下是链接地址。

Genome annotation — Rfam Help documentation

For the purposes of Infernal, the total database size is the number of nucleotides that will be searched, in units of megabases (Mb, millions of nucleotides). So, it is the total number of nucleotides in all sequences that make up the genome, multiplied by two (because both strands will be searched), and divided by 1,000,000 (to convert to millions of nucleotides).

就 Infernal 而言，数据库总大小是将要搜索的核苷酸数量，以兆碱基（Mb，百万核苷酸）为单位。因此，它是构成基因组的所有序列中的核苷酸总数乘以2（因为将搜索两条链），然后除以 1,000,000（转换为Mb）。

两种方法计算Z值相差4000多倍，我无法断定哪种是正确的，后续有更深的理解再更新（也许是版本问题？）。

这里我按照文章作者给的默认参数对基因组进行注释（也就是没有指定Z值，每条序列的Z值是变动的）。

运行脚本如下：

#!/bin/bash
#SBATCH -n 50
#SBATCH -t 7200

cmscan --cut_ga --rfam --nohmmonly --tblout luobuma.tblout --fmt 2 -o luobuma.out --clanin Rfam.clanin Rfam.cm /public/home/wlxie/NextPolish/luobuma_rundir/genome.nextpolish.fasta

参数解释：

–cut_ga 指定Rfam GA阈值，决定哪些hits可以报告。有多种标准，可以见Glossary — Rfam Help documentation

–fram 以快速模式运行

–nohmmonly 决定所有模型都是CM模型（非HMM模型）

–tblout 表格形式输出结果

–fmt 2 输出格式2，包括overlapping hit的注释

-o 标准输出文件

–clanin Rfam.clanin文件的位置，该文件记录哪些模型属于同一家族

最终获得luobuma.out的标准输出文件和整理成表格的luobuma.tblout文件，这里整理一下表格文件：

27列对应预测ncRNA类型和信息；前后都有#键注释的行，除此之外每一行是预测的ncRNA具体内容

需要注意olp这一列，在infernal 1.1.4版本这一列有以下四个值：

* indicates this hit does not overlap with any other reported hits 这条序列与其他已报道的序列之间无重叠区域（保留）

ˆ indicates that this hit does overlap with at least one other hit, but none of the hits that overlap with it have a lower E-value (occur above it in the hit list) 这条序列与至少一条已报道的序列之间有重叠区域，但是这条序列的E-value最低（保留）

$ indicates that this hit does overlap with at least one other hit that does have a lower E-value (occurs above it in the hit list) but none of those higher scoring hits have ˆ in this column 这条序列与至少一条已报道序列之间有重叠区域，且其他序列E-value更低但不是最低的（过滤，1.1.2版本的infernal中没有）

= indicates that this hit does overlap with at least one other hit that has a lower E-value (occurs above it in the hit list) and does itself have a ˆ in this column 这条序列与至少一条已报道序列之间有重叠区域，且其他序列的E-value值更低且是最低的（过滤）

我们只关注预测结果中准确度最高的ncRNA，记录种类和长度，因此可以写个python脚本处理并统计数据。

因为这里输出的列表没有具体给出分类，因此对数据处理前要去Rfam官网的Entry type search栏下找到每个accession号对应的ncRNA类型Rfam: Search Rfam：

我这里只关心上面红框中的ncRNA（根据情况自己选择），勾选之后点击submit：

拉到最底下看到官网提供未格式化的列表，点击show，将显示的列表复制粘贴到一个新建的accession.txt文档，接下来就是对tblout文件和accession.txt文件处理：

'''
Infernal预测ncRNA结果文件统计脚本
2023.3.22
'''
# 结果文件数据过滤，获取每条预测ncRNA的accession号和长度
loci_length = []
accession = []
with open('./luobuma.tblout', 'r') as input:
    for i in input.readlines():
        if i.find('#') != -1 or i.find('=') != -1 or i.find('$') != -1:
            continue
        else:
            lst = i.strip().split()
            if len(lst) < 1:
                continue
            length = abs(int(lst[10]) - int(lst[9]))
            loci_length.append(length)
            accession.append(lst[2])
len_sum = 0
for i in loci_length:
    len_sum += i

# 处理accession.txt,提取accession号和ncRNA类型的关系
accession_num = []
dicts = {}
with open('./accession.txt', 'r') as ac:
    for i in ac.readlines():
        m = i.strip().split('\t')		# 以制表符分割
        accession_num.append(m[0])
        nc_type = m[2].split(';')[1].strip()	# 获取第三列第二个分号处的ncRNA类型
        if m[0] not in dicts:
            dicts[m[0]] = nc_type

# 统计结果文件accession号对应的ncRNA类型数量
mi = s = sn = lnc = t = r = other = 0
mi_len = s_len = sn_len = lnc_len = t_len= r_len = other_len =0
for i in range(len(accession)):
    item = accession[i]
    if item in dicts:
        if dicts[item] == 'miRNA':
            mi += 1
            mi_len += int(loci_length[i])
        elif dicts[item] == 'sRNA':
            s += 1
            s_len += int(loci_length[i])
        elif dicts[item] == 'snRNA':
            sn += 1
            sn_len += int(loci_length[i])
        elif dicts[item] == 'lncRNA':
            lnc += 1
            lnc_len += int(loci_length[i])
        elif dicts[item] == 'tRNA':
            t += 1
            t_len += int(loci_length[i])
        else:
            r += 1
            r_len += int(loci_length[i])
    else:
        other += 1
        other_len += int(loci_length[i])

outputlst = [('miRNA', mi, mi_len), ('sRNA', s, s_len), ('snRNA', sn, sn_len), ('lncRNA', lnc, lnc_len), ('tRNA', t, t_len), ('rRNA', r, r_len), ('others', other, other_len), ('total', len(accession), len_sum)]
# 输出结果
with open('./res.xls', 'w') as output:
    output.write('Type\tCopy Number\tTotal length(bp)\n')
    for i in outputlst:
        type = str(i[0])
        number = str(i[1])
        length = str(i[2])
        output.write(type + '\t' + number + '\t' + length + '\n')