抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

经过前面的全基因组特征调查(survey)后,我们发现这是一个复杂基因组,杂合度较高,可以以二代+三代测序技术相结合的策略进行全基因组组装,还可以以Hi-C(高通量染色体捕获技术,High-through chromosome conformation capture)技术进行辅助组装。这里我用华大开发的二代测序组装工具SOAPdenovo,用二代测序数据对进行初步基因组组装。

1 安装SOAPdenovo 2.0

github上这个软件的版本是2.0,网址点击这里

软件下载安装过程非常顺利,如果有报错无法解决的话可以在Issue里向作者反馈。

1
2
3
4
# 集群如果无法登录github,下载源码包,通过xftp传到集群
tar -zxvf SOAPdenovo2
cd SOAPdenovo2-r242
make

编译之后可以看到有如下几个文件

SOAPdenovo-127mer和SOAPdenovo-63mer是用于组装的两个程序,分别代表支持的最大k-mer为127和63,用法上是完全相同的。

example.config是配置文件,组装之前我们要设置其中的参数内容;README.md是帮助文件,详细记录了各项参数的作用和设置方法。这个后面会讲到。

2 kmergenie计算最佳k值

现在组装基因组的算法主要有三种:De Bruijn graph,Overlap-Layout-Consensus和String Graph。SOAPdenovo软件组装基因组用的是De Bruijn graph算法,简单理解是通过将reads打断成k-mer后,利用k-mer之间的重复部分构建图,得到最优化路径从而拼接contig。要具体了解什么是De Bruijn graph,可以参考这一篇博文

不同k-mer值构建的De Bruijn graph不一样,会导致组装质量的差异,因此我们需要选择一个最佳的组装k-mer大小(尽管可以用默认值23直接组装,但是效果不一定是最好的)。

kmergenie软件和之前的Jellyfish类似,都可以用于统计k-mer数量,kmergenie最大优点是可以对预设的多个k-mer进行分析,找到最佳的k-mer。点击这里进入Kmergenie官网,下载最新版本的软件。

注意下这个软件安装需要python > 2.7,并且需要安装R和zlib

1
2
3
4
5
6
7
8
tar -zxvf kmergenie-1.7051.tar.gz
cd kmergenie-1.7051
make
python setup.py install --user # 安装到用户环境中,不报错说明可以使用

vim file.txt # 将两个fq文件路径写进去,一行一个

/public/home/wlxie/biosoft/kmergenie-1.7051/kmergenie file.txt -o ./kmergenie_res -l 15 -k 65 -s 5 -t 30 --diploid # 运行kmergenie
  • -o # 输出文件位置和名称
  • -l # 设定的最小k值
  • -k # 设定的最大k值
  • -s # 最小k值到最大k值,每次增加的间隔(根据需要设定间隔大小)
  • -t # 运行的线程(CPU核)数
  • --diploid # 二倍体模式,前面我们已经用jellyfish确认过这是个复杂基因组。默认是单倍体模式

其原理就是设置不同k值进行基因组大小预估,将组装的基因组最大的k值作为最佳k值。

最终会给出kmergenie_res为前缀的一系列报告,生成的.histo文件还可以用来上一篇笔记中的GenomeScope分析,这里我们只需要看总结的html文件。

确定组装的最佳k值为51

3 SOAPdenovo2组装contigs/scaffolds

复制一份example.config配置文件,重命名为run_config,修改部分参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#maximal read length
#全局配置参数,只要高于这个参数的序列都会被截取到这个长度
max_rd_len=150
#文库配置以[LIB]开头
[LIB]
#average insert size
#文库插入片段的平均长度,在实际设置时,可以参考文库size分布图,取峰值(默认200)
avg_ins=200
#if sequence needs to be reversed
#是否需要将序列反向互补,对于pair-end数据,不需要反向互补,设置为0;对于mate-pair数据,需要反向互补,设置为1
reverse_seq=0
#in which part(s) the reads are used
#1表示只组装contig,2表示只组装scaffold,3表示同时组装contig和scaffold,4表示只补gap
asm_flags=3
#use only first 100 bps of each read
#序列长度阈值,作用和max_rd_len相同,大于该长度的序列会被切除到该长度
rd_len_cutoff=150
#in which order the reads are used while scaffolding
#设置不同文库数据的优先级顺序,取值范围为整数,rank值相同的多个文库,在组装scaffold时,会同时使用。
rank=1
#cutoff of pair number for a reliable connection (at least 3 for short insert size)
#contig或者scaffold之前的最小overlap个数,对于pair-end数据,默认值为3;对于mate-paird数据,默认值为5
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
#比对长度的最小阈值,对于pair-end数据,默认值为32;对于mate-pair数据,默认值为35
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
#过滤后的双端测序数据文件路径,q为fastq格式,f为fasta格式,b为bam格式
q1=/public/home/wlxie/luobuma/luobuma/sample_1_rawdata/Second-generation_sequencing/20211106-BaiYiHuiNeng01/01.rawFq/00.mergeRawFq/1/clean_data/1_r aw_1_val_1.fq
q2=/public/home/wlxie/luobuma/luobuma/sample_1_rawdata/Second-generation_sequencing/20211106-BaiYiHuiNeng01/01.rawFq/00.mergeRawFq/1/clean_data/1_r aw_2_val_2.fq

SOAPdenovo有6个子命令pregraph、sparse_pregraph、contig、map、scaff和all,前5个命令对应5个组装步骤,第一和第二是两种不同构图方式,all命令一次执行所有步骤,用all命令比较省事儿。

SOAPdenovo命令还有一些参数用于调整,参数参考

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-s  # 配置文件
-o # 输出文件的前缀
-K # 输入的K-mer值大小,默认值23
-p # 程序运行时设定的线程数,默认值8
-R # 利用read鉴别短的重复序列,默认值不进行此操作
-d # 去除频数不大于该值的k-mer,默认值为0
-D # 去除频数不大于该值的由k-mer连接的边,默认值为1,即该边上每个点的频数都小于等于1时才去除
-M # 连接contig时合并相似序列的等级,默认值为1,最大值3。
-F # 利用read对scaffold中的gap进行填补,默认不执行
-u # 构建scaffold前不屏蔽高覆盖度的contig,这里高频率覆盖度指平均contig覆盖深度的2倍。默认屏蔽
-G # 估计gap的大小和实际补gap的大小的差异,默认值为50bp。
-L # 用于构建scaffold的contig的最短长度,默认为:Kmer参数值 ×2
-k # map步骤中kmer的大小,默认是和K一样的kmer大小
-N # 基因组大小
-V # 输出可视化的组装信息

运行组装命令

1
/public/home/wlxie/biosoft/SOAPdenovo2-r242/SOAPdenovo-63mer all -s /public/home/wlxie/biosoft/SOAPdenovo2-r242/run_config -K 51 -R -V -o A_venetum -p 30

程序运行了3个小时,结束后生成了以下文件

4 组装结果解读

组装结果文件其实只有两个,分别以**.contig结尾和.scafseq结尾**。因为我是在集群上运行的,slurm-11168.out是集群的输出日志文件,记录了详细的组装过程和结果。

最终得到935861个contigs,总长度295302126 bp,平均长度315 bp,最长的长度38673 bp,contig N50是532 bp,contig N90是103 bp;scaffold个数77918,总长度192992858 bp,平均长度2476 bp,最长的长度108587 bp,scaffold N50是3385 bp,scaffold N90是130 bp。(可以做一个统计表)

从组装的contig覆盖深度和数量还可以做一个柱状图,理论上来说是和前面k-mer分布图呈现一样的趋势,也就是一个主峰和一个杂峰,两个图相互印证目标基因组是个复杂基因组。

其他的结果文件在github上有解释,我先直接复制过来,以后用到再翻译翻译……

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
1. Output files from the command "pregraph"
a. *.kmerFreq
Each row shows the number of Kmers with a frequency equals the row number. Note that those peaks of frequencies which are the integral multiple of 63 are due to the data structure.
b. *.edge
Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it's reverse-complementarily identical and the sequence.
c. *.markOnEdge & *.path
These two files are for using reads to solve small repeats.
e. *.preArc
Connections between edges which are established by the read paths.
f. *.vertex
Kmers at the ends of edges.
g. *.preGraphBasic
Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.

2. Output files from the command "contig"
a. *.contig
Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file.
b. *.Arc
Arcs coming out of each edge and their corresponding coverage by reads
c. *.updated.edge
Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one.
d. *.ContigIndex
Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.

3. Output files from the command "map"
a. *.peGrads
Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning.
b. *.readOnContig
Reads' locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already.
c. *.readInGap
This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds if "-F" is set.

4. Output files from the command "scaff"
a. *.newContigIndex
Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links.
b. *.links
Links between contigs which are established by read pairs. New index are used.
c. *.scaf_gap
Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used.
d. *.scaf
Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others contigs.
e. *.gapSeq
Gap sequences between contigs.
f. *.scafSeq
Sequences of each scaffolds.
g. *.contigPosInscaff
Contigs' positions in each scaffold.
h. *.bubbleInScaff
Contigs that form bubble structures in scaffolds. Every two contigs form a bubble and the contig with higher coverage will be kept in scaffold.
i. *.scafStatistics
Statistic information of final scaffold and contig.

欢迎小伙伴们留言评论~