抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

串联重复序列注释这篇笔记里记录了如何用TRF软件进行TR预测,这款软件可以使用-m参数得到屏蔽后的序列,当时没写如何把Hard masking结果转换成Soft masking,这里就补个档。这两种屏蔽方式的结果文件是可以相互转换的。

1. Hardmasking转Softmasking

因为每个人的基因组文件可能各不相同,有的序列大小写混杂,有的60个核苷酸或者80个核苷酸换一行,为了方便起见,首先将基因组文件以及hard masking之后的序列文件转换成以下fasta格式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# transform.py 
# 每条序列两行数据形式存储,所有序列用大写字母

def transform(input_file, output_file):
sequence = []
with open(input_file, 'r') as input_f:
for line in input_f:
if line.startswith('>'):
if sequence:
sequence = "".join(sequence).replace('\n', '').upper() # 合并分行的序列以及转换成大写字母
with open(output_file, 'a') as new:
new.write(sequence + '\n' + line)
sequence = []
else:
with open(output_file, 'a') as new: # 第一条序列的处理方式
new.write(line)
else:
sequence.append(line)
sequence = "".join(sequence).replace('\n', '') # 处理最后一条序列处理方式
with open(output_file, 'a') as new:
new.write(sequence)

transform('genome.fa', 'standard_genome.fa')

NCI官网有fasta标准格式要求,这个要求还是相当宽泛的,但是不方便我们做转换,所以还是有必要统一一下。

FASTA Format for Nucleotide Sequences (nih.gov)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# hard2soft.py
# hardmasking文件为input_file,想要得到的softmasking文件为output_file,基因组文件为reference_file

def hard2soft(input_file, output_file, reference_file):
reference_sequences = {} # 序列名称和对应的序列储存在字典中
with open(reference_file, 'r') as ref:
sequence_name = ''
sequence = ''
for line in ref:
line = line.strip()
if line.startswith('>'):
if sequence_name:
reference_sequences[sequence_name] = sequence
sequence_name = line[1:]
sequence = ''
else:
sequence += line
if sequence_name: # 处理最后一条序列
reference_sequences[sequence_name] = sequence

with open(input_file, 'r') as input_f, open(output_file, 'w') as output_f:
sequence_name = ''
for line in input_f:
line = line.strip()
if line.startswith('>'):
sequence_name = line[1:]
output_f.write(line + '\n')
else:
sequence = line
if sequence_name in reference_sequences:
reference_sequence = reference_sequences[sequence_name]
replaced_sequence = ''
for i in range(len(sequence)):
if sequence[i] == 'N':
if reference_sequence[i] == 'N': # 如果基因组中原本就有N(填补gap产生的),则无需转换
replaced_sequence += 'N'
else:
replaced_sequence += reference_sequence[i].lower()
else:
replaced_sequence += sequence[i]
output_f.write(replaced_sequence + '\n')
else:
output_f.write(sequence + '\n')

hard2soft('hardmask.fa', 'softmask.fa', 'standard_genome.fa')

需要注意,如果基因组中原本就有填补gap产生的N,就不需要转换,直接用N表示。

2. Softmasking转Hardmasking

Softmasking转Hardmasking的情况就要简单多了,直接搜序列中的小写字母再用N替代即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# soft2hard.py
# softmasking文件为input_file,hardmasking文件为output_file

def soft2hard(input_file, output_file):
with open(input_file, 'r') as input_f, open(output_file, 'w') as output_f:
for line in input_f:
line = line.strip()
if line.startswith('>'):
output_f.write(line + '\n')
else:
sequence = line
replaced_sequence = ''
for i in range(len(sequence)):
if sequence[i].islower():
replaced_sequence += 'N'
else:
replaced_sequence += sequence[i]
output_f.write(replaced_sequence + '\n')

soft2hard('softmask.txt', 'hardmask.txt')

不管用哪种方式屏蔽重复序列,都是为了下游分析服务的,没必要死磕用什么软件才是最优解……能做出符合自己预期的结果就行O(∩_∩)O

顺便一提,UCSC Genome Browser Group提供的生物分析套件和工具的源码,其中也有用TRF获得softmasking结果以及后续分析等一系列的流程,感兴趣可以看github仓库上的perl源码:

kent/src/hg/utils/automation/doSimpleRepeat.pl at 307976d1f4c1ecbc73a55dea1f6348c19c1336b8 · ucscGenomeBrowser/kent (github.com)

欢迎小伙伴们留言评论~