oriTfinder
Origin of transfer (oriT)
  The conjugative regions of the self-transmissible MGE typically consist of four modules: origin of transfer site (oriT), relaxase gene, gene encoding type IV coupling protein (T4CP) and gene cluster for bacterial type IV secretion system (T4SS). In general, the single-stranded DNA (ssDNA) conjugation process is initiated by relaxase precisely binding and cleaving the oriT region [Llosa et al (2002) Mol Microbiol]. After rolling-circle replication, the ssDNA is recruited by T4CP and subsequently transferred from the donor cell into the recipient cell via T4SS. In addition, non-conjugative MGE carrying oriT sequences can be mobilized by conjugative elements (Ramsay et al., Current Opinion in Microbiology, 2017).

  The oriT region, usually tens to even hundreds of base pairs in length, contains a conserved nick region (nic) and pairs of inverted repeats (IRs) (De La Cruz et al., FEMS Microbiol Rev, 2009). The nic site is precisely recognized and cleaved by relaxase [de la Cruz, et al (2010) FEMS Microbiol Rev].

Figure 1. Four modules of the typical self-transmissible MGE: origin of transfer site (oriT), relaxase gene, gene encoding type IV coupling protein (T4CP) and gene cluster for bacterial type IV secretion system (T4SS).

oriTfinder: a web-based tool for the identification of origin of transfers in DNA sequences of bacterial mobile genetic elements
  oriTfinder was designed to facilitate rapid detection of oriTs in bacterial MGE sequences, especially in the antimicrobial-resistant plasmids and the virulence plasmids. The utilized backend database oriTDB was firstly built upon sets of known oriT of MGEs, including 996 oriT in 985 natural plasmids and 74 oriT in 74 ICEs. oriTfinder is thus developed to catch three typical features of the oriTDB-archived oriT regions: (i) flanked by relaxase genes; (ii) containing a conserved nic site; (iii) containing IRs. By using the integration of the profile HMM-based relaxase gene search module with a BLAST-based oriT sequence search module,oriTfinder has exhibited rapid performance for oriT region prediction and visualization. It can also detect the conserved nick region with MEME_MAST and the putative IRs with Vmatch. In addition, oriTfinder can predict the T4CP and T4SSs encoded within the given MGE.

  To evaluate the algorithm, 43 transferable plasmids with experimental supports were used as the positive dataset. The oriT regions of these transferable plasmids have not been identified and collected by oriTDB yet. And 50 non-transferable plasmids were employed as the negative dataset. With these benchmark data sets, oriTfinder achieved the accuracy prediction with the sensitivity of 88.4%, the specificity of 100.0%, and the precision of 100.0%.


Figure 2. The prediction strategy used by oriTfinder to identify oriT region.

Detection the putative oriT regions with the DNA sequence similarity by using BLASTn-based H-value.
  To examine the degree of sequence similarities at a nucleotide level between the oriTDB-collected oriT sequence and the submitted MGE sequence, oriTfinder employed an NCBI BLASTn-based H-value with a slight modification. For each query, the H-value was calculated as follows:
where i was the level of BLASTn identified for the region with the highest Bit score expressed as a frequency between 0 and 1, lm was the length of the highest scoring matching sequence (including gaps) and ls the length of the oriT region with the highest Bit score. If there were no matching sequences with a BLASTn E value < 0.01, the H-value assigned to that query sequence was defined as zero. Therefore, the H-value belonged to the set, H∈[0,1]. In this study, a strict H-value cut-off ≥ 0.49 was used to determine the significant sequence similarities; for example, the identities are 70% and the ratio of matching length is 70%.

Detection the putative virulence factors (VFs) and acquired antibiotic resistance determinants (ARs) with the protein sequence similarity by using BLASTp-based Ha-value.
  To examine the degree of sequence similarities at an amino acid level between each query protein and the oriTfinder-collected VFs and ARs, the NCBI BLASTp-derived Ha-value was employed. For each query, the Ha-value was calculated as follows:
where i was the level of BLASTp identities of the region with the highest Bit score expressed as a frequency of between 0 and 1, lm the length of the highest scoring matching sequence (including gaps) and lq the query length. If there were no matching sequences with a BLASTp E-value < 0.01, the Ha-value assigned to that query sequence was defined as zero. Therefore Ha-value belonged to the set, Ha ∈[0,1]. Here, a strict Ha-value cut-off ≥ 0.64 was used to determine the significant sequence similarities; for example, the identities is 80% and the ratio of matching length is 80%.