Methods – Matt McGuffie – Bioinformatics 2019


     I first constructed a manually-curated list of over 1,200 features commonly found in engineered plasmids and I created a local BLAST database of these features. I then wrote a python script which takes an inputted DNA sequence or FASTA file and BLASTs the entire sequence against curated database. Many of the hits are redundant given the nature of the manually curated list and engineered parts in general, so I filtered the results to give proper annotations. During the filtering process, I also located fragments of these engineered parts, which I define as having at least 90% identity, but less than 95% match length (these values were empirically determined to give accurate results).

     Using a database of nearly 20,000 engineered plasmid sequences provided by Addgene, I then applied this pipeline to those plasmids, quantifying the amount of “junk” contained within these plasmids. I only analyzed CDSs, because these sequences are simple to know where they stop and start. Other sequences such as ncRNA-based origins and promoters have boundaries that are much harder to define, as often they can contain UP elements, or other features that make the start and stop positions ambiguous.


import glob
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import json
import subprocess
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.Alphabet import generic_dna
import time
from tempfile import NamedTemporaryFile

import plotly
import plotly.offline as py
import plotly.graph_objs as go