I first constructed a manually-curated list of over 1,200 features commonly found in engineered plasmids and I created a local BLAST database of these features. I then wrote a python script which takes an inputted DNA sequence or FASTA file and BLASTs the entire sequence against curated database. Many of the hits are redundant given the nature of the manually curated list and engineered parts in general, so I filtered the results to give proper annotations. During the filtering process, I also located fragments of these engineered parts, which I define as having at least 90% identity, but less than 95% match length (these values were empirically determined to give accurate results).
Using a database of nearly 20,000 engineered plasmid sequences provided by Addgene, I then applied this pipeline to those plasmids, quantifying the amount of “junk” contained within these plasmids. I only analyzed CDSs, because these sequences are simple to know where they stop and start. Other sequences such as ncRNA-based origins and promoters have boundaries that are much harder to define, as often they can contain UP elements, or other features that make the start and stop positions ambiguous.
import glob from Bio.Seq import Seq from Bio import SeqIO from Bio.SeqRecord import SeqRecord import json import subprocess from Bio.SeqFeature import SeqFeature, FeatureLocation from Bio.Alphabet import generic_dna import time from tempfile import NamedTemporaryFile import plotly plotly.offline.init_notebook_mode() import plotly.offline as py import plotly.graph_objs as go