ChimeraBuster  

Introduction

ChimeraBuster is an application specifically developed for finding putative chimeric sequences in clone libraries with high coverage. Chimeras are formed during PCR amplification and are sequences comprised of two fragments (or more) derived from two other sequences (parent sequences). Thus, for chimeras to form, both parent sequences must exist during amplification process. The existing Chimera programs query external sequence databases to search for parent sequences. However, for a large number of the sequences it is difficult to conclude with high probability that they had originated from distinct parental sequences because no sufficiently close relatives are present in the databases such as GenBank or RDP.

Rationale

Briefly, the main assumption of ChimeraBuster is that chimeras in well-sampled clone libraries should have their parental sequences.If all parents are represented in the clone library, then putative chimeras can be identified.

The Algorithm Description

The program ChimeraBuster uses the two most variable 20-bp regions at each end of the molecule as in silico probes with adjustable specificity cutoffs. The program flags all sequences in which each probe matches two or more distinct sequences with >1% sequence difference. Thus, three sequences are identified of which two are parental and one the potential chimera. Of these three, the sequence with the lowest incidence in the clone library was identified as more likely chimeric because chimeras form at later stages in the amplification when the parental sequences are already abundant. ChimeraBuster uses sequences directly provided by a user and it does not query external databases.

The algorithm

Take all sequences in a dataset and group them in clusters (defined below).

Generate consensus (defined below) sequence for each cluster.

Take 2 highly conserved regions on different parts of sequence.

For each of conserved regions, compare consensus sequences and if they match toward ends of sequence, report clusters as candidates for further analysis.

Dissolve clusters that were reported as candidates for further analysis, and compare individual sequences using same methodology. For sequences that are very similar, i.e. where there is reasonable suspicion that they are chimeras, report to scientist.

Example of a report
difference 0 parentDist=16
chimera r164c2(size=2 in r164c2, 479 bases, 6BDA6613 checksum.)
rparent srb6XA04(size=1 in srb6XA04, 479 bases, C895546 checksum.)
lparent r56c3(size=3 in r56c3, 479 bases, 20093F59 checksum.)
*****************************++++++++++++++.................
............................................................
............................................................
............................................................
...........

Where difference is difference between chimera( r164c2), and join between left parent (r56c3) and right parent (srb6XA04).

Graphical representation, (*,+,.) informs user on possible solutions to truncation position between parents. Before first character is first conserved region, and after last character is second conserved region. Where character is star (*) left parent can be truncated, and right parent appended without modifying total difference with chimera. Plus (+) represents where difference would increase by one, and dot (.) all other nucleotide positions.

For this example we note chimera may have been formed by truncating close to the first conserved region.

NOTE: Only expert testing can give more conclusive results, and this application does should be used only as a method of reducing chimera search space.

Definitions

Clusters are sets of sequences. For each sequence within cluster, there is at least one sequence that is closer to it than MAX_DIFF value.

Cluster consensus sequence is a join of all sequences in cluster. If position varies, that consensus will match value that exists in at least one of constituents.

>