Presentation is loading. Please wait.

Presentation is loading. Please wait.

BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Similar presentations


Presentation on theme: "BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support."— Presentation transcript:

1 BioJava Core API

2 Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support libraries for modern technologies (XML, WebServices, JDBC). Scales well from small to industrial strength enterprise sized programs.

3 Java for Bioinformatics? Object Oriented. Rapid development due to  Very strict types  Simple clear syntax  Exception handling and recovery  Cross platform  Extensive class library  Code reuse

4 What is BioJava? A collection of Java objects that represent and manipulate biological data Not a program, rather a programming library Open source (LGPL) open for all development, even commercial. Not ‘sticky’ or ‘viral’.

5 What is BioJava? Collection of objects to assist bioinformatics research Started at EBI/Sanger in 1998 by Matthew Pocock and Thomas Down 25+ developers have contributed (5 core)

6 What is BioJava? BioJava has acquired 1100+ classes, 130,000+ lines of code. Uses CVS version control, JUnit testing and ANT builds. It now has a fairly stable API. 76 packages!

7 Where is BioJava Home Page www.biojava.org BioJava in Anger http://www.biojava.org/docs/bj_in_anger/ Mailing Lists biojava-l@biojava.org biojava-dev@biojava.org Nightly Builds http://www.derkholm.net/autobuild/

8 Obtaining BioJava Download  http://www.biojava.org/download/ http://www.biojava.org/download/  Get binaries, source and docs biojava-live (requires cvs)  cvs -d :pserver:cvs@cvs.open- bio.org:/home/repository/biojava login  Password is ‘cvs’  cvs -d :pserver:cvs@cvs.open- bio.org:/home/repository/biojava checkout biojava-live  cvs update -Pd

9 Compiling biojava-live Requires the ANT build tool  http://jakarta.apache.org/ant/ http://jakarta.apache.org/ant/ The ANT tool will use build.xml to  Arrange source code  Compile source  Make jar file  Make Java docs  Build demos  Build and Run tests  Change to biojava-live; type ant Unit testing requires JUnit  http://junit.sourceforge.net/ http://junit.sourceforge.net/

10 Setting up BioJava Put the following JAR files on your class path: biojava.jar bytecode-0.92.jar commons-cli.jar commons-collections-2.1.jar commons-dbcp-1.1.jar commons-pool-1.1.jar

11 Object Orient Patterns and BioJava Design

12 BioJava Design Uses some reasonably “advanced” concepts  Design by Interface  Protected or Private constructors  Factory classes and Methods  Flyweight/ Singleton objects

13 Interfaces Hide Implementation In BioJava there are several implementations of the Distribution interface. Any can be legally returned by a method that returns a Distribution (the returning method may even return different ones depending on the situation). Any can be legally used as an argument to a method that requires a Distribution. All are guaranteed to contain a minimal set of common methods.

14 Flyweight and Singleton Objects A Singleton is a class with only one instance and only one access point. A Singleton will need a Private constructor and may be static (e.g. AlphabetManager). A Flyweight object uses sharing to support large numbers of fine-grained object efficiently. For example in BioJava there is only ever one instance of the DNA Symbol “A”. A sequence of A’s is really just a list of pointers to that one object.

15 Factory and Static methods Sometimes it is useful to prevent a user from directly constructing an object via a constructor.  If the construction is complex.  If the choice of the optimal implementation is best left to the API developer.  If important resources are best protected from end users e.g. Singletons/ Flyweights. Rather than instantiating the object via its constructor a static method or Factory object is used

16 Examples Static method:  FiniteAlphabet dna = DNATools.getDNA(); Static field:  DistributionFactory df = DistributionFactory.DEFAULT; Factory method:  Distribution d = df.createDistribution(dna);

17 Two Levels of BioJava Macro type programming  Tools classes (SeqIOTools, DistributionTools etc).  Static methods for common tasks. Full programming  Lots of customizations and ‘plug and play’ possible.  More exposure to the sharp edges of the API. Less documentation.

18 Alphabets, Symbols and Sequences

19 Symbols In BioJava the DNA residue “A” is an object. In Bioperl “A” would be a String. The “A” object is part of the sequence not the sequence. “A” from DNA is not equal to “A” from RNA or “A” from Protein.

20 Why not Strings? DNA A != RNA A != Protein A For Strings “A”.equals(“A”); DNA Alphabet also contains K,Y,W,S,R,M,B,D,G,V,N

21 Why not Strings? Object Y contains C and T, The String “Y” doesn’t contain anything Translation HashMaps with Strings are flawed.  Biojava GGN translates to GLY  String GGN maps to null A fully redundant String to String HashMap translation table requires 4096 keys!

22 Symbols are Canonical DNATools.a() == DNATools.a();  There is only one instance of ‘a’ DNATools.a().equals(DNATools.a()); ProteinTools.a() != DNATools.a(); Even on Remote JVM’s!  During serialization Alphabet indexing is transient and ‘reconnected’ via readResolve() methods.

23 Alphabets A set of Symbols Alphabets can be infinite  DoubleAlphabet, IntegerAlphabet Some Alphabets have a Finite number of Symbols  DNA, RNA etc Alphabet and FiniteAlphabet interfaces

24 org.biojava.bio.Alphabet boolean contains(Symbol s) Returns whether or not this Alphabet contains the symbol.containsSymbol List getAlphabets() Return an ordered List of the alphabets which make up a compound alphabet.getAlphabets SymbolSymbol getAmbiguity(java.util.Set syms) Get a symbol that represents the set of symbols in syms.getAmbiguity SymbolSymbol getGapSymbol() Get the 'gap' ambiguity symbol that is most appropriate for this alphabetgetGapSymbol String getName() Get the name of the alphabet.getName SymbolSymbol getSymbol(java.util.List rl) Get a symbol from the Alphabet which corresponds to the specified ordered list of symbols.getSymbol SymbolTokenizationSymbolTokenization getTokenization(java.lang.String name) Get a SymbolTokenization by name. getTokenization void validate(Symbol s) Throws a precanned IllegalSymbolException if the symbol is not contained within this Alphabet.validateSymbol

25 org.biojava.bio.FiniteAlphabet In addition to the previous methods void addSymbol(Symbol s) Adds a symbol to this AlphabetaddSymbolSymbol Iterator iterator() Retrieve an Iterator over the Symbols in this Alphabet. iterator void removeSymbol(Symbol s) Remove a symbol from this alphabet.removeSymbolSymbol int size() The number of symbols in the alphabet.size

26 The Default Alphabets DNA (a,c,g,t) RNA (a,c,g,u) PROTEIN (all amino acids including ‘Sel’) PROTEIN-TERM (all PROTEIN plus “ * ”) STRUCTURE (PDB structure symbols) Alphabet of all integers (Infinite Alphabet)  Can generate SubIntegerAlphabets Alphabet of all doubles (Infinite Alphabet)

27 Getting the common Alphabets import org.biojava.bio.symbol.*; import java.util.*; import org.biojava.bio.seq.*; public class AlphabetExample { public static void main(String[] args) { Alphabet dna, rna, prot; //get the DNA alphabet by name dna = AlphabetManager.alphabetForName("DNA"); //get the RNA alphabet by name rna = AlphabetManager.alphabetForName("RNA"); //get the Protein alphabet by name prot = AlphabetManager.alphabetForName("PROTEIN"); //get the protein alphabet that includes the * termination Symbol prot = AlphabetManager.alphabetForName("PROTEIN-TERM"); //get those same Alphabets from the Tools classes dna = DNATools.getDNA(); rna = RNATools.getRNA(); prot = ProteinTools.getAlphabet(); //or the one with the * symbol prot = ProteinTools.getTAlphabet(); } }

28 SymbolLists are made of Symbols org.biojava.bio.symbol.SymbolList A sequence of Symbols from the same Alphabet. Uses biological coordinates from 1 to length  cf String from 0 to length-1

29 Doesn’t this waste memory? A SymbolList is not really a List of Symbol Objects. Rather a List of Object references. Still a bit heavier than a char[] but not serious. A C G T AACGTGGGTTCCAACT

30 The Bigger Picture A C G T AACGTGGGTTCCAACT AlphabetManager “DNA” “Protein”

31 The SymbolList interface void edit(Edit edit) Apply an edit to the SymbolList as specified by the edit object. editEdit AlphabetAlphabet getAlphabet() The alphabet that this SymbolList is over. getAlphabet Iterator iterator() An Iterator over all Symbols in this SymbolList. iterator int length() The number of symbols in this SymbolList. length String seqString() Stringify this symbol list. seqString SymbolListSymbolList subList(int start, int end) Return a new SymbolList for the symbols start to end inclusive. subList String subStr(int start, int end) Return a region of this symbol list as a String. subStr SymbolSymbol symbolAt(int index) Return the symbol at index, counting from 1.symbolAt List toList() Returns a List of symbols.toList

32 String to SymbolList import org.biojava.bio.seq.* import org.biojava.bio.symbol.*; public class StringToSymbolList { public static void main(String[] args) { try { //create a DNA SymbolList from a String SymbolList dna = DNATools.createDNA("atcggtcggctta"); //create a RNA SymbolList from a String SymbolList rna = RNATools.createRNA("auugccuacauaggc"); //create a Protein SymbolList from a String SymbolList aa = ProteinTools.createProtein("AGFAVENDSA"); } catch (IllegalSymbolException ex) { //this will happen if you use a character in one of your strings that is //not an accepted IUB Character for that Symbol. ex.printStackTrace(); } }

33 SymbolList to String import org.biojava.bio.symbol.*; public class SymbolListToString { public static void main(String[] args) { SymbolList sl = null; //code here to instantiate sl //convert sl into a String String s = sl.seqString(); }

34 The Sequence Interface A Sequence is a SymbolList with more information. In addition to Annotatable and SymbolList: String getName() The name of this sequence. getName String getURN() A Uniform Resource Identifier (URI) which identifies the sequence represented by this object.getURN Also implements FeatureHolder which allows addition of Feature Objects.

35 Quickly generate a Sequence import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class StringToSequence { public static void main(String[] args) { try { //create a DNA sequence with the name dna_1 Sequence dna = DNATools.createDNASequence("atgctg", "dna_1"); //create an RNA sequence with the name rna_1 Sequence rna = RNATools.createRNASequence("augcug", "rna_1"); //create a Protein sequence with the name prot_1 Sequence prot = ProteinTools.createProteinSequence("AFHS", "prot_1"); } catch (IllegalSymbolException ex) { //an exception is thrown if you use a non IUB symbol ex.printStackTrace(); } } }

36 More Complex Symbols and Alphabets

37 Ambiguity Symbols Ambiguous or Fuzzy data is a fact of life, especially with sequencing. DNA traces can contain symbols such as n, r, w, v, h, k, y, n etc. In BioJava DNA symbols a, c, g, t are AtomicSymbols. Ambiguous symbols like y are BasisSymbols.

38 BasisSymbols A BasisSymbol may be represented as a list of one or more Symbols. BasisSymbol extends Symbol. Ambiguity Symbols are always BasisSymbols getSymbols() The list of symbols that this symbol is composed from.

39 AtomicSymbols AtomicSymbols are not ambiguous. They cannot be further divided into Symbols that are valid members of the parent Alphabet. In the case of compound Alphabets they can be divided into valid Symbols from component Alphabets.

40 AtomicSymbols The AtomicSymbol interface extends BasisSymbol but adds no new methods only behaviour contracts. AtomicSymbol instances guarantee that getMatches() returns an Alphabet containing just that Symbol and each element of the List returned by getSymbols() is also atomic.

41 Atomic and Basis A T AATW W AlphabetManager “DNA” AtomicSymbols BasisSymbol

42 Translating Ambiguity BioJava handles translation of ambiguity very smoothly. DNA ‘n’ = [a,c,g,t] Transcribes to RNA ‘n’ [a,c,g,u] ggn translates to Gly agn translates to [Ser, Arg] Most protein ambiguities have no ‘token’ and are printed as ‘X’

43 CrossProduct Alphabets A CrossProductAlphabet is a combination of two or more Alphabets. Any type of CrossProductAlphabet is possible Dimers (DNA x DNA) Codon (DNA x DNA x DNA) Conditional ((DNA x DNA) x DNA) Mixed ((DNA x DNA x DNA) x PROTEIN)

44 Finite and Compound Alphas A C G T [AAC][GTG]GGTTCCAACT DNA AtomicSymbols ACA GTG (DNA x DNA x DNA) AtomicSymbols GNG (DNA x DNA x DNA) BasisSymbol

45 What are they good for? Codon Symbols (DNA x DNA x DNA). Many analysis Classes such as Count and Distribution use Symbol as an argument. A hexamer can be an AtomicSymbol. Phred is DNA x Integer 1 st and Higher order Markov Models use CrossProductAlphabets.

46 How do I make a CrossProductAlphabet? import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class CrossProduct { public static void main(String[] args) { //make a CrossProductAlphabet from a List List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet codon = AlphabetManager.getCrossProductAlphabet(l); //get the same Alphabet by name Alphabet codon2 = AlphabetManager.generateCrossProductAlphaFromName( "(DNA x DNA x DNA)“ ); //show that the two Alphabets are canonical System.out.println(codon == codon2); } }

47 Making Triplet Views on a SymbolList import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class CodonView { public static void main(String[] args) { try { //make a DNA SymbolList SymbolList dna = DNATools.createDNA("atgcccgcgtaa"); System.out.println("Length of dna " + dna.length()); //get a Codon View (window size of three) SymbolList codons = SymbolListViews.windowedSymbolList(dna, 3); System.out.println("Length of codons " + codons.length()); //get a Triplet View SymbolList triplets = SymbolListViews.orderNSymbolList(dna, 3); System.out.println("Length of triplets "+ triplets.length()); } catch (Exception ex) { ex.printStackTrace(); } } }

48 Getting a Symbol for a Codon import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class MakeATG { public static void main(String[] args) { //make a CrossProductAlphabet from a List List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet codon = AlphabetManager.getCrossProductAlphabet(l); //get the codon made of atg List syms = new ArrayList(3); syms.add(DNATools.a()); syms.add(DNATools.t()); syms.add(DNATools.g()); Symbol atg = null; try { atg = codon.getSymbol(syms); } catch (IllegalSymbolException ex) { //used Symbol from Alphabet that is not a component of codon ex.printStackTrace(); } System.out.println("Name of atg: "+ atg.getName()); } }

49 Breaking a Codon into its Parts import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class BreakingComponents { public static void main(String[] args) { //make the 'codon' alphabet List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet alpha = AlphabetManager.getCrossProductAlphabet(l); //get the first symbol in the alphabet Iterator iter = ((FiniteAlphabet)alpha).iterator(); AtomicSymbol codon = (AtomicSymbol)iter.next(); System.out.print(codon.getName()+" is made of: "); //break it into a list its components List symbols = codon.getSymbols(); for(int i = 0; i < symbols.size(); i++){ if(i != 0) System.out.print(", "); Symbol sym = (Symbol)symbols.get(i); System.out.print(sym.getName()); } } }

50 Basic Sequence Operations

51 Getting a section of a SymbolList symbolAt(int i)  Returns a Symbol subList(int min, int max)  Returns a SymbolList subString(int min, int max)  Returns the subsection tokenized to a String

52 Transcription In BioJava DNA sequences and RNA sequences are from different Alphabets. To convert between them: //make a DNA SymbolList SymbolList dna = DNATools.createDNA("atgccgaatcgtaa"); //convert it to RNA SymbolList rna = DNATools.toRNA(dna); //just to prove it worked System.out.println(rna.seqString()); //augccgaaucguaa //biological transcription (ie copy and reverse strand) rna = DNATools.transcribeToRNA(dna); //5’ atgccgaatcgtaa 3’ System.out.println(rna.seqString()); //5’ uuacgauucggcau 3’

53 Reverse Complement import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class ReverseCompiment { public static void main(String[] args) throws Exception{ SymbolList forward = DNATools.createDNA("atcgctagcgatcg"); //two step SymbolList reverse = SymbolListViews.reverse(forward); SymbolList revc1 = DNATools.complement(reverse); //one step SymbolList revc2 = DNATools.reverseComplement(forward); //test for equivalence System.out.println(revc1.equals(revc2)); } }

54 Translation RNATools contains the “Universal” RNA to Protein TranslationTable. Standard procedure is transcribe DNA to RNA and then translate.

55 Translation Example import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class Translate { public static void main(String[] args) { try { //create a DNA SymbolList SymbolList symL = DNATools.createDNA("atggccattgaatga"); //transcribe to RNA symL = RNATools.toRNA(symL); //translate to protein symL = RNATools.translate(symL); //prove that it worked System.out.println(symL.seqString()); } catch (Exception ex) { ex.printStackTrace() } }

56 Sequence I/O

57 Don’t ever write another Parser If you can avoid it! BioJava supports  Genbank, GenPept, RefSeq, EMBL, SwissProt, PDB, Fasta, ABI, LocusLink, Unigene (requires Java 1.4)  GAME, AGAVE  Blast, Fasta, HMMER (models and results), BlastXML, MEME, Phred  OBDA, BioIndex, BioSQL, DAS, GFF, XFF  Ensembl (with biojava-ensembl package) StAX/ Tag value RMI and Serialization

58 Simple I/O Most of BioJava’s simpler I/O operations are conveniently wrapped up behind static methods from the SeqIOTools class. SeqIOTools can read and write:  Fasta (protein or DNA)  EMBL  GenBank (flat file and XML)  SwissProt  GenPept  MSF (protein or DNA)  Fasta Alignments

59 SeqIOTools Reader Methods SequenceIterator i = SeqIOTools.readGenbank(br); SequenceIterator i = SeqIOTools.readGenpept(br); SequenceIterator i = SeqIOTools.readSwissprot(br); SequenceIterator i = SeqIOTools.readEmbl(br); etc… SequenceIterator i = (SequenceIterator) SeqIOTools.fileToBiojava("fasta", "dna“, br); Alignment a = (Alignment) SeqIOTools.fileToBiojava(“MSF", “rna“, br);

60 Features, Locations, Annotations

61 Features and Annotations Sequence data often comes with added information about the various properties of the sequence (Genbank, SwissProt etc). BioJava divides this information into global properties (Annotations) and Localized properties (Features).

62 Annotatable Annotatable is an “mix-in” interface that indicates the implementing object contains a Annotation object. It defines one method.  Annotation getAnnotation();

63 Annotations org.biojava.bio.Annotation Annotations are used for Global properties. Species, Accession Number, xrefs, date, publication. Key – value maps. Key and Value are objects but almost always are Strings. Annotation.EMPTY_ANNOTATION  static convenience class  good place holder, avoids null pointer exceptions  immutable

64 Annotation API Map asMap() Return a map that contains the same key/values as this Annotation. asMap boolean containsProperty(java.lang.Object key) Returns whether there the property is defined. containsProperty Object getProperty(java.lang.Object key) Retrieve the value of a property by key. getProperty Set keys() Get a set of key objects. keys void removeProperty(java.lang.Object key) Delete a property removeProperty void setProperty(java.lang.Object key, java.lang.Object value) Set the value of a property.setProperty

65 FeatureHolder FeatureHolder is another “mix-in” interface which allows the implementing object to hold Features. Sequence implements FeatureHolder. Features are created by FeatureHolders. FeatureHolders can be filtered.

66 FeatureHolder methods boolean containsFeature(Feature f) Check if the feature is present in this holder.containsFeatureFeature int countFeatures() Count how many features are contained.countFeatures FeatureFeature createFeature(Feature.Template ft) Create a new Feature, and add it to this FeatureHolder.createFeatureFeature.Template Iterator features() Iterate over the features in no well defined order.features FeatureHolderFeatureHolder filter(FeatureFilter filter) Query this set of features using a supplied FeatureFilter. filterFeatureFilter FeatureHolderFeatureHolder filter(FeatureFilter fc, boolean recurse) Return a new FeatureHolder that contains all of the children of this one that passed the filter fc.filterFeatureFilter FeatureFilter getSchema() Return a schema-filter for this FeatureHolder.getSchema void removeFeature(Feature f) Remove a feature from this FeatureHolder.removeFeatureFeature

67 Features are Annotatable Features implement Annotatable  Can hold an annotation  Global annotations of a Feature /note: /db_xref: etc

68 Features may be nested Features implement FeatureHolder!  Therefore Features may hold nested Features  c.f. The AWT Menu is a MenuItem  e.g. A gene has exons and introns  Filtering can be recursive  A Feature cannot hold itself (directly or indirectly)

69 Location API Locations are objects that specify a minimum and maximum bound on a region of sequence. Contains some useful methods, particularly getMin() and getMax(). Many methods have been deprecated and are now delegated to LocationTools. LocationTools is the best place to get new instances of a Location. PointLocation, RangeLocation, CircularLocation, CompoundLocation.

70 LocationTools static boolean areEqual(Location locA, Location locB) Return whether two locations are equal.areEqualLocation static boolean contains(Location locA, Location locB) Return true iff all indices in locB are also contained by locA.containsLocation static Location flip(Location loc, int len) Flips a location relative to a length.LocationflipLocation static Location intersection(Location locA, Location locB) Return the intersection of two locations.LocationintersectionLocation static CircularLocation makeCircularLocation(int min, int max, int seqLength) A simple method to generate a RangeLocation wrapped in a CircularLocationCircularLocationmakeCircularLocation static Location makeLocation(int min, int max) Return a contiguous Location from min to max.LocationmakeLocation static boolean overlaps(Location locA, Location locB) Determines whether the locations overlap or not.overlapsLocation static Location subtract(Location x, Location y) Subtract one location from another.LocationsubtractLocation static Location union(java.util.Collection locs) The n-way union of a Collection of locations.static Locationunion LocationLocation union(Location locA, Location locB) Return the union of two locations.unionLocation

71 Location Example import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class SpecifyRange { public static void main(String[] args) { try { //make a RangeLocation specifying the residues 3-8 Location loc = LocationTools.makeLocation(3,8); //print the location System.out.println("Location: "+loc.toString()); //make a SymbolList SymbolList sl = RNATools.createRNA("gcagcuaggcggaaggagc"); System.out.println("SymbolList: "+sl.seqString()); //get the SymbolList specified by the Location SymbolList sym = loc.symbols(sl); System.out.println("Symbols specified by Location: "+sym.seqString()); } catch (IllegalSymbolException ex) { //illegal symbol used to make sl ex.printStackTrace(); } } }

72 Filtering Features FeatureHolders have a filter method that accepts a FeatureFilter as an argument. Features that are accepted by the FeatureFilter are returned as a new FeatureHolder. Filtering may be done recursively so that nested Features are subjected to the same FeatureFilter.

73 FeatureFilters FeatureFilter is an interface that specifies one method.  boolean accept(Feature f) There are 26 implementations of FeatureFilter in BioJava available as inner classes of the FeatureFilter interface. Most commonly used are ByType, BySource, StrandFilter, OverlapsLocation, ContainedByLocation. Also boolean logic filters: And, Or, Not

74 Analysis and Distributions

75 Distributions and Counts The Distribution and Count interfaces are from the org.biojava.bio.dist package. Counts are maps from AtomicSymbols to counts. Distributions are maps from Symbols to frequencies.

76 Distributions Distributions are central to analysis Map Symbols to Frequencies Can be trained or weights can be set Used heavily in dp (dynamic programming) package.  HMM transitions and emmissions Many implementations, frequently used are:  SimpleDistribution  OrderNDistribution  UniformDistribution

77 Distribution API Alphabet getAlphabet()AlphabetgetAlphabet The alphabet from which this spectrum emits symbols. Distribution getNullModel()DistributiongetNullModel Retrieve the null model Distribution that this Distribution recognizes. double getWeight(Symbol s)getWeightSymbol Return the probability that Symbol s is emited by this spectrum. void registerWithTrainer(DistributionTrainerContext dtc)registerWithTrainerDistributionTrainerContext Register this distribution with a training context. SymbolSymbol sampleSymbol()sampleSymbol Sample a symbol from this state's probability distribution. void setNullModel(Distribution nullDist)setNullModelDistribution Set the null model Distribution that this Distribution recognizes. void setWeight(Symbol s, double w)setWeightSymbol Set the probability or odds that Symbol s is emited by this state.

78 DistributionFactory Generally a Distribution is created using a DistributionFactory. The DistributionFactory interface contains a static inner class called DEFAULT that implements DistributionFactory DistributionFactory df = DistributionFactory.DEFAULT; Distribution d = df.createDistribution(dna.getAlphabet());

79 Distribution Training Distributions can be trained on observed sequences using a DistributionTrainerContext. One or more Distributions can be registered with the DTC.  //register the Distributions with the trainer dtc.registerDistribution(dnaDist);

80 DistributionTrainerContext A DistributionTrainer is assigned to each registered Distribution by the DTC. If unusual training behaivour is required you can register your own DistributionTrainer at the same time. The dtc can also add pseudocounts if needed. Ambiguities are automagically handled.  Counts are split according to the null model.

81 Training Example //make a DNA SymbolList SymbolList dna = DNATools.createDNA("atcgctagcgtyagcntatsggca"); //get a DistributionTrainerContext DistributionTrainerContext dtc = new SimpleDistributionTrainerContext(); //make the Distribution Distribution dnaDist = DistributionFactory.DEFAULT.createDistribution(dna.getAlphabet()); //register the Distribution with the trainer dtc.registerDistribution(dnaDist); for(int j = 1; j <= dna.length(); j++){ dtc.addCount(dnaDist, dna.symbolAt(j), 1.0); } //train the Distribution dtc.train();

82 setWeight() Example FiniteAlphabet a = DNATools.getDNA(); Distribution d = DistributionFactory.DEFAULT.createDistribution(a); //set the weight of each symbol d.setWeight(DNATools.a(),0.3); d.setWeight(DNATools.c(),0.2); d.setWeight(DNATools.g(),0.2); d.setWeight(DNATools.t(),0.3);

83 DistributionTools DistributionTools holds static methods for creating and manipulating Distributions. Tasks include:  Equal emission spectra?  Shannon Entropy, information, KL Distance.  Generate biased sequences.  Make a Distribution[] from an Alignment (each Distribution represents one position in an Alignment.  Average two or more Distributions.  Randomize a Distribution.  Make a Distribution from a Count.

84 Serialization of Distributions Distributions are Serializable  Write to and Read from Binary  RMI XMLDistributionWriter  Write any Distribution to a stream in XML format. XMLDistributionReader  SAXParser  Read any Distribution from a XML stream

85 XML Output

86 What Else?? Dynamic Programming (HMMs) Bibliography Alignments Blast and Fasta parsing

87 What Else?? BioSQL support GUI components Chromatograms Molecular Biology (pI, mass, restriction enzymes) Molecular Structure


Download ppt "BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support."

Similar presentations


Ads by Google