Presentation on theme: "The (new) Table Browser. Talk Outline Table Browser History New Table Browser Features New Table Browser Implementation –all.joiner &.as files –Overall."— Presentation transcript:
Talk Outline Table Browser History New Table Browser Features New Table Browser Implementation –all.joiner &.as files –Overall control and data flow –Joining and intersection modules Limits and future directions
Table Browser History Goal - annotations over a particular region of genome in text rather than graphic format Krish - did first successful implementation - separated tables into positional and non-positional, merged chrN_ tables, split off hgFind. Angie - added sequence output, filters, intersections, and many help pages. These versions of the table browser were called hgText
Why a New Table Browser hgText is powerful, but much of the power is not obvious in the first page. In hgText the association between tracks and tables was not clear. No way to join fields across related tables.
New Table Browser Flip to demoing new table browser online. –Show overall controls –Demo getting genome position, common name, and review status for refSeq on ENCODE. –Demo getting alt-splice varients with knownCanonical and knownIsoforms –Demo custom track created from filtered cpgIslands (>= 500 bases >= 0.9 Exp/Obs) –Intersect custom fat cpg track with most conserved, requiring 75% overlap, output as custom track –Intersect conserved fat cpg with exonophy, requiring <= 5% overlap, output as hyperlink (custom track output crashes!)
New Table Browser Implementation Built using: –AutoSql.as files to describe table fields –all.joiner file to describe table relationships –.bed based intersection and sequence output code from old table browser –About 8000 lines of new C code in 19.c files in src/hg/hgTables
Data Flow Each region (piece of a chromosome) processed separately Filter is turned into a SQL where clause Field oriented output, especially selected tables is handled by one branch of code. –SQL rows -> joining routines -> output GFF, Custom Track, Sequence, Hyperlink, and Summary Stats outputs handled by a branch of code that turns things into BED format internally: –SQL rows -> BED -> intersecting -> output Need to merge fields & BEDs to get joining and intersecting to happen at the same time ultimately.
Joining Code Use all.joiner to find out route from primary table to other tables in join. Construct SQL query for each table that applies table filters and region and includes key fields even if not part of final output. Construct a row object (array of lists) for each row returned on primary table. Construct a hash keyed by joining field of primary table, with row objects as values. Execute SQL query for next table, and when keys match add info to row object. Repeat with third and subsequent tables if any.
Limits/Features of Joining Code Unless a filter is applied, non-positional tables will be scanned completely. This takes 3 minutes for gbCdnaInfo. (Hint, add filter type=mRNA) Joining code only applied to field oriented output. Will handle joins across split tables. Can chop of prefixes and suffixes on a key field before joining if specified in all.joiner. (Needed for chopping off version number in some Ensembl tables for instance) Avoids combinatorical explosion of output rows by allowing fields to contain lists.
Intersecting Code Primarily inherited from hgText. Uses hTableInfo (call in hg/lib/hdb.c) which reports which fields in database store chromosome, start, end, etc. Analyses hTableInfo to figure out how many fields in corresponding BED structure, and how to query database and massage output to get a BED. Converts second table in intersection into a bitmap. Counts up number of bases in bitmap that intersect each bed item in first table. (For pure bitwise operations converts first table to bitmap too.)
Limits and Features of Intersections Not applied to field or MAF output. Information is lost in converting to BED. Does allow intersection code for sequence, GFF, custom track, BED, statistics, and hyperlinks output to go through same path.
Future Directions Make a combined BED/Row structure to bring together intersections and joining. Polish sequence output in some places. Get.as file info for all tables. Encourage people to pay a little more attention to database concerns as well as genome browser concerns when designing tables. See if can phase out split tables by tuning MySQL aggressively.