NATO Consultation, Command and Control Agency

Slides:



Advertisements
Similar presentations
Deloitte Technology Fast 500 Asia Pacific Winners Accelerating Your Network WACC Technology.
Advertisements

ASIAES Project Overview Satellite Image Network for Natural Hazard Management in ASEAN+3 region Pakorn Apaphant Geo-Informatics and Space Technology Development.
Review of Chapter 2. Important concepts – The Internet is a worldwide collection of networks that links millions of businesses, government agencies, educational.
Utilizing Military Message Handling Systems as a Transport Mechanism for SOA in Military Tactical Networks Mr Frank T. Johnsen, Mr Anders Eggen, Ms Trude.
Samsung Digital Signage
Click to continue Network Protocols. Click to continue Networking Protocols A protocol defines the rules of procedures, which computers must obey when.
Lecture 7, Term COD12 Computer Literacy for Teachers Presentation Software.
November 2006 TECO-WIS, Seoul 1 Definition of the Internet Network of networks –millions of smaller domestic, academic, business, and government networks.
Knowledge Pathways in IT
W3C Workshop on Web Services Mark Nottingham
Data Compression CS 147 Minh Nguyen.
ECE358: Computer Networks Fall 2014
Doc.: IEEE /535r0 Submission July 2003 Khurram Kazi ECDD Tech, Jeff Heath Hughes Networks Slide 1 Lossless Data Compression: Data Compression.
International Academy Design and Technology Technology Classes.
Eighth Edition by William Stallings
© De Montfort University, Synchronised Presentations using Windows Media Howell Istance School of Computing De Montfort University.
Are there organizational characteristics that Public Health Departments share in common? Dominique Smart The Department of Biomedical Informatics Columbia.
Lecture 4. US Systems. Advanced Mobile Phone System Analog Signal Processing at the sender side Compress Pre Emphasize Limit Low Pass Filter + Frequency.
Dynamic Adaptive Streaming over HTTP2.0. What’s in store ▪ All about – MPEG DASH, pipelining, persistent connections and caching ▪ Google SPDY - Past,
Wan Accelerators: Optimizing Network Traffic with Compression Introduction & Motivation Results for Trained Files Another Compression Method Approach &
Optimizing the User Experience Throughout the Infrastructure Consolidation Process Dan Smith, Enterprise Solutions Manager, GTSI Chris Theon, Practice.
Measurements of Congestion Responsiveness of Windows Streaming Media (WSM) Presented By:- Ashish Gupta.
HyperText Transfer Protocol (HTTP)
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao
COE 341: Data & Computer Communications (T081) Dr. Marwan Abu-Amara Chapter 1: Data Communications & Networking Overview.
Data and Computer Communications
Effects of Applying Mobility Localization on Source Routing Algorithms for Mobile Ad Hoc Network Hridesh Rajan presented by Metin Tekkalmaz.
Introduction Future wireless systems will be characterized by their heterogeneity - availability of multiple access systems in the same physical space.
COE 341: Data & Computer Communications (T061) Dr. Marwan Abu-Amara Chapter 1: Data Communications & Networking Overview.
1 CP Lecture 9 Media communication standards.
SWE 423: Multimedia Systems Chapter 7: Data Compression (1)
Bandwidth-Efficient Method for Adaptive Forward Error Correction on Wireless Local Area Network  Co-Presenters: David R. Pollard, Graduate Student, Eastern.
Routing.
Data and Computer Communications
Using Redundancy and Interleaving to Ameliorate the Effects of Packet Loss in a Video Stream Yali Zhu, Mark Claypool and Yanlin Liu Department of Computer.
ALFRED A new Graphical User Interface for ALFRED: the ALlele FREquency Database A new Graphical User Interface for ALFRED: the A AA ALlele F FF FREquency.
APPLICATION AND NETWORK PERFORMANCE - TCP TUNING Alan Bodnar.
1 Enabling Secure Internet Access with ISA Server.
CS 218 F 2003 Nov 3 lecture:  Streaming video/audio  Adaptive encoding (eg, layered encoding)  TCP friendliness References: r J. Padhye, V.Firoiu, D.
CIS679: RTP and RTCP r Review of Last Lecture r Streaming from Web Server r RTP and RTCP.
Objectives Overview Discuss the evolution of the Internet
UDgateway WAN Optimization. 1. Why UDgateway? All-in-one solution Value added services – Networking project requirements Optimize IP traffic on constrained.
CE 4228 DATA COMMUNICATIONS AND NETWORKING Introduction.
Textbook  “Data Communications and Networking” 2 nd Edition by Behrouz A. Forouzan  “Data and Computer Communication” 6 th Edition by William Stallings.
Introduction to Multimedia Networking (2) Advanced Multimedia University of Palestine University of Palestine Eng. Wisam Zaqoot Eng. Wisam Zaqoot October.
Introduction to Interprocess communication SE-2811 Dr. Mark L. Hornick 1.
Brett Neely IP Next Generation. To boldly go where no network has gone before...
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 17 This presentation © 2004, MacAvon Media Productions Multimedia and Networks.
Fall 2005Computer Networks20-1 Chapter 20. Network Layer Protocols: ARP, IPv4, ICMPv4, IPv6, and ICMPv ARP 20.2 IP 20.3 ICMP 20.4 IPv6.
1 Lecture 17 – March 21, 2002 Content-delivery services. Multimedia services Reminder  next week individual meetings and project status report are due.
An Introduction to CDMA Air Interface: IS-95A
Signaling Fifth Meeting. SundayMondayTuesday Sixth meeting Seventh meeting.
ECS 152A 4. Communications Techniques. Asynchronous and Synchronous Transmission Timing problems require a mechanism to synchronize the transmitter and.
Automated Police Reports System City of Pittsburgh March 5, 2007 Presented to.
UNDERSTANDING THE HOST-TO-HOST COMMUNICATIONS MODEL - OSI LAYER & TCP/IP MODEL 1.
Voice Design Last Update Copyright 2011 Kenneth M. Chipps Ph.D. 1.
LOG Objectives  Describe some of the VoIP implementation challenges such as Delay/Latency, Jitter, Echo, and Packet Loss  Describe the voice encoding.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 17 This presentation © 2004, MacAvon Media Productions Multimedia and Networks.
Professor Douglas Lyon Fairfield University Computer Networks CR320.
21.1 Chapter 21 Network Layer: Address Mapping, Error Reporting, and Multicasting Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction.
Networks are about Communication! What is a good model of communication?
Network Layer: Address Mapping, Error Reporting, and Multicasting
Queuing Delay 1. Access Delay Some protocols require a sender to “gain access” to the channel –The channel is shared and some time is used trying to determine.
PROWIND (Positioning Relay over Wirelessly Networked Devices) Team Members: Alexander Smithson, Dian Ding, Yeh Cheng Yuan Graduate Advisor: Eric Liao Faculty.
Data and Computer Communications Eighth Edition by William Stallings Lecture slides by Lawrie Brown Chapter 1 – Data Communications, Data Networks, and.
1 The Latency/Bandwidth Tradeoff in Gigabit Networks UBI 527 Data Communications Ozan TEKDUR , Fall.
Data and Computer Communications Eighth Edition by William Stallings Chapter 1 – Data Communications, Data Networks, and the Internet.
VLAN Trunking Protocol
A Probabilistic Routing Protocol for Mobile Ad Hoc Networks
Presentation transcript:

NATO Consultation, Command and Control Agency COMMUNICATIONS & INFORMATION SYSTEMS Decreasing “Bit Pollution” through “Sequence Reduction” Dr. Davras Yavuz yavuz@nc3a.nato.int NATO UNCLASSIFIED NATO UNCLASSIFIED

You will find this presentation and the accompanying paper at www.nc3a.info/MCC2006 from where both can be viewed and/or downloaded (the four other NC3A presentations can also be found at the above URL) NATO UNCLASSIFIED NATO UNCLASSIFIED

Terminology “Sequence Reduction” Originates with Peribit ~2000, Founder’s Ph. D. on Genome Mapping - uses the term “Molecular Sequence Reduction” (MCR) - Biomedical Informatics, Stanford University “Bit Pollution” Link/network pollution repetition of redundant digital sequences over transmission media (especially significant for mobile/deployed networks/links) Other related terms: WAN optimizer, Application Accelerator/ Optimizer or Application Controller-Optimizer, Performance Enhancement Proxies (PEP), WAN Expanders, Latency (=delay) removers/compensators/mitigators ….. etc. New & dynamic field, many terms will continue to appear, coalesce, some will catch on others will disappear NATO UNCLASSIFIED NATO UNCLASSIFIED

Terminology Application Accelerator/Optimizer/Controller-Optimizer “Next Generation Compression”, “Bit Pollution Reduction”, “Sequence Reduction” (latter Peribit/Dr. Amit Singh) WAN Expander (WX), WAN Optimizer, WAN Optimization Controller (WOC) (Juniper/Peribit) Application Accelerator/Optimizer/Controller-Optimizer Latency Remover/Optimizer (replace Latency by “Delay” ) Especially for networks with SATCOM links In general; use of a-priori knowledge of data comms protocols required by application to optimize the data input/output Combinations of above Unfortunately all present implementations “proprietary” Unrealistic to expect “standards” soon, technology too new and lucrative NATO UNCLASSIFIED 5

Why “Bit Pollution” ? 1) Application & protocol overheads Most of us deal daily with various electronic files/ information Taking MS Office as an example; Word, PPT, Excel, Project, HTML, Access, …. Files …and/or many other electronic files, data-bases, forms, etc.,.. On many occasions we make small changes and send them back and/or forward to others Repetitive traffic over communication links can, in general, be classified broadly into 3 categories: 1) Application & protocol overheads 2) Commonly used words, phrases, strings, objects (logos, images, audio clips, etc.) 3) Process flows (data-base updates/views, forms, templates, etc. going back & forth) NATO UNCLASSIFIED NATO UNCLASSIFIED

SEQUENCE REDUCTION Next Generation Compression - Examples 256 Kbps satellite link 20 Mbytes PPT file (48 slides) sent 1st time : ~12 minutes (700 secs) 6 of the slides modified, file size change <0.5 Mbytes Modified file sent 6 hours later time taken: ~ 8 secs Same modified file sent 24 hours later ~ 18 secs Sent 7 days later ~24 secs Original file sent 7 days later : ~14 secs Similar results for Word, Excel files and web pages Less but still significant improvement for PDF files Smallest improvement for zipped files (reduction by ~ 2.5 to 3) Amount of “new” files in between repetitions & SR RAM/HD capacities have strong effect on the duration of repeat transmissions (dynamic library updates) Above results based on Peribit SR s : German MOD, Syracuse University “Real World” Labs (Network Computing Nov 2004) and NC3A GE MOD results based on operational traffic, others test traffic Ref [6] of paper: “Record for throughput was ~60Mbps through a T1. It came about when copying 1.5GB file twice! ” NATO UNCLASSIFIED NATO UNCLASSIFIED

Mobile/Tactical Comms Divergence Fixed communications – WANs with all users/nodes fixed Fiber-optic/photonic revolution: Essentially unlimited capacity is now possible/available if/when a cable can be installed Mobile comms: Networks with mobile/deployable users No technological revolution similar to photonic foreseen Radio propagation will be the limiting factor Mainstay will be radio: Tactical LOS tens/hundreds of Kbps, BLOS (rough terrain, long distances) few Kbps Star-wars scenarios : Moving laser beams ??? LEO satellites will provide some 100s of Kbps at a cost Divergence will continue Another factor: Input into the five senses : ~100 Shannon/ Entropy bps For transmission redundancy : x 10 = 1 Kbps Basic issue for mobile/deployed communications, e.g. when at least one end of a communications link is moving, and/or some users/nodes of a communications network are moving Deployed : move and then set-up communications On-the-move : communicate while moving Therefore: we must treat mobile/tactical comms differently NATO UNCLASSIFIED NATO UNCLASSIFIED

Deployable, Mobile, On-the-Move Communications At least one end of a link moving/deployed Networks which have nodes/users moving/deployed Such links/networks essential for survivability and rapid reaction Will be taking on increasingly more critical tasks Present approach: Use applications developed for fixed links/networks for deployed/mobile units Must consider the very different characteristics of such networks when choosing applications Can we measure information” so we can determine performance of links/ networks in terms of “information” transported, not just bits/bytes NATO UNCLASSIFIED NATO UNCLASSIFIED

Can we measure “information” ? Yes we can ! Shannon defined the concept of “Entropy”, a logarithmic measure in 1940s (while working on cryptography), it has stood the test of time First suggestion of log measure was Hartley (base 10) but Shannon used the idea to develop a complete “theory of information & communication” Shannon preferred Log2 and called the “unit” bits Base e is also sometimes used (Nats) Smaller the probability of occurrence of an event higher the “information delivered” when it occurs Hartley was the first to propose Log as a measure of information but Shannon developed it into a full fledged, structured theory. There is another “Info theory” that you might hear about, it is called Kolmogorov-Chaitin Theory (or K-C) that some mathematicians still refer to (not in the mainstream) Napiers/Nats base e Hartleys/dits/decs base 10 etc. Hartley 1928 NATO UNCLASSIFIED NATO UNCLASSIFIED

{Si} {Rj} {{ discrete Discrete, countable C. E. Shannon (BSTJ 1948) NATO UNCLASSIFIED 5

in the case of two possibilities/events/symbols Entropy Entropy (H) in the case of two possibilities/events/symbols Prob of one = p the other q = 1-p H = -(p log p + q log q) H versus p plotted  NATO UNCLASSIFIED NATO UNCLASSIFIED

Let us take a “Natural Language” English as an example English has 26 letters (characters) Space as a delimiter TOTAL 27 characters (symbols) One could include punctuation, special characters, etc., for example we could use the full 256 ASCII symbol set - methodology is the same Extension to other natural languages readily made Extension to images also possible (same methodology) NATO UNCLASSIFIED NATO UNCLASSIFIED

Structure of a “Natural Language” - English Defined by many characteristics: Grammar, semantics, etymology, usage, …., historical developments, …. Until early 70s there was substantial belief that “Natural Languages” and “computer programming languages” (finite automata instructions) had similarities Noam Chomsky’s work (Professor at MIT) completely destroyed those expectations Natural Languages can be studied through probabilistic (Markov) models Shannon’s approach (1940s, no computers, Bell Labs staff flipped through many pages of books to get the probabilities) He was actually working on cryptography and made important contributions in that area also NATO UNCLASSIFIED NATO UNCLASSIFIED

Various Markov model examples here, skipped here for continuity, may be found at the end NATO UNCLASSIFIED NATO UNCLASSIFIED

Zipf’s Law “Principle of Least Effort” George Kingsley Zipf, Professor of Linguistics, Harvard (1902 – 1950) If the “words” in a language are ordered (“ranked”) from the most frequently used down the probability Pn of the nth word in this list is Pn  0.1 / n Implies a maximum vocabulary size 12366 words since (  1 / n is not finite when summed 1 to  ) For details of above see DY IEEE Transactions on Information Theory, September 1974 Many other applications of “Zipf’s Law”, if interested just make a Google/Internet search Words could be roots, lexical, types. 12366 is sufficiently large to model all languages (Shannon had 8727 which is wrong, however the correction of the error makes his results even more meaningful) Populations of cities in a country Company sizes ….. NATO UNCLASSIFIED NATO UNCLASSIFIED

“Symbols, Signals & Noise” J. R. Pierce Zipf’s Law (Principle of Least Effort) ~ million words, various texts Many such analysis have been made All issues of TIME Magazine, New York Times, …., Shakespeare's works, etc and they all give similar results Just search Google for “Zipf’s Law” From “Symbols, Signals & Noise” J. R. Pierce NATO UNCLASSIFIED 5

Entropy bits/character - English Amazingly it turns out to be about the same for most “Natural Languages” for which the analysis has been done (Arabic, French, German, Hebrew, Latin, Spanish, Turkish, .…). These languages also follow Zipf’s Law. NATO UNCLASSIFIED NATO UNCLASSIFIED

Entropy of Natural Languages Between 1 & 2 bits per letter/character 1.5 bits per letter is commonly used English has ~4.5 letters per word on the average 4.5 x 1.5 = 6.75 or ~7 bits per word average Normal speech 1 - 2 words per second Hence information per second ~ 5 bits NATO UNCLASSIFIED NATO UNCLASSIFIED

(*) “equally likely” assumption clearly not realistic Extension to Images Same concept and definitions Letters replaced by pixels/groups of pixels, etc. Words could be analogous to sets of pixels, objects The numbers are much larger E.g. 400 x 600 = 240000 pixel image with each pixel capable of taking on one of 16 brightness levels 16240000 possible images Assume all these images are equally likely (*): Probability of one these images is 1/ 16240000 and the information provided by that image is 240000 log2 16 = 0.96 106 bits A real image contains much smaller “information” adjacent/nearby pixels are not independent of each other Movies : frame to frame only small/incremental changes (*) “equally likely” assumption clearly not realistic NATO UNCLASSIFIED NATO UNCLASSIFIED

Speech Coding ~5 b/s is irreducible information content, x by 10 to introduce redundancy - therefore we should be able communicate speech “information” at ~50 bps Examples of speech coding we use: 64000 bps , 32000 bps PC 16000 bps CVSD, 2400 bps LPC, MELP 1200, 600 bps MELP All above “waveform” codecs, they will also convey “non-measurable” (intangible) information Speech codecs (recognition at transmitter and synthesis at receiver ) technology could conceivably go lower than 600 bps but would not contain the intangible component ! NATO UNCLASSIFIED 5

A QUICK REFRESHER ON CONVENTIONAL COMPRESSION May be found at the end NATO UNCLASSIFIED NATO UNCLASSIFIED

SEQUENCE REDUCTION Next Generation Compression Dictionary based – implements learning algorithm Dynamically learns the “language” of the communications traffic and translates into “short-hand” Continuously updates/improves “knowledge” of link “language” Frequent patterns move up in dictionary, infrequent patterns move down and eventually can age out No fixed packet or window boundaries Unlike e.g. LZ which generally uses 2048 byte window Once a pattern is learned and put in dictionary it will be compressed wherever it appears Data compression is based on previously seen data Performance improves with time as “learning” increases Very quickly at first (10 –20 minutes) and then slowly When a new application comes in, SR adapts to its “language” NATO UNCLASSIFIED NATO UNCLASSIFIED

MOLECULAR SEQUENCE REDUCTION Relative positioning of statistical and substitutional compression algorithms (from Peribit, A. P. Singh) NATO UNCLASSIFIED 5

“Molecular Sequence reduction” NATO UNCLASSIFIED www.Peribit.com NATO UNCLASSIFIED

Origins in DNA pattern matching MSR – Technology Real time, high speed, low latency Continuously learns and updates dictionary Transparently operates on all traffic (optimized for IP) Eliminates patterns of any size, anywhere in stream Patent-pending technology Origins in DNA pattern matching 3 or 4 conflicting goals High speed  Cisco works well at <256K links, but as bw incr, perf decr Patterns spread across large distances  key for data reduction Latency  looking at a broad range of data can create latency in the process of compression IP layer  benefits all applications rather than working at the app layer and making compression only work for one app Dictionary  auto-populates and doesn’t age NATO UNCLASSIFIED NATO UNCLASSIFIED

MSR – Molecular Sequence Reduction “Next-gen dictionary-based compression” NATO UNCLASSIFIED www.peribit.com NATO UNCLASSIFIED

Government/Military use examples Many thousands of units in use in USA (mostly corporate but also government agencies) GE MOD using Peribit SRs (since ~2 years) INMARSAT German Navy WAN (encrypted) Links to GE Navy ships in/around South Africa Satellite links to GE units in Afghanistan Plans for some 64 Kbps landlines GE MOD total : 300+ units also other nations …… Some with initial trials NATO UNCLASSIFIED NATO UNCLASSIFIED

Reduction rates observed (reduced by % amount given) GE Armed Forces Results Traffic type Version 3.0 V 4.02 V 5.0 HTTP 30 % 40 % 46 % MAIL 61 % 67 %   NetBios 59 % 62 % CIFS 92 % FTP 69 % 73 % TELNET 65 % 93 % CIFS: Common Internet File System "Microsoft's way of doing network file sharing“ All MS operating systems have had some form of CIFS networking available or built in, and there are implementations of CIFS for most major non-MS operating systems as CIFS allows the sharing of directories, files, printers, and other cool computer stuff across a network NATO UNCLASSIFIED NATO UNCLASSIFIED

CIFS: Common Internet File System "Microsoft's way of doing network file sharing“ From German MOD NATO UNCLASSIFIED NATO UNCLASSIFIED

Startup behavior example From German MOD NATO UNCLASSIFIED NATO UNCLASSIFIED

From German MOD NATO UNCLASSIFIED NATO UNCLASSIFIED

From German MOD NATO UNCLASSIFIED NATO UNCLASSIFIED

From Peribit.com (not GE MOD data) NATO UNCLASSIFIED NATO UNCLASSIFIED

Peribit (screen capture) NC3A – WAN (NL – BE) EFFECTIVE WAN CAPACITY INCREASED BY 2.80 DATA REDUCTION BY 64.34 % NO DATA COMPRESSION & NO REDUCTION Real-life test results, with a typical IP traffic between NC3A-NL and NC3A-BE Impact of lossless data compression on TCP/IP traffic across a 2048 kbps terrestrial link. WITH DATA COMPRESSION & REDUCTION !!! NATO UNCLASSIFIED NATO UNCLASSIFIED

Real-life test results, with a typical IP traffic between NC3A-NL and NC3A-BE Impact of lossless data compression on TCP/IP traffic across a 2048 kbps terrestrial link. NATO UNCLASSIFIED NATO UNCLASSIFIED

Peribit Sequence Reducers NATO UNCLASSIFIED www.peribit.com NATO UNCLASSIFIED

NC3A TEST RESULT SUMMARY Expand Model 4800 “WAN Link Accelerators” 512 kbps satellite link Multiplexed TCP/IP Link with SCPS-TP acceleration Link with application accelerator & IP data compressor Un-accelerated link NATO UNCLASSIFIED NATO UNCLASSIFIED

NC3A TEST RESULT SUMMARY 512 kbps satellite link Multiplexed TCP/IP Link with SCPS-TP acceleration Link with application accelerator & IP data compressor Un-accelerated link NATO UNCLASSIFIED NATO UNCLASSIFIED

512 Kbps satellite link 10 multiplexed TCP/IP sessions Link with SCPS-TP acceleration Link with application accelerator & IP data compressor Un-accelerated link NATO UNCLASSIFIED NATO UNCLASSIFIED

Packeteer NATO UNCLASSIFIED NATO UNCLASSIFIED

Industry New area but many & increasing number of companies Peribit.com (now Juniper Networks) Expand.com (Expand Networks) Packeteer.com Riverbed.com Silver-peak.com ….. National authorities (e.g. USA & GE) also working with industry to incorporate SR/WX technology into national crypto devices NATO UNCLASSIFIED NATO UNCLASSIFIED

SEQUENCE REDUCTION Next Generation Compression Summary (1) WANs will form backbone of Network Enabled Operation This technology provides significant improvements in capacity Dictionary based – implements learning algorithm Dynamically learns the “language” of the communications traffic and translates into “short-hand” Continuously updates/improves “knowledge” of link “language” Frequent patterns move up in dictionary, infrequent patterns move down and eventually can age out No fixed packet or window boundaries Unlike conventional compression which operates over 1-2 Kbytes Once a pattern is learned and put in dictionary it will be compressed wherever it appears Data compression is based on previously seen data Performance improves with time as “learning” increases Very quickly at first (10 –20 minutes) and then slowly When a new application comes in, SR adapts to its “language” NATO UNCLASSIFIED NATO UNCLASSIFIED

SEQUENCE REDUCTION Next Generation Compression Summary (1) Significant advantages for WANs where capacity is an issue (i.e. deployed/mobile/tactical) Removes redundant/repetitive transmissions Packet-flow acceleration (latency removal) can be easily added Quality of Service & Policy Based Multipath can also be implemented Does not impact security implementations (cryptos between SRs) However Presently available from a few sources, each with its “proprietary” technology Useful implementations for NNEC, GIG implementations in the coming years Proprietary nature of the product is an issue that needs to be considered NATO UNCLASSIFIED NATO UNCLASSIFIED

Conclusions Shannon Information Theory provides tools for measuring “information” as “Entropy” Has formed the basis for most of the coding, data transmission/detection results since 1950s DNA / Genome mapping process has also apparently benefited from it In 90s estimate for human genome was 20-30 years; took 2-3 years with the computational developments in late 90s A new form of compression, “Sequence Reduction” provides significant reductions by reducing redun-dancies in transmitted data Will provide important advantages for mobile/deployable/moving WAN link applications NATO UNCLASSIFIED NATO UNCLASSIFIED

This presentation & associated paper can be found at Questions Comments This presentation & associated paper can be found at www.nc3a.info/MCC2006 NATO UNCLASSIFIED NATO UNCLASSIFIED

NC3A NC3A Brussels NC3A The Hague Visiting address: Bâtiment Z Avenue du Bourget 140 B-1110 Brussels Telephone +32 (0)2 7074111 Fax +32 (0)2 7078770 Postal address: NATO C3 Agency Boulevard Leopold III B-1110 Brussels - Belgium NC3A The Hague Oude Waalsdorperweg 61 2597 AK The Hague Telephone +31 (0)70 3743000 Fax +31 (0)70 3743239 Postal address: NATO C3 Agency P.O. Box 174 2501 CD The Hague The Netherlands NATO UNCLASSIFIED NATO UNCLASSIFIED

Markov model examples NATO UNCLASSIFIED NATO UNCLASSIFIED

= log 27 = 4.75 bits / letter (or symbol) Zeroth approximation to English (zero memory) [Zero order Markov : equally likely letters, 27 numbers ] AZEWRTZYNSADXESYJRQY_WGECIJJ_OB _KRBQPOZB_YMBUAWVLBTQCNIKFMP_KMVUUGBSAXHLHSIE_MAULEXJ_NATSKI All logs base 2 Entropy =  pi log (1/pi) for i = 1 to 27 = log 27 = 4.75 bits / letter (or symbol) NATO UNCLASSIFIED 5

Entropy =  pi log (1/pi) for i = 1 to 27 First approximation to English (zero memory) [Zero order Markov : letter probabilities, 27 numbers ] AI_NGAE__ITF__NR_ASAEV_OIE_BAINTHHHYROO_POER_SETRYGAIETRWCO__ EHDUARU_ EU_C_FT_NSREM_DIY_EESE_ F_O_SRIS_R __UNNASHOR_CIE_AT_XEOIT_UTKLOOUL_E Entropy =  pi log (1/pi) for i = 1 to 27 = ~ 4 bits / letter NATO UNCLASSIFIED 5

Entropy =  pi,k log (1/pi/k) for i = 1 to 729 (= 27 x 27) Second approximation to English (memory) [First order Markov : e.g. prob(a|a), prob(b|a), prob(c|a), … , 27 x 27 = 729 numbers, some zero] URTESHETHING_AD_E AT_FOULE_ ITHALIORT_WACT_D_STE_MINTSAN_OLINS__TWID_OULY_TE_THIGHE_CO_YS_TH_HR_ UPAVIDE_PAD_CTAVED_QUES_E Entropy =  pi,k log (1/pi/k) for i = 1 to 729 (= 27 x 27) = ~ 3.3 bits / letter NATO UNCLASSIFIED 5

Entropy: ~ 3 bits / letter Third approximation to English (memory) [Second order Markov : e.g. prob(a|aa), prob(a|ab), prob(a|ac), …, ….., prob(z|zy), prob(z|zz - 27 x 27 x 27 = 19683, ~ 75% zero] (Shannon calls these “di-gram probabilities) IANKS _CAN_OU_ANG_RLER_THATTED _OF_TO_SHOR_OF_TO_HAVEMEM_A_I_MAND_AND_BUT_WHISSITABLY_THERVEREER_EIGHTS_TAKILLIS_TA_KIND_AL Entropy: ~ 3 bits / letter NATO UNCLASSIFIED 5

N. Abramson “Information Theory & Coding” Third approximation to French JOU_MOUPLAS_DE_MONNERNAISSAINS_DEME_US_VREH_BRETU_DE_TOUCHEUR_DIMMERE_LLES_MAR_ELAME_RE_A_VER_IL_DOUVENTS_SO_FUITE N. Abramson “Information Theory & Coding” NATO UNCLASSIFIED 5

N. Abramson “Information Theory & Coding” Third approximation to ???? ET_LIGERCUM_SITECI_LIBEMUS_ACERELEN_TE_VICAESCERUM_PE_NON_SUM_MINUS_UTERNE_UT_IN_ARION_POPOMIN_SE_INQUENEQUE_IRA N. Abramson “Information Theory & Coding” NATO UNCLASSIFIED 5

WE COULD CONTINUE THIS WITH CONDITIONAL PROBABILITIES GIVEN TRIPLETS (tri-grams), QUADRUPLETS (tetra-grams), … n-grams,... etc. (i.e. mth ORDER MARKOV SOURCES m  3) HOWEVER, THIS BECOMES IMPRACTICAL AS THE NUMBER OF JOINT PROBABILITIES BECOMES TOO LARGE - SO SHANNON JUMPED TO MARKOV SOURCES WITH WORDS AS SYMBOLS - symbol set no longer 27 characters, but thousands of words. However m=1,2 Markov model gives much better results than n-gram analysis as “n” is increased NATO UNCLASSIFIED 5

Fourth approximation to English [Zero order Markov with words : e.g. Probability of words, zero memory] REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE … Entropy = ~ 2.2 bits / letter (using Zipf’s Law) (Shannon 1948) NATO UNCLASSIFIED 5

Fifth approximation to English (memory) [First order Markov with words : e.g. Probability (wordi | wordj) THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN… (Shannon 1948) NATO UNCLASSIFIED 5

Fifth approximation to Turkish (memory) [First order Markov with words : e.g. Probability (wordi | wordj) BIR ANLATTIKLARINA GŰLMECE YAZDI YAPITLARININ ŞARAP BİÇİMLERİ BELA GÖRŰNŰMŰ GİBİ AMA BİR ETMEK YOK TUTULDU GELEN GİDEN YER KALMADI ... NATO UNCLASSIFIED 5

A QUICK REFRESHER ON CONVENTIONAL COMPRESSION NATO UNCLASSIFIED NATO UNCLASSIFIED

Conventional Compression Lossy Compression Not necessarily a copy of the input: most audio, image, video compression algorithms are “Lossy” – our ears and eyes have resolution thresholds Loss-less Compression Data integrity essential in digital data communications – Network compression must be “Loss-less” Two basic approaches Statistical compression algorithms Substitutional compression algorithms NATO UNCLASSIFIED 5

Statistical compression : Probabilities of characters in the input data calculated (or given) - frequently occurring characters are encoded into fewer bits [e.g. Huffman code, Morse code] Static coding : Once the coding is determined in accordance with the probabilities of occurrence it does not change Dynamic coding : Coding changes with “context” - for example, the occurrence of “q” in English increases the probability of occur-rence of “u” to 1, similarly the occurrence of “th” significantly increases the probability of occurrence of “e” , etc. As the amount of “historical context” information increases “dynamic coding” techniques can approach “Shannon limit”, however computational requirements increase exponentially making them impractical for real-time/on-line applications For example e, t, a are frequently occurring characters in English where as x, z are very infrequent - However ASCII encodes all with 8 bits, a statistical coding technique would, for example, encode “e” with 3 bits and “z” 10 bits, etc. NATO UNCLASSIFIED 5

Substitutional compression : Identifies repeated strings of characters (longer the better) and replaces them with reference identifiers or tokens (shorter the better) - At the receiver the tokens are de-referenced and the reverse substitution performed Essentially a form of “pattern recognition” and classification Pattern detection/recognition generally much faster than computations needed for dynamic coding algorithms Most network compression techniques in use today use substitutional compression Compression techniques can also be combined – for example substitution based compression followed by static coding, etc. NATO UNCLASSIFIED 5

“Substitution” based compression is the basis of almost all network compression implementations Principle of all : replace repeated patterns with shorter tokens Different techniques for detecting/encoding repeated patterns Two basic approaches : Lempel-Ziv (LZ) “stateless” window compression e.g. v.42bis, fax compression, LZS(STAC) Predictor compression Tries to predict the next input byte : the matching algorithm looks for the most recent match of any pattern rather than best and longest match - higher speed but misses many significant pattern repetitions therefore lower data reduction (not much used) NATO UNCLASSIFIED 5

Lempel-Ziv (LZ) “stateless” window compression Published in 1977 (hence LZ77) Basis of ~all loss-less data compression implementations today Repeated “strings” replaced by “pointers” to the previous location where the string had occurred Buffer or “window” required for the “historical” information to be available for reference – typically 1000 – 2000 bytes (mostly 2048 bytes) All previous data outside the buffer/window is lost or “forgotten” hence the name “stateless” or memory-less Can find and compress only patterns that are repeated within the window – repetitions separated by more than window size are ignored Poor scalability: For compression efficiency large window size is required but this increases pattern search computation significantly Good for “file compression” type applications NATO UNCLASSIFIED 5

NATO UNCLASSIFIED NATO UNCLASSIFIED

Nov 1978, University of Pennsylvania, Museum Hall, Banquet in honor of Claude E. Shannon receiving H. Pender award (Prof. F. Haber & DY) NATO UNCLASSIFIED NATO UNCLASSIFIED