Presentation is loading. Please wait.

Presentation is loading. Please wait.

Plain Text ASCII (American Standard Code for Information Interchange) - basic English alphabet character encoding UTF-8 (Universal Character Set Transformation.

Similar presentations


Presentation on theme: "Plain Text ASCII (American Standard Code for Information Interchange) - basic English alphabet character encoding UTF-8 (Universal Character Set Transformation."— Presentation transcript:

1 Plain Text ASCII (American Standard Code for Information Interchange) - basic English alphabet character encoding UTF-8 (Universal Character Set Transformation Format, 8-bit) - variable width encoding - compatible with ASCII (1 st 128 characters) - can represent every Unicode character - increasingly being used as the default character encoding

2 Plain Text… example Remember, I am not recording the vision of a madman. The sun does not more certainly shine in the heavens than that which I now affirm is true. Some miracle might have produced it, yet the stages of the discovery were distinct and probable. After days and nights of incredible labour and fatigue, I succeeded in discovering the cause of generation and life; nay, more, I became myself capable of bestowing animation upon lifeless matter. The astonishment which I had at first experienced on this discovery soon gave place to delight and rapture. After so much time spent in painful labour, to arrive at once at the summit of my desires was the most gratifying consummation of my toils. But this discovery was so great and overwhelming that all the steps by which I had been progressively led to it were obliterated, and I beheld only the result. From: Project Gutenberg's Frankenstein, by Mary Wollstonecraft Shelley (plain text version)

3 Plain Text… typical questions linguistic analysis: analysing word frequencies finding and analysing phrases creating indexes and word lists list every instance of each principle word with its immediate context (concordance) analyze a writer’s style – word, sentence, paragraph length text analytics: named entity recognition recognition of pattern identified entities sentiment analysis …

4 Rich Text … text with mark up I continued walking in this manner for some time, endeavouring by bodily exercise to ease the load that weighed upon my mind. I traversed the streets without any clear conception of where I was or what I was doing. My heart palpitated in the sickness of fear, and I hurried on with irregular steps, not daring to look about me: Like one who, on a lonely road, Doth walk in fear and dread, And, having once turned round, walks on, And turns no more his head; Because he knows a frightful fiend Doth close behind him tread. [Coleridge's "Ancient Mariner."] I continued walking in this manner for some time, endeavouring by bodily exercise to ease the load that weighed upon my mind. I traversed the streets without any clear conception of where I was or what I was doing. My heart palpitated in the sickness of fear, and I hurried on with irregular steps, not daring to look about me: Like one who, on a lonely road, Doth walk in fear and dread, And, having once turned round, walks on, And turns no more his head; Because he knows a frightful fiend Doth close behind him tread. [Coleridge's "Ancient Mariner."] From: Project Gutenberg's Frankenstein, by Mary Wollstonecraft Shelley (html version)

5 Rich Text … typical questions retrieving information from rich text recovering plain text from rich text generating an outline, a table of contents, or an index from rich text retrieve document metadata merging information into marked-up text mail merge to personalize a stock letter

6 Data stored in delimited format Values in a line are separated by a delimiter character. space delimited files (cereal.txt data file) 100%_Bran N C 70 4 1 130 10 5 6 3 280 25 1 0.33 11 100%_Natural_Bran Q C 120 3 5 15 2 8 8 3 135 0 1 -1 16 CSV – comma separated values 100%_Bran,N,C,70,4,1,130,10,5,6,3,280,25,1,0.33,11, 100%_Natural_Bran,Q,C,120,3,5,15,2,8,8,3,135,0,1,-1,16, TSV – tab separated values 100%_BranNC704113010 5632802510.3311

7 Data stored in delimited format CSV formatted files: - most popular format to move tabular data between programs that use proprietary formats, e.g. from a database to a spreadsheet 1. data in plain text 2. each line contains values for the same fields 3. values in line separated by commas

8 Data stored in delimited format CSV – comma separated values 100%_Bran,N,C,70,4,1,130,10,5,6,3,280,25,1,0.33,11, 100%_Natural_Bran,Q,C,120,3,5,15,2,8,8,3,135,0,1,-1,16,

9 Data stored in delimited format Name: Name of cereal mfr: Manufacturer of cereal where A = American Home Food Products; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina type: cold or hot calories: calories per serving shelf: display shelf (1, 2, or 3, counting from the floor) weight: weight in ounces of one serving cups: number of cups in one serving


Download ppt "Plain Text ASCII (American Standard Code for Information Interchange) - basic English alphabet character encoding UTF-8 (Universal Character Set Transformation."

Similar presentations


Ads by Google