Presentation on theme: "LIS650lecture 1 Major HTML Thomas Krichel 2004-10-02."— Presentation transcript:
LIS650lecture 1 Major HTML Thomas Krichel 2004-10-02
structure It's not just about HTML –web –web server –markup –XML –HTML Fairly general but abstract Probably the toughest lecture in the course
literature I work from the text of the official standard at http://www.w3.org/TR/html4/ To work with it faster, I made a copy at http://wotan.liu.edu/~krichel/html4/ You can work from any HTML book. The W3C is the standard making body for the Web. Anything that they say is the standard. But some people don't behave according to the standard.
The world wide web The World Wide Web (Web) is a network of information resources. The Web relies on three mechanisms to make these resources readily available to the widest possible audience: –A uniform naming scheme for locating resources on the Web (i.e. URIs). –Protocols, for access to named resources over the Internet (e.g., http). –Hypertext, for easy navigation among resources (e.g., HTML).
URI introduction Every resource available on the Web -- HTML document, image, video clip, program, etc. -- has an address that may be encoded by a Universal Resource Identifier, or "URI". URIs typically consist of three pieces: –The name of the mechanism used to access the resource or the otherwise resolve it –The name of the machine hosting the resource. –The name of the resource itself, given as a path.
example URI http://openlib.org/home/krichel This URI may be read as follows: There is a document available via the HTTP protocol, residing on the site openlib.org, accessible via the path "/home/krichel". mailto:firstname.lastname@example.org This URI may be read as follows: There is email user krichel in a domain openlib.org to whom email may be sent.
Internet application protocols On the Internet machines use different application level protocols to do things Common protocols include –http-- dns--telnet –smtp-- ssh--ftp All of the ones cited are client/server protocols –client issues a request –server gives a response All of them use a different port. A port is a number that tells the machine what to do with the incoming stream of data.
http The web operates mostly on http, the hypertext transfer protocol. The client software is run on the local PC that you are using, called –a web browser (not politically correct) –a user agent (that's better) Our server is a piece of hardware called wotan.liu.edu, wotan for short –It runs the Debian GNU/Linux operating system on a Intel architecture. –It provides http daemon software that serves http requests. The particular software is called Apache.
main features of http http is insecure. the contents of http transactions (requests/responses) can be observed http is stateless. each transaction is self- contained and has no relationship to the previous one. http has a limited vocabulary of requests and responses. It is no good, say, to operate a machine remotely. We can therefore not use it communicate with the server.
working with a remote machine There are two traditional ways to work with a remote machine –issue commands to it used to be done with telnet –transfer files to and from it used to be done with ftp Telnet and ftp servers are not available on wotan.liu.edu. Telnet and ftp do not encrypt the communication stream. Therefore they are not secure.
communication with wotan The protocol that we use for communicating with the server is the secure shell, short ssh. It is based public-key cryptography. There are two PC programs commonly used as ssh clients –putty for issuing commands –winscp for file transfer. winscp is the one we will use. In offers a range of other facilities besides file transfer. Mac users should investigate a software called fugu.
registration time As part of the course, you are being provided with web space on the server wotan.liu.edu, at the URL http://wotan.liu.edu/~username where username is a user name that you will chose now. It is my intention to maintain this web space for you into the foreseeable future. You should also choose a password, now. I will now register you.
free software I maintain wotan.liu.edu server but you can build your own server if –you have Internet access –you have an old PC to spare All the server software, as well as putty and winscp are free, open-source. It is one of my fundamental beliefs that free information should run on free software. The library community can learn a hell of a lot from the free software community. See my talk at http://openlib.org/home/krichel/ presentations/new_york_2003-11-07.ppt
installing winscp http://winscp.sourceforge.net/eng/download.php has –installation package. for use if you have administrator rights on the machine where you are installing to –application. for use otherwise, i.e. to just download and run the application at installation time, when/if asked about the default interface, I suggest you use Windows explorer style, rather than the default Norton commander style. You can change that later, so no panic.
other stuff: installing user agents Download and install a recent version of at least two browsers. I suggest –Mozilla Firefox at http://www.mozilla.org/products/firefox/ –Netscape Navigator at http://channels.netscape.com/ns/browsers/download.jsp –Opera at http://www.opera.com
open a wotan session start winscp the host name is wotan.liu.edu give your user name click on save, this will save the session, after ok you will be lead to the list of saved sessions double click to open the session Note: –you can save the password as part of the session –it is risky to do that in a public classroom
initial remote files on wotan a set of files starting with a dot. –These are places where Linux Masters exert their black magic. –Leave them alone. a directory called public_html –This is the place where web masters exert their magic. you can go into that directory to see the files that you have on your web site at the moment. –There should be two files empty.html validated.html
public_html Imagine you are user user and you have a file file in public_html. The web server will map requests to http://wotan.liu.edu/~user/file to show the file public_html/file. Here user stands for your user id, and file is the file name, and / is the directory separator. If file ends with.html or.htm the web browser will be told that the file is a HTML file. It will be rendered accordingly by the browser.
index.html The web server on wotan will map requests to http://wotan.liu.edu/~user to show the file public_html/index.html If this file is not there, the server will prepare a html document from the list of files that it finds in the directory and send it to the user agent. Once you have a file index.html, the web user can no longer see the individual files in your directory.
HTML and XHTML HTML is the hypertext markup language HTML is a markup language that is widely used on the Web. The latest, and probably last version of HTML is at http://www.w3.org/TR/html4/ The W3C, the standard making body for the Web, have issued XHTML, a replacement of HTML that is compatible with XML. We will work with XHTML.
SGML HTML XML You will probably have come across these terms. SGML was developed first. HTML and XML are developed from SGML in different ways. –HTML is an SGML DTD –XML is an SGML application One common thing here is the ML. It stands for Markup Language. Markup is everything in a document that is not content. (something to scratch your head about)
procedural/descriptive Markup can be given in two ways 1: Procedural –Codes identify point size, style, font, etc. –Usually only understood by defining tool –Example: Microsoft Word 2: Descriptive –Describes purpose of text within the document –Chapter head, Paragraph, Section Head, TOC –Structure and Style are kept separate –Example: LaTeX, SGML
SGML Standard Generalized Markup Language Descriptive approach with three separate layers –structure: types of information in document –content: the information itself –style: defines how to typeset the document Developed for the publishing industry by a group around Goldfarb. So complicated that no software implements it fully. But an important idea that remains of it is the document type definition.
Document Type Definition (DTD) The DTD is a non-SGML language that describes SGML. Describes information the document handles, e.g. –title –chapter Relationships between fields e.g. –a chapter contains sections Consistency and logical structure
XML Since SGML is so complicated, it is not good for use on the Web. So the W3C has issued XML, the eXtensible markup language. Every XML document is SGML, but not the opposite. Thus XML is like SGML but with many features removed.
XML elements XML is based on elements. There are basically three ways of writing an element. The first way is write Here name is the name of the element. Example: – Such an element is called an empty element. Here its name is bang.
non-empty elements If name is the name of the element, you can give an element contents contents by writing contents. Examples: – bonjour – здравствуйте – She says hello to you. In fact is just a shortcut for.
attributes to elements Elements can have attributes. Here is an element with two attributes Here attribute_name_one and attribute_name_two are attribute names and value_one and value_two are attribute values. The element itself is empty. Example: bonjour
more on attributes There can be no two attributes to the same element with the same names. Attribute values are simple strings. You can not have an element inside attribute. Attribute names are separated from their values by the = sign. Attribute values can be enclosed in single or double quotes. It does not matter. Double quotes are more common, so I suggest you use those.
XML document An XML document is a piece of data that is written in XML. But sometimes the author of a document makes a mistake, and, in fact the XML is wrong in some ways. If there is no mistake, the document is called well- formed. If a document is not well-formed, it really is not an XML document.
some rules for well-formedness There must be one single element in the document. –It is called the root element. –It may be preceded by a prolog (stuff before the root element) – All other elements are called children of the root. –Whitespace that surrounds the root element is ignored. All elements must be properly nested. You can only close the outer element after all inner elements are closed. Examples – not well-formed – well formed
other stuff: comments In an XML document, you can make comments about your code. These are notes to yourself. Comments start with <!-- Comments end with --> Example: Comments can not be nested. Can appear anywhere in the document
other stuff: XML declaration The XML declaration is a special line that says that what follows is XML and give some very basic information about that XML. It is trendy to use it. It is optional, but if it is there it has to be on the first line. You will need to have an XML declaration if your character encoding is not UTF-8. We will come back to this point later.
other stuff: XML declaration Normally the XML declaration looks like where encoding is the character encoding. By default, the character encoding is UTF-8, so if you use that, you do not need to mention it. There is now a version 1.1 of XML around, but –it is not widely deployed –it is not much different from version 1.0
other stuff: document type declaration XML documents, like any SGML documents, accept document type declarations. A document type declaration tells us something about the vocabulary of elements and attributes used in the document. It should appear before the root element, after the XML declaration, if you have one. It takes the form We will come back to the document type declaration later.
HTML HyperText Markup Language HTML is an SGML DTD –Head, Title, Body, Paragraph, etc. –Headings, Bold, Italic, etc. –Table, List, Image, etc. –Links to other documents –Forms –and many others
HTML history HTML was a very bare-bones language when first invented by Tim Berners-Lee. It did not describe pages with much of a visual appeal. In the 90s, successful browsers invented extensions that aimed to stretch the visual boundaries of HTML. Some of these extensions found their way in the official HTML spec issued by the W3C. Later the W3C developed style sheets as a way to accommodate for display requirements without having to extend HTML
HTML versions HTML 4.01 is the last version of HTML This version has two different DTDs: –the loose DTD –the strict DTD I only the cover the elements of the strict DTD. The loose DTD has more elements, but all the functionality of these elements is best done with style sheets. Thus, the pages created with HTML only will look rather boring. But we do cover style sheets later.
XHTML XHTML is HTML written in an XML syntax. Every XHTML document has to be well-formed XML. non-XHTML HTML documents can violate some well-formedness constraints, including –HTML element names are not case sensitive –some HTML elements do not need closing. –there is no need for a single root element in a HTML document.
XHTML: pain without gain? In this course we study XHTML. When I say HTML in the following, I mean XHTML. Reasons to study XHTML rather than HTML –syntactic rules of XML are easier to understand. –any tool that can work with XML can be applied to XHTML, but can not be applied to HTML. –in general XML documents are more computer understandable. This is crucial in the age of the search engine.
Example HTML snippet Thomas Krichel –the whole thing is an element. It creates an anchor. (I use to surround element names.) –href is an attribute name –http://openlib.org/home/krichel is the value of the "href" attribute (I surround attribute names with straight quotes) –'Thomas Krichel' is character data.
Characters: concept A character set combine two things –Character repertoire: a set of characters e.g. "A", "" "", "" –Character code positions: defines a number for each character in the repertoire. Character encoding is a way to encode the code positions in bytes To correctly display a document, the user agent needs to know both!
playing safe with characters Only use the characters on the US keyboard, don't insert symbols. Save as ASCII or UTF-8. All ASCII files are also UTF-8 files. Never save as "Unicode" within MS Notepad. If you encounter a character that is not on your keyboard, use an SGML entity. The SGML entity is the last special SGML thing that we have to study.
SGML entities SGML entities are something like a way to represent non-ASCII characters when only ASCII input is possible. Codes can can be &code; –Ex. é Inserts and e with acute accent. –this is called a character entity –Codes are often abbreviation of the character names Codes can be in hex form Ex. & to insert an ampersand this is called a numeric entity
XHTML entities They are officially defined in three files that are maintained by the W3C –http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent –http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent –http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent A sample line is <!ENTITY is DTD speak for defining an entity it is followed by the character form and the numeric form of the entity the rest of the line is a comment, of course
entities used in XML There are three that you need to know and use. –< stands for < –> stands for > –& stands for & Every time you want to insert or & in the documents, you have to use the entities instead. Examples: –email@example.com –je suis Français –Marks & Spencers
another look at empty.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
empty.html dissected the is an SGML document type declaration. It says that the document contains XHTML of the strict flavor. The document type declaration is the only thing that we have in the prolog. We could have placed an XML declaration before it but chose not to do so. is the root element. It contains some other elements. Some of these we discuss now, others later.
the element It is the root element of an XHTML document. It has required children and. It has two optional attributes –the "dir" attribute says in which direction the contents is rendered. The classic value is "ltr", "rtl" is also valid. –the "lang" attribute says in which language the contents is. Use ISO 639 codes, e.g. lang="en-us" –these two attributes are know as the internationalization (i8n) attributes. Example: …
the element This is a required child of. It defines the title of the document. It takes the i18n attributes. Example A fine limerick There was a young friar named Tuck It must not contain other HTML tags.
usability concerns with The title is used by the user agent in a special manner –as bookmark default title –as the title for a window in which the user agent runs Google uses the title as anchor text to your web page. –It is a crucial ad for your page –Google may truncate the title. Bad ideas for titles –section 1 -- home page
the element This encloses the contents of the page as opposed to its header. Validation requires one and only one body. It takes the i18n attributes. as well as some others that we will discuss now. These fall into a another group of attributes we call core attributes. We will study those core attributes now.
core attributes: "id" This attribute assigns a name to a element. This name must be unique in a document. In the element, this requirement is superfluous, of course. The "id" attribute has several roles in HTML, including –As a style sheet selector –As a target anchor for hypertext links
core attributes: "class" The class attribute is a friend of the "id" attribute. It assigns one or more class names to a element. Class names are separated by colons. The element may be said to belong to these classes. A class name may be shared by several elements. The "class" attribute has several roles in HTML, but it is most useful as a style sheet selector, when you want to assign style information to a set of elements.
Example for "class" and "id" There was a young man from Peru Whose limericks stopped at line two. OK, that's a stupid limerick. Let us look at another There was a young man from Japan Whose limericks would never scan And when they asked why He said it is because I Try to put as many words into the last line as I possibly can.
core attributes: "title" The "title" attribute sets a title in use with the tag. There is no prescribed way in with the title is being rendered by a user agent. Sometimes it is shown as a tool tip, i.e. something that flashes up when the mouse is rolled over it. Example: Thomas Krichel
core attributes: style Use the "style" attribute to give style information to a particular element. This will be more discussed when we do the style sheets. Usually there are better ways to attach style information then writing it onto every element. It is better to place the tag into a class by giving them the same "class" attribute, and then give style sheet information for the class. See validated.html for an example.
summary: core attributes To summarize, we have a group of core attributes. These attributes can be used with almost all elements. There are other attributes that can be almost universally used, called "event attributes", but they have to do with scripting, they are therefore not studied in this course.
block-level vs text-level elements Block-level elements contain data that is aligned vertical by visual user agent. Text-level elements are aligned horizontally by visual user agents. The reasons behind this distinction –Block level can contain other block level elements and text-level elements. –Text-level elements can not contain block-level elements. –Visual user agents start a new line at the beginning of block-level elements. –Multidirectional text would be impossible without it.
the and elements The elements allows you to set arbitrary block level divisions in your document. It takes the core attributes. RULE: put all your contents that is vertically aligned into a. The tag is like but it signals the start and end of a paragraph.
the element is used to create a line break. Note its emptiness! It has the "clear" attribute that can take the values "left", "right" and "center" and "all". This prevents textual contents to float around other content.
The element This is another element for arbitrary divisions, but it operates on inline content. This is contents that is put in lines horizontally, rather than block- level contents, that is put in vertically. Admits core attributes. Put things in a that belong together in a line.
example A worse poet however was J enny. Her limericks werent worth a P enny Though the invention was s ound She always f ound That, whenever she tried to write any She always had one line to m any.
abstraction ends here Up until now, we have done a lot of abstract elements and attributes that do not achieve much visual impact. Instead, they –point the style sheet to where things are –create a semantic design We will now turn to more physical descriptions.
try it out right click empy.html in your winscp window. you will see the option to duplicate the file. duplicate it, say, to tryout.html by entering the new name. right-click tryout.html and choose edit. open a user agent to http://wotan.liu.edu/~user/tryout.html where user is the name of your user name. You should be able to see your changes, as last saved.
the element I opens a hyperlink, contents of element is the anchor text, it is limited to text only "href" attribute has the target URL "hreflang" has the language of the target "type" attribute gives the MIME-type of the target Some other attributes for which we have no use –coords–shape–accesskey–tabindex and of course, takes the core attributes
the element II It takes the "rel" attributes to specify the relationship between the current document and the link target, as well as the "rev" attribute to specify the reverse. –This is not currently well supported by the browsers. –I will come back to these relational attributes when discussing the tag. Ex: a nice man.
linking within a document If the "id" attribute of an element in a document at a URL URL is set to id, you can make the element the target of a link. You use the URL URL#id for this purpose. If the document linked to is the current document, you dont need to reference its URL. example: joke links to the element with id "joke" in Thomas Krichel's homepage.
the elementI makes an image. "src" attribute says where the image is "alt" attribute give a text to show for user agents that do not display image. It may be shown by the user agents as the user highlights the image. It is limited to 1024 characters. "longdesc" attribute is the same as "alt" but does not have the length limitation. Example:
the element II "width" attribute gives the user agent a suggestion for the width of the image. "height" attribute gives the user agent a suggestion for the height of the image both can be expressed –in pixels, as a number –in %age of the current display width of course supports the core attributes.
HTML checking validated.html has some additional code (as compared to empty.html), that we can now understand. <img style="border: 0pt" src="http://wotan.liu.edu/valid-xhtml10.png" alt="Valid XHTML 1.0!" height="31" width="88" /> click on the icon to check your code. That's cool!
header elements Headers to Simple form of text formatting Vary text size based on the headers level. Actual size of text of header element is selected by browser. Results can vary significantly between user agents. All take the core attributes.
element creates a horizontal rule admits the core attributes other attributes have been deprecated, i.e. are allowed in the loose DTD but not the strict one.
contents-based style elements encloses abbreviations encloses acronyms encloses citations encloses computer code snippets encloses things being defined encloses emphasized text encloses text typed on a keyboard encloses literal samples encloses strong text encloses variables all admit the core attributes
physical style elements encloses bold contents encloses big contents encloses small contents encloses italics contents encloses subscripted contents encloses superscripted contents encloses typewriter-style contents all admit the core attributes
the element encloses contents that is to be rendered with the characters and line breaks just like in the source text. Markup is still allowed, but elements that do spacing should not be used, obviously. It takes the core attributes and a "width" attribute setting the number of characters per line.
and elements quotes a paragraph make a short quote inside a paragraph both takes a "cite" attribute that take the value of a URL of the source of the quote. They also take the core attributes.
list elements creates an ordered list. – encloses each item unordered list – encloses each item encloses a definition list – encloses the term that is being defined – encloses the definition All take the core attributes and the i18n attributes.
http://openlib.org/home/krichel Thank you for your attention!