Web Science The web is a phenomenon, nothing like it has happened before. The web has changed the way that many people live their lives, there are other inventions that have had as big an impact but none have matched the rate of adoption of the web. This provides great opportunity, but with this comes responsibility. The web should benefit the human race, but in order to ensure it does, we must first understand it. The web is the largest human construct in history. It is transforming society. In order to fully understand it, engineer its future and ensure its social benefit, we need to study Web Science. The Internet and the Web Many people use the words Internet and Web interchangeably, but truthfully they are fundamentally different. The web (short for World Wide Web) is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange information. The Internet is that vehicle. The Internet is a word formed by contracting the phrase “inter-connected networks”, that is, the internet is a large network made up of many others. This provides the backbone for the exchange of information in the form of webpages or email. We access the web using a web browser(Firefox, Google, Internet Explorer(no sniggering)). A browser is a simple piece of software on the face of things but this is because it hides a lot of what it does for us. Let’s look at this in more detail. URI A uniform resource identifier (URI) is a string of characters used to identify a name of a web resource. Such identification enables interaction with representations of the web resource over a network (typically the World Wide Web) using specific protocols. Schemes specifying a concrete syntax and associated protocols define each URI. People can very easily confuse URIs and URLs, here is a quote from Tim Berners-Lee to try to distinguish them: ” A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location"). “ A URN(Uniform Resource Name) can be likened to a person’s name, it identifies them, but not where you can find them. Their address is like a URL, it tells you where you can find that person.
Conclusions 1. A URL is a type of URI. But that doesn't mean all URIs are URLs. 2. The part that makes a URI a URL is the inclusion of the "access mechanism", or "network location", e.g. http:// or ftp://. 3. The URN is the "globally unique" part of the identification; it's a unique name. URL When surfing the web, we might talk about visiting a website, as if we go there. In truth, the web page is actually brought to us. We request a resource by entering a URL in the address bar(if we know it, if we search then click a result, the URL ends up there for us). This resource is normally stored on another computer, typically a web server, that might be somewhere else across the world. The URL uniquely identifies the page you have requested and the server responds to the request by delivering the resource (the HTML, CSS, JavaScript or other scripts that are used to create the page). It is the browser’s job to interpret this code and then render the webpage in the browser window as the multimedia experience we have come to know and rely on. The paragraph above again belies the complexity of what goes on. Before a web server can be contacted, the browser must “translate” the URL into something that machines can understand. It does this by making use of the Domain Name System (DNS). The Domain Name System As mentioned previously, when we want to access a resource on the web, we use a URL. This can be thought of as a hostname like kgv.edu.hk. As humans, we like hostnames as they are easy to remember. Behind the scenes, a hostname is translated to an IP address. An IP address is much easier for a computer to use and in IPv4, an IP address is made up of four numbers between 0-255 separated by dots: 172.0.0.212 As we know, computers like to use binary, and each of these numbers is represented by 8 bits. This means the whole address is 32 bits, giving us 232 possible unique IP addresses. This number is around 4 billion. This translation is carried out by the Domain Name System(actually a distributed network). Previously, the Stanford Research Group took responsibility for a single file known as a host table. As new hostnames were established, the group would add them to the table, usually twice a week. System administrators would access this table and update their domain name servers to enable them to resolve hostnames to IP addresses. This became impractical as the web grew, we now have a distributed database, no one organisation is responsible for updating the hostname/IP mappings. The DNS is contacted by the browser with a hostname to look up, the DNS tries to find that hostname in its database and if it finds it returns the corresponding IP address. If it is not found the current server contacts another DNS to see if it can resolve the issue. This would continue for a reasonable amount of time if the address cannot be resolved to an IP address an error message
would be returned(the dreaded 404 error). If another server can provide the IP details, the requesting DNS will store the information in its cache for a certain amount of time in case the same request is made(the time it is stored is called the Time To Live). Evolution of the Web As we have seen from the need to create a new version of the IP protocol to provide more IP addresses due to the explosion of web pages and devices accessing them, the web has “grown-up� in the last few years. It has evolved since HTTP was presented to the public in 1991(more on that later). The early web was a dull interface for web designers, it was text based with underlined hyperlinks scattered throughout. Mosaic was one of the first browsers to allow images to be displayed within the text, it supported web forms and gifs. The designs were pretty boring as there was a limit on the bandwidth, and most websites were designed by computer programmers, hence lacked in artistic design and originality. These websites were monotonous and very basic; however, there were only a few hundred at the time on the web. The war between browsers started in 1995 and lasted for the next three years. This is when the trends for animated gifs and blinking text started. The leading browser in the mid 90s was Netscape but was soon overridden by IE. This is when the page layouts on the Internet started to become more intricate, as frames and tables took the scene. User buttons started to emerge and the use of animated gifs were used to make these more attractive, with the first functionalities of JavaScript. The last two years of the 20th century marked the beginnings of more serious website designing. This is when enhanced tools for better web development started to be seen in the industry with GoLive and Dreamweaver. People could finally have better access to the creation of web pages. Web designers were starting to be in demand as more and more businesses wanted to create websites to increase sales.
This is when web designers were confronted with the reality of keeping up with the fast-evolving trends and computer technology, if they wanted to be competitive. Flash technology was starting; however, the websites still looked pretty boring and were mostly based on sliced images and HTML tables.
With the turn of the century until 2004, was when the Internet experienced a rise in standards. Designs without the use of tables started to gain terrain and CSS technology was also evolving. The next three years after that were the years of Web 2.0. This is when the websites started to move towards the community and become more user-friendly. Now websites were using bolder typography and attractive colors as well as faster loading times. The websites developed numerous functionalities that users could easily access. This is when widgets started to be used everywhere on blogs and websites in order to link with social networks, and the use of feeds started to become popular. Since 2008 to the present year the Internet trends are moving towards the use of mobile devices, with the popular iPhones and iPads. Web designers now have to create websites that adapt to
smaller screens, with applications that go with these, therefore, the future is focusing more on simplified websites for mobile devices. So as you can see, in its short lifetime the web has evolved quickly and is far removed from the text based collection of hundreds of pages that existed in the 1990s. There are now billions of web pages, so much in fact that web crawlers cannot index them all due to the storage requirements and the rate of expansion of the web. You can read more about the evolution of the web here. Web Standards and Protocols The rise of the web has also been in part to do with the standards that have been created to ensure that any device that implemented them could access the web, transfer and share information. We will look at these in detail in the following section. HTTP The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text. HTTP is the protocol to exchange or transfer hypertext. HTTP functions as a request-response protocol in the client-server computing model. A web browser, for example, may be the client and an application running on a computer hosting a web site may be the server. The client submits an HTTP request message to the server. The server, which provides resources such as HTML files and other content, or performs other functions on behalf of the client, returns a response message to the client. The response contains completion status information about the request and may also contain requested content in its message body. HTTP uses a sequence of request-response transactions. The client makes requests of the server. The HTTP server listens for requests at a specific port(usually port 80). Upon receiving a request the server will respond with a status(usually that it is ok) and a message. The body of the message is typically the resource requested by the client, although on occasion this might be an error message of other information. This HTTP “session� is stateless. This means that each new request to the server, from the client, is completely unrelated to those that have gone before. Each request-response pair is independent. This keeps things simple for the server, it does not need to allocate storage for conversations in progress. If a client disconnects in mid-transaction, no part of the system needs to clean up the present state of the server. Of course this has its disadvantages, each transaction needs to include additional information and this information needs to be interpreted by the server. HTTPS Hyper Text Transfer Protocol Secure is similar to HTTP, except that the protocol is used for secure communication over a computer network. It is widely implemented on the Internet. Strictly speaking, it is not a single protocol but a product of layering HTTP on top of the SSL/TLS(Secure
Socket Layer/Transport Layer Security) protocol, adding the capabilities of SSL to HTTP communications. The security of the protocol is therefore provided by the underlying TLS, which uses long term public and private keys to exchange a short term session key to encrypt the data flow between client and server(explanation below). This encryption can be bi-directional which protects against eavesdropping and tampering with the contents of the communication. The other level of security that is provided is authentication. In popular deployment on the internet, HTTPS provides authentication of the web site and the associated web server that the client is communicating with, thus protecting against man-in-the-middle attacks. The two levels of security combined provide a reasonable guarantee that the client is communicating with precisely the site that it intended to, as well as ensuring the contents of the communication cannot be read by any third party. The biggest use of HTTPS was primarily for payment transactions on the web and for sensitive transactions in corporate information systems. However, more recently HTTPS has become more widely used for protecting page authenticity on a whole manner of websites, securing accounts and keeping user communications, identity and web browsing private. In order to implement the security provided by the protocol, a site must be completely hosted over HTTPS. There cannot be a mix of HTTPS and HTTP or the user becomes susceptible to attacks and surveillance. For example, if scripts were loaded insecurely on an HTTPS page would leave the user open to attack. If we set up a site such that the log-in page(or any single sensitive page) was loaded over HTTPS whilst the rest of the site was loaded over HTTP, this would also expose the user attacks. Any time a site is accessed with HTTP rather than HTTPS, the user is exposed. HTML TO BE ADDED FROM MAC CSS Cascading stylesheets are used to define the presentation and layout of any mark-up language. Whilst the most popular application of CSS is to style HTML and XHTML web pages, it could also be used to any kind of XML document. CSS was created to de-couple the data and presentation within HTML. CSS can specify elements such as layout, colours and fonts. This separation leads to more flexibility and control over the presentation characteristics of an entire site as multiple pages can share the same formatting and reduces redundancy by taking away the need to include presentation information within the HTML. CSS solved a big problem for developers. HTML was never intended to contain tags for formatting a document. HTML was intended to define the content of a document such as: <h1>This is a heading</h1> <p>This is a paragraph.</p>
As the HTML language evolved(specification 3.2 to be exact), the problems for developers began. Development of large websites, where fonts and color information were added to every single page, became a long and tiresome process, and by taking more time, eventually cost more. The separation of data and presentation also allows the same mark-up page to be presented in different styles for different contexts, such as on-screen, in print or even read out by a text-tospeech reader. The word cascading refers to the priority scheme that is used to determine which style rules apply if more than one rule matches against a particular element. In this cascade, priorities or weights are calculated and assigned to rules, so that the results are predictable. You can view the application of different stylesheets to the same content on this site CSS Zen Garden. If you would like to learn some CSS you could try W3Schools or Codecademy. XML eXtensible Markup Language, like HTML, is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The design goals of XML emphasise simplicity, generality and usability over the internet. It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services. XML allows separation of data and presentation, where HTML combines the two. If you need to display dynamic data in an HTML document, it will take a lot of work to edit the HTML each time the data changes. By using XML, the data is stored in separate files, this means that HTML/CSS can be used to concentrate on the display and layout, and any changes in the underlying data will not require any changes in HTML. A few lines of JavaScript can be used to read an external XML file and update the content of the web page. XML simplifies data sharing, there are many different computer systems and databases that contain incompatible formats. XML is stored in plain text format and is thus both a software and hardware independent way of storing data. This makes it much easier to create data that can be shared by different applications. For example, the Facebook website and the Facebook app will be able to access the same data were it stored in XML. The transfer of data between incompatible systems is one of the most time consuming tasks for web developers. XML reduces this complexity.
As mentioned previously, XML is a markup language much like HTML, but there are some very clear differences. HTML is designed with the display of data in mind, whereas XML was designed to transport data with a focus on what data is, it is for carrying information. Therefore, XML is not a replacement for HTML. XML doesnâ&#x20AC;&#x2122;t actually do anything. It is simply a method for structuring, storing and transporting information. The following example is a note to Tove, from Jani, stored as XML:
<note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> The tags you see here are not defined in any XML standard. The tags are â&#x20AC;&#x153;inventedâ&#x20AC;? by the author of the document. This is because XML does not have any pre-defined tags. HTML documents can only use tags that are pre-defined whereas XML allows the author to define the tags that they need to create their document structure. XSLT XSL stands for eXtensible Stylesheet Language, it is a stylesheet language for XML documents. XSLT stands for XSL Transformations. It can be used to transform XML documents into other XML documents, or a format that can be understood by a browser like HTML of XHTML. Normally XSLT does this by transforming each XML element into an HTML element. XSLT can add or remove elements and attributes to or from the output file. Elements can be rearranged and sorted, tests can be performed and decisions can be made about which elements to hide or display, and a lot more. This transformation process does not alter the original file, a new document is created, based on the content of an existing one. Typically, input documents are XML files but can be anything from which the processor can build an XQuery and XPath Data Model can be used, for example relational database tables. XSLT uses XPath to find information in an XML document. XPath is used to navigate through elements and attributes in XML documents. In the transformation process, XSLT uses XPath to define parts of the source document that should match one or more predefined templates. When a match is found, XSLT will transform the matching part of the source document into the result document. A common way to describe the transformation process is to say that XSLT transforms an XML source tree into an XML result tree. The XSLT processor takes one or more XML source documents, plus one or more XSLT stylesheets, and processes them to produce an output document. XML and XSLT Exercise Click on the link to download the XML CD catalogue View "cdcatalog.xml" Open it in your browser and notice that the browser tells you that it does not have any style information associated with it but will display the information as it is a plain text file. Create the XSLT doc for cdcatalog, do this in notepad or equivalent text editor, saving as an XSL file named cdcatalog.xsl: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <body> <h2>My CD Collection</h2> <table border="1"> <tr bgcolor="#9acd32"> <th>Title</th> <th>Artist</th> </tr> <xsl:for-each select="catalog/cd"> <tr> <td><xsl:value-of select="title"/></td> <td><xsl:value-of select="artist"/></td> </tr> </xsl:for-each> </table> </body> </html> </xsl:template> </xsl:stylesheet>
Now, we need to create a link between the XML and XSL file. Add the line in bold to the XML file like so: <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="cdcatalog.xsl"?> <catalog> <cd> <title>Empire Burlesque</title> Now try opening the XML file in your browser, you should see the data nicely formatted in a table. JavaScript JavaScript is a dynamic programming language. It is a scripting language, meaning it is a lightweight language. Scripting languages tend to be used to add functionality to existing systems and this is what JavaScript does for browsers. JavaScript can be added to static HTML pages to be executed by all types of browsers and provide dynamic content. This can be done by placing it directly inside <script></script> tags. Depending on the Web developer's intent, script code may run when the user opens the Web page, clicks or drags some page element with the mouse, types something on the keyboard, submits a form, or leaves the page. It is mainly a client side language e.g. the script is delivered to the browser and executed on the client machine. This allows interaction with the user, control of the browser, asynchronous communication (sending information when necessary instead of on some fixed time interval) and alteration of the document content that is being displayed. JavaScript is also gaining momentum in server side programming, game development and the creation of desktop and mobile games, so is becoming more than a scripting language. JavaScript copies many names and naming conventions from Java but the two languages are otherwise unrelated and have very different semantics. JavaScript is a multi-paradigm language in that it supports object-oriented, imperative and functional programming styles. To try out some simple JavaScript tutorials go here: W3Schools So for client-side interactivity we can use JavaScript, we can use this to control the items mentioned above. For our server-side interactions we can use PHP. Protocols In the context of the web, a protocol is a set of digital rules for the exchange of information. The web is built on these protocols and itâ&#x20AC;&#x2122;s rise is down to the fact that there are standard protocols that allow open access to the web and all it has to offer. TCP/IP Two people can communicate when they have a common language. They may choose to speak English, Chinese, Spanish, German or even sign language. Computers work the same way. Transmission Control Protocol/Internet Protocol (TCP/IP) is like a language that computers speak. More specifically, TCP/IP is a set of rules that govern how
computers send data to each other. The set of rules is called a protocol. TCP/IP is a group of two protocols and is therefore referred to as a protocol stack. Why TCP/IP? TCP/IP is a fast, scalable, strong and efficient suite of protocols. This protocol stack is the de facto protocol of the internet. But, why? There was a time when there was no need for computers to communicate. But as networks became more prevalent, a standard method for communication was required. Network administrators can choose among many different protocols, but TCP/IP is the most widely used. The world’s biggest network, the internet, uses TCP/IP. So if you want to communicate on the internet, you must use TCP/IP. Another reason for TCP/IPs popularity is that it is compatible with virtually every computer system in the world. It is supported by most of the large software and hardware vendors and crucially all of the major current versions of operating systems. It is often said that TCP/IP is the ‘language of the internet’. But it is also the language of many of the smaller networks that exist. The likelihood is that these smaller networks will be connected to the internet anyway, so it follows that we would want them to be able to communicate with all of the other computers on that network. One of the reasons that TCP/IP has gained so much prominence is that it can be installed on any platform. For example, a Unix host can easily transfer data to a DOS or Windows system, if TCP/IP is used. TCP/IP eliminates cross platform boundaries. TCP TCP takes a message/data that is to be sent and splits it up into “chunks” of an agreed size(part of the protocol). These are called packets. Each packet has 2 components: a header and a payload. The header is appended to the data to allow it to reach its destination and be re-assembled at the destination. The header contains the source IP address, the destination IP address, a packet number and other data required to route and deliver the packet. The payload is simply the data to be delivered. Packets can be routed through the network in different ways depending on traffic. This is why the packet number is required, so they can be put in to the correct order upon arrival at the destination. If we think about this in another way. You want to send a bicycle to a relative in another country. You might try to fit the whole thing in one package, but it is likely to exceed the maximum allowed by the postal service. So you would split it into several packages that meet the restrictions on package size. The packages still require address information to reach their destination, this information has to be added to all of the packages. You would likely add your own address to the packages in case there are any problems with the delivery. This way you could be notified or the package could be returned. We could also number the packages to allow our relative to open them in the correct order to facilitate easier re-building of the bicycle upon receipt of it. After adding a header to the packet, it needs to be encapsulated into the correct format for the network. Encapsulation is simply wrapping the packet into the correct format. For example, if you were using Ethernet, the packet would be formatted to move through an Ethernet. The encapsulation process would be different when preparing a packet for transfer on a ring network.
When the packets have been prepared for delivery, they are passed to the IP to do the routing and delivery. At the recipient end, IP re-assembles the packets, but TCP is responsible for dealing with any errors that occur such as a missing packet. The neat thing is that the missing packet never needs to be found, the sender can simply send a clone of that packet and does not have to re-send the entire message again. TCP at the recipient is also responsible for sending an acknowledgement packet to confirm safe receipt of the entire message. IP As mentioned previously, TCP/IP is a protocol stack, that is, it is a combination of two separate protocols. TCP sits on top of the IP portion to achieve the goal of data transmission. We will deal with IP first. Internet Protocol is the most prominent protocol for communications on the internet. It has the task of delivering packets from the source to the destination based on the IP addresses in the packet headers. The protocol is responsible for addressing hosts and for routing packets from the source host to the destination host across one or more networks. IP addressing is the assignment of IP addresses to host interfaces. An Internet Protocol address (IP address) is a numerical label assigned to each device (e.g., computer, printer) participating in a computer network that uses the Internet Protocol for communication. An IP address serves two principal functions: host or network interface identification and location addressing. Its role has been characterized as follows: "A name indicates what we seek. An address indicates where it is. A route indicates how to get there.” The first version of IP was IPv4 which used 32 bits for an IP address (providing around 4.3 billion). IPv4 still forms the backbone of the internet but as IP addresses have been almost exhausted, it’s successor, IPv6 has been defined. IPv6 uses 128 bits to define an address, giving rise to around 3.4×1038 addresses. IP addresses are binary numbers, but they are usually stored in text files and displayed in human-readable notations, such as 172.16.254.1 (for IPv4), and 2001:db8:0:1234:0:567:8:1 (for IPv6). Hardware Addresses If you go to command prompt and type ipconfig/all, you will be presented with some information about the various ways of identifying your device on the network you are connected to. These might be wireless LAN cards, Bluetooth or Ethernet. What you will see is a physical address for each of these. This physical address is permanent for each of these pieces of hardware, it is burned into them. IP uses another protocol to resolve logical IP addresses to physical hardware addresses. This works by gathering information when other devices make broadcast and keeping a cache, for a short time. Or if the address is not resolved in the cache, sending a broadcast request. For example: Broadcast packet – “Hi, I need to know the hardware address of 201.56.34.12”. The broadcast is also sent out with the broadcasters hardware address to expedite the return process. The reply to the broadcast would be something like “Hi, I’m 201.56.34.12. my hardware address is 55-AD-3F-C3-110E”
TCP/IP Although these have been dealt with above separately, they are usually mentioned in the same breath as TCP/IP. The way it is written tells you something about the relationship, TCP rests on top IP. TCP prepares the packets for transmission and IP gets them to the destination through the maze of inter-connected networks. Upon arrival, TCP takes over once more to re-assemble the packets into the complete message that was sent. TCP will also deal with any errors, such as incomplete or corrupted data. FTP FTP is a high level protocol that is also built on TCP. The File Transfer Protocol (FTP) is a standard network protocol used to transfer computer files from one host to another host over a TCP-based network, such as the Internet. FTP is built on a client-server architecture and uses separate control and data connections between the client and the server. FTP users may authenticate themselves using a log-in, normally in the form of a username and password, but can connect anonymously if the server is configured to allow it. For secure transmission that protects the username and password, and encrypts the content, FTP is often secured with SSL. The first FTP client applications were command-line applications developed before operating systems had graphical user interfaces, and are still shipped with most Windows, Unix, and Linux operating systems. Your browser is capable of FTP, but will only be able to download from FTP servers. In order to upload, a dedicated FTP client is required. FTP may run in active or passive mode, which determines how the data connection is established. In both cases, the client creates a TCP control connection from a random unprivileged port N to the FTP server command port 21. In active modes, the client starts listening for incoming data connections on port N+1 from the server (the client sends the FTP command PORT N+1 to inform the server on which port it is listening). In situations where the client is behind a firewall and unable to accept incoming TCP connections, passive mode may be used. In this mode, the client uses the control connection to send a PASV command to the server and then receives a server IP address and server port number from the server, which the client then uses to open a data connection from an arbitrary client port to the server IP address and server port number received. FTP has been recently updated to make it compatible with IPv6. The diagrams below show the difference between active and passive connections:
Active is shown on the left, passive on the right.
Web Pages Web pages are of course written in HTML, as discussed previously. This HTML is structured and if you look at any HTML page you will see similarities. The first item to appear in the source code of a web page is the doctype declaration. This provides the web browser (or other user agent) with information about the type of markup language in which the page is written, which may or may not affect the way the browser renders the content. Here is an example doctype declaration: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd"><html>….rest of page The next thing to consider is the head of the page. The head of the page is contained within <head> tags and can contain lots of important information about the page. The most visible head element is the page title. The <title> element:
defines a title in the browser toolbar provides a title for the page when it is added to favorites displays a title for the page in search-engine results
The next head element you might know about is the meta element. Metadata is data (information) about data. The <meta> tag provides metadata about the HTML document. Metadata will not be displayed on the page, but will be machine parsable. Meta elements are typically used to specify page description, keywords, author of the document, last modified, and other metadata. The metadata can be used by browsers (how to display content or reload page), search engines (keywords), or other web services. Some examples of use of the meta element. Define keywords for search engines: <meta name="keywords" content="HTML, CSS, XML, XHTML, JavaScript"> Define a description of your web page: <meta name="description" content="Free Web tutorials on HTML and CSS"> Define the author of a page: <meta name="author" content="John Smith"> The script element can also be found in the head. A script can be defined within the script tags. Alternatively, the script tag can used to link to an external file containing the required scripts. The scripts begin defined are client-side scripts, such as JavaScript. The <style> tag is used to define style information for an HTML document. Inside the <style> element you specify how HTML elements should render in a browser. Each HTML document can contain multiple <style> tags. Alternatively, style information can be placed inside a separate file called a stylesheet and linked in the head using the <link> element.
After the head of the page comes the body. This is where the bulk of the page is contained. Everything that you can see in the browser window is contained inside this element, including paragraphs, lists, links, images, tables, and more. How the page looks will depend entirely upon the content that you decide to fill it with and whatever style information has been provided. Webpage as a Document Tree A web page could be considered as a document tree that can contain any number of branches. There are rules as to what items each branch can contain (and these are detailed in each element’s reference in the “Contains” and “Contained by” sections). To understand the concept of a document tree, it’s useful to consider a simple web page with typical content features alongside its tree view.
You can see from this that the html element has two elements, head and body. The head has two branches: meta and title. The body section contains a number of divs, headers, lists, images, links and paragraphs. Different types of Web Page Personal Web Page A personal webpage is created by an individual for his/her own personal need. Personal web pages are often used solely for informative or entertainment purposes. In strictly technical terms, a site's actual home page (index page) often only contains sparse content with some interesting or catchy introductory material and serves mostly as a pointer or table of contents to the content-rich pages inside, such as résumés, family, hobbies, family genealogy, a blog, opinions, online journals and diaries or other writing, work, sound clips, movies, photos, or other interests. Many personal pages only include information of interest to friends and family of the author. Blog A blog (a truncation of the expression web log) is a discussion or informational site published on the World Wide Web and consisting of posts typically displayed in reverse chronological order (the most recent post appears first). Until recently, blogs were usually the work of a single individual occasionally of a small group, and often covered a single subject. More recently "multi-author blogs" (MABs) have developed, with posts written by large numbers of authors and professionally edited.
MABs from newspapers, other media outlets, universities, think tanks, advocacy groups and similar institutions account for an increasing quantity of blog traffic. The rise of Twitter and other "microblogging" systems helps integrate MABs and single-author blogs into societal newstreams. Blog can also be used as a verb, meaning to maintain or add content to a blog. The emergence and growth of blogs in the late 1990s coincided with the advent of web publishing tools that facilitated the posting of content by non-technical users. (Previously, a knowledge of such technologies as HTML and FTP had been required to publish content on the Web.) A majority are interactive, allowing visitors to leave comments and even message each other via GUI widgets on the blogs, and it is this interactivity that distinguishes them from other static websites.[2] In that sense, blogging can be seen as a form of social networking service. Indeed, bloggers do not only produce content to post on their blogs, but also build social relations with their readers and other bloggers. Search Engine A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages. The information may be a specialist in web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. The page itself is usually very simple, offering the user only a text box to input a query. These queries can be as simple or as complex as the user desires, the thought being that the more specific the query is, the more relevant the results returned. Here are examples of different techniques that can be used to build search engine queries.
Web Forums An Internet forum, or message board, is an online discussion site where people can hold conversations in the form of posted messages. They differ from chat rooms in that messages are often longer than one line of text, and are at least temporarily archived. Also, depending on the access level of a user or the forum set-up, a posted message might need to be approved by a moderator before it becomes visible. Forums have a specific set of jargon associated with them; e.g. a single conversation is called a "thread", or topic. A discussion forum is hierarchical or tree-like in structure: a forum can contain a number of subforums, each of which may have several topics. Within a forum's topic, each new discussion started is called a thread, and can be replied to by as many people as so wish. Depending on the forum's settings, users can be anonymous or have to register with the forum and then subsequently log in in order to post messages. On most forums, users do not have to log in to read existing messages. Below is an example of what a forum might look like:
Static vs Dynamic Web pages A static web page (sometimes called a flat page/stationary page) is a web page that is delivered to the user exactly as stored, in contrast to dynamic web pages which are generated by a web application. Consequently a static web page displays the same information for all users. Static web pages are often HTML documents stored as files in the file system and made available by the web server over HTTP (nevertheless URLs ending with ".html" are not always static). However, loose interpretations of the term could include web pages stored in a database, and could even include pages formatted using a template and served through an application server, as long as the page served is unchanging and presented essentially as stored. Static web pages are suitable for the contents that never or rarely need to be updated. However, maintaining large numbers of static pages as files can be impractical without automated tools. Any personalization or interactivity has to run client-side, which is restricting. A server-side dynamic web page is a web page whose construction is controlled by an application server processing server-side scripts. In server-side scripting, parameters determine how the assembly of every new web page proceeds, including the setting up of more client-side processing. Server-side scripts are usually written in PHP, ASP or as Java Servlets. A client-side dynamic web page processes the web page using HTML scripting running in the browser as it loads. JavaScript and other scripting languages determine the way the HTML in the received page is parsed into the Document Object Model, or DOM, that represents the loaded web page. The same client-side techniques can then dynamically update or change the DOM in the same way. A dynamic web page is then reloaded by the user or by a computer program to change some variable content. The updating information could come from the server, or from changes made to that page's DOM. Using Ajax technologies, the end user gets one dynamic page managed as a single page in the web browser while the actual web content rendered on that page can vary. The Ajax engine sits only on the browser requesting parts of its DOM, the DOM, for its client, from an application server.
Connecting Web Pages to underlying data sources Dynamic pages often require data sources to generate appropriate content. This may be in the form of a user database that specifies something unique about the user or the results of a user generated query to a database containing information about any subject. You should be very familiar with the process of connecting to a database. You have linked pages up with SQL Databases on several occasions. For security purposes, these databases are set-up to require log-in. The web author can connect to the database using some PHP script: <?php #mysqli_connect.php DEFINE('DB_USER', 'root'); DEFINE('DB_PASSWORD', ''); DEFINE('DB_HOST', 'localhost'); DEFINE('DB_NAME', 'kgv'); $dbc = @mysqli_connect (DB_HOST, DB_USER, DB_PASSWORD, DB_NAME) OR die('Could not connect to MySQL:'.mysqli_connect_error());
mysqli_set_charset($dbc,'utf8_bin'); It is good practice to keep this file outside the web folder on the server so that it cannot be accessed. A browser would not render anything on the screen should the file be accessed, but it is still in the best interests of security to follow this convention. After connecting to the database, PHP affords us the opportunity to construct queries using variables that can be user generated. PHP does this using the mysqli_query($dbc, $q) function. The function requires the database being queried and the query to be executed â&#x20AC;&#x153;$qâ&#x20AC;?. $q is simply a string that follows the query syntax of MySQL. Proper use might look like this: $r = mysqli_query($dbc, $q); The results of the query, if there are any and it is successful, are stored in $r and can then be accessed using further PHP commands. As PHP is able to generate HTML, the results can be accessed and woven into HTML to create a dynamic page. PHP and SQL are not the only way to achieve this goal, there are a few other technologies that perform similar roles. ASP.NET ASP.NET is a server-side Web application framework designed for Web development to produce dynamic Web pages. ASP.NET is a development framework for building web pages and web sites with HTML, CSS, JavaScript and server scripting. ASP.NET supports three different development models: Web Pages, MVC (Model View Controller), and Web Forms. Web Pages is one of the 3 programming models for creating ASP.NET web sites and web applications.
The other two programming models are Web Forms and MVC (Model, View, Controller). Web Pages is the simplest programming model for developing ASP.NET web pages. It provides an easy way to combine HTML, CSS, JavaScript and server code:
Easy to learn, understand, and use Built around single web pages Similar to PHP and Classic ASP Server scripting with Visual Basic or C# Full HTML, CSS, and JavaScript control
Web Pages is easy extendable with programmable Web Helpers, including database, video, graphics, social networking and much more. MVC is the second of the three ASP.NET programming models. MVC is a framework for building web applications using a MVC (Model View Controller) design:
The Model represents the application core (for instance a list of database records). The View displays the data (the database records). The Controller handles the input (to the database records).
The MVC model also provides full control over HTML, CSS, and JavaScript. The final ASP.NET model is Web Forms. It is the oldest ASP.NET programming model, with event driven web pages written as a combination of HTML, server controls, and server code.Web Forms are compiled and executed on the server, which generates the HTML that displays the web pages. Web Forms comes with hundreds of different web controls and web components to build userdriven web sites with data access. Web Forms are the main building blocks for application development. Web forms are contained in files with a ".aspx" extension; these files typically contain static (X)HTML markup, as well as markup defining server-side Web Controls and User Controls where the developers place all the content for the Web page. Additionally, dynamic code which runs on the server can be placed in a page within a block <% -- dynamic code -- %>, which is similar to other Web development technologies such as PHP, JSP, and ASP. One great feature of ASP.NET is Web Services. Web services mean that you can literally have several pieces of your application on different servers all around the world, and the entire application will work perfectly and seamlessly. Web services can even work with normal .NET Windows applications. For example: A lot of people would like to have a stock ticker on their web site, but not many people want to manually type in all changes to the prices. If one company (a stock broker) creates a web service and updates the stock prices periodically, then all of those people wanting the prices can use this web service to log in, run a function which grabs the current price for a chosen company, and return it. Web services can be used for so many things: news, currency exchange, login verification and so on. ASP.NET encourages the programmer to develop applications using an event-driven GUI model, rather than in conventional Web-scripting environments like ASP and PHP.
ASP.NET session state enables you to store and retrieve values for a user as the user navigates ASP.NET pages in a Web application. HTTP is a stateless protocol. This means that a Web server treats each HTTP request for a page as an independent request. The server retains no knowledge of variable values that were used during previous requests. ASP.NET session state identifies requests from the same browser during a limited time window as a session, and provides a way to persist variable values for the duration of that session. By default, ASP.NET session state is enabled for all ASP.NET applications. Unfortunately, the Internet still has bandwidth limitations and not every person is running the same web browser. These issues make it necessary to stick with HTML as our mark-up language of choice. This means that web pages won't look quite as amazing as a fully-fledged application running under Windows, but with a bit of skill and creative flair, you can make some rather amazing web applications with ASP.NET. ASP.NET processes all code on the server (in a similar way to a normal application). When the ASP.NET code has been processed, the server returns the resultant HTML to the client. To see examples of ASP.NET in action or to learn more, see here. Java Servlets The servlet is a Java programming language class used to extend the capabilities of a server. Although servlets can respond to any types of requests, they are commonly used to extend the applications hosted by web servers, so they can be thought of as Java applets that run on servers instead of in web browsers. These kinds of servlets are the Java counterpart to other dynamic Web content technologies such as PHP and ASP.NET. Servlets are most often used to:[citation needed]
Process or store data that was submitted from an HTML form Provide dynamic content such as the results of a database query Manage state information that does not exist in the stateless HTTP protocol, such as filling the articles into the shopping cart of the appropriate customer
A Servlet is an object that receives a request and generates a response based on that request. The basic Servlet package defines Java objects to represent servlet requests and responses, as well as objects to reflect the servlet's configuration parameters and execution environment. The package javax.servlet.http defines HTTP-specific subclasses of the generic servlet elements, including session management objects that track multiple requests and responses between the web server and a client. Servlets may be packaged in a WAR file as a web application. The following illustration shows where Java Servlets fit into the architecture of a web application:
Common Gateway Interface(CGI) Common Gateway Interface (CGI) is a standard method used to generate dynamic content on web pages and web applications. CGI, when implemented on a web server, provides an interface between the web server and programs that generate the web content. These programs are known as CGI scripts or simply CGIs; they are usually written in a scripting language, but can be written in any programming language. Abbreviation of Common Gateway Interface, CGI is a specification for transferring information between a World Wide Web server and a CGI program. A CGI program is any program designed to accept and return data that conforms to the CGI specification. The program could be written in any programming language, including C, Perl, Java, or Visual Basic. CGI programs are the most common way for Web servers to interact dynamically with users. Many HTML pages that contain forms, for example, use a CGI program to process the form's data once it's submitted. Another increasingly common way to provide dynamic feedback for Web users is to include scripts or programs that run on the user's machine rather than the Web server. These programs can be Java applets, Java scripts, or ActiveX controls. These technologies are known collectively as client-side solutions, while the use of CGI is a server-side solution because the processing occurs on the Web server. CGI is the part of the Web server that can communicate with other programs running on the server. With CGI, the Web server can call up a program, while passing user-specific data to the program (such as what host the user is connecting from, or input the user has supplied using HTML form syntax). The program then processes that data and the server passes the program's response back to the Web browser. CGI isn't magic; it's just programming with some special types of input and a few strict rules on program output. Everything in between is just programming. Of course, there are special techniques that are particular to CGI, underlying it all is the simple model shown below.
Web Browser Much of what your web browser does has already been discussed in preceding pages. However, we will summarise the main functionality here. A browser is software that is used to access the internet. A browser lets you visit websites and do activities within them like login, view multimedia, link from one site to another, visit one page from another, print, send and receive email, among many other activities. The most common browser software titles on the market are: Microsoft Internet Explorer, Google's Chrome, Mozilla Firefox, Apple's Safari, and Opera. Browser availability depends on the operating system your computer is using (for example: Microsoft Windows, Linux, Ubuntu, Mac OS, among others). When you type a web page address such as www.allaboutcookies.org into your browser, that web page in its entirety is not actually stored on a server ready and waiting to be delivered. In fact each web page that you request is individually created in response to your request. You are actually calling up a list of requests to get content from various resource directories or servers on which the content for that page is stored. It is rather like a recipe for a cake - you have a shopping list of ingredients (requests for content) that when combined in the correct order bakes a cake (the web page).The page maybe made up from content from different sources. Images may come from one server, text content from another, scripts such as date scripts from another and ads from another. As soon as you move to another page, the page that you have just viewed disappears. This is the dynamic nature of websites. The primary purpose of a web browser is to bring information resources to the user ("retrieval" or "fetching"), allowing them to view the information ("display", "rendering"), and then access other information ("navigation", "following links"). This process begins when the user inputs a Uniform Resource Locator (URL), for example http://en.wikipedia.org/, into the browser. The prefix of the URL, the Uniform Resource Identifier or URI, determines how the URL will be interpreted. The most commonly used kind of URI starts with http: and identifies a resource to be retrieved over the Hypertext Transfer Protocol (HTTP). Many browsers also support a variety of other prefixes, such as https: for HTTPS, ftp: for the File Transfer Protocol, and file: for local files. Prefixes that the web browser cannot directly handle are often handed off to another application entirely. For example, mailto: URIs are usually passed to the user's default e-mail application, and news: URIs are passed to the user's default newsgroup reader. In the case of http, https, file, and others, once the resource has been retrieved the web browser will display it. HTML and associated content (image files, formatting information such as CSS, etc.) is passed to the browser's layout engine to be transformed from markup to an interactive document, a process known as "rendering". Aside from HTML, web browsers can generally display any kind of content that can be part of a web page. Most browsers can display images, audio, video, and XML files, and often have plug-ins to support Flash applications and Java applets. Upon encountering a file of an unsupported type or a file that is set up to be downloaded rather than displayed, the browser prompts the user to save the file to disk. Information resources may contain hyperlinks to other information resources. Each link contains the URI of a resource to go to. When a link is clicked, the browser navigates to the resource indicated by the link's target URI, and the process of bringing content to the user begins again.
Searching the Web Search Engine A web search engine is a site that helps you find other websites. You most probably use search engines such as Google or Yahoo on a daily basis. You enter a query(a list of keywords) that indicates what you are looking for, the search engine provides a list of ranked links to potential sites. The search engine does not perform the search as you type it, is searches an index/database containing information about millions (more likely billions nowadays) websites. A good search engine keeps its database up to date (more on this later) and has effective techniques for matching the keywords to the content of a web page. Most search engines compare the keywords entered by the user to a set of keywords that have been indexed about each website. Some search engines index every word on every page in their databases, although eliminating common words like “the”, “a”, “an” etc. Some search engines index only part of a page, such as the title and headings. Some of these techniques use case-sensitivity and some don’t. Keyword searching is a challenge to carry out effectively because natural languages such as English are ambiguous by their very nature. For example the terms hard cider, hard brick, hard exam and hard drive all use the word hard in different ways. If enough keywords are provided, the search engines will ideally prioritise the matches appropriately. With context, however, the utility of basic keyword matching is limited. Some search engines perform concept based searches. Concept based searches attempt to figure out the context of a search. When they work well, they return links that contain content that relates to your search topic whether they contain the keywords you used or not. Concept based searches generally rely on complex linguistic theories. The basic premise is called clustering, which compares words to other words found in close proximity. For example, if the word “heart” is used in a medical sense, it may be near words like artery, cholesterol and blood. These types of search are by no means perfect as they are far more complicated than the standard keyword search. Even with this in mind, they have far more potential to be effective when the techniques used improve. Surface vs Deep Web This may come as a shock, Google cannot find everything. Because of the way that it indexes the web, it is not possible for a variety of reasons (more on this later). A lot of information on the web is stored in searchable databases that produce results dynamically in response to a request. The programs that index the web for Google (spiders/web crawlers) are not capable of making these requests. Therefore, Google can only index the surface web. The Deep Web is a part of the internet not accessible to link-crawling search engines like Google. The only way a user can access this portion of the internet is by typing a directed query into a web
search form, thereby retrieving content within a database that is not linked. In layman’s terms, the only way to access the Deep Web is by conducting a search that is within a particular website. The Surface Web is the internet that can be found via link-crawling techniques; link-crawling means linked data can be found via a hyperlink from the homepage of a domain. Google can find this Surface Web data. Mike Bergman, founder of BrightPlanet and credited with coining the phrase, said that searching on the Internet today can be compared to dragging a net across the surface of the ocean: a great deal may be caught in the net, but there is a wealth of information that is deep and therefore missed. Most of the Web's information is buried far down on dynamically generated sites, and standard search engines do not find it. Traditional search engines cannot "see" or retrieve content in the deep Web—those pages do not exist until they are created dynamically as the result of a specific search. Surface Web search engines (Google/Bing/Yahoo!) can lead you to websites that have unstructured Deep Web content. Think of searching for government grants; most researchers start by searching “government grants” in Google, and find few specific listings for government grant sites that contain databases. Google will direct researchers to the website www.grants.gov, but not to specific grants within the website’s database. Today’s internet stands at an estimated 555 million domains, each containing thousands or millions of unique web pages. The Deep Web is at least 400-500 times the size of the Surface Web. As the web continues to grow, so too will the Deep Web and the value attained from the Deep Web Content. An example These two images demonstrate the differences between the Deep Web and the Surface Web. The image on the left is a search of what Google has indexed. The query (BrightPlanet AND “Steve Pederson” site:argusleader.com) tells Google that the only results we want are from the Argus Leader domain. The search returns zero web pages that have been indexed by Google containing both BrightPlanet AND “Steve Pederson”.
The image on the right proves that results containing both terms do exist. This search is performed using the search box provided by the Argus Leader website. The reason why this search returns results is because the search box points to the newspaper’s database, a Deep Web source. Archived content can only be accessed via the Argus Leader’s search, making that content exclusive to the
Deep Web. Google does not direct queries into any site searches, as it only finds documents via link following. The” news article has fallen into the Deep Web. Interesting Facts about the Deep Web
On average, deep Web sites receive about 50% greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet search public The deep Web is the largest growing category of new information on the Internet Deep Web sites tend to be narrower with deeper content than conventional surface sites Total quality content of the deep Web is at least 1,000 to 2,000 times greater than that of the surface Web Deep Web content is highly relevant to every information need, market and domain More than half of the deep Web content resides in topic specific databases
The Dark Web and Deep Web are not the same thing! The Dark Web refers to any web page that has been concealed to hide in plain sight or reside within a separate, but public layer of the standard internet. The internet is built around web pages that reference other web pages; if you have a destination web page which has no inbound links you have concealed that page and it cannot be found by users or search engines. One example of this would be a blog posting that has not been published yet. The blog post may exist on the public internet, but unless you know the exact URL, it will never be found. Other examples of Dark Web content and techniques include: Search boxes that will reveal a web page or answer if a special keyword is searched.
Try this by searching “distance from Sioux Falls to New York” on Google. Sub-domain names that are never linked to; for example, “internal.brightplanet.com” Relying on special HTTP headers to show a different version of a web page Images that are published but never actually referenced, for example “/image/logo_back.gif”
Virtual private networks are another aspect of the Dark Web that exists within the public internet, which often requires additional software to access. TOR (The Onion Router) is a great example. Hidden within the public web is an entire network of different content which can only be accessed by using the TOR network. While personal freedom and privacy are admirable goals of the TOR network, the ability to traverse the internet with complete anonymity nurtures a platform ripe for what is considered illegal activity in some countries, including: Controlled substance marketplaces Armories selling all kinds of weapons Child pornography Unauthorized leaks of sensitive information Money laundering Copyright infringement Credit Card fraud and identity theft
Users must use an anonymizer to access TOR Network/Dark Web websites. The Silk Road, an online marketplace/infamous drug bazaar on the Dark Web, is inaccessible using a normal search engine or web browser. Web Crawlers As mentioned previously, the deep web is almost caused by the way that search engines index the web. These search engines use web crawlers(also referred to as bots, spiders, ants or automatic indexers). Web Crawlers systematically browse the internet for the purpose of creating an index of it. A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the information as it goes. The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content. As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."[5] A crawler must carefully choose at each step which pages to visit next. You can find out a bit more about how Google does crawling here. Google indexes around 1 trillion unique URLs in sweeps known as ‘Google Dances’. Google began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second. When the Google spider looked at an HTML page, it took note of two things:
The words within the page Where the words were found
Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.
Parallel Web Crawling A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes. As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized. We refer to this type of crawler as a parallel crawler. Issues with parallel web crawling: Overlap: When multiple processes run in parallel to download pages, it is possible that different processes download the same page multiple times. One process may not be aware that another process has already downloaded the page. Clearly, such multiple downloads should be minimized to save network bandwidth and increase the crawler’s effectiveness. Quality: Often, a crawler wants to download “important” pages first, in order to maximize the “quality” of the downloaded collection. However, in a parallel crawler, each process may not be aware of the whole image of the Web that they have collectively downloaded so far. For this reason, each process may make a crawling decision solely based on its own image of the Web (that itself has downloaded) and thus make a poor crawling decision. Communication bandwidth: In order to prevent overlap, or to improve the quality of the downloaded pages, crawling processes need to periodically communicate to coordinate with each other. However, this communication may grow significantly as the number of crawling processes increases. Whilst implementation of a parallel crawler has it’s challenges, there are numerous benefits if it can be done. Benefits of parallel web crawling: Scalability: Due to enormous size of the Web, it is often imperative to run a parallel crawler. A singleprocess crawler simply cannot achieve the required download rate in certain cases. Network-load dispersion: Multiple crawling processes of a parallel crawler may run at geographically distant locations, each downloading “geographically-adjacent” pages. For example, a process in Germany may download all European pages, while another one in Japan crawls all Asian pages. In this way, we can disperse the network load to multiple regions. In particular, this dispersion might be necessary when a single network cannot handle the heavy load from a large-scale crawl. Network-load reduction: In addition to the dispersing load, a parallel crawler may actually reduce the network load. For example, assume that a crawler in North America retrieves a page from Europe. To be downloaded by the crawler, the page first has to go through the network in Europe, then the Europe-to-North America inter-continental network and finally the network in North
America. Instead, if a crawling process in Europe collects all European pages, and if another process in North America crawls all North American pages, the overall network load will be reduced, because pages go through only “local” networks. Of course this information will likely need to be aggregated to create a central index, but this data can be compressed or it can be compared with the index as it is and only send the ‘difference’. Meta Tags and Web Crawlers Meta tags allow the owner of a page to specify key words and concepts under which the page will be indexed. This can be helpful, especially in cases in which the words on the page might have double or triple meanings -- the meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct. There is, however, a danger in over-reliance on meta tags, because a careless or unscrupulous page owner might add meta tags that fit very popular topics but have nothing to do with the actual contents of the page. To protect against this, spiders will correlate meta tags with page content, rejecting the meta tags that don't match the words on the page. All of this assumes that the owner of a page actually wants it to be included in the results of a search engine's activities. Many times, the page's owner doesn't want it showing up on a major search engine, or doesn't want the activity of a spider accessing the page. Consider, for example, a game that builds new, active pages each time sections of the page are displayed or new links are followed. If a Web spider accesses one of these pages, and begins following all of the links for new pages, the game could mistake the activity for a high-speed human player and spin out of control. To avoid situations like this, the robot exclusion protocol was developed. This protocol, implemented in the meta-tag section at the beginning of a Web page, tells a spider to leave the page alone -- to neither index the words on the page nor try to follow its links. Some Examples of Meta Tags Meta description tag Example: <meta name=”description” content=”Content excerpt or brief summary goes here.” /> The description field should be a very brief summary of the page’s content (two sentences or less), including one or two targeted keyphrases. It’s generally advised to keep the description at 160 characters (including spaces) or less, as most search engines ignore any text beyond this limit. Meta keywords tag Example: <meta name=”keywords” content=”google, seo, content marketing” /> The keywords tag, now deprecated in most search engines, serves as a supplement to the description tag for crawlers to correctly assess page content. The tag is deprecated due to abuse known as keyword stuffing. Essentially placing popular topic keywords in the tag to artificially move the page up the rankings. Meta robots tag Example: <meta name=”robots” content=”index,follow” />
The robots tag allows a webmaster to prevent a search engine from indexing a website page, in place of an entry in the website’s robots.txt[link] file. In the past, this tag has also been used to optout web pages from web directories such as the Open Directory Project, as spammers frequently use the directories to coordinate their attacks. Web Indexing in Search Engines Web indexing (or Internet indexing) refers to various methods for indexing the contents of a website or of the Internet as a whole. Individual websites or intranets may use a back-of-the-book index, while search engines usually use keywords and metadata to provide a more useful vocabulary for Internet or onsite searching. With the increase in the number of periodicals that have articles online, web indexing is also becoming important for periodical websites. Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Popular engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time. Metadata web indexing involves assigning keywords or phrases to web pages or web sites within a meta-tag field, so that the web page or web site can be retrieved with a search engine that is customized to search the keywords field. This may or may not involve using keywords restricted to a controlled vocabulary list. This method is commonly used by search engine indexing. The web is like an ever-growing public library with billions of books and no central filing system. Search engines, as we saw earlier, essentially gather the pages during the crawl process and then create an index, so we know exactly how to look things up. Much like the index in the back of a book, the Google index includes information about words and their locations. When you search, at the most basic level, algorithms look up your search terms in the index to find the appropriate pages. The search process gets much more complex from there. When you search for “dogs” you don’t want a page with the word “dogs” on it hundreds of times. You probably want pictures, videos or a list of breeds. Google’s indexing systems note many different aspects of pages, such as when they were published, whether they contain pictures and videos, and much more. Search engines are continuing to go beyond keyword matching to better understand the people, places and things you care about. Ensuring you appear in search results
Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link. Offer a site map to your users with links that point to the important parts of your site. If the site map has an extremely large number of links, you may want to break the site map into multiple pages. Keep the links on a given page to a reasonable number.
Create a useful, information-rich site, and write pages that clearly and accurately describe your content. Think about the words users would type to find your pages, and make sure that your site actually includes those words within it. Try to use text instead of images to display important names, content, or links. The Google crawler doesn't recognize text contained in images. If you must use images for textual content, consider using the "ALT" attribute to include a few words of descriptive text. Make sure that your <title> elements and ALT attributes are descriptive and accurate. Check for broken links and correct HTML. If you decide to use dynamic pages (i.e., the URL contains a "?" character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.
Some things that can cause your removal from search results: Automatically generated content—or “auto-generated”—content is content that’s been generated programmatically. Often this will consist of paragraphs of random text that make no sense to the reader but which may contain search keywords. Participating in link schemes intended to manipulate PageRank(more on this later) or a site's ranking in Google search results may be considered part of a link scheme and a violation of Google’s Webmaster Guidelines. This includes any behaviour that manipulates links to your site or outgoing links from your site. The following are examples of link schemes which can negatively impact a site's ranking in search results:
Buying or selling links that pass PageRank. This includes exchanging money for links, or posts that contain links; exchanging goods or services for links; or sending someone a “free” product in exchange for them writing about it and including a link Excessive link exchanges ("Link to me and I'll link to you") or partner pages exclusively for the sake of cross-linking Large-scale article marketing or guest posting campaigns with keyword-rich anchor text links Using automated programs or services to create links to your site
Cloaking refers to the practice of presenting different content or URLs to human users and search engines. Cloaking is considered a violation of Google’s Webmaster Guidelines because it provides our users with different results than they expected. Some examples of cloaking include:
Serving a page of HTML text to search engines, while showing a page of images or Flash to users Inserting text or keywords into a page only when the User-agent requesting the page is a search engine, not a human visitor
Redirecting is the act of sending a visitor to a different URL than the one they initially requested. There are many good reasons to redirect one URL to another, such as when moving your site to a new address, or consolidating several pages into one.
However, some redirects deceive search engines or display content to human users that is different than that made available to crawlers. It's a violation of Google Webmaster Guidelines to redirect a user to a different page with the intent to display content other than what was made available to the search engine crawler. When a redirect is implemented in this way, a search engine might index the original page rather than follow the redirect, while users are taken to the redirect target. Like cloaking, this practice is deceptive because it attempts to display different content to users and to Googlebot, and can take a visitor somewhere other than where they expected to go. Some examples of sneaky redirects include:
Search engines shown one type of content while users are redirected to something significantly different. Desktop users receive a normal page, while mobile users are redirected to a completely different spam domain
Hiding text or links in your content to manipulate Google’s search rankings can be seen as deceptive and is a violation of Google’s Webmaster Guidelines. Text (such as excessive keywords) can be hidden in several ways, including:
Using white text on a white background Locating text behind an image Using CSS to position text off-screen Setting the font size to 0 Hiding a link by only linking one small character—for example, a hyphen in the middle of a paragraph
Doorway pages are typically large sets of poor-quality pages where each page is optimized for a specific keyword or phrase. In many cases, doorway pages are written to rank for a particular phrase and then funnel users to a single destination. Whether deployed across many domains or established within one domain, doorway pages tend to frustrate users. Therefore, Google frowns on practices that are designed to manipulate search engines and deceive users by directing them to sites other than the one they selected, and that provide content solely for the benefit of search engines. Google may take action on doorway sites and other sites making use of these deceptive practices, including removing these sites from Google’s index. Some examples of doorways include:
Having multiple domain names targeted at specific regions or cities that funnel users to one page Templated pages made solely for affiliate linking Multiple pages on your site with similar content designed to rank for specific queries like city or state names
Some webmasters use content taken (“scraped”) from other, more reputable sites on the assumption that increasing the volume of pages on their site is a good long-term strategy regardless of the relevance or uniqueness of that content. Purely scraped content, even from high-quality sources, may not provide any added value to your users without additional useful services or content provided by your site; it may also constitute copyright infringement in some cases. It's worthwhile to take the time to create original content that sets your site apart. This will keep your visitors coming back and will provide more useful results for users searching on Google.
An example of a thin affiliate includes:
Sites that copy and republish content from other sites without adding any original content or value Sites that copy content from other sites, modify it slightly (for example, by substituting synonyms or using automated techniques), and republish it Sites that reproduce content feeds from other sites without providing some type of unique organization or benefit to the user Sites dedicated to embedding content such as video, images, or other media from other sites without substantial added value to the user
"Keyword stuffing" refers to the practice of loading a webpage with keywords or numbers in an attempt to manipulate a site's ranking in Google search results. Often these keywords appear in a list or group, or out of context (not as natural prose). Filling pages with keywords or numbers results in a negative user experience, and can harm your site's ranking. Focus on creating useful, informationrich content that uses keywords appropriately and in context. Examples of keyword stuffing include:
Lists of phone numbers without substantial added value Blocks of text listing cities and states a webpage is trying to rank for Repeating the same words or phrases so often that it sounds unnatural, for example: We sell custom cigar humidors. Our custom cigar humidors are handmade. If you’re thinking of buying a custom cigar humidor, please contact our custom cigar humidor specialists at custom.cigar.humidors@example.com.
Distributing content or software on your website that behaves in a way other than what a user expected is a violation of Google’s Webmaster Guidelines. This includes anything that manipulates content on the page in an unexpected way, or downloads or executes files on a user’s computer without their consent. Google not only aims to give its users the most relevant search results for their queries, but also to keep them safe on the web. Some examples of malicious behaviour include:
Changing or manipulating the location of content on a page, so that when a user thinks they’re clicking on a particular link or button the click is actually registered by a different part of the page Injecting new ads or pop-ups on pages, or swapping out existing ads on a webpage with different ads; or promoting or installing software that does so Including unwanted files in a download that a user requested Installing malware, trojans, spyware, ads or viruses on a user’s computer Changing a user’s browser homepage or search preferences without the user’s informed consent
Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.
Appearing Higher up the page rankings Research has shown that users typically do not go past the first ten results given from a search. Creating a web page that automatically rises to the top of the search rankings is actually very difficult because of the items that search engines place importance on. Let us take Google as the example here. The company's influence on the Web is undeniable. Practically every webmaster wants his or her site listed high on Google's search engine results pages (SERPs), because it almost always translates into more traffic on the corresponding Web site. In order to do this, you need to know how Google ‘ranks’ sites. This is the stage of deciding what order ‘hits’ should be served to a user after they have made a query. We will look at PageRank in more detail later but some of the fundamental aspects of it are as follows: -
-
-
The number of times a keyword appears will have an effect on it’s importance. If a word appears once then it is obviously of smaller importance to a page than if it appeared several times. However, web designers must be wary of ‘keyword stuffing’ as mentioned previously Google thinks that the length of time a page has existed is important. New web pages spring up on a daily basis and some disappear as quickly as they were created. Google is in favour of pages with an established history. The number of pages that link to your site are like a vote for it’s importance. Even better if those sites are well respected and have high PageRanks themselves. Think of it this way, if Van Gogh told you he liked your artwork it is more likely to have more resonance than if your friend told you the same.
Of course, these principles can be abused as we saw earlier and Google keeps an eye out for any artificial manipulation of the contributory factors to PageRank.
What are the metrics used by the major search engines to determine the value of External Links? Today, the major search engines use many metrics to determine the value of external links. Some of these metrics include:
The trustworthiness of the linking domain. The popularity of the linking page. The relevancy of the content between the source page and the target page. The anchor text used in the link. The amount of links to the same page on the source page. The amount of domains that link to the target page. The amount of variations that are used as anchor text to links to the target page. The ownership relationship between the source and target domains.
In addition to these metrics, external links are important for two main reasons: 1. Popularity
Whereas traffic is a "messy" metric and difficult for search engines to measure accurately (according to Yahoo! search engineers), external links are both a more stable metric and an easier metric to measure. This is because traffic numbers are buried in private server logs
while external links are publicly visible and easily stored. For this reason and others, external links are a great metric for determining the popularity of a given web page. This metric (which is roughly similar to toolbar PageRank) is combined with relevancy metrics to determine the best results for a given search query. 2. Relevancy
Links provide relevancy clues that are tremendously valuable for search engines. The anchor text used in links is usually written by humans (who can interpret web pages better than computers) and is usually highly reflective of the content of the page being linked to. Many times this will be a short phrase (e.g. "best aircraft article") or the URL of the target page (e.g. http://www.best-aircraft-articles.com). The target and source pages and domains cited in a link also provide valuable relevancy metrics for search engines. Links tend to point to related content. This helps search engines establish knowledge hubs on the Internet that they can then use to validate the importance of a given web document.
Comparing Search Engines Yahoo!
been in the search game for many years. is better than MSN but nowhere near as good as Google at determining if a link is a natural citation or not. has a ton of internal content and a paid inclusion program. both of which give them incentive to bias search results toward commercial results things like cheesy off topic reciprocal links still work great in Yahoo!
Bing
new to the search game is bad at determining if a link is natural or artificial in nature due to sucking at link analysis they place too much weight on the page content their poor relevancy algorithms cause a heavy bias toward commercial results likes bursty recent links new sites that are generally untrusted in other systems can rank quickly in MSN Search things like cheesy off topic reciprocal links still work great in MSN Search
Google
has been in the search game a long time, and saw the web graph when it is much cleaner than the current web graph is much better than the other engines at determining if a link is a true editorial citation or an artificial link looks for natural link growth over time heavily biases search results toward informational resources trusts old sites way too much
a page on a site or subdomain of a site with significant age or link related trust can rank much better than it should, even with no external citations they have aggressive duplicate content filters that filter out many pages with similar content if a page is obviously focused on a term they may filter the document out for that term. on page variation and link anchor text variation are important. a page with a single reference or a few references of a modifier will frequently outrank pages that are heavily focused on a search phrase containing that modifier crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index. things like cheesy off topic reciprocal links are generally ineffective in Google when you consider the associated opportunity cost
Ask
looks at topical communities due to their heavy emphasis on topical communities they are slow to rank sites until they are heavily cited from within their topical community due to their limited market share they probably are not worth paying much attention to unless you are in a vertical where they have a strong brand that drives significant search traffic