Data Transfer Across the Internet

This is a simple primer describing how data travels across the internet. Please note that I am not a network engineer, computer scientist, or in any other way qualified to talk about computers and data transfer in a professional sense. This is just my understanding. Use this information at your own risk. Although I have no professional qualifications in this area, I've been a computer user for many, many years, and over time I've familiarized myself with some basics regarding how data moves from one computer to another over the internet. You shouldn't use this information if you're planning to take any exams on the subject, but if you just have a general curiosity about the process as I do, read on.

All data transfer between computers occurs in binary form, that is, as a sequence of 0's and 1's. The physical means of transfer may include copper or aluminum wires, optical fiber, or electromagnetic radiation (usually in the microwave range). In all cases, the 0's and 1's are coded to some physical signal. For example, if transmission occurs across a copper wire, the 0's and 1's represent voltage changes across the wire. Some predefined voltage threshold (for example, 2.5 volts for ethernet Cat5 cables) is considered a 1. Absence of that voltage is considered a 0. In the same way, presence of light of a certain wavelength, above a pre-specified intensity, in a fiber cable may be considered a 1, and its absence a 0. There are further details about how long the signal (the voltage, the light, etc.) must be present before it can be called a 0 or a 1. These are outside the scope of this discussion. Just to summarize: data transmission across a physical medium occurs through the modulation of a signal (light, electricity, etc.) where specific values of the signal translate to 0's and 1's.

When you send an email, or click on a web page in a browser, a series of events must occur before any information leaves your computer. Many different types of software are involved in something as simple as sending an email. For example:

As you can see, the simple process of sending an email is actually quite complex, and involves much more software than just your email application or web browser. The operating system itself (Windows, Linux, Mac OS) provides much of the functionality needed to perform some of the tasks listed above.

The OSI Model

In order to understand the steps that data must go through before it can be transmitted, people think of the process in terms of a standardized model. This model is called the Open Systems Interconnection (OSI) model, and it divides the process into 7 stages, or "layers". The term "layer" is very useful, because it indicates that we should think of the whole process as a hierarchy. There are levels in the hierarchy, with the top level consisting of an internet-enabled application or software, such as your email client. The next layer down in the hierarchy takes data from the top layer (your email application), does something to the data, then moves it along to the next lower layer. In this way, data moves down through the layers until it reaches the lowermost layer, which is the physical device that actually moves the data, meaning the network card in your computer, or a modem.

The purpose of this layered approach is to form a pipeline that can be used by many different applications. Higher level applications such as email clients do not need to understand the nitty gritty details of networking, they just need to be able to present data in a simple high-level format to the next layer. Note that the OSI model is a theoretical model. Not all protocols used in networking fall neatly within a given layer of the OSI model. This is specially true for the higher layers (Application, Presentation, Session), but can be true for any layer. So you'll often come across protocols that seem to span functions from multiple layers. For this reason, it's important not to take the OSI model too seriously. It's an important aid in understanding how networking works, but many networking protocols do not follow the OSI model literally.

The OSI Reference Model

The OSI model is divided into 7 layers, as described below. Generally, it's helpful to divide these 7 layers into two groups. The upper group consists of the first 3 layers: the application, presentation and session layers. This group of layers is basically not much concerned with data transmission, but rather with user interaction. They would include the actual application that the user interfaces with, for example, a web browser, or an email client, or a telnet window, or an IRC client. The lower group consists of the last 4 layers: transport, network, data-link and physical. These layers are concerned with the actual data transfer, and some parts of these layers may actually be implemented in hardware rather than software. I've included Layer 4 (the transport layer) in the lower group, but some people might disagree. This layer sort of occupies a hazy borderline between the two groups, which will become clearer as we get into how it works.

Application Layer (Layer 7)

This is the top layer of the model, and includes the actual software or application that generates or requests the data. This might include your email client, your web browser, a multimedia player such as WinAmp or Quicktime, or any other application that needs to send or receive information. It's important to note that there's a difference between what's considered an application in OSI terminology, and the usual meaning of application. In regular usage, pretty much any software that you interact with is an application. This includes your web browser and your email program, but it also includes things like your text editor, or a photo-editing program. In the OSI model, the latter two would not be considered "applications".

An application, in OSI terminology, is some software that specifically understands some standard application layer protocol and uses it. For example, one application layer protocol is HTTP (Hyper Text Transfer Protocol) which is used for transferring web pages. Since your browser understands HTTP and uses it, it would be considered an application according to the OSI model. Another example of an application protocol might be SMTP (Simple Mail Transfer Protocol). So any software that understands and uses SMTP, such as your email client, would also be considered an application in the OSI model. However, a text editor such as Windows Notepad isn't an application. You might actually use the text editor to read a file on another computer over your network, but the text editor is not aware that the file is not local. It simply sees the file and opens it for editing. On the other hand, the software that actually enables your text editor to "see" the file (such as the software used for disk sharing), is an application according to the OSI model, because it uses some standard application layer protocol to initiate the process of fetching the file from the other computer.

Note that many protocols operate on multiple levels. For example, SMTP operates on the Application Layer (where it does higher level functions of formatting email to specify email addresses, subject lines, and other headers such as CC: (copy to), BCC: (blind copy to), etc. In addition, SMTP also operates on the next layer (the Presentation Layer) where it performs tasks relevant to the Presentation Layer, as described below.

Presentation Layer (Layer 6)

This layer can perform several functions, but all of these functions are optional. This means that some networking stacks might not even have a Presentation layer. Other systems may have the Application and Presentation Layers (and sometimes the Session Layer as well) all collapsed into a single layer. Some of the tasks of the Presentation Layer include:

Other programs which operate at this level may include file system drivers, which are part of the operating system.

Session Layer (Layer 5)

This layer is so called because it initiates and controls sessions - that is, periods in which the computer talks to the network. Many forms of network communication happen in sessions. The nature and length of a session depends upon the application. For example, the session of a web browser might be simply the process of downloading all the context from a single web page. Note that a single web page may contain multiple types of content, including text, graphics, multimedia files such as audio/video, javascript, etc. All of these together would form one session, because they are all on a single web page, and the process of loading that web page involves loading all those types of content.

A different type of session might be when you open a Telnet or SSH connection to another computer. These are interactive sessions, where you start by typing in your userID/password to log in, then communicate with the other computer for some period of time, and finally, terminate the session. Throughout the session, you are talking to some specific computer, and it's the Session layer's job to initiate, maintain and terminate this session. The same is true for other interactive sessions, such as IRC.

The Session layer is programmatically implemented in the form of API's (Application Program Interface) or sockets. These are not really protocols, they are programming concepts. For example, your IRC program might open a socket to a certain IRC server - this is a form of session management, since all data over that socket (including all channels you are connected to on that server) is bound to that socket. The IRC program might open another socket to a different IRC server. The API concept is similar - for example, an application needs to send or receive data over the Internet, so it engages the API of some networking software (usually part of the operating system) and sends/requests the data it needs. It does this when it needs to start a session, maintain it, or terminate it.

As mentioned previously, the Session layer might be part of the two higher layers. In fact, many protocol stacks, including the Internet standard TCP/IP do not differentiate between layers 5, 6 and 7. Therefore, the functions of the Session layer, Presentation layer, and Application layer are all combined into one.

Transport Layer (Layer 4)

This is where the hard work of communications actually takes place. Packet generation takes place here, so this is where messages are packed (or unpacked, if you are at the receiving end). Long messages may be broken down into several standard sized packets. Conversely, several smaller messages may be combined into a single packet. The size and structure of a packet depends upon the particular protocol being used.

The commonest protocol used by the Transport Layer is TCP, or Transmission Control Protocol. Other protocols that might be used instead of TCP at this layer include Net BEUI (Net BIOS Extended User Interface), SPX (Sequenced Packet Exchange), or NW Link. The exact format of the packets depends on which protocol is being used.

When you are sending data, the Transport Layer converts your data into packets according to the protocol being used, breaking up large messages into smaller ones for packaging, or combining small messages into a single packet. When you are receiving data, this layer is where the packets being received are assembled into the message. If the message was properly received, this layer send acknowledgement of that, otherwise it will send a message saying that the package was malformed, and request re-transmission.

The Transport layer is where several very important functions can be implemented:

Connection Management: For applications that are connection-based (such as IRC, Telnet, etc.), the Transport layer initiates, maintains and terminates connections. This is reflected in the concept of sessions described above for the Session layer. The Session layer assigns the session operations, the Transport layer implements them.

Flow Control: The speed of communication between two computers can be adjusted by the Transport layer through Flow Control. This allows each machine to tell the other "you're going too fast for me, slow down", and enables more reliable communications without overwhelming the capacity of either machine.

Retransmission: Data is often lost during transmission. The Transport layer can implement a kind of record keeping, so the sending computer keeps the receiving computer constantly updated about which and how many packets it's sent. Conversely, the receiving computer keeps the sending computer updated about which and how many packets it's received. So if any packets are lost, they can be re-transmitted automatically based on this record keeping.

Fragmentation: Communications may involve very large amounts of data. Consider a large file transfer, or a movie or audio file. These files must be broken down into multiple packets before they are transmitted. The Transport layer fragments large amounts of data before sending it, and assembles fragments it receives into larger pieces. Fragmentation occurs at many layers in the network stack. Each layer tries to fragment data into units that can be easily handled by the layer below it. So the size of packets created by the Transport layer is based on its understanding of what the layer below it (the Network layer) can handle.

Muxing: Multiplexing simply means mixing two or more data streams together (de-multiplexing or demuxing means splitting them apart again). Many applications have more than one data stream. For example, a video player will have at least two separate data streams, one for audio and one for video. Very often it will have more than 2, since the audio stream is again split into left channel and right channel for stereo, or even more channels for Dolby or DTS sound. These different streams of data need to be transmitted together, because the audio and video have to be synchronized on the other end. The Transport layer deals with the issues of multiplexing and demultiplexing data streams from the same or different applications, which need to be sent to the same computer.

Addressing: Although addressing at the host level (specifying which computer you want to send data to, or receive data from) occurs at lower layers, there is a form of addressing that can happen at the Transport layer. This is process-level addressing, and is usually implemented through ports. A single computer might be using several Internet applications at the same time: for example, someone might be watching a movie on YouTube, be connected to an IRC server, and have a Telnet/SSH session open simultaneously. All these applications are simultaneously sending and receiving data over the Internet. In order to separate this traffic, process level addressing can be used. Commonly, this is by assigning a certain kind of traffic to a certain port. For example, web traffic goes to port 80, the Telnet session will operate through port 23, etc. This level of addressing is also done by the Transport layer.

All of these functions are described in much more detail later. As you can see, the Transport layer is very important to the network stack, as it contains the highest level functions that are actually used for data transmission. You can say that the Transport layer is part of the "lower group" of the stack (the part of the stack that deals with data transmission, rather than with the user), but it is user/application aware in many ways, as you can see if you think about the list of functions. So the Transport layer kind of lives in the hazy in-between territory between user-interaction, and the purely routine work of data transmission.

Not all Transport Layer protocols implement all the functions of the Transport layer. Which protocol is used depends on the features it offers, and how well those features meet the needs of the task. For example, the alternative to TCP (the commonest Layer 4 protocol on the Internet) is UDP, another Layer 4 protocol. UDP is missing several functions that TCP has (such as requests for re-transmission of lost data). While this makes UDP less reliable than TCP, it also means that there is less overhead in UDP than in TCP. This makes UDP more suitable than TCP for many types of communications.

Network Layer (Layer 3)

Up to this point, the higher level layers are not aware of your particular network or its characteristics. The Network Layer is where this information is added in. Each packet generated by the Transport Layer is stamped with the address of the sender and recipient. This layer therefore also needs to understand logical addresses and be able to convert them into physical addresses. A logical address for email would be something in the form "name@hotmail.com". This is a human-readable or logical address, but it means nothing to any of the machines routing information across the internet. It needs to be translated to a physical address, which is an IP address (in the form of numbers, for example "207.46.232.182" for microsoft.com.

Again, there are several protocols that can be used by the Network Layer. The commonest is IP, or Internet Protocol. The numeric address described above (207.46.232.182) is the IP address for microsoft.com. This is by far the commonest protocol used on the internet. However, other protocols can also be used in specific places, such as IPX (Internet Packet eXchange), NW link., Net BEUI, etc.

The internet is sometimes referred to as using TCP/IP, which means it commonly uses TCP for the Transport Layer, in combination with IP for the Network Layer. The layers above and below these two might be totally different, depending on the type of data, type of machine, type of physical connection, etc. But so long as the traffic can be converted to TCP/IP, it can be sent anywhere on the internet.

The Network Layer does all the routing, that is, it decides how the message should be sent. As mentioned above, internet traffic goes through a number of relays before it reaches its destination. Each step is called a "hop". The Network Layer decides which route would be most efficient, based on the shortest number of hops, the amount of traffic congestion for different pathways, etc.

Finally, the Network Layer is also used for network traffic shaping. Large chunks of data can be broken down into smaller parts if needed. Packets can be prioritized or de-prioritized, which affects the order in which they are received at the other end. The Network Layer might decide to prioritize packets containing audio or video information (for internet video or telephony applications), because such data is sensitive to receiving packets in the same order they were shipped. Other kinds of data which is not sensitive (such as emails) may have data packets held back until congestion eases. This will simply delay the receipt of the email at the other end, it won't garble the message as might happen if you sent audio or video packets out of order.

Nasty telcos also use the Network Layer to throttle certain kinds of traffic. They might think that bit-torrent traffic is using up too much of their bandwidth, so they'll throttle down (decrease the rate of transfer) of such packets, while allowing other kinds of traffic to go through at normal rates.

Data-Link Layer (Layer 2)

This is the layer at which packets are converted into electrical signals for transmission. As mentioned earlier, data is actually transmitted as a series of 0's and 1's. Depending on the physical method of transmission (electrical signals in copper wire, light pulses in optical fiber, electromagnetic wireless signals, etc.), there may be many different methods of packaging 0's and 1's into bundles for transmission. This work is done by the data-link layer.

Data is usually sent in discrete units, rather than continuously. This means that a set of data (containing error-detection information as well as the data itself) is sent, after which the device at the other end does two things: it acknowledges that it received the data set, and it does the error checking to make sure that the data was not corrupted during the transmission. If it fails to acknowledge that data was received, the data is re-sent. The same happens if the data is received, but was corrupted.

These data sets are called "frames". Data frames are different from IP packets described earlier. The size of a data frame in the data-link layer depends upon the physical medium. It might be totally different for copper, optical fiber, or radio transmission. It might be different for two different data-link protocols that work on the same type of medium. In other words, it depends solely on the physical means used to communicate between devices, and the particular data-link protocol they agree to use. So several IP packets might be combined into a data frame for transmission. Or larger IP packets might be broken apart to fit into a data frame. It all depends on the size of the data frames, which depends on the underlying physical mode of transmission.

Each data frame contains error checking information, usually in the form of a CRC (Cyclic Redundancy Check) value calculated from the data. Typically, there are two Data-Link Layers. The higher level one is the LLC (Logical Link Control), which controls the actual network interfaces (or SAPs - Service Access Points) on the computer or other device. The lower level layer is the MAC (Media Access Control) layer, which actually controls the voltage on the network interface, and produces the 0's and 1's (or controls light signals for fiber optic, etc.).

A single computer may have several NIC's (Network Access Cards). For example, your notebook computer might have an ethernet port for times when you are home or at an office, and can connect to your home/office network with an ethernet cable. Additionally, it might have a wireless NIC so you can access wireless networks. The LLC layer controls all of the SAPs in your computer (in this case, your computer has two Service Access Points, one for the wired connection and one for the wireless). Each device will have its own MAC layer, therefore you'll have two MAC layers in this example: one MAC layer producing voltages in your wired ethernet port, and one producing wireless signals for the wireless network card.

Each interface must have some unique identifier, so that any network interface can be unambiguously addressed by a network device. This is the MAC address (which is different from your internet IP address). Every network interface manufactured (whether it's a wired ethernet card, a wireless card, whatever) has a unique MAC address hard coded into it by the manufacturer. You cannot change the MAC address of your network card unless you replace your network card and install another one. You can, of course, have different IP addresses assigned to you by your internet provider, without changing any hardware.

Because of this reliance on MAC addresses, the Data-Link layer only works at a local network level. Each device has to be aware of the MAC addresses of other devices, otherwise it cannot route frames properly. It's the next higher layer - the Network layer - that is responsible for routing beyond local domains.

Physical Layer (Layer 1)

This is the actual physical basis of the network, and consists of the wires or links connecting various computers and other network devices across the internet. This also includes electromagnetic waves used in wireless networks.

This is where the 0's and 1's are actually encoded to voltages or light pulses and transmitted. The mechanics of this layer are outside the scope of this article, but this is where things such as bit synchronization, determining the length of a signal pulse, etc. would happen.

Communication between Network Stacks

As shown above, the layers in the OSI model are numbered, starting with Layer 1 for the lowest or Physical layer, to Layer 7 for the highest or Application layer. The OSI model often refers to layers with the "N" methodology. This can get somewhat confusing, but it's important to understand because OSI terminology is very commonly used for describing communications.

You can refer to any layer as "N". When you do, the layers above it become N+ layers, numbered as N+1, N+2, N+3, etc. as shown in the figure to the right. The layers below it become the N- layers, such as N-1 or N-2. In the diagram to the right, the Network layers has been arbitrarily chosen as the N layer, but you could pick any layer at all and call it the N layer.

You can therefore use this terminology to refer to the next, previous, or any other layer in the stack, without referring to a specific layer by name. For example, you could say that a device provides a service to the N+1 layer, meaning the layer above it. This doesn't specify what that N+1 layer is, because it's dependent on the context, that is, which layer were you talking about in reference to the device.

Generally, communications between layers in the same network stack are called interfaces. So you can say that the Transport layer interfaces with the Network layer in the same network stack. Communications between the same layer in different network stacks is called a protocol. For example, the IP protocol communicates between the Network layers of 2 different network stacks. Each device has a network stack, so when you say "two different network stacks" you mean two different devices, such as when one computer is talking to another computer, or when a computer is talking to a switch. Another way of thinking of this is that interfaces are vertical communications channels, while protocols are horizontal communications channels.

Two other terms need to be clarified while we're on the topic of network stacks. For any layer N, a complete message that fulfills all requirements of the protocol used for that layer is called a Protocol Data Unit (PDU) for that layer. These terms will become much clearer when we talk about about the details of the protocols and the concept of encapsulation. For now, think of it this way. Consider a protocol for Layer 4 (Transport Layer). One protocol used at this layer is TCP, or Transport Control Protocol. The unit of this protocol is a TCP packet, which is a packet of data plus a header that contains meta information about the data, as well as other information relevant to the TCP protocol. Therefore, the PDU of Layer 4 is the TCP packet. If some other protocol is used instead of TCP for layer 4, then the PDU will be different. For example, if instead of TCP you use UDP for Layer 4, then the PDU of Layer 4 will be a UDP packet. Regardless of the protocol, the PDU of a layer is one complete packet which contains some data plus all header information required by the protocol used on that layer.

Data moves down the network stack, from the application down through successive layers to the physical layer, where it's transmitted to the next device on your network. This idea of data moving down the stack can be thought of as a service model - that is, each layer in the stack is providing some service(s) to the layer above it. So the PDU of some layer, when it's passed down to the layer below, becomes the Service Data Unit (SDU) of the layer below. For example, the PDU of layer 4 is passed to layer 3, where it becomes the SDU of layer 3. Layer 3 is providing some service to layer 4. This service often consists of repackaging the SDU it received from layer 4 into some different format (explained in more detail in encapsulation). When it's provided this service and repackaged the data, the repackaged data is now considered the PDU of layer 3, which will in turn be passed to layer 2 below, where it will become the SDU of layer 2.

These terms are very commonly used in networking, so it would be a good idea to understand them. The section on encapsulation will make this much clearer.

Devices

Based on the seven layers described in the OSI model, networking devices can also be classified into a layer like model, indicating the level at which they interact with the network. The reason for having a separate "Devices" section in addition to the above description of the OSI model is because the functions of the lower layers are often implemented in hardware rather than software. This is easy enough to see for the Physical Layer -- obviously, wires and cables and radio-waves or microwaves are a physical medium. However, it's also useful to understand that some of the higher level functions are also done by dedicated hardware in your network card, or in switches and routers, and therefore not done by the software on your computer, and not done by your computer's CPU.

This is specially true for modern and/or more expensive switches used in server environments. Today, even cheap computers come with built-in gigabit ethernet ports, for example. These ports may well be controlled by dedicated chips that offload much of the work of packetization from the CPU, specially in the better motherboards. For this reason, it's useful to consider networking from this hardware perspective, because it enables us to understand the devices used in networking and choose the appropriate device for the task.

Note that just like it's simplistic to ask "which layer does this protocol operate at" for the OSI model (because most protocols can and do operate at multiple layers), it's also simplistic to ask "which layer does this device work at?" Chances are the device works at multiple layers. It's often more useful to ask "what's the highest layer at which this device works?"

Here is one simple scheme for describing layers in terms of the devices which work on them. Remember, many devices work at multiple layers.

Layer 1: Physical Layer - Devices operate on Bits

Devices which deal directly with bits (electrical pulses, light pulses, etc.) are Layer 1 devices. They work at the physical level, on the actual wires or fiber optic cables of the network.

An example of a Layer 1 device would be a repeater. Signals can only travel a certain length before being overwhelmed by noise. For this reason, most network topologies have repeaters placed at fixed distances along the signal path. Their job is simply to detect the signal and amplify it before sending it on its way again. For example, the range of a microwave antenna used for wireless networking might be 3 miles. If you wish to provide wireless networking coverage beyond 3 miles, you need to place repeaters at 3 mile intervals. Repeaters are also used for electrical and optical signaling.

Another example of a Layer 1 device is a hub. A hub is a simple device for aggregating or dispersing signals. For example, you could connect many computers to a single hub, and then use the hub to communicate between the computers. It is simply a physical link point where a number of cables carrying their individual signals merge. There is no shaping or control of traffic, so linking too many computers to a hub may cause congestion and data collisions (when two different computers try to send a signal at exactly the same time, and the signals interfere with each other). This results in dropped data.

The higher layers in the OSI stack can deal with some problems. For example, if some data is lost or corrupted due to data collisions, the higher layers (in this case, the Data-Link Layer) will ask for the data to be re-transmitted. So hubs can work even if there are several computers talking at the same time on the network, but performance will degrade due to these congestion issues.

Layer 2: Data-Link Layer - Devices operate on Frames

These are devices that work on data frames generated by the Data-Link Layer described above. Since the Data-Link Layer has the ability to address a specific MAC address, Layer 2 devices can sort traffic according to which MAC address it's directed to. The Data-Link Layers also has the built-in ability to do CRC checks for errors, and request retransmission for dropped or corrupt data.

The commonest Layer 2 device is the Layer 2 switch. This is a basic switch that can be used to connect several computers into a local area network. Each computer has its own separate path to the switch, so there is a separate port on the switch for each computer. The switch receives traffic through dedicated links to each computer, sorts it out based on MAC address, and forwards traffic to the specific computer the traffic is destined for.

Each port is its own collision domain, meaning that there can't be collisions on a single port unless you plug two computers into the same port on the switch (which can be done, using a hub, if you were so inclined). However, all ports are in the same broadcast domain (a broadcast is a message from a computer which is not targeted to any specific computer, but to all computers on the network).

Switches are fairly complex devices, with electronics and instructions that allow them to decode data frames, read MAC addresses, etc. Layer 2 switches are mostly the older and cheaper switches these days.

Another Layer 2 device is a bridge, which is used to connect two different networks together. Since they work at the frame level, they can connect different types of networks, such as ethernet, token-ring, etc.

Layer 3: Network Layer - Devices operate on Packets or Datagrams

The network layer handles packets and does routing. That is, packets that are generated by the Transport Layer above it are address-stamped by the network layer. The network layer therefore understands internet addresses, which are usually in the IP format. It also does routing, that is, it decides when to send each packet, and through what route. It can therefore be used to address Quality of Service (QoS) issues, such as making sure that audio/video packets are sent immediately, while email and other such packets can be held for a while to avoid congestion. It decides which route to use to send the packet, based on hop number, congestion, etc.

In other words, a Layer 3 switch is what is commonly referred to as a "router" or "gateway". Routing and gateway functions can be done in software as well, but when the switch does this in hardware, it's called a Layer 3 switch. Since individual packets can be inspected at this level, this is where packet filtering can also be done. Hardware firewalls are one example of such filtering devices.

As such, Level 3 switches are used where you need to connect two broadcast domains, that is, two different types of networks. An example might be when you want to connect your home network to the internet. Or anywhere you need to connect a LAN to a WAN.

Layer 4: Transport Layer - Devices operate on Segments

Layer 4 switches operate at the level of the transport layer, which does the hard work of packetizing data. By implementing this in hardware, Layer 4 switches remove a lot of computing load from busy servers in a network.

Layer 4 switches are sometimes also called "session switches". This does not mean that they work at the Session-Layer, which is above the Transport Layer. It simply means that they are aware of sessions (without being able to manipulate them). Being aware of sessions, they can control traffic at a session level. For example, suppose someone clicks the "buy" button to buy something at Amazon's store. This is one session, meaning one instance of communication between some computer and Amazon's cluster, for the purpose of buying an item. It will involve some communication (exchange of information about the order, credit card or payment information, shipping/billing address, etc.). A Layer 4 switch can treat the entire exchange as a single session, and direct it towards a particular machine in the Amazon cluster, for load balancing. It would do this by binding that session to the IP of the particular server in the cluster for the duration of the session.

Layer 4 switches are used in very high bandwidth applications (gigabits) where you want to offload the CPU work of packetizing off the servers to the switches custom ASICs which can do the job very fast. It also helps issues like load balancing, because of the session awareness. Additionally, Layer 4 switches can shape traffic, route, and do all the functions of Layer 3 switches.

One advantage of thinking about devices in terms of layers of the OSI model is that it's easy to understand that not all devices have to deal with all layers. This is specially important in terms of the Internet. Consider an email that you send from the US to someone in Australia. That email will probably go through a dozen or more devices located in different cities and continents before it reaches its destination. Most of these devices are not computers, but simply switches, routers and gateways. None of them care about the content of the email, all they care about is how to get it to its destination. Because of the layering, the actual routing and destination information is contained in the lower layers, typically layers 1, 2 and 3. Layer 4 is also important in case of the Internet, because it often contains information that helps with data transmission (such as flow control, traffic congestion avoidance, repeat-transmission requests in case of lost information -- all of which will be explained in more detail later).

So at the most, a switch just needs to read and understand layers 1 - 4. Many switches won't even read that much, and will operate only on layer 1-3, for example. This simplifies the design of switches/routers, because they don't need to unravel the whole network stack, they just need to be able to operate on the lower layers.

The structure of an IP Packet

Traffic across networks is pretty much always packetized these days. This is because it's easier to do error checking at a packet level, rather than at the level of individual bits. The commonest packet type on the Internet is the IP packet.

Since the basic unit of data transmission across the Internet is an IP packet, let's examine it in a bit more detail. Remember, the Transport Layer is the first layer that generates packets out of the data - these are usually either TCP packets or UDP packets. The Network Layer then repacketizes the TCP packets into IP packets. Although TCP or UDP packets are generated first (since the Transport layer is higher up in the OSI stack), we're beginning our discussion with IP packets, because they are much simper and easier to understand, and because they are the basic "currency" of the Internet.

Normally, each layer that generates packets tries to adjust the packet size to meet the requirements of the next lower layer. This is done in order to minimize fragmentation. For example consider the case where a lower layer can handle packets with a maximum size of 1500 bytes. But the layer above it generates packets of 1600 bytes. Each of these larger packets would need to be split into two by the lower layer before they could be sent. Since they are 1600 bytes each, they would be split into one packet of 1500 bytes, and another of 100 bytes. This is inefficient, since each of these packets must then be attached to a separate header. The header size is constant, that is, the header size of the 1500 byte packet and of the 100 byte packet is the same. So there is a lot of overhead in fragmentation. For this reason, it's generally wise to form packets of an appropriate size that won't be fragmented much farther by lower layers.

This is a good rule in principle, but it's next to impossible to implement effectively on the Internet. The reason why it's difficult is because any transmission on the Internet goes through multiple devices before it reaches its destination. These devices may include various switches, routers and gateways along the path. The problem is that not only are these devices outside your control, in most cases they are also totally unknown to you and to your computer. So you really have no way to know what packet sizes they can handle without fragmentation.

So in practice, fragmentation does indeed occur. Later on we will talk about some techniques for reducing fragmentation, but all those methods work at this layer (the Network layer) or below. Recall that the highest level layer that is actually aware of the network is the Network layer. So any method of discovering the maximum size of an allowable packet before fragmentation occurs for any machine along the route can only propagate as high as the Network layer, and no higher. TCP packets are generated by the Transport layer, which is higher in the OSI stack, and therefore unaware of network level information. So TCP packets are generated without knowledge of the permissible maximum packet size of layers below. They are usually generated at sizes determined by the task at hand, or the application. For example, an HTTP task might put the entire text of a web page into a single TCP packet. A multimedia player, on the other hand (such as a Flash player embedded on a web page) will try to break down its data into suitable "buffers" for smooth playback at the other end. So it might decide to break down the whole movie into a number of 6 second fragments, for example, with each fragment being a TCP packet. Other applications may handle their own packetization differently.

So for the purpose of this discussion, you can take for granted that all TCP packets generated by Layer 4 will be fragmented by Layer 3 when it repackages them into IP packets. There might be some small number of TCP packets that aren't fragmented, because they were too small to begin with. Usually such small packets are not "data" but rather control and informational messages.

In the simplest case, when no fragmentation occurs, each packet received from a higher level layer is simply encapsulated. The process of encapsulation is described in more detail later, but for now we can understand it simply in the sense that the IP packet is simply a TCP packet (containing the data plus the TCP header), with an IP header tacked on to the front of it.

The basic structure of a IP packet is shown below:

Each box represents a single bit, numbered starting from bit 0. As can be seen, it contains two broad sections, the data itself (or payload), and a header section consisting of meta data (information describing the data itself, addresses, etc.). The total length of an IP packet is variable. The general rules for the packet size are:

Here's a more detailed description of the header fields.

Version The version of IP used to create the packet. Currently, we use IPv4, so it contains the value 4. This is a 4-bit field, so values of 0 to 15 are allowed.
Header Length Also called the Internet Header Length (IHL), it simply describes the length of the packet's header, in units of 32-bit words. This is a 4-bit field, so values of 0 to 15 are possible. Since headers must contain all the required fields, the minimum possible header length is 160 bits (5 words). So this field will contain a minimum value of 5. The maximum value is 15, which corresponds to 15 x 32 = 480 bits, or 60 bytes. That is the maximum header size possible. This field can be used as an offset to read the data. Simply multiply the value by 32, count off that many bits from the start of the packet, and start reading data.
Differentiated Services This 8-bit field was originally known as the TOS (Type of Service) field. It's basically a way for the host to express a preference for how it wants that packet handled - fast and less reliable, slower but more reliable, using a more expensive route, using a cheaper route, etc. It's not been much implemented so far, but will probably be important in IPv6 when there's a lot of real time traffic involved (audio/video stuff). The 8 bits of this field are apportioned as shown below.
  0 - 2 Precedence, or priority: 3 bits, 8 possible values from very high to very low.
3 Delay: two values possible, normal delay or low delay
4 Throughput: two values possible, normal or high throughput
5 Reliability: two values possible, normal or high reliability
6 Cost: two values possible, normal cost or minimize cost
7 Undefined
Some switches may read the information in these bits, and route traffic accordingly. Others may totally ignore this information and handle all packets in exactly the same manner. This is therefore a feature of how a particular network is implemented.
Total Length This 16-bit field defines the total length of the packet (header + data). The unit is bytes. It can hold values from 0 to 65535 in theory. Legally, the minimum value is 20 (minimum header plus no data). Legally, the maximum must be at least 576, but it can be much larger up to a limit of 65535. This means that the maximum possible size for any IP packet is 65535 bytes, or half a megabit. In practice, it may be much smaller, depending upon the software/hardware generating the packets. However, the maximum allowable must be at least 576 bytes to meet specs.
Identification This 16-bit field is not always used. Its intended purpose was originally as a unique ID or identifier for different fragments of an IP packet. It's not much used for that purpose today. Some people suggest that it could be used to add information to prevent source address spoofing.
Flags This is a 3-bit field which contains flags that help manage fragmentation. Fragmentation is described in more detail in a section below, but for now, remember that IP packets can be fragmented if they are too large to go through lower layers (which may have packet size limitations imposed by their own protocol). Therefore, information is needed that helps those layers to fragment these packets. This information is provided via the Flags field and the Fragment Offset field. The 3 bits reserved for the Flags are used as follows.
  0 Reserved bit. Not used. Must be a 0.
1 two possible values, 1 means "don't fragment", which lower layers understand as "just trash this packet if it's too big to send without fragmentation". 0 means "okay to fragment".
2 two possible values: these values are actually set by the lower layers if fragmentation is needed. A value of 1 means "more fragments", meaning that this packet is a fragment of a larger packet, and more packets are to follow. A value of 0 means that this is the last fragment of the series. So if a large IP packet is fragmented into 5 smaller IP packets, the first 4 will have the "more fragments" flag set and the 5th one won't.
Fragment Offset This 13-bit field tells the host at the receiving end how to re-assemble a packet that was fragmented. The unit is a block of 8 bytes. Since it can have values from 0 to 8191 (13 bits), it can provide a maximum offset of 8191 x 8 = 65528 bytes. This is sufficient to cover the maximum possible length of an IP packet (65535 bytes minus header). This field is again used by lower layers, in case a large IP packet needs to be fragmented. Each fragment is stamped with a fragment offset, which is used to re-assemble the original packet at the other end. For example, a fragment offset of 107 means that the data contained in this fragment belongs to position 107 x 8 = 856, measured from the beginning of the original packet.
Time to Live (TTL) This 8-bit field specifies exactly what it says: how long a packet should "live". The units are seconds, so values can be 0 to 255. However, 0 is not a legal value, and any value less than 1 is rounded up to 1. So the legal values are 1 to 255 seconds. The purpose of this field is to prevent a packet from being forwarded forever in a circle, since routes to hosts aren't always known exactly in advance. Every time a packet encounters a switch or router, the TTL is decremented by 1 before it's passed on. When it hits 0, the packet is discarded and no longer forwarded. The switch/router that decrements it to 0 sends an ICMP message back to the sender informing him that the package was discarded. This feature can be used to implement traceroutes.
Protocol This 8-bit field defines the protocol used for the data or payload of the packet. It can have values from 0 to 255, each of which specifies a certain Transport Layer protocol. The Internet Assigned Numbers Authority maintains a list which assigns a specific number to a certain protocol. For example, a value of 6 means TCP, or Transport Control Protocol, which is commonly used for the data in TCP packets. A value of 1 means ICMP (Internet Control Message Protocol), which is used for standard messages used for housekeeping/control functions in networks.
Header Checksum This is simply a checksum calculated for the header portion of the packet, which is used for error checking. Note that the data is not included in calculating the header checksum. Integrity of the data is the responsibility of the data protocol used. For example, TCP has its own separate checksum for verifying the data integrity. Each switch/router along the route calculates a checksum for the header and compares it against the value in this field. If they don't match, it requests re-transmission of the packet. Note that since each switch changes the header (by decrementing the TTL field), it must calculate and embed a new checksum in each packet before forwarding it.
Source Address This is usually an IPv4 address in binary format. The address field is 32 bits, and can be thought of as a series of four 8-bit fields. Each 8-bit field can contain a value from 0 to 255. Therefore, an IPv4 address of the form 202.134.227.153, for example, can be contained by first converting each of its four parts into binary, then concatenating them together. So it would be: 202 = 11001010; 134 = 10000110; 227 = 11100011; 153 = 10011001. So the value of source address would be obtained by concatenating these four binary numbers: 11001010100001101110001110011001. Note that because of NAT (Network Address Translation) the source address might not be the address of the actual host where the packet originated, but rather the address of the NAT machine. Replies will therefore be sent back to the NAT machine, which will forward them to the appropriate host.
Destination Address This is the IPv4 address of the destination machine, again in binary format as described for the Source Address field.
Options This is an optional part of the header, and is hardly ever used in IPv4. Since the minimum header length is 20 bytes (with no options) and the maximum header length is 60 bytes (determined by the maximum value possible in the header length field), these options can use up to a maximum of 40 bytes. The first 2 bytes (16 bits) are reserved for the options header, and the remainder for the options data. The bit assignment for the header is shown below.
Copied 1 bit Should options be copied into all fragments if the packet is fragmented. 1 means yes.
Class 2 bits 4 possible values. Generally used to categorize the options. 0 means "control options" 2 is for "debugging and measurement options", while 1 and 3 are currently not used.
Number 5 bits The option number, uniquely identifies each option contained in this part of the header. A maximum of 32 options are possible.
Length 8 bits The length of the option in bits. Includes the length of this field as well.
Data Variable Any data used by the options. The length is variable, with the constraint that the total optional part of the header can't be more than 40 bytes. Simple options might not have any data associated with them.
Data The actual data or payload of the packet. This is not part of the header and not included in the header checksum. The data can be any of the Transport Layer protocols, as defined in the Protocol field of the header. Usually over the Internet, the data will be a Layer 4 packet, either a TCP packet or a UDP packet.

As you can see from the table, the IP packet contains all data needed for transmitting a packet from a host to a destination. Therefore, an IP packet is the smallest unit that is independently routable end-to-end, meaning from the client computer to the host computer, and vice versa.

Fragmentation of IP Packets

Since IPv4 traffic can go over a variety of networks, including both WAN and LANs (such as ethernet), the issue of packet size becomes important. When designing the protocol, the designers might have picked an arbitrarily small packet size to make sure it never got fragmented, because it was smaller than the MTU (Maximum Transmission Unit) of the common networking infrastructures. However, this would have been inefficient, because smaller packet size decreases data density, since a proportionately larger fraction of the bandwidth is spent in simply transmitting headers.

Instead, it was decided to allow for a system of fragmentation, so that larger packets could be fragmented and re-assembled only as needed, for the particular type of network on which they would be used.

As mentioned above, packets are first generated from data by the Transport layer (TCP packets or UDP packets). These packets are generated without any concerns about fragmentation or MTU size (since the Transport layer is not network-aware, so it doesn't even know what an MTU is), so they are almost always fragmented by the Network layer when it repackages them into IP packets.

MTU Sizes for Common Media
Media MTU (bytes) Comments
Ethernet v2 1500 The vast majority of Ethernet implementations are Ethernet v2.
Ethernet 802.3 1492
Ethernet Jumbo Frames 1500-9000
802.11 2272  
802.5 4464  
FDDI 4500  
Internet IPv4 68+ These are minimum values, practically they are much higher.
Internet IPv6 1280+

However, since the Network layer is aware of MTU restrictions, the size of the IP packets it generates can be finely tuned.

Over a LAN with a relatively homogenous structure, fragmentation is not much of a problem. The host knows the MTU of its own interface (and possibly the MTU of other hosts around it, through handshaking). It can set the packet size so it doesn't exceed the MTU. Most LANs use some standard protocol like ethernet, which has some specified minimum MTU size. So packet size can be based on that.

The table on the right shows MTU sizes for a variety of media. Consider Ethernet v2, which is the commonest Ethernet standard used. Chances are, the computer you are using right now is connected to a LAN running Ethernet. The LAN will probably have a router or gateway to the Internet (could be as simple as the cable modem from your local cable company, or a DSL modem). Ethernet v2 has a MTU of 1500 bytes. If you wanted to guarantee that there would be no fragmentation at this level, you would make sure that no packet exceeded 1500 bytes. Remember, the 1500 bytes includes both data and header.

On the Internet, a packet encounters a wide range of networks and equipment, and the MTU size may vary from hop to hop. This makes it difficult to choose a large enough packet size to minimize overhead, and still stay below the MTU of all hops along the route. There are procedures for discovering the MTU size of all hops along the route (Path MTU Discovery) in order to optimize packet size before transmission. The procedure is to basically send a series of packets of smaller and larger sizes with the "don't fragment" flag set. If a packet is too big for a switch/router to forward without fragmentation, it will simply drop the packet and send back an ICMP "Fragmentation Needed" message, which contains its MTU value. In this way, the smallest MTU size along a route can be discovered, and the packet size set accordingly.

However, many network routers are configured incorrectly and do not send back ICMP errors. This might be deliberate practice out of fear of DDoS attacks. So Path MTU Discovery methods are not foolproof. Since overheads caused by fragmentation are an immensely important source of bandwidth loss across the Internet, these issues are an active area of work and research.

 

Continue on to Page 2 of this article, which talks about encapsulation and TCP packets.