Monday, December 2, 2013

Cookies vs Query parameters for Sticky Sessions

After going through the nginx-sticky-module and understanding the requirements for session based load balancing, we were convinced that the cookie based approach implemented earlier had several drawbacks, including -

1. The most obvious one, as stated on the nginx-sticky-module page itself, is that it requires cookies to be enabled by the browser. If that fails, the load balancing either falls back to the regular round robin mechanism or returns a bad gateway error depending on its configuration.

2. This cookie based approach isn't completely transparent. To ensure persistence, both the client and server are involved, they are both aware of the cookie being stored on the client side. If there is a server side processing simply based on the presence of a cookie in the incoming request, the logic fails as the load balancer intervenes to add another cookie. So yea, not completely transparent!

3. The third and more significant drawback, is the session timeout. At the application level, the session may last for a few minutes or more, there might also be an upper limit on the duration of the session. On the other hand, the cookie being stored for persistence has an associated lifetime, which will definitely be different from the actual user sessions, usually set to the length of the longest session by the developer! As mentioned in our previous blog, this will cause the load balancing to get skewed, causing the requests to go the the same server even after the user session has expired.

4. Another drawback of this module, is that the client side gets too much control. This is the consequence of the transparency problem mentioned earlier. Since the client gets the digest of the server ip in the cookie, the client can track all the sever side mappings, by simply sending a cookie free request each time - the nginx load balancer would allot a different server each time using the default round robin approach.  The concept of a reverse proxy to abstract the backend servers is somewhat lost. The client can then use the information to flood one particular server with requests, succeeding in a DOS attack!

5. Https, by using TLS, encrypts all the data, including the headers. Only the hostname isn't encrypted.
For instance : the first part of the URL (https://www.google.com) is still visible as it builds the connection. The second part (/herearemygetparameters/1/2/3/4) is protected by TLS. So if the application uses https and if the termination proxy is located behind the load balancer, this approach fails as the cookies become inaccessible.

After identifying the above shortcomings, we wanted to be able to counter as many of them in our approach, if not all,  using query parameters.

Our approach involves the use of 2 modules -

1. The first module is a load balancer, similar to the nginx-sticky-module using cookies. The main idea is to fetch the query parameters from the incoming url request. Extract the session id, which is the unique identifier for a client, from the query parameters. This is then matched against our own Hash Map which stores the mappings of the session id (key) and the server id(value). If a match is found, the client's request is redirected to the appropriate server. If there are no query parameters, or if there is no session id in the query parameters, or if the session id does not match with any in the hash map, a server is selected in round robin fashion.

For a new request (or otherwise) , if a new session id is assigned by the server, it has to be mapped to the server id. This can be done only after the response message has been built, clearly indicating the need of a second module to process the response sent by the server.

2. The second module, a filter, parses the response body to look for a new session id. This parsing can either involve :
a. Parsing only for form action urls and examining the query parameters contained in them (to add them to the HashMap). This approach is restrictive and only addresses form based requests.
b. Parsing for all urls and only storing the session id information from those which belong to the same domain.

As the number of user sessions increase, there is also a need to clean up some of the old mappings to optimize space. This wasn't a necessity in case of the cookie based approach as the the number of mappings was equal to the number of the backend servers in the system. With our approach, the number of mappings keep increasing with new user sessions, hence the filter module also has to clear some of the mappings after a configured timeout. For the same reason, we decided to associate a new mapping with a time attribute as well.

This alternative approach for sticky sessions counters some of the drawbacks of the cookie based approach.

- It doesn't rely on the client's browser to enable cookies.
- It is transparent on the client's side, we use a cookies only to share information across modules, before and after the server's response.
- The application is in control of the timeout, the requests are sent to the same server so long as the session id is the same in the incoming request and within the timeout. The onus of the session id is on the application, if the server session has expired, then the server will no longer generate a response with the same session id.
- The client can still access the session id which is used as a unique identifier and can flood by sending many requests with the same session id. But we cannot control the server to which the requests are sent like in the case of the sticky module.
- Lastly, since this method relies on parsing the query parameters as well as the response body, https is still a limitation if the termination proxy lies beyond the load balancer, as only the hostname is not encrypted.

Monday, October 28, 2013

Sticky Sessions with Cookies


This module is used to track upstream servers using cookies, enabling clients to be served by the same backend server for session persistence. The incoming request is examined for a cookie (we assume "route" to be the name of this cookie) and the client request is forwarded to the corresponding server based on the digest value in the mapping. It can't be applied when cookies are disabled and switches back to the classic round robin load balancing mechanism available in nginx. (or returns a Bad Gateway based on the fallback attribute specified )

We decided to examine the control flow and the source code of this sticky module, hoping that it would give us a fair idea of how to implement our own module - using url rewriting for session persistence. This blog is a brief description of what we've inferred from reading the source code.

This module is added by recompiling nginx from source :

./configure ... --add-module=/absolute/path/to/nginx-sticky-module
make
make install

To  use the sticky module,  a "sticky" directive is specified in the upstream block with options as indicated below:

Usage
upstream {
sticky;
server 127.0.0.1:9000;
server 127.0.0.1:9001;
server 127.0.0.1:9002;
}

sticky [name=route] [domain=.foo.bar] [path=/] [expires=1h] [hash=index|md5|sha1] [no_fallback];

On encountering the sticky directive, the handler mapped to this directive is invoked. This is called : ngx_http_sticky_set and is specified as a part of the nginx sticky directive (the enabling directive): 

static ngx_command_t  ngx_http_sticky_commands[] = {

{ ngx_string("sticky"),
NGX_HTTP_UPS_CONF|NGX_CONF_ANY,
ngx_http_sticky_set,
0,
0,
NULL },

ngx_null_command
};

The callback : ngx_http_sticky_set (registration function)
Reads and validates the arguments specified in the directive and saves them in an appropriate structure - ngx_http_sticky_srv_conf_t (custom structure) of the sticky module. These include specifications like the name of the cookie, its lifetime, domain, path, a callback based on the digest type specified in the directive, fallback specification and a reference to contain the peer mappings with the digest values that will be computed later. The validation also includes checking for unnecessary parameters and whether another upstream module has been loaded (in which case an error is thrown). The function also sets an upstream initialization callback.

The upstream initialization function - ngx_http_init_upstream_sticky
It computes the digest for all servers based on the encoding type specified(md5, sha1, etc.), calls the round robin module (for resolving host names and the allocation of sockets) and sets the handler for the peer initialization function. 

The peer initialization function - ngx_http_init_sticky_peer 
It is invoked per request. The configuration specifications that were packaged into a peer data structure (in the registration function) is now set as a part of the http request. The round robin module is invoked to determine the next peer to invoke (callback : ngx_http_get_sticky_peer). This function also performs a good chunk of the cookie operations. It checks the incoming request header for the sticky cookie : route. If found, it checks for a set encryption type and tries to find a match (peer). If no encoding type is specified, the cookie data is taken to be the index of the peer directly. The selected peer is saved into the peer data structure to be later used by the ngx_http_get_sticky_peer callback. In case a cookie wasn't found, indicated by an index value of -1, the regular load balancing mechanism of round robin will be used.

ngx_http_get_sticky_peer is the callback set in the peer initialization function and is called at least once per request, to select the next peer. The function examines the state of the selected peer and ensures that it hasn't been tried or isn't down. After all these validations, it has to perform the core operation of assigning this peer to the upstream module and setting it with the cookie data. For the latter, ngx_http_sticky_misc_set_cookie() is invoked. If an existing cookie with the "route" header is found, it is overwritten to update the lifetime of the cookie (expires attribute), if not, a new cookie is created and set in the output headers.

This is how the cookie based session based load balancing works. 

However, this approach is not truly session based load balancing.
Consider an application that has sessions only for an hour. The lifetime of the cookie is set to 2 days. The cookie will be set on the client side and subsequent requests will continue to be sent to the same backend server for the next 2 days, much beyond the expiry of the original session. This skews the load balancing and it is not truly "sessional".
We plan to correct it by making use of query parameters and maintaining a mapping of the session id generated by the backend server to the backend server.

Wednesday, September 25, 2013

Setup Basic Load Balancing in Nginx


First we must understand what is Load Balancing.

Wikipedia definition  - >

"Load balancing is a computer networking method for distributing workloads across multiple computing resources, such as computers, a computer cluster, network links, central processing units or disk drives. Load balancing aims to optimize resource use, maximize throughput, minimize response time, and avoid overload of any one of the resources. Using multiple components with load balancing instead of a single component may increase reliability through redundancy. Load balancing is usually provided by dedicated software or hardware, such as a multilayer switch or a Domain Name System server process."


In layman terms wrt nginx , it is a way in which multiple servers are ready to serve a request from client to increase reliability and reduce disk access latency, etc . Having a load balancer allows us to dynamically add and remove back-end servers easily and perform many more functions listed here:
http://en.wikipedia.org/wiki/Load_balancing_(computing)#Load_balancer_features

This diagram depicts the scenario:


Source : http://indonetworksecurity.com/

Nginx is an amazing load balancer and used by many sites (wikipedia itself - link) . 

In this diagram the load balancer can be running nginx and the servers can be running apache, boa or nginx itself. 

So we shall now setup a very simple but effective load balancer using nginx (as load balancer) and any 2 backend servers (can be any webserver)

I have assumed you followed previous post in setting up nginx. [Build and Setup Nginx from Source]

In the conf folder there is a nginx.conf file and this is main configuration file for nginx.

under http directive ->
      create a upstream directive ->

upstream backendserver  {
  server backendserver1;   
  server backendserver2;                                                                  }

   backend1/2 - > can be an ip/url of the server residing on the same system as nginx(localhost/127.0.0.1) or in another networked system as well.

under server directive inside http
      create a location directive ->


location / {
            proxy_pass http://backendserver;
        }

note the same name backendserver in proxypass and upstream.

restart nginx if running and now a simple load balancer is up and running in round robin distribution.(make sure backendservers are up as well :P)

Advanced Configurations :

ip hash; Specifies that a group should use a load balancing method where requests are distributed between servers based on client IP addresses.


weight;  Nginx allows us to assign a number specifying the proportion of traffic that should be directed to each server. 

health_check; will send “/” requests to each server in the backend group every five seconds. If any communication error or timeout occurs, or a proxied server responds with the status code other than 2xx or 3xx, the health check will fail, and the server will be considered unhealthy. Client requests are not passed to unhealthy servers.

and many more.

Sources :
http://en.wikipedia.org/wiki/Load_balancing_(computing)
http://wiki.nginx.org/HttpUpstreamModule
http://nginx.org/en/docs/http/ngx_http_upstream_module.html
https://www.digitalocean.com/community/articles/how-to-set-up-nginx-load-balancing

Tuesday, September 24, 2013

Build And Setup Nginx From Source

We will now build nginx from source rather than installing pre complied versions.(:P) and host a directory

We have to download these files for doing so.

1) latest nginx source code (http://nginx.org)

2) latest PCRE source code (www.pcre.org)

3) latest zlib source code (http://www.zlib.net)


Extract all files etc.. and then cd into the nginx-x.x.x folder

Note:

PATHN = "the path to folder where nginx build will be stored"
PATHP = "the path to pcre folder"
PATHZ = "the path to zlib folder"

Now run

./configure --prefix=PATHN --with-zlib=PATHP --with-pcre=PATHP && make && make install


After some time of compiling, linking etc the nginx binary will be built


cd into the PATHN (nginx folder) . we wil have 4 folders namely sbin , conf , logs , html .

sbin -> nginx built binary
conf -> contains the configuration files
logs -> contains various logs when server is running
html -> basic html files

we can now run the nginx binary and test at 127.0.0.1 or localhost in any browser to get this page.
(nginx should be run with root privileges to bind to port 80(default port))




cd into the conf folder and open nginx.conf

go to server part

change: listen 80 -> listen xxxx .  xxxx stands for any valid port

in location: change root html -> root "path of folder you want to host"

also add :-> autoindex on; //if you want to enable automatic directory index

save, close the file.


restart nginx and check the new configuration you have set.

to restart nginx -> run killall nginx
-> run nginx binary again

All done :)

Sunday, September 22, 2013

Decoding the Jargon

As per the plan, we started reading up on wiki and a number of other blogs to get the holistic picture of the system that we were planning to patch.

We were already familiar with the concept of a proxy server (forward proxy server to be specific) - It acts as an intermediary on the client side, fetches resources on web and returns it to them, aiding in implementing caching. 

A Reverse proxy server is its counter part on the server side, which takes requests from the client and forwards them to the servers in the internal network. The client connects to the reverse proxy which abstracts the origin servers and can be designed to provide security, caching, content compression or be coupled with a load balancer.

BUT, contrary to popular belief, a reverse proxy server isn't synonymous with a load balancer!

A load balancer, as the name indicates, is a mechanism used to enhance performance and reliability through redundancy in the servers in a client-server architecture. It can be implemented in software as well as hardware. 

Our focus is on the current load balancing mechanisms supported by NGINX. 

Clearly a load balancer is a must in a web based system as it is directly associated with the throughput. NGINX provides 2 different mechanisms - round robin and ip hashing. The round robin technique distributes the workload across servers in a round robin fashion. IP hash can be used to associate a particular ip/set of ip addresses with a server. 

Associating ip addresses with backend servers doesn't prove to be a very efficient technique in stateful applications, where we look to associate client sessions with a single server. Using this, all the client activity and data is tracked and maintained locally by the same server instance. This concept of having "sticky sessions" to introduce persistence is essential to be able to scale all stateful web applications. IP hash fails as multiple clients may come in with the same ip addresses. (courtesy forward proxies) 

Our search for existing implementations of this feature in NGINX led to nginx-sticky-module, which uses cookies to store and identify user sessions. This module is a win-win so long as the cookies aren't disabled by the browser. 

So now we finally had a more well defined task - to look for and implement an alternative solution to counter the limitations of the cookie based approach.

Our answer (not exactly ours) - URL rewriting. Assuming that the application appends a session id or an equivalent to the url, our nginx module can read it and direct the request to the server to which the user session is mapped to. A mapping is created in case the request lacks a session id, indicating a new client session. 

For now we decided to restrict ourselves to a system with a single load balancer, to avoid the the challenges of sharing the mapping information. 

Coming up next - setting up nginx on our machines and the much dreaded anatomy of nginx. 





Intro.

Hi! We are a team of 3 people working on a college project for a course on the Architecture of Open Source Technologies #PESIT #CS. We are extremely ambitious about implementing session based load balancing in NGINX (engine - X), a popular open source web server. A blog will help us gloat at the end of this course and might accidentally help someone somewhere. #pj

Engine - X ?

Wiki describes NIGNX as an open source reverse proxy server, load balancer, HTTP cache and a web server. #techjargon #kills

So in the pilot week, we decided to read up about load balancing, web servers and session based load balancing in general before getting our hands dirty with the nitty-gritties of NGINX.