Sunday, December 29, 2013

Python, scalable file uploading

Almost every solution suffers from dedicating a thread for particular client (like django) or despite having event loop on board sucking the whole file into the memory (like tornado). That makes impossible to handle either big amount of clients or big files. We also do not want to dedicate a file upload completely to some external entity while we want to make some auth checks before upload will be permitted (otherwise server can be flooded with unauthorized ingest traffic).
I've borrowed the idea below from Anatoly Mikhailov, see his post Nginx direct file upload without passing them through backend. Lets do this quickly.

Prerequisite is nginx 1.5.4+ with auth_request_module.
Excerpt from nginx.conf
http {

    server {
        listen       6666;
        server_name  localhost;

        client_max_body_size       5000M;
        location /upload {
            auth_request /auth;

            client_body_temp_path      /home/adolgarev/tmp/;
            client_body_in_file_only   on;
            client_body_buffer_size    128K;

            proxy_pass_request_headers on;
            proxy_set_header           X-FILE $request_body_file;
            proxy_pass_request_body    off;
            proxy_set_header Content-Length "";
            proxy_redirect             off;
            proxy_pass                 http://localhost:8888/upload;

        location /auth {
            proxy_pass http://localhost:8888/auth;
            proxy_pass_request_body off;
            proxy_set_header Content-Type "";
            proxy_set_header Content-Length "";
            proxy_set_header X-Original-URI $request_uri;

        location / {
            root   html;
            index  index.html index.htm;
The backend
import os, re, os.path

def application(env, start_response):
    str_request_uri = env['REQUEST_URI']
    if str_request_uri.startswith('/auth'):
        return auth(env, start_response)
    elif str_request_uri.startswith('/upload'):
        return upload(env, start_response)
        start_response('404 NOT FOUND', [])

def auth(env, start_response):
        str_original_uri = env['HTTP_X_ORIGINAL_URI']
        if re.match('.*access_token=(?P<access_token>[^&]*)', str_original_uri).group('access_token') == 'AAA':
            start_response('200 OK', [])
    start_response('403 Forbidden', [])

def upload(env, start_response):
    #import rpdb2; rpdb2.start_embedded_debugger('test37')
    str_filename = env['HTTP_X_FILE']
    os.rename(str_filename, os.path.join("upload", os.path.basename(str_filename)))
    start_response('200 OK', [('Content-Type','text/plain')])
    return "Uploaded {0}".format(str_filename)
As you can see in example
  1. auth() authorizes users by access_token in X-Original-URI header. It is called after request header is read but before body upload starts.
  2. upload() receives path to completely uploaded file in header X-File and processes it.
You can replace those functions with required logic.
Try it out. Start backend
uwsgi --http :8888 --wsgi-file
Upload a file
curl --data-binary '@data_upload_body' http://localhost:6666/upload?access_token=AAA
Uploaded /home/adolgarev/tmp/0000000001

Unauthorized users will see
curl --data-binary '@data_upload_body' http://localhost:6666/upload?access_token=BBB
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>

