Got Shrinkage? Rasterio and ManyLinux Wheels in Serverless Lambda

2018-06-27

Rasterio with ManyLinux Wheels are super awesome! You can get the power and functionality of GDAL binaries compiled and packaged up to be on-the-go. Exactly what you need for serverless functions (lambda). Alternatively, there is a GDAL with ManyLinux Wheels experiment out there as well.

Backstory

I developed a raster image processing pipeline to convert TNRIS's huge historical archive from geotiffs into COGs served as WMS services from s3. A major hurdle in that project was getting what I was doing locally with 'gdaltindex' up into a serverless lambda function. I don't have any experience in any C languages and compiling an independent GDAL binary to deploy with the function code was out of reach for me. If I wrote the function in python, I could use Rasterio as the package utilizes GDAL but the package doesn't stand independently with GDAL inherent in it. The saving grace was when I came across this github issue which revealed to me I was not alone. Luckily, Mapbox was working on a Rasterio package utilizing ManyLinux wheels which compiled the GDAL binaries in the package.

pip install --pre rasterio[s3]>=1.0b4

I wrote my python function using this working version of Rasterio with the ManyLinux Wheels and it tested perfectly. Then came deployment which, surely enough, revealed another hurdle.
When you deploy to lambda, you do so by uploading a zipfile with all the code and dependencies included. The zipfile must be < 50 MB. If it is larger, you can get away with uploading it to s3 and then directing lambda to it but even then the package must be < 250 MB uncompressed.

My package was originally: 7,716 items - 344.0 MB ...or, waaaaaay too big.

How do you shrink the deployment? Here comes Seth Fitzsimmons with a brillant blog post about this very subject with these very dependencies. He cleverly outlined a process using 'atime' (modified/accessed file metadata) to track the individual files in the dependencies being used by the lambda function... from there, just delete the ones that are not being used. I followed his guidance to figure out the general steps to make it happen. I'm posting here to share the detailed commands and code I used to make it happen in hopes of providing some shortcuts to others.

It was super successful with my output deployment package shrinking down to: 2,072 items - 154.0 MB (44 MB compressed!)

How To Shrink the Package

Let's start with a directory that is ready for lambda deployment but is too large. In it is the lambda code and all the dependency packages. I'm working on Fedora 27 which has 'atime' disabled by default so the first thing I had to do was enable it. The simplest instruction I discovered to this was here. Be sure to apply strictatime to the mounted volume where the python script is running. You can find this with `df -h ` then replace '/dev/mapper/fedora-home' in the commands below with the printed volume 'Filesystem'

df -h ./lambda_function.py to find the mounted volume where the python script is running. This will print the volume 'Filesystem'. Obviously, path to the location of your lambda function opposed to my example relative path.
findmnt /dev/mapper/fedora-home shows 'relatime' is enabled which blocks atime updates. Swap out the volume 'Filesystem' with the response from step 1 if it differs
sudo mount -o remount,strictatime /dev/mapper/fedora-home disables 'relatime' by enabling 'strictatime'
findmnt /dev/mapper/fedora-home ran again proves 'relatime' is removed
With both 'atime' and your python virtual environment enabled, cd into the lambda function directory
touch start to create an arbitrary file to compare against
Run the lambda function from the directory
find /path/to/function/ -type f -anewer ./start > dep_whitelist.txt to create a txt list of files with atime later than the arbitrary 'start' file. these were the files in the dependencies that were actually used by the function when you just ran it
Run a quick little python script to walk the function directory and delete all unused files (files not in the dep_whitelist.txt). My script was named 'dep_cleanup.py' and sat in the lambda function directory. Be sure to explicitly include the lambda function, requirements, and whitelist files in your whitelist. This is seen with the 'hrd_whitelist' list variable. My 'dep_cleanup.py' looked like this: import os # get this directory cur_dir = os.path.dirname(os.path.realpath(__file__)) print(cur_dir) # non-whitelist files that we don't want to delete hrd_whitelist = [] hrd_whitelist.append(cur_dir + "/dep_cleanup.py") hrd_whitelist.append(cur_dir + "/dep_whitelist.txt") hrd_whitelist.append(cur_dir + "/lambda_function.py") hrd_whitelist.append(cur_dir + "/requirements.txt") print("hardcoded to include:") print(hrd_whitelist) # open dep_whitelist file and merge with hardcoded list dep_whitelist = open("dep_whitelist.txt", "r") dep_lines = dep_whitelist.read().splitlines() whitelist = hrd_whitelist + dep_lines # count files deleted counter = 0 # walk all files and folders and check if it is in the dep_whitelist for (dirpath, dirnames, filenames) in os.walk(cur_dir): for filename in filenames: single_file = os.path.join(dirpath, filename) # if not in dep_whitelist then delete if single_file not in whitelist: os.remove(single_file) counter += 1 print(str(counter) + " files deleted") print("that's all folks!!")

For the actual code, project, and context I was working this in, visit the github repo here: https://github.com/TNRIS/lambda-s4. The lambda function which utilizes this process is the directory `ls4-04-shp_index` within the repo. It contains the 'dep_cleanup.py' file.

adam breznicky

full stack web & data engineer

Got Shrinkage? Rasterio and ManyLinux Wheels in Serverless Lambda

Rasterio with ManyLinux Wheels are super awesome! You can get the power and functionality of GDAL binaries compiled and packaged up to be on-the-go. Exactly what you need for serverless functions (lambda). Alternatively, there is a GDAL with ManyLinux Wheels experiment out there as well.

Backstory

How To Shrink the Package

that's all folks!!