Got Shrinkage? Rasterio and ManyLinux Wheels in Serverless Lambda
2018-06-27
Rasterio with ManyLinux Wheels are super awesome! You can get the power and functionality of GDAL binaries compiled and packaged up to be on-the-go. Exactly what you need for serverless functions (lambda). Alternatively, there is a GDAL with ManyLinux Wheels experiment out there as well.
Backstory
I developed a raster image processing pipeline to convert TNRIS's huge historical archive from geotiffs into COGs served as WMS services from s3. A major hurdle in that project was getting what I was doing locally with 'gdaltindex' up into a serverless lambda function. I don't have any experience in any C languages and compiling an independent GDAL binary to deploy with the function code was out of reach for me. If I wrote the function in python, I could use Rasterio as the package utilizes GDAL but the package doesn't stand independently with GDAL inherent in it. The saving grace was when I came across this github issue which revealed to me I was not alone. Luckily, Mapbox was working on a Rasterio package utilizing ManyLinux wheels which compiled the GDAL binaries in the package.
pip install --pre rasterio[s3]>=1.0b4
I wrote my python function using this working version of Rasterio with the ManyLinux Wheels and it tested perfectly. Then came deployment which, surely enough, revealed another hurdle.
When you deploy to lambda, you do so by uploading a zipfile with all the code and dependencies included. The zipfile must be < 50 MB. If it is larger, you can get away with uploading it to s3 and then directing lambda to it but even then the package must be < 250 MB uncompressed.
My package was originally: 7,716 items - 344.0 MB
...or, waaaaaay too big.
How do you shrink the deployment? Here comes Seth Fitzsimmons with a brillant blog post about this very subject with these very dependencies. He cleverly outlined a process using 'atime' (modified/accessed file metadata) to track the individual files in the dependencies being used by the lambda function... from there, just delete the ones that are not being used. I followed his guidance to figure out the general steps to make it happen. I'm posting here to share the detailed commands and code I used to make it happen in hopes of providing some shortcuts to others.
It was super successful with my output deployment package shrinking down to: 2,072 items - 154.0 MB (44 MB compressed!)
How To Shrink the Package
Let's start with a directory that is ready for lambda deployment but is too large. In it is the lambda code and all the dependency packages. I'm working on Fedora 27 which has 'atime' disabled by default so the first thing I had to do was enable it. The simplest instruction I discovered to this was here. Be sure to apply strictatime to the mounted volume where the python script is running. You can find this with `df -h
-
df -h ./lambda_function.py
to find the mounted volume where the python script is running. This will print the volume 'Filesystem'. Obviously, path to the location of your lambda function opposed to my example relative path. -
findmnt /dev/mapper/fedora-home
shows 'relatime' is enabled which blocks atime updates. Swap out the volume 'Filesystem' with the response from step 1 if it differs -
sudo mount -o remount,strictatime /dev/mapper/fedora-home
disables 'relatime' by enabling 'strictatime' -
findmnt /dev/mapper/fedora-home
ran again proves 'relatime' is removed -
With both 'atime' and your python virtual environment enabled,
cd
into the lambda function directory -
touch start
to create an arbitrary file to compare against - Run the lambda function from the directory
-
find /path/to/function/ -type f -anewer ./start > dep_whitelist.txt
to create a txt list of files with atime later than the arbitrary 'start' file. these were the files in the dependencies that were actually used by the function when you just ran it -
Run a quick little python script to walk the function directory and delete all unused files (files not in the dep_whitelist.txt). My script was named 'dep_cleanup.py' and sat in the lambda function directory. Be sure to explicitly include the lambda function, requirements, and whitelist files in your whitelist. This is seen with the 'hrd_whitelist' list variable. My 'dep_cleanup.py' looked like this:
import os # get this directory cur_dir = os.path.dirname(os.path.realpath(__file__)) print(cur_dir) # non-whitelist files that we don't want to delete hrd_whitelist = [] hrd_whitelist.append(cur_dir + "/dep_cleanup.py") hrd_whitelist.append(cur_dir + "/dep_whitelist.txt") hrd_whitelist.append(cur_dir + "/lambda_function.py") hrd_whitelist.append(cur_dir + "/requirements.txt") print("hardcoded to include:") print(hrd_whitelist) # open dep_whitelist file and merge with hardcoded list dep_whitelist = open("dep_whitelist.txt", "r") dep_lines = dep_whitelist.read().splitlines() whitelist = hrd_whitelist + dep_lines # count files deleted counter = 0 # walk all files and folders and check if it is in the dep_whitelist for (dirpath, dirnames, filenames) in os.walk(cur_dir): for filename in filenames: single_file = os.path.join(dirpath, filename) # if not in dep_whitelist then delete if single_file not in whitelist: os.remove(single_file) counter += 1 print(str(counter) + " files deleted") print("that's all folks!!")
For the actual code, project, and context I was working this in, visit the github repo here: https://github.com/TNRIS/lambda-s4. The lambda function which utilizes this process is the directory `ls4-04-shp_index` within the repo. It contains the 'dep_cleanup.py' file.