Deploying big Spacy NLP models on AWS Lambda + S3

Source: Unsplash

Hey y’all!

A few weeks ago, I started using Spacy to detect locations in job descriptions. Spacy is an NLP library that lets you do pretty powerful stuff out-of-the-box and get things done fast.

Everything was working fine locally. But NoiceJobs (my project) is hosted on Heroku and uses the cheapest dynos possible, with only 0,5GB of RAM. For running simple apps that’s enough, but ML code is normally more memory and CPU-intensive, so when I deployed the new version of the app on Heroku I’d get memory quota exceeded errors all the time.

Some AWS engineers jumped into the conversation and after some back-and-forth, we came to the conclusion that AWS could be a good solution for my problem

I had used Flask on AWS Lambda on the past with Zappa and liked the easiness of the deployment process and fast is to get a small app running without too much hassle (most of the time)

Zappa lets you deploy Django or Flask apps on AWS Lambda, but I’d rather use Flask for something simple like this to keep memory usage and the bundle size as low as possible.

I made a smaller but functioning version of the code I run for my app, and you can find it in this GitHub repo:

The README has all the info on how to run it locally and deploy it so I won’t get you bored repeating it here. If you have any doubts write to me on Twitter and I’ll try to help you.

What I’ll do instead is telling you what didn’t work and what worked for me, so that you don’t waste your time the next time.

Loading models — the wrong way (for AWS Lambda)

According to Spacy’s docs, to download one of the pre-trained models you can do python -m spacy download en_core_web_md and then

import spacy
nlp = spacy.load("en_core_web_md")

to use it. However, if you want to deploy your code somewhere, it’s better to install the models as packages

So in my case, using Pipenv, I did it like this:

pipenv install git+

The #egg=en_core_web_md at the end is used by pip to assign an alias to the package so that later we can import it the same way as before.

Why is this the wrong way?

It’ll work locally. And if you were to deploy this project on EC2, Lightsail, or whatever VPS you’re using, this would be the right way to do it.

But when you try to deploy it to AWS Lambda, it will fail if you’re trying to use medium and large models because, when uncompressed, they’re bigger than Lambda’s 250MB limit

The right way

Or at least the one that worked for me. I asked Chris if loading the model from S3 could be viable, and this was his answer

We can load Spacy models from disk if we pass the route to an uncompressed model that’s in disk. A simple version of the function I have to find locations in a string would be:

#src/nlp.pyfrom collections import Counter
from typing import List
from src.manage_models import get_model_from_disk
import spacymodel = 'en_core_web_md-2.3.1'
model_location = get_model_from_disk(model)
nlp = spacy.load(model_location)
def find_locations(text: str) -> List[str]:
doc = nlp(text)
location_labels = ['GPE', 'LOC']
location_list = [ent.text for ent in doc.ents if ent.label_ in location_labels]
locations_sorted_by_num_appearances = list(Counter(location_list).keys())
return locations_sorted_by_num_appearances

There are no transfer costs between S3 and AWS Lambda if they’re in the same region, so all I had to do was uploading the model I wanted to an S3 bucket and then download it to disk, unzip it in the Lambda filesystem and pass its location to spacy.load. This StackOverflow answer helped

Then, in each call check if the model is still in disk (it’ll be if the calls are not very spaced in time from each other)

These are the functions needed to do this

#src/ get_model_from_disk(model: str, dest: str = dest) -> s:
print(f'Getting model {model} from disk')
filename = os.path.join(Path(dest), model)
if not os.path.exists(filename):
print('Not in disk, downloading from S3 bucket')
filename = download_model_from_s3(model, dest)
dirname = model.split('-')[0]
model_full_path = os.path.join(filename, dirname, model)
return model_full_path
def download_model_from_s3(model: str, dest: str) -> str:
print(f'Downloading {model} from S3')
filename = os.path.join(Path(dest), f'{model}.tar.gz')
# download model
object_name = f'models/{model}.tar.gz'
s3 = boto3.client('s3')
s3.download_file(s3_bucket, object_name, filename)
unzip_file(filename, dest)
uncompressed_file = os.path.join(Path(dest), model)
print(f'Downloaded to {uncompressed_file}')
return uncompressed_file
def unzip_file(filename: str, dest: str):
print(f'Unzipping {filename}')
with as f:

There’s more code in this src/ file to also automate downloading models them from GitHub and uploading them to S3.

Check it in the link below, it’s probably the most useful part of this post.

It also comes with a small CLI to let you do python -m src.manage_models -d to download the model you want from GitHub and python -m src.manage_models -u to upload it to GitHub.

Deploying it

Follow the instructions on the README (it’s just setting the name of the S3 bucket you created to store the Spacy models), run zappa deploy dev and in a couple minutes you should get a message like this


One more thing

I’m using this for NoiceJobs, a platform that every day finds 2k+ remote jobs and distributes them to more than 3,000 people via 60 Telegram channels and email.

It has more than 3000 subscribers as of today

If you’re looking for a remote job, you’ll probably find it useful. Check it out!

Freelance data scientist and software developer. On Twitter: @xoelipedes.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store