If you've worked with Django at some point you probably had the need for some background processing of long running tasks. Chances are you've used some sort of task queue, and Celery is currently the most popular project for this sort of thing in the Python (and Django) world (but there are others).

While working on some projects that used Celery for a task queue I've gathered a number of best practices and decided to document them. Nevertheless, this is more a rant about what I think should be the proper way to do things, and about some underused features that the celery ecosystem offers.

No.1: Don't use the database as your AMQP Broker

Let me explain why I think this is wrong (aside from the limitations pointed out in the celery docs).

A database is not built for doing the things a proper AMQP broker like RabbitMQ is designed for. It will break down at one point, probably in production with not that much traffic/user base.

I guess the most popular reason people decide to use a database is because, well, they already have one for their web app, so why not re-use it. Setting up is a breeze and you don't need to worry about another component (like RabbitMQ).

Not so hypothetical scenario: Let's say you have 4 background workers processing the tasks you've put in the database. This means that you get 4 processes polling the database for new tasks fairly often, not to mention that each of those 4 workers can have multiple concurrent threads of it's own. At some point you notice that you are falling behind on your task processing and more tasks are coming in than are being completed, so naturally you increase the number of workers doing the task processing. Suddenly your database starts falling apart due to the huge number of workers polling the database for new tasks, your disk IO goes through the roof and your webapp starts being affected by this slow down because the workers are basically DDOS-ing the database.

This does not happen when you have a proper AMQP like RabbitMQ because, for one thing, the queue resides in memory so you don't hammer your disk. The consumers (the workers) do not need to resort to polling as the queue has a way of pushing new tasks to the consumers, and if the AMQP does get overwhelmed for some other reason, at least it will not bring down the user facing web app with it.

I would go as far to say that you shouldn't use a database for a broker even in development, what with things like Docker and a ton of pre-built images that already give you RabbitMQ out of the box.

No.2: Use more Queues (ie. not just the default one)

Celery is fairly simple to set up, and it comes with a default queue in which it puts all the tasks unless you tell it otherwise. The most common thing you'll see is something like this:

@app.task()
def my_taskA(a, b, c):
print("doing something here...") @app.task()
def my_taskB(x, y):
print("doing something here...")

What happens here is that both tasks will end up in the same Queue (if not specified otherwise in the celeryconfig.py file). I can definitely see the appeal of doing something like this because with just one decorator you've got yourself some sweet background tasks. My concern here is that taskA and taskB might be doing totally different things, and perhaps one of them might even be much more important than the other, so why throw them both in the same basket? Even if you've got just one worker processing both tasks, suppose that at some point the unimportant taskB gets so massive in numbers that the more important taksA just can't get enough attention from the worker? At this point increasing the number of workers will probably not solve your problem as all workers still need to process both tasks, and with taskB so great in numbers taskA still can't get the attention it deserves. Which brings us to the next point.

No.3: Use priority workers

The way to solve the issue above is to have taskA in one queue, and taskB in another and then assign x workers to process Q1 and all the other workers to process the more intensive Q2 as it has more tasks coming in. This way you can still make sure that taskB gets enough workers all the while maintaining a few priority workers that just need to process taskA when one comes in without making it wait to long on processing.

So, define your queues manually:

CELERY_QUEUES = (
Queue('default', Exchange('default'), routing_key='default'),
Queue('for_task_A', Exchange('for_task_A'), routing_key='for_task_A'),
Queue('for_task_B', Exchange('for_task_B'), routing_key='for_task_B'),
)

And your routes that will decide which task goes where:

CELERY_ROUTES = {
'my_taskA': {'queue': 'for_task_A', 'routing_key': 'for_task_A'},
'my_taskB': {'queue': 'for_task_B', 'routing_key': 'for_task_B'},
}

Which will allow you to run workers for each task:

celery worker -E -l INFO -n workerA -Q for_task_A
celery worker -E -l INFO -n workerB -Q for_task_B

No.4: Use Celery's error handling mechanisms

Most tasks I've seen in the wild don't have a notion of error handling at all. If a task fails that's it, it failed. This might be fine for some use cases, however, most tasks I've seen are talking to some kind of 3rd party API and fail because of some sort of network error, or other kind of "resource availability" error. The most simple way we can handle these kinds of errors is to just retry the task, because maybe the 3rd party API just had some server/network issues and it will be back up shortly, why not give it a go?

@app.task(bind=True, default_retry_delay=300, max_retries=5)
def my_task_A():
try:
print("doing stuff here...")
except SomeNetworkException as e:
print("maybe do some clenup here....")
self.retry(e)

What I like to do is define per task defaults for how long should a task wait before being retried, and how many retries is enough before finally giving up (the default_retry_delay and max_retries parameters respectively). This is the most basic form of error handling that I can think of and yet I see it used almost never. Of course Celery offers more in terms of error handling but I'll leave you with the celery docs for that.

No.5: Use Flower

The Flower project is a wonderful tool for monitoring your celery tasks and workers. It's web based and allows you to do stuff like see task progress, details, worker status, bringing up new workers and so forth. Check out the full list of features in the provided link.

No.6: Keep track of results only if you really need them

A task status is the information about the task exiting with a success or failure. It can be useful for some kind of statistics later on. The big thing to note here is that the exit status is not the result of the job that the task was performing, that information is most likely some sort of side effect that gets written to the database (ie. update a user's friend list).

Most projects I've seen don't really care about keeping persistent track of a task's status after it exited yet most of them use either the default sqlite database for saving this information, or even better, they've taken the time and use their regular database (postgres or otherwise).

Why hammer your webapp's database for no reason? Use CELERY_IGNORE_RESULT = True in your celeryconfig.py and discard the results.

No.7: Don't pass Database/ORM objects to tasks

After giving this talk at a local Python meetup a few people suggested I add this to the list. What's it all about? You shouldn't pass Database objects (for instance your User model) to a background task because the serialized object might contain stale data. What you want to do is feed the task the User id and have the task ask the database for a fresh User object.

Celery - Best Practices的更多相关文章

  1. celery最佳实践

    作为一个Celery使用重度用户.看到Celery Best Practices这篇文章.不由得菊花一紧. 干脆翻译出来,同一时候也会添加我们项目中celery的实战经验. 至于Celery为何物,看 ...

  2. Celery最佳实践(转)

    原文:http://my.oschina.net/siddontang/blog/284107 英文原文:https://denibertovic.com/posts/celery-best-prac ...

  3. python的分布式爬虫框架

    scrapy + celery: Scrapy原生不支持js渲染,需要单独下载[scrapy-splash](GitHub - scrapy-plugins/scrapy-splash: Scrapy ...

  4. 异步任务队列Celery在Django中的使用

    前段时间在Django Web平台开发中,碰到一些请求执行的任务时间较长(几分钟),为了加快用户的响应时间,因此决定采用异步任务的方式在后台执行这些任务.在同事的指引下接触了Celery这个异步任务队 ...

  5. celery使用的一些小坑和技巧(非从无到有的过程)

    纯粹是记录一下自己在刚开始使用的时候遇到的一些坑,以及自己是怎样通过配合redis来解决问题的.文章分为三个部分,一是怎样跑起来,并且怎样监控相关的队列和任务:二是遇到的几个坑:三是给一些自己配合re ...

  6. tornado+sqlalchemy+celery,数据库连接消耗在哪里

    随着公司业务的发展,网站的日活数也逐渐增多,以前只需要考虑将所需要的功能实现就行了,当日活越来越大的时候,就需要考虑对服务器的资源使用消耗情况有一个清楚的认知.     最近老是发现数据库的连接数如果 ...

  7. celery 框架

    转自:http://www.cnblogs.com/forward-wang/p/5970806.html 生产者消费者模式 在实际的软件开发过程中,经常会碰到如下场景:某个模块负责产生数据,这些数据 ...

  8. celery使用方法

    1.celery4.0以上不支持windows,用pip安装celery 2.启动redis-server.exe服务 3.编辑运行celery_blog2.py !/usr/bin/python c ...

  9. Celery的实践指南

    http://www.cnblogs.com/ToDoToTry/p/5453149.html Celery的实践指南   Celery的实践指南 celery原理: celery实际上是实现了一个典 ...

随机推荐

  1. 【python游戏编程之旅】第一篇---初识pygame

    本系列博客介绍以python+pygame库进行小游戏的开发.有写的不对之处还望各位海涵. 一.pygame简介 Pygame 是一组用来开发游戏软件的 Python 程序模块,基于 SDL 库的基础 ...

  2. BZOJ3733 : [Pa2013]Iloczyn

    首先将$n$的约数从小到大排序,设$dfs(x,y,z)$表示当前可以选第$x$个到第$m$个约数,还要选$y$个,之前选的乘积为$z$是否可能. 爆搜的时候,如果从$x$开始最小的$y$个相乘也超过 ...

  3. BZOJ4296 : [PA2015]Mistrzostwa

    先不断将度数小于D的点都删去,再找到剩下的图里最大的连通块即可. #include<cstdio> #include<algorithm> #define N 200010 i ...

  4. HDU 4666 Hyperspace(曼哈顿距离)

    题目链接 这是HDU第400个题. #include <cstdio> #include <cstring> #include <set> #include < ...

  5. linux rootfs制作

    http://blog.sina.com.cn/s/blog_6795385f01011ifg.html 作一个嵌入式Linux rootfs,并且实现 web 服务 1. 文件系统简介 •理论上说一 ...

  6. 三星S4接电话黑屏无法挂断通话

    最近发现S4的通话距离感应起出了问题,接电话后直接熄屏,按什么按钮都没有反应.通话结束后只能等对方挂断才会恢复正常,再或者长按9秒电源键强制重启.极大的影响了实用体验.网上搜了下,发现这样的问题还不少 ...

  7. workerman是一个高性能的PHP socket服务器框架

    workerman-chatorkerman是一款纯PHP开发的开源高性能的PHP socket服务器框架.被广泛的用于手机app.手游服务端.网络游戏服务器.聊天室服务器.硬件通讯服务器.智能家居. ...

  8. dig理解dns主备 - 阿权的书房

    dns的解析一般都授权两个以上,防止单点故障. 比如阿权的书房的域名 www.aslibra.com,授权两台ns.aslibra.com 和 ns2.aslibra.com,如果单点故障会怎么样呢? ...

  9. 双机冗余备份和负载均衡策略(Mysql Cluster入门安装配置指南)

    MySQL Cluster 是MySQL适合于分布式计算环境的高实用.高冗余版本.它采用了NDB Cluster 存储引擎,允许在1个 Cluster 中运行多个MySQL服务器.MySQL Clus ...

  10. java命令行参数

    命令行参数就是main方法里面的参数String[] args他就是一个数组,args只是数据类型的一个名称,就是一个数组的变量,名称无所谓,类型没变就行了.这个就是程序的入口点.如图7.4所示: 图 ...