Whoa! Another Reason To Love Vim

I've been struggling with some misconfigured appliances at work for the past couple of days, and I was getting tired of manually diff-ing things. On a whim, I decided to ask Google if there is a better way. Turns out there is, and it uses what I already know and love: VIM. Here's a command that lets you diff two remote file using vimdiff:

vimdiff scp://user@host//path/to/file scp://user@otherhost//path/to/file

This is going to save me so much time! I hope it is as useful to you all as it is to me.

SVN Commits By User

The other day at work, I found myself needing to see a list of Subversion commits by a specific user. I spent a few minutes looking at the svn log help, but nothing seemed to be designed to show commits by user. It took me a while to find something to do the trick, but this is it:

svn log | sed -n '/username/,/-----$/ p'

Gotta love sed!

More django-articles Updates

I've spent a little more time lately adding new features to django-articles. There are two major additions in the latest release (2.0.0-pre2).

  • Article attachments
  • Article statuses

That's right folks! You can finally attach files to your articles. This includes attachments to emails that you send, if you have the articles from email feature properly configured. To prove it, I'm going to attach a file to this article (which I'm posting via email).

Next, I've decided that it's worth allowing the user to specify different statuses for their articles. One of the neat things about this feature is that if you are a super user, you're logged in, and you save an article with a status that is designated as "non-live", you will still be able to see it on the site. This is a way for users to preview their work before making it live. Out of the box, there are only two statuses: draft and finished. You're free to add more statuses if you feel so inclined (they're in the database, not hardcoded).

The article status is still separate from the "is_active" flag when saving an article. Any article that is marked as inactive will not appear on the site regardless of the article's "status".

On a slightly less impressive note (although still important), this release includes some basic unit tests. Most of the tests currently revolve around article statuses and making sure that the appropriate articles appear on the site.

Learned Something New Today

I learned something very interesting today regarding JavaScript. Back in the day, I used to put something like this in my HTML when I wanted to include some JS:

<script language="javascript">
...
</script>

Then I learned that I should be using something like this instead:

<script type="text/javascript">
...
</script>

I've been doing that for years and years now. Turns out I've been wrong all this time. Well, at least for 4 years of that time. I stumbled upon RFC4329 today for whatever reason and noticed that it said the text/javascript mimetype is obsolete. I dug into the RFC a bit and found this:

Various unregistered media types have been used in an ad-hoc fashion
to label and exchange programs written in ECMAScript and JavaScript.
These include:

   +-----------------------------------------------------+
   | text/javascript          | text/ecmascript          |
   | text/javascript1.0       | text/javascript1.1       |
   | text/javascript1.2       | text/javascript1.3       |
   | text/javascript1.4       | text/javascript1.5       |
   | text/jscript             | text/livescript          |
   | text/x-javascript        | text/x-ecmascript        |
   | application/x-javascript | application/x-ecmascript |
   | application/javascript   | application/ecmascript   |
   +-----------------------------------------------------+

Use of the "text" top-level type for this kind of content is known to
be problematic.  This document thus defines text/javascript and text/
ecmascript but marks them as "obsolete".  Use of experimental and
unregistered media types, as listed in part above, is discouraged.
The media types,

   * application/javascript
   * application/ecmascript

which are also defined in this document, are intended for common use
and should be used instead.

So yeah. It's time to go update all of my JavaScript stuff I guess. I thought the rest of you who are/were in the same boat as me might like to know about this...

GitHub and django-articles

Some of you who prefer to use git for your version control needs and were following the django-articles mirror on GitHub may have noticed some strange activity recently. I noticed today that the GitHub mirror was out of sync with the other mirrors, and I took a bit of time to investigate the problem.

I thought, for some reason, that I might be able to quickly and easily bring it back into sync if I just deleted the repo, recreated it, and pushed my changes to it. That didn't work. This means that all of you who were once following the project there are no longer following it, and I only realized that side effect after I had clicked the delete button. I apologize for this inconvenience.

In the end, it turned out that I had some things misconfigured with git on my box. I have resolved the problems and have brought the mirror back into sync. Please let me know if you run into any problems with it!

New Feature in django-articles: Articles From Email

One of the features that I really like about sites like posterous and tumblr is that they allow you to send email to a special email address and have it be posted as a blog article. This is a feature I've been planning to implement in django-articles pretty much since its inception way back when. I finally got around to working on it.

The latest release of django-articles allows you to configure a mailbox, either IMAP4 or POP3, to periodically check for new emails. A new management command check_for_articles_from_email can be used to process the messages found in the special mailbox. If any emails are found, they will be fetched, parsed, and posted based on your configuration values. Only articles whose sender matches an active user in your Django site will be turned into articles. You can configure the command to mark such articles from email as "inactive" so they don't appear on the site without moderation. The default behavior, actually, is to mark the articles inactive--you must explicitly configure django-articles to automatically mark the articles as active if you want this behavior.

One of the biggest things that you should keep in mind with this new feature, though, is that it does not currently take your attachments into account. In time I plan on implementing this functionality. For now, only the plain text content of your email will be posted. Please see the project's README for more information about this new feature.

Please keep in mind that this is brand new functionality and it's not been very well tested in a wide variety of situations. Right now, it's in the "it works for me" stage. If you find problems with it, please create a ticket or update any similar existing tickets using the ticket tracker on bitbucket.org.

You can install or update django-articles using the following utilities:

  • pip install -U django-articles
  • easy_install -U django-articles
  • hg clone http://bitbucket.org/codekoala/django-articles/ or just hg pull -u if you have already cloned it
  • git clone git://github.com/codekoala/django-articles.git

Enjoy!

P.S. This article was posted via email

On Security and Python's Exec

A recent project at work has renewed my aversion to Python's exec statement--particularly when you want to use it with arbitrary, untrusted code. The project requirements necessitated the use of exec, so I got to do some interesting experiments with it. I've got a few friends who, until I slapped some sense into them, were seemingly big fans of exec (in Django projects, even...). This article is for them and others in the same boat.

Take this example:

#!/usr/bin/env python

import sys

dirname = '/usr/lib/python2.6/site-packages'

print dirname, 'in path?', (dirname in sys.path)

exec """import sys

dirname = '/usr/lib/python2.6/site-packages'
print 'In exec path?', (dirname in sys.path)

sys.path.remove(dirname)

print 'In exec path?', (dirname in sys.path)"""

print dirname, 'in path?', (dirname in sys.path)

Take a second and examine what the script is doing. Done? Great... So, the script first makes sure that a very critical directory is in my PYTHONPATH: /usr/lib/python2.6/site-packages. This is the directory where all of the awesome Python packages, like PIL, lxml, and dozens of others, reside. This is where Python will look for such packages when I try to import and use them in my programs.

Next, a little Python snippet is executed using exec. Let's say this snippet comes from an untrusted source (a visitor to your website, for example). The snippet removes that very important directory from my PYTHONPATH. It might seem like it's relatively safe to do within an exec--maybe it doesn't change the PYTHONPATH that I was using before the exec?

Wrong. The output of this script on my personal system says it all:

$ python bad.py
/usr/lib/python2.6/site-packages in path? True
In exec path? True
In exec path? False
/usr/lib/python2.6/site-packages in path? False

From this example, we learn that Python code that is executed using exec runs in the same context as the code that uses exec. This is a critical concept to learn.

Some people might say, "Oh, there's an easy way around that. Give exec its own globals dictionary to work with, and all will be well." Wrong again. Here's a modified version of the above script.

#!/usr/bin/env python

import sys

dirname = '/usr/lib/python2.6/site-packages'

print dirname, 'in path?', (dirname in sys.path)

context = {'something': 'This is a special context for the exec'}

exec """import sys

print something
dirname = '/usr/lib/python2.6/site-packages'
print 'In exec path?', (dirname in sys.path)

sys.path.remove(dirname)

print 'In exec path?', (dirname in sys.path)""" in context

print dirname, 'in path?', (dirname in sys.path)

And here's the output:

$ python also_bad.py
/usr/lib/python2.6/site-packages in path? True
This is a special context for the exec
In exec path? True
In exec path? False
/usr/lib/python2.6/site-packages in path? False

How can you get around this glaring risk in the exec statement? One possible solution is to execute the snippet in its own process. Might not be the best way to handle things. Could be the absolute worst solution. But it's a solution, and it works:

#!/usr/bin/env python

import multiprocessing
import sys

def execute_snippet(snippet):
    exec snippet

dirname = '/usr/lib/python2.6/site-packages'

print dirname, 'in path?', (dirname in sys.path)

snippet = """import sys

dirname = '/usr/lib/python2.6/site-packages'
print 'In exec path?', (dirname in sys.path)

sys.path.remove(dirname)

print 'In exec path?', (dirname in sys.path)"""

proc = multiprocessing.Process(target=execute_snippet, args=(snippet,))
proc.start()
proc.join()

print dirname, 'in path?', (dirname in sys.path)

And here comes the output:

$ python better.py
/usr/lib/python2.6/site-packages in path? True
In exec path? True
In exec path? False
/usr/lib/python2.6/site-packages in path? True

So the PYTHONPATH is only affected by the sys.path.remove within the process that executes the snippet using exec. The process that spawns the subprocess is unaffected, and can continue with life, happily importing all of those wonderful packages from the site-packages directory. Yay.

With that said, exec isn't always bad. But my personal point of view is basically, "There is probably a better way." Unfortunately for me, that does not hold up in my current situation, and it might not work for your circumstances too. If no one is forcing you to use exec, you might investigate alternatives in all of that free time you've been wondering what to do with.

Python And Execution Context

I recently found myself in a situation where knowing the execution context of a function became necessary. It took me several hours to learn about this functionality, despite many cleverly-crafted Google searches. So, being the generous person I am, I want to share my findings.

My particular use case required that a function behave differently depending on whether it was called in an exec call. Specifics beyond that are not important for this article. Here's an example of how I was able to get my desired behavior.

import inspect

def is_exec():
    caller = inspect.currentframe().f_back
    module = inspect.getmodule(caller)

    if module is None:
        print "I'm being run by exec!"
    else:
        print "I'm being run by %s" % module.__name__

def main():
    is_exec()

    exec "is_exec()"

if __name__ == '__main__':
    main()

The output of such a script would look like this:

$ python is_exec.py
I'm being run by __main__
I'm being run by exec!

It's also interesting to note that when you're using the Python interactive interpreter, calling the is_exec function from the code above will tell you that you are indeed using exec.

Some may argue that modifying behavior as I needed to is dirty, and that if your system requires such code, you're doing it wrong. Well, you could apply this sort of code to situations that have nothing to do with exec. Perhaps you want to determine which part of your product is using a specific function the most. Perhaps you want to get additional debugging information that isn't immediately obvious.

Just like always, I want to add the disclaimer that there may be other ways to do this and there probably are. However, this is the way that worked for me. I'd still be interested to here about other solutions you may have encountered for this problem.

On a side note, if you're up for some slightly advanced Python mumbo jumbo, I suggest diving into the inspect documentation.

Monitor Multiple Remote Files Using Multitail

There comes a time in each of our individual lives that we just learn to love log files. We learn to love utilities like tail and grep as we pore over countless lines of information, seeking out the stuff that really matters. We like to show off our debugging prowess as innocent bystanders look on in absolute wonderment.

While that's all fine and dandy, I'm always on the lookout for utilities to make my log monitoring less painful. A few weeks ago, my supervisor introduced me to a program that he's been using for quite some time: multitail. In essence, it's tail with some really neat features, such as the ability to:

  • "tail" multiple files (or commands, like netstat) independently in the same terminal
  • highlight text using regular expressions
  • search log messages and see only the matching lines
  • merge multiple files into one log window
  • scrolling back in the history of a log file
  • highlighting "themes"

I've been using multitail for a couple of weeks now (it took me a while to warm up to it after my supervisor introduce it), and I'm quite satisfied with it. One thing I really, really like about multitail is that I can kinda sorta almost monitor multiple remote files. What does that mean, you ask?

Well, my development environment includes at least 5 virtual machines, each of which will be logging different but equally important information. I want to be able to "tail" a specific log file on each of the virtual machines in one window. Now, it took me a while to learn how to do this, which is why I'm sharing the information with you.

And here comes my usual disclaimer: this may not be the most efficient way to do what I want to do, but it's currently working for me. I'm open to other solutions too!

Anyway, I can run a command like the following to monitor multiple remote log files:

multitail -l 'ssh user@host1 "tail -f /path/to/log/file"' -l 'ssh user@host2 "tail -f /path/to/log/file"'

Such a command would ssh into two computers, host1 and host2, and run tail -f /path/to/log/file on each. Multitail allows you to monitor the output of both tail commands in a single window, reducing clutter on your desktop. You can also arrange the files/commands you're "tailing" into various rows and columns. I tend to have a 2x2 grid of log files when I use multitail at work.

I've also started using multitail to monitor the access and error logs for my Django sites on WebFaction. I simply ssh into my account, run an alias for a ridiculous multitail command, and watch as both log files scroll on by.

Again, this is just another aspect of my work environment that is fun and useful to me, and I wanted to spread the joy. Multitail may or may not be a utility you like to use, but it suits my current needs and desires quite well. YMMV. And, once again, I'm always on the look-out for other tools to make my work life more interesting and productive!

Site-Wide Caching in Django

My last article about caching RSS feeds in a Django project generated a lot of interest. My original goal was to help other people who have tried to cache QuerySet objects and received a funky error message. Many of my visitors offered helpful advice in the comments, making it clear that I was going about caching my feeds the wrong way.

I knew my solution was wrong before I even produced it, but I couldn't get Django's site-wide caching middleware to work in my production environment. Site-wide caching worked wonderfully in my development environment, and I tried all sorts of things to make it work in my production setup. It wasn't until one "Jacob" offered a beautiful pearl of wisdom that things started to make more sense:

This doesn't pertain to feeds, but one rather large gotcha with the cache middleware is that any javascript you are running that plants a cookie will affect the cache key. Google analytics, for instance, has that effect. A workaround is to use a middleware to strip out the offending cookies from the request object before the cache middleware looks at it.

The minute I read that comment, I realized just how logical it was! If Google Analytics, or any other JavaScript used on my site, was setting a cookie, and it changed that cookie on each request, then the caching engine would effectively have a different page to cache for each request! Thank you so much, Jacob, for helping me get past the frustration of not having site-wide caching in my production environment.

How To Setup Site-Wide Caching

While most of this can be gleaned from the official documentation, I will repeat it here in an effort to provide a complete "HOWTO". For further information, hit up the official caching documentation.

The first step is to choose a caching backend for your project. Built-in options include:

To specify which backend you want to use, define the CACHE_BACKEND variable in your settings.py. The definition for each backend is different, so check out the official documentation for details.

Next, install a couple of middleware classes, and pay attention to where the classes are supposed to appear in the list:

  • django.middleware.cache.UpdateCacheMiddleware - This should be the first middleware class in your MIDDLEWARE_CLASSES tuple in your settings.py.
  • django.middleware.cache.FetchFromCacheMiddleware - This should be the last middleware class in your MIDDLEWARE_CLASSES tuple in your settings.py.

Finally, you must define the following variables in your settings.py file:

  • CACHE_MIDDLEWARE_SECONDS - The number of seconds each page should be cached
  • CACHE_MIDDLEWARE_KEY_PREFIX - If the cache is shared across multiple sites using the same Django installation, set this to the name of the site, or some other string that is unique to this Django instance, to prevent key collisions. Use an empty string if you don't care

If you don't use anything like Google Analytics that sets/changes cookies on each request to your site, you should have site-wide caching enabled now. If you only want pages to be cached for users who are not logged in, you may add CACHE_MIDDLEWARE_ANONYMOUS_ONLY = True to your settings.py file--its meaning should be fairly obvious.

If, however, your site-wide caching doesn't appear to work (as it didn't for me for a long time), you can create a special middleware class to strip those dirty cookies from the request, so the caching middleware can do its work.

import re

class StripCookieMiddleware(object):
    """Ganked from http://2ze.us/Io"""

    STRIP_RE = re.compile(r'\b(_[^=]+=.+?(?:; |$))')

    def process_request(self, request):
        cookie = self.STRIP_RE.sub('', request.META.get('HTTP_COOKIE', ''))
        request.META['HTTP_COOKIE'] = cookie

Edit: Thanks to Tal for regex the suggestion!

Once you do that, you need only install the new middleware class. Be sure to install it somewhere between the UpdateCacheMiddleware and FetchFromCacheMiddleware classes, not first or last in the tuple. When all of that is done, your site-wide caching should really work! That is, of course, unless your offending cookies are not found by that STRIP_RE regular expression.

Thanks again to Jacob and "nf", the original author of the middleware class I used to solve all of my problems! Also, I'd like to thank "JaredKuolt" for the django-staticgenerator on his github account. It made me happy for a while as I was working toward real site-wide caching.