Vim Tip: Global Delete

Today I was asked to help debug a problem with our product's patcher. All of the debug information for the entire product goes into a single log file, and some processes are quite chatty. The log file that contained the information I was interested in for the patcher problems was some 26.5MB by the time I got it.

All of the lines I was interested in were very easy to find, because they contained specific strings (yay). The problem was that they were scattered throughout the log, in between debug output for other processes. At first, I tried to just delete lines that were meaningless for me, but that got old very quickly. This is how I made my life easier using Vim.

It's possible to do a "global delete" on lines that don't contain the stuff you are interested in. The lines I wanted to see contained one of two words, but I'll just use foo and bar for this example:

:g!/\v(foo|bar)/d

This command will look for any line that does not contain foo or bar and delete it. Here's the breakdown:

  • :g - This is the command for doing some other command on any line that matches a pattern
  • ! - Negate the match (perform the pending command on any line that does not contain the pattern)
  • /\v(foo|bar)/ - The regular expression pattern
    • \v - Use of \v means that in the pattern after it all ASCII characters except '0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning (very magic). Basically, it removes the need to escape almost everything in your regex.
    • (foo|bar) - Find either foo or (|) bar
  • d - The command to perform on matching lines, which is delete in this case

So, executing that command in the Vim window with the log file wiped out all of the lines that didn't have my magical keywords in them.

When I showed my co-worker how awesome Vim was, he was mildly impressed, and then he asked, "What about multiline log messages?" My particular case didn't have any multiline messages, but I wanted to figure it out anyway. I haven't been able to figure out an exact method for deleting the lines that don't match, but I have found a way to show only the lines that match:

:g!/\v^".+(foo|bar)\_.{-}^"/p

This command is pretty close to the previous one.

  • :g - Global command on lines that match a pattern
  • ! - Negate the match (seems a little backward this time)
  • /\v^".+(foo|bar)\_.{-}^"/ - The regular expression pattern
    • \v - Very magic
    • ^" - Find a line that starts with a double quote ("). Each of our individual log messages starts with a double quote that is guaranteed to be at the beginning of the line, so this is specific to our environment.
    • .+ - One or more characters between the " and foo or bar
    • (foo|bar) - Find either foo or (|) bar
    • \_.{-}^" - Non-greedy multiline match. Matches any character, including newlines (because of the \_), and continues matching until it reaches the next line that begins with ^". Again, that double quote is specific to our environment. The {-} is what makes this a "non-greedy" match--it's like using *, but it matches matches as few as possible of the preceding atom.
  • p - The command to perform on matching lines, which is print in this case. This brings up a separate little window that displays each match (which is why I mentioned the negation seemed a bit backward to me). Navigation and whatnot in this window appears to be similar to less on the command line.

And there you have it! I hope you find this information as useful as it has been for me!

Site-Wide Caching in Django

My last article about caching RSS feeds in a Django project generated a lot of interest. My original goal was to help other people who have tried to cache QuerySet objects and received a funky error message. Many of my visitors offered helpful advice in the comments, making it clear that I was going about caching my feeds the wrong way.

I knew my solution was wrong before I even produced it, but I couldn't get Django's site-wide caching middleware to work in my production environment. Site-wide caching worked wonderfully in my development environment, and I tried all sorts of things to make it work in my production setup. It wasn't until one "Jacob" offered a beautiful pearl of wisdom that things started to make more sense:

This doesn't pertain to feeds, but one rather large gotcha with the cache middleware is that any javascript you are running that plants a cookie will affect the cache key. Google analytics, for instance, has that effect. A workaround is to use a middleware to strip out the offending cookies from the request object before the cache middleware looks at it.

The minute I read that comment, I realized just how logical it was! If Google Analytics, or any other JavaScript used on my site, was setting a cookie, and it changed that cookie on each request, then the caching engine would effectively have a different page to cache for each request! Thank you so much, Jacob, for helping me get past the frustration of not having site-wide caching in my production environment.

How To Setup Site-Wide Caching

While most of this can be gleaned from the official documentation, I will repeat it here in an effort to provide a complete "HOWTO". For further information, hit up the official caching documentation.

The first step is to choose a caching backend for your project. Built-in options include:

To specify which backend you want to use, define the CACHE_BACKEND variable in your settings.py. The definition for each backend is different, so check out the official documentation for details.

Next, install a couple of middleware classes, and pay attention to where the classes are supposed to appear in the list:

  • django.middleware.cache.UpdateCacheMiddleware - This should be the first middleware class in your MIDDLEWARE_CLASSES tuple in your settings.py.
  • django.middleware.cache.FetchFromCacheMiddleware - This should be the last middleware class in your MIDDLEWARE_CLASSES tuple in your settings.py.

Finally, you must define the following variables in your settings.py file:

  • CACHE_MIDDLEWARE_SECONDS - The number of seconds each page should be cached
  • CACHE_MIDDLEWARE_KEY_PREFIX - If the cache is shared across multiple sites using the same Django installation, set this to the name of the site, or some other string that is unique to this Django instance, to prevent key collisions. Use an empty string if you don't care

If you don't use anything like Google Analytics that sets/changes cookies on each request to your site, you should have site-wide caching enabled now. If you only want pages to be cached for users who are not logged in, you may add CACHE_MIDDLEWARE_ANONYMOUS_ONLY = True to your settings.py file--its meaning should be fairly obvious.

If, however, your site-wide caching doesn't appear to work (as it didn't for me for a long time), you can create a special middleware class to strip those dirty cookies from the request, so the caching middleware can do its work.

import re

class StripCookieMiddleware(object):
    """Ganked from http://2ze.us/Io"""

    STRIP_RE = re.compile(r'\b(_[^=]+=.+?(?:; |$))')

    def process_request(self, request):
        cookie = self.STRIP_RE.sub('', request.META.get('HTTP_COOKIE', ''))
        request.META['HTTP_COOKIE'] = cookie

Edit: Thanks to Tal for regex the suggestion!

Once you do that, you need only install the new middleware class. Be sure to install it somewhere between the UpdateCacheMiddleware and FetchFromCacheMiddleware classes, not first or last in the tuple. When all of that is done, your site-wide caching should really work! That is, of course, unless your offending cookies are not found by that STRIP_RE regular expression.

Thanks again to Jacob and "nf", the original author of the middleware class I used to solve all of my problems! Also, I'd like to thank "JaredKuolt" for the django-staticgenerator on his github account. It made me happy for a while as I was working toward real site-wide caching.