Converting Mailman Archives to .mbox
Mailman is perhaps one of the most popular mailing lists in the world, yet searching and dealing with archives is absolutely hideous. Mailman stores every email sent and received in a .mbox file internally, but for most mailing lists, this is not accessible, and users must access archives either by vieiwing individual emails in HTML, or else downloading the tar-zipped archives for each month, which are stored in a weird format which is not documented.
For a recent project, I wanted to make a Mailman list’s archive searchable, as it contained a great knowledgebase. While many lists are public, these contained potentially private, sensitive information about my university, thus it is not possible to simply use Google to index and search for us.
After a weekend of reverse engineering a number of different email formats that Mailman seems to store archives as, we now have a workin Mailman archive parser, which can be used easily, creatively named mailman-downloader. The script downloads and decodes the archives, searching and replacing email mimetypes which Mailman scrubs, inserting mimetype boundaries and truncating these boundaries as they overflow. Author’s email addresses are reformed from the obfuscated “foo at bar dot com” form.
The heart of the decoder:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
In the repository, a simple script to upload the downloaded archives to gmail is included, fully automating the process of creating a gmail-based full-text archive searcher, more to come on that soon.
Any questions, comments? Please file an issue and or submit a pull request.
