Name

rfc822 — RFC 822 parsing library

Synopsis

#include <rfc822.h>

#include <rfc2047.h>

g++ ... -lrfc822

DESCRIPTION

The rfc822 library provides C++ classes for parsing E-mail headers in the RFC 822 format. This library also includes some functions to help with encoding and decoding 8-bit text, as defined by RFC 2047.

The format used by E-mail headers to encode sender and recipient information is defined by RFC 822 (and its successor, RFC 2822). The format allows the actual E-mail address and the sender/recipient name to be expressed together, for example: John Smith <jsmith@example.com>

The main purposes of the rfc822 library is to:

  1. Parse a text string containing a list of RFC 822-formatted addresses into its logical components: names and E-mail addresses.

  2. Access those individual components.

  3. Allow some limited modifications of the parsed structure, and then convert it back into a text string.

Tokenizing an E-mail header

std::string_view header;

rfc822::tokens tokens{header};

for (rfc822::token &t:token)
    ;

rfc822::tokens is a container of tokenized parts of E-mail addresses. It is constructed from a std::string_view that contains E-mail addresses.

Note

The underlying text string must not be destroyed as long as the rfc822::tokens object is in scope.

struct rfc822::token {
};
 inttype;

// RFC 822 atom

 std::string_viewstr;

// underlying text

The type field contains one of the RFC 822 atoms, such as @ or ;. The str field contains atom's text. It references a substring of the original string that was passed to rfc822::tokens constructor. str references a substring for '\0', '"', '(' atoms. In all other cases, str is an empty string. Possible values of type:

'\0'

This is a simple atom - a sequence of non-special characters that is delimited by whitespace or special characters (see below).

'"'

This is a quoted string.

'('

This is an old style comment. A deprecated form of E-mail addressing uses - for example - "john@example.com (John Smith)" instead of "John Smith <john@example.com>". This old-style notation defined parenthesized content as arbitrary comments. The rfc822::token with type set to '(' is created for the entire comment, including the parentheses.

Symbols: '<', '>', '@', and many others

The remaining possible values of type include all the characters in RFC 822 headers that have special significance.

Extracting E-mail addresses

rfc822::addresses addresses{tokens};

for (rfc822::address &a:addresses)
    ;

rfc822::addresses is a container of E-mail addresses that were parsed from a rfc822::tokens object.

struct rfc822::address {
};
 rfc822::tokensname;

// Name portion of an address

 rfc822::tokensaddress;

// E-mail address

The rfc822::address class has two fields: name and address. name contains the name portion of an address, which is a sequence of tokens. address contains the E-mail address itself, which is also a sequence of tokens.

For example, the following is a valid E-mail header:

To: recipient-list: tom@example.com, john@example.com;

Typically, all of this, except for "To:", gets parsed by creating a rfc822::tokens object, then a rfc822::addresses object. The "recipient-list:" and the trailing semicolon is a legacy mailing list specification that is no longer in widespread use, but must still must be accounted for.

The resulting rfc822::addresses object will have four rfc822::address structures: one for "recipient-list:"; one for each address; and one for the trailing semicolon.

If address in a rfc822::address is an empty container, then this structure represents some non-address portion of the original header, such as "recipient-list:" or a semicolon. Otherwise it contains a tokenized representation of the E-mail address.

name either contains the tokenized form of a non-address portion of the original header, or a tokenized form of the recipient's name. name will be an empty container if the recipient name was not provided.

For example, for the following address: Tom Jones <tjones@example.com> - the address field contains the tokenized form of "tjones@example.com", and name contains the tokenized form of "Tom Jones".

Working with 8-bit MIME-encoded headers

const auto &[string, error] = rfc2047::encode(U"header", "utf-8",
    rfc2047::qp_allow_any);

The rfc2047::encode() function template and the rfc2047::decode() function object provide additional logic to encode or decode 8-bit content in 7-bit RFC 822 headers, as specified in RFC 2047.

rfc2047::encode()'s first parameter is a std::string in the character set specified by the second parameter. The third parameter is a function that returns true if the character should be encoded. The following functions are predefined:

rfc2047::qp_allow_any

All characters are allowed to be unencoded, except a small number of characters that have special meaning in RFC 2047: control characters, eight-bit characters, and several characters that would break the tokenization of the header.

rfc2047::qp_allow_comments

Also parenthesis and quotes are allowed to be unencoded.

rfc2047::qp_allow_word

Allow only characters used in base64 encoded MIME entities, and a few other characters.

Instead of a single string of text, an overloaded rfc2047::encode() function template accepts a beginning and an ending iterator for a sequence of characters to be encoded.

rfc2047::decode() parses a string in RFC 2047 format. It is a somewhat complicated template that implements a callback-based parser. Consult the inline comments for a more detailed explanation of how to use it. rfc2047::decode_unicode() does the same but it decodes to a Unicode string, and ignores the character set and language of the encoded word (the character set effects the conversion to a Unicode character stream, and the language is immaterial).

std::u32string ustr;

rfc822::tokens name, address;

address.unicode_address(std::back_inserter(ustr));
name.unicode_name(std::back_inserter(ustr), false);

std::string str;

address.display_address(unicode_default_chset(),
			std::back_inserter(str));
name.display_name(unicode_default_chset(),
		  std::back_inserter(str), false);

display_header_unicode("To:", "nobody@example.com",
    std::back_inserter(ustr),
    []
    {
    }
);

display_header("To", "nobody@example.com",
    unicode_default_chset(),
    std::back_inserter(str),
    []
    {
    }
);

std::vector<std::u32string> ulines;

rfc2047::wrap_header_unicode("Subject", "Hello world", 80,
    std::back_inserter(ulines)
);

std::vector<std::string> lines;

rfc2047::wrap_header("Subject:", "Hello world", 80,
    unicode_default_chset(),
    std::back_inserter(lines)
);

rfc822::address &address;

address.encode(unicode_default_chset(), std::back_inserter(str));

The rfc2047 namespace contains several functions that handle various kinds of encoding and decoding between 8-bit content in 7-bit RFC 822 headers. These functions implement the specification in RFC 2047, and related standards. These functions write their output to an output iterator that gets passed as one of the parameters. If an output iterator gets passed by value, the function returns the value of the output iterator after it has been advanced for each character written to it. If the output iterator is passed by reference, the function returns void, and the output iterator itself is modified.

The following functions are available:

rfc822::tokens::unicode_address, rfc822::tokens::unicode_name

A method of the rfc822::tokens class, used to convert the parsed contents of an RFC 822 address or name into a sequence of Unicode characters. unicode_name() uses RFC 2047 to decode any RFC 2047-encoded words. unicode_address() uses IDN encoding to convert any IDN-encoded domain names into Unicode.

rfc822::tokens::display_address, rfc822::tokens::display_name

Convert a sequence of rfc822::tokenss containing either an IDN-encoded address or an RFC-2047 encoded name into a sequence of characters in the specified character set.

display_name()'s third parameter is a flag. A true value strips off quotes or parentheses from the display name.

display_header_unicode

This function takes the name of a header and its contents, and converts the contents into a sequence of Unicode characters. The passed in header name determines how the header gets parsed. Headers containing addresses are handled by parsing them as addresses, then converting the result into a sequence of Unicode characters using unicode_name() and unicode_address(). Other headers are parsed as unstructured headers, using RFC 2047 to decode any RFC 2047-encoded words.

The fourth parameter is an optional callback that gets invoked at every line-breaking opportunity. The callback gets invoked after writing, to the output iterator, the sequence of characters that end in a line-breaking opportunity, and before writing the first character after the potential line-break.

rfc2047::display_header

Take an arbitrary header, and convert it into a sequence of characters in the specified character set. This is, basically, the same as display_header_unicode() but the returned string is in the specified character set.

rfc2047::wrap_header_unicode

This uses display_header_unicode, but then it wraps the resulting sequence of characters into a sequence of lines, using the passed in maximum line width. The passed in output iterator iterates over a sequnce of unicode strings.

rfc2047::wrap_header

This uses wrap_header_unicode, but then it converts the resulting sequence of Unicode characters into a sequence of 8-bit strings, using the passed in character set. The passed in output iterator iterates over a sequence of 8-bit strings.

rfc2047::encode

Encode a sequence of 8-bit strings into a sequence of RFC 2047-encoded words. The passed in output iterator iterates over a sequence of RFC 2047-encoded words.

rfc822::address::encode

This method takes the name and address portion of a rfc822::address, that was encoded in the given character set, and encodes them using RFC 2047 and IDN, as appropriate. The output is written to the passed in output iterator.

Working with subjects

const auto &[str, flags]=rfc822::coresubj("Re: your message");

const auto &[str, flags]=rfc822::coresubj_nouc("Re: your message");

const auto &[str, flags]=rfc822::coresubj_keepblobs("Re: your message");

These functions take the contents of the subject header, and return the "core" subject header that's used in the specification of the IMAP THREAD function. These functions are designed to strip all subject line artifacts that might've been added in the process of forwarding or replying to a message. These functions return a tuple of a string and an int flag value. Currently, rfc822::coresubj() performs the following transformations:

Whitespace

Leading and trailing whitespace is removed. Consecutive whitespace characters are collapsed into a single whitespace character.

Re:, (fwd) [foo]

These artifacts (and several others) are removed from the subject line.

rfc822::coresubj

This is the original version of this function. It is preserved for binary compatibility with existing programs.

rfc822::coresubj_nouc

The returned string does not get converted to uppercase.

rfc822::coresubj_keepblobs

This is like rfc822::coresubj_nouc(), except that it does not remove [blob] markers from the returned subject line.

Note that these functions do NOT do MIME decoding. In order to implement IMAP THREAD, it is necessary to call something like rfc2047_decode() before calling rfc822::coresubj().

The returned flag value is a bitmask:

CORESUBJ_RE

This indicates that the original subject line starts with Re: .

CORESUBJ_FWD

This indicates that the original subject line contained a (fwd) marker.