ohcount
Parser Documentation
Author
Mitchell Foral

Overview

I will assume the reader has a decent knowledge of how Ragel works and the Ragel syntax. If not, please review the Ragel manual found at: http://research.cs.queensu.ca/~thurston/ragel/

All parsers must at least:

Additionally a parser can call the callback function for each position of entities parsed.

Take a look at 'c.rl' and even keep it open for reference when reading this document to better understand how parsers work and how to write one.

Writing a Parser

First create your parser in 'src/parsers/'. Its name should be the language you are parsing with a '.rl' extension. You will not have to manually compile any parsers, as this is automatically for you. However, you do need to add your parser to 'hash/parsers.gperf'.

Every parser must have the following at the top:

/************************* Required for every parser *************************/
#ifndef OHCOUNT_C_PARSER_H
#define OHCOUNT_C_PARSER_H
#include "../parser_macros.h"
// the name of the language
const char *C_LANG = LANG_C;
// the languages entities
const char *c_entities[] = {
"space", "comment", "string", "number", "preproc",
"keyword", "identifier", "operator", "any"
};
// constants associated with the entities
enum {
C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC,
C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY
};
/*****************************************************************************/

And the following at the bottom:

/************************* Required for every parser *************************/
/* Parses a string buffer with C/C++ code.
*
* @param *buffer The string to parse.
* @param length The length of the string to parse.
* @param count Integer flag specifying whether or not to count lines. If yes,
* uses the Ragel machine optimized for counting. Otherwise uses the Ragel
* machine optimized for returning entity positions.
* @param *callback Callback function. If count is set, callback is called for
* every line of code, comment, or blank with 'lcode', 'lcomment', and
* 'lblank' respectively. Otherwise callback is called for each entity found.
*/
void parse_c(char *buffer, int length, int count,
void (*callback) (const char *lang, const char *entity, int s,
int e, void *udata),
void *userdata
) {
%% write init;
cs = (count) ? c_en_c_line : c_en_c_entity;
%% write exec;
// if no newline at EOF; callback contents of last line
if (count) { process_last_line(C_LANG) }
}
#endif
/*****************************************************************************/

(Your parser will go between these two blocks.)

The code can be found in the existing 'c.rl' parser. You will need to change:

You may be asking why you have to rename variables and functions. Well if variables have the same name in header files (which is what parsers are), the compiler complains. Also, when you have languages embedded inside each other, any identifiers with the same name can easily be mixed up. It is also important to prefix your Ragel definitions with your language to avoid conflicts with other parsers.

Additional variables available to parsers are in the parser_macros.h file. Take a look at it and try to understand what the variables are used for. They will make more sense later on.

Now you can define your Ragel parser. Name your machine after your language, "write data", and include 'common.rl', a file with common Ragel definitions, actions, etc. For example:

%%{
machine c;
write data;
include "common.rl";
...
}%%

Before you begin to write patterns for each entity in your language, you need to understand how the parser should work.

Each parser has two machines: one optimized for counting lines of code, comments, and blanks; the other for identifying entity positions in the buffer.

Line Counting Machine

This machine should be written as a line-by-line parser for multiple lines. This means you match any combination of entities except a newline up until you do reach a newline. If the line contains only spaces, or nothing at all, it is blank. If the line contains spaces at first, but then a comment, or just simply a comment, the line is a comment. If the line contains anything but a comment after spaces (if there are any), it is a line of code. You will do this using a Ragel scanner. The callback function will be called for each line parsed.

Scanner Parser Structure

A scanner parser will look like this:

[lang]_line := |*
entity1 ${ entity = ENTITY1; } => [lang]_ccallback;
entity2 ${ entity = ENTITY2; } => [lang]_ccallback;
...
entityn ${ entity = ENTITYN; } => [lang]_ccallback;
*|;

(As usual, replace [lang] with your language name.)

Each entity is the pattern for an entity to match, the last one typically being the newline entity. For each match, the variable is set to a constant defined in the enum, and the main action is called (you will need to create this action above the scanner).

When you detect whether or not a line is code or comment, you should call the appropriate @code or @comment action defined in 'common.rl' as soon as

Main Action Structure

Defining Patterns for Entities

Notes

Parsers with Embedded Languages

Entry Pattern Actions

[lang]_[elang]_entry @{ entity = CHECK_BLANK_ENTRY; } @[lang]_callback
@{ saw([elang]_LANG)} => { fcall [lang]_[elang]_line; };

What this does is checks for a blank entry, and if it is, counts the line as a line of parent language code. If it is not, the macro will not do anything. The machine then transitions into the child language.

Outry Pattern Actions

@{ p = ts; fret; };

What this does is sets the current Ragel parser position to the beginning of the outry so the line is counted as a line of parent language code if no child code is on the same line. The machine then transitions into the parent language.

Entity Identifying Machine

This machine does not have to be written as a line-by-line parser. It only has to identify the positions of language entities, such as whitespace, comments, strings, etc. in sequence. As a result they can be written much faster and more easily with less thought than a line counter. Using a scanner is most efficient. The callback function will be called for each entity parsed.

The @ls, @ code, @comment, @queue, and @commit actions are completely unnecessary.

Scanner Structure

[lang]_entity := |*
entity1 ${ entity = ENTITY1; } => [lang]_ecallback;
entity2 ${ entity = ENTITY2; } => [lang]_ecallback;
...
entityn ${ entity = ENTITYN; } => [lang]_ecallback;
*|;

Main Action Structure

action [lang]_ecallback {
callback([lang]_LANG, [lang]_entities[entity], cint(ts), cint(te),
userdata);
}

Parsers for Embedded Languages

TODO:

Including Written Tests for Parsers

You should have two kinds of tests for parsers. One will be a header file that goes in the 'test/unit/parsers/' directory and the other will be an input source file that goes in the 'test/src_dir/' and an expected output file that goes in the 'test/expected_dir/' directory.

The header file will need to be "#include"ed in 'test/unit/test_parsers.h'. Then add the "all_[lang]_tests()" function to the "all_parser_tests()" function.

Recompile the tests for the changes to take effect.

The other files added to the 'test/{src,expected}_dir/' directories will be automatically detected and run with the test suite.