Zachary Grafton


Intro to gumbopp

A short introductory post to demonstrate the gumbopp library for parsing HTML based on Google's gumbo C library.

Gumbopp is a simple library that wraps Google’s gumbo HTML 5 parser, originally written in C, with a modern C++ interface that should integrate well with the STL. It does so while providing a complete compiler firewall between the underying C library and the C++ interface. Care was taken to make all of the features of the original library available to the user of the C++ interface. The library attempts to be forward thinking and uses some features from C++17, and hopefully as the new standard progresses and compiler support improves, so will gumbopp.

Getting Started

Getting started is simple, just clone the git repo like so:

git clone https://github.com/zacharygrafton/gumbopp
cd gumbopp
git submodule init
git submodule update
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr ..
make
make install

This installation will provide CMake configuration files to make using the library with CMake as simple as possible. Using the library with CMake is as simple as:

find_package(gumbopp REQUIRED)
include_directories(${gumbopp_INCLUDE_DIRS})

add_executable(test ${sources})
target_link_libraries(test PRIVATE ${gumbopp_LIBRARIES})

Using the API

Using the api is simple enough, use the Parser::parse method to parse a string containing html, then start using Document object that is returned to find the nodes that you need. Below is an example:

Document document = Parser::parse(R"html(
<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
  </head>
  <body>
  </body>
</html>)html");
auto begin = document.begin();
auto end = document.begin();

auto location = std::find_if(begin, end, [](const auto& n) {
  if(n.IsElement())
    return n.GetElement() == "html";
  return false;
});

if(location != end)
  std::cout << "Found <" << location->GetElement() << ">" << std::endl;

Do note that in the preceding example, the std::find_if could be removed and the same node could have been accessed directly by calling document.GetRoot(). The API also supports iterating through the Attributes that are defined on an element, like so:

Node node = *location;

for(const auto& attr : node.GetAttributes()) {
  std::cout << attr.GetName().to_string() << " = "
    << attr.GetValue().to_string() << std::endl;
}

The source is documented with doxygen, but could probably be improved with more examples, but I believe the API is pretty self discoverable.

The Next Steps

Moving forward with the library, it would be nice to have a way to search through the mini DOM in the spirit of CSS. However, in the mean time, the library should be stable enough for everyday use. As a side note, binary compatability should be easy enough to maintain going forward. Stay tuned for some further announcements.