Learn Protocol Buffers (Protobuf) for serializing structured data — Part 1

Neeraj Kushwaha
6 min readSep 24, 2022

--

This article is originally published at https://www.learncsdesign.com

Before we talk about Protocol Buffers let’s first understand Serialization and Deserialization.

Serialization — The process of converting an object into a linear sequence of bytes for either storage or transmission to another location. This process of serializing an object is also called marshaling an object.

Deserialization — The process of taking in stored information and recreating objects from it. Extracting a data structure from a series of bytes is called unmarshaling.

As data has evolved, so have the ways to serialize and deserialize data. Let’s talk about a few of them.

Comma Separated Values (CSV)

CSV is easy to parse and read but offers several disadvantages like the data types have to be inferred and is not a guarantee. Parsing becomes tricky when data contains commas and column names may or may not be there.

Relation table definitions

Relational table definitions add types like

CREATE TABLE authors {
title varchar(255),
author varchar(80),
isbn13 varchar(20),
page int
}

The advantage is that data is fully typed and it fits in a table. But data has to be flat and data stored in a database and its data definition will be different for different databases.

JavaScript Object Notation(JSON)

In JSON data can take any form and is a widely accepted format on the web. JSON can be read by most the programming languages and can be easily shared over a network. But JSON data has no schema enforcement and JSON objects are quite big in size because of repeated keys.

{"title":"Clean Code","author":"Robert C Martin","isbn13":"978-8131773383","pages":434}

In the JSON object the characters like {}[],: does not possess any kind of data, it only helps the serializer to format the data so that it can be decoded and structured.

JSON Object length: 87 bytes
Actual data length: 42 bytes
Non-data length: 45 bytes which is total wastage during data transfer.

Extensible Markup Language (XML)

XML also uses meta tags similar to JSON but they also have the end tag. Since they have the end tag, they are much bigger in size compared to JSON.

<?xml version="1.0" encoding="UTF-8" ?><root><title>Clean Code</title><author>Robert C Martin</author><isbn13>978-8131773383</isbn13><pages>434</pages></root>

XML document length: 158 bytes
Actual data length: 42 bytes
Non-data length: 116 bytes which is total wastage during data transfer.

Protocol Buffers

Protocol Buffers or Protobuf is a data serializing protocol like a JSON or XML. But unlike them, the Protobuf is not for humans, serialized data is compiled bytes and hard for human reading.

Protocol Buffers are Google’s language-neutral, platform-neutral extensible mechanism for serializing structured data.

Advantages

  • Data is fully types
  • Data compressed automatically, which means less CPU usage
  • Schema (.proto file) is needed to generate code and read the data
  • Documentation can be embedded in the schema
  • Data can be read across many languages (C++, C#, Dart, Go, Java, Objective-C, JavaScript, Ruby, PHP, Kotlin & Python)
  • Schema can evolve over time with support for backward compatible
  • It’s much smaller and faster compared to JSON & XML
  • Code can be auto-generated with a code generator

Disadvantages

  • Because it’s compressed and serialized, can’t be open in a text editor
  • Limited programming languages supported

Protobuf is defined by a .proto text file that can be easily read and understood by humans.

In Protobuf everything is a message, the message is equivalent to class or structure in programming languages.

The syntax=“proto3” tells the compiler that we are using version 3 of Protocol Buffers. In the message body, we can define the fields associated with the message. It supports unsigned integers, signed integers, floats, doubles, byte-arrays, strings, booleans, enums, and user-defined messages.

So let’s see what serialized data we get for the above JSON example payload. I have broken down the serialized data into multiple lines to show what encoding looks like after each line.

Serialized data length: 48 bytes
Actual data length: 42 bytes
Non-data length: 6 bytes only total wastage during data transfer.

Share data across Programming languages

When we run the Protocol Buffers compiler on a .proto file, the compiler generates the code in the chosen language. We only need to work with the message types we have described in the .proto file including getting, setting field values, serializing your messages to an output stream, and parsing messages from an input stream.

Protocol Buffers Style Guide

Google provides style guidelines for designing a .proto file and we should try to adhere them.

Standard file formatting

  • Keep the line length to 80 characters
  • Use an indent of 2 spaces
  • Prefer the use of double quotes for strings

File Structure

  • Files should be named lower_snake_case.proto
  • All files should be ordered in the following manner:
    1. License header (if applicable)
    2. File overview
    3. Syntax
    4. Package
    5. Imports (sorted)
    6. File options
    7. Everything else

Packages

  • Package names should be in lowercase. Package names should have unique names based on the project name.

Message and field names

  • Use CamelCase (with an initial capital) for message names
  • Use underscore_separated_names for field names
  • If your field name contains a number, the number should appear after the letter instead of after the underscore like media_artist1
message MediaServerRequest {
optional string media_name = 1;
optional string media_artist1 = 2;
}

Repeated Fields

  • Use pluralized names for repeated fields
message MediaServerRequest {
repeated string songs = 1;
}

Enums

  • Use CamelCase (with an initial capital) for enum type names and CAPITALS_WITH_UNDERSCORES for value names
enum RequestStatus {
REQUEST_STATUS_DEFAULT = 0;
REQUEST_STATUS_FIRST_VALUE = 1;
REQUEST_STATUS_SECOND_VALUE = 2;
}

Defining A Message Type

Let’s break down our author.proto message type to see all components of it.

Scaler Field Types

  • Number — Numbers can take various forms based on what values you expect them to have double(64bits), float(32 bits), int32, uint32, sint32, sint64, fixed64, sfixed32, sfixed64
    Boolean — Boolean can hold the value True or False. It is represented as bool in Protobuf
    String — String represents an arbitrary length of text. It is represented as a string in Protobuf. A string must always contain UTF-8 encoded or 7-but ASCII text
    Bytes — Bytes represent any sequence of the byte array. It is represented as bytes in Protobuf

Field Tags

In Protocol Buffers, field names are not important, they are important only when referring to fields in programming. For Protobuf, field tags are important, the smallest tag value can be 1 and the maximum tag value can be 2²⁹–1 or 536,870,911. We also cannot use the numbers 19000 through 19999. Tags numbered from 1 to 15 use only 1 byte compared to tags numbered from 14 to 2047 which uses 2 bytes. So use 1 to 15 tag numbers for frequently populated fields.

Repeated Fields

To make a list or an array, we can use the concept of repeated fields. The list can take any number of elements we want, even 0.

Comments

To add comments to your .proto files, use C/C++ style // and /* */ syntax.

Enums

If we know all the values a field can take in advance, we can leverage the enum type. Enum must start with the tag 0 which is also the default value.

Default Fields Values

All fields if not specified any values will take the default values.

  • bool — false
  • number — 0
  • string — empty string
  • bytes — empty bytes
  • repeated — empty list
  • enum — first value

Using Other Message Types

We can use other message types as field types which can also be defined in the same .proto file.

Nesting Types

It is possible to define types within types, this is helpful for avoiding naming conflicts and enforcing some level of the locality of that type. If you want to reuse this message type outside its parent message type, we can refer to it as _Parent_._Type_. We can nest messages as deeply as we like.

In the next post, we will talk about advanced concepts of Protocol Buffers.

If you like the post, don’t forget to clap. If you’d like to connect, you can find me on LinkedIn.

--

--

Neeraj Kushwaha
Neeraj Kushwaha

Written by Neeraj Kushwaha

https://www.learncsdesign.com “Walking on water and developing software from a specification are easy if both are frozen”

No responses yet