Learn Protocol Buffers (Protobuf) for serializing structured data — Part 1
This article is originally published at https://www.learncsdesign.com
Before we talk about Protocol Buffers let’s first understand Serialization and Deserialization.
Serialization — The process of converting an object into a linear sequence of bytes for either storage or transmission to another location. This process of serializing an object is also called marshaling an object.
Deserialization — The process of taking in stored information and recreating objects from it. Extracting a data structure from a series of bytes is called unmarshaling.
As data has evolved, so have the ways to serialize and deserialize data. Let’s talk about a few of them.
Comma Separated Values (CSV)
CSV is easy to parse and read but offers several disadvantages like the data types have to be inferred and is not a guarantee. Parsing becomes tricky when data contains commas and column names may or may not be there.
Relation table definitions
Relational table definitions add types like
CREATE TABLE authors {
title varchar(255),
author varchar(80),
isbn13 varchar(20),
page int
}
The advantage is that data is fully typed and it fits in a table. But data has to be flat and data stored in a database and its data definition will be different for different databases.
JavaScript Object Notation(JSON)
In JSON data can take any form and is a widely accepted format on the web. JSON can be read by most the programming languages and can be easily shared over a network. But JSON data has no schema enforcement and JSON objects are quite big in size because of repeated keys.
{"title":"Clean Code","author":"Robert C Martin","isbn13":"978-8131773383","pages":434}
In the JSON object the characters like {}[],: does not possess any kind of data, it only helps the serializer to format the data so that it can be decoded and structured.
JSON Object length: 87 bytes
Actual data length: 42 bytes
Non-data length: 45 bytes which is total wastage during data transfer.
Extensible Markup Language (XML)
XML also uses meta tags similar to JSON but they also have the end tag. Since they have the end tag, they are much bigger in size compared to JSON.
<?xml version="1.0" encoding="UTF-8" ?><root><title>Clean Code</title><author>Robert C Martin</author><isbn13>978-8131773383</isbn13><pages>434</pages></root>
XML document length: 158 bytes
Actual data length: 42 bytes
Non-data length: 116 bytes which is total wastage during data transfer.
Protocol Buffers
Protocol Buffers or Protobuf is a data serializing protocol like a JSON or XML. But unlike them, the Protobuf is not for humans, serialized data is compiled bytes and hard for human reading.
Protocol Buffers are Google’s language-neutral, platform-neutral extensible mechanism for serializing structured data.
Advantages
- Data is fully types
- Data compressed automatically, which means less CPU usage
- Schema (.proto file) is needed to generate code and read the data
- Documentation can be embedded in the schema
- Data can be read across many languages (C++, C#, Dart, Go, Java, Objective-C, JavaScript, Ruby, PHP, Kotlin & Python)
- Schema can evolve over time with support for backward compatible
- It’s much smaller and faster compared to JSON & XML
- Code can be auto-generated with a code generator
Disadvantages
- Because it’s compressed and serialized, can’t be open in a text editor
- Limited programming languages supported
Protobuf is defined by a .proto text file that can be easily read and understood by humans.
In Protobuf everything is a message, the message is equivalent to class or structure in programming languages.
The syntax=“proto3” tells the compiler that we are using version 3 of Protocol Buffers. In the message body, we can define the fields associated with the message. It supports unsigned integers, signed integers, floats, doubles, byte-arrays, strings, booleans, enums, and user-defined messages.
So let’s see what serialized data we get for the above JSON example payload. I have broken down the serialized data into multiple lines to show what encoding looks like after each line.
Serialized data length: 48 bytes
Actual data length: 42 bytes
Non-data length: 6 bytes only total wastage during data transfer.
Share data across Programming languages
When we run the Protocol Buffers compiler on a .proto file, the compiler generates the code in the chosen language. We only need to work with the message types we have described in the .proto file including getting, setting field values, serializing your messages to an output stream, and parsing messages from an input stream.
Protocol Buffers Style Guide
Google provides style guidelines for designing a .proto file and we should try to adhere them.
Standard file formatting
- Keep the line length to 80 characters
- Use an indent of 2 spaces
- Prefer the use of double quotes for strings
File Structure
- Files should be named lower_snake_case.proto
- All files should be ordered in the following manner:
1. License header (if applicable)
2. File overview
3. Syntax
4. Package
5. Imports (sorted)
6. File options
7. Everything else
Packages
- Package names should be in lowercase. Package names should have unique names based on the project name.
Message and field names
- Use CamelCase (with an initial capital) for message names
- Use underscore_separated_names for field names
- If your field name contains a number, the number should appear after the letter instead of after the underscore like media_artist1
message MediaServerRequest {
optional string media_name = 1;
optional string media_artist1 = 2;
}
Repeated Fields
- Use pluralized names for repeated fields
message MediaServerRequest {
repeated string songs = 1;
}
Enums
- Use CamelCase (with an initial capital) for enum type names and CAPITALS_WITH_UNDERSCORES for value names
enum RequestStatus {
REQUEST_STATUS_DEFAULT = 0;
REQUEST_STATUS_FIRST_VALUE = 1;
REQUEST_STATUS_SECOND_VALUE = 2;
}
Defining A Message Type
Let’s break down our author.proto message type to see all components of it.
Scaler Field Types
- Number — Numbers can take various forms based on what values you expect them to have double(64bits), float(32 bits), int32, uint32, sint32, sint64, fixed64, sfixed32, sfixed64
• Boolean — Boolean can hold the value True or False. It is represented as bool in Protobuf
• String — String represents an arbitrary length of text. It is represented as a string in Protobuf. A string must always contain UTF-8 encoded or 7-but ASCII text
• Bytes — Bytes represent any sequence of the byte array. It is represented as bytes in Protobuf
Field Tags
In Protocol Buffers, field names are not important, they are important only when referring to fields in programming. For Protobuf, field tags are important, the smallest tag value can be 1 and the maximum tag value can be 2²⁹–1 or 536,870,911. We also cannot use the numbers 19000 through 19999. Tags numbered from 1 to 15 use only 1 byte compared to tags numbered from 14 to 2047 which uses 2 bytes. So use 1 to 15 tag numbers for frequently populated fields.
Repeated Fields
To make a list or an array, we can use the concept of repeated fields. The list can take any number of elements we want, even 0.
Comments
To add comments to your .proto files, use C/C++ style // and /* */ syntax.
Enums
If we know all the values a field can take in advance, we can leverage the enum type. Enum must start with the tag 0 which is also the default value.
Default Fields Values
All fields if not specified any values will take the default values.
- bool — false
- number — 0
- string — empty string
- bytes — empty bytes
- repeated — empty list
- enum — first value
Using Other Message Types
We can use other message types as field types which can also be defined in the same .proto file.
Nesting Types
It is possible to define types within types, this is helpful for avoiding naming conflicts and enforcing some level of the locality of that type. If you want to reuse this message type outside its parent message type, we can refer to it as _Parent_._Type_. We can nest messages as deeply as we like.
In the next post, we will talk about advanced concepts of Protocol Buffers.
If you like the post, don’t forget to clap. If you’d like to connect, you can find me on LinkedIn.