The Pivotal API is a good candidate for a simple brute force approach at reverse engineering.
It’s known that
there is a manageable bounded set of data and it’s presumed that the API is idempotent
and doesn’t itself produced
new information. The examples show what appears to be a graph model with nodes and
links. It is not known if the
API is complete - if it exposes access to all the data available or needed, or if
it’s discoverable – that all
nodes in the graph can be reached from the set of known root nodes. The topology of
the graph is not known, nor if
the links are bidirectional. There are practical constraints that make exploration
more difficult but also make it
easier to reach a halting state.
With the API Exploration tool, interactive browsing and database queries were possible.
browser allowed for quick visual comparison of the data from the API compared to the
download center web site.
Since there was no documentation of data content, simply guessing based on the field
names only somewhat
successful. For example, was “product__files” a list of the files in the product?
Surprisingly, not always.
How about “file__groups”, should the link be dereferenced to get a list of groups
of files ? What was the set
of valid “release__type” values and was that relevant to determining if a release
should be distributed ? For
example “Beta Release” seems like something you wouldn’t ship to customers. More surprising
that guesses of the semantics were not always right was discovering that the presence
of links had no
relevance to if the resource it referred to existed.
For example, the following appears to indicate that there are resources associated
with the properties
Resolving the references led to surprising variants of “doesn’t exist”. Note that
these endpoints were all
explicitly listed as links in a successful response body.
eula_acceptance: HTTP Status 404
Resource does not exist at all
"message": "no API endpoint matches 'products/dyadic-ekm-service-broker/releases/4186/eula_acceptance"
file_groups: HTTP Status 200
The resource ‘exists’ as a REST resource (a “file group” with no entries) and correctly
links back to
user_groups: HTTP Status 403
Resource is not viewable. The account has the rights to view all resources in this
product, unknown if the
"message": "user cannot view user groups for this release '4186
Without a larger sample, it’s not obvious if these cases are constant or depend on
some property of the
resource. Since the presence and syntax of a uri was clearly insufficient to determine
its validity, every
resource would require checking of every link to see if it was valid and used.
Another example is apparent overloading of the term “Release”. In the following, is
the Release “Native
Client 9.0.7” a different product or newer version of “9.0.4” ?
Which is represented in the API similarly.
"release_type": "Maintenance Release",
"version": "Native Client 9.0.7",
"release_type": "Maintenance Release",
To make sense of this, a larger sample set was needed. The solution was like a basic
incorporating caching and network loop detection so it wouldn’t run forever or spawn
too many threads. The
resulting dataset could then be directly queried and used a meta-data cache. The binary
downloads) were not downloaded at this point.
Analysis and Data Modeling
With the complete (accessible) data set in a database instead of a REST interface,
queries could be done
across the corpus efficiently. There was no intent at this point to use the database
in production, rather
for forensics and exploration to help determine the scope of the problem, basic ‘shape’
of the resources, and
determine constant and variable feature. The source of the data could change at any
time with no reliable
notification mechanism so queries were inherently a point-in-time query with about
a day lifespan at
Modeling the topology of the network (resource links) was needed to clarify what relationships
extensions or components of a single entity, what was containment of possibly shared
entities and what were
casual relationships. Simple statistics of unique attribute name and value distributions
exposed where the
same name was used in different types of entities, Co-occurrence queries validated
which attributes were
unique identifiers in what domain, which were optional, the complete set of attributes
for each entity. The
structure of the markup was regular and self consistent. It look like several
commonly used REST and hypermedia conventions, but it wasn’t identical. It would have
been useful to know
which convention or framework was used for the implementation. An example is the JSON
API specification (JSONAPI)
which a very similar link and
attribute convention but required attributes that did not exist such as ‘type’ Other
similar examples include,
JSON-LD. JSON-LD , JSON Hypertext Application
Language (HAL) HAL
. The answer to why many
of the same names were used in different resources but with subsets of attributes
was consistent with the
“Expanded Resource” convention. Identifying those made the abstract model much simpler
and allowed one to ‘see
through’ the window of the API calls to the underlying model.
Note: that except for “self” there is no markup, naming convention, type or other
determine if a link refers to another entity, a separate contained entity or an expanded
entity. The intent of
this structure is to model one specific user interaction and presentation, one single
instance of the class of
Web Browser HTML Applications. Comparing individual REST requests to the equivalent
HTML page shows an exact
From the perspective of a document or markup view, the API's present a data centric
structured view, but
from the perspective of the underlying resources the API's expose a presentation view.
There are very good
practical reasons for this design and it’s a common concept to have a summary and
detail interface. But
without some kind of Schema there is little clue that what your looking at is
a page layout resource that aggregates, divides and overlays the entity resource making
it quite obscure how
to query, extract and navigate the dataset as entities. The REST API provides another
method to address and
query the resources once you have their unique id by exposing several other top level
API's corresponding to
the sub-resources. However, it’s the same endpoints (URI's) as in the links and the
only way to get the ids is
by querying the parent resource.
In theory , denormalizing or refactoring the data structure into a resource or document
produce a much better abstraction for resource centric queries and document creation
largely independent of
the REST abstraction or the database, query or programming language and tool-set.
It should be a simple matter of extracting the schema from the REST model, then refactoring
that into a
document or object schema. In order to make use of the new model, instances of the
source model need to be
transformed into instances of the data model using an equivalent transformation as
the schema. Separately
these are common tasks solved by existing tools in many different domains, markup
and programming languages.
Unfortunately, the REST and JSON domain doesn’t have a proliferation of compatible
libraries, or conventions for schema and data transformations. XML Tools can do much
of this easily but lack
good support for schema refactoring and little support for automation of the creation
Modern programming languages, libraries and IDE's have very good support for class
and type refactoring
including binding to software refactoring, code generation from schema, schema generation
from code and data
mapping to a variety of markup languages. There is enough overlap in functionality
and formats that combining
the tools from multiple domains has more then enough functionality. In theory. With
the right theory, it
should be practical and useful yet I have not yet discovered implementation, proposals,
designs, or even
discussion about how to do it, why it’s easy or difficult or that the problem or solution
I assert that terminology, domain specialization, and disjoint conceptual abstractions
existence of a common design a model that could be put to good use if recognized.
The Hidden life of Schema.
I am as guilty as any of assuming that my understanding of terminology is the only
one. Reluctant to use
the term “Schema” because it has such a specific meaning in many domains but lacking
a better word I
resorted to actually looking up the word in a few dictionaries. WD
a diagrammatic presentation; broadly : a structured framework or plan : outline
a mental codification of experience that includes a particular organized way of perceiving
cognitively and responding to a complex situation or set of stimuli
Oxford Dictionary: OD
technical. A representation of a plan or theory in the form of an outline or
‘a schema of scientific reasoning’
(in Kantian philosophy) a conception of what is common to all members of a
class; a general or essential type or form.
With these concepts in mind, looking for “schema” and “schema transformations” in
particularly object oriented programming languages and IDE's finds schema everywhere
under different names and
forms, some explicit and some implicit. Starting with the core abstraction of object
oriented programming –
A Concrete Class combines the abstract description of the static data model, the
static interface for behavior, and the concrete implementation of both. This supplies
the framework for both
Schema and Transformation within that language.
Annotations provide for Meta-data that ‘cross-cut’ class systems and
frameworks such that you can overlay multiple independent or inconsistent views on
types, classes and methods.
Serialization, Data Mapping and Transformation frameworks make heavy use of a combination
implicit schema in static class declarations and annotations to provide customization
and generation of unique
transformations with minimal or no change to the class itself.
Multiple domains of programming language, markup and data definition languages have
followed a similar
path starting from a purely declarative document centric concept of Schema to ‘Code
language centric model and eventually introducing some form of annotation that augments
the data model schema
or the transformation. The ability to directly associate meta-data representing semantics
locations and fragments of implementation allows for general purpose IDE's , static
generation and transformation tools to preserve the semantics and relationships between
as a ‘free ride’, agnostic to specialized domain knowledge.
I set out to validate that this was not only a viable theory but that it would work
in practice, using
commonly available tools and general knowledge.
OpenAPI the “New WSDL”
REST based Web Services and JSON grew from a strong “No Schema” philosophy. A reaction
and rejection of
the complexity of the then-current XML based service API's. The XML Service frameworks
(WSDL, SOAP, JAXB,
JAXP,J2EE ) had themselves started humbly as a simplification of the previous generation’s
XMLRPC , the grandfather of SOAP is a very similar to REST in may respects. The schema for
XMLRPC defines the
same basic object model primates, integers, floats, strings, lists, maps with no concept
of derived or custom
types. XMLRPC, as the name implies, models Remote Procedure Calls while REST, as originally
defined reststyle, models Resources and representations. XMLRPC is a concrete specification of the
behavior and data format. REST, on the other hand, is an Architectural Style, “RESTful”
or “RESTful-style”, a set of design principals and concepts -
without a concrete specification. So while an implementation of an XMLRPC service
and a RESTful
service may be equivalent in function, complexity , data model and use, they are entirely
concepts. XMLRPC evolved along a single path into SOAP and branching into a collection
of well defined albeit
highly complex interoperable standards. “REST”, the Architectural Style, has inspired
specifications but is as ephemeral now as it ever was. As the current predominate
style of web services,
nearly universally with the term “REST” applied to an implementation, there is little
consensus as to what
that specifically means and much debate as to whether a given API is really “RESTful”
or not. Attempts to
define standards and specifications for “REST” is oxymoronic leading to the current
state of affairs expressed
well by the phrase
The wonderful thing about standards is that there are so many of them to choose So while the eternal
competing forces of constraints vs freedom play on, the stronger forces of usability, interoperability and adoption led a path of rediscovery and reinvention.
The Open API Initiative was formed and agreed on a specification for RESTful API's
that is quietly and
quickly gaining adoption. The Open API specification is explicitly markup and implementation
agnostic. It is not vendor or implementation based, rather it’s derived from a collection
of open source
projects based on the “Swagger” SG01 specification. A compelling indicator of mind-share adoption is the number of
“competing” implementations, vendors and specifications that support import and export
to Open API format but
not each other. While Open API has the major components of a mature API specification
– schema, declarative,
language agnostic, implementation independent – its documentation centric focus and
compatible tools in dozens of languages attracts a wide audience even those opposed
to the idea of standards
and specifications, schema and constraints. OpenAPI is more of an interchange format
then normative specification. Implementations can freely produce
or consume OpenAPI documents and the API's they describe without having themselves
to be based on OpenAPI.
Figure 10: The OpenAPI Ecosystem
The OpenAPI Specification is the formal document describing only the OpenAPI document
The ecosystem of tools and features are not part of the specification nor is their
The agility to enter or leave this ecosystem at will allows one to combine with other
create work-flows and processing pipelines otherwise impractical. The Java ecosystem
has very good support for
data mapping, particularly JSON to Object mapping, manipulation of Java artifacts
as source and dynamically
via reflection and bytecode generation at runtime. The JSON ecosystem has started
to enter the realm of schema
processing, typically JSON Schema JSCH1. Not nearly the extent of XML tools, there are very few
implementations that make direct use of JSON schema, but there are many tools that
can produce JSON Schema
from JSON Documents, produce sample JSON Documents from JSON Schema, and most interesting
in this context
produce Java Classes from JSON Schema and JSON Schema from Java Classes.
Combined together, along with some manual intervention to fill in the gaps and add
human judgment, a
processing path that spans software languages, markup and data formats, behavior,
along with the meta-data and
schema that describe them. Nowhere close to frictionless, lossless or easy – but it’s
becoming possible. If it’s shown to be useful as well then
perhaps motivation for filling the remaining holes and smoothing the cracks will inspire
people to explore the
Standard OpenAPI Tools
Figure 11: Swagger Editor
Swagger Editor with the standard "Pet Store" API. http://editor.swagger.io
The left pane contains the editable content, the OpenAPI document in YAML format.
As you edit the
right pane adjusts to a real-time documentation view composed solely from the current
Note the native support for XML and JSON bodies. The sample shown is generated from
the schema alone
although it can be augmented with sample content, an optional element of the specification
for any type
Included in the editor is the "Try it out" feature which will invoke the API as currently
the editor returning the results.
Code generation in several dozen software languages for client and server code is
included as part
of the open source "Swagger Coden" libraries and directly accessible from the editor.
This ranges from
simple stub code, Documentation in multiple formats and a few novel implementations
that provide fully
functional client and server code dynamically generating representative requests and
responses based on
the OpenAPI document alone.
Figure 12: Swagger UI
The Swagger UI tool provides no editing capability rather it is intended for live
exploration of an API. A REST endpoint that supplies a definition of its API in OpenAPI
Format can be
opened, viewed and invoked interactively from this tool. There is no requirement that
the API be
implemented in any way using OpenAPI tools, the document could simply be a hand made
describing any API implemenation that can be described, even partially, by an OpenAPI
Swagger Editor and Swagger IO provide both sample and semantic representations of
Figure 13: Restlet Studio
Restlet Studio is a proprietary application from Restlet with a free version. It uses its own proprietery
formats not based on OpenAPI, but provides import and export to OpenAPI with a good
but not perfect
fidelity. Restlet Studio was used during the early schema refactoring due to its support
for editing the
data model in a more visual fashion.
A powerful refactoring feature is the ability to extract nested anonymous JSON Schema
named top level types. This facilitated a very quick first pass at extracting common
types from multiple
Having named top level types instead of anonymous sub types translated into Java Classes
Reverse Engineering an Open API Document
It would have been very convenient if the pivotal API exposed an OpenAPI interface,
the tool-kits used
most likely have the capability, or a OpenAPI document manually created. This would
have provided generation of
documentation for the API including the complete schema for the data model as well
as enabling creation of
client API's in nearly any language by the consumer.
Instead, I attempted an experiment to see how difficult it is to reverse engineer
an open API document
from the behavior of the API. From that, the remaining tasks would be fairly simple.
To construct an Open API document requires a few essential components.
Declaration of the HTTP endpoint, methods, resource URI pattern and HTTP response.
These were all documented to sufficient detail to easily transcribe into OpenAPI
A JSON Schema representation of the object model for each endpoint. The specifications
allow this to
be omitted by simply using ANY.
But the usefulness of ANY is None.
A shared ‘definition’ schema to allow for reuse of schema definition within a single
The representation of JSON Schema in OpenAPI is a subset of the full JSON Schema plus
a few optional
extensions. Otherwise it can literally be copy and pasted into a standalone JSON Schema
document, or you can
reference an external JSON Schema. There are several good OpenAPI compatible authoring
and design tools
available, open source and commercial. These can be used for authoring JSON Schema
The Open API Document format itself is fully supported in both JSON and YAML formats.
This allows you to
choose which format you dislike least. The transformation from JSON to YAML is fully
reversible, since JSON is a
subset of YAML and OpenAPI only utilize JSON expressible markup. Available tools do
a fairly good job of this,
with the exception of YAML comments and multi-line strings. The former have no JSON
representing so are lost and
the later have to many representations so get mangled. That can be worked around by
adding comments later or by
a little human intervention.
To validate that there was no hidden dependence on specific implementation and that
it didn’t require a
great deal of software installation or expert knowledge, I picked a variety of tools
for the purpose ad-hoc
and generally used web sites that had online browser based implementations.
The Pivotal api has several useful top level endpoints exposing different paths to
the same data. To reuse
the schema across endpoints and to reduce redundancy within an endpoint, the definitions
feature of OpenAPI
was used. This required assigning type names to every refactored schema component.
Since JSON document
instances have no type name information in them, every extracted type would need to
be named. Using the
OpenAPI editors, some amount of refactoring was possible, producing automatically
generated names of dubious
value since there is no constraint that the schema for one fields value is the same
as another filed of the
same name. Identifying where these duplicate schema components could be combined into
one and where they were
semantically different was aided by the prior analysis of the data set.
I made use of several API IDE’s that were not OpenAPI native but did provide import
and export of Open
API. There was some issue with these where the import or export was not fully implemented.
For example the
extend type annotations in OpenAPI were unrecognized by the tools and either discarded
or required changing to
the basic JSON Schema types or their proprietary types. Enumerations and typed strings
were the most
problematic. I have since communicated with the vendors and some improvements been
made. I expect this to be
less of an issue over time.
The availability of tools that can convert between JSON Schema and Java Classes allows
for the use of Java
IDE’s to refactor JSON Schema indirectly.
Of course all the representations of data, schema, java source and API Specifications
were in plain text,
which any text editor accommodated.
The result was an interactive process of exploration and convenience switching between
environments fairly easy. Use of scripting and work-flow automation would have improved
the experience, but was
There are multiple OpenAPI Validation implementations. There is no specification of
validation itself in
OpenAPI which lead to differences in results. Difference in indication of the exact
cause and location
varied greatly. Some tools support semantic validation as well as schema and syntax
The ability to directly execute the API from some tools is a major feature that allowed
testing during editing and refactoring.
Code generation of client side API invocation in multiple languages provided a much
due to the data mapping involved. An incorrect schema would usually result in an exception
response into the generated object model instances. Scripting invocation of the generated
allowed testing across a large sample set then easily done interactively.
Particularly useful was the ability to serialize the resulting object back to JSON
using the generated
data model. The output JSON could then be compared against the raw response to validate
the fidelity of the
model in practice. It’s not necessary or always useful for the results to match exactly.
For example renaming
of field names, collapsing redundant structure, removal of unneeded elements can be
the main reason for
using the tools. I found it easier to first produce and validate a correct implementation
it to the desired model, especially since I didn’t yet know what data was going to
Refactoring JSON Schema via Java Class Representation
Tools to convert between JSON Schema and Java Classes are easily available. Typically
used for Data
Mapping and Serializing Java to and from JSON, they work quite well as a schema language
A Java Class derived from JSON Schema preserves most of the features of JSON Schema
directly as Java
constructs. The specific mapping is implementation dependant, but the concept is ubiquitous.
Once in a Java
Class representaiton common refactoring functionality present in modern Java IDE's
such as Eclipse are
trivial. For example the result of extracting anonymous types into named types in
Restlet Studio resulted in a
large number of synthetic class names such as "links_href_00023". Renaming a class
to something more
appropriate could be done in a few seconds including updating all of the references
to it. Duplicate classes
of different names can be easily consolidated by a similar method. Type constraints
can be applied by
modifying the primitive types. For example where enumerated values are present but
the JSON to JSON Schema
conversion did not recognize them, the fields were left as 'string' values. These
could be replaced by Java
Enum classes. Fields can be reordered or removed if unneeded.
Overlapping types can sometimes be refactored into a class hierarchy or encapsulation
duplication and model the intended semantics. Mistakes are immediately caught as compile
Since the IDE itself is not aware that the source of the classes was from JSON Schema
it will not
prevent you from making changes that have an ill-defined or non-existent JSON Schemea
representaiton, or one
that the conversion tool does not handle well. Several iterations may be necessary
to produce the desired
Converting back from Java classes to JSON Schema preserves these changes allowing
one to merge the results back into the OpenAPI document.
Refactoring by API Composition
No amount of schema refactoring and code generation could account for the expanded
entities that spanned
API calls. The Open API document has no native provision for composition or transformation
at runtime. That
required traditional programming.
Once I had a working and reasonable OpenAPI document model and validated it across
a large sample set, I
then took exited the OpenAPI ecosystem and proceeded to some simple program enhancements.
The REST API was now
represented as an Object model, with an API object with methods for each REST endpoint.
From this basis it was
simple to refactor by composition. For example to expand a partially exposed resource
into the complete form
required either extracting the resource id and invoke a call in its respective endpoint
method, or in
dereferencing its ‘self’ link. The later actually being more difficult because the
semantics of the link was
not part of the data model. The resource ID was not explicitly typed either but the
generated methods to
retrieve a resource of a given type were modeled and provided static type validation
in the form of argument
arty and return type.
This is a major difference from using a REST API in a web browser. The architecture
and style impose a
resource model and implementation that is not only presentation oriented but also
browser navigation and
user interaction specific. This is explicitly stated as a fundamental architectural
feature highlighting the
advantage of media type negation for user experience.
To complete the application I added a simple command line parser and serialize. This
experiment and validated that the process is viable and useful. This also marked the
beginning. I could
invoke the API, retrieve the data in multiple formats reliably, optimize and compose
queries in a language
of choice and rely on the results.
Figure 14: A simple command line application
I could now begin to ask the question I started with. What are the transitive set
of digital artifacts
of the latest version of a software product?
Left as an exercise for the reader.
Reverse Engineering, Refactoring, Schema and Data Model Transformation Cycle.
Figure 15: A Continuous Cyclical Transformation Workflow
A flow diagram of the process described. Wherever the path indicates a possible loop
implies that an iterative
process can be used. The entire work-flow itself is iterable as well.
Since the work-flow is continuously connected including the ability to generate a
client or server API, any step
in the process can be an entry or exit point. This implies that not only can you start
at any point and
complete, but that any subset of the process can be used to enable transformations
between the respective
representations of the entry and end nodes in isolation.
REST API to JSON Document
Invoke REST API producing a JSON Document
Automated exhaustive search of API resources over hypermedia links.
Producing a representative sample set of JSON Documents
JSON to JSON Schema
JSON samples to JSON Schema automated generation.
JSON Schema to Open API Documents
Enrich JSON Schema with REST semantics to create an Open API Document.
( JSON or YAML)
Open API IDE
Cleanup and refactoring in Open API IDE's.
Open API to JSON Schema
Extract refactored JSON Schema from Open API
JSON Schema to Java Class
JSON Schema to Java Class automated generation.
Produces self standing Java source code.
Java IDE with rich class and source refactoring ability.
Refactor Java Class structure using standard techniques.
Enrich with Annotations
Produces refined and simplified Java Classes
Java Source to JSON Schema
Automated conversion of Java source to JSON Schema
Merge new JSON Schema to Open API
Merge Refactored JSON Schema back into Open API Document (JSON or YAML)
Open API IDE
Cleanup and refactoring in Open API IDE's.
Open API to JSON Schema
Extract JSON Schema from Open API
Iterate 6-11 as needed
Open API code generation tools to source code.
Optional customization of code generation templates to apply reusable data mapping
Open API to complete documentation, any format.
Generated Code to Application
Enrich and refactor generated code with Data Mapping annotation to produce custom
preserving the REST interface.
Augment with business logic to produce custom application based on underlying data
Application operates on custom data model directly; data mapping converts between
data models and
generated code invokes REST API.
Output multiple markup formats
Side effect of open API tools and Data Mapping framework automate serialization to
(JSON, XML, CSV)
Application to Open API
Generated code includes automated output of OpenAPI document at runtime, incorporating
applied during enrichment.