The Pivotal API is a good candidate for a simple brute force approach at reverse engineering. It’s known that
there is a manageable bounded set of data and it’s presumed that the API is idempotent and doesn’t itself produced
new information. The examples show what appears to be a graph model with nodes and links. It is not known if the
API is complete - if it exposes access to all the data available or needed, or if it’s discoverable – that all
nodes in the graph can be reached from the set of known root nodes. The topology of the graph is not known, nor if
the links are bidirectional. There are practical constraints that make exploration more difficult but also make it
easier to reach a halting state.
With the API Exploration tool, interactive browsing and database queries were possible. Interactive
browser allowed for quick visual comparison of the data from the API compared to the download center web site.
Since there was no documentation of data content, simply guessing based on the field names only somewhat
successful. For example, was “product__files” a list of the files in the product? Surprisingly, not always.
How about “file__groups”, should the link be dereferenced to get a list of groups of files ? What was the set
of valid “release__type” values and was that relevant to determining if a release should be distributed ? For
example “Beta Release” seems like something you wouldn’t ship to customers. More surprising than validating
that guesses of the semantics were not always right was discovering that the presence of links had no
relevance to if the resource it referred to existed.
For example, the following appears to indicate that there are resources associated with the properties
Resolving the references led to surprising variants of “doesn’t exist”. Note that these endpoints were all
explicitly listed as links in a successful response body.
eula_acceptance: HTTP Status 404
Resource does not exist at all
"message": "no API endpoint matches 'products/dyadic-ekm-service-broker/releases/4186/eula_acceptance"
file_groups: HTTP Status 200
The resource ‘exists’ as a REST resource (a “file group” with no entries) and correctly links back to
user_groups: HTTP Status 403
Resource is not viewable. The account has the rights to view all resources in this product, unknown if the
"message": "user cannot view user groups for this release '4186
Without a larger sample, it’s not obvious if these cases are constant or depend on some property of the
resource. Since the presence and syntax of a uri was clearly insufficient to determine its validity, every
resource would require checking of every link to see if it was valid and used.
Another example is apparent overloading of the term “Release”. In the following, is the Release “Native
Client 9.0.7” a different product or newer version of “9.0.4” ?
Which is represented in the API similarly.
"release_type": "Maintenance Release",
"version": "Native Client 9.0.7",
"release_type": "Maintenance Release",
To make sense of this, a larger sample set was needed. The solution was like a basic ‘Web Crawler’
incorporating caching and network loop detection so it wouldn’t run forever or spawn too many threads. The
resulting dataset could then be directly queried and used a meta-data cache. The binary resources (file
downloads) were not downloaded at this point.
Analysis and Data Modeling
With the complete (accessible) data set in a database instead of a REST interface, queries could be done
across the corpus efficiently. There was no intent at this point to use the database in production, rather
for forensics and exploration to help determine the scope of the problem, basic ‘shape’ of the resources, and
determine constant and variable feature. The source of the data could change at any time with no reliable
notification mechanism so queries were inherently a point-in-time query with about a day lifespan at
Modeling the topology of the network (resource links) was needed to clarify what relationships were
extensions or components of a single entity, what was containment of possibly shared entities and what were
casual relationships. Simple statistics of unique attribute name and value distributions exposed where the
same name was used in different types of entities, Co-occurrence queries validated which attributes were
unique identifiers in what domain, which were optional, the complete set of attributes for each entity. The
structure of the markup was regular and self consistent. It look like several
commonly used REST and hypermedia conventions, but it wasn’t identical. It would have been useful to know
which convention or framework was used for the implementation. An example is the JSON API specification (JSONAPI)
which a very similar link and
attribute convention but required attributes that did not exist such as ‘type’ Other similar examples include,
JSON-LD. JSON-LD , JSON Hypertext Application
Language (HAL) HAL
. The answer to why many
of the same names were used in different resources but with subsets of attributes was consistent with the
“Expanded Resource” convention. Identifying those made the abstract model much simpler and allowed one to ‘see
through’ the window of the API calls to the underlying model.
Note: that except for “self” there is no markup, naming convention, type or other indication to
determine if a link refers to another entity, a separate contained entity or an expanded entity. The intent of
this structure is to model one specific user interaction and presentation, one single instance of the class of
Web Browser HTML Applications. Comparing individual REST requests to the equivalent HTML page shows an exact
From the perspective of a document or markup view, the API's present a data centric structured view, but
from the perspective of the underlying resources the API's expose a presentation view. There are very good
practical reasons for this design and it’s a common concept to have a summary and detail interface. But
without some kind of Schema there is little clue that what your looking at is
a page layout resource that aggregates, divides and overlays the entity resource making it quite obscure how
to query, extract and navigate the dataset as entities. The REST API provides another method to address and
query the resources once you have their unique id by exposing several other top level API's corresponding to
the sub-resources. However, it’s the same endpoints (URI's) as in the links and the only way to get the ids is
by querying the parent resource.
In theory , denormalizing or refactoring the data structure into a resource or document model should
produce a much better abstraction for resource centric queries and document creation largely independent of
the REST abstraction or the database, query or programming language and tool-set.
It should be a simple matter of extracting the schema from the REST model, then refactoring that into a
document or object schema. In order to make use of the new model, instances of the source model need to be
transformed into instances of the data model using an equivalent transformation as the schema. Separately
these are common tasks solved by existing tools in many different domains, markup and programming languages.
Unfortunately, the REST and JSON domain doesn’t have a proliferation of compatible tools, standards,
libraries, or conventions for schema and data transformations. XML Tools can do much of this easily but lack
good support for schema refactoring and little support for automation of the creation of transformations.
Modern programming languages, libraries and IDE's have very good support for class and type refactoring
including binding to software refactoring, code generation from schema, schema generation from code and data
mapping to a variety of markup languages. There is enough overlap in functionality and formats that combining
the tools from multiple domains has more then enough functionality. In theory. With the right theory, it
should be practical and useful yet I have not yet discovered implementation, proposals, designs, or even
discussion about how to do it, why it’s easy or difficult or that the problem or solution exists.
I assert that terminology, domain specialization, and disjoint conceptual abstractions obscure the
existence of a common design a model that could be put to good use if recognized.
The Hidden life of Schema.
I am as guilty as any of assuming that my understanding of terminology is the only one. Reluctant to use
the term “Schema” because it has such a specific meaning in many domains but lacking a better word I
resorted to actually looking up the word in a few dictionaries. WD
a diagrammatic presentation; broadly : a structured framework or plan : outline
a mental codification of experience that includes a particular organized way of perceiving
cognitively and responding to a complex situation or set of stimuli
Oxford Dictionary: OD
technical. A representation of a plan or theory in the form of an outline or
‘a schema of scientific reasoning’
(in Kantian philosophy) a conception of what is common to all members of a
class; a general or essential type or form.
With these concepts in mind, looking for “schema” and “schema transformations” in other domains,
particularly object oriented programming languages and IDE's finds schema everywhere under different names and
forms, some explicit and some implicit. Starting with the core abstraction of object oriented programming –
A Concrete Class combines the abstract description of the static data model, the
static interface for behavior, and the concrete implementation of both. This supplies the framework for both
Schema and Transformation within that language.
Annotations provide for Meta-data that ‘cross-cut’ class systems and
frameworks such that you can overlay multiple independent or inconsistent views on types, classes and methods.
Serialization, Data Mapping and Transformation frameworks make heavy use of a combination of the
implicit schema in static class declarations and annotations to provide customization and generation of unique
transformations with minimal or no change to the class itself.
Multiple domains of programming language, markup and data definition languages have followed a similar
path starting from a purely declarative document centric concept of Schema to ‘Code First’ programming
language centric model and eventually introducing some form of annotation that augments the data model schema
or the transformation. The ability to directly associate meta-data representing semantics to targeted
locations and fragments of implementation allows for general purpose IDE's , static refactoring, dynamic
generation and transformation tools to preserve the semantics and relationships between schema, transformation
as a ‘free ride’, agnostic to specialized domain knowledge.
I set out to validate that this was not only a viable theory but that it would work in practice, using
commonly available tools and general knowledge.
OpenAPI the “New WSDL”
REST based Web Services and JSON grew from a strong “No Schema” philosophy. A reaction and rejection of
the complexity of the then-current XML based service API's. The XML Service frameworks (WSDL, SOAP, JAXB,
JAXP,J2EE ) had themselves started humbly as a simplification of the previous generation’s binary protocols.
XMLRPC , the grandfather of SOAP is a very similar to REST in may respects. The schema for XMLRPC defines the
same basic object model primates, integers, floats, strings, lists, maps with no concept of derived or custom
types. XMLRPC, as the name implies, models Remote Procedure Calls while REST, as originally defined reststyle, models Resources and representations. XMLRPC is a concrete specification of the
behavior and data format. REST, on the other hand, is an Architectural Style, “RESTful”
or “RESTful-style”, a set of design principals and concepts -
without a concrete specification. So while an implementation of an XMLRPC service and a RESTful
service may be equivalent in function, complexity , data model and use, they are entirely different abstract
concepts. XMLRPC evolved along a single path into SOAP and branching into a collection of well defined albeit
highly complex interoperable standards. “REST”, the Architectural Style, has inspired implementations and
specifications but is as ephemeral now as it ever was. As the current predominate style of web services,
nearly universally with the term “REST” applied to an implementation, there is little consensus as to what
that specifically means and much debate as to whether a given API is really “RESTful” or not. Attempts to
define standards and specifications for “REST” is oxymoronic leading to the current state of affairs expressed
well by the phrase
The wonderful thing about standards is that there are so many of them to choose So while the eternal
competing forces of constraints vs freedom play on, the stronger forces of usability, interoperability and adoption led a path of rediscovery and reinvention.
The Open API Initiative was formed and agreed on a specification for RESTful API's that is quietly and
quickly gaining adoption. The Open API specification is explicitly markup and implementation language
agnostic. It is not vendor or implementation based, rather it’s derived from a collection of open source
projects based on the “Swagger” SG01 specification. A compelling indicator of mind-share adoption is the number of
“competing” implementations, vendors and specifications that support import and export to Open API format but
not each other. While Open API has the major components of a mature API specification – schema, declarative,
language agnostic, implementation independent – its documentation centric focus and proliferation of
compatible tools in dozens of languages attracts a wide audience even those opposed to the idea of standards
and specifications, schema and constraints. OpenAPI is more of an interchange format
then normative specification. Implementations can freely produce
or consume OpenAPI documents and the API's they describe without having themselves to be based on OpenAPI.
Figure 10: The OpenAPI Ecosystem
The OpenAPI Specification is the formal document describing only the OpenAPI document format itself.
The ecosystem of tools and features are not part of the specification nor is their functionality.
The agility to enter or leave this ecosystem at will allows one to combine with other systems to
create work-flows and processing pipelines otherwise impractical. The Java ecosystem has very good support for
data mapping, particularly JSON to Object mapping, manipulation of Java artifacts as source and dynamically
via reflection and bytecode generation at runtime. The JSON ecosystem has started to enter the realm of schema
processing, typically JSON Schema JSCH1. Not nearly the extent of XML tools, there are very few
implementations that make direct use of JSON schema, but there are many tools that can produce JSON Schema
from JSON Documents, produce sample JSON Documents from JSON Schema, and most interesting in this context
produce Java Classes from JSON Schema and JSON Schema from Java Classes.
Combined together, along with some manual intervention to fill in the gaps and add human judgment, a
processing path that spans software languages, markup and data formats, behavior, along with the meta-data and
schema that describe them. Nowhere close to frictionless, lossless or easy – but it’s becoming possible. If it’s shown to be useful as well then
perhaps motivation for filling the remaining holes and smoothing the cracks will inspire people to explore the
Standard OpenAPI Tools
Figure 11: Swagger Editor
Swagger Editor with the standard "Pet Store" API. http://editor.swagger.io
The left pane contains the editable content, the OpenAPI document in YAML format. As you edit the
right pane adjusts to a real-time documentation view composed solely from the current document content.
Note the native support for XML and JSON bodies. The sample shown is generated from the schema alone
although it can be augmented with sample content, an optional element of the specification for any type
Included in the editor is the "Try it out" feature which will invoke the API as currently defined in
the editor returning the results.
Code generation in several dozen software languages for client and server code is included as part
of the open source "Swagger Coden" libraries and directly accessible from the editor. This ranges from
simple stub code, Documentation in multiple formats and a few novel implementations that provide fully
functional client and server code dynamically generating representative requests and responses based on
the OpenAPI document alone.
Figure 12: Swagger UI
The Swagger UI tool provides no editing capability rather it is intended for live documentation and
exploration of an API. A REST endpoint that supplies a definition of its API in OpenAPI Format can be
opened, viewed and invoked interactively from this tool. There is no requirement that the API be
implemented in any way using OpenAPI tools, the document could simply be a hand made static resource
describing any API implemenation that can be described, even partially, by an OpenAPI document. Both
Swagger Editor and Swagger IO provide both sample and semantic representations of the model.
Figure 13: Restlet Studio
Restlet Studio is a proprietary application from Restlet with a free version. It uses its own proprietery
formats not based on OpenAPI, but provides import and export to OpenAPI with a good but not perfect
fidelity. Restlet Studio was used during the early schema refactoring due to its support for editing the
data model in a more visual fashion.
A powerful refactoring feature is the ability to extract nested anonymous JSON Schema types into
named top level types. This facilitated a very quick first pass at extracting common types from multiple
Having named top level types instead of anonymous sub types translated into Java Classes
Reverse Engineering an Open API Document
It would have been very convenient if the pivotal API exposed an OpenAPI interface, the tool-kits used
most likely have the capability, or a OpenAPI document manually created. This would have provided generation of
documentation for the API including the complete schema for the data model as well as enabling creation of
client API's in nearly any language by the consumer.
Instead, I attempted an experiment to see how difficult it is to reverse engineer an open API document
from the behavior of the API. From that, the remaining tasks would be fairly simple.
To construct an Open API document requires a few essential components.
Declaration of the HTTP endpoint, methods, resource URI pattern and HTTP response.
These were all documented to sufficient detail to easily transcribe into OpenAPI format.
A JSON Schema representation of the object model for each endpoint. The specifications allow this to
be omitted by simply using ANY.
But the usefulness of ANY is None.
A shared ‘definition’ schema to allow for reuse of schema definition within a single endpoint and
The representation of JSON Schema in OpenAPI is a subset of the full JSON Schema plus a few optional
extensions. Otherwise it can literally be copy and pasted into a standalone JSON Schema document, or you can
reference an external JSON Schema. There are several good OpenAPI compatible authoring and design tools
available, open source and commercial. These can be used for authoring JSON Schema directly.
The Open API Document format itself is fully supported in both JSON and YAML formats. This allows you to
choose which format you dislike least. The transformation from JSON to YAML is fully reversible, since JSON is a
subset of YAML and OpenAPI only utilize JSON expressible markup. Available tools do a fairly good job of this,
with the exception of YAML comments and multi-line strings. The former have no JSON representing so are lost and
the later have to many representations so get mangled. That can be worked around by adding comments later or by
a little human intervention.
To validate that there was no hidden dependence on specific implementation and that it didn’t require a
great deal of software installation or expert knowledge, I picked a variety of tools for the purpose ad-hoc
and generally used web sites that had online browser based implementations.
The Pivotal api has several useful top level endpoints exposing different paths to the same data. To reuse
the schema across endpoints and to reduce redundancy within an endpoint, the definitions feature of OpenAPI
was used. This required assigning type names to every refactored schema component. Since JSON document
instances have no type name information in them, every extracted type would need to be named. Using the
OpenAPI editors, some amount of refactoring was possible, producing automatically generated names of dubious
value since there is no constraint that the schema for one fields value is the same as another filed of the
same name. Identifying where these duplicate schema components could be combined into one and where they were
semantically different was aided by the prior analysis of the data set.
I made use of several API IDE’s that were not OpenAPI native but did provide import and export of Open
API. There was some issue with these where the import or export was not fully implemented. For example the
extend type annotations in OpenAPI were unrecognized by the tools and either discarded or required changing to
the basic JSON Schema types or their proprietary types. Enumerations and typed strings were the most
problematic. I have since communicated with the vendors and some improvements been made. I expect this to be
less of an issue over time.
The availability of tools that can convert between JSON Schema and Java Classes allows for the use of Java
IDE’s to refactor JSON Schema indirectly.
Of course all the representations of data, schema, java source and API Specifications were in plain text,
which any text editor accommodated.
The result was an interactive process of exploration and convenience switching between different editing
environments fairly easy. Use of scripting and work-flow automation would have improved the experience, but was
There are multiple OpenAPI Validation implementations. There is no specification of validation itself in
OpenAPI which lead to differences in results. Difference in indication of the exact cause and location
varied greatly. Some tools support semantic validation as well as schema and syntax validation.
The ability to directly execute the API from some tools is a major feature that allowed iterative
testing during editing and refactoring.
Code generation of client side API invocation in multiple languages provided a much better validation
due to the data mapping involved. An incorrect schema would usually result in an exception parsing the
response into the generated object model instances. Scripting invocation of the generated application code
allowed testing across a large sample set then easily done interactively.
Particularly useful was the ability to serialize the resulting object back to JSON using the generated
data model. The output JSON could then be compared against the raw response to validate the fidelity of the
model in practice. It’s not necessary or always useful for the results to match exactly. For example renaming
of field names, collapsing redundant structure, removal of unneeded elements can be the main reason for
using the tools. I found it easier to first produce and validate a correct implementation before modifying
it to the desired model, especially since I didn’t yet know what data was going to be needed.
Refactoring JSON Schema via Java Class Representation
Tools to convert between JSON Schema and Java Classes are easily available. Typically used for Data
Mapping and Serializing Java to and from JSON, they work quite well as a schema language conversion.
A Java Class derived from JSON Schema preserves most of the features of JSON Schema directly as Java
constructs. The specific mapping is implementation dependant, but the concept is ubiquitous. Once in a Java
Class representaiton common refactoring functionality present in modern Java IDE's such as Eclipse are
trivial. For example the result of extracting anonymous types into named types in Restlet Studio resulted in a
large number of synthetic class names such as "links_href_00023". Renaming a class to something more
appropriate could be done in a few seconds including updating all of the references to it. Duplicate classes
of different names can be easily consolidated by a similar method. Type constraints can be applied by
modifying the primitive types. For example where enumerated values are present but the JSON to JSON Schema
conversion did not recognize them, the fields were left as 'string' values. These could be replaced by Java
Enum classes. Fields can be reordered or removed if unneeded.
Overlapping types can sometimes be refactored into a class hierarchy or encapsulation to reduce
duplication and model the intended semantics. Mistakes are immediately caught as compile errors.
Since the IDE itself is not aware that the source of the classes was from JSON Schema it will not
prevent you from making changes that have an ill-defined or non-existent JSON Schemea representaiton, or one
that the conversion tool does not handle well. Several iterations may be necessary to produce the desired
Converting back from Java classes to JSON Schema preserves these changes allowing one to merge the results back into the OpenAPI document.
Refactoring by API Composition
No amount of schema refactoring and code generation could account for the expanded entities that spanned
API calls. The Open API document has no native provision for composition or transformation at runtime. That
required traditional programming.
Once I had a working and reasonable OpenAPI document model and validated it across a large sample set, I
then took exited the OpenAPI ecosystem and proceeded to some simple program enhancements. The REST API was now
represented as an Object model, with an API object with methods for each REST endpoint. From this basis it was
simple to refactor by composition. For example to expand a partially exposed resource into the complete form
required either extracting the resource id and invoke a call in its respective endpoint method, or in
dereferencing its ‘self’ link. The later actually being more difficult because the semantics of the link was
not part of the data model. The resource ID was not explicitly typed either but the generated methods to
retrieve a resource of a given type were modeled and provided static type validation in the form of argument
arty and return type.
This is a major difference from using a REST API in a web browser. The architecture and style impose a
resource model and implementation that is not only presentation oriented but also browser navigation and
user interaction specific. This is explicitly stated as a fundamental architectural feature highlighting the
advantage of media type negation for user experience.
To complete the application I added a simple command line parser and serialize. This completed the
experiment and validated that the process is viable and useful. This also marked the beginning. I could
invoke the API, retrieve the data in multiple formats reliably, optimize and compose queries in a language
of choice and rely on the results.
Figure 14: A simple command line application
I could now begin to ask the question I started with. What are the transitive set of digital artifacts
of the latest version of a software product?
Left as an exercise for the reader.
Reverse Engineering, Refactoring, Schema and Data Model Transformation Cycle.
Figure 15: A Continuous Cyclical Transformation Workflow
A flow diagram of the process described. Wherever the path indicates a possible loop implies that an iterative
process can be used. The entire work-flow itself is iterable as well.
Since the work-flow is continuously connected including the ability to generate a client or server API, any step
in the process can be an entry or exit point. This implies that not only can you start at any point and
complete, but that any subset of the process can be used to enable transformations between the respective
representations of the entry and end nodes in isolation.
REST API to JSON Document
Invoke REST API producing a JSON Document
Automated exhaustive search of API resources over hypermedia links.
Producing a representative sample set of JSON Documents
JSON to JSON Schema
JSON samples to JSON Schema automated generation.
JSON Schema to Open API Documents
Enrich JSON Schema with REST semantics to create an Open API Document.
( JSON or YAML)
Open API IDE
Cleanup and refactoring in Open API IDE's.
Open API to JSON Schema
Extract refactored JSON Schema from Open API
JSON Schema to Java Class
JSON Schema to Java Class automated generation.
Produces self standing Java source code.
Java IDE with rich class and source refactoring ability.
Refactor Java Class structure using standard techniques.
Enrich with Annotations
Produces refined and simplified Java Classes
Java Source to JSON Schema
Automated conversion of Java source to JSON Schema
Merge new JSON Schema to Open API
Merge Refactored JSON Schema back into Open API Document (JSON or YAML)
Open API IDE
Cleanup and refactoring in Open API IDE's.
Open API to JSON Schema
Extract JSON Schema from Open API
Iterate 6-11 as needed
Open API code generation tools to source code.
Optional customization of code generation templates to apply reusable data mapping annotations
Open API to complete documentation, any format.
Generated Code to Application
Enrich and refactor generated code with Data Mapping annotation to produce custom data model
preserving the REST interface.
Augment with business logic to produce custom application based on underlying data model
Application operates on custom data model directly; data mapping converts between data models and
generated code invokes REST API.
Output multiple markup formats
Side effect of open API tools and Data Mapping framework automate serialization to multiple formats.
(JSON, XML, CSV)
Application to Open API
Generated code includes automated output of OpenAPI document at runtime, incorporating any changes
applied during enrichment.