Requirements

  • Available collection of OpenAPI YAML documents
  • Successful set-up of a vector database

Extending service composition with vector databases


One approach to improve the service composition pipeline is to augment prompts passed to LLMs by inserting additional context information. The background idea or necessity for such an extension stems from the reduced effectiveness that current agents experience when trying to recall long segments, i.e., in this case data used to describe existing services for generating OpenAPI documents.

Beyond approaches to improve service discovery such as, e.g., majority voting, we can use the OpenAPI document that the agent generated when converting the user description of a composite service to match it against a manually created collection of existing services in the OpenAPI format. This collection, in the optimal case, contains correct and up-to-date descriptions for services that can be used for discovery.

This approach is also known as retrieval augmented generation (RAG) and aims to optimize the output of LLMs. For this the agent receives additional information that is fetched from a knowledge base and should be utilized when generating the response. Such a knowledge base can be represented by a vector database containing embedded documents where given a query the documents with the highest relevancy are returned. As the name implies, documents are stored as vectors in such a database, i.e., a sequence of floating point numbers, obtained by applying chunking and by then embedding the individual blocks of textual data. A whole paragraph or even pages of text can be transformed into a vector enabling relevancy scoring since such transformations retain semantic similarity as perceived by humans. This in general means that when looking at the resulting vector space the phrases “A dog jumped over the fence.” and “There is a dog jumping over the fence!” are represented by points with a small distance to each other whereas a larger distance can be observed with the phrases “When are we arriving at the airport?” and “My favorite color is blue.” since these two pairs of phrases capture different degrees of semantic similarity. In the context of OpenAPI service specifications the aim is to compare the generated OpenAPI document to which we created the embedding for with the existing embeddings in the vector database. For example, if we ask the agent to generate an OpenAPI specification for a service returning information about requested movies as the input to our our vector database, entries describing existing services such as The Movie Database (TMDB) should receive a significantly higher similarity score compared to a service about finding new recipes, e.g., Spoonacular.

Matching is done by selecting the top k documents in our collection that are most likely belonging to the given agent specification, i.e. the query used as input to the vector database, the result of which is injected into a following prompt as follows:

[Task prompt]

<Agent generated OpenAPI specification>

<Retrieved collection document 1>
...
<Retrieved collection document n>

Note that since we discard retrieved documents with a similarity score less than some defined threshold.

The task prompt should instruct the agent to select the document that best matches the provided agent specification and the result is then used in subsequent steps for service composition. If the matching process yields no documents above the removal threshold we default back to the use of LLMs for generating OpenAPI documents and apply the strategies known for this approach to improve the output.

Data sources for vector databases


To implement the proposed approach we require data sources that will be used to build our collection of service descriptions stored in a vector database, however not every source is suitable for integration without applying pre-processing first. Ideally, the vector database should consist of embeddings derived from OpenAPI documents that together form a corpus with the following properties:

  • The service represented by a specific document is only present once
  • The document is actively maintained, possible indicators are the
    • Author(s), e.g., if a document is released by the company offering the service and not a 3rd party
    • Last update time
  • Documents contain various endpoints with rich descriptions
  • All endpoints that are publicly available when using the service are present in the document

The degree of maintaining these properties that can be achieved will be analyzed in subsequent sections.

Postman API Network


Postman Inc. offers the Postman API Network with what the company describes as the largest set of public APIs, including over 500,000 so called collections that enable users to define APIs using the Postman collection v2 format. Users can create such collections that are by default private but can be also made publicly available and are stored as GitHub repositories. To define a service various API protocols such as REST APIs, WebSocket, GraphQL or gRPC can be incorporated with the possibility of adding per endpoint authorization methods and test cases. Furthermore, collections may contain extensive documentation describing specific endpoints or services as a whole.

Postman furthermore offers a Transformation API that can be used to export collections, defined using the Postman collection v2 format, as OpenAPI documents.

> Evaluation of the Postman API Network

Examining the Postman API Network with respect to the ideal properties introduced in the section Data sources for vector databases yields following observation: By sorting the available list of collection according to the number repository forks and limiting the set to the first thousand collections, it can be observed from the metadata entries that many of the popular collections are created by the companies that provide the services. ~ Insert analysis # Top 100

Processing OpenAPI specifications


When adding OpenAPI specifications to a vector database an embedding model needs to be used that transforms text into a vector. However, such embedding models cannot convert arbitrarily long text sequences, instead there is an upper limit in the number of tokens that can be used for a single document. A token is the basic unit of text processed by LLMs when reading inputs or generating responses and does not necessarily correspond to a whole word, it may represent part of a word or punctuation marks. The following table shows how many tokens can be provided as input across leading embedding models from the MTEB leaderboard:

When looking at the top 1000 Postman collections based on fork count and apply conversion to OpenAPI specifications, the number of tokens for each entry can be calculated. As an example the following two tokenizers are chosen, namely cl100k_base used by, e.g., the embedding model text-embedding-3-large and the LLaMA tokenizer used by voyage-large-2-instruct. The table below shows for the two tokenizers the respective statistics for unmodified OpenAPI specifications derived using the tokenizer python package for cl100k_base and the transformers python package for LLaMA: Note: When running the encoder the special token ’<|fim_suffix|>’ is excluded since it appears in the dataset.

TokenizerMax TokensNumber DocumentsNumber Documents
exceeding max tokens
Mean
Token
Count
Max
Token
Count
cl100k_base81911000572136,5413,980,245
LLaMA 2160001000491164,0905,204,395

The table shows for both tokenizers that the mean and maximum token counts for OpenAPI documents are above the maximum allowed value that can be embedded at once. When comparing tokenizers different token counts can be observed, this is due the use of different rules for splitting the text into individual tokens, i.e., text can be for example broken down into differently fine-grained tokens. Since 57.2% of documents for the cl100k_base tokenizer and 49.1% for LLaMA 2 surpass the context size it is important to reduce the number of tokens required. This can be achieved by taking an OpenAPI document and generating chunks with a length less than the maximum token count. Therefore, a document is split, depending on its length, into one or more chunks that are then encoded into a vector using embedding models. Examining different strategies for chunking is important since it can have a significant impact on the performance of RAG systems [1]. In the following approaches for generating chunks out of documents are introduced.

Sequential chunking


There are a number of chunking methods that are well-known and applicable to any text input. In the following methods are examined that can be described as sequential chunking strategies since the chunks they produce can be sequentially merged to retrieve the source input. The difference between the various methods stems from the strategy used to determine breaking points at which text is split into separate chunks. Fixed size chunking represents the most crude and simplest method that can be used to segment the text into chunks. As the name implies text supplied as input is split into chunks of a fixed specified number of characters. Note that in general different strings with the same number of characters may result in disparate number of tokens after using an embedding model, however the number of characters can be limited to a degree where the yielded token count is likely to not exceed the context size. A different approach to sequential chunking represents recursive chunking, where unlike fixed size chunking which does not consider the structure of the text recursive chunking will split text based on operators, e.g., new line (\n). The input text is split into two chunks based on the specified operators and if any of the resulting chunks exceeds the maximum character count chunking is applied recursively. Document based chunking splits a document based on its structure, for example python files can be split into a list of, e.g., classes or functions. In the following a custom document based chunking method for OpenAPI documents is introduced. So far the introduced sequential chunking methods result in fixed sized chunks, it is possible to create chunks with dynamic sizes. For this, semantic chunking will group together text segments that are semantically similar. Therefore, if we have for example two longer paragraphs focusing on different topics that exceed the maximum token count semantic chunking will split the two paragraphs into separate chunks.

~Examples

Endpoint-based chunking with document processing


Approaches for splitting text inputs into parts of manageable sizes using sequential chunking methods have the advantage that they can be applied to any content. However, when looking at the generated chunks, by using for example fixed size chunking, one issue that can be noticed is a lack of semantic consideration in the context of OpenAPI documents. It is possible that a specific endpoint of an existing service is split into two and assigned to different chunks, meaning that neither of the two parts will capture the endpoint in its entirety. Moreover, chunks describing the majority of a certain endpoint may contain parts of another potentially unrelated endpoint. This is especially problematic since some OpenAPI documents are created by companies that provide a multitude of distinct services while exposing API documentation as a singular collection. The aforementioned issue is amplified by the observation that textual length of endpoints specified in OpenAPI documents can show a large variance. This means that an endpoint may not only be split up and assigned to two different chunks but it also may represent a small fraction of the chunk size compared to the unrelated endpoint. When querying vector databases containing such chunked documents for specific functionality with the described issues it is possible that documents which should match are then assigned low similarity scores. Depending on the implementation documents with a similarity score lower than some threshold may be excluded and thus the query may result in no matches despite the existence of endpoints matching the required functionality.
Using sequential chunking approaches without any processing to the OpenAPI documents will also cause a larger number of total chunks as some service specifications contain, e.g., images embedded as base64 strings when listing examples for the result of calling an endpoint. It is therefore possible that chunks containing no meaningful information will be indexed into the vector database.

Considering the preceding issues the focus is now to assess whether considering the structure of OpenAPI documents when creating chunks can yield an improvement to the task of service composition using RAGs. OpenAPI specifications will be first processed to remove data that is not relevant or does not contribute in a significant manner to better capture semantics in the context of service discovery, allowing to create smaller text inputs that are simpler to break up into chunks. Then, approaches for document based chunking are introduced.

Processing specifications can be done in two ways by creating steps that remove data which provides no benefit in terms of service discovery and processing steps used to remove data that provides useful information but where sufficient related data is already present when generating a chunk.

When looking at OpenAPI documents obtained from Postman it can be observed that some entries contain base64 encoded data such as images. For example the Twitter v2 API contains example responses for invoking an endpoint where tweets including images are listed. Since this data does not contribute towards a concise description of the service at hand it is removed by using regular expressions to match base64 encoded content. The next modification made to documents is the removal of html tags which also does not provide meaningful additional information about the described service. Next, markdown notation such as ** to create bold text is removed as some OpenAPI documents make heavy use of markdown when describing service endpoints. The modifications introduced so far are not sufficient to lower token counts to a point where chunking can be applied. Therefore, to further reduce the number of tokens URLs are removed - not all URLs are removed however. A blacklist is used to determine whether a specific entry should be excluded, this includes the domain postman.com frequently used in a Postman collection to link to different parts of the collection. Furthermore, many OpenAPI documents often contain links that represent shortened URLs such as bitly.com or t.ly where the domain and path do not include information relevant to the service description and can thus also be removed. Since the process of service discovery typically involves the task of finding existing services that closely match the required functionality it is possible to omit response examples that contain the message when a service could not complete the request successfully. For this, we consider the input JSON OpenAPI document as a tree where the nodes are object keys. The objective is then to search within this tree for keys representing HTTP status codes, more specifically entries with keys that correspond to client and server error responses (4xx and 5xx status codes). All matches, that have some parent with the key “responses”, are then deleted from the document. Using the same approach we can also remove information from successful (2xx status codes) responses that is not relevant such as response headers.

The aforementioned approaches to reduce token counts are in a few instances not sufficient to then produce chunks that do not exceed the context size of, e.g., 8191 as defined by the embedding model text-embedding-3-large. When looking at the remaining documents that are still too large it can be observed that there are a lot of examples listed which show how an endpoint is invoked. This can be

An example represents an airline booking service where the OpenAPI specification contains example

Generating chunks using LLMs


Continue based on [1]

  • Issues:
    • Some documents are too large to be even passed as input to LLMs (consider max token count of upwards 5M)

[1] https://arxiv.org/pdf/2312.06648