RCSB PDB: Search API Documentation

  • Search API Basics
  • Query Language
  • Using Facets
  • Search Operators
  • Date Math Expressions
  • Search Attributes
  • Examples
  • RCSB Search API

    This document explains how to use the RCSB Search API. The Search API allows users to run queries across RCSB PDB Search Services and retrieve a list of relevant identifiers such as PDB IDs, entity IDs, assembly IDs, etc.

    The Search API is a RESTful API over HTTP with JSON payloads. The Search API accepts HTTP GET or POST requests. Refer to the RCSB Search API Full Reference for a full API documentation.

    Introduction

    The base URI for calls to the Search API is http://search.rcsb.org/rcsbsearch/v1/query.

    The search request body should be specified as a URL-encoded query string inside the json parameter as http://search.rcsb.org/rcsbsearch/v1/query?json={search-request}. The query syntax for the {search-request} is detailed in the Query Language section of this guide. See Build Your Search section for general information on how to construct the {search-request} body.

    Build Your Search

    A search request is a complete specification of what should be returned in a result set. The search request is represented as a JSON object. The building blocks of the request are:

    Context Description
    return_type Required. Specifies the type of the returned identifiers, e.g. entry, polymer entity, assembly, etc. See Return Type section for more information.
    query Optional. Specifies the search expression. Can be omitted if, instead of IDs retrieval, facets or count operation should be performed. In this case the request must be configured via the request_options context.
    request_options Optional. Controls various aspects of the search request including pagination, sorting, scoring and faceting. If omitted, the default parameters for sorting, scoring and pagination will be applied.
    request_info Optional. Specifies an additional information about the query, e.g. query_id. It's an optional property and used internally at RCSB for logging purposes. When query_id is sent with the search request, it will be included into the corresponding response object.
    The query context may consist of two types of clauses:

    The simplest query requires specifying only return_type parameter and query context. With unspecified parameters property in the query object, a query matches all documents, returning PDB IDs if the return_type property is set to "entry".

    Refer to Examples section for more examples.

    Search Services

    The RCSB Search API consolidates requests to heterogeneous search services. The list of available services is below:

    Service Description
    text Performs linguistic searches against textual annotations associated with PDB structures.
    sequence Performs fast sequence matching searches (BLAST-like) against nucleotide or protein sequences available in the PDB archive.
    seqmotif Performs short motif searches against nucleotide or protein sequences available in the PDB archive.
    structure Performs fast searches matching 3D shape of PDB structures, e.g assemblies or chains.
    chemical Performs substructure searches across chemical definitions in PDB, matching a user-defined query molecule.

    Return Type

    The search can return one of the following result types:

    Type Description
    entry Returns a list of PDB IDs.
    assembly Returns a list of PDB IDs appended with assembly IDs in the format of a [pdb_id]-[assembly_id], corresponding to biological assemblies.
    polymer_entity Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to polymeric molecular entities.
    non_polymer_entity Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to non-polymeric entities (or ligands).
    polymer_instance Returns a list of PDB IDs appended with asym IDs in the format of a [pdb_id].[asym_id], corresponding to instances of certain polymeric molecular entities, also known as chains.

    Query Language

    The Search API provides a full query DSL (domain-specific language) based on JSON to define queries.

    Basic Search

    The query language allows to perform unstructured (basic) searches. An unstructured query refers to the search of textual annotation associated with PDB structures when the field name is unknown. Such query will search across all fields, available for search, and return a hit if match happens in any field.

    To perform an unstructured search, you should send the parameters object without an explicit attribute property:

    Refer to Examples section for more examples.

    Field Search

    A field query allows getting results based on a specific field's value. To perform a field search, you should send the parameters object with an explicit attribute property set to a field name, value property set to a search term, and operator property set to a search operator.

    Refer to the Examples section for more examples.

    When using field search, you must observe the following rules:

    Boolean Expressions

    The query language supports two boolean operators: AND and OR. Boolean operators should be added to the group node as logical_operator property. The group nodes can be used to logically combine search expressions (terminal nodes) or other group nodes:

    Refer to the Examples section for more examples.

    Sorting

    Sorting is determined by the sort object in the request_options context. It allows you to add one or more sorting conditions to control the order of the search result hits. The sort operation is defined on a per field level, with special field name for score to sort by score (the default).

    By default sorting is done in descending order ("desc"). The sort can be reversed by setting direction property to "asc". This example demonstrates how to sort the search results by release date:

    Refer to the Examples section for more examples.

    Pagination

    By default, only first 10 hits are included in the search result list. Pagination can be configured by the start and rows parameters of the pager object in the request_options context.

    Returning all hits is generally not desirable and may be the source of performance issues. However, if it's needed to retrieve all matched hits, consider adding return_all_hits parameter to the request_options context.

    Refer to the Examples section for more examples.

    Using Facets

    Faceting is the arrangement of search results into categories based on the content of requested fields. If the facets property is specified in the request_options context, the search results are presented along with numerical counts of how many matching IDs were found for each term requested in the facets. If the query context is omitted in the search request with facets specified, the response will contain only the facet counts.

    Here, for all PDB structures, released after 2019-08-20, the breakdown by experimental method is returned along with matching IDs:

    Refer to Examples section for more examples.

    Search Operators

    Search operators are commands that help you make your search more specific and focused. The following operators can be used to perform a field search:

    Exact Match Operators

    Exact match operators indicate that the input value should match a field value exactly (including whitespaces, special characters and case).

    exact_match

    You can use the exact_match operator to find exact occurrences of the input value. Comparisons with exact_match operator are case-sensitive.

    A single value input is required for this operator and must be a string.

    in

    The in operator allows you to specify multiple values in a single search expression. It returns results if any value in a list of input values matches. It can be used instead of multiple OR conditions. Comparisons with in operator are case-sensitive.

    An input value is required for this operator and it must be a list of strings, numbers or dates.

    Full-Text Operators

    The full-text operators enable you to perform linguistic searches against text data by operating on words and phrases. The input text is analyzed before performing a search. The analysis includes following transformations:

    The standard grammar based tokenization is used to break input text into tokens. Refer to the Unicode Text Segmentation documentation for more information on tokenization rules.

    contains_words

    The contains_words operator performs a full-text search by operating on words in provided text. After text is broken into tokens, more basic queries are constructed and OR boolean logic used to interpret the query. For example, "actin-binding protein" will be interpreted as "actin" OR "binding" OR "protein". The search will return results if any of these tokens match. This operator can match multiple tokens in any order.

    A single value input is required for this operator and it must be a string.

    contains_phrase

    The contains_phrase operator performs a full-text search by operating on phrases. The operator will require the following criteria fulfilled in order to return results:

    For example, "actin-binding protein" will be interpreted as "actin" AND "binding" AND "protein" occurring in a given order.

    A single value input is required for this operator and it must be a string.

    Comparison Operators

    greater, less, greater_or_equal, less_or_equal, equals operators match fields whose values are larger, smaller, larger or equal, smaller or equal to the given input value.

    A single value input is required for this operator and it must be a number or date.

    Range Operator

    The range operator can be used to match values within a provided range.

    A single value input is required for this operator and it must be an object as follows:

    
    {
        "from": [number|date],
        "include_lower": [boolean],
        "to": [number|date],
        "include_upper": [boolean]
    }
            

    By default, lower and upper bounds are excluded. They can be included by setting include_lower and include_upper to true respectively. An inclusive bound means that the boundary point itself is included in the range as well, while an exclusive bound means that the boundary point is not included in the range.

    Refer to Examples section for more examples.

    Exists Operator

    The exists is a logical operator that allows you to check whether a given field contains any value. To be deemed as non-existent the value must be null or []. The following values will indicate the field does exist:

    The operator doesn't require a value.

    Date Math Expressions

    Comparison and range operators support using date math expression. The expression starts with an "anchor" date, which can be either now or a date string (in the applicable format) ending with ||. It can be followed by a math expression, supporting + and -, e.g. "2020-06-01||-12M".

    The units supported are:

    Search Attributes

    The attributes available for search include the annotations described by mmCIF dictionary, annotations coming from external resources and attributes added by RCSB PDB. Both internal additions to the mmCIF dictionary and external resources annotations are prefixed with rcsb_.

    Refer to the Search Attributes page for a full list of the attributes that are available for text search.

    Examples

    This section demonstrates how to use the RCSB Search API to perform complex searches.

    Biological Assembly Search

    This query finds symmetric dimers having a twofold rotation with the DNA-binding domain of a heat-shock transcription factor.

    X-Ray Structures Search

    This query finds PDB structures of virus's thymidine kinase with substrate/inhibitors, determined by X-ray crystallography at a resolution better than 2.5 Å.

    Protein Sequence Search

    In this example, using sequence search, we find macromolecular PDB entities that share 90% sequence identity with GTPase HRas protein from Gallus gallus (Chicken).

    3D-shape Search

    This example demonstrates how structure search can be used to find PDB structures of calmodulin with conformational changes upon Ca2+ binding. Calmodulin (CaM) protein has two homologous globular domains connected by a flexible linker. Ca2+ binding to each globular domain causes a change from a “closed” to an “open” conformation. This query finds calmodulin structures in “open” conformation.

    As a structure query input parameter we will use the crystal structure of Ca2+-loaded calmodulin (PDB entry 1CLL). This query is combined with the text search for CA chemical component ID. Note: if you leave out the query clause matching Ca2+ ions, you will also get calmodulin structures in complex with other metals (e.g. strontium in 4BW7).

    Free Ligand Search

    Ligands are considered “free ligands” when they interact non-covalently with macromolecules. This example shows how to find non-polymeric entities of ATP molecule that is found as “free ligand”.

    Sequence Motif Search

    A sequence motif search finds macromolecular PDB entities that contain a specific sequence motif. This examples retrieves occurrences of the His2/Cys2 Zinc Finger DNA-binding domain as represented by its PROSITE signature.

    Chemical Similarity Search

    This example demonstrates how to find non-polymeric entities chemically similar to Tylenol defined by the InChI string. Note, that the parameter match_type="graph-strict" does not imply exact structure match and you are getting acetaminophen molecules (TYL) together with methoxy (T9V) and ethoxy (N4E) analogs in the result set.

    shell