Compound Group Population Project



Authoritative scientific bodies have identified hundreds of groups of chemicals that share a common chemical structure - such as containing the element lead or a methylmercury cation - that these bodies have determined are all associated with a serious health hazards. These groups are an important part of hazard list screening in tools like the Data Commons, allowing association of hazards with substances that haven’t been individually listed on the authoritative hazard lists.

The challenge is that there is no authoritative listing of the tens of thousands of hazardous chemicals which are included in those groups. In the Data Commons we’re building the tools to solve this problem and we’d like the help of chemists, toxicologists, and data scientists like yourselves to complete the job.  The vision of this process is to develop a transparent, scientific, peer reviewed methodology for populating members of compound groups to be broadly used by researchers and programs in hazard screening with a public repository for definitions and compound group members.


Compound Groups used to identify hazardous substances are not well defined. Pharos and the Data Commons rely heavily on authoritative hazard listings to identify associations between substances and human and environmental health hazards, such as cancer and aquatic toxicity. Authoritative bodies of scientists - primarily under the auspices of national and international governmental agencies, such as the US EPA and the World Health Organization - review the science and develop a consensus on these important listings, resulting in lists of carcinogens, reproductive toxicants, aquatic toxicants and others.  

In some cases, these bodies will list a series of individual compounds associated with these hazards. In other cases, however, the bodies identify whole classes of related substances. The evidence may be compelling that any compound that contains a certain element (such as lead) or is based upon a particular structure (such as lead bound to carbon chains) is likely to have the same hazard.  The Data Commons includes over 600 of these groups. 

Rarely do these agencies attempt to list all of the substances that are members of these hazardous groups, as some may contain thousands of chemicals.  The agencies generally leave it to manufacturers or the public to check if the chemicals they use have these characteristics. This might be workable with a handful of groups, but it has become overwhelming as the number of groups has grown into the hundreds.

Now that list based hazard screening - screening chemicals against the authoritative lists to identify potential hazards - is automated in tools such as the Data Commons, Pharos, and the HPD Builder, it is particularly important to fully populate these groups with individual members. Until the chemicals in each group are identified and listed, they can find their way into our products without anyone being aware of their hazards. Furthermore, without a common agreed upon listing of the chemicals in each group, different screening tools will come up with different results.   


HBN coordinates the Compound Group Population Project to identify the chemicals that should be included in groups used for hazard screening. This  collaborative project is coordinated through HBN’s Chemical Hazard Data Commons (hereinafter Data Commons).  Together we can help manufacturers and consumers avoid hazards they may have otherwise missed. Help us close this huge gap in list screening. 

We are tackling this problem by the following steps:

  1. Establish definitions of groups

  2. Develop search algorithms to apply to chemical structure databases to identify members of the groups.

  3. Populate lists of substances that are members of each group through use of these definitions and algorithm drive searches.

  4. Establish a public registry of the group definitions and algorithms to allow others to replicate (and test) this work

  5. Establish a public registry of the individual group members

  6. Use an open collaborative peer review process to improve these definitions and algorithms, establish credibility and build buy in.  

  7. Publish these definitions and algorithms as an open standard

  8. Encourage use of the open standard   - these definitions and algorithms - by tool developers and list increase consistency.

  9. Update the list regularly


We are using the Data Commons as the registry for these definitions and algorithms and to facilitate collaboration. We are developing structure based algorithms and searching PubChem and other structural databases for group members that are then added to the compound group lists in Pharos. These Pharos compound group lists define how the Pharos list screening process generates hazard listings for Pharos, the Data Commons and the other tools that use data by API from Pharos, including Portico, the HPDC’s HPD Builder, and BlueGreen’s ChemHat. 

To date, the Project has used PubChem searches to add compound group associations to existing Pharos substances as well as to find new substances with matching structures that are not previously listed in Pharos.
We are exploring additional chemical databases that may provide additional structural search options.

We generally limit the scope of database searches to substances with a CASRN, under the assumption that excluding non CAS registered substances effectively limits the addition of large numbers of substances, at least in PubChem, which are experimental or pharmaceutical only, are not likely to be used in non pharmaceutical industry and would only serve to burden the database without improving its function.

Effect of the Project on Hazards

Whenever we populate a new compound group, the warnings that scientific bodies have associated with that group are associated with the substances in the group. This may change the hazard level for an endpoint or even the GreenScreen List Translator score. As part of the Interim Harmonization Project described below we do not implement these changes on a rolling basis as we complete the research, but instead roll them out on a coordinated basis with CPA.   

Current Status and Access

Through the Project we have made progress on the first 5 steps. We have established group definitions and added tens of thousands of members to groups. The full list of Compound Groups is available, including descriptions of how each was populated, the number of members, and the number of hazards. Additional detail on each, including a list of members is available in individual compound group profiles in Pharos and on the Data Commons. To view in the Data Commons, copy the group name from the link above and search for it in the Data Commons. If you have a research or program need for the entire list of CASRN members for all groups, contact

Terms of Use:  The definitions and substance lists developed under the Project are subject to the Terms of Use of the Data Commons. Any entity which intends to use these definitions or substance lists for public or commercial use must do two things:

  • Notify HBN’s Data Commons Project of the intended use by email notification at  

  • Clearly identify on the website or other media using the data that it was sourced from the Data Commons Compound Group Population Project.

  • Required language: “The compound group definitions and data used here are provided by the Data Commons Compound Group Population Project run by the Healthy Building Network’s Chemical Hazard Data Commons.”

Provide a link as listed above to the Project home page. (

Parties are encouraged to participate in the Project to define and populate more groups and collaborate on development of an open standard for groups and a public registry of substance members.

Next steps

Interim Harmonization Project: These groups are an integral part of hazard screening for the GreenScreen List Translator (GSLT) and the Health Product Declaration (HPD). Pharos/Data Commons is one of two automation systems that provide hazard screening services for these two programs. Differences in how automation systems populate these groups is one of several issues which can result in significant differences in automator hazard screening outcomes. Clean Production Action (CPA) is facilitating a harmonization process to reconcile differences between the two systems and generate a single list of groups and members.   HBN is participating in that harmonization process and contributing its work to date on this Compound Group Population Project, Use of this contribution is subject to the Terms of Use listed above.

CPA has committed to the vision to develop a transparent, scientific, peer reviewed methodology for populating members of compound groups and supports the long term project to fill the gaps described here. The official policy for using the results of this project for GreenScreen List Translation is published in the GreenScreen for Safer Chemicals Compound Group Policy on the GreenScreen Guidance and Resources page 

You can help

We need chemists who like puzzles to help us design the necessary chemical searches. If you think you can help, please review the documents below and post questions and ideas in the Data Commons Compound Group Population Discussion or contact the coordinators of this project (listed below) directly.

Compound Group - Cheminformatics project: Using cheminformatics tools to perform complex queries on large chemical datasets. This is a currently active part of the project.

Compound Group - PubChem API project: Refining compound group profiles to match search capabilities in PubChem API. This was a previous approach we tried, but has been dropped for now in favor of using cheminformatics tools.

Join in the discussion: This project is discussed  in the Data Commons Compound Group Population Discussion.

Project coordinators

Michel Dedeo

Akos Kokai