Optimal Fitting

There is a steady stream of academic papers saying "The binary fitting of HSP data to a sphere isn't perfect. Here we show a better method." If, as the authors claim, their method is better, why don't we implement it in HSPiP? Well, let's check what methods are in HSPiP.

HSPiP's fitting techniques

The Classic Binary Fit. Here we take your set of 0 (bad) and 1(good) values and find a best fit using the Hansen exponential penalty function for wrong in and wrong out.
The Classic 1-6 scoring. With 1 = Very Good to 6 = Very Bad you can create a binary fit using just the 1's, the 1's and 2's and even the 1's, 2's & 3's. Sometimes the HSP value stays the same but the radius gets larger, sometimes the fit changes a lot. There is also an option to do a graded fit, prioritising the 1's but not ignoring, say, the 3's.
Genetic Algorithm. The algorithm from YAMAMOTO has a different set of penalties and priorities. Sometimes the fit looks better, sometimes worse, often it's essentially the same. It can work with 0-1 and 1-6
Minimum Sphere. One often finds proposals for, say, Voronoi fits to the data but you then lose the benefits of the Distance calculations. HSPiP allows a Minimum Sphere fit (the smallest sphere that includes all the good solvents) and in our experience the results are comparable to more complex geometries, while preserving the Distance metric.
We don't do ellipses. In the early versions of HSPiP we allowed elliptical fits. But they confused everyone, added no obvious value and, again, you lose the Distance metric, so they were removed.
Data fits. If you happen to have data such as solubility or relative sedimentation times, then we have 4 ways to fit them.
1. YAMAMOTO genetic algorithm.
2. Exponential data fit with graph
3. Manual choice of binary split
4. Optimal Binary (automatic calculation of best binary split)
Double Sphere. If you have a molecule (e.g. a surfactant) or a polymer (e.g. a diblock) made from different components (e.g. a hydrophobic region and a hydrophilic one) then the data can be best fitted via a double sphere.
Optional Extras. There are attempts to adjust for molar volume effects (larger molecules make worse solvents) and donor/acceptor effects.

To emphasise that we aren't closed to new ideas, we enthusiastically adopted the Optimal Binary methodology that was developed for the nanoparticle community. A method we didn't incorporate because we could not find a way to handle the sophisticated methodology, is fully acknowledged on the Small Solutes page.

HSPiP's fitting philosophy

What is missing from the stream of alternative approaches is a way to think about what you are doing. HSPiP has been designed to make it easy for you to interact with the data to extract most meaning. Here are some of the ways we encourage users to interact with their fits.

Explore multiple fit options. We like to think of HSPiP as a collaboration tool - something you interact with to make sense of a complex solubility issue. So try some of the alternatives above, and some of the suggestions below.
Sphere Radius Check. Sometimes a couple of extra solvents will refine a sphere. Testing with another solvent deep inside or far outside the sphere adds no value. So HSPiP does a Sphere Radius Check to suggest solvents near the edge of the current sphere which, whether good or bad during testing, would refine the sphere's position and radius.
Grid checks. If you have a convenient good solvent inside the sphere and a bad one outside, you can easily generate a Grid combination of the two solvents, going from 90:10 in steps to 10:90 to allow you to refine the radius by finding the point where the mix goes from good to bad.
Good/Bad/Neutral. If one solvent is obviously causing a problem for the fit, you can swap its value (if it was 1 change it to 0 or 0 to 1) or just put a - symbol so it is ignored (without having to remove it from the dataset). Refit with the swapped and the ignored versions and see what happens. We are not saying "alter the data to get the result you want". Instead we're saying "This solvent is a problem, let's play with the data to see what's going on so we can then choose to:"
- Keep the original and accept a poor-quality fit
- Recheck the data - maybe there's been an error with that sample. Classic errors for giving false "good" solvents are (a) sample stuck onto the top of the tube and (b) a close refractive index match, each giving the false impression that it has fully dissolved. False "bad" can come from too little shaking, too little time, errors in temperatures or even use of the wrong solvent.
- Realise that there is some specific interaction with that solvent which explains the rogue result.
Try to use your fitted values! Real life often doesn't give you the time and resources to get lots of precision solubility data to ensure you have an optimal fit. Sometimes you have to create a sphere from relatively poor-quality data. Now try to use the results. If they allow you to create a formulation that works, that's great. If the suggested formulation doesn't work that's disappointing, but it's also an extra datapoint.. With that new score of 0 you can refit and get a better estimate.

The take home message

Solubility is complicated and HSP is, by choice, relatively simple. Measuring HSP is always a compromise between time/resources and precision. Most of us, most of the time, choose to get "good enough" HSP values that allow us to focus our resources on complicated formulation issues. So the tools in HSPiP and our philosophy of interacting with the data allow users to get good enough answers relatively quickly.

In an ideal world, all the major corporations would measure and provide the HSP of the chemicals they are selling to end users. They have the robotic resources and data analytics that would result in high-quality values. We look forward to the day when this is routine. And if these corporations find that there a better method for fitting the data, we'd be happy to add it to HSPiP!