Skip to main content

Posts

Showing posts from 2008

Why you can't make custom progress monitors in eclipse

I wanted to make a custom progress monitor with a temperature graph in Eclipse, but it does not seem to be possible. Here's why:
There is a ProgressProvider class that has an abstract createMonitor(Job job) method that is called in the JobManager when creating jobs. The standard concrete implementing class is the ProgressManager, shown above in a dashed box labelled "Discouraged Access". So you can't just override this class.
The JobMonitor is an inner class of ProgressManager, and it is not possible to provide an alternative. Once it has been created and passed to the Job, it has already been displayed as a Dialog, so it can't be wrapped with another Dialog.
Seems impossible, really.

Intermediate vs direct rendering

Just a small, informal diagram of how the new rendering system works, to balance out all the speculative UML sketches I made...
I have not shown the elements inside the element groups, but they are things like line elements, oval elements, and so on.

Creating new molecule editing programs with cdk

After all the work that has gone into making a flexible core for JChemPaint (the controller and renderer modules) I finally have the opportunity to use it for what I wanted : a small custom program for editing molecules from scratch and displaying the spectrum as you go.
This may not look like much, but compare to this next screenshot:
Obviously this is a fairly unlikely molecule, but (if you believe me :), the new ring has caused the program to re-predict a spectrum (using NMRShiftDb). The point of this was to teach me more about the connection between molecules and spectra.

CDK-City

On the mean streets of CDK-city, one massive tower block dominates the skyline in the renderer district - the Renderer2DModel tower...

Masak pointed me to this site due to Richard Wettel which is about a static-analysis tool for making so called 'code cities'. 
Basically, the parking lots (like SMARTSParserConstants) have lots of attributes (in fact, this class only has attributes) while skyscrapers (like in the debug package) have many more methods than attributes (most of the debug classes have no attributes except inherited ones).
There's an eclipse plugin from here called MooseBrewer, and I think there are other tools there I haven't explored. It's great to have a map to this busy code metropolis, as I wondered what my workplace looked like :)

chalky

I wanted to show something that hints at the things that the new architecture can afford us:
This is using a Java2D graphics Paint object to make it look like chalk...kindof. It's a very simplistic way of doing it by making a small image with a random number of white, gray, lightgray, and black pixels.
edit: it doesn't look so good at small scales
some tweaking of stroke widths and so on is essential.

Seeing Double?

Comes from too much screen-time:
although refactoring often seems like running on the spot (you get nowhere fast), things happen behind the scenes..
A short code snippet.

More Annealing

Some more detail here:
Here is a larger molecule, cembrane (pubchem link) which has 20 carbons. The run is longer, too, with 2,000 steps. It finds a spectrum with 100% match within 300 steps, in this case.
An even larger example is lanosterol (pubchem link) which has 30 carbons. This actually seems to be too large for C-NMR prediction using NMRShiftDB. It doesn't get the answer within 2,000 steps, and does not look like it is on course to do so:
The highlighted step (1928) is shown as a molecule in the box marked "final", but the score graph has levelled off by about the 400th step.

Adaptive Annealing Engine Test Screenshot

So, finally, some work that I am meant to be doing (click for bigger, as usual):

It's crude, but it is starting to work. The screenshot shows only the first 100 steps of a run, but clearly you can get the right answer (almost by chance, actually) for such a small molecule when you are essentially just permuting the atoms.
Now to test more fully...

Shadows

Not the most important use of rendering code..

:)

Symbols and Elements

I have made attempts over the years to draw triangles and circles like, for example, this:
(in fact, this was drawn by David Westhead's code). There is a small design decision to be made between representing drawing elements as basic geometry or as symbols. A symbol may be the same as a geometric element (like a circle), but it might be a combination of them. Consider the 'N' and 'C' in boxes, above. Or this image:
where the mass number and charge are separate text elements on the left, and part of the whole symbol on the right. Neither is a 'better' way of doing things; they each have their advantages - and disadvantages.
It can be a lot clearer to use symbols, as each model object (atom, bond, helix, gene, cell) has a corresponding representation, and then a diagram composes the symbols into a manageable whole. On the other hand, you can re-use elements in different combinations for diagrams. A general vector drawing package would have Line, Text, Rectangle,…

The Trouble with Tribble Visitors

So, I'm now partly sold on the power of a Visitor approach to rendering. Consider this snippet:
if (clicked) diagram.accept(new DropShadowVisitor(g, 5, -5);
else         diagram.accept(new DrawVisitor(g));
What this is doing is drawing a diagram normally unless the mouse is clicked, when it draws with a drop shadow. I see that the beauty of this is the ability to manipulate functionality as a block (just as in languages where you can pass around functions...).
However, I should point out that the approach has its tricky rapids as well as such smooth sailing. The image below is a spot-the-difference (click for bigger):

On the left is the version drawn by a naive first try at the drop visitor. Its methods look like this, the visitText(Text text) method:
g.setColor(Color.LIGHT_GRAY);
g.drawString(text.text, text.x, text.y);
g.setColor(Color.BLACK);
g.drawString(text.text, text.x + dx, text.y + dy);
The problem with this code is subtle - the elements are visited…

Arvid's Renderer Design

This is a sketch of my understanding of Arvid's renderer design:

I say 'my understanding' with good reason - a diagram can be biased, even a formal(ish) one like this, so I don't claim that this is definitive!
It is interesting, anyway, as it seems to be a combination of a Composite and a Visitor pattern. The RenderingModel implements IRenderingElement (and Iterable of IRenderingElements), which is Composite.
The Modules are Elements in the Visitor pattern, and the Elements are Visitors.
edit: What nonsense I talk! The IRenderingElements are Elements and the  IRenderingModules are Visitors. That's better.

Font Management and GlyphVectors

Fonts and text are always a pain when drawing vector graphics!
After hitting myself over the head over a stupid bug in my zoom tests (like this:)JButton inButton = new JButton("IN");
inButton.setActionCommand("IN");
JButton outButton = new JButton("OUT");
inButton.setActionCommand("OUT");I got center scaling implemented in my branch. Which lets me test a different approach to managing fonts. I had tried to be clever and do something like:GlyphVector glyphs = font.createGlyphVector("N");
ArrayList shapes = new ArrayList();
for (int i = 0; i < glyphs.getNumberOfGlyphs(); i++) {
shapes.add(affineTransform.getTransformedShape(glyphs.get(i)));
}
but letter shapes fill horribly with lots of missing pixels. Anyway, I eventually settled on just storing the Glyphs themselves. Less pure, but it works, and it allows the size of TextSymbols to be computed and used by the class in between drawing.
So, for fonts there had to be a way in my architecture to…

A correction, and a simpler example

There was of course a deliberate mistake in the previous post:
as the highlighting shows, the matrix for cubane numbered PVR style is neatly divided. A simpler example is cyclobutane:
which also shows the important point that, even if the sum of the row-numbers is the same as for other possible numberings, the PVR numbering is still the only ordered one.

On Canonical Numberings

So, after reading* this (2005) paper : "On Canonical Numbering of Carbon Atoms in Fullerenes : C60 Buckminsterfullerene" (link) I made some pictures to illustrate the difference between it and the numbering scheme used for SMILES (as described here). Er, which is used in the CDK.
Anyway, the point is that the scheme used by Plavšić, Vukičević, and Randić (or PVR as I will refer to them, I hope they don't mind!) numbers the atoms in a way that produces an adjacency matrix with a particular property. If you consider the rows of the matrix to be binary numbers, then the set of numbers is the smallest possible. So, for example:

The structure on the left is cubane, with its adjacency matrix on the right. The column on the far right shows the rows of the matrix in base 10. They are clearly in order. Now what happens for the SMILES? Well:
Here, the rows are neither in order (I'm not sure from their paper whether the ordering is an expected outcome for all structures, nor have …

KDTree in SymbolTree

So, the new Symbol tree base class now has a working implementation of a KDTree. (Or as close as the test can tell). What this means is that, should you ever need to edit (not just display) a very large 2D structure, you would do something like:
SymbolTree tree = createTree(); // get a tree somehow
tree.setHitDistance(minDistance); // the minimum distance from mouse to symbol
tree.useKDTree(true); // essential
tree.highlightClosestSymbol(point); // will now do a fast search for the closestOf course, this is really only useful for a) large trees and b) when doing many "closest symbol" operations. For example, highlighting when moving the mouse.

The best situation would be to dynamically call the "useKDTree" method when the size was above some threshold (100 atoms?). The interface is the same, either way.

Steps to move Bioclipse to the new CDK plugins

Thanks to egon the monolithic cdk plugin is now about 40 small plugins. Here's how to migrate...

In no particular order:

Step 1: Delete plugin org.openscience.cdk from the workspace
Step 2: Download the third-party library plugins from bioclipse2/trunk/cdk-externals/
Step 3: Download the (many) new cdk plugins from bioclipse2/trunk/cdk-externals/trunk/
Step 4: Update those plugins that complain about missing org.openscience.cdk dependency
(Step 5: Possibly change the Require-bundle for org.apache.log4j : as in this bug report)

It should now all compile...

The cdk renderer forest

CASE Diagram

Just a sketch:Of course, if the structure source is a database, then it would make more sense to generate the spectra beforehand. So it would then be a spectrum source, I suppose.

Ultimate Generic Diagram

Sorry, I couldn't resist, but when reading this paper (see also : this) I came across this paragraph :
The compilation of metabolic data sets comprises three aspects. Each analytical data aquisition produces numerical values as estimates of metabolite concentration [...] Second, all such studies involve biological material [...] and a particular study context and this produces a wide variety of supplementary data (metadata). Finally, many studies process the data further using a variety of algorithms and then subject the data to statistical or bioinformatic analysis..
and made this (fairly foolish) diagram) for myself. I then realised that the diagram covers almost anything, and could be safely incorporated into almost any talk :)

Of course, the symbols (ab)used are very vague - what should be a package is here some kind of 'context'. Who knows, I suppose there is correct UML for this, but anyway.

Oh, and I'm not mocking the original authors - I just realised that what I d…

What's on my whiteboard?

Current situation with Seneca and Medea.

This is a better picture of what the dependencies were like in bioclipse:

The dashed lines indicate some kind of vague boundary between plugin 'layers'. There's no such concept of a plugin layer, as far as I know, but still. Of course, there is the net.bioclipse.core plugin, which could be considered to be an inner core.
It may be an artificial distinction to make between the 'data' and 'analysis' layers, but basically the data plugins do the open-edit-save cycle mentioned here while the analysis plugins do more. Specifically, the seneca and medea plugins can produce spectra data from molecule data.

Classical studies : Seneca and Medea

More pictures, mainly for my own benefit. First, a possible refactoring of some bioclipse modules:


Where boxes are modules, and lines indicate dependencies. There's a lot of work in that arrow! Only the compute plugin and the seneca plugin have been ported (however incompletely) so far.

The second image shows the extension points for the putative CASE plugin. One is the judge extension point that already exists in the seneca plugin, and the other is a structure source:
This is more general than just a structure generator extension point, and could also include sources from files or databases. I think that this makes sense...

Interface relays and controller modules

Sorry, but some more opaque UML diagram from the cdk.controller package. First, the Controller2DModules:

Which is very simple, although the names are a bit clumsy. I would favour using a sub-package called "module" and then having "AddAtomModule", "ChangeFormalChargeModule", and so on. Doesn't really matter, as they will likely not be used very often.
The other image is more complex; it shows the class that implements IChemModelRelay and how it interacts with the two current Swing event handlers. The circles represent interfaces. The MouseListener and MouseMotionListener are, of course, the interfaces from the java.awt package that are normally used by java programs. Now to figure out how to actually use all this!

Current Controller Architecture

One way to understand a piece of software architecture is to make a diagram out of it:

Also, it's an excuse to make another diagram... Oh, and here's the version with packages:


Reflecting on all the options

I knocked up a small program to analyse the various options that the Renderer2DModel supports. It uses reflection to get the set methods and create controls, then calls the appropriate get method when the control changes.

Another aspect of the Java2DRenderer it shows is that many of the options do nothing as they are not wired in. Finally, it is clear that there are certain options that probably shouldn't be there, such as the selection box, or are quite mysterious, such as the ClipboardContent...

Patched!

Thorsten's patch here:
https://sourceforge.net/tracker/index.php?func=detail&aid=2128802&group_id=20024&atid=320024

makes things much better. I don't yet know exactly why, but better:

(ignore the "in" and "out", they were an experiment and not relevant.

Aha! Success, thanks to Rajarshi

So:
it sortof works...

The missing piece of the puzzle was this line:

GeometryTools.scaleMolecule(molecule, getPreferredSize(), 0.9);

where getPreferredSize() is just a container call, and the 0.9 is a scale factor. This is too magical for my tastes. You apparently don't have to scale benzene, but you do have to scale pyridine!

Java2DRenderer

Okay, so this code (only valid pastebin for a month...) produces some odd results. For c1ccncc1:

whereas for c1ccccc1:

So, something may be wrong with the text scaling. In fact, the code looks like it is applying a transform to the graphics object, then scaling the font...

Alternative JCP architecture

Soooo. I have been messing around with what I've optimistically been thinking of as an 'alternative' JChemPaint renderer rather than just a botched refactoring. I started with the code by Niels Out (blog) and messed around with it a bit. Here is a diagram of part of where it is at:

This is meant to be valid UML, but the message is the same : "Overly. Complex." The basic idea is that the parent SymbolTreeFactory handles the logic of what bonds to create, and the child factory creates a tree suitable for AWT.
There's a name problem here, as "AWT" is not right for Java2D; also I'm not sure why I decided on "SymbolTree" as it's not really a tree as such. It could be, using a Composition pattern perhaps, but that seems unnecessary.
Anyway, the point of this is to make separate AWT and SWT renderers, while sharing as much code as possible. This may not be the way to do it, but it is a way to do it.

nasty pubchem example

Hmmm. Seems there are dragons lurking in the PubChem data:
(oxygens not shown). It has an 801 character smiles string! It's called "Inulin" apparently, and it's a polysaccharide:

PubChem record

Good news everybody!

With a slight change of data, and a change to the bond order settings, now 2,737 out of 5,739 match exactly. 1,114 of the rest get the right one in the top 10. That still leaves almost 2,000 - but some of these are clearly poor experimental data.

Comparing molecules and spectra

I made a tiny minimal tool for me too look at the results of spectra matching. Re-inveting several wheels, no doubt, but it does use CDK trunk rendering...sortof.

Monster package diagram

Heh:

I should point out that a lot of the nested packages are "data" packages with no java code in. There are a number of empty packages in the source tree, which are not included.

CDK, JCP, SWT, AWT...

And other TLAs (three-letter acronyms)...

So, in trying to find out what the status of the JChemPaint code, there seems to be a lot of stuff, but not all of it works. So, in order to understand the plan, and as a suggested architecture, have this diagram:

It's not proper UML, but the package symbols are right. What this sketch suggests is that if there is graphics-independant code in the cdk.renderer package, then this package would take a molecule (or a reaction set, or a chemobject, etc) and 'render' it into graphical objects.

These would then be passed (the "renderer objects" oval) to the package (n.b.cdk.jchempaint or o.o.jchempaint) that is actually using a Graphics or GC object to draw with. Otherwise, the renderer package is really doing nothing useful.

CDK dependancy analysis again

Took another look at the CDK using classcycle, but this time I used the complete package that Taverna downloaded as part of the CDK plugin (which isn't working at the moment, for some reason).
The easiest way to look at the results, really, is to export them as a CSV file, and look at it with a spreadsheet. Interestingly, the results show that it is CML that has the most layers. If I am understanding the idea of layers in classcycle correctly, it means something like the longest dependency path that ends in that class.
The top hit is the DictionaryTool from org.xmlcml.cml.tools, with a layer index of 32. The first CDK class with a high layer index is CMLWriter (29) then RSSWriter but most of the top 50 are CML classes. I guess it's not a bad thing, it's just a thing. A measurement.
edit: heh. The eclipse plugin for classcycle made this image for CDK:

Bioclipse and Taverna

After a short time with Taverna, I'm starting to see how it compares and contrasts with Bioclipse. As is my wont, here is a diagram illustrating this: What is meant here is that mostly Bioclipse is used to open-edit-save documents while Taverna is mostly used to get-process-save data. There are plugins for Bioclipse that do more processing style stuff, while you can sort of edit single documents in a workflow. Still.
In the Taverna docs it talks about using Bioclipse as a results viewer for the output of a workflow. I suppose it would be nice to do the reverse and send documents to Taverna from Bioclipse. Perhaps that's already possible; I'm still learning :)

EDIT: ola pointed out to me this page in the bioclipse wiki, although apparently it never worked right.

Taverna-cdk

The Taverna project is very interesting, in my not so humble opinion, because of the potential that workflows have. A workflow is a complete description of an experiment; that can now be shared through the myexperiment site.
The central point of a scientific experiment is that it should be repeatable, by the researcher and by others. Many bioinformatics journal papers describe experiments of a sort that will not be repeatable years down the line, by anyone.
A concrete example is this paper by CH Robert and PS Ho, which describes an analysis of water bridges in proteins. A crucial line in the methods section is this : "All programs were written with FORTRAN 77 on a Silicon Graphics Iris workstation or with MATHEMATICA...on a Macintosh IIci computer."Which is really great if, 10 years later, you want to re-run their method on more than the 100 high-resolution structures that were available at the time. Do their programs still exist? Do I have access to an Iris machine (I used the…

Classfile family trees

Looking into dependency analysis tools, these are three that I've seen:
eUML2 - a very slick tool with a free version, and a commercial version.JDepend - a free, open source tool for numerically analysing dependencies.classcycle - another open source (&free) tool for reporting on cycles/dependencies.I like the last one the best, as eUML2 adds @UML annotations to your code, and JDepend's stats are a bit opaque.

Especially nice is that classcycle's report tells you the 'Layer' of a class. This is the number of other classes that it builds on top of. So, in cdk, there are various classes in layer 14 (some of the undoredo Edit classes). StabilizationCharges is in layer 13. ChemObject is layer 15!

Perhaps some sort of diagram would be in order...

Current status : Working

I've been 'roundtripping' molecule-spectra correlation from the nmrshiftdb to a database of theoretical spectra generated for PubChem data. Something like this:

Where 'experimental' is the nmrshiftdb data, and 'theoretical' is the PubChem data.
Currently, my laptop is running 7,000 or so searches for the best spectrum match, and then calculating the Tanimoto similarity of the fingerprints of the search and hit molecules. Phew!

A rolling stone

So there's a plugin in bioclipse called "moss" (or, I suppose MoSS).

It has an Atom class. It has a Bond class. It has graphs for chemicals, smiles parsers, ...

Just like the CDK, in fact. Perhaps I should check one of my projects (tailor) which also has Atom, Bond, Chain, PDBReader.

You can't have too many implementations of the same thing :)

On models

So, after chatting some stuff about the PDBReader in CDK versus the Biojava PDBReader on CDK-devel (devel archive) I realised that I don't actually know how these different frameworks model biopolymers.

This is my attempt to understand them:
it shows a partial hierarchy for each framework, without the detail of which are classes, and which are interfaces. Notably, biojava has an interface and an implementing class for each of the things shown.

One thing I should point out, is that "Strand" in CDK should really (really) be called "Chain" as in Biojava and Jmol (and probably every other known framework). A kind of spelling mistake, I think. I also don't understand what the PhosporousMonomer and PhosporousPolymer are in Jmol.

Turns out I don't remember SMILES format very well

Testing out the Java2DRenderer (in the org.openscience.cdk.renderer package), I tried to make this:


Using this smile:

c1(c2ccc(c8ccccc8)cc2)
c(c3ccc(c9ccccc9)cc3)
c(c4ccc(c%10ccccc%10)cc4)
c(c5ccc(c%11ccccc%11)cc5)
c(c6ccc(c%12ccccc%12)cc6)
c1(c7ccc(c%13ccccc%13)cc7)except that I didn't have the % signs, so it came out as:
This is because of multiple ring closures, of course. I should have used:
C1CCC(CC1)
C2CCC(CC2)
C13C(C3CCC(CC3)C4CCCCC4)
C(C5CCC(CC5)C6CCCCC6)
C(C7CCC(CC7)C8CCCCC8)
C(C9CCC(CC9)C10CCCCC10)
C13(C11CCC(CC11)C12CCCCC12)which is much clearer. Oh yes.

Fixed a bug in PDBReader

When trying to convert PDB files to CML files (to test the CMLFileDescriber...) the underlying jumbo classes complained about adding bonds that had already been added (cdk bug 204663).

Turns out, it was a problem with the PDBReader, which wasn't handling CONECT records properly. Oddly, CONECT are fully redundant, so atom 10 and atom 50 are both "CONECT 10 50" and "CONECT 50 10".

The fix I used was to store the bonds from CONECT in a list, but checking each time to see if it was alread there. It wasn't possible to use "bondA == bondB", as that checks refs, and each bond is a new object.

I should say, though, that maybe it's not worth worrying about problems with reading PDB files (an old format) in CDK (dedicated to chemical data). PDB files are horrible to read, anyway.

How to set IOSettings on cdk Readers

So I wanted to set readConnect and useRebondTool to false on the org.openscience.cdk.io.PDBReader.

Turns out, the way to do this is:
PDBReader reader = new PDBReader(new FileReader(input));
Properties properties = new Properties();
properties.setProperty("ReadConnectSection", "false");
properties.setProperty("UseRebondTool", "false");
PropertiesListener listener =
new PropertiesListener(properties);
reader.addChemObjectIOListener(listener);
reader.customizeJob();Hmm. Not so easy. Maybe reflection in the base DefaultChemObjectReader class could allow the Readers to have set(String propertyName, String value) methods?