1 minute read

Hi everyone, this is Vishav and I am here with the 4th iteration of my blog, “Journey to GSoC”. If you haven’t read my previous blog, you can read it here and keep up.

The second phase of Google Summer of Code has been completed. The goal for the second phase of the summer was to add features for buffering and caching and rewriting the tests with hypothesis. I am done with buffering, you can see my work at this PR. Caching is still in discussion stage and will be done in the coming days.

During this phase, I faced many blockers but am able to solve most of them with the guidance of the mentors. Some interesting ones are described below:

  • I was rewriting tests with hypothesis. I found that all the int keys are converted and saved as str in JSONDict. If we save a int key then we will get an error while trying to access int keys. In order to solve it, we decided that every collection can have a validator (or list of validators) that are applied to inputs and that the JSON validator should be applied by signac when using synced collections, regardless of the back end.

  • While working on validation, we found that the currently we only provide validation for keys of a dictionary. For this, we decided to generalize this behaviour so that every input data should be validated before insertion into the collection.

The buffering PR is still being reviewed, so I am focused on completing caching before the start of third phase. Currently there has been a discussion ongoing related to caching in the PR. The main questions that need to be answered are:

  • Can we handle the suspended synchronization by simply reading from/writing to an object-specific cache?
  • How should a cache be defined? Are caches always in-memory objects? If not, how do we ensure their synchronization, for instance with respect to the current project cache?
  • How do we handle multiple synced collections pointing to the same file?
  • Should we assume that all instances must point to the same cache at any given time? Can there be multiple active caches? If not, how do we prevent that?

This will mark the end of second phase of Google Summer of Code. For the third phase, I plan to implement different backends and lazy statepoint loading.

Updated: