This article discusses a simple experiment on classifying experimental X-ray diffraction patterns. Note that methods described here are not polished, so anyone following the steps below should double-check the results.
Working Environment: CentOS 6.7; Python 3.5.2 in Anaconda3; Theano 0.8.2; Keras, version unknown
Author: Zhiyuan Yang, zyang96@ucla.edu
Suppose initial experiment data are stored in a hdf5 file, say data.h5
. Within it lies a single 3-dimensional dataset patterns
.
In [2]: f = h5py.File('data.h5', 'r') In [3]: list(f.items()) Out[3]: [('sel_p', <HDF5 dataset "patterns": shape (30000, 260, 257), type "<f4">)]Here the 1st dimension (or dimension 0) is the number of diffraction patterns in the dataset, and the other two dimensions are height and width, respectively. The first and most tedious step is to hand-pick a few hundred of clean single-particle hits from the dataset.
Suppose about 200 single-hit patterns are selected, and are stored in a hdf5 dataset clean_single_hits
. The next step is to generate more templates of sinlge-hit patterns to compare with the rest of the whole dataset. This can be achieved with some basic python routines (note again that codes here are not optimized, and one can improve them for better performance) :
import h5py import numpy as np import scipy.ndimage.interpolation as sci import numpy.linalg as linalg from sklearn.preprocessing import scale def preprocessing(x): x[x<0] = 0 x *= 255.0/x.max() def generate_templates(pattern): '''for each pattern generate 36 templates, each from rotating the initial image about a certain degree ''' l = np.zeros([36,260,257]) for i in range(36): rotated = sci.rotate(pattern, angle=(5*i), reshape=False) l[i] = rotated return l def generate_all_templates(patterns): num = len(patterns) total = np.zeros([num, 36, 260, 257]) for i in range(num): total[i,:,:,:] = generate_templates(patterns[i]) total = total.reshape([num * 36, 260, 257]) return total def smallest_value(image, templates): '''returns the smallest L2 norm of the "difference matrix", which serves as a loose measurement of how different (or similar) the new data is to all the generated templates. ''' diffs = [] num = len(templates) for i in range(num): diff = np.subtract(image, templates[i]) diff = linalg.norm(diff) diffs.append(diff) diffs = sorted(diffs, key=float) return float(diffs[0])Use the functions above to preprocess the data, generate templates and produce a sorted list of L2 norms of new, unprocessed data. Those in the front of the list are ones that are most similar to our templates (so most likely to be also clean single hits).
Then, you can partition the new data into several groups. You can either leave them as hdf5 datasets, or you can all images into different folders. It should be something like
.\training_set .\single .\multiple .\validation_set .\single .\multiple .\data_to_predict
I suggest the latter approach, since it's easier to do real-time data augmentation in Keras this way.
Once you are done preparing data, you can start building the convolutional neural network. The one I am using here is not exactly “deep”, but it should be enough for a relatively small dataset like this. To build the network I also used Keras, which is a higher-level api built upon both Theano and Tensorflow for cleaner code structure. Tensorflow is the preferred backend since it has better multi-gpu support, but on an operating system as old as CentOS 6, it is almost impossible to get it installed.
from keras.models import Sequential from keras.preprocessing.image import ImageDataGenerator from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D from keras.layers import Activation, Dropout, Flatten, Dense, BatchNormalization # data augmentation methods train_datagen = ImageDataGenerator( rotation_range=40, width_shift_range=0.2, height_shift_range=0.2, rescale=1./255, shear_range=0.2, zoom_range=0.4, horizontal_flip=True, fill_mode='nearest') test_datagen = ImageDataGenerator() # do nothing for validation set. # network architecture model = Sequential() model.add(Convolution2D(32, 3, 3, input_shape=(3, 256, 256))) #model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(32, 3, 3)) #model.add(BatchNormalization()) model.add(Activation('relu')) #model.add(Dropout(0.5)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(64, 3, 3)) #model.add(BatchNormalization()) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(64, 3, 3)) #model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) # this converts our 3D feature maps to 1D feature vectors model.add(Dense(64)) #model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
My approach uses images stored in different directories:
# import data train_generator = train_datagen.flow_from_directory( './training_set', target_size=(256,256), batch_size = 32, class_mode ='binary') test_generator = test_datagen.flow_from_directory( './validation_set', target_size=(256,256), batch_size = 32, class_mode='binary')Then, we can train our model and save the trained weights:
model.fit_generator( train_generator, samples_per_epoch=4137, nb_epoch=10, verbose=1, validation_data=test_generator, nb_val_samples= 1098) #callbacks=callbacks) model.save_weights('weights.h5')My particular run resulted in a validation accuracy of about 93%, which is not bad for a data set small like mine.
You can now use the model and saved weights to make predictions on new experiment data, and save the prediction labels:
# suppose y is a numpy array with dimension [number_of_samples, color_channels, height, width] # if using Tensorflow backend, it should be [number_of_samples, height, width, color_channels] y = model.predict(images, verbose=1) f = h5py.File('predictions.h5', 'w') f.create_dataset('predictions', data=y) f.close()
A hdf5 file named predictions.h5
is generated with prediction labels for the new experiment data. You can extract images
accordingly to manually check the result.
One should notice that the method here is very unsophisticated, and has several limitations such as computational power restriction and time efficiency issues. However, this is a practical method that did work for me, and should work for many single-particle diffraction pattern classification problems.
“everything that is living can be understood in terms of the jiggling and wiggling of atoms”.
and now, we want to watch atoms jiggling and wiggling.
X-rays, electrons, fluorescence light, the advances of photon sciences, together with computational modeling, are making this happen.