This shows you the differences between two versions of the page.

Next revision | Previous revision | ||

x-ray_diffraction_pattern_classification_using_convolutional_neural_networks_and_some_brutal_force [2016/08/05 09:05] zhiyuan1 created |
x-ray_diffraction_pattern_classification_using_convolutional_neural_networks_and_some_brutal_force [2016/08/05 10:41] zhiyuan1 |
||
---|---|---|---|

Line 1: | Line 1: | ||

- | This article discusses a simple experiment on classifying experimental X-ray diffraction patterns. Note that methods described here are not polished, so anyone following the steps below should double-check the results. | + | **This article discusses a simple experiment on classifying experimental X-ray diffraction patterns.** Note that methods described here are not polished, so anyone following the steps below should double-check the results. |

- | ==== Data Preparation and Templates Creation ==== | + | //Working Environment: CentOS 6.7; Python 3.5.2 in Anaconda3; Theano 0.8.2; Keras, version unknown \\ |

+ | Author: Zhiyuan Yang, zyang96@ucla.edu // | ||

+ | ==== Templates Creation and Data Preparation ==== | ||

Suppose initial experiment data are stored in a hdf5 file, say ''data.h5''. Within it lies a single 3-dimensional dataset ''patterns''. | Suppose initial experiment data are stored in a hdf5 file, say ''data.h5''. Within it lies a single 3-dimensional dataset ''patterns''. | ||

<code> | <code> | ||

- | In [2]: f = h5py.File('data.h5', 'r') \\ | + | In [2]: f = h5py.File('data.h5', 'r') |

- | In [3]: list(f.items()) \\ | + | In [3]: list(f.items()) |

- | Out[3]: [('sel_p', <HDF5 dataset "patterns": shape (4000, 260, 257), type "<f4">)]\\ | + | Out[3]: [('sel_p', <HDF5 dataset "patterns": shape (30000, 260, 257), type "<f4">)] |

</code> | </code> | ||

Here the 1st dimension (or dimension 0) is the number of diffraction patterns in the dataset, and the other two dimensions are | Here the 1st dimension (or dimension 0) is the number of diffraction patterns in the dataset, and the other two dimensions are | ||

- | height and width, respectively. | + | height and width, respectively. The first and most tedious step is to hand-pick a few hundred of clean single-particle hits from the dataset. \\ |

+ | (an unsupervised clustering approach can be adopted here, and was described in this article: https://www.osapublishing.org/oe/abstract.cfm?uri=oe-19-17-16542 . However, with human labor it is much more accurate, and should result in higher accuracy with the deep learning approach described latter, in return.) | ||

+ | | ||

+ | Suppose about 200 single-hit patterns are selected, and are stored in a hdf5 dataset ''clean_single_hits''. The next step is to generate more templates of sinlge-hit patterns to compare with the rest of the whole dataset. This can be achieved with some basic python routines (note again that codes here are not optimized, and one can improve them for better performance) : | ||

+ | <code> | ||

+ | import h5py | ||

+ | import numpy as np | ||

+ | import scipy.ndimage.interpolation as sci | ||

+ | import numpy.linalg as linalg | ||

+ | from sklearn.preprocessing import scale | ||

+ | | ||

+ | def preprocessing(x): | ||

+ | x[x<0] = 0 | ||

+ | x *= 255.0/x.max() | ||

+ | | ||

+ | def generate_templates(pattern): | ||

+ | '''for each pattern generate 36 templates, each from | ||

+ | rotating the initial image about a certain degree | ||

+ | ''' | ||

+ | l = np.zeros([36,260,257]) | ||

+ | for i in range(36): | ||

+ | rotated = sci.rotate(pattern, angle=(5*i), reshape=False) | ||

+ | l[i] = rotated | ||

+ | return l | ||

+ | | ||

+ | def generate_all_templates(patterns): | ||

+ | num = len(patterns) | ||

+ | total = np.zeros([num, 36, 260, 257]) | ||

+ | for i in range(num): | ||

+ | total[i,:,:,:] = generate_templates(patterns[i]) | ||

+ | total = total.reshape([num * 36, 260, 257]) | ||

+ | return total | ||

+ | | ||

+ | def smallest_value(image, templates): | ||

+ | '''returns the smallest L2 norm of the "difference matrix", | ||

+ | which serves as a loose measurement of how different (or similar) | ||

+ | the new data is to all the generated templates. | ||

+ | ''' | ||

+ | diffs = [] | ||

+ | num = len(templates) | ||

+ | for i in range(num): | ||

+ | diff = np.subtract(image, templates[i]) | ||

+ | diff = linalg.norm(diff) | ||

+ | diffs.append(diff) | ||

+ | diffs = sorted(diffs, key=float) | ||

+ | return float(diffs[0]) | ||

+ | </code> | ||

+ | Use the functions above to preprocess the data, generate templates and produce a sorted list of L2 norms of new, unprocessed data. Those in the front of the list are ones that are most similar to our templates (so most likely to be also clean single hits). | ||

+ | | ||

+ | Then, you can partition the new data into several groups. You can either leave them as hdf5 datasets, or you can all images into different folders. It should be something like \\ | ||

+ | .\training_set | ||

+ | .\single | ||

+ | .\multiple | ||

+ | .\validation_set | ||

+ | .\single | ||

+ | .\multiple | ||

+ | .\data_to_predict | ||

+ | | ||

+ | **I suggest the latter approach, since it's easier to do real-time data augmentation in Keras this way.** | ||

+ | ==== Build Convolutional Neural Network Model to Train and Predict the Data ==== | ||

+ | Once you are done preparing data, you can start building the convolutional neural network. The one I am using here is | ||

+ | not exactly "deep", but it should be enough for a relatively small dataset like this. | ||

+ | To build the network I also used Keras, which is a higher-level api built upon both Theano and Tensorflow for cleaner code structure. **Tensorflow is the preferred backend since it has better multi-gpu support, but on an operating system as old as CentOS 6, it is almost impossible to get it installed.** | ||

+ | <code> | ||

+ | from keras.models import Sequential | ||

+ | from keras.preprocessing.image import ImageDataGenerator | ||

+ | from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D | ||

+ | from keras.layers import Activation, Dropout, Flatten, Dense, BatchNormalization | ||

+ | | ||

+ | # data augmentation methods | ||

+ | train_datagen = ImageDataGenerator( | ||

+ | rotation_range=40, | ||

+ | width_shift_range=0.2, | ||

+ | height_shift_range=0.2, | ||

+ | rescale=1./255, | ||

+ | shear_range=0.2, | ||

+ | zoom_range=0.4, | ||

+ | horizontal_flip=True, | ||

+ | fill_mode='nearest') | ||

+ | | ||

+ | test_datagen = ImageDataGenerator() # do nothing for validation set. | ||

+ | # network architecture | ||

+ | model = Sequential() | ||

+ | model.add(Convolution2D(32, 3, 3, input_shape=(3, 256, 256))) | ||

+ | #model.add(BatchNormalization()) | ||

+ | model.add(Activation('relu')) | ||

+ | model.add(Dropout(0.5)) | ||

+ | model.add(MaxPooling2D(pool_size=(2, 2))) | ||

+ | | ||

+ | model.add(Convolution2D(32, 3, 3)) | ||

+ | #model.add(BatchNormalization()) | ||

+ | model.add(Activation('relu')) | ||

+ | #model.add(Dropout(0.5)) | ||

+ | model.add(MaxPooling2D(pool_size=(2, 2))) | ||

+ | | ||

+ | model.add(Convolution2D(64, 3, 3)) | ||

+ | #model.add(BatchNormalization()) | ||

+ | model.add(Activation('relu')) | ||

+ | model.add(MaxPooling2D(pool_size=(2, 2))) | ||

+ | model.add(ZeroPadding2D((1,1))) | ||

+ | | ||

+ | model.add(Convolution2D(64, 3, 3)) | ||

+ | #model.add(BatchNormalization()) | ||

+ | model.add(Activation('relu')) | ||

+ | model.add(Dropout(0.5)) | ||

+ | model.add(MaxPooling2D(pool_size=(2, 2))) | ||

+ | | ||

+ | model.add(Flatten()) # this converts our 3D feature maps to 1D feature vectors | ||

+ | model.add(Dense(64)) | ||

+ | #model.add(BatchNormalization()) | ||

+ | model.add(Activation('relu')) | ||

+ | model.add(Dropout(0.5)) | ||

+ | | ||

+ | model.add(Dense(1)) | ||

+ | model.add(Activation('sigmoid')) | ||

+ | | ||

+ | model.compile(loss='binary_crossentropy', | ||

+ | optimizer='rmsprop', | ||

+ | metrics=['accuracy']) | ||

+ | </code> | ||

+ | | ||

+ | My approach uses images stored in different directories: | ||

+ | <code> | ||

+ | # import data | ||

+ | train_generator = train_datagen.flow_from_directory( | ||

+ | './training_set', | ||

+ | target_size=(256,256), | ||

+ | batch_size = 32, | ||

+ | class_mode ='binary') | ||

+ | | ||

+ | test_generator = test_datagen.flow_from_directory( | ||

+ | './validation_set', | ||

+ | target_size=(256,256), | ||

+ | batch_size = 32, | ||

+ | class_mode='binary') | ||

+ | </code> | ||

+ | Then, we can train our model and save the trained weights: | ||

+ | <code> | ||

+ | model.fit_generator( | ||

+ | train_generator, | ||

+ | samples_per_epoch=4137, | ||

+ | nb_epoch=10, | ||

+ | verbose=1, | ||

+ | validation_data=test_generator, | ||

+ | nb_val_samples= 1098) | ||

+ | #callbacks=callbacks) | ||

+ | | ||

+ | model.save_weights('weights.h5') | ||

+ | </code> | ||

+ | My particular run resulted in a validation accuracy of about 93%, which is not bad for a data set small like mine. | ||

+ | | ||

+ | You can now use the model and saved weights to make predictions on new experiment data, and save the prediction labels: | ||

+ | <code> | ||

+ | # suppose y is a numpy array with dimension [number_of_samples, color_channels, height, width] | ||

+ | # if using Tensorflow backend, it should be [number_of_samples, height, width, color_channels] | ||

+ | y = model.predict(images, verbose=1) | ||

+ | | ||

+ | f = h5py.File('predictions.h5', 'w') | ||

+ | f.create_dataset('predictions', data=y) | ||

+ | f.close() | ||

+ | </code> | ||

+ | | ||

+ | A hdf5 file named ''predictions.h5'' is generated with prediction labels for the new experiment data. You can extract images | ||

+ | accordingly to manually check the result. | ||

+ | | ||

+ | ==== Some Thoughts ==== | ||

+ | One should notice that the method here is very unsophisticated, and has several limitations such as computational power restriction and time efficiency issues. However, this is a practical method that did work for me, and should work for many single-particle diffraction pattern classification problems. |

“everything that is living can be understood in terms of the jiggling and wiggling of atoms”.

and now, we want to watch atoms jiggling and wiggling.

X-rays, electrons, fluorescence light, the advances of photon sciences, together with computational modeling, are making this happen.