12.07.2015 Views

Crystal Fingerprinting and STM4 for USPEX output ... - Mario Valle

Crystal Fingerprinting and STM4 for USPEX output ... - Mario Valle

Crystal Fingerprinting and STM4 for USPEX output ... - Mario Valle

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Crystal</strong> <strong>Fingerprinting</strong><strong>and</strong> <strong>STM4</strong><strong>for</strong> <strong>USPEX</strong> <strong>output</strong> analysis<strong>Mario</strong> <strong>Valle</strong><strong>USPEX</strong> 2011 Xi’an workshop – 04/08/2011The CSCS Data Analysis<strong>and</strong> VisualizationGroupmvalle@cscs.ch1


CSCS computer roomCSCS usage May 2011Prof. Michele ParrinelloStarted with postprocessingMovies <strong>for</strong>conference talksFrancesco Gervasio – ETH ZürichNice images <strong>for</strong> publications2


Started with postprocessingMovies <strong>for</strong>conference talksFrancesco Gervasio – ETH ZürichNice images <strong>for</strong> publicationsData from disjoint subfieldsMacromolecules<strong>Crystal</strong>lography<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/20113


Help overcome tools inflexibilityFor example there are nice crystallographyprograms that do not support dynamicdata <strong>and</strong> do not allow customization<strong>STM4</strong> is a framework<strong>for</strong> the development ofunusual <strong>and</strong> enhanced techniques<strong>for</strong> chemistry visualization4


Offer broader set of techniquesProvide enhanced techniquesAvailable data with thest<strong>and</strong>ard isosurfaceIsosurface with the newvolume interpolator5


STM3 GallerySTM3 modulesThe LEGO DNA6


<strong>USPEX</strong> discovered new materialsBoron at 1 atm: <strong>USPEX</strong>easily found the complexα-B structure...Prof. A. Oganov...<strong>and</strong> discoveredalso the superhardboron γ-B 28 phase(Nature)<strong>USPEX</strong> structure cancer problemDifferent colors meansdifferent crystalstructuresNormal structure cluster generation<strong>USPEX</strong> structure cancerGeneration<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/20118


<strong>Crystal</strong>Fp: the problem to solve<strong>USPEX</strong> is a crystal structure predictorbased on an evolutionary algorithmEach run produceshundred of putativecrystal structures……but many of themare equalProject: develop a(semi)automaticway to extractunique structuresfrom <strong>USPEX</strong> <strong>output</strong>sSo an intensive manuallabor is needed to pruneduplicated structures<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Proposed solution from High-DimCompute uniquecoordinatesDefine distancemeasureSpace 100-3000dimensionalEach groupdescribes adistinct structureAdd groupingcriteria<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/20119


Structure “coordinates”Set of distances<strong>for</strong> each atomin the unit cellaggDistance setsconcatenated <strong>for</strong> allatoms in the structure(coordinate)Structure “fingerprint” <strong>Crystal</strong>FpSet of distances<strong>for</strong> each atomin the unit cellaggDistance setsconcatenated <strong>for</strong> allatoms in the structure(coordinate)10


Visual design <strong>and</strong> validationBuilt a tool to explorealgorithm choices <strong>and</strong>parameters settingsThis tool wraps theclassifier library, called<strong>Crystal</strong>Fp, <strong>and</strong> providesvarious interactivevisual diagnostics tocheck classifierbehaviorIt is built inside <strong>STM4</strong>,the molecularvisualization toolkitdeveloped at CSCS<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201111


Workflow supportThe application interface gives accessto all <strong>Crystal</strong>Fp algorithms <strong>and</strong> theirparameters in a clear process workflow<strong>STM4</strong> provided an environment thataccelerated the implementation1. Load structures2. Filter on energy3. Compute fingerprints4. Compute distances5. Group structuresVisual diagnostics toolsVarious visualization <strong>and</strong> analysistools to check <strong>and</strong> validate<strong>Crystal</strong>Fp algorithms behavior1. 2D maps2. Charts3. Picking <strong>for</strong> details4. 2D data export12


Visual diagnostics: distance matrixDistances between structuresDistances ordered by group<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Clustering visual diagnosticDFS groupingPseudo SNN (K=5)Pseudo SNN (K=1)SNN (K=5)DFS: Deep first search ofthe neighbors nodesPseudo SNN: Maintainconnection betweennodes only if theyshare at least KneighborsSNN: As above plus aDBSCAN pass<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201113


Visual diagnostics: scatterplotThe scatterplot tool in <strong>Crystal</strong>Fp tries tomap High‐Dim space points to 2Dpreserving their relative distancesColored by“stress” to detectlocal minimatrapsColoredby groupDiagnostic chart:distances in 2Dvs. distances inHigh‐Dim space14


A pseudo-diffraction like methodThis structure fingerprint issampled on X to provide thecoordinate values.The fingerprint is cut at a userdefined distance to provide100-400 coordinate values<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Final fingerprint (per atom type pair)Compared to the previousfingerprinting method, thisone is sensitive to theordering of atoms in thestructure <strong>and</strong> does notdepend on the specificatomic species involved<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201115


<strong>USPEX</strong> problem solved: an exampleHydrogen at 600 GPa (16 atoms)• The <strong>USPEX</strong> run produced 1274 structures• From these the 794 within 0.5 eV from thelowest energy value found are selected• Manual analysis to remove duplicatedstructures from this set: ~20h of work• Using the <strong>Crystal</strong>Fp classifier: ~10min• At the end found only 4 unique structures:• One α-Ga type (top)• One Cs-IV (bottom), the ground state (i.e. thelower energy structure), <strong>and</strong> two closely relatedstructures<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201116


Classifier integration in <strong>USPEX</strong><strong>USPEX</strong> structure cancerGenerationSolved atthe root!Original <strong>USPEX</strong>.A lot of identical structures.<strong>USPEX</strong> after the classifier integration.No more “structure cancer”!So, where is the problem?Compute uniquecoordinatesDefine distancemeasureSpace 100-3000dimensionalEach groupdescribes adistinct structureAdd groupingcriteria<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201117


So, where is the problem?Space 100-3000dimensionalSpace 100-3000dimensional<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011High-dimensionality is not intuitiveHigh dimensionalityspace is mostly emptyEverything is inthe corners ofthe hypercube<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201118


The curse of dimensionalityRoughly speaking, the higher the dimensionality,the lower the power of recognizing similar objectsBecause everything is at the same distance fromevery other point…Distance distribution <strong>for</strong> r<strong>and</strong>om points in a hypercube<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011But distances measures ……could help contrast this curse of dimensionalityEuclidean distance: , ∑ Minkowski distance: , ∑ Cosine distance: , 1cos ∙1 <strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201119


Tried various distance measuresEuclidean<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011From the problem solution …Compute uniquecoordinatesDefine distancemeasureSpace 100-3000dimensionalEach groupdescribes adistinct structureAdd groupingcriteria<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201120


… to a new paradigmCompute uniquecoordinatesDefine distancemeasureSpace 100-3000dimensionalTo look at crystalstructures from anovel perspectiveHigh Dimspace tools<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Unfold data to lower dimensionsMultidimensional scaling projects points from highdimensional space to a lower dimensional onepreserving distances between points as faithfully aspossibleSammon mapping to 2DOne famous test dataset (right I said right!) containspoints on a rolled sheet that <strong>for</strong>ms a 3D shape calledthe “Swiss roll” (a superb example on the left)CCA mapping<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201121


<strong>Crystal</strong>Fp multi dim. scalingThe scatterplot tool in<strong>Crystal</strong>Fp implements aForce Directed Placementmultidimensional scalingalgorithm (here the pointsare colored by energy)<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Nanocluster data from Dr. Gareth Tribello (USI)<strong>Crystal</strong>Fp multi dim. scalingThe scatterplot tool in<strong>Crystal</strong>Fp implements aForce Directed Placementmultidimensional scalingalgorithm (here the pointsare colored by energy)<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Nanocluster data from Dr. Gareth Tribello (USI)22


Study of energy l<strong>and</strong>scapesA. R. Oganov <strong>and</strong> M. <strong>Valle</strong>,How to quantify energy l<strong>and</strong>scapes of solids,The Journal of Chemical Physics, vol. 130,p. 104504, 2009.<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Energy l<strong>and</strong>scape of Au 8 Pd 4 system24


More complex l<strong>and</strong>scapesEnergy l<strong>and</strong>scape <strong>for</strong> MgO with 32 atoms/cell<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011New quantities: quasi-entropyFor each given structure, quasientropyis a measure of disorder<strong>and</strong> complexity of that structure.Si structures developing defectsS str is better correlated to energythan Steinhardt’s Q625


Faithful representation?Distances <strong>for</strong> three Si defectssimulations without correcting <strong>for</strong>translations <strong>and</strong> rotationsChart by Dr. Beat SahliIntegrated Systems Lab.ETH Zürich<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011On average (on an interval), yes(using <strong>Crystal</strong>Fp)Nanocluster data from Dr. Gareth Tribello (USI)(using VMD)26


(Totally) unexpected correlations• We found unexpected correlationsbetween distance <strong>and</strong> otherphysical variables• For example the deceptivelysimple H 2 O shows clearcorrelations <strong>and</strong> grouping• This <strong>and</strong> other datasets motivatedus to continue the exploration ofthe crystal fingerprints’ space…“And roughly the only mechanism <strong>for</strong>suggesting questions is exploratory”A conversation with John W. Tukey <strong>and</strong> Elizabeth TukeyLuisa T. Fernholz <strong>and</strong> Stephan MorgenthalerStatistical ScienceVolume 15, Number 1 (2000), 79-9427


Classical data analysisproblemWe have a model of ourphenomena under studydatahypothesisWe use quantitativemethods to prove ordisprove our hypothesis(confirmative data analysis)analysisconclusions<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Exploratory data analysisproblemWe do not start froman established modeldataWe focus on the data“Know your data”analysismodelWe try various graphicalmethods looking <strong>for</strong> the(hidden) model in an exploration-driven,evolutionary wayconclusions<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201128


<strong>Crystal</strong>Fp – Parametric studyFingerprint cutoffAtoms per cellGeneral law29


Usage:<strong>Crystal</strong>Fp [options] POSCARfile [ENERGIESfile]‐v ‐‐verbose (optional argument)Verbose level (if no argument, defaults to 1)‐? ‐h ‐‐help (no argument)This help‐t ‐‐elements (required argument)List of chemical elements‐es‐ss‐‐max‐step ‐‐end‐step (required argument)Last step to load (default: all)‐‐start‐step (required argument)First step to load (default: first)‐et ‐‐energy‐per‐structure (no argument)Energy from file is per structure, not per atom‐e ‐‐energy‐threshold (required argument)Energy threshold‐r ‐‐threshold‐from‐min (required argument)Threshold from minimum energy‐c ‐‐cutoff‐distance (required argument)Fingerprint <strong>for</strong>ced cutoff distance‐n ‐‐nano‐clusters ‐‐nanoclusters (no argument)The structures are nanoclusters, not crystals‐b ‐‐bin‐size (required argument)Bin size <strong>for</strong> the pseudo‐diffraction methods‐p ‐‐peak‐size (required argument)Peak smearing size...30


Interesting correlations<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Searching an explanation<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201131


Distance vs. dimensionalityGaAs 8 atoms/cellcutoff: 3 – 30 Ådimensionality: 180 – 1800Distance decompositionGaAs 8 atoms/cellcutoff: 30 Ådimensionality: 1800<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201132


But this one?Structure: Au 8 Pd 4Cutoff: 30 ÅDimension: 1800<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Intrinsic dimensionalityFingerprint spacedimensionality:100 ÷ 3000Dim space = 3Dim intrinsic = 2Vastly redundant!Theory:Dim intrinsic = 3*N atoms +3More realistic theory:Dim intrinsic = 3*N atoms +3-κAu 8 Pd 4 39 10.85MgNH 39 32.47MgO 99 11.62<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201133


Constraints per atomDim intrinsic = 3*N atoms +3-κ (3-κ A )*N atoms +3Dataset Theor. Dim. Intr. Dim. k Aau8pd4 39 11.94 2.25Ca‐16at‐160GPa 51 27.37 1.48ca‐16at‐300GPa 51 29.19 1.36carbon‐0GPa‐8atoms‐final 27 25.88 0.14ch2‐800GPa‐18at 57 61.94 ‐0.27gaas‐8at_new 27 26.47 0.07h2o 39 90.78 ‐4.32H‐300GPa‐12at 39 4.48 2.88H‐500GPa‐16at 51 25.78 1.58H‐500GPa‐8at 27 4.29 2.84l4j8a 39 2.30 3.06l4j8 39 2.87 3.01mgnh‐2.5eV‐threshold1 39 9.67 2.44mgnh‐total4 39 53.03 ‐1.17mgo32a 99 10.58 2.76mgofull 99 15.05 2.62Na‐140GPa‐8at 27 27.22 ‐0.03urea‐0GPa 51 18.79 2.01GaAs‐old 27 5.64 2.67MgSiO3_Postperovskite_120GPa 63 56.34 0.33GaAs_r<strong>and</strong>om 27 23.45 0.44MgNH‐r<strong>and</strong>om 39 63.69 ‐2.06Frequency0 1 2 3 4 5Distribution ofk A0.11 2.620 1 2 3k A<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/2011Intrinsic dimensionalityFingerprint spacedimensionality:100 ÷ 3000Dim space = 3Dim intrinsic = 2Vastly redundant!Theory:Dim intrinsic = 3*N atoms +3More realistic theory:Dim intrinsic = 3*N atoms +3-κAu 8 Pd 4 39 10.85MgNH 39 32.47MgO 99 11.62H 2 O 39 80.50But how do you explain this?<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201134


Synthetic datasetsDistance distribution vs.embedding dimension8 atoms with uni<strong>for</strong>mly distributedr<strong>and</strong>om fractional coordinates in acubic unit cell with 5 Å sideIntrinsic dimension vs.embedding dimensionLessons learnedFrom the Modeling side• Using known concepts in unusual contexts is asource of unexpected insights• Discoveries happen on the boundaries of disciplines• “Seeing is believing” <strong>and</strong> convincing. Then thedomain experts become a source of ideasFrom the Visual Analysis side• Quick prototyping <strong>and</strong> experimentation capabilitiesare critical (that is, <strong>STM4</strong> is a big help)• No need of fancy visualizations. What are neededare visualizations tuned to the problem at h<strong>and</strong>• Data management is critical to keep order in thedata exploration<strong>Crystal</strong>Fp, <strong>STM4</strong> & <strong>USPEX</strong> - <strong>Mario</strong> <strong>Valle</strong> - 04/08/201135


http://www.cscs.ch/~mvalle/<strong>STM4</strong>http://www.cscs.ch/~mvalle/<strong>Crystal</strong>FpGoing together…Thank you<strong>for</strong> your Thank attention! you!And don’t <strong>for</strong>get: mvalle@cscs.ch36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!