I'm a senior at Brown University pursuing an Sc.B in Computer Science. My academic interests lie in applied cryptography, decentralized networks, data science and computer graphics. In my spare time, I enjoy traveling, exploring what lies beyond horizons, and practicing Ghanaian drumming and dancing.
Increasingly, studies across the scientific community are uncovered as being false discoveries. This phenomena is largely due to
“p-hacking” — the process of repeatedly testing hypotheses on a dataset until an interesting
result emerges. Since hypothesis testing inherently involves a small failure probability, given enough tests performed
on a single data set, a spurious correlation will emerge. This leads to obviously obscured and unreplicatable claims being published in dubious news outlets (e.g., “A new study shows that drinking a glass of wine is just as good as spending an hour at the gym”) with the excuse “p-value < 0.05!” but also affects reputable institutions and journals where more subtle forms of data dredging are often present as well.
Accurately controlling for such errors is notoriously difficult since there are no existing means of tracking researcher bias during the testing procedures which leave many studies up to good faith. With the massive amounts of data available to researchers today, “p-hacking” is becoming increasingly concerning, with some journals outright refusing to consider studies involving p-values prior to having them reproduced by independent researchers (a costly and in some cases even impossible).
My research has led me to develop a system called HypoCert which provably certifies valid hypothesis testing procedures. The system uses a novel approach of combining homomorphic encryption, secure multi-party computation and distributed-ledgers to enforce statistical tests that are performed in a manner which prevents “p-hacking”.
Data visualization tools such as Vizdom are highly interactive and require that results be displayed to users without breaking the interaction cycles. Unfortunately, numerous data mining algorithms are slow and require several seconds to minutes prior to producing a result to a given user query. This leaves visualization tools with an ultimatum: either break all user interaction by running the algorithms on-the-spot or pre-compute all possible queries ahead of the exploration task (an unreasonable assumption in many cases).
I worked on developing "progressive" data mining algorithms which output many incremental and useful results while converging on an exact output while simultaneously providing strong error guarantees on each output. With the progressive algorithm, rather than waiting minutes for a result of a data mining query to be displayed, a high-quality approximate output is displayed every few milliseconds thus enabling true interaction between the exploration task and the user.
Crypto Scheme Implementations
My implementation of the Threshold-Paillierscheme with some other nice properties.
More info on the scheme: Multiparty Computation from Threshold Homomorphic Encryption.
ProSecCo: Progressive Sequence Mining with Convergence Guarantees,
S. Servan-Schreiber, M. Riondato, and E. Zgraggen.
IEEE ICDM'18 2018
Towards Quantifying Uncertainty in Data Analysis & Exploration,
Y Chung, S Servan-Schreiber, E Zgraggen, T Kraska.
IEEE Data Engineering Bulletin, 41 (3), 2018
“Have nothing in your house that you do not know to be useful or believe to be beautiful.”— William Morris, Hopes and Fears for Art, 1882.
“Art is an outlet toward regions which are not ruled by time and space.” — Marcel Duchamp.
“Let the future tell the truth, and evaluate each one according to his work and accomplishments. The present is theirs; the future, for which I have really worked, is mine.” — Nikola Tesla.
“The man who has fed the chicken every day throughout its life at last wrings its neck instead, showing that more refined views as to the uniformity of nature would have been useful to the chicken.” — Bertrand Russell, The Problems of Philosophy, 1912.