Exploratory visualization of neural networks
Created Tuesday 15 November, 2022
This blog post summarizes a summer project with The Minh Nguyen, whom I co-supervised with my postdoc mentor Simone Brugiapaglia.
See this graphic (7.2 MB GIF) for a visualization of some of our results.
Created Sunday March 11, 2018
A 42 video YouTube series by "the guy who wrote the Matlab function \"
A YouTube series on Conditional Random Fields by Hugo Larochelle
One of myriad Medium lists of ML resources
One more with a link to ≥ 150 tutorials
Matrix Completion Whirlwind
Created Thursday September 14, 2017
I recently wrote some basic Python functions for a matrix completion tutorial, covering elementary theory and a couple popular convex optimization methods for matrix completion.
The repository is available on GitHub, including the Jupyter notebook, which can be viewed on GitHub, as a gist, or below.
Reflection: 2018 BC Data Science Workshop
Created Wednesday June 27, 2018
This is a summary and reflection from co-organizing the second iteration of the BC Data Workshop.
The BC Data Workshop was hosted June 4 – 8 2018 on the UBC campus, co-organized with IAM Director Brian Wetton, supported by PIMS.
The workshop featured a wide array of problem material from industry in the lower mainland, channeled as week-long data science projects led by industry mentors.
Project 1: SSR Mining
Participants analyzed over 50 GB of data for particular heavy equipment used in an ongoing mining operation to discover patterns in equipment failure for predictive maintenance strategies. While the data was very clear and well-formatted, the mining workflow was one of the more complicated elements to understand; the size of the data made the project unwieldy at times.
Project 2: St. Paul’s Hopsital
Participants sifted through a sparse collection of cytokine expression data to search for patterns between these expressions, resilience to septic shock and genetic mutation data in the patient genomes. The group may be in the process of pursuing publication of their work.
Project 3: SNC-Lavalin
Participants used current geographic data for shipping routes in the Juan de Fuca strait to predict past routes that ships have taken as a means of ascertaining more detailed knowledge about ship emissions near the lower mainland and Gulf islands.
Project 4: Comm100
Participants were faced with the challenge of generating sentiment analyses and automatic knowledge bases from a paucity of chat conversations provided by this IM platform.
Project 5: CloudPBX
By far the most involved industry mentor, CloudPBX led their team through an investigation of VoIP call quality analysis as the team investigated novel means of viewing call quality in the modern era.
The survey responses generally indicated a positive workshop. We had approximately 50% of individuals respond to our survey. Of this, there was only one individual who did not like the workshop. Of those who enjoyed the workshop overall, the majority liked their project (65%).
These responses are complicated by the fact that we had a mentor drop out last minute - the company bailed without notice on the morning presentations, as well as the Friday student presentations and subsequent attempts to arrange meetings. In fact, we found that Mentor Support was tied in second place with Teamwork (60%) for the most important factors affecting people’s rating of project quality (falling behind only Learning Opportunities, 85%). Accordingly, it doesn’t seem surprising that the groups who typically gave lower ratings to their project quality (data withheld) were also the groups who lacked adequate mentor support and/or were observed to have a less cohesive group dynamic. We can see how some of these patterns appear by correlating the factors in project quality rating across participants.
Primarily, those who valued mentor support were likely also to value teamwork. Matching this with comments submitted by respondents suggests that mentor support and teamwork were key factors in determining the perceived project quality.
Similarly, those who valued Learning Opportunities were likely to value Background Knowledge (i.e., opportunities for learning problem-specific domain knowledge like Genetics for Project 2 or Traceroutes and Internet architecture for Project 5) and Technical Knowledge (i.e., opportunities to learn problem-specific mathematical and data scientific tools like Kalman filters for Project 3 or Natural Language Processing for Project 4).
It is interesting to note that two different mentalities might be visible in this plot. Because Learning Opportunities is negatively correlated with Complexity, it may mean (for example) that some entered the worskhop to crunch on a hard challenge, while others entered the workshop to soak up as much new knowledge as possible.
In that sense, there was one major difference between this year’s workshop and last year’s: there was no pre-workshop this year comprised of lecture content and exercises.
Of those who wanted a pre-workshop (68%), almost 70% expected there to be a pre-workshop. Only 30% of individuals did not want a pre-workshop. I think that offering a series of lecture content and exercises prior to the problem solving session would allow for greater levels of confidence when starting the project: those with tools less familiar with data science would have a chance to pick up the thread for solving their problem.
This is especially true since another major change from last year included undergraduate participants as well as participants from fields outside of mathematics. Naturally, undergraduates were less comfortable engaging in a research-type setting, especially if they didn’t already feel like they had the necessary tools or adequate mentor support.
About 30% of respondents felt un- or under-prepared for the workshop. Fortunately, nearly all respondents reported that they learned a lot during the workshop, which is by all intents the primary goal!
For now this reflection remains incomplete - to be updated at will with availability.
Reflection: 2017 BC Data Science Workshop
Created Tuesday September 12, 2017
This is a brief summary and some incomplete reflections from my time as the workshop TA for the 2017 BC Data Science Workshop.
The BC Data Workshop was hosted August 9 – 25 in downtown Vancouver at UBC Robson Square, co-organized by myself, Brian Wetton and Lee Rippon from the IAM, with support from PIMS and SFU Mathematics.
The workshop featured a wide array of material from experts across the west coast, structured into three sections: a 3-day pre-workshop intensive; a week of introductory material; and a week-long data science project led by industry mentors.
The pre-workshop was hosted by Patrick Walls and Brian Wetton. It covered software carpentry material like GitHub, introductory Python and bash, and Jupyter notebooks. It also covered elements of scientific computing and machine learning like gradient descent, principal component analysis, fast Fourier transforms, convex optimization and linear regression.
The mornings of the first week were comprised of introductory lectures by Isabell Konrad (UC Berkeley), Yinshan Zhao (BC Health) and Michael Reid (Amazon). Topics covered elements of machine learning and data science; hypothesis testing and experimental design; and modern software tools (SQL, Hadoop, Fink, Kafka, ElasticSearch, etc.). In the afternoons, participants completed mini-projects whose focus was related either to the morning material, or to tools that would be integral to a project in the second week. The mini-projects covered regression, neural networks, matrix completion, data wrangling and exploration, and distributed and parallel computing with Apache Spark and tensorflow on GPU.
The mornings of the second week were comprised of “advanced topics” — featuring professors from UBC or SFU, speaking on elements of their research which rely on or develop tools from data science. The rest of the time was devoted to group projects. Teams of 5 – 7 worked on projects brought by industry mentors whose solutions would feature elements of data science. This included two kinds of image processing; nonlinear function approximation for video compression bitrate analysis; deep neural networks for genetics research; and analytics from vehicle time series messages.
My role as TA
It was most rewarding to learn how to think in different modes with teams of diverse individuals working on very diverse projects. One evening, I got to help create an interpretable way to balance highly imbalanced data; the next morning I had to help design a way to train a deep model using images on a remote server. After that, I contributed to a brainstorming session on how to best cluster data that was too big for memory. In general, I got to contribute ideas for project strategies, available software tools, and design/algorithm troubleshooting.
An unintended consequence of the workshop saw me designing three of the five mini-projects in the first week.
Parallel and distributed computing with pyspark and tensorflow-gpu
Designing these consumed a lot of effort and time - too much given the other organizational duties. Consequently, there are places where the work remains unpolished. Nevertheless, designing these was a great opportunity to explore how to communicate new concepts in an immersive, interactive way.
We may have imposed too great a constraint on the duration for which the teams were able to work on their industry projects. Effectively, they received their projects Monday afternoon; had advanced lectures through the week; and had to present on Friday afternoon. It’s clear that longer duration would have indeed allowed significantly greater progress on the teams' projects. That being said, each of the teams made an impressive amount of progress that will be discussed below.
About the projects
Data-driven modelling of video compression
This project had aggregate camera data and sought to use this data to determine a function that could accurately predict the bitrate for given camera settings. The group tried several convex optimization approaches, and settled on a particular flavour of random forest, boasting accuracy well above what is considered “industry standard”.
Risk-based platform for accdient prevention
I helped to design a skeleton for this project, wherein the team would take raw photo data from BC Safety Authority inspections and use it to predict compliant and non-compliant objects using a combination approach of active learning and transfer learning. The training would have be performed by downloading batches of images from an AWS S3 bucket, since there were too many images to fit on the VM disk we were using. It is likely that such a model will require more complex structure than a standard image recognition ConvNet, such as topic modelling or LDA.
The team reduced the complexity of this task by classifying images of flowers using a transfer learning approach. In this approach, these used bottlenecking to generate feature vectors by running the images through the bottom several layers of a VGG16 network. Then, they trained a binary classifier on this markedly smaller feature space to discriminate between species of flowers. This particular approach has immediate generalization potential to the more complex problem of discerning compliant and non-compliant objects from inspection photos.
Elucidating enhancer-promoter gene expression using ConvNets
This group developed a convolutional neural network whose first layer filters, after [agnostic] training, were composed of significant and known gene promoter regions. This convolutional neural network was designed to predict the efficacy of gene expression for particular enhancer-promoter pairs. This team made significant progress in the time they had, and made steps toward a reverse-complement invariant machine learning model.
Data insights from vehicle time series messages
The goal of this project was to learn novel insights from vehicle time series messages, logged by in-car devices that record the car’s state, position, etc. This team made significant progress toward the problem of discriminating multiple drivers of the same vehicle. For this task, the team used Bayesian methods and k-means.
High-resolution shoreline data for flood protection and environmental conservation
This project sought to classify land type from photogrammetric drone image data. The data was in the format of an unstructured sparse point cloud, listing colour as well as latitude, longitude and elevation. The team took the approach of looking at local variation in elevation, as well as point colour. The team used an out-of-memory mini-batch k-means algorithm which clustered features generated from a nearest neighbours tree. My own approach to this problem can be seen here.
Future [Mathematics] Leaders
Created Wednesday January 15, 2020
This post was written as part of the Math section for the 2019-2020 Future Science Leaders program. If you’re interested in using any of the resources below for your own outreach program, there are two caveats. You are free to use any of my materials with proper attribution. However, some of the materials below were created or co-created by Matt Coles and for these you should e-mail us to check.
Day 3 — Coding in Python
Welcome to Day 3 of the Math section of Future Science Leaders! Today, you will get to explore a series of activities that will have you coding and problem solving in Python. If you don’t have Python already installed your laptop, no worries! Simply go to repl.it. There should be no need to make an account. We’ll be walking around to make sure you’re able to get up and running.
Before you start coding, have a look through the Introduction and Flowchart documents. These will give you some idea of where to begin, and what comes next.
If you’re new to Python or coding, or if you could use a refresher, then try exploring the Numbers worksheet next. Afterward, check out Conditional Statements, and then Loops. If you know how to program in another language, and want to see one of many things that makes Python special, you can check out Loops2.
Once you feel comfortable with the content above, you might wish to try out the Choose Your Own Adventure game. If you think you might like to build your own, then why don’t you try making your very own Choice Game?
If these challenges still aren’t hard enough for you, then you can try out some of our challenge problems. For example (in order of difficulty):
Three ultra-challenging problems (attribution unknown)
Note: to attempt the Caesar challenge, you’ll first want to “Fork” the associated FSL20d2 REPL before getting started. Feel free to ask us for help along the way! If some of the Python syntax or notation seems “weird”, that’s because we haven’t talked about it. Come ask us for some pointers!
Why wavelets are cool
Created Friday November 4, 2016
Slides from a talk in November 2016 at the Grad Seminar hosted by the UBC Mathematics Graduate Committee.
Note: This post is testing a Markdown script plugin for Google Sites using an old blog post for a Python installation tutorial that I wrote.