QCon San Francisco 2021 November 1-5, 2021 | | Mind the Software Gap: How We Can Operationalize Privacy & Compliance

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear some of the ways GDPR and CCPA can influence software.
Learn about some of the practical solutions to protecting data privacy and security.

Abstract

With legislation like GDPR and CCPA, it has become newly urgent for organizations to understand internal and external data flows. In the push towards compliance, software organizations have been discovering just how difficult it is to maintain an up-to-date picture of data inventory and data flows. A major challenge is that modern software teams are developing and deploying software quickly and in decentralized ways. When each code change can cause data flow changes, building a clear, up-to-date map of data flows becomes more and more elusive. The state of the art (using human processes; catching data as it flows to untrusted locations) leaves gaps.

Understanding software behavior makes up a big part of the compliance gap--and automated techniques can help. In this talk, I discuss just what it could look like to get visibility into data flows and hint at what kinds of solutions could get us there.

Question:

You were assistant professor at Carnegie Mellon and now you're CEO of a company. What led to that transition?

Answer:

For the last 10+ years, I'd been going after this problem of how can we build tools to help software teams better understand their software. I had decided that security and privacy was going to be one of the big application domains. The whole time I really tried to keep an eye on what was the best way to make impact building software tools for security and privacy. For the first few years, it was pretty clear that academia was the best place to do this because the world didn't care so much about it yet and a lot of the ideas I was thinking about weren't ready for primetime. But in the last couple of years, it's become really clear that companies care more about it, consumers care more about it, regulators care more about it, and the tooling ecosystem has gotten to a place where we can start building tools to automate some of the software problems that people have been treating as process problems. What data are my teams using, where are they sending that data, and what are they doing with that data? For me, it wasn't so much as a decision to switch between academia and running a company. Every year I ask myself, is what I'm doing right now the way to best make impact doing this? When GDPR came along, I thought about this question and concluded that building a product was the way to bring the tooling that I wanted to help the biggest number of developers.

Question:

What's the goal of this talk?

Answer:

Part of it is education. I talk to a lot of companies these days about their problems with monitoring their data. The state of the art seems to be asking people what they're doing with their data and telling them and telling them to fill it into a spreadsheet. I want to reach as many people as possible and tell them it doesn't have to be this way. My goal here is to, step one, tell people, you can have nicer things than what you have now, and then, step two, lay out, here's some options about other things you can do. Hopefully, one of the conclusions people will come to is they need the kinds of tools that we're building, because I think that especially when we started working in this space, GDPR was pretty new, people were just starting to get their heads around this.

Question:

Can you give me some examples?

Answer:

There's a class of problems that I was hearing about over and over again that inspired a large part of our solution now, which is developers accidentally writing passwords and logs. When I first started hearing this problem, I was skeptical that this was really a problem. I was like, it doesn't seem like that big of a deal. Why haven't you solved it yet? Over the last year, I’ve seen this becoming an increasingly serious problem. And because of GDPR, companies now need to notify their users to change their passwords if they discover passwords in logs and there's been a lot more discussion around this question. It's also analogous to many other problems that might sound like bigger deals, like sending credit card numbers or health information to Twilio or Salesforce, or using data for purposes it’s not supposed to be used The most recent was Twitter using two-factor authentication phone numbers for advertising. I really like this Hacker News thread from a few months ago when RobinHood had the passwords and logs problem or the leak passwords problem, and someone said, how can they possibly do this? Everyone said, here's one time I accidentally logged passwords to Apache; here’s another time passwords ended up in one of my crash dumps. There are just many ways to do it. The reason I think passwords in logs is a really good example is that it can happen at any time and passwords are hard to detect. Your password could be anything. It's really hard to pattern match: like, OK, if you see like a bunch of stuff with three dashes in between, that's the password. This is something that if you're doing code review, you'll probably miss because it's usually not like log-password, but log my user, log something that contains some part of a user in it. At Akita, we’re very focused on solving this class of problems: how do we take sensitive data and provide tools that tell you this is where it goes? We’re building tools to support better software practices to move faster without leaking data.

Question:

In your talk, will you be talking about the product or the practices?

Answer:

We're still in stealth mode, so we're not talking about the product publicly yet. I'm going to outline the classes of problems that you should be worried about, some ways you can mitigate those concerns today, as well as where the gaps still are. If you are interested in still filling those gaps, you should come talk to me.

Question:

What do you want someone to leave the talk with?

Answer:

I would love for someone to leave the talk having gotten a better understanding the data flow problems around GDPR, how to reason about software in its context. One of my goals is to also have people come away with what isn’t likely to work on its own. Previously most legislation around software was not so tied to the software itself. GDPR was the first legislation that said, if you use this data over there or for something else, then that's problematic. Which means you have to know that your data ended up over there or ended up being used for something else.

Question:

Do you give specific ways, specific things to look at privacy?

Answer:

Yes. My goal is to give an idea of how to think about responsible data practice if you wanted to do right by GDPR--and by your customers. Some of the solutions are going to be difficult to implement with the tools we have today. In fact, one of the main motivations for starting Akita was that GDPR is ahead of what we are able to support technologically. There are two possible outcomes. One is that everyone just ignores it and the state of privacy is where it was before, maybe even worse. Or people start following GDPR--but we're gonna need new tools to take us there, because reading code by hand and understanding data flows by hand is just not feasible. But since we can’t do all that in a single talk, I’m planning to give the audience actionable things they can start doing today.

Speaker: Jean Yang

Founder and CEO @AkitaSoftware

Jean Yang is the founder and CEO of Akita Software, an enterprise data monitoring company. She was previously an Assistant Professor in the Computer Science Department at Carnegie Mellon University, where she led a research group working on techniques for automating software-based security and privacy. She has also worked in this space during her PhD at MIT, at Microsoft Research, and at Facebook. In 2016, the MIT Technology Review named her one of the Top 35 Innovators Under 35 for her work in this area.