Theory Group Seminar Notes

These are my notes for the seminars that happen in the Theory Group at The University of Toronto. Many thanks to Professor Allan Borodin for allowing me to attend the Theory Group seminars and helping out.

The Theory Group focuses on theory of computation. The group is interested in using mathematical techniques to understand the nature of computation and to design and analyze algorithms for important and fundamental problems.

The members of the theory group are all interested, in one way or another, in the limitations of computation: What problems are not feasible to solve on a computer? How can the infeasibility of a problem be used to rigorously construct secure cryptographic protocols? What problems cannot be solved faster using more machines? What are the limits to how fast a particular problem can be solved or how much space is needed to solve it? How do randomness, parallelism, the operations that are allowed, and the need for fault tolerance or security affect this?

1 Lower Bounds for Locally Decodable Codes from Semirandom CSP Refutation

The related paper: Combinatorial lower bounds for 3-query LDCs by Alrabiah et al. [1]. Seminar by Peter Manohar. [2] [3]

1.1 Abstract

A code

C

is a q-locally decodable code (q-LDC) if one can recover any chosen bit

b_{i}

of the

k

-bit message b with good confidence by randomly querying the

n

-bit encoding x on at most

q

coordinates. Existing constructions of

2

-LDCs achieve blocklength

n = \exp (O (k))

, and lower bounds show that this is in fact tight. However, when

q = 3

, far less is known: the best constructions have

n = subexp (k)

, while the best known lower bounds, that have stood for nearly two decades, only show a quadratic lower bound of

n \geq Ω (k^{2})

on the blocklength.

In this talk, we will survey a new approach to prove lower bounds for LDCs using recent advances in refuting semirandom instances of constraint satisfaction problems. These new tools yield, in the

3

-query case, a near-cubic lower bound of

n \geq \tilde{Ω} (k^{3})

, improving on prior work by a polynomial factor in

k

1.2 Locally Decodable Codes

In [1], they show that a better minimum bound can be found than these existing ones for

q = 3

Semi-random CSP refutation comes to our aid to prove this! The intuitive way to put this theorem is that

q

-LDC lower bound is same as refuting ”LDC”

q

-XOR.

1.3 How to prove the Theorem

1.4 Normally Decodable Codes

We can see that the decoder we have can arbitrary but WLOG we can assume there are

q

-unif hypergraphs

H_{1}, H_{2}, \dots H_{k}

where every

H_{i}

is such that:

1.5 Proof: Going from LDC to XOR

We suppose that our code is linear and that there exists

q

-unif hypergraphs

H_{1}, H_{2}, \dots H_{k}

We can write down the maximum fraction of satisfiable constraints:

val (ψ_{b}) = 1

for any

b \in {0, 1}^{k}

It is sufficient now if we can argue that

ψ_{b}

is unsat with high probability for some random

b

when

n ≪ k^{\frac{q}{q - 2}}

Now we need to refute XOR, there are many ways to argue unsatisfiability of an XOR instance. One reason why we can not use probablistic approaches here is that

ψ_{b}

only has

k

bits of randomness.

With this the guarantee then would be

val (ψ) \leq algval (ψ)

which is similar to saying that if

algval (ψ) < 1

then

A

refutes

ψ

. The ideal goal would be to refute random

ψ

with

m

constraints with high probability

However, we take a look at semi-random XOR. Our refutation algorithm and the guarantee will still be the same:

1.6 Proof: Existing

q

-LDC lower bound for

q

even

We now have a goal to argue that

ψ_{b}

unsat with high probability for random when

b

when

n ≪ k^{q ∕ (q - 2)}

frac. constraints satisfied by

x \in {\pm 1}^{n}

\frac{1}{2} + \frac{f (x)}{2}

We will now try to refute

ψ_{b}

. With Equation 5 and Equation 6 to refute

ψ_{b}

is like showing:

We say that if

S \oplus T = C \in h_{i}

then

\prod_{j \in S \oplus T} x_{j} = b_{i}

Simplifying an earlier statement we can also say from here that:

A_{C} (S, T) = 1

S \oplus T = C

For which

A_{i} = \sum_{C \in h_{i}} A_{C}

and

A = \sum_{i = 1}^{k} b_{i} A_{i}

Note that the way we defined

D

here it only depends on

| C | = q

, we can say:

We have already proven that

| | A | |_{\infty \to 1} \geq Dm \max_{x} f (x) \geq Dm \geq Dδnk

It is also interesting to note that

| | A | |_{\infty \to 1} \leq N | | A | |_{2}

and we still need to be able to show that with high probability that

| | A | |_{\infty \to 1}

is not too large.

Matrix Bernstein: with high probability over

b_{i}

| | A | |_{2} \leq △ \sqrt{kl}

where

△

is the maximum number of 1’s in a row in any

A_{i}

Expected number of 1’s per row is

δn \frac{D}{N} \sim n {(\frac{l}{n})}^{q ∕ 2}

We can optimistically suppose that

△ \sim n {(\frac{l}{n})}^{q ∕ 2}

however this also needs

l \geq n^{1 - 2 ∕ q}

Because

H_{i}

are matchings, a random row will have only

\approx \frac{δnD}{N}

1’s.

The idea now is to prune off all the bad rows or columns in A to get B such that:

And now we can just use

B

instead which will prove

q

-LDC lower bound for

q

even.

1.7 Proof:

k^{3}

lower bound

The goal is argue that

ψ_{b}

is unsatisfiable with high probability for random

b

. And the idea is to design a matrix

A \in ℝ^{N \times N}

so that:

The previous approach fails because the

A

from before requires

q

to be even.

One attempt is to represent rows as

| S | = l

and columns as

| T | = l + 1

. However this will only get us to

k \leq \sqrt{n}

We need to derive more constraints, using

C_{i} \oplus C_{j}

get us to

nk

constraints so each

n_{j}

is in

\approx k

constraints

\Rightarrow

new

n k^{2}

constraints.

The matrix

A

is indexed by

S

A (S, T) = b_{i} b_{j}

. The calculation is now:

The row pruning tricks would still work provided that any

{u, v}

is in at most

polylog (n)

constraints.

1.8 Conclusion

This proof for

q = 3

is not generalizable for all odd

q

and neither is a reduction to

2

-LDC. This is particularly true because of the row pruning step.

2 Algorithms for the ferromagnetic Potts model on expanders

The related paper: Algorithms for the ferromagnetic Potts model on expanders by Carlson et al. [5]. Seminar by Aditya Potukuchi.

2.1 Abstract

The ferromagnetic Potts model is a canonical example of a Markov random field from statistical physics that is of great probabilistic and algorithmic interest. This is a distribution over all

1

-colorings of the vertices of a graph where monochromatic edges are favored. The algorithmic problem of efficiently sampling approximately from this model is known to be #BIS-hard, and has seen a lot of recent interest. I will outline some recently developed algorithms for approximately sampling from the ferromagnetic Potts model on d-regular weakly expanding graphs. This is achieved by a significantly sharper analysis of standard ”polymer methods” using extremal graph theory and applications of Karger’s algorithm to count cuts that may be of independent interest. I will give an introduction to all the topics that are relevant to the results.

2.2 The Ferromagnetic Potts Model

Notice that for

β < 0

it means that we take the antiferromagnetic case. Here we talk more about when

β > 0

meaning it is ferromagnetic.

2.3 The Problem

What we need to do is given

G

and

β

, efficiently sample a coloring from this distribution.

Our problem is that given

G

and

β

we want to efficiently sample a color distribution. We give 2 facts:

We now modify the problem as: Given

G

and

β

, efficiently sample approximately a colouring from this distribution.

𝜖

approximation will have us sample a law from

q

such that

| | p - q | |_{TV D} \leq 𝜖

, thus

We modify our original problem template to now be: Given

G

and

β

, efficiently sample

𝜖

-approximately a colouring from this distribution.

Fully Polynomial Almost Uniform Sampler can allow us to sample

𝜖

-approximately in

poly (G, \frac{1}{𝜖})

time.

Instead Fully Polynomial Time Approximation Scheme:

1 \pm 𝜖

-factor approximation in

poly (G, \frac{1}{𝜖})

time.

2.4 Antiferromagnetic Potts model

Given

G

and

β < 0

, we want to be able to give an FPAUS for this distribution. It is then equivalent to instead work on the problem: given

G

and

β < 0

, give an FPTAS for its partition function

Z_{G} (q, β)

We can say that this is #BIS-hard (bipartite independent sets). Thus, doing this is at least as hard as an FPTAS for the number of independent sets in bipartite graphs. If our graph has no bipartiteness then this becomes a NP-hard problem.

For now, let’s consider the problem given a bipartite graph

G

, design an FPTAS for the number of individual sets in

G

. This accurately captures the difficulty of: the number of proper

q

-colorings of a bipartite graph for

q \geq 3

, the number of stable matchings, the number of antichains in posets.

2.5 Main Results

For our purposes we assume that

G

is always a

d

-regular graph on

n

vertices. Now for a set

S \subset V

, we define it’s edge boundary as:

Now,

G

is an

η

expander if for every

S \subset V

of size at most

n ∕ 2

, we have

| ▿ (S) | \geq η | S |

. For example we can take a discrete cube

Q_{d}

with vertices

{0, 1}^{d}

uv

is an edge if

u

and

v

differ in exactly 1 coordinate.

Using a simplification of the Harper’s Theorem we can say that

Q_{d}

is a

1

-expander [7].

Something to note here is that

q \geq poly (d)

should not be a necessary condition.

As well as as in the case

β \leq (1 - 𝜖) β_{0}

does not require expansion or even that

q \geq poly (d)

2.6 Potts Distribution

We want to be able to know more about how the Potts distribution looks for

β < (1 - 𝜖) β_{0}

and for

β > (1 + 𝜖) β_{0}

2.7 Results

2.8 Polymer Methods

The motivating idea is to visualize the state for

β

large at low temperature as ground state + defects.

Polymer methods are pretty useful in such cases. These were first proposed in [8] and originated in statistical physics. We take

G

to be our defect graph and each node in this represents a defect.

Ideas is to

Z_{G} (q, β) \sim Z_{red} + Z_{blue} + \dots

where

Z_{red} \approx e^{βnd ∕ 2}

Z_{red} e^{- βnd ∕ 2} = \sum_{I \subset V (G)} \prod_{γ \in I} w_{γ}

where

w_{γ}

is the weight of polymer

γ

We now move towards cluster expansion: multivariate in the

w_{γ}

Taylor expansion of:

This is an infinite sum, so convergence is not guaranteed however convergence can be established by verifying the Kotecký-Preiss criterion.

We also want to answer how many connected subsets are there of a given edge boundary in an

η

-expander?

A heuristic we have is to count the number of such subsets that contain a given vertex

u

: a typical connected subgraph of size

a

is tree-like, i.e., has edge boundary

a \cdot d

Working backward, a typically connected subgraph with edge boundary size

b

has

O (b ∕ d)

vertices. The number of such subgraphs

\leq

number of connected subgraphs of size

O (b ∕ d)

containing

u

. The original number of subsets is also

\leq

Number of rooted (at

u

) trees with

O (b ∕ d)

vertices and maximum degree at most

d = d^{O (b ∕ d)}

. Thus,

Another question to ask is how many

q

-colorings of an

η

-expander induce at most

k

non-monochromatic edges?

Easiest way is to make

k

non-monochrimatic edges is to color all but

k ∕ d

randomly chosen vertices with the same color. Now,

k

small

\Rightarrow

these vertices likely form an independent set. we now color these

k ∕ d

vertices arbitrarily. There are:

Now we also know the maximum value of

Z_{G} (q, β)

over all graphs

G

with

n

vertices,

m

edges, and max degree

d

. This will always be attained when

G

is a disjoint union of

K_{d + 1}

and

K_{1}

3 Statistical Learning using Compression

The related paper: Adversarially Robust Learning with Tolerance by Ashtiani et al. [9]. Seminar by Hassan Ashtiani.

3.1 Abstract

Characterizing the sample complexity of different machine learning tasks is one of the central questions in statistical learning theory. For example, the classic Vapnik-Chervonenkis theory characterizes the sample complexity of binary classification. Despite this early progress, the sample complexity of many important learning tasks — including density estimation and learning under adversarial perturbations — are not yet resolved. In this talk, we review the less conventional approach of using compression schemes for proving sample complexity upper bounds, with specific applications in learning under adversarial perturbations and learning Gaussian mixture models.

3.2 Some background

3.3 Density Estimation

Our goal is that for every

D_{Z}

A_{Z, H} (S)

we want it to be comparable to

OPT

with high probability.

We take the example of density estimation in this case

L (D_{Z}, h) = d_{TV} (D_{z}, h)

. Now,

A_{Z, H}

probably approximately correct learns

H

with

m (𝜖, δ)

samples if for all

D_{Z}

and for all

𝜖

with a

δ \in (0, 1)

. Now if

S \sim D_{Z}^{m (𝜖, δ)}

then:

Now if we take the example of

C = 2

, let

H

be the set of all Gaussians in

ℝ^{d}

then:

We will now modify the above equation. Now,

A_{Z, H}

probably approximately correct learns

H

with

m (𝜖)

samples if for all

D_{Z}

and for all

𝜖 \in (0, 1)

. Now if

S \sim D_{Z}^{m (𝜖)}

then:

For the example of

C = 2

, let

H

be the set of all Gaussians in

ℝ^{d}

then:

3.4 Binary Classification (with adv. pertubations)

For the example of binary classification, we have

Z = X \times {0, 1}

and

h

is some model which maps from

h : X \to {0, 1}

We also have

l (h, x, y) = 1 h (x) \neq y

and then we will have the

L

L (D_{Z}, h) = E_{(x, y) \sim D_{Z}} l (h, x, y)

Now,

A_{Z, H}

probably approximately correct learns

H

with

m (𝜖)

samples if for all

D_{Z}

and for all

𝜖 \in (0, 1)

. Now if

S \sim D_{Z}^{m (𝜖)}

then:

For the example of binary classification, we have

Z = X \times {0, 1}

and

h

is some model which maps from

h : X \to {0, 1}

We also have

l^{U} (h, x, y) =

adversarial pertubations and then we will have the

L^{U}

L^{U} (D_{Z}, h) = E_{(x, y) \sim D_{Z}} l^{U} (h, x, y)

Now,

A_{Z, H}

probably approximately correct learns

H

with

m (𝜖)

samples if for all

D_{Z}

and for all

𝜖 \in (0, 1)

. Now if

S \sim D_{Z}^{m (𝜖)}

then:

Here

H_{2}

is richer which can make it contain better models as well as harder to learn. We can characterize the sample complexity of learning

H

using Binary classification, with Binary classification with adversarial perturbations or with Density estimation.

3.5 Sample Compression

The idea is to try and answer how should we go about compressing a given training set? In classic information theory, we would compress it into a few bits. In the case of sample compression, we want to try to compress it into a few samples.

If we just take the simple example of linear classification Number of required bits is unbounded (depends on the sample).

It has already been shown by Littlestone and Warmuth [10] that Compressibility

\Rightarrow

Learnability

It has also been shown by Moran and Yehudayoff [11] Compressibility

\Leftarrow =

Learnability

In the case of classification with adversarial perturbations we had

l^{0 ∕ 1} (h, x, y) = 1 {h (x) \neq y}

and

l^{U} (h, x, y) = su p_{\bar{x} \in U (x)} l^{0 ∕ 1} (h, \bar{x}, y)

and then we will have the

L^{U}

L^{U} (D_{Z}, h) = E_{(x, y) \sim D_{Z}} l^{U} (h, x, y)

Now,

A_{Z, H}

probably approximately correct learns

H

with

m (𝜖)

samples if for all

D_{Z}

and for all

𝜖 \in (0, 1)

. Now if

S \sim D_{Z}^{m (𝜖)}

then:

In a compression-based method the decoder should recover the labels of the training set and their neighbors and then compress the inflates set:

Ashtiani et al. [9] introduced tolerant adversarial learning

A_{Z, H}

PAC learns

H

with

m (𝜖)

samples

\forall D_{Z}

\forall 𝜖 \in (0, 1)

, if

S ≃ D_{Z}^{m (𝜖)}

then

The trick is to avoid compressing an infinite set and now our new goal is that the decoder should only recover labels of things in

U (x)

To do so we can define a noisy empirical distribution (using

V (x)

) and then use boosting to achieve a super small error with respect to this distribution. And then, we encode the classifier using the samples used to train weak learners and the decoder smooths out the hypotheses.

It is interesting to think of Why do we need tolerance? There do exist some other ways to relax the problem and avoid

2^{O (V C)}

3.6 Gaussian Mixture Models

Gaussian mixture Models are very popular in practice and are one of the most basic universal density approximators. These are also the building blocks for more sophisticated density classes and can think of them as multi-modal versions of Gaussians.

We say

F

is Gaussian Mixture Model with

k

components in

ℝ^{d}

. And we want to ask how many samples is needed to recover

f \in F

within

L_{1}

error

𝜖

Let us take the example of this graph. For a moment look at this as a binary classification problem. The decision boundary has a simple quadratic form!

3.7 Compression Framework

We have

F

which is a class of distributions (e.g. Gaussians) and we have. If A sends

t

points from

m

points and B approximates

D

then we say

F

admits

(t, m)

-compression.

Distribution compression schemes extend to mixture classes automatically! So for the case of GMMs in

ℝ^{d}

it is enough to come up with a good compression scheme for a single Gaussian!

For learning mixtures of Gaussians, the encoding center and axes of ellipsoid is sufficient to recover

N (μ, Σ)

. This admits

Õ (d^{2}, \frac{1}{𝜖})

compression! The technical challenge is encoding the

d

eigenvectors “accurately” using only

d^{2}

points.

3.8 Conclusion

References

[1] Omar Alrabiah, Venkatesan Guruswami, Pravesh Kothari, and Peter Manohar. A near-cubic lower bound for 3-query locally decodable codes from semirandom CSP refutation. Technical Report TR22-101, Electronic Colloquium on Computational Complexity (ECCC), July 2022.

[2] Arnab Bhattacharyya, L. Sunil Chandran, and Suprovat Ghoshal. Combinatorial lower bounds for 3-query ldcs, 2019. URL https://arxiv.org/abs/1911.10698.

[3] Venkatesan Guruswami, Pravesh K. Kothari, and Peter Manohar. Algorithms and certificates for boolean csp refutation: ”smoothed is no harder than random”, 2021. URL https://arxiv.org/abs/2109.04415.

[4] Alexander S. Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchy and tensor pca, 2019. URL https://arxiv.org/abs/1904.03858.

[5] Charlie Carlson, Ewan Davies, Nicolas Fraiman, Alexandra Kolla, Aditya Potukuchi, and Corrine Yap. Algorithms for the ferromagnetic potts model on expanders, 2022. URL https://arxiv.org/abs/2204.01923.

[6] Matthew Coulson, Ewan Davies, Alexandra Kolla, Viresh Patel, and Guus Regts. Statistical Physics Approaches to Unique Games. In Shubhangi Saraf, editor, 35th Computational Complexity Conference (CCC 2020), volume 169 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1–13:27, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. ISBN 978-3-95977-156-6. doi: 10.4230/LIPIcs.CCC.2020. 13. URL https://drops.dagstuhl.de/opus/volltexte/2020/12565.

[7] Peter Frankl and Zoltán Füredi. A short proof for a theorem of harper about hamming-spheres. Discrete Mathematics, 34(3):311–313, 1981.

[8] Tyler Helmuth, Will Perkins, and Guus Regts. Algorithmic pirogov-sinai theory. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, page 1009–1020, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367059. doi: 10.1145/3313276.3316305. URL https://doi.org/10.1145/3313276.3316305.

[9] Hassan Ashtiani, Vinayak Pathak, and Ruth Urner. Adversarially robust learning with tolerance, 2022. URL https://arxiv.org/abs/2203.00849.

[10] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. 1986.

[11] Shay Moran and Amir Yehudayoff. Sample compression schemes for vc classes. Journal of the ACM (JACM), 63(3):1–10, 2016.


$q$	Lower Bound	Upper Bound

2	$2^{Ω (k)} \leq n$	$n \leq 2^{k}$
3	$k^{2} \leq n$	$n \leq \exp (k^{o (1)})$
$O (1)$ , even	$k^{\frac{q}{q + 1}} \leq n$	$n \leq \exp (k^{o (1)})$
$O (1)$ , odd	$k^{\frac{q + 1}{q - 1}} \leq n$	$n \leq \exp (k^{o (1)})$