These are my notes for the seminars that happen in the Theory Group at The
University of Toronto. Many thanks to Professor Allan Borodin for allowing me to
attend the Theory Group seminars and helping out.
A PDF of these notes is available at https://rishit-dagli.github.io/cs-theory-notes/main.pdf.
An online version of these notes are available at https://rishit-dagli.github.io/cs-theory-notes.
The Theory Group focuses on theory of computation. The group is interested in
using mathematical techniques to understand the nature of computation and to
design and analyze algorithms for important and fundamental problems.
The members of the theory group are all interested, in one way or another, in the limitations of computation: What problems are not feasible to solve on a computer? How can the infeasibility of a problem be used to rigorously construct secure cryptographic protocols? What problems cannot be solved faster using more machines? What are the limits to how fast a particular problem can be solved or how much space is needed to solve it? How do randomness, parallelism, the operations that are allowed, and the need for fault tolerance or security affect this?
7th October 2022
The related paper: Combinatorial lower bounds for 3-query LDCs by Alrabiah et al. [1]. Seminar by Peter Manohar. [2] [3]
A code $C$
is a q-locally decodable code (q-LDC) if one can recover any chosen bit
${b}_{i}$ of the
$k$-bit
message b with good confidence by randomly querying the
$n$-bit encoding x on at most
$q$ coordinates. Existing
constructions of $2$-LDCs
achieve blocklength $n=\mathit{exp}(O(k))$,
and lower bounds show that this is in fact tight. However, when
$q=3$, far less is known: the
best constructions have $n=\mathit{subexp}(k)$,
while the best known lower bounds, that have stood for nearly two decades, only show a quadratic
lower bound of $n\ge \mathrm{\Omega}({k}^{2})$
on the blocklength.
In this talk, we will survey a new approach to prove lower bounds for LDCs using recent advances in refuting semirandom instances of constraint satisfaction problems. These new tools yield, in the $3$-query case, a near-cubic lower bound of $n\ge \stackrel{~}{\mathrm{\Omega}}({k}^{3})$, improving on prior work by a polynomial factor in $k$.
Take codes $b\in {0,1}^{k}\to x\in {0,1}^{n}$
Codes $x$ are read by the decoder, $i\in [k]$, $\widehat{{b}_{i}}\in 0,1$
Definition 1. $C$ is a $(q,\delta ,\mathit{\epsilon})$-locally decodable if for any $x$ with $\u25b3(x,\mathit{Enc}(b))\le \mathit{\delta n}$, $\mathit{De}{c}^{x}(i)={b}_{i}$ w.p. $\ge \frac{1}{2}+\mathit{\epsilon}$ for any $i$.
Ask the question, what is the best possible rate for a $q$-LDC given a $q$?
$q$ | Lower Bound | Upper Bound |
2 | ${2}^{\mathrm{\Omega}(k)}\le n$ | $n\le {2}^{k}$ |
3 | ${k}^{2}\le n$ | $n\le \mathit{exp}({k}^{o(1)})$ |
$O(1)$, even | ${k}^{\frac{q}{q+1}}\le n$ | $n\le \mathit{exp}({k}^{o(1)})$ |
$O(1)$, odd | ${k}^{\frac{q+1}{q-1}}\le n$ | $n\le \mathit{exp}({k}^{o(1)})$ |
Focus on the case $q=3$, we have gotten better bounds:
$$k\le n\le {2}^{k}$$ | (1) |
$${k}^{2}\le n\le \mathit{exp}(\mathit{exp}(\sqrt{\mathrm{log}k\mathrm{log}\mathrm{log}k}))$$ |
In [1], they show that a better minimum bound can be found than these existing ones for $q=3$:
$${\text{k}}^{3}\le n$$ | (2) |
The main result is that:
Theorem 1. Let $C$ be a $(3,\delta ,\mathit{\epsilon})$-locally decodable codes. Then $n\ge {\stackrel{~}{\mathrm{\Omega}}}_{\delta ,\mathit{\epsilon}}({k}^{3})$.
Semi-random CSP refutation comes to our aid to prove this! The intuitive way to put this theorem is that $q$-LDC lower bound is same as refuting ”LDC” $q$-XOR.
The idea:
$q$-LDC lower bound is same as refuting ”LDC” $q$-XOR
We can see that the decoder we have can arbitrary but WLOG we can assume there are $q$-unif hypergraphs ${H}_{1},{H}_{2},\cdots {H}_{k}$ where every ${H}_{i}$ is such that:
$${H}_{i}\subseteq \left(\genfrac{}{}{0ex}{}{[n]}{q}\right)$$ |
We can also see that:
Each ${H}_{i}$ is a matching such that $\left|{H}_{i}\right|\ge \mathit{\delta n}$
and, $\mathit{Dec}(i)$ picks
$C\leftarrow {H}_{i}$ and
outputs ${\sum}_{j\in C}{x}_{j}$
One such example is the Hadmard code:
$$b\in {0,1}^{k}\mapsto f={(\u27e8b,v\u27e9)}_{v\in 0,1}^{k}$$ | (3) |
$${b}_{i}=f({e}_{i})=f(v)+f(v+{e}_{i})$$ |
Can think of this as $v$
and $v+{e}_{i}$
are connected.
Matching vector codes are $\approx {\mathbb{Z}}_{m}^{h}$
We suppose that our code is linear and that there exists
$q$-unif
hypergraphs ${H}_{1},{H}_{2},\cdots {H}_{k}$.
We also know that:
Each ${H}_{i}$ is a matching such that $\left|{H}_{i}\right|\ge \mathit{\delta n}$
and, $\mathit{Dec}(i)$
picks $C\leftarrow {H}_{i}$ and
outputs ${\sum}_{j\in C}{x}_{j}$
So, we start by considering a $q$-XOR instance ${\psi}_{b}$:
$$\begin{array}{cc}\begin{array}{rl}\mathit{\text{Vars:}}& {\{{x}_{j}\}}_{j\in [n]}\\ \mathit{\text{OverEquations:}}& \sum _{j\in C}{x}_{j}={b}_{i},\forall i\in [k],C\in {H}_{i}\end{array}& \end{array}$$ |
We can write down the maximum fraction of satisfiable constraints:
$\mathit{val}({\psi}_{b})=1$ for
any $b\in {0,1}^{k}$.
It is sufficient now if we can argue that
${\psi}_{b}$ is unsat with high
probability for some random $b$
when $n\ll {k}^{\frac{q}{q-2}}$.
Now we need to refute XOR, there are many ways to argue unsatisfiability of an
XOR instance. One reason why we can not use probablistic approaches here is that
${\psi}_{b}$ only
has $k$
bits of randomness.
One way we can have some success here is to use a refutation algorithm
$$\psi \to A\to \mathit{algval}(\psi )$$ |
With this the guarantee then would be
$\mathit{val}(\psi )\le \mathit{algval}(\psi )$ which is similar
to saying that if $\mathit{algval}(\psi )<1$
then $A$ refutes
$\psi $. The ideal goal would
be to refute random $\psi $
with $m$
constraints with high probability
However, we take a look at semi-random XOR. Our refutation algorithm and the guarantee will still be the same:
$$\psi \to A\to \mathit{algval}(\psi )$$ |
with the guarantee that $\mathit{val}(\psi )\le \mathit{algval}(\psi )$.
So, now we generate semi-random $\mathit{\psi w}\u2215m$ constraints:
The equation we have is:
$$\sum _{j\in C}{x}_{j}={b}_{c}$$ | (4) |
And we also already know that
$${\psi}_{b}\mathit{\text{is}}\sum _{j\in C}$$ |
And, $\mathit{xj}={b}_{i},i\in [k],C\in {H}_{i}$.
${\psi}_{b}$ is
almost semi-random.
Thus, we have shown 1 Part 1 of Proof.
$q$-LDC XOR instance ${\psi}_{b}$ is encoded by:
We now have a goal to argue that ${\psi}_{b}$
unsat with high probability for random when
$b$ when
$n\ll {k}^{q\u2215(q-2)}$
frac. constraints satisfied by $x\in {\{\pm 1\}}^{n}$
is $\frac{1}{2}+\frac{f(x)}{2}$.
Here $f(x)$ is:
$$f(x)=\frac{1}{m}\sum _{i}{b}_{i}\sum _{C\in {H}_{i}}\prod _{j\in C}{x}_{j}$$ | (5) |
$$m=k\cdot \mathit{\delta n}$$ |
This makes our goal to be to certify with high probability that:
$$\underset{x\in {\{\pm 1\}}^{n}}{\mathrm{max}}f(x)<1\mathit{\text{when}}n\ll {k}^{\frac{q}{q-2}}$$ | (6) |
We will now try to refute ${\psi}_{b}$. With Equation 5 and Equation 6 to refute ${\psi}_{b}$ is like showing:
$$\mathit{\text{w.h.p.}}\underset{x\in {\{\pm 1\}}^{n}}{\mathrm{max}}f(x)1\mathit{\text{where}}f(x)=\frac{1}{m}\sum _{i}{b}_{i}\sum _{C\in {H}_{i}}\prod _{j\in C}{x}_{j}$$ | (7) |
$\mathit{\text{when}}n\ll {k}^{\frac{q}{q-2}}$.
The idea is to design a matrix $A\in {\mathbb{R}}^{N\times N}$ so that:
$$f(x)\le \left|\right|A|{\text{|}}_{\infty \to 1}=\underset{z,w\in {\{\pm 1\}}^{N}}{\mathrm{max}}{z}^{T}\mathit{Aw}$$ |
As shown by Wein et al. [4] the matrix $A$ can be indexed by
$$S\in \left(\genfrac{}{}{0ex}{}{[n]}{l}\right)$$ |
Assign $x\mapsto y$ such that ${y}^{T}\mathit{Ay}\propto f(x)$
and ${y}_{s}:={\prod}_{j\in S}{x}_{j}$
which is simply the tensor product.
We need to now be able to answer how to set $A(S,T)$
$${\text{y}}^{T}\mathit{Ay}=\sum _{S,T}{y}_{S}{y}_{T}A(S,T)=\sum _{S,T}A(S,T)\prod _{j\in S\oplus T}{x}_{j}$$ | (8) |
Which shows that we are actually using symmetric difference here.
We say that if $S\oplus T=C\in {h}_{i}$ then ${\prod}_{j\in S\oplus T}{x}_{j}={b}_{i}$
$\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}A(S,T)={b}_{i}$ if
$S\oplus T=C\in {H}_{i}$
$${\text{y}}^{T}\mathit{Ay}=\sum _{i=1}^{k}{b}_{i}\sum _{C\in {h}_{i}}\sum _{S\oplus t=C}\prod _{j\in C}{x}_{j}=\mathit{Dmf}(x)$$ | (9) |
Here $D=$
number of $S,T$
where $S\oplus T=C$.
Simplifying an earlier statement we can also say from here that:
${A}_{C}(S,T)=1$ if
$S\oplus T=C$.
For which ${A}_{i}={\sum}_{C\in {h}_{i}}{A}_{C}$
and $A={\sum}_{i=1}^{k}{b}_{i}{A}_{i}$
Set ${y}_{S}:={\prod}_{j\in S}{x}_{j}$
${y}^{T}\mathit{Ay}=\mathit{Dmf}(x)\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}\mathit{Dmf}(x)\le \left|\right|A|{\text{|}}_{\infty \to 1}$
Note that the way we defined $D$ here it only depends on $\left|C\right|=q$, we can say:
$$D=\left(\genfrac{}{}{0ex}{}{q}{\frac{q}{2}}\right)\left(\genfrac{}{}{0ex}{}{n-q}{l-\frac{q}{2}}\right)$$ |
Also we know ${A}_{c}\in {\mathbb{R}}^{N\times N}$
and $N=\left(\genfrac{}{}{0ex}{}{n}{l}\right)$.
We have already proven that $\left|\right|A|{\text{|}}_{\infty \to 1}\ge \mathit{Dm}\underset{x}{\mathrm{max}}f(x)\ge \mathit{Dm}\ge \mathit{D\delta nk}$
It is also interesting to note that $\left|\right|A|{\text{|}}_{\infty \to 1}\le N|\left|A\right|{\text{|}}_{2}$
and we still need to be able to show that with high probability that
$\left|\right|A|{\text{|}}_{\infty \to 1}$ is not
too large.
Matrix Bernstein: with high probability over
${b}_{i}$,
$\left|\right|A|{\text{|}}_{2}\le \u25b3\sqrt{\mathit{kl}}$ where
$\text{\u25b3}$ is the maximum number
of 1’s in a row in any ${A}_{i}$.
Expected number of 1’s per row is $\mathit{\delta n}\frac{D}{N}\sim n{(\frac{l}{n})}^{q\u22152}$.
We can optimistically suppose that $\u25b3\sim n{(\frac{l}{n})}^{q\u22152}$
however this also needs $l\ge {n}^{1-2\u2215q}$.
Then $D\cdot \mathit{\delta nk}\le \left|\right|A|{\text{|}}_{\infty \to 1}\le N\u25b3\sqrt{\mathit{kl}}$
$\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}k\le l$ since
$\u25b3\sim \mathit{\delta n}\frac{D}{N}$
Now take $l={n}^{1-2\u2215q}\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}{k}^{q\u2215(q-2)}\le n$
So, $\u25b3=\frac{2l}{q}$
Because ${H}_{i}$ are matchings, a
random row will have only $\approx \frac{\mathit{\delta nD}}{N}$
1’s.
The idea now is to prune off all the bad rows or columns in A to get B such that:
$$\left|\right|A|{\text{|}}_{\infty \to 1}\le |\left|B\right|{\text{|}}_{\infty \to 1}+o(N)$$ |
And, ${\text{\u25b3}}_{B}\sim \mathit{\delta n}{(\frac{l}{n})}^{q\u22152}$
And now we can just use $B$ instead which will prove $q$-LDC lower bound for $q$ even.
Recall, $q$-LDC XOR instance ${\psi}_{b}$ is encoded by:
The goal is argue that ${\psi}_{b}$ is unsatisfiable with high probability for random $b$. And the idea is to design a matrix $A\in {\mathbb{R}}^{N\times N}$ so that:
$$f(x)\le \left|\right|A|{\text{|}}_{\infty \to 1}=\underset{z,w\in {\{\pm 1\}}^{N}{z}^{T}\mathit{Aw}}{\mathrm{max}}$$ |
The previous approach fails because the
$A$ from before
requires $q$
to be even.
One attempt is to represent rows as $\left|S\right|=l$
and columns as $\left|T\right|=l+1$. However
this will only get us to $k\le \sqrt{n}$.
We need to derive more constraints, using
${C}_{i}\oplus {C}_{j}$ get us to
$\mathit{nk}$ constraints
so each ${n}_{j}$
is in $\approx k$
constraints $\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}$
new $n{k}^{2}$
constraints.
The matrix $A$ is indexed by $S$, $A(S,T)={b}_{i}{b}_{j}$. The calculation is now:
$$n{k}^{2}D\le \left|\right|A|{\text{|}}_{\infty \to 1}\le N\u25b3\sqrt{\mathit{kl}}$$ |
An optimist approach is $\u25b3\sim \mathit{Nk}\frac{D}{N}=\mathit{nk}{(\frac{l}{n})}^{2}$
$\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}l\ge \sqrt{\frac{n}{k}}$
$\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}k\le n\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}{k}^{3}\le n$
The row pruning tricks would still work provided that any $\{u,v\}$ is in at most $\mathit{polylog}(n)$ constraints.
This proof for $q=3$ is not generalizable for all odd $q$ and neither is a reduction to $2$-LDC. This is particularly true because of the row pruning step.
14th October 2022
The related paper: Algorithms for the ferromagnetic Potts model on expanders by Carlson et al. [5]. Seminar by Aditya Potukuchi.
The ferromagnetic Potts model is a canonical example of a Markov random field from statistical physics that is of great probabilistic and algorithmic interest. This is a distribution over all $1$-colorings of the vertices of a graph where monochromatic edges are favored. The algorithmic problem of efficiently sampling approximately from this model is known to be #BIS-hard, and has seen a lot of recent interest. I will outline some recently developed algorithms for approximately sampling from the ferromagnetic Potts model on d-regular weakly expanding graphs. This is achieved by a significantly sharper analysis of standard ”polymer methods” using extremal graph theory and applications of Karger’s algorithm to count cuts that may be of independent interest. I will give an introduction to all the topics that are relevant to the results.
We start by defining some basic notation:
Notice that for $\beta <0$
it means that we take the antiferromagnetic case. Here we talk more about when
$\beta >0$
meaning it is ferromagnetic.
This could have quite some applications:
we know $p(\chi )\propto \mathit{exp}(\beta \cdot m(\chi ))$
Now for $\beta =0$ it means that we
are doing a uniform $q$-coloring
of $V$
For $\beta =-\infty $ we do a uniform
proper coloring of $G$
What we need to do is given $G$ and $\beta $, efficiently sample a coloring from this distribution.
$$p(\chi )=\frac{\mathit{exp}(\mathit{\beta m}(\chi ))}{\sum _{\chi}\mathit{exp}(\mathit{\beta m}(\chi ))}$$ | (10) |
We add the normalizing factor here:
$$\mathit{\text{Nomalizingfactor}}=\sum _{\chi}\mathit{exp}(\mathit{\beta m}(\chi ))$$ |
Now we can also say,
$$\sum _{\chi}\mathit{exp}(\mathit{\beta m}(\chi ))=:{Z}_{G}(q,\beta )$$ | (11) |
A partition function of the model/distribution is very important for this
POV.
Our problem is that given $G$ and $\beta $ we want to efficiently sample a color distribution. We give 2 facts:
We now modify the problem as: Given $G$
and $\beta $,
efficiently sample approximately a colouring from this distribution.
$\mathit{\epsilon}$ approximation will have us sample a law from $q$ such that $\left|\right|p-q|{\text{|}}_{\mathit{TV}D}\le \mathit{\epsilon}$, thus
$$||p-q|{\text{|}}_{\mathit{TV}D}:=\frac{1}{2}\sum _{\chi}|p(\chi )-q(\chi )|$$ | (12) |
We modify our original problem template to now be: Given
$G$ and
$\beta $, efficiently sample
$\mathit{\epsilon}$-approximately
a colouring from this distribution.
Fully Polynomial Almost Uniform Sampler can allow us to sample
$\mathit{\epsilon}$-approximately
in $\mathit{poly}(G,\frac{1}{\mathit{\epsilon}})$
time.
Instead Fully Polynomial Time Approximation Scheme:
$1\pm \mathit{\epsilon}$-factor
approximation in $\mathit{poly}(G,\frac{1}{\mathit{\epsilon}})$
time.
We can also show for a fact that $\mathit{FPTAS}\phantom{\rule{0.28em}{0ex}}\iff \phantom{\rule{0.28em}{0ex}}\mathit{FPAUS}$.
The Antiferromagnetic Potts model:
$$p(\chi )\propto \mathit{exp\beta}\cdot m(\chi )$$ | (13) |
where $\beta <0$
Given $G$ and
$\beta <0$, we want to be
able to give an FPAUS for this distribution. It is then equivalent to instead work on the problem:
given $G$ and
$\beta <0$, give an FPTAS for
its partition function ${Z}_{G}(q,\beta )$.
From some previous work, we know that there exists a ${\beta}_{c}$ such that:
We can say that this is #BIS-hard (bipartite independent sets). Thus, doing
this is at least as hard as an FPTAS for the number of independent sets in
bipartite graphs. If our graph has no bipartiteness then this becomes a NP-hard
problem.
For now, let’s consider the problem given a bipartite graph $G$, design an FPTAS for the number of individual sets in $G$. This accurately captures the difficulty of: the number of proper $q$-colorings of a bipartite graph for $q\ge 3$, the number of stable matchings, the number of antichains in posets.
For our purposes we assume that $G$ is always a $d$-regular graph on $n$ vertices. Now for a set $S\subset V$, we define it’s edge boundary as:
$$\u25bf(S):=\#(\mathit{uv}\in G|u\in S,v\notin S)$$ |
Now, $G$ is an $\eta $ expander if for every $S\subset V$ of size at most $n\u22152$, we have $|\u25bf(S)|\ge \eta \left|S\right|$. For example we can take a discrete cube ${Q}_{d}$ with vertices ${\{0,1\}}^{d}$, $\mathit{uv}$ is an edge if $u$ and $v$ differ in exactly 1 coordinate.
Using a simplification of the Harper’s Theorem we can say that
${Q}_{d}$ is a
$1$-expander
[7].
Theorem 2. For each $\mathit{\epsilon}>0$ and there is a $d=d(\mathit{\epsilon})$ and $q=q(\mathit{\epsilon})$ such that there is an FPTAS for ${Z}_{G}(q,\beta )$ where $G$ is a $d$-regular $2$-expander providing the following conditions hold:
The main result shown was that
Theorem 3. For each $\mathit{\epsilon}>0$, and $d$ large enough, there is an FPTAS for ${Z}_{G}(q,\beta )$ where $G$ for the class of $d$-regular triangle-free $1$-expander grpahs providing the following conditions hold:
This was previously known for:
Something to note here is that $q\ge \mathit{poly}(d)$
should not be a necessary condition.
As well as as in the case $\beta \le (1-\mathit{\epsilon}){\beta}_{0}$ does not require expansion or even that $q\ge \mathit{poly}(d)$.
We first write the order-disorder threshold of the ferromagnetic Potts model
$$\begin{array}{cc}\begin{array}{rl}{\beta}_{0}:=& ln(\frac{q\; -\; 2}{{(q\; -\; 1)}^{1-2\u2215d}-\; 1})\\ {\beta}_{0}=& 2\frac{lnq}{d}(1\; +\; O(\frac{1}{q}))\end{array}& \end{array}$$ | (14) |
We want to be able to know more about how the Potts distribution looks for $\beta <(1-\mathit{\epsilon}){\beta}_{0}$ and for $\beta >(1+\mathit{\epsilon}){\beta}_{0}$
Another result we have is:
Theorem 4. For each $\mathit{\epsilon}>0$, let $d$ be large enough $q\ge \mathit{poly}(d)$, and $G$ be a $d$-regular $2$-expander graph on $n$ vertices then,
The strategy we have, to prove the theorem for $\beta <(1-\mathit{\epsilon}){\beta}_{0}$:
The motivating idea is to visualize the state for
$\beta $ large
at low temperature as ground state + defects.
Typical Colouring = Ground State + Defects
Polymer methods are pretty useful in such cases. These were
first proposed in [8] and originated in statistical physics. We take
$G$ to be
our defect graph and each node in this represents a defect.
Now using Polymer methods $X{\text{\u223c}}_{G}Y$
Ideas is to ${Z}_{G}(q,\beta )\sim {Z}_{\mathit{red}}+{Z}_{\mathit{blue}}+\dots \phantom{\rule{0.17em}{0ex}}$
where ${Z}_{\mathit{red}}\approx {e}^{\mathit{\beta nd}\u22152}$
${Z}_{\mathit{red}}{e}^{-\mathit{\beta nd}\u22152}={\sum}_{I\subset V(G)}{\prod}_{\gamma \in I}{w}_{\gamma}$ where
${w}_{\gamma}$ is the weight
of polymer $\gamma $.
We now move towards cluster expansion: multivariate in the ${w}_{\gamma}$ Taylor expansion of:
$$\mathit{ln}(\sum _{I\subset V(G)}\prod _{\gamma \in I}{w}_{\gamma})$$ |
This is an infinite sum, so convergence is not guaranteed however convergence can be
established by verifying the Kotecký-Preiss criterion.
We also want to answer how many connected subsets are there of a given edge boundary in an $\eta $-expander?
A heuristic we have is to count the number of such subsets that contain a given vertex
$u$: a typical connected
subgraph of size $a$ is tree-like,
i.e., has edge boundary $a\cdot d$.
Working backward, a typically connected subgraph with edge boundary size $b$ has $O(b\u2215d)$ vertices. The number of such subgraphs $\le $ number of connected subgraphs of size $O(b\u2215d)$ containing $u$. The original number of subsets is also $\le $ Number of rooted (at $u$) trees with $O(b\u2215d)$ vertices and maximum degree at most $d={d}^{O(b\u2215d)}$. Thus,
Theorem 5. At most ${d}^{O(1+1\u2215\eta )b\u2215d}$ connected subsets in an $\eta $ expander that contains $u$ have edge boundary of size at most $b$.
Another question to ask is how many
$q$-colorings of
an $\eta $-expander
induce at most $k$
non-monochromatic edges?
Easiest way is to make $k$ non-monochrimatic edges is to color all but $k\u2215d$ randomly chosen vertices with the same color. Now, $k$ small $\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}$ these vertices likely form an independent set. we now color these $k\u2215d$ vertices arbitrarily. There are:
$$\left(\genfrac{}{}{0ex}{}{n}{k\u2215d}\right){q}^{k\u2215d+1}$$ |
ways.
Theorem 6. For $\eta $-expanders and $q\ge \mathit{poly}(d)$ there are at most ${n}^{4}{q}^{O(k\u2215d)}$ possible colourings.
Now we also know the maximum value of ${Z}_{G}(q,\beta )$ over all graphs $G$ with $n$ vertices, $m$ edges, and max degree $d$. This will always be attained when $G$ is a disjoint union of ${K}_{d+1}$ and ${K}_{1}$
18th October 2022
The related paper: Adversarially Robust Learning with Tolerance by Ashtiani et al. [9]. Seminar by Hassan Ashtiani.
Characterizing the sample complexity of different machine learning tasks is one of the central questions in statistical learning theory. For example, the classic Vapnik-Chervonenkis theory characterizes the sample complexity of binary classification. Despite this early progress, the sample complexity of many important learning tasks — including density estimation and learning under adversarial perturbations — are not yet resolved. In this talk, we review the less conventional approach of using compression schemes for proving sample complexity upper bounds, with specific applications in learning under adversarial perturbations and learning Gaussian mixture models.
We start by defining some notation:
Our goal is that for every ${D}_{Z}$,
${A}_{Z,H}(S)$ we want it to be
comparable to $\mathit{OPT}$
with high probability.
We take the example of density estimation in this case $L({D}_{Z},h)={d}_{\mathit{TV}}({D}_{z},h)$. Now, ${A}_{Z,H}$ probably approximately correct learns $H$ with $m(\mathit{\epsilon},\delta )$ samples if for all ${D}_{Z}$ and for all $\mathit{\epsilon}$ with a $\delta \in (0,1)$. Now if $S\sim {D}_{Z}^{m(\mathit{\epsilon},\delta )}$ then:
$$\underset{S}{Pr}[L({D}_{Z},{A}_{Z,H}(S))>\mathit{\epsilon}+C\cdot \mathit{OPT}]<\delta $$ | (15) |
Now if we take the example of $C=2$, let $H$ be the set of all Gaussians in ${\mathbb{R}}^{d}$ then:
$$m(\mathit{\epsilon},\delta )=O\left(\frac{{d}^{2}+\mathrm{log}1\u2215\delta}{{\mathit{\epsilon}}^{2}}\right)$$ |
We will now modify the above equation. Now, ${A}_{Z,H}$ probably approximately correct learns $H$ with $m(\mathit{\epsilon})$ samples if for all ${D}_{Z}$ and for all $\mathit{\epsilon}\in (0,1)$. Now if $S\sim {D}_{Z}^{m(\mathit{\epsilon})}$ then:
$$\underset{S}{Pr}[L({D}_{Z},{A}_{Z,H}(S))>\mathit{\epsilon}+C\cdot \mathit{OPT}]<0.01$$ | (16) |
For the example of $C=2$, let $H$ be the set of all Gaussians in ${\mathbb{R}}^{d}$ then:
$$m(\mathit{\epsilon},\delta )=O\left(\frac{{d}^{2}}{{\mathit{\epsilon}}^{2}}\right)$$ |
For the example of binary classification, we have
$Z=X\times \{0,1\}$ and
$h$ is some model
which maps from $h:X\to \{0,1\}$.
We also have $l(h,x,y)=1h(x)\ne y$ and
then we will have the $L$
be $L({D}_{Z},h)={E}_{(x,y)\sim {D}_{Z}}l(h,x,y)$.
Now, ${A}_{Z,H}$ probably approximately correct learns $H$ with $m(\mathit{\epsilon})$ samples if for all ${D}_{Z}$ and for all $\mathit{\epsilon}\in (0,1)$. Now if $S\sim {D}_{Z}^{m(\mathit{\epsilon})}$ then:
$$\underset{S}{Pr}[L({D}_{Z},{A}_{Z,H}(S))>\mathit{\epsilon}+C\cdot \mathit{OPT}]<0.01$$ | (17) |
Now $H$ is the set of all half spaces in ${\mathbb{R}}^{d}$ then:
$$m(\mathit{\epsilon})=O\left(\frac{d}{{\mathit{\epsilon}}^{2}}\right)$$ |
For the example of binary classification, we have
$Z=X\times \{0,1\}$ and
$h$ is some model
which maps from $h:X\to \{0,1\}$.
We also have ${l}^{U}(h,x,y)=$
adversarial pertubations and then we will have the
${L}^{U}$ be
${L}^{U}({D}_{Z},h)={E}_{(x,y)\sim {D}_{Z}}{l}^{U}(h,x,y)$.
Now, ${A}_{Z,H}$ probably approximately correct learns $H$ with $m(\mathit{\epsilon})$ samples if for all ${D}_{Z}$ and for all $\mathit{\epsilon}\in (0,1)$. Now if $S\sim {D}_{Z}^{m(\mathit{\epsilon})}$ then:
$$\underset{S}{Pr}[{L}^{U}({D}_{Z},{A}_{Z,H}(S))\mathit{\epsilon}+\mathit{OPT}]<0.01$$ | (18) |
Now, consider the following ${H}_{1}$ and ${H}_{2}$:
Here ${H}_{2}$
is richer which can make it contain better models as well as harder
to learn. We can characterize the sample complexity of learning
$H$ using
Binary classification, with Binary classification with adversarial perturbations or with
Density estimation.
In the case of binary classification the VC-dimension quantifies complexity:
$$m(\mathit{\epsilon})=\Theta \left(\frac{VC(H)}{{\mathit{\epsilon}}^{2}}\right)$$ |
The upper bound here is achieved using simple ERM
$$L(S,h)=\frac{1}{\left|S\right|}\sum _{(x,y)\in S}l(h,x,y)$$ |
$$h=\mathit{argmi}{n}_{h\in H}L(S,h)$$ |
And then for uniform convergence:
$$su{p}_{h\in H}|L(S,h)-L({D}_{z},h)|=O\left(\sqrt{\frac{VC(H)}{\left|S\right|}}\right)w.p.>0.99$$ | (19) |
We now introduce sample compression as an alternative.
The idea is to try and answer how should we go about compressing a given training
set? In classic information theory, we would compress it into a few bits. In
the case of sample compression, we want to try to compress it into a few
samples.
If we just take the simple example of linear classification Number of required bits is
unbounded (depends on the sample).
It has already been shown by Littlestone and Warmuth [10] that Compressibility $\phantom{\rule{0.28em}{0ex}}\Rightarrow \phantom{\rule{0.28em}{0ex}}$ Learnability
$$m(\mathit{\epsilon})=\xd5\left(\frac{k}{{\mathit{\epsilon}}^{2}}\right)$$ |
It has also been shown by Moran and Yehudayoff [11] Compressibility $\phantom{\rule{0.28em}{0ex}}\Leftarrow =\phantom{\rule{0.28em}{0ex}}$ Learnability
$$k={2}^{O(VC)}$$ |
Sample Compression can be very helpful by:
Typical classifiers are often:
In the case of classification with adversarial perturbations we had
${l}^{0\u22151}(h,x,y)=1\{h(x)\ne y\}$ and
${l}^{U}(h,x,y)=\mathit{su}{p}_{\overline{x}\in U(x)}{l}^{0\u22151}(h,\overline{x},y)$
and then we will have the ${L}^{U}$
be ${L}^{U}({D}_{Z},h)={E}_{(x,y)\sim {D}_{Z}}{l}^{U}(h,x,y)$.
Now, ${A}_{Z,H}$ probably approximately correct learns $H$ with $m(\mathit{\epsilon})$ samples if for all ${D}_{Z}$ and for all $\mathit{\epsilon}\in (0,1)$. Now if $S\sim {D}_{Z}^{m(\mathit{\epsilon})}$ then:
$$\underset{S}{Pr}[{L}^{U}({D}_{Z},{A}_{Z,H}(S))\mathit{\epsilon}+\mathit{OPT}]<0.01$$ | (20) |
However one of the problems with this is if the robust ERM works for all $H$
$$L(S,h)=\frac{1}{\left|S\right|}\sum _{(x,y)\in S}l(h,x,y)$$ |
$$h=\mathit{argmi}{n}_{h\in H}L(S,h)$$ |
The robust ERM would not work for all $H$, uniform convergence can fail,
$$\mathit{su}{p}_{h\in H}|{L}^{U}(S,h)-{L}^{U}({D}_{z},h)|$$ |
can be unbounded.
We can say that any “proper learner” (outputs from
$H$) can
fail.
In a compression-based method the decoder should recover the labels of the training set and their neighbors and then compress the inflates set:
$$k={2}^{O(VC)}$$ |
So,
$${\text{m}}^{U}(\mathit{\epsilon})=O\left(\frac{{2}^{VC(H)}}{{\mathit{\epsilon}}^{2}}\right)$$ | (21) |
There is an exponential dependence on
$VC(H)$.
Ashtiani et al. [9] introduced tolerant adversarial learning
${A}_{Z,H}$ PAC
learns $H$
with $m(\mathit{\epsilon})$
samples
if $\forall {D}_{Z}$, $\forall \mathit{\epsilon}\in (0,1)$, if $S\simeq {D}_{Z}^{m(\mathit{\epsilon})}$ then
$$P{r}_{S}[{L}^{U}({D}_{Z},{A}_{Z,H}(S))>\mathit{\epsilon}+\mathit{in}{f}_{h\in H}{L}^{V}({D}_{Z},{A}_{Z,H}(S))]<0.01$$ | (22) |
And,
$${\text{m}}^{U,V}(\mathit{\epsilon})=\xd5\left(\frac{VC(H)d\mathrm{log}(1+\frac{1}{\gamma})}{{\mathit{\epsilon}}^{2}}\right)$$ | (23) |
The trick is to avoid compressing an infinite set and now our new
goal is that the decoder should only recover labels of things in
$U(x)$.
To do so we can define a noisy empirical distribution (using
$V(x)$) and
then use boosting to achieve a super small error with respect to this distribution.
And then, we encode the classifier using the samples used to train weak learners and
the decoder smooths out the hypotheses.
It is interesting to think of Why do we need tolerance? There do exist some other ways to relax the problem and avoid ${2}^{O(VC)}$
This is also observable in the density estimation example.
Gaussian mixture Models are very popular in practice and are one of the most basic universal density approximators. These are also the building blocks for more sophisticated density classes and can think of them as multi-modal versions of Gaussians.
$$f(x)={w}_{1}N(x|{\mu}_{1},\sum 1)+{w}_{2}N(x|{\mu}_{2},\sum 2)+{w}_{3}N(x|{\mu}_{3},\sum 3)$$ |
We say $F$ is Gaussian
Mixture Model with $k$
components in ${\mathbb{R}}^{d}$.
And we want to ask how many samples is needed to recover
$f\in F$ within
${L}_{1}$ error
$\mathit{\epsilon}$.
The number of samples $\simeq m(d,k,\mathit{\epsilon})$.
To learn single Gaussian in ${\mathbb{R}}^{d}$ then
$$O\left(\frac{{d}^{2}}{{\mathit{\epsilon}}^{2}}\right)=O\left(\frac{\mathit{\#params}}{{\mathit{\epsilon}}^{2}}\right)$$ |
samples are sufficient (and necessary).
Now if we have $k$ Gaussian in ${\mathbb{R}}^{d}$ then we want to know if
$$O\left(\frac{k{d}^{2}}{{\mathit{\epsilon}}^{2}}\right)=O\left(\frac{\mathit{\#params}}{{\mathit{\epsilon}}^{2}}\right)$$ |
samples are sufficient?
There have been some results on learning Gaussian Mixture Models.
Let us take the example of this graph. For a moment look at this as a binary classification problem. The decision boundary has a simple quadratic form!
$$VC-\mathit{dim}=O({D}^{2})$$ |
Here “Sample compression” does npt make sense as there are no “labels”.
We have $F$ which is a class of distributions (e.g. Gaussians) and we have. If A sends $t$ points from $m$ points and B approximates $D$ then we say $F$ admits $(t,m)$-compression.
Theorem 7. If $F$ has a compression scheme of size $(t,m)$ then sample complexity of learning $F$ is
$$\xd5\left(\frac{t}{{\mathit{\epsilon}}^{2}}+m\right)$$ |
$\xd5(\cdot )$ hides polylog factors.
Small compression schemes imply sample-efficient algorithms.
Theorem 8. If $F$ has a compression scheme of size $(t,m)$ then $k$ mixtures of $F$ admits $(\mathit{kt},\mathit{km})$ compression.
Distribution compression schemes extend to mixture classes automatically! So for the case
of GMMs in ${\mathbb{R}}^{d}$
it is enough to come up with a good compression scheme for a single Gaussian!
For learning mixtures of Gaussians, the encoding center and axes of ellipsoid is sufficient
to recover $N(\mu ,\mathrm{\Sigma})$.
This admits $\xd5({d}^{2},\frac{1}{\mathit{\epsilon}})$
compression! The technical challenge is encoding the
$d$ eigenvectors
“accurately” using only ${d}^{2}$
points.
$\frac{{\sigma}_{\mathit{max}}}{{\sigma}_{\mathit{min}}}$ can be large which is a technical challenge.
Compression relies heavily on a few points
Target compression is quite general
[1] Omar Alrabiah, Venkatesan Guruswami, Pravesh Kothari, and Peter Manohar. A near-cubic lower bound for 3-query locally decodable codes from semirandom CSP refutation. Technical Report TR22-101, Electronic Colloquium on Computational Complexity (ECCC), July 2022.
[2] Arnab Bhattacharyya, L. Sunil Chandran, and Suprovat Ghoshal. Combinatorial lower bounds for 3-query ldcs, 2019. URL https://arxiv.org/abs/1911.10698.
[3] Venkatesan Guruswami, Pravesh K. Kothari, and Peter Manohar. Algorithms and certificates for boolean csp refutation: ”smoothed is no harder than random”, 2021. URL https://arxiv.org/abs/2109.04415.
[4] Alexander S. Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchy and tensor pca, 2019. URL https://arxiv.org/abs/1904.03858.
[5] Charlie Carlson, Ewan Davies, Nicolas Fraiman, Alexandra Kolla, Aditya Potukuchi, and Corrine Yap. Algorithms for the ferromagnetic potts model on expanders, 2022. URL https://arxiv.org/abs/2204.01923.
[6] Matthew Coulson, Ewan Davies, Alexandra Kolla, Viresh Patel, and Guus Regts. Statistical Physics Approaches to Unique Games. In Shubhangi Saraf, editor, 35th Computational Complexity Conference (CCC 2020), volume 169 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1–13:27, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. ISBN 978-3-95977-156-6. doi: 10.4230/LIPIcs.CCC.2020. 13. URL https://drops.dagstuhl.de/opus/volltexte/2020/12565.
[7] Peter Frankl and Zoltán Füredi. A short proof for a theorem of harper about hamming-spheres. Discrete Mathematics, 34(3):311–313, 1981.
[8] Tyler Helmuth, Will Perkins, and Guus Regts. Algorithmic pirogov-sinai theory. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, page 1009–1020, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367059. doi: 10.1145/3313276.3316305. URL https://doi.org/10.1145/3313276.3316305.
[9] Hassan Ashtiani, Vinayak Pathak, and Ruth Urner. Adversarially robust learning with tolerance, 2022. URL https://arxiv.org/abs/2203.00849.
[10] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. 1986.
[11] Shay Moran and Amir Yehudayoff. Sample compression schemes for vc classes. Journal of the ACM (JACM), 63(3):1–10, 2016.