import torch
n = 2 * 4 * 100000
s = torch.empty(n)
a = torch.rand(1)
x = torch.rand(n)
y = torch.rand(n)
s = a * x + y
%timeit a * x + y
s
392 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
tensor([0.9778, 0.2121, 0.6137, ..., 1.3553, 1.3155, 1.4675])
You can get addresses the first element of torch tensors with the data_ptr() member. These are 64 bit unsigned integers on 64 bit systems. With these addresses and the tensors’ lengths, you can do whatever you want with the tensor. Be careful not to leak memory here. Do things in place or place the results of your computation into a passed tensor. Below, I place the results of a * x + y into the passed s tensor. Also, make sure not to get the data type wrong. In this case, I know that I am working with the FP32 type.
%%writefile ./my_lib1.c
#include<arm_neon.h>
// assumes n is a multiple of 8
void saxpy4(float* s, float a, const float* x, const float* y, int n) {
float32x4_t vs, va, vx, vy;
va = vdupq_n_f32(a);
for (int i = 0;i < n; i += 4) {
vx = vld1q_f32(x + i);
vy = vld1q_f32(y + i);
vs = vfmaq_f32(vy, va, vx);
vst1q_f32(s + i, vs);
}
}
Writing ./my_lib1.c
!gcc-13 -O3 -shared -o my_lib1.so -fPIC my_lib1.c
import ctypes
lib = ctypes.CDLL('./my_lib1.so')
lib.saxpy4.argtypes = (ctypes.c_uint64, ctypes.c_float, ctypes.c_uint64, ctypes.c_uint64, ctypes.c_int)
lib.saxpy4.restype = None
%timeit lib.saxpy4(s.data_ptr(), a.item(), x.data_ptr(), y.data_ptr(), n)
s
106 µs ± 129 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
tensor([0.9778, 0.2121, 0.6137, ..., 1.3553, 1.3155, 1.4675])
Its roughly 4 times as fast because the SIMD vectors are 128 bits wide which is 4 fp32s.
multithreading is only worth it for large n. Here, I’m using just two threads. Be careful about bad memory access patterns and false sharing. That’s not a concern in this case.
%%writefile ./my_lib2.c
#include <omp.h>
#include<arm_neon.h>
// assumes n is a multiple of 8
void saxpy2x4(float* s, float a, const float* x, const float* y, int n) {
#pragma omp parallel num_threads(2)
{
n /= 2;
int tid = omp_get_thread_num();
s += tid * n;
x += tid * n;
y += tid * n;
float32x4_t vs, va, vx, vy;
va = vdupq_n_f32(a);
for (int i = 0;i < n; i += 4) {
vx = vld1q_f32(x + i);
vy = vld1q_f32(y + i);
vs = vfmaq_f32(vy, va, vx);
vst1q_f32(s + i, vs);
}
}
}
Writing ./my_lib2.c
!gcc-13 -O3 -shared -o my_lib2.so -fPIC -fopenmp my_lib2.c
lib2 = ctypes.CDLL('./my_lib2.so')
lib2.saxpy2x4.argtypes = (ctypes.c_uint64, ctypes.c_float, ctypes.c_uint64, ctypes.c_uint64, ctypes.c_int)
lib2.saxpy2x4.restype = None
%timeit lib2.saxpy2x4(s.data_ptr(), a.item(), x.data_ptr(), y.data_ptr(), n)
s
67.2 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
tensor([0.9778, 0.2121, 0.6137, ..., 1.3553, 1.3155, 1.4675])
Its even faster!
W = torch.rand(4,4)
h = torch.empty(4)
x = torch.rand(4)
h = torch.mv(W, x) # same as W @ x
%timeit torch.mv(W, x)
h
980 ns ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
tensor([0.9658, 0.7360, 1.3095, 1.7704])
if h = Wx, the ith entry of h is the dot product of the ith row of W and x. This can be extended to matrix multiplications too: C = AB, where the ith column of C is the matrix vector product of A and the ith column of B. There’s more optimizations that can be done for matmuls.
%%writefile ./my_lib3.c
#include<arm_neon.h>
// assumes W is 4x4 and x is 4x1
void matvec4x4(float* h, const float* W, const float* x) {
float32x4_t vw, vx;
vx = vld1q_f32(x);
vw = vld1q_f32(W);
h[0] = vaddvq_f32(vmulq_f32(vx, vw));
vw = vld1q_f32(W + 4);
h[1] = vaddvq_f32(vmulq_f32(vx, vw));
vw = vld1q_f32(W + 8);
h[2] = vaddvq_f32(vmulq_f32(vx, vw));
vw = vld1q_f32(W + 12);
h[3] = vaddvq_f32(vmulq_f32(vx, vw));
}
Writing ./my_lib3.c
!gcc-13 -O3 -shared -o my_lib3.so -fPIC -fopenmp my_lib3.c
lib3 = ctypes.CDLL('./my_lib3.so')
lib3.matvec4x4.argtypes = (ctypes.c_uint64, ctypes.c_uint64, ctypes.c_uint64)
lib3.matvec4x4.restype = None
%timeit lib3.matvec4x4(h.data_ptr(), W.data_ptr(), x.data_ptr())
h
675 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
tensor([0.9658, 0.7360, 1.3095, 1.7704])
SIMD extensions are great if your linear operators are very small
]]>Last Friday, there was a lecture by Robert Tarjan about sorting given some information about the sorted order. I thought it was pretty cool. I saw a lot of UCI’s computer science professors there.
I forgot some things that were mentioned, and there’s things that I remember being brought up briefly
but also forget where they fit into the lecture. Here’s those things:
Ellipsoid method
Splay trees
Here is my understanding of what was said:
You can represent a sorted total order as a DAG in which vertices are items and there
is an edge from item a to item b if a and b have been compared and b > a.
a ---> b
Say, n is the number of items.
If all n choose 2 comparisons have been made. You would have a complete graph
whose topological sort would always yield the true total order.
However, if you had no comparisons made yet, or zero edges in this graph of n vertices, then there’s n! possible orderings that topological sort could give you, only one of which is correct.
Also, here’s my topological sort implementation for the cses problem, course schedule.
my submission
Tarjan was focusing on situations when you are given the results of some comparisons prior to sorting as an incomplete DAG
---------------
| |
| v
a ---> b c ---> d
A solution to this is to topological sort the DAG using a heap instead of a queue. This way, you always get the right sorted order. A big consideration however is the choice of heap. It had something to do with the working set of a heap, which is the maximum number of elements that are in the heap while item i is in the heap.
I forget the complexity of topological heapsort with a binary heap.
It might be better than plain heapsort depending on how informative the DAG is. Here’s the most informative DAG you can get
a ---> b ---> c ---> d
In this case, for topological heapsort, the maximum number of elements in the heap at one time would be 1. On the other hand, regular heapsort would put all elements on the heap. Its still O(nlogn), but the working sets will be bigger.
Tarjan’s solution was to use a pair heap for topological sorting. This had to do with getting a good bound on the working set. The complexity was O(n + m + logT), where n is the number of vertices, m is the number of given comparisons, and T is the number of possible total orders for the given DAG. Notice that if no comparisons are given, you get n + log(n!), which is n + nlogn as n gets big.
Did you know that Tarjan was involved in making Fibonacci heaps?
A question that I asked: how many comparisons prior to sorting is sufficient for this to be the best option for sorting?
the jist of Tarjan’s answer: Enough to get a reasonably informative DAG that cuts down on T. Getting long runs are especially helpful.
Big thanks to Tarjan for the lecture.
]]>The idea is to use a lazy segment tree to get the sum of ranges in log N time complexity.
int A[N];
long long t[N << 2];
The standard segment tree idea is that the sum of a vertex’s segment is equal to the sum of vertex’s children’s segments. This is made clear in a segment tree’s construction.
void build(int i, int l, int r) {
if (l == r) {
t[i] = A[l];
}
else {
int m = l + (r - l) / 2;
build(i<<1, l, m);
build(i<<1|1, m + 1, r);
t[i] = t[i<<1] + t[i<<1|1];
}
}
However, an eager as opposed to lazy range update on a segment tree might affect many ranges, degrading the log N time complexity of operations. For example, adding 1 to every element in the array would require the entire tree to be calculated again.
You can think of a lazy update to a range as completed for that range, but pending for its sub-range. Prior to any descent into those subranges, you should propagate the changes.
For example, say I add 1 to all values in the array: I can increase the segment tree’s root sum and remember that 1 was added to all values with a lazy tag. Any reads to that exact range will just return the range’s sum. But now what if I want the sum of a range within the one I just lazily updated? Look below.
Ignore the set operation for now. Notive the new lazy add member added to each vertex.
struct node {
ll sum;
ll lz_add;
node() {}
} t[N << 2];
void add(int i, int l, int r, int a, int b, ll x) {
if (a > b) {
return;
}
else if (l == a && r == b) {
t[i].lz_add += x;
t[i].sum += x * (r - l + 1);
}
else {
// 1. what should be done right here?
int m = l + (r - l) / 2;
add(i<<1, l, m, a, min(b, m), x);
add(i<<1|1, m + 1, r, max(a, m + 1), b, x);
// 2. and what should be done here?
}
}
Prior to any descent into those subranges, you have to propagate the changes you made and clear the lazy tag.
Since you are going to update a lower range lazily as well, the segment tree invariant needs to be upholded.
item 2 is quite simple:
t[i] = t[i<<1] + t[i<<1|1];
I will leave item 1 to you. Look at my submission or the usaco guide if you are confused. USACO guide
]]>The important thing to know is that if you can cut a tree into components whose size >= x, then the same cut has components with size >= x - 1.
The other thing to know is that k cuts produces k + 1 components.
Using the first fact you can do binary search on the possible x values: [1 … (n/(k+1) + 1)].
Why the upper bound of (n/(k+1) + 1)? If you have a tree with n nodes which you want to split into k + 1 components. The biggest x could be is n/(k + 1). I added the 1 to account for when n is not divisible by k + 1.
Binary search fixes the x value for you. Know you have to be able to check if you can cut the tree into k + 1 components with size >= x.
You can greedily do this. Check subtrees in increasing order of their size (smaller subtrees first). If their size >= x, cut them off into a new component. You can check tries in this order with depth first search. It’s in the name: depth first.
int check(int u, int x, int& num_components) {
int cnt = 1;
for (int v : adj[u]) {
if (v == parent[u]) continue;
cnt += check(v, x, num_components);
}
if (cnt >= x) {
++num_components;
cnt = 0; // parent can not use this subtree's node count as it goes into our new component
}
return cnt;
}
I had this idea for formatting graph coloring as an LP problem. Too bad it doesn’t work.
Here’s a connected graph of three vertices: x1, x2, and x3.
I want to find a color assignment that uses the least number of distinct colors such that no two adjacent vertices have the same color.
x1 ---- x2
\ /
\ /
\ /
x3
example.lp
Minimize
obj: + x1 + x2 + x3
Subject To
c1: + x1 - x2 >= 1
c2: + x1 - x3 >= 1
c3: + x2 - x3 >= 1
Bounds
x1 >= 0
x2 >= 0
x3 >= 0
End
It works for this connected graph, but would give a terrible answer for a graph that was just a long string of vertices:
x1 —- x2 —- x3 —- x4
I did this mainly as a demo of the GNU lp kit. Try some linear programs for yourself.
apt install glpk
glpsol -lp example.lp > example_output.txt
Welsh Powell graph coloring implementation
#include<iostream>
#include<algorithm>
#include<stdlib.h>
#define REP(i, n) for(int i = 0;i < n;++i)
#define REP2(i, s, f) for(int i = s;i <= f;++i)
#define ll long long
#define ull unsigned long long
using namespace std;
struct p {
int first, second;
bool operator<(const p& o) const {
return first < o.first;
}
};
const int N = 100;
const int M = N * (N - 1) / 3;
bool A[N][N];
int D[N]; // D[i] = degree of vertex i
p order[N];
int C[N]; // C[i] = color of vertex i
int main() {
// create random graph
srand(2024);
int a, b;
REP(i, M) {
a = rand() % N;
b = rand() % N;
while (a == b || A[a][b]) {
a = rand() % N;
b = rand() % N;
}
A[a][b] = true;
A[b][a] = true;
}
// get degree of each vertex
REP(i, N) {
REP(j, N) D[i] += A[i][j];
}
REP(i, N) {
order[i].second = i;
order[i].first = D[i];
}
// sort vertices by degree
sort(order, order + N);
int chromatic = 1;
bool adj[N];
for (int i = N - 1;i >= 0;--i) {
int u = order[i].second;
REP(j, N) adj[j] = false;
REP(v, N) {
if (u != v && A[u][v]) adj[C[v]] = true;
}
int c = 1;
while (adj[c]) ++c;
C[u] = c;
if (c > chromatic) chromatic = c;
}
REP(i, N) cout << i << ' ' << C[i] << '\n';
cout << chromatic << endl;
}
educational dynamming programming contest
coins
There’s n coins, and 2^n possible flip sequences. There’s a closed form expression for this when the probability of the coins being heads is the same for all coins. That is not the case here.
The probability of a sequence of independent flips is the product of each flip’s proability (p_i if heads and (1 - p_i) if tails).
You could sum up all of those products in which there’s more heads than tails (# heads >= n/2 + 1). However, n < 3000, and there’s (n choose (n/2 + 1)) of these products.
Notice that these products can share segments:
P(H1 T2 H3 T4 H5) = P(H1 T2 H3 T4) * P(H5)
P(H1 T2 H3 T4 T5) = P(H1 T2 H3 T4) * P(T5)
A very intuitive dynammic programming solution follows:
dp[i][j] = P(getting i heads in first j flips) = P(Hj) * P(getting i - 1 heads in the first j - 1 flips) + P(Tj) * P(getting i heads in the first j - 1 flips)
my submission
#include<iostream>
#include<iomanip>
using namespace std;
int n;
long double P[3000];
long double dp[3000][3000];
int main() {
cin >> n;
for (int i = 1;i < n + 1;++i) cin >> P[i];
dp[0][0] = 1;
for (int flips = 1;flips < n + 1;++flips)
dp[0][flips] = (1.0 - P[flips]) * dp[0][flips - 1];
for (int flips = 1;flips < n + 1;++flips) {
for (int heads = 1;heads <= flips;++heads) {
dp[heads][flips] = P[flips] * dp[heads - 1][flips - 1] + (1 - P[flips]) * dp[heads][flips - 1];
}
}
long double more_heads = 0;
for (int heads = n/2 + 1;heads < n + 1;++heads) more_heads += dp[heads][n];
cout << std::setprecision(9) << more_heads << endl;
}
educational dynamming programming contest
grouping
You want to put n rabbits in groups, and you get a value a_ij for having rabbit i and rabbit j in the same group.
There’s 2^n possible subsets out of these rabbits.
If you have a set of rabbits A, you could perform no split and take the sum of all a_ij. But you could also perform a split into sets B and C, such that B union C = A, then:
best_score(A) = max(no_split_score(A), best_score(B) + best_score(C) : for all splits B and C)
#include<iostream>
#define REP(i, n) for(int i = 0;i < n;++i)
using namespace std;
int n;
int A[16][16];
long long dp[1 << 16];
int main() {
cin >> n;
REP(i, n) {
REP(j, n) {
cin >> A[i][j];
}
}
unsigned mask = 1;
for (;mask < (1<<n);++mask) {
int ffs = __builtin_ctz(mask);
dp[mask] = dp[mask^(1<<ffs)];
REP(i, n) {
if ((mask^(1<<ffs)) & (1<<i))
dp[mask] = (dp[mask] + A[ffs][i]);
}
}
//REP(i, (1<<n)) cout << i << ' ' << dp[i] << endl;
//cout << endl;
mask = 1;
for (;mask < (1<<n);++mask) {
unsigned group1 = 0;
while (group1 < mask / 2 + 1) {
if ((group1 ^ (mask - group1)) == mask)
dp[mask] = max(dp[mask], (dp[group1] + dp[mask - group1]));
++group1;
}
}
cout << dp[(1<<n)-1] << endl;
}
I created this docker compose file in a folder:
~/myfolder/docker-compose.yml
version: "1"
services:
dev:
image: amd64/ubuntu
platform: linux/amd64
volumes:
- .:/root/myfolder2
working_dir: /root
cpus: 2
network_mode: host
security_opt:
- seccomp:unconfined
privileged: true
cap_add:
- ALL
In the same directory as my docker-compose.yml (~/myfolder/):
docker compose run --name=mycontainer dev bash
Then make the mounted folder writable:
sudo chmod +w ~/myfolder
In the container:
apt update
apt install build-essentials vim
Now I can just build things and run in the container or cross compile for the host machine.
Useful commands:
# start the container when the docker daemon is running
docker start -i mycontainer
# stop the container
docker stop mycontainer
# opens a bash container in the running container
docker exec -it mycontainer bash
# remove all stopped containers
docker container prune
Here’s a way to make order-1 Markov chains out of a text.
clean_book.cpp
#include<iostream>
#include<fstream>
#include<unordered_map>
#include<string>
using namespace std;
int main(int argc, char* argv[]) {
ifstream inputFile(argv[1]);
ofstream outFile(argv[2]);
string t;
string t_clean;
while (inputFile >> t) {
t_clean = string();
for (char c : t) {
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9') || c == '\'' || c == '-')
t_clean += c;
}
outFile << t_clean << ' ';
if (t.back() == '.' || t.back() == '?' || t.back() == '!' || t.back() == ';' || t.back() == ':')
outFile << t.back() << ' ';
}
outFile << '\n';
}
This cleans the text for later use.
g++ clean_book.cpp -o clean_book
./clean_book dracula.txt dracula_clean.txt
I implemented the markov chaing as a map of maps. There’s implementations that use less space.\
I made it so that the next word chosen is the most common word that is to the right of the current word. So, if “hat” appears 5 times after “the” while “cat” appears only 3 times after “the”, then “hat” will follow “the” in the markov text. A hacky solution to avoid loops, is to decrease the observed frequency of a chosen successor to a word.
text:
the hat cat.
the hat. the hat. the hat. the hat.
the cat. the cat. the cat.
the hat cat
the 0 5 3
hat 0 0 1
cat 0 0 0
after choosing “hat” to follow “the”, the table would look like this:
the hat cat
the 0 4 3
hat 0 0 1
cat 0 0 0
Instead of decreasing the frequency though, I chose to make the frequency 0, so that it won’t be chosen again until the other frequencies are 0 too.
#include<iostream>
#include<fstream>
#include<unordered_map>
#include<string>
using namespace std;
int main() {
unordered_map<string, pair<unsigned, unordered_map<string, unsigned> > > M;
string t;
string prior;
while (cin >> t) {
if (t.back() == '.' || t.back() == '?' || t.back() == '!' || t.back() == ';' || t.back() == ':') {
prior = string();
continue;
}
++M[t].first;
++M[prior].second[t];
prior = t;
}
string s("The");
int n = 200;
for (int i = 0;i < n;++i) {
cout << s << ' ';
unordered_map<string, unsigned>:: iterator next;
unsigned cnt = 0;
for (unordered_map<string, unsigned>::iterator it = M[s].second.begin();it != M[s].second.end();++it) {
if (it->second > cnt) {
next = it;
cnt = it->second;
}
}
M[s].second[next->first] /= 100;
s = next->first;
}
cout << endl;
}
On Dracula: The Professor and the door and I am not be a little and we were all the Count Dracula as I could not to the room and then I have been a few minutes however I was a man who have to be no one of the same time to me and that I had been in the time and he said to see the Professor Van Helsing and in a great boxes of his face and a sort of a moment I shall be that he was not know that the window and as he had a word to do not have a good to him and so I must be done and all that we are to my own room was the night and there was in his hand and with a long and to have had to his own heart and his head and said in my dear Madam Mina Harkers Journal 30 October 5 November afternoon–I am so that it was no more than ever and it is a very very sad and when I know what he is the house in her and was to her eyes and if he has been to get the
On Ivanhoe: The Jew and the same time to the Templar and his own hand and a man of the lists and to be the Grand Master and I will be a good knight and that the Jew said the Saxon and in the knight to his head of his hand to a Saxon churls said De Bracy and of a small oratory eight days of their own safety and it is a few minutes the castle of my own and so much of this day and he had been a sort of our own share in a long and was the most reverend father said Rebecca said he was a strong and if thou hast thou art thou shalt thou wilt thou dost thou canst not to him to have been so many a Jew to do not the other hand of which he said Cedric and then said Wamba who had not be found himself to me to her to my father and as the Prior Aymer Prior of her hand the Knight of thy master and with the Black Knight said Prince John to their horses and who was not a Norman and my friend and which the very
I did not use glib for this demo, but if you do use glib and are having linking problems, this is for you.
gcc-13 <program-that-includes-glib> $(pkg-config --cflags --libs glib-2.0)
Today is the last day of 2023. It is about 8 o’clock in the morning and I forgot to sleep. My new year’s resolution is to sleep before midnight everyday.
]]>F = ( ( (an * x) + an-1) * x) + an-2) * x …
# A is an array of coefficients
# evaluates nth degree polynomial with coefficients at x
def Horner(A, x : int) -> int:
ans = 0
for a in A[::-1]:
ans = (ans + a) * x
return ans
This can be quite useful in other areas, like rabin karp string matching. The polynomial can be updated in a sliding window fashion instead of recomputed everytime.
code for initializing polynomial hash
// s : string with length N > M
// si_h1 : hash of first M length substring
// A1 : our coefficient
ull si_h1 = s[0] - 'a';
for (int i = 1;i < M;++i) {
si_h1 = (si_h1 * A1) % MOD;
si_h1 = (si_h1 + (ull)(s[i] - 'a')) % MOD;
}
“sliding” the hash to the right
si_h1 = (si_h1 - (big_term1 * (ull)(s[i] - 'a')) % MOD) % MOD;
si_h1 = (si_h1 * A1) % MOD;
si_h1 = (si_h1 + (ull)(s[i + M] - 'a')) % MOD;