At my current client, we use Sonatype Nexus to store our artifacts. The repository is secured with a username/password both for publishing as downloading artifacts.
Spark is having support for specific repositories with the –repositories configuration.
We use it like this:
pyspark --repositories https://readonly:secret_password@nexus/repository/maven-public/ --packages com.example:foobar:1.0.0
Unfortunately, we ran into the following issue:
==== repo-1: tried https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom -- artifact com.example#foobar;1.0.0!foobar.jar: https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: com.example#foobar;1.0.0: not found ::::::::::::::::::::::::::::::::::::::::::::::
The strange thing: The url is correct. With curl we can download the dependency:
curl -s -o /dev/null -v https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom * Hostname was NOT found in DNS cache * Trying 35... * Connected to foobar.com (35.xxx.xxx.x) port 443 (#0) * successfully set certificate verify locations: * CAfile: none CApath: /etc/ssl/certs ... ... 200 OK
Okay, let’s debug this thing by using ivy directly.
Ivy is using a config file to configure the Nexus repository so I tried:
defaultResolver="nexus"/> name="nexus-public" value="https://nexus/repository/maven-public"/> name="nexus" m2compatible="true" root="${nexus-public}"/>
curl -L -O http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
java -jar ivy-2.4.0.jar -settings ivy.settings -dependency com.example foobar 1.0.0 -debug
Here we end up with the same issue. So the issue is not Spark related, but Ivy.
==== nexus: tried https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom -- artifact com.example#foobar;1.0.0!foobar.jar: https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: com.example#foobar;1.0.0: not found ::::::::::::::::::::::::::::::::::::::::::::::
With the -debug
option we find the following:
HTTP response status: 401 url=https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar CLIENT ERROR: Unauthorized url=https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar nexus: resource not reachable for com/example#foobar;1.0.0: res=https://readonly:secret_password@nexus/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar
Now we understand the issue, we can start googling. I found this StackOverflow issue
So Let’s change the basic authentication in the URL to a credentials
block.
defaultResolver="nexus"/> name="nexus-public" value="https://nexus/repository/maven-public"/> host="nexus" realm="Sonatype Nexus Repository Manager" username="readonly" passwd="secret_password" /> name="nexus" m2compatible="true" root="${nexus-public}"/>
Now everything works like a charm. Time to fix the pyspark command.
pyspark
--packages com.example:foobar:1.0.0
--conf spark.jars.ivySettings=/tmp/ivy.settings
Now Spark is able to download the packages as well. I’m a happy camper again.
What is left for us to do, is to add this in our init script to initialize new dataproc clusters with this setup.